Amazon is investigating claims of abusive data collection

June 28, 2024Last Updated: June 28, 2024

0 200

Amazon is investigating claims of abusive data collection

Amazon’s cloud division has opened an investigation into Perplexity AI. The issue is whether the AI search startup violated Amazon Web Services rules by collecting data from websites that try to prevent this, WIRED has learned.

An AWS spokesperson, who spoke to WIRED on condition of anonymity, confirmed the company’s investigation into Perplexity. WIRED previously found it that the startup—yes donate from the Jeff Bezos family foundation and Nvidia, and more recently highly appreciated at $3 billion—apparently based on content from scraped sites that have been banned from access through the Robot Exclusion Protocol, a popular web standard. While the Robot Exclusion Protocol is not legally binding, the terms of service generally are.

Robot exclusion protocol is a decades-old web standard that involves placing a plaintext file (like Wired.com/robots.txt) on a domain to indicate which pages should not be accessed by bots and automated crawlers. While companies using scraping tools can choose to ignore the protocol, most have traditionally adhered to it. An Amazon spokesperson told WIRED that AWS customers must adhere to the robots.txt standard when crawling websites.

“AWS’s terms of service prohibit customers from using our services for any illegal activity, and our customers are responsible for complying with our terms and all applicable laws,” the spokesperson said.

Check out the following Perplexity operations a June 11 report from Forbes accused the startup of plagiarizing at least one of its articles. WIRED’s investigation confirmed this behavior and found additional evidence of shaving abuse And Plagiarism using systems linked to Perplexity’s AI-powered search chatbot. Engineers at Condé Nast, WIRED’s parent company, block Perplexity’s crawlers on all of its sites using a robots.txt file. But WIRED found that the company accessed a server with an unpublished IP address—44.221.181.252—that accessed Condé Nast properties at least hundreds of times in the past three months, apparently is to collect data from Condé Nast websites.

The machine associated with Perplexity appears to be engaging in widespread crawling of news websites that ban bots from accessing their content. Spokespeople for The Guardian, Forbes and The New York Times also said they discovered the IP address on their servers multiple times.

WIRED tracked down the IP address of a virtual machine called an Elastic Compute Cloud (EC2) instance hosted on AWS. This virtual machine launched an investigation after we asked whether using AWS infrastructure to scan for sites that banned it violated the company’s terms of service.

Last week, Perplexity CEO Aravind Srinivas responded to WIRED’s inquiry by first saying that the questions we asked the company “reflect a deep and fundamental misunderstanding of how Perplexity and the internet work.” Srinivas then told Fast Company that the secret IP address WIRED observed crawling Condé Nast sites and a test site we created is run by a third-party company that performs web indexing and crawling services. He declined to name the company, citing a nondisclosure agreement. When asked whether he had asked the third party to stop crawling WIRED, Srinivas replied, “It’s complicated.”