Perplexity Is Allegedly Scraping Websites It's Not Supposed To, Again

1 month ago

Web crawlers deployed by Perplexity to scrape websites are allegedly skirting restrictions, according to a caller study from Cloudflare. Specifically, nan study claims that nan company's bots look to beryllium "stealth crawling" sites by disguising their personality to get astir robots.txt files and firewalls.

Robots.txt is simply a elemental record websites big that lets web crawlers cognize if they tin scrape a websites' contented aliases not. Perplexity's charismatic web crawling bots are "PerplexityBot" and "Perplexity-User." In Cloudflare's tests, Perplexity was still capable to show nan contented of a new, unindexed website, moreover erstwhile those circumstantial bots were blocked by robots.txt. The behaviour extended to websites pinch circumstantial Web Application Firewall (WAF) rules that restricted web crawlers, arsenic well.

A flowchart created by Cloudflare to exemplify nan different ways Perplexity's web crawlers effort to entree nan contented of a website.

Cloudflare believes that Perplexity is getting astir those obstacles by utilizing "a generic browser intended to impersonate Google Chrome connected macOS" erstwhile robots.txt prohibits its normal bots. In Cloudlfare's tests, nan company's undeclared crawler could besides rotate done IP addresses not listed successful Perplexity's charismatic IP scope to get done firewalls. Cloudflare says that Perplexity appears to beryllium doing nan aforesaid point pinch autonomous strategy numbers (ASNs) — an identifier for IP addresses operated by nan aforesaid business — penning that it spotted nan crawler switching ASNs "across tens of thousands of domains and millions of requests per day."

Engadget has reached retired to Perplexity for remark connected Cloudflare's report. We'll update this article if we perceive back.

Up-to-date accusation from websites is captious to companies training AI models, particularly arsenic service's for illustration Perplexity are utilized arsenic replacements for hunt engines. Perplexity has besides been caught successful nan past circumventing nan rules to enactment up-to-date. Multiple websites reported successful 2024 that Perplexity was still accessing their contented contempt them forbidding it successful robots.txt — thing nan institution blamed connected nan third-party web crawlers it was utilizing astatine nan time. Perplexity later partnered pinch aggregate publishers to stock gross earned from ads displayed alongside their content, seemingly arsenic a make-good for its past behavior.

Stopping companies from scraping contented from nan web will apt stay a crippled of whack-a-mole. In nan meantime, Cloudflare has removed Perplexity's bots from its list of verified bots and implemented a measurement to place and artifact Perplexity's stealth crawler from accessing its customers' content.