this post was submitted on 19 Aug 2025

695 points (98.9% liked)

Technology

74233 readers

4393 users here now

This is a most excellent place for technology news and articles.

Our Rules

Follow the lemmy.world rules.
Only tech related news or articles.
Be excellent to each other!
Mod approved content bots can post up to 10 articles per day.
Threads asking for personal tech support may be deleted.
Politics threads may be removed.
No memes allowed as posts, OK to post as comments.
Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
Check for duplicates before posting, duplicates may be removed
Accounts 7 days and younger will have their posts automatically removed.

Approved Bots

founded 2 years ago

MODERATORS

L3s@lemmy.world

enu@lemmy.world

technopagan@lemmy.world

L4s@lemmy.world

L3s@hackingne.ws

L4s@hackingne.ws

695

The AI company Perplexity is complaining their bots can't bypass Cloudflare's firewall (www.searchenginejournal.com)

submitted 1 day ago* (last edited 1 day ago) by Davriellelouna@lemmy.world to c/technology@lemmy.world

161 comments fedilink hide all child comments

you are viewing a single comment's thread
view the rest of the comments

[–] Ekybio@lemmy.world 20 points 1 day ago (3 children)

Can someone with more knowledge shine a bit more light on this while situation? Im out of the loop on the technical details

[–] spankmonkey@lemmy.world 55 points 1 day ago

AI crawlers tend to overwhelm websites by doing the least efficient scraping of data possible, basically DDOSing a huge portion of the internet. Perplexity already scraped the net for training data and is now hammering it inefficiently for searches.

Cloudflare is just trying to keep the bots from overwhelming everything.

[–] panda_abyss@lemmy.ca 33 points 1 day ago* (last edited 1 day ago) (4 children)

Cloudflare runs as a CDN/cache/gateway service in front of a ton of websites. Their service is to help protect against DDOS and malicious traffic.

A few weeks ago cloudflare announced they were going to block AI crawling (good, in my opinion). However they also added a paid service that these AI crawlers can use, so it actually becomes a revenue source for them.

This is a response to that from Perplexity who run an AI search company. I don’t actually know how their service works, but they were specifically called out in the announcement and Cloudflare accused them of “stealth scraping” and ignoring robots.txt and other things.

[–] very_well_lost@lemmy.world 31 points 23 hours ago* (last edited 23 hours ago) (2 children)

A few weeks ago cloudflare announced they were going to block AI crawling (good, in my opinion). However they also added a paid service that these AI crawlers can use, so it actually becomes a revenue source for them.

I think it's also worth pointing out that all of the big AI companies are currently burning through cash at an absolutely astonishing rate, and none of them are anywhere close to being profitable. So pay-walling the data they use is probably gonna be pretty painful for their already-tortured bottom line (good).

[–] Tollana1234567@lemmy.today 1 points 4 hours ago

they already said they wernt profitable, they are trying to keep on life support til the VC funds run out.

[–] Dogiedog64@lemmy.world 16 points 22 hours ago (1 children)

It's more than simply astonishing, it's mind-blowingly bonkers how much money they have to burn to see ANY amount of return. You think a normal company is bad, blowing a few thousand bucks on materials, equipment, and labor per day in order to make a few bucks revenue (not profit)? AI companies have to blow HUNDREDS OF BILLIONS on massive data center complexes in order to train their bots, and then the energy cost and water cost of running them adds a couple more million a day. ALL so they can make negative hundreds of dollars on every prompt you can dream of.

The ONLY reason AI firms are still a thing in the current tech tree is because Techbros everywhere have convinced the uberwealthy VC firms that AGI is RIGHT AROUND THE CORNER, and will save them SO much money on labor and efficiency that it'll all be worth it in permanent, pure, infinite profit. If that sounds like too much of a pipe dream to be realistic, congratulations, you're a sane and rational human being.

[–] ubergeek@lemmy.today 7 points 20 hours ago

It’s more than simply astonishing, it’s mind-blowingly bonkers how much money they have to burn to see ANY amount of return

See, that's the trick, and it's used by LOADS of startups:

You don't actually have to see a return... You just have to have a good story showing there MAY be a GIANT return. The founders collect enormous salaries (Funded by VC dollars, not their own), they burn through the money to create more illusion, then ask for more, then burn through that, foretelling of the coming days when the money is just coming!

Meanwhile, just before it's "projected" to become insanely profitable, they sell out to someone, walk away with a giant check, and the product evaporates.

[–] _cryptagion@lemmy.dbzer0.com 10 points 22 hours ago

It should be pointed out that Cloudflare didn't say they were going to block AI traffic, they give you the option to. The service is a free opt-in for people who want it.

[–] nutsack@lemmy.dbzer0.com 6 points 22 hours ago* (last edited 22 hours ago)

they don't outright block ai crawlers. they added some new tools and options for managing or blocking ai bot traffic which the cloudflare customer can choose to use or to not use.

im running a free educational resource and i let the crawlers hit my site all they want because its useful knowledge unavailable anywhere else and it's served to them from cloudflare's free tier cache. i just don't know why they have to read it ten thousand times a day.

[–] RogueBanana@piefed.zip 4 points 23 hours ago

But the website owner can still choose to continue blocking them right? Without using additional stuff like Anubis that is.

[–] BetaDoggo_@lemmy.world 21 points 23 hours ago* (last edited 23 hours ago) (1 children)

Perplexity (an "AI search engine" company with 500 million in funding) can't bypass cloudflare's anti-bot checks. For each search Perplexity scrapes the top results and summarizes them for the user. Cloudflare intentionally blocks perplexity's scrapers because they ignore robots.txt and mimic real users to get around cloudflare's blocking features. Perplexity argues that their scraping is acceptable because it's user initiated.

Personally I think cloudflare is in the right here. The scraped sites get 0 revenue from Perplexity searches (unless the user decides to go through the sources section and click the links) and Perplexity's scraping is unnecessarily traffic intensive since they don't cache the scraped data.

[–] lividweasel@lemmy.world 6 points 21 hours ago (3 children)

…and Perplexity's scraping is unnecessarily traffic intensive since they don't cache the scraped data.

That seems almost maliciously stupid. We need to train a new model. Hey, where’d the data go? Oh well, let’s just go scrape it all again. Wait, did we already scrape this site? No idea, let’s scrape it again just to be sure.

[–] rdri@lemmy.world 1 points 3 hours ago

First we complain that AI steals and trains on our data. Then we complain when it doesn't train. Cool.

[–] jballs@lemmy.world 2 points 20 hours ago

It's worth giving the article a read. It seems that they're not using the data for training, but for real-time results.

[–] spankmonkey@lemmy.world 0 points 20 hours ago

They do it this way in case the data changed, similar to how a person would be viewing the current site. The training was for the basic understanding, the real time scraping is to account for changes.

It is also horribly inefficient and works like a small scale DDOS attack.