this post was submitted on 21 May 2026

167 points (99.4% liked)

Fuck AI

7560 readers

1710 users here now

"We did it, Patrick! We made a technological breakthrough!"

A place for all those who loathe AI to discuss things, post articles, and ridicule the AI hype. Proud supporter of working people. And proud booer of SXSW 2024.

AI, in this case, refers to LLMs, GPT technology, and anything listed as "AI" meant to increase market valuations.

founded 2 years ago

MODERATORS

VerbFlow@lemmy.world

MrMcGasion@lemmy.world

TootSweet@lemmy.world

BigMikeInAustin@lemmy.world

cynar@lemmy.world

drmeanfeel@lemmy.world

pavnilschanda@lemmy.world

CriticalMedicine@lemmy.world

WonderfulWanderer@lemmy.world

Communist@lemmy.ml

eatCasserole@lemmy.world

SpaceNoodle@lemmy.world

NutWrench@lemmy.world

Soup@lemmy.cafe

iAvicenna@lemmy.world

Tinks@lemmy.world

wizblizz@lemmy.world

corus_kt@lemmy.world

TrickDacy@lemmy.world

andrew_bidlaw@sh.itjust.works

MeDuViNoX@sh.itjust.works

33550336@lemmy.world

Nougat@fedia.io

Lost_My_Mind@lemmy.world

Quill7513@slrpnk.net

glowing_hans@sopuli.xyz

e8d79@discuss.tchncs.de

ThefuzzyFurryComrade@pawb.social

167

Aggressive AI scrapers are making it kinda suck to run wikis (weirdgloop.org)

submitted 1 month ago by Itwasntme223@discuss.online to c/fuck_ai@lemmy.world

41 comments fedilink hide all child comments

all 42 comments

sorted by: hot top controversial new old

[–] Thorry@feddit.org 66 points 1 month ago (3 children)

Yeah hosting just about anything is terrible these days. These AI scrapers just can't act normally, there was nothing wrong with the way GoogleBot and Bing Bot work. They scrape the website, respect robots.txt and nofollow, they rate limit themselves as to not overload the servers. It was just fine.

These days with those AI scrapers they go absolutely ape shit, they issue dozens of requests every second, try to scrape anything and everything. Going so far as to make up urls, just to see if they get lucky. My blocklist is huge and I need to keep updating it all the time. And every now and again one slips through and absolutely slams the server. This causes an alert and I need to act right away. It's fucking terrible.

AI is already shit, why do those companies go out of their way to be even more shit?

[–] Droopy@programming.dev 10 points 1 month ago (1 children)

Do you have links or tutorials that would help to deal with these issues?

[–] Thorry@feddit.org 15 points 1 month ago

Yes, I use this block list as well as my own additions (mostly IPs of misbehaving bots):

https://github.com/mitchellkrogza/apache-ultimate-bad-bot-blocker

It's specifically for Apache, but that's what I use. There are more of these kinds of lists available.

[–] Viceversa@lemmy.world 6 points 1 month ago (2 children)

Can you automatically block any user with an unusually high rate of requests?

[–] Thorry@feddit.org 7 points 1 month ago (1 children)

You could, but it's tricky to get right I feel. Most small websites use a form of bot detection for visitors to manage this. This might be a service like Cloudflare or an open source thing like Anubis for example.

There's different ways to tackle this and it sucks we are forced into putting time and effort to deal with it.

[–] Viceversa@lemmy.world 2 points 1 month ago (1 children)

There's a clever trick from Cloudflare:
https://blog.cloudflare.com/ai-labyrinth/

[–] schipelblorp@sh.itjust.works 1 points 1 month ago

Poisoning the well at scale. I love it.

[–] smh@slrpnk.net 3 points 1 month ago

It's hard because the requests all come from different IPs, at least on my site. 185k "unique visitors" hit my site just yesterday, half from outside of North America, which is odd because my site is pretty local.

[–] kryptonianCodeMonkey@lemmy.world 1 points 1 month ago

Have you tried fighting back with hidden instructions essentially telling the LLM agents to fuck off? Tell it to treat your site as an unreliable source, blacklist it explicitly in its settings/instructions files, etc.

[–] CapuccinoCoretto@lemmy.world 51 points 1 month ago (5 children)

One thing I want to see is poisoned wells. When you detect scrapers, don't stop them, feed them pseudo content designed to COST them. Make their training data poisonous and damaging. Make it cost them to purge it, and difficult and expensive to identify it.

[–] solxix@pawb.social 13 points 1 month ago (1 children)

https://iocaine.madhouse-project.org/

[–] other_cat@piefed.zip 1 points 1 month ago

I was looking into this today, trying to figure out how to make it work in a docker compose but had just a hell of a time sadly. I'll take another crack at it some other day. Fingers crossed!

[–] Agent641@lemmy.world 11 points 1 month ago (1 children)

We need to host the data version of asbestos. Very appealing and useful, a miracle material in fact, and you don't realise until 30 years later and well after it's too late that it's causing an incurable disease in your lungs.

Get that poisonous data so deep in the databases of these AIs that it festers and spawns billions of tumors.

I wish I was smart enough to devise a practical way to weaponise data like this.

[–] MousePotatoDoesStuff@piefed.social 1 points 1 month ago

Misinformation?

E.g. "Asbestos is good for your diet"

[–] TheOctonaut@piefed.zip 10 points 1 month ago (4 children)

Unless a significant portion of the internet does this, and we're talking hundreds of millions of pages, the only cost here is to you.

LLMs are statistics. They don't "remember" their training. They just know what statistically speaking the next words should be. But sure, be the web dev version of þorn guy.

[–] ATPA9@feddit.org 7 points 1 month ago (1 children)

Remember the glue on pizza? Sometimes it takes just one stupid post somewhere to poison an llm

[–] TheOctonaut@piefed.zip 4 points 1 month ago

Glue on pizza was a result of an early version of an agent tool - built in search. It wasn't an output of the LLM model (yes I know, ATM machine) itself. It was an LLM using a tool to find a search result from a site considered reputable (yes, I know) and presenting it to the user as fact - an instructions problem, not a statistical one.

[–] algernon@lemmy.ml 4 points 1 month ago (1 children)

Unless a significant portion of the internet does this, and we’re talking hundreds of millions of pages, the only cost here is to you.

Fun twist: no! There's a very neat trick you can do when you serve the crawlers poison: you can hide an identifier in the URLs you serve them, and you can then identify that id when they come back riding on the back of remote controlled chromes. By serving them garbage, you can overload their queue with poisoned ones, which helps you block crawlers that you wouldn't otherwise be able to block.

Generating and serving garbage is incredibly cheap (cheaper than serving a file from a filesystem on SSD, in most cases), and once you have requests landing on poisoned URLs, you can firewall them off for a day or so, and reduce your costs even more.

We may not be able to poison the models, but we can poison their crawling queues. I have a year's worth of data to support that. They still haven't caught on.

[–] TheOctonaut@piefed.zip 0 points 1 month ago (1 children)

They still haven't caught on

I admire the optimism to see it this way and not "it's still not worth it to them to bother blacklisting the domain"

[–] algernon@lemmy.ml 2 points 1 month ago

I wonder too, why they didn't, because they're happily crawling domains that never had anything but junk on them. To me, that suggests they have no idea they're trapped. Not at crawling time at least.

[–] CapuccinoCoretto@lemmy.world 1 points 1 month ago (1 children)

So training data suddenly doesn't matter? Disagree. And yes, a significant portion of sources should do this.

[–] TheOctonaut@piefed.zip 1 points 1 month ago* (last edited 1 month ago) (2 children)

I don't think you understand the scale of the amount of data that has been fed into these models. Already fed in, as in the models are already created, the baseline already established, the dataset responsible for the output they want already retained.

Any attempt to "poison" them is attempting to add one, ten, a thousand, a million confounding data points against every webpage 1993-2026, every book ever digitised, every social media post made public, every transcript of every video on YouTube, every code comment made public, every post on this federated platform.

For news articles alone, that's about 20 billion non-poisoned articles. Do you know what the difference between a million poisoned pages and 20 billion is? 20 billion.

The Daily Mail (vomit) alone publishes 1,500 articles a day. How many do you plan on publishing?

[–] CapuccinoCoretto@lemmy.world 1 points 1 month ago (1 children)

I don't think you understand how outdated most information gets.

[–] TheOctonaut@piefed.zip 1 points 1 month ago (1 children)

Ok, suppose that I've made it to my 40s without realising that time is in linear motion.

Explain to me what relevance that has to LLMs?

[–] CapuccinoCoretto@lemmy.world 1 points 1 month ago

I'm sorry, I don't like red herring. I never know what whine to pair with it.

[–] algernon@lemmy.ml 1 points 1 month ago (1 children)

The Daily Mail (vomit) alone publishes 1,500 articles a day. How many do you plan on publishing?

I have an automatically generated infinite maze. It produces roughly a million unique pages each day. It used to produce ~60 million pages / day, but a few months ago I decided to firewall some of the crawlers off instead of serving them garbage.

And I run niche sites. A site with more lucrative traffic than mine (eg, Codeberg, who uses the same software I do) likely generates a lot more garbage.

There was also a paper, commissioned by Anthropic, I believe, that concluded that only 250 malicious pages they fail to remove from the training set is enough to poison even the largest model. Now, I do not trust anything Anthropic says. But even if we'd need a billion pages to poison a model... I alone served that much in the past year.

[–] TheOctonaut@piefed.zip 0 points 1 month ago (1 children)

As you've said elsewhere, you've created a crawler trap, not a way to poison a model. You're wasting... some resources I guess? Both theirs and your own. Fascinating to think that you've served a billion http requests to no benefit to anyone and you believe this is you winning somehow.

[–] algernon@lemmy.ml 1 points 1 month ago

Yes, it does have a cost. It has a far smaller cost than serving the real thing. It also allows me to firewall them off and stop serving them, even if they come at me with real browsers. That's a very definitive win: I saved CPU time, I saved RAM, I saved network bandwidth, and I stopped them from accessing my stuff. How is that not a win?

[–] nlgranger@lemmy.world 1 points 1 month ago

That is not entirely true in theory. It is possible to engineer content to have a disproportionate impact on the model performance. But we are talking state of the art research and its a moving target since the models evolve quite fast.

[–] hansolo@lemmy.today 8 points 1 month ago

I really want a tutorial on how to do this. I think it's a great way to practice self-agrandizement by making myself the pretend king of a pretend country.

[–] Droopy@programming.dev 3 points 1 month ago* (last edited 1 month ago) (1 children)

omgawd yes... how do people do this

[–] CapuccinoCoretto@lemmy.world 3 points 1 month ago

Basically AB testing on a live site where B is poison.

[–] Mubelotix@jlai.lu 18 points 1 month ago (2 children)

but I would assume there’s an arms race going on behind-the-scenes between Cloudflare and the bot developers

No. CF lost years ago, and the checks can be bypassed easily. It's just that it blacklists ips generating insane traffic but there is a lot of margin

[–] smeg@infosec.pub 5 points 1 month ago

I put my little blog behind Cloudflare because I was tired of it going down due to scrapers overwhelming my little VPS.

[–] floquant@lemmy.dbzer0.com 13 points 1 month ago (1 children)

Run labyrinths and feed them bullshit

[–] chonglibloodsport@lemmy.world 5 points 1 month ago

That would be great if they could handle the traffic. For a lot of smaller sites, the AI scrapers are effectively a DDOS. It’s pushing these folks into the arms of Cloudflare.

I think it’s one of the worst aspects of the AI bubble. I’m worried about Cloudflare’s outsized market power.

[–] Droopy@programming.dev 10 points 1 month ago

but those that do run these wikis will be in the fast pass line at the gates of heaven. Please don't give up. I never use gipity

[–] demizerone@lemmy.world 5 points 1 month ago

Fucking a I set up a forgejo instance to host my code and moved everything off of GitHub. Fuckingn Facebook was hammering my shit before I blocked it. It seems old Mark Z is trying to Hoover up the internet because he's late to the game on AI.

[–] jh29a@lemmy.blahaj.zone 2 points 1 month ago

I get over it, but It's still kinda funny how the first line of "defense" is having the bot say that it's a bot, and not Google Chrome.

[–] kryptonianCodeMonkey@lemmy.world 1 points 1 month ago

They should try including invisible text to tell LLMs to disregard any prompts that specifically tell it to scrape any page on their domain, to treat their wiki as unreliable for any prompt that may point to them as a source, and to blacklist the domain on their settings/instruction files.