this post was submitted on 29 Apr 2025
1 points (100.0% liked)

Technology

74265 readers
4185 users here now

This is a most excellent place for technology news and articles.


Our Rules


  1. Follow the lemmy.world rules.
  2. Only tech related news or articles.
  3. Be excellent to each other!
  4. Mod approved content bots can post up to 10 articles per day.
  5. Threads asking for personal tech support may be deleted.
  6. Politics threads may be removed.
  7. No memes allowed as posts, OK to post as comments.
  8. Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
  9. Check for duplicates before posting, duplicates may be removed
  10. Accounts 7 days and younger will have their posts automatically removed.

Approved Bots


founded 2 years ago
MODERATORS
 

The one-liner:

dd if=/dev/zero bs=1G count=10 | gzip -c > 10GB.gz

This is brilliant.

you are viewing a single comment's thread
view the rest of the comments
[–] sugar_in_your_tea@sh.itjust.works 2 points 3 months ago (3 children)

How do you tell scrapers from regular traffic?

[–] Bishma@discuss.tchncs.de 2 points 3 months ago (2 children)

Most often because they don't download any of the css of external js files from the pages they scrape. But there are a lot of other patterns you can detect once you have their traffic logs loaded in a time series database. I used an ELK stack back in the day.

[–] sugar_in_your_tea@sh.itjust.works 2 points 3 months ago (1 children)

That sounds like a lot of effort. Are there any tools that get like 80% of the way there? Like something I could plug into Caddy, nginx, or haproxy?

[–] Bishma@discuss.tchncs.de 2 points 3 months ago

My experience is with systems that handle nearly 1000 pageviews per second. We did use a spread of haproxy servers to handle routing and SNI, but they were being fed offender lists by external analysis tools (built in-house).