this post was submitted on 13 Jan 2026
714 points (98.8% liked)

Selfhosted

54761 readers
383 users here now

A place to share alternatives to popular online services that can be self-hosted without giving up privacy or locking you into a service you don't control.

Rules:

  1. Be civil: we're here to support and learn from one another. Insults won't be tolerated. Flame wars are frowned upon.

  2. No spam posting.

  3. Posts have to be centered around self-hosting. There are other communities for discussing hardware or home computing. If it's not obvious why your post topic revolves around selfhosting, please include details to make it clear.

  4. Don't duplicate the full text of your blog or github here. Just post the link for folks to click.

  5. Submission headline should match the article title (don’t cherry-pick information from the title to fit your agenda).

  6. No trolling.

  7. No low-effort posts. This is subjective and will largely be determined by the community member reports.

Resources:

Any issues on the community? Report it using the report flag.

Questions? DM the mods!

founded 2 years ago
MODERATORS
 

Reddit's API is effectively dead for archival. Third-party apps are gone. Reddit has threatened to cut off access to the Pushshift dataset multiple times. But 3.28TB of Reddit history exists as a torrent right now, and I built a tool to turn it into something you can browse on your own hardware.

The key point: This doesn't touch Reddit's servers. Ever. Download the Pushshift dataset, run my tool locally, get a fully browsable archive. Works on an air-gapped machine. Works on a Raspberry Pi serving your LAN. Works on a USB drive you hand to someone.

What it does: Takes compressed data dumps from Reddit (.zst), Voat (SQL), and Ruqqus (.7z) and generates static HTML. No JavaScript, no external requests, no tracking. Open index.html and browse. Want search? Run the optional Docker stack with PostgreSQL – still entirely on your machine.

API & AI Integration: Full REST API with 30+ endpoints – posts, comments, users, subreddits, full-text search, aggregations. Also ships with an MCP server (29 tools) so you can query your archive directly from AI tools.

Self-hosting options:

  • USB drive / local folder (just open the HTML files)
  • Home server on your LAN
  • Tor hidden service (2 commands, no port forwarding needed)
  • VPS with HTTPS
  • GitHub Pages for small archives

Why this matters: Once you have the data, you own it. No API keys, no rate limits, no ToS changes can take it away.

Scale: Tens of millions of posts per instance. PostgreSQL backend keeps memory constant regardless of dataset size. For the full 2.38B post dataset, run multiple instances by topic.

How I built it: Python, PostgreSQL, Jinja2 templates, Docker. Used Claude Code throughout as an experiment in AI-assisted development. Learned that the workflow is "trust but verify" – it accelerates the boring parts but you still own the architecture.

Live demo: https://online-archives.github.io/redd-archiver-example/ GitHub: https://github.com/19-84/redd-archiver (Public Domain)

Pushshift torrent: https://academictorrents.com/details/1614740ac8c94505e4ecb9d88be8bed7b6afddd4

(page 2) 50 comments
sorted by: hot top controversial new old
[–] Mubelotix@jlai.lu 1 points 6 days ago

I do not consent for this

[–] Clbull@lemmy.world 1 points 6 days ago (2 children)
[–] 19_84@lemmy.dbzer0.com 1 points 6 days ago (3 children)

i will always take more data sources, including lemmy!

load more comments (3 replies)
load more comments (1 replies)
[–] K3can@lemmy.radio 1 points 1 week ago (1 children)

Can anyone figure out what the minimum process is to just use the SSG function? I'm having a really hard time trying to understand the documentation.

[–] 19_84@lemmy.dbzer0.com 1 points 6 days ago (1 children)

did you check the quickstart?

[–] K3can@lemmy.radio 1 points 6 days ago (1 children)

Yes, both the standalone quickstart and the quickstart section of the readme (which are both different).

Is it possible to get the static sites without spinning up a DB backend?

[–] 19_84@lemmy.dbzer0.com 1 points 6 days ago

the database is required

[–] HugeNerd@lemmy.ca -3 points 6 days ago

Boring. I want the Kuro5hin site. That was actually good and hysterically funny at the best times. ASCII reenactment players of Michael Crawford anyone?

[–] breakingcups@lemmy.world 127 points 1 week ago (2 children)

Just so you're aware, it is very noticeable that you also used AI to help write this post and its use of language can throw a lot of people off.

Not to detract from your project, which looks cool!

[–] 19_84@lemmy.dbzer0.com 146 points 1 week ago (18 children)

Yes I used AI, English is not my first language. Thank you for the kind words!

load more comments (18 replies)
load more comments (1 replies)
[–] offspec@lemmy.world 61 points 1 week ago (16 children)

It would be neat for someone to migrate this data set to a Lemmy instance

[–] cyberpunk007@lemmy.ca 12 points 1 week ago

Now this is a good idea.

load more comments (15 replies)
[–] a1studmuffin@aussie.zone 56 points 1 week ago

This seems especially handy for anyone who wants a snapshot of Reddit from pre-enshittification and AI era, where content was more authentic and less driven by bots and commercial manipulation of opinion. Just choose the cutoff date you want and stick with that dataset.

[–] frongt@lemmy.zip 55 points 1 week ago (2 children)

And only a 3.28 TB database? Oh, because it's compressed. Includes comments too, though.

[–] 19_84@lemmy.dbzer0.com 34 points 1 week ago

Yes! Too many comments to count in a reasonable amount of time!

[–] douglasg14b@lemmy.world 12 points 1 week ago* (last edited 1 week ago) (1 children)

Yeah, it should inflate to 15TB or more I think

[–] muusemuuse@sh.itjust.works 15 points 1 week ago (1 children)

If only I had the space and bandwidth. I would host a mirror via Lemmy and drag the traffic away.

Actually, isn’t the a way to decentralize this that can be accessed from regular browsers on the internet? Live content here, archive everywhere.

load more comments (1 replies)
[–] SteveCC@lemmy.world 39 points 1 week ago (1 children)

Wow, great idea. So much useful information and discussion that users have contributed. Looking forward to checking this out.

load more comments (1 replies)
[–] tanisnikana@lemmy.world 31 points 1 week ago (1 children)

Reddit is hot stinky garbage but can be useful for stuff like technical support and home maintenance.

Voat and Ruqqus are straight-up misinformation and fascist propaganda, and if you excise them from your data set, your data will dramatically improve.

[–] 19_84@lemmy.dbzer0.com 16 points 1 week ago

the great part is that since everything is built it is easy to support any additional data! there is even an issue template to submit new data source! https://github.com/19-84/redd-archiver/blob/main/.github/ISSUE_TEMPLATE/submit-data-source.yml

[–] Tiger@sh.itjust.works 26 points 1 week ago (3 children)

What is the timing of the dataset, up through which date in time?

[–] 19_84@lemmy.dbzer0.com 45 points 1 week ago (1 children)

2005-06 to 2024-12

however the data from 2025-12 has been released already, it just needs to be split and reprocessed for 2025 by watchful1. once that happens then you can host archive up till end of 2025. i will probably add support for importing data from the arctic shift dumps instead so that archives can be updated monthly.

load more comments (1 replies)
load more comments (2 replies)
[–] 19_84@lemmy.dbzer0.com 26 points 1 week ago (2 children)

PLEASE SHARE ON REDDIT!!! I have never had a reddit account and they will NOT let me post about this!!

load more comments (2 replies)

so kinda like kiwix but for reddit. That is so cool

[–] MedicPigBabySaver@lemmy.world 19 points 1 week ago (7 children)

Fuck Reddit and Fuck Spez.

load more comments (7 replies)
[–] BigDiction@lemmy.world 17 points 1 week ago

You should be very proud of this project!! Thank you for sharing.

load more comments
view more: ‹ prev next ›