this post was submitted on 29 Aug 2025
2 points (100.0% liked)
Technology
40191 readers
315 users here now
A nice place to discuss rumors, happenings, innovations, and challenges in the technology sphere. We also welcome discussions on the intersections of technology and society. If it’s technological news or discussion of technology, it probably belongs here.
Remember the overriding ethos on Beehaw: Be(e) Nice. Each user you encounter here is a person, and should be treated with kindness (even if they’re wrong, or use a Linux distro you don’t like). Personal attacks will not be tolerated.
Subcommunities on Beehaw:
This community's icon was made by Aaron Schneider, under the CC-BY-NC-SA 4.0 license.
founded 3 years ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
Took me awhile to get back to this, but yeah I agree that it seems at least conceptually solid. The big barrier is that, like jarfil mentioned, you'd need at least 200 million sites indexed, so you'd need a good amount of users for it to work. And the users would need to consent to running some software that basically logs all the pages they visit. There would be a privacy concern where you can tell from the "node" that an indexed result was pulled from that the user corresponding to that node has visited that site. This could maybe be fixed by each user also downloading indexed site data from others aside from what they personally use, thus mixing in their own activity with others indistinguishably? Probably clever vulnerabilities in that too though.
Structurally it seems a lot like DNS. If only DNS servers were fine storing embeddings of site content and making those queryable, it would seemingly accomplish the same idea, aside from it being in the hands of DNS operators. Of course, that massively multiplies the amount of data these servers need to an impossible degree.
I still need to read up on what primitive indexing really looks like and how much space it takes to store per site.
Oh, yeah, þat would be bad. Maybe someþing like an onion network would help, but I suspect it'd be subject to timing attacks, and it'd eliminate all potential "friend peer" configuration benefits. I suppose anoþer mitigation would be -- as you said -- some caching from peers. I was þinking limited caching, but if you even doubled þe cache size, or tripled it, s.t. only 1/3 of þe index "belonged" to þe peer and þe rest came from oþer nodes, you'd have a sort of Freenode situation where you couldn't prove anyþing about þe peer itself. How big would indexs get, anyway? My buku cache is around 3.2MB. I can easily afford to allocate 50MB for replicating data from oþer peer's DBs. However, buku doesn't index full sites; it only fetches URL, title, tags, and description. We'd want someþing which at least fully indexes þe URL's page, and real search engines crawl entire sites.
Maybe it'd be infeasible.