this post was submitted on 28 May 2026
95 points (76.8% liked)

Selfhosted

59685 readers
1131 users here now

A place to share alternatives to popular online services that can be self-hosted without giving up privacy or locking you into a service you don't control.

Rules:

  1. Be civil: we're here to support and learn from one another. Insults won't be tolerated. Flame wars are frowned upon.

  2. No spam posting.

  3. Posts have to be centered around self-hosting. There are other communities for discussing hardware or home computing. If it's not obvious why your post topic revolves around selfhosting, please include details to make it clear.

  4. Don't duplicate the full text of your blog or github here. Just post the link for folks to click.

  5. Submission headline should match the article title (don’t cherry-pick information from the title to fit your agenda).

  6. No trolling.

  7. No low-effort posts. This is subjective and will largely be determined by the community member reports.

Resources:

Any issues on the community? Report it using the report flag.

Questions? DM the mods!

founded 3 years ago
MODERATORS
 

Recent post re: AI as utility

https://www.tomsguide.com/ai/people-will-buy-intelligence-from-us-on-a-meter-chatgpts-ceo-sam-altman-has-critics-worried-with-his-ai-vision

Myself, I'm a fan of local LLM / self hosted ML.... but if you ever needed a clarion call that a hard pivot is coming (soon) for online/ cloud based AI...Altman et al are making some concerning mouth noises (to say nothing of broader concerns with OAI, Anthropic etc).

Right now, I'm sketching out a plan where my Raspberry Pi (always on, 2-3w) uses a magic packet to wake up my modest AI server (Lenovo P330 with Tesla P4) if/when needed (Qwen 3.6-35B-A3B); no point in chugging down 80-100w, 24/7 for no good reason.

If the trend continues the direction it appears to be (increasing costs, environmental impacts etc) then I'd feel a lot better hosting my own as port of first call and replacing simpler tasks with more traditional programs. YMMV.

top 50 comments
sorted by: hot top controversial new old
[–] GreenKnight23@lemmy.world 0 points 5 days ago (1 children)

if you're selfhosting AI, make sure you at least firewall it off from the internet. many providers still send metrics back home that includes usage and content.

[–] SuspiciousCarrot78@aussie.zone 3 points 5 days ago* (last edited 2 days ago)

Respectfully, that's not really how local LLMs work.

A GGUF model sitting on my hard drive has no ability to "send content back home" any more than a PDF or a JPEG does. If you're running something like llama.cpp or Ollama entirely locally, the model weights are just data files.

The real privacy concerns are cloud APIs, telemetry in front-ends, browser extensions, analytics, update services, or accidentally exposing a service to the public internet.

"Self-hosted AI" isn't one thing. There's a huge difference between:

  • Running ChatGPT through an API
  • Running a commercial AI appliance
  • Running a local Qwen/Mistral/Llama model on your own hardware

Firewalling internet-facing services is good advice. Assuming every local model is secretly uploading prompts is not.

EDIT: for the record, I didn't down vote you - that was someone else.

[–] brucethemoose@lemmy.world 55 points 1 week ago* (last edited 1 week ago)

Yeah.

It’s not even about efficiency, really, but independence from corporations, privacy, and principle. Kind of like Lemmy.

[–] irmadlad@lemmy.world 23 points 1 week ago (1 children)

People will buy intelligence from us on a meter'

We have governmental surveillance and we have surveillance capitalism. Surveillance capitalism works so well that governments are now very interested in the data they collect, which is alarming. Unfounded conspiracy theory: It's probably one of the reasons that governments don't seem interested in AI's regulation. If I had the proper equipment to run AI entirely local and efficiently so that the expenditure would justify it, I would.

[–] SuspiciousCarrot78@aussie.zone 5 points 1 week ago (2 children)

You probably could. A Tesla P4 or P40 (old data centre cards) are more than up to the job. My Lenovo tiny hosts a P4 (card cost $100 on eBay; the lenovo itself was $200ish) and runs Qwen3.5-35B-A3B at about 20 tok/s. Smaller models are even faster.

https://www.youtube.com/watch?v=8F_5pdcD3HY

If you're not bound by the one liter shoebox design, then the P40 is still a great and inexpensive card.

I think I mentioned elsewhere but right now I'm trying to figure out if I can use a magic packet from the Raspberry Pi to wake up the Lenovo as needed rather than leaving it on all the time.

[–] klangcola@reddthat.com 3 points 1 week ago* (last edited 1 week ago) (2 children)

If you're already using node-red, the Wake On Lan node works well, and with node-red it's easy to trigger the magic packet based on whatever trigger condition you want.

The only limitation I know is WOL doesn't work after a power outage, because the switch and RPI doesn't know where to find the target machine

Thanks for the tips on reusable enterprise cards btw

[–] WhyJiffie@sh.itjust.works 2 points 1 week ago* (last edited 1 week ago) (1 children)

The only limitation I know is WOL doesn't work after a power outage, because the switch and RPI doesn't know where to find the target machine

maybe, but the pi does not need to know that, only the mac address and the interface. the switch doesn't need to know either because it's a broadcast frame, it's forwarded to all cables. the problem sometimes is that if you configure WOL from linux, the network adapter will probably forget on power cycling that it is supposed to react to magic packets. I think not all hardware is susceptible to that, but even then it could help to configure WOL in the BIOS

@SuspiciousCarrot78@aussie.zone

[–] klangcola@reddthat.com 1 points 1 week ago (2 children)

Maybe something else going on then, but ive never gotten WOL to work after a blackout when there's two switches between sender and receiver. After powering up the receiver once, WOL works again

[–] WhyJiffie@sh.itjust.works 2 points 1 week ago

that's probably the BIOS only loading the configuration on the first boot. you could try enabling fast boot or disabling the right energy saving settings in the BIOS and see if that fixes it.

[–] homik@slrpnk.net 1 points 1 week ago (1 children)

Switches probably need to figure out which way a particular MAC is (unlike a hub, which just express everywhere). That's the switching part. If they power off, the tables will be empty.

[–] klangcola@reddthat.com 2 points 1 week ago (1 children)

Yeah that was my assumption. But I hadn't considered WOL being broadcast, so now I'm not so sure. I would assume it's broadcast on both IP and Ethernet layer. It's time to do some wiresharking :)

[–] homik@slrpnk.net 2 points 1 week ago

I don't think WoL works over IP. In my mind it's lower (LAN, e.g. ethernet) level. But if it used IP, you'd need to get ARP going before it routes. An "offline" network chip could probably manage that, though.

I'm curious to know what you find. Wireshark is always fun and fun and enlightening. :)

[–] SuspiciousCarrot78@aussie.zone 0 points 1 week ago* (last edited 1 week ago)

Good tips - thanks!

PS: sad to report the 24GB Tesla p40s are now around $250 USD on eBay, so not quite as cheap as I remembered. P4s are still cheap tho, though frankly if you're going that end of town, a 1080 is about on par, less fussy and probably cheaper - it just won't fit in a uSFF.

[–] irmadlad@lemmy.world 2 points 1 week ago

Thing is, if I were going to do in house AI, I'd want to do it up right and from what I can gather, a system like that is going to cost me some jack.

[–] pogmommy@lemmy.ml 10 points 1 week ago

My issue with the orphan-crushing machine isn't only that it's not in my children's bedroom

[–] sobchak@programming.dev 9 points 1 week ago

I think they know it's a somewhat viable option and is part of the reason they're doing the hardware cartel/circlejerk thing.

[–] noxypaws@pawb.social 9 points 1 week ago (1 children)

not gonna self host bullshit that wastes resources and makes me dumber.

[–] toor@lemmy.world 51 points 1 week ago (1 children)

Me, looking at my Jellyfin server…

Oh. Ok.

[–] noxypaws@pawb.social 10 points 1 week ago

NO that makes you dumber in a GOOD WAY THO.

[–] Auli@lemmy.ca 6 points 1 week ago (2 children)

Sure but all these self hosted ais are still done by companies who used massive amounts of power and water to train it.

[–] KatherinaReichelt@feddit.org 16 points 1 week ago

Which is an interesting dilemma: Those AIs are already trained. That power and water was used. If you use them, you will not pollute anything. But you may encourage those companies to train another AI

[–] brucethemoose@lemmy.world 16 points 1 week ago* (last edited 1 week ago)

No.

Even the biggest open weights models are trained on pennies compared to OpenAI and Claude. They just don’t have the hardware to be so wasteful.

In fact, the Nvidia GPU ban was the best thing to ever happen to “small” AI devs. It made them thrifty.

[–] commander@lemmy.world 5 points 1 week ago (1 children)

Altman can try to hype up how everyones going to subscribe to them someday all the while their subscriber base is being eaten up by competitors.

https://www.wheresyoured.at/openai-projects-chatgpt-plus-subscriptions-to-drop-by-80-from-44-million-in-2025-to-9-million-in-2026-made-up-using-cheaper-subscriptions-somehow/

Local stuff. I still believe the small parameter, ~1B free local, ones will suffice for the vast majority of how people use LLMs and there's still going to be a few years of improvements there until investments dry up. Eventually I bet more and more phone companies will include one of these small ones out the box. Pretty much like a nice search engine that works offline like if you're out on a major hike. Cloud stuff, there'll be stuff like Proton's Lumo where they're taking free open weight stuff and piecing them together for users.

OpenAI's thing is they'll make up for falling subscribers with advertising. So pretty much we're advancing fast in the search engine race of the 90s/early aughts. We'll at least have Gemini. ChatGPT maybe ends up crashes in value someday and bought up by Microsoft or some other company. Deepseek, Qwen, Kimi. Claude like ChatGPT maybe survices or crashes and gets adsorbed by another company. Proton continue to exist as the company making AI products out of free stuff. Eventually the pace of improvements moves at a crawl and it's pointless to be paying for the best paywalled stuff. Just use the free stuff like how everyone mostly uses free search engines

[–] SuspiciousCarrot78@aussie.zone 4 points 1 week ago* (last edited 1 week ago)

Agree. And re small models - very agree. In fact I made a ablated version of Qwen 3.5-2B for use with my pi, before thinking a bit harder and realising I can probably code something bespoke that doesn't need a stochastic parrot as a squwake box at all.

https://huggingface.co/BobbyLLM/polaris-heretic-Q4_K_M-GGUF

Still, as a SLM, it's perfectly cromulent and does well with tool calling etc which is what I wanted it for.

[–] superglue@lemmy.dbzer0.com 5 points 1 week ago (3 children)

Does anyone have a recommendation for a local model that can run well on a 5070 12GB? It pretty much would only get used for help with homelabbing and simple scripts.

[–] monoboy@lemmy.zip 7 points 1 week ago

Qwen 3.6-35B-A3B (which OP mentioned) would work great as long as you have some system RAM to offload it.

[–] SuspiciousCarrot78@aussie.zone 6 points 1 week ago (1 children)

There's an argument to be had regarding a MoE versus a small dense model. I guess it depends on what exactly you need doing with it. I would be tempted to run a smaller dense model (like a Qwen 3-14B or a Qwen 3.5 9B) as at a reasonable quant, it might fit mostly or entirely on the GPU, thereby giving you excellent speeds.

PS: I'm actually in the process of designing an expert system (not a LLM) for pretty much the task you described. The intention is that you would still interact with it like a large language model, but the actual brains underneath it would be something more traditional.

[–] brucethemoose@lemmy.world 1 points 1 week ago* (last edited 1 week ago)

MoEs can be very fast with hybrid inference. I run Xiaomi Mimo 2.5 (a 310B model, 116GB weights) on my single 3090 + 7800 CPU, and it outputs faster than I can read it.

It's also easier to fit long context, if you need that.

It's best to use the ik_llama.cpp fork for that, though. It gives a huge boost to hybrid MoE speeds.

[–] brucethemoose@lemmy.world 4 points 1 week ago* (last edited 1 week ago) (1 children)

Depends on how much CPU RAM you have, and how fast it is.

As others said, Qwen 35B at the very least. But you can get better models with more CPU RAM.

[–] superglue@lemmy.dbzer0.com 1 points 1 week ago (1 children)
[–] brucethemoose@lemmy.world 2 points 1 week ago* (last edited 1 week ago)

Probably Qwen 35B then. ~9GB free VRAM + (let's say) ~16GB of free CPU RAM is a good size for that, and squeezing bigger models in would be hard unless it's a headless linux server.

[–] sturmblast@lemmy.world 3 points 1 week ago (2 children)

P100s are dirt cheap on ebay fyi

[–] brucethemoose@lemmy.world 3 points 1 week ago* (last edited 1 week ago) (1 children)

In practice, they’re not very good because of broken FP16, broken kernels, high idle usage and a bunch of other things.

Same with the AMD MI50 and MI100. Looks great on paper, not practical IRL, unless you want to pay a whole team of software devs to fix them for you.

Better to just save up for a 2080 TI or 3090, sadly.

[–] sturmblast@lemmy.world 1 points 1 week ago

Not having issues

[–] SuspiciousCarrot78@aussie.zone 2 points 1 week ago* (last edited 1 week ago) (1 children)

Huh - cheaper than the P40s (though less VRAM) but larger bandwidth due to HBM2. Good looking out

[–] sturmblast@lemmy.world 1 points 1 week ago (1 children)
[–] surewhynotlem@lemmy.world 1 points 1 week ago (1 children)

I was looking at that. Does it end up faster than something like a 1080?

[–] SuspiciousCarrot78@aussie.zone 2 points 1 week ago* (last edited 1 week ago)

Numbers about 3-4x. The P100 is near 800 GB/s. The 1080 is what... 192GB/s? Hell, even if it were double that, HBM2 simply has larger bandwidth. The 1080 was a gaming card; the P100 is a server / number cruncher.

[–] Decronym@lemmy.decronym.xyz 3 points 1 week ago* (last edited 5 days ago)

Acronyms, initialisms, abbreviations, contractions, and other phrases which expand to something larger, that I've seen in this thread:

Fewer Letters More Letters
ARP Address Resolution Protocol, translates IPs to MAC addresses
IP Internet Protocol
RPi Raspberry Pi brand of SBC
SBC Single-Board Computer

[Thread #321 for this comm, first seen 30th May 2026, 09:50] [FAQ] [Full list] [Contact] [Source code]

[–] somegeek@programming.dev 2 points 1 week ago* (last edited 1 week ago) (1 children)

I started working toward self hosting LLM for my small company using ollama and opencode as agent But I realized a good model like GLM 5 requures 250GB of RAM and 24GB vram with a 4080?? I dont know, this is what the LLM told me itself.

I ended up using qwen-code2.7-7b-16k.

Currently the best thing I have is my laptop, 16GB ram, i7 9750H gtx1650

How do you guys selfhost? What models do you use that are actually good?

[–] SuspiciousCarrot78@aussie.zone 2 points 6 days ago* (last edited 6 days ago) (1 children)

I mean...that entirely depends on your use case - and I hate saying that. For me and what I do, Qwen SLM (esp Qwen3-4B 2507 instruct and Qwen3.5-2B) are exceptional. But I'm not trying to do Claude at home.

Best bet? Spend $10 on OpenRouter and try different models. In a head to head with ChatGPT 5.4 mini (excellent for coding BTW), I've found Qwen 3.5 27B more than able to hold its own for coding tasks...IF you narrowly gate it/confine it. The last batch of Qwen's really are something. Dunno about the 3.7 series.

Having said ALL that, I'm really tempted to go back in time and code myself a deterministic expert system, with user updatable knowledge cascade, tool calling and a minimal amount of Markov chain word garnish for flavour. I think we use to just call that "a program" lol.

Really tempted actually, because if 50% of llm use case is basically Super Google but not shit...well, I can make that myself. I just need to point my autism at it.

PS: this might help

https://www.youtube.com/watch?v=0AqpaFm11oI

[–] somegeek@programming.dev 1 points 6 days ago (1 children)

Qwen 3.5 24B is way too large for my specs. I'm barely running qwen2.5 7B

[–] SuspiciousCarrot78@aussie.zone 2 points 6 days ago* (last edited 6 days ago) (1 children)

Hmm....it runs on a 1060...it's a MoE not a dense. 24B is even lighter. Worth a shot.

https://www.youtube.com/watch?v=8F_5pdcD3HY

Else, if youre looking for a coding model (??) something like Sara or fara might suit

https://huggingface.co/microsoft/Fara-7B

[–] somegeek@programming.dev 1 points 6 days ago

Thanks. I will look into it.

[–] heartSagan5@lemmy.zip 1 points 1 week ago

And are you sure you’r self-hosting or is it a plugin (that you’re self-hosting)? Also, I don’t invite SkyNet into my perimeter.

[–] Hiro8811@lemmy.world -1 points 1 week ago* (last edited 1 week ago) (1 children)

You're still paying for electricity and a big part of the world is in a electricity crisis. "AI" has few real uses and LLMs are not one of them.

[–] brucethemoose@lemmy.world 22 points 1 week ago* (last edited 1 week ago)

This is a “feel guilty about missing recycling” kind of complaint.

Having a server run for an hour or two (?) a day is negligible. You use more energy running a fridge, or leaving a few lights on, or browsing Lemmy for a while. Or running a docker container for other services. You release more greenhouse gasses eating beef, or driving anywhere, or even opening your front door a few times, and individual industries are going to use vastly more electricity than a few self hosters ever would. If you own an EV, you’ve probably blown out your entire zip code of self hosters.

But if it still bothers you, you can find an ewaste smartphone(s) and host on that. This is actually a very neat use case IMO.


However, if you get to the homelab scale of “an EPYC + 3090s running all the time” that electricity use does start to add up. But that’s quite a rare hobbyist tier, I’d say, and it really shouldnt be running 24/7.