I'm using anythingllm. It's quite easy to setup and use. I'm impressed of the perf on comodity hardware.
Selfhosted
A place to share alternatives to popular online services that can be self-hosted without giving up privacy or locking you into a service you don't control.
Rules:
-
Be civil: we're here to support and learn from one another. Insults won't be tolerated. Flame wars are frowned upon.
-
No spam.
-
Posts here are to be centered around self-hosting. Please ensure it is clear in your post how it relates to self-hosting.
-
Don't duplicate the full text of your blog or git here. Just post the link for folks to click.
-
Submission headline should match the article title.
-
No trolling.
-
Promotion posts require your active participation in selfhosting or related communities, or the post will be removed. No more than 10% of your posts or comments may be self-promotional, or your post will be removed. F/LOSS Exception: If your post is about a project that is completely open source & can be self-hosted in full without payment, your post is exempt from this rule as long as you continue to engage in comments.
Resources:
- selfh.st Newsletter and index of selfhosted software and apps
- awesome-selfhosted software
- awesome-sysadmin resources
- Self-Hosted Podcast from Jupiter Broadcasting
Any issues on the community? Report it using the report flag.
Questions? DM the mods!
I have a simple slow model running on CPU in my cluster for karakeep. I've tried running a variety of models on my 7900XT but even with 16GB their performance just isn't there. My new work m5 Mac book with 48GB of ram is the first time I've seen usable performance for local models and it has been pretty impressive.
I hosted Qwen 3.5 9b uncensored on my site at https://masland.tech/ for a while. I didn't really use it and no one else used it so I took it down. These days I'm spending most of my time finding uses for AI and accessibility. One of the next things I'm planning is a video to text reasoning system, primarily for the purpose of grading used electronic devices.
Yes, I got a Strix Halo machine before the RAM price hike and use it to run all my ML stuff on it.
Currently using llama-swap with llama.cpp/ComfyUI and opencode/Open WebUI as frontend.
I'm running Qwen3.6-27b, Voxtral Mini 4b, Piper and Qwen Image. Also, some embedding and reranking models.
I use them for:
- Tagging and classification of my documents in Paperless
- Home Assistant (voice assistant)
- Translations (both text and image)
- Transcriptions
- Some light coding and debugging
- Avatar/Backdrop generation for DnD sessions
What sort of tok/s are you getting on the strix?
About 200 t/s prompt processing and 10-20 t/s with MTP.
Greatly depends on the task, predictable things like code generates at 18-20 t/s. Creative writing more like 10-17 t/s.
i don't use it at all, i do want some selfhosted speech to text model (whisper?) but my computer is ancient so it would be awfully slow. i have some multi hour audio recordings from presentations, would be nice to have them in text and searchable..
How ancient is ancient? TTS and STT are much lighter than llm. (eg: Whisper, Piper, Kokoro, Coqui etc)....you might have more capability than you think, especially if you're doing batch processing like that.
a haswell xeon e5-1650 machine, i remember running llama 7b in llama.cpp in like 2023 and it was quite sluggish. guess i should try whisper at some point..
Ha. You were doing inference on CPU on a haswell era. Been there, done that.
OTOH...whisper.cpp is heavily optimised for it.
Plus, you're doing batch transcription, not real-time, so slow doesn't actually matter.
Fire Whisper small or medium overnight and wake up to searchable text.
PS: if you want a good fast little llm, something like Qwen 3.6 2B will work well on the Xeon.
No, too expensive. I wish I could but it doesn't make sense financially for me right now, it is much cheaper to buy openrouter credits from time to time
I've been running ministral on CPU on a home-server: works pretty nicely, not very performant for everyday tasks and the savings were not sufficient for it to make sense. It still was cheaper and faster to just use Mistral API and get better models.
Yeah, mostly for translation purposes.
I think I currently have gemma 4 set up.
I have the setup, never found a use for it though.
I tried but I only have 16g of ram and it wouldn't complete a thought alas
I'm running dwarfstar which is a 2 bit deepseek v4 flash. It's quite capable even at 2 bit.
This dwarfstar looks interesting, can you elaborate on your setup and what kind of inference speeds you are getting?
I have a 5080 and 128gb of ram running on a AMD 9950X.
Depending on the task I can get over 170-200t/s when the MOE only calls a few agents and can fit inside the VRAM or as low as 5-10ts when it calls more agents and has to hit the system memory. But for grunt work that doesn't need professor level tasks, it's more than capable and if you have the time, it's super worth it because it's basically free tokens.
I only use this for overnight work to save on tokens during the day. When I'm pulling analytics for my work and it just needs basic analysis that doesn't touch multiple tooks.
During work hours I'm using GLM5.2 for web development, Kimi k2.7 for complicated data analysis and Minimax m3 if I need the context window to be bigger than what kimik2.7 can give me.
An aside for anyone reading this:
https://sleepingrobots.com/dreams/stop-using-ollama/
And that barely scratches the surface. Please.
Use anything but Ollama. Even APIs.
Llama.cpp or death!
Or exllama! Vllm, sglang, Lorax. Koboldcpp, Aphrodite, text-generation-webui, LM Studio, really whatever floats your boat. Just not ollama.
I agree that the concerns listed there are smells, and I wasn't aware of some of the options listed there.
Thank you for sharing this!
I ran through lmstudio because it really eazy, I ran some kind of qwen 3.6 27b imatrix neo code DI, it is the best local model for coding I tried, I think it can be better than some cloud model
Why would I?
I tried Qwen 3.6 a3b and Gemma 4 a4b, but both were too stupid for everyday work.
Yes. My Actual Intelligence lives in my head, and runs mostly on coffee.
Just coffee?!? That's cool.
Mine runs on:
- coffee
- spite
- tortilla chips
- & shame
Mostly on coffee, not exclusively. Noticable amounts of spite & tortilla chips are also present, yes, but... no shame.
Nice!
If that's not already on a shirt it should be
Yeah, I'm using qwen 31b a3b on an amd 9070xt requires a bit of cpu offloading, but still plenty fast. Using it wall llama.cpp. Combine that with some mcp's such as ddg-search to make it truly useful by actually being able to search online.
I mostly use it for small tedious tasks with well defined inputs and outputs. For example when hyprland recently changed from their own configuration language to lua. At first I started going line by line translating my config to the new lua language until I realized oh wait this is exactly the type of thing that ML is useful for. Going from the well defined hyprland configuration language to their also well defined lua syntax. It banged it out in less than a minute with only a single mistake which I easily fixed. The mistake it made was that it forgot to translate the comments to lua. It did it in less than a minute and worked first try. Where as I had made several typos and gotten a few lines wrong when I was doing it by hand.
Not to say that I couldn't do it. I would have gotten it done in about half an hour, but less than a minute is a lot faster.
I also used it to transform a bunch of unstructured data into json data, so that I could then use purpose built tools like jq to parse that. If I'm having trouble finding certain information. I'll ask it to find me some resources to look at.
Basically small well defined tasks and parsing data is what I use it for and it seems to be pretty good at that.
What I don't like is the way companies try to market it to people. I don't believe people should be trying to summarize emails or messages from loved ones, writing essays or any other creative tasks for the most part. Translating is okay. I don't expect a machine to be able to decide things for me or to be some filter between me and others.
I set up ollama on our thinkstation in the lab and I use it for looking up documentation, generating readmes, searching papers, and sometimes coding when I know what to do but don't feel it is worth it to spend time on it myself. So basically the chat with web search.
Which models did you find particularly useful for those tasks?
Gemma 4, gpt oss, and nemotron. Currently I've been sticking with Gemma more, the 31 billion parameters one.
Found vLLM to be the most efficient local runtime service. And "ray" as a good (but complicated) way to distribute the load: https://docs.ray.io/
Yes. Openwebui/ollama for LLM, comfyui for stable diffusion. I just dick around with it as a toy.
I was put off by ComfyUI, seems awfully complex. How is your experience?
Any suggestions to start? I have Fooocus installed now
I run Handy with Parakeet for speech to text, and home assistant with Whiper for the same. Whisper+ on my phone.
I think that counts. But I have more relevant and useful things to do on my hardware and no 2000€+ to get LLM-capable hardware 😂