this post was submitted on 24 Jun 2026
55 points (76.7% liked)

Selfhosted

60093 readers
751 users here now

A place to share alternatives to popular online services that can be self-hosted without giving up privacy or locking you into a service you don't control.

Rules:

  1. Be civil: we're here to support and learn from one another. Insults won't be tolerated. Flame wars are frowned upon.

  2. No spam.

  3. Posts here are to be centered around self-hosting. Please ensure it is clear in your post how it relates to self-hosting.

  4. Don't duplicate the full text of your blog or git here. Just post the link for folks to click.

  5. Submission headline should match the article title.

  6. No trolling.

  7. Promotion posts require your active participation in selfhosting or related communities, or the post will be removed. No more than 10% of your posts or comments may be self-promotional, or your post will be removed. F/LOSS Exception: If your post is about a project that is completely open source & can be self-hosted in full without payment, your post is exempt from this rule as long as you continue to engage in comments.

Resources:

Any issues on the community? Report it using the report flag.

Questions? DM the mods!

founded 3 years ago
MODERATORS
 

Do you host your own ML / AI / LLM? What do you use, and what do you use it for?

top 50 comments
sorted by: hot top controversial new old
[–] toebert@piefed.social 1 points 3 minutes ago

I have the setup, never found a use for it though.

[–] placebo@lemmy.zip 2 points 32 minutes ago

I tried Qwen 3.6 a3b and Gemma 4 a4b, but both were too stupid for everyday work.

[–] Meatwagon@lemmy.dbzer0.com 3 points 1 hour ago

I tried but I only have 16g of ram and it wouldn't complete a thought alas

[–] JustEnoughDucks@slrpnk.net 1 points 39 minutes ago

I run Handy with Parakeet for speech to text, and home assistant with Whiper for the same. Whisper+ on my phone.

I think that counts. But I have more relevant and useful things to do on my hardware and no 2000€+ to get LLM-capable hardware 😂

[–] dfgxx@lemmy.zip 4 points 1 hour ago

I ran through lmstudio because it really eazy, I ran some kind of qwen 3.6 27b imatrix neo code DI, it is the best local model for coding I tried, I think it can be better than some cloud model

[–] november@piefed.blahaj.zone 2 points 1 hour ago

Why would I?

[–] fubarx@lemmy.world 1 points 1 hour ago

Found vLLM to be the most efficient local runtime service. And "ray" as a good (but complicated) way to distribute the load: https://docs.ray.io/

[–] wrinkle2409@lemmy.cafe 2 points 2 hours ago

I set up ollama on our thinkstation in the lab and I use it for looking up documentation, generating readmes, searching papers, and sometimes coding when I know what to do but don't feel it is worth it to spend time on it myself. So basically the chat with web search.

[–] algernon@lemmy.ml 44 points 8 hours ago (9 children)

Yes. My Actual Intelligence lives in my head, and runs mostly on coffee.

[–] portifornia@piefed.social 5 points 5 hours ago (1 children)

Just coffee?!? That's cool.

Mine runs on:

  • coffee
  • spite
  • tortilla chips
  • & shame
[–] searabbit@piefed.social 6 points 5 hours ago

If that's not already on a shirt it should be

[–] tal@lemmy.today 5 points 7 hours ago (1 children)

Do you get many hallucinations?

[–] algernon@lemmy.ml 8 points 7 hours ago (1 children)

Only when I'm deprived of coffee.

load more comments (1 replies)
[–] SuspiciousCarrot78@aussie.zone 7 points 8 hours ago* (last edited 8 hours ago)

I'll make sure to send you flowers, Algernon.

[–] GreenCrunch@piefed.blahaj.zone 5 points 7 hours ago

critical security bug: if coffee is taken away my head hurts :(

load more comments (5 replies)
[–] D_Air1@lemmy.ml 11 points 6 hours ago

Yeah, I'm using qwen 31b a3b on an amd 9070xt requires a bit of cpu offloading, but still plenty fast. Using it wall llama.cpp. Combine that with some mcp's such as ddg-search to make it truly useful by actually being able to search online.

I mostly use it for small tedious tasks with well defined inputs and outputs. For example when hyprland recently changed from their own configuration language to lua. At first I started going line by line translating my config to the new lua language until I realized oh wait this is exactly the type of thing that ML is useful for. Going from the well defined hyprland configuration language to their also well defined lua syntax. It banged it out in less than a minute with only a single mistake which I easily fixed. The mistake it made was that it forgot to translate the comments to lua. It did it in less than a minute and worked first try. Where as I had made several typos and gotten a few lines wrong when I was doing it by hand.

Not to say that I couldn't do it. I would have gotten it done in about half an hour, but less than a minute is a lot faster.

I also used it to transform a bunch of unstructured data into json data, so that I could then use purpose built tools like jq to parse that. If I'm having trouble finding certain information. I'll ask it to find me some resources to look at.

Basically small well defined tasks and parsing data is what I use it for and it seems to be pretty good at that.

What I don't like is the way companies try to market it to people. I don't believe people should be trying to summarize emails or messages from loved ones, writing essays or any other creative tasks for the most part. Translating is okay. I don't expect a machine to be able to decide things for me or to be some filter between me and others.

[–] Steve@startrek.website 6 points 5 hours ago (6 children)

I recently gave it a try with qwen3.5 and deepseek coder v2. I have a RTX3090 and these are the largest models that can run comfortably on it.

Conclusion, they are both fucking useless. Free tier claude runs circles.

[–] e0qdk@reddthat.com 3 points 58 minutes ago

If you just pulled the default version of qwen3.5 from ollama's repo you downloaded a mediocre one that only uses ~6GB.

Check ollama show qwen3.5 and see if you get something like this in the result:

  Model
    architecture        qwen35    
    parameters          9.7B      
    context length      262144    
    embedding length    4096      
    quantization        Q4_K_M 

This is the default version I got when I first tried using ollama without any experience. It worked, but it's a heavily quantized, lower parameter version of the model -- i.e. it's pretty dumb -- compared to what you can actually run on your hardware.

[–] SuspiciousCarrot78@aussie.zone 2 points 5 hours ago

Yeah :(

Were not there yet on consumer rigs.

load more comments (4 replies)
[–] brucethemoose@lemmy.world 5 points 5 hours ago* (last edited 5 hours ago) (2 children)

An aside for anyone reading this:

https://sleepingrobots.com/dreams/stop-using-ollama/

And that barely scratches the surface. Please.

Use anything but Ollama. Even APIs.

[–] pinball_wizard@lemmy.zip 2 points 1 hour ago

I agree that the concerns listed there are smells, and I wasn't aware of some of the options listed there.

Thank you for sharing this!

[–] comrademiao@piefed.social 3 points 4 hours ago* (last edited 4 hours ago) (1 children)

looks like extreme nitpicking without any real issues beyond some VC funding a FOSS issues.

//whyre you spamming the comment to everyone? its quite alarmist actually

[–] brucethemoose@lemmy.world 2 points 2 hours ago* (last edited 1 hour ago)

I completely disagree.

Frankly, I find the description "VC funding a FOSS" offensive. They aren't funding the engine. I've been messing with LLM inference engines since 2022, and Ollama is the worst I've seen in the community.

They misname models for SEO. They leech off llama.cpp while deliberately hiding attribution yet redirecting GH support requests there. They sometimes make their own GGUFs+forked releases which are broken and incompatibile with upstream llama.cpp, just so they can get a release out a day ahead for hype, even though it doesn't really work and they'll never upstream one line. They set a default context size thats basically unusable, they screw up chat templates and deep internal code with no obvious indicators, they release suboptimal quants without iMatrix, they gate you into their internal quantization repo and model card format, they hide model downloads on your hard drive, they mess with standard APIs for no good reason other than to mess up other backends. I could go on and on.

And if that's all fine, they're enshittifying the app with closed code, and pointers to cloud models.

They GIVE LLM inference a bad name, by making it a terrible quality engine that happens to show up in search as the "default." Hence the comments below of people being unimpressed with local inference. And they sap attention from actual llama.cpp devs, without contributing a single dime. Everyone in the localllama communtity hates their guts, and that's not even getting into the interpersonal drama they've stirred.

They are a leech that's a net drag to the whole community, that we can't get rid of because they're attention grifters. And they've gotten worse and worse over time.


It's more morale to use any cloud API over Ollama, in my eyes. They're a grift.


EDIT: And, to be clear, I’m not against VC funded downstream stuff.

LM Studio is good! Even though it’s closed source.

Tons of downstream projects are great.

[–] mierdabird@lemmy.dbzer0.com 3 points 5 hours ago* (last edited 4 hours ago) (1 children)

I started out playing around with code generation using Ollama/open-webui and qwen 2.5 coder 14b on a 3060 12GB, but ended up on a winding journey with an ex datacenter card called the AMD V620. Its roughly equivalent to an RX 6800XT, but with double the VRAM. At this point i've really done nothing productive with it but learned a lot about bios settings, GPU/ROCm drivers, and custom fan solutions/PWM controls trying to get it setup and optimized haha.

It's pretty sick though, that amount of VRAM with 512GB/s bandwidth can run Qwen 3.6 27B dense with 100k context window at 20 tokens/sec in LM studio. Draws 300 watts at the wall on my ITX chassis (idling about 30w).

I've been dabbling in building an aviation weather and field condition report application using this, but my next step is to rebuild my VS Code environment into a new machine. I'm kinda enjoying just fucking around with building the hardware too though

load more comments (1 replies)
[–] frongt@lemmy.zip 27 points 9 hours ago (2 children)

Yes. Openwebui/ollama for LLM, comfyui for stable diffusion. I just dick around with it as a toy.

[–] Shimitar@downonthestreet.eu 1 points 2 hours ago

I was put off by ComfyUI, seems awfully complex. How is your experience?

Any suggestions to start? I have Fooocus installed now

[–] mesamunefire@piefed.social 9 points 8 hours ago* (last edited 8 hours ago) (3 children)

Same. Its somewhat useful on some very small scripting or tasks...but its mostly just to try out a new model or two. Its not really useful for anything big.

I will have to say....even my tiny models are about as good as Chatgpt/Claude/etc... which makes me think about how much people are spending on tokens regularly. I was able to get the same kind of python script started with my local tiny model that was comparable to the newest Claude code offerings.

load more comments (3 replies)
[–] Shipgirlboy@sh.itjust.works 6 points 6 hours ago

I've thought about it, but I actually could never think of anything I would do with it.

[–] slazer2au@lemmy.world 16 points 9 hours ago
[–] curbstickle@anarchist.nexus 3 points 6 hours ago (6 children)

Yep.

Ollama + about 8 different models at the moment, hosted on a mac mini with open webui as a front end.

Predominantly for transcription, translation, an extra round of security checks on code, a more context friendly home assistant interface, and a daily run of context evaluation on property I'm looking for with a lot of specific needs (acreage, min elevation change, soil type, area, etc).

[–] surewhynotlem@lemmy.world 2 points 5 hours ago (1 children)

I have to recommend switching to llamacpp. It's SO much faster than ollama.

load more comments (1 replies)
[–] irmadlad@lemmy.world 1 points 5 hours ago (1 children)

mac mini

How? What is your average response time?

[–] curbstickle@anarchist.nexus 2 points 4 hours ago

Apple silicon is pretty good at it as long as you've got the ram for it. I wouldn't do less than 16GB.

A few seconds for most of the tasks

load more comments (4 replies)
[–] rando@lemmy.ml 0 points 3 hours ago

Bought b70 with egpu enclosure and usb4 connection wasn't really planning to actually run anything but now ended up with llama.cpp with openwebui - kids/parents want to/have to use chat, might as well provide local solution than them using industry options. Also started with ollama and Gemma 4 26b a4b - asked it to write script to setup llama.cpp in container.

[–] atzanteol@sh.itjust.works 9 points 9 hours ago (8 children)

I've tried a few times but with only 8gig of vram it's simply not worth it.

load more comments (8 replies)
load more comments
view more: next ›