this post was submitted on 24 Jun 2026
132 points (81.7% liked)

Selfhosted

60093 readers
1224 users here now

A place to share alternatives to popular online services that can be self-hosted without giving up privacy or locking you into a service you don't control.

Rules:

  1. Be civil: we're here to support and learn from one another. Insults won't be tolerated. Flame wars are frowned upon.

  2. No spam.

  3. Posts here are to be centered around self-hosting. Please ensure it is clear in your post how it relates to self-hosting.

  4. Don't duplicate the full text of your blog or git here. Just post the link for folks to click.

  5. Submission headline should match the article title.

  6. No trolling.

  7. Promotion posts require your active participation in selfhosting or related communities, or the post will be removed. No more than 10% of your posts or comments may be self-promotional, or your post will be removed. F/LOSS Exception: If your post is about a project that is completely open source & can be self-hosted in full without payment, and your account is at least 7 days old, your post is exempt from this rule as long as you continue to engage in comments.

Resources:

Any issues on the community? Report it using the report flag.

Questions? DM the mods!

founded 3 years ago
MODERATORS
 

Do you host your own ML / AI / LLM? What do you use, and what do you use it for?

you are viewing a single comment's thread
view the rest of the comments
[–] brucethemoose@lemmy.world 2 points 1 day ago* (last edited 1 day ago) (1 children)

CPU offloading is too slow unless you use a hybrid MoE model, with the --n-cpu-moe parameter, specifically.

This only offloads "sparse" parts of the model to the CPU, which take up a lot of RAM but are very compute-lite to run. In practice, thats most of the size of modern MoE LLMs.

[–] robber@lemmy.ml 1 points 23 hours ago (1 children)

Since implementation of the --fit parameter and its relatives, and --fit on becoming the default, llama.cpp intelligently decides what to offload. For me, it made --n-cpu-moe obsolete.

[–] brucethemoose@lemmy.world 2 points 22 hours ago* (last edited 22 hours ago) (1 children)

Mostly, yeah.

Sometimes it’s better to “cut it close,” with (for instance) a 27B model that’s nearly OOMing your VRAM fully offloaded, but you know will be fine in regular use without too many programs open.

In my case, with MiMo 2.5, it fills both my CPU and GPU RAM rather completely, so it’s best to set a static value so I don’t swap CPU RAM, and don’t OOM on the GPU either.

[–] robber@lemmy.ml 1 points 4 hours ago

You can control how much context should be fitted with --fit-ctx and how much space the algorithm should leave unallocated (even on a per-GPU basis) with --fit-target.