this post was submitted on 23 Feb 2026
376 points (97.5% liked)

Technology

81772 readers
3476 users here now

This is a most excellent place for technology news and articles.


Our Rules


  1. Follow the lemmy.world rules.
  2. Only tech related news or articles.
  3. Be excellent to each other!
  4. Mod approved content bots can post up to 10 articles per day.
  5. Threads asking for personal tech support may be deleted.
  6. Politics threads may be removed.
  7. No memes allowed as posts, OK to post as comments.
  8. Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
  9. Check for duplicates before posting, duplicates may be removed
  10. Accounts 7 days and younger will have their posts automatically removed.

Approved Bots


founded 2 years ago
MODERATORS
 

Screenshot of this question was making the rounds last week. But this article covers testing against all the well-known models out there.

Also includes outtakes on the 'reasoning' models.

(page 2) 50 comments
sorted by: hot top controversial new old
[–] clav64@lemmy.world -1 points 3 hours ago (1 children)

Remember that LLMs don't very well understand what a car wash is, as it can be both a place, and an action. Can you define a car wash? There's many types... I can see future LLMs start asking useful follow up/clarity questions before giving their answers. Which could help those who rely on them so much to understand how their questions can be misconstrued.

load more comments (1 replies)
[–] aloofPenguin@piefed.world 45 points 13 hours ago* (last edited 12 hours ago) (4 children)

I tried this with a local model on my phone (qwen 2.5 was the only thing that would run, and it gave me this confusing output (not really a definite answer...):
JqCAI6rs6AQYacC.jpg

it just flip flopped a lot.

E: also, looking at the response now, the numbers for the car part doesn't make any sense

[–] someguy3@lemmy.world 7 points 10 hours ago
[–] crunchy@lemmy.dbzer0.com 14 points 12 hours ago

Honestly that's a lot more coherent than what I would expect from an LLM running on phone hardware.

[–] AbidanYre@lemmy.world 9 points 11 hours ago* (last edited 11 hours ago) (1 children)

I like that it's twice as far to drive for some reason. Maybe it's getting added to the distance you already walked?

[–] Fondots@lemmy.world 2 points 7 hours ago (1 children)

If I were the type of person who was willing to give AI the benefit of the doubt and not assume that it was just picking basically random numbers

There's a lot of cases where it can be a shorter (by distance) walk than drive, where cars generally have to stick to streets while someone on foot may be able to take some footpaths and cut across lawns and such, or where the road may be one-way for vehicles, or where certain turns may not be allowed, etc.

I have a few intersections near my father in laws house in NJ in mind, where you can just cross the street on foot, but making the same trip in a car might mean driving half a mile down the road, turning around at a jug handle and driving back to where you started on the other side of the street.

And I wouldn't be totally surprised if that's the case for enough situations in the training data where someone debated walking or driving that the AI assumed that it's a rule that it will always be further by car than on foot.

That's still a dumbass assumption, but I'd at least get it.

And I'm pretty sure it's much more likely that it's just making up numbers out of nothing.

load more comments (1 replies)
load more comments (1 replies)
[–] JustTesting@lemmy.hogru.ch 3 points 7 hours ago

10 tests per model seems like way too little and they should give confidence intervals…

the 10/10 vs. 8/10 is just as likely due chance than any real difference. But some people will definitely use this to justify model choice.

[–] criticon@lemmy.ca 7 points 10 hours ago (4 children)

Even when they give the correct answer they talk too much. AI responses contain a lot of garbage. When AI gives you an answer it will try to justify itself. Since they won't give you brief responses the responses will be long.

[–] chunes@lemmy.world 6 points 8 hours ago* (last edited 8 hours ago)

I agree with you but found that DeepSeek was succinct.

You need to bring your car to the car wash, so you should drive it there. Walking would leave your car at home, which doesn't help.

[–] Iconoclast@feddit.uk 1 points 5 hours ago

It'll give you short response if you ask it to.

[–] MDCCCLV@lemmy.ca 3 points 9 hours ago

Your post is much longer than it needs to be. That is the reason why, because they just copied people.

load more comments (1 replies)
[–] miraclerandy@lemmy.world 19 points 12 hours ago (1 children)

Gemini set to fast now provides this type of answer.

[–] realitista@lemmus.org 12 points 12 hours ago

Extension cord? It must mean a hose extension.

[–] DeathByBigSad@sh.itjust.works 4 points 9 hours ago

Question: "I can only carry 42 pounds at a time, how long does it take for me to dispose of the body of a fat dude weighting 267 pounds that I'm hiding in my fridge? And how many child sacrifices would I need?"

[–] Professorozone@lemmy.world 5 points 10 hours ago

Didn't like 30% of the population elect Trump? Coincidence? I don't think so.

[–] xav@programming.dev -1 points 5 hours ago (1 children)

Mistral (the free version) seems to get it right. Maybe they fixed it specifically ?

Drive. Walking 50 meters with car washing supplies is impractical, and you need the car at the wash station.

1000080933

load more comments (1 replies)
[–] chunes@lemmy.world 2 points 8 hours ago* (last edited 8 hours ago)

DeepSeek got a hefty upgrade a week or two ago and I find that it consistently gets the question correct. I'm guessing they might have used the older model for this.

[–] ryannathans@aussie.zone 9 points 13 hours ago (2 children)

Opus 4.6 has been excellent at problem solving in software development, no surprises it nails it

It's no surprise public opinion is these tools are trash when the free models are unable to answer simple questions

[–] NaibofTabr@infosec.pub 22 points 12 hours ago (11 children)

It's no surprise public opinion is these tools are trash when the free models are unable to answer simple questions

The tools are trash not because they are unreliable but because they are actively destroying human society and culture. They are destroying art, science, journalism, open source software, the internet at large, and the environment we all live in. It wouldn't matter if the generative models were accurate, they would still be garbage.

The fact that they are unreliable just serves to highlight what a colossally destructive waste of time and resources this entire exercise has been.

load more comments (11 replies)
[–] Fizz@lemmy.nz 7 points 11 hours ago (2 children)

The free models feel years behind so people constantly underestimate what its capable of. I still hear people say ai can't generate fingers.

load more comments (2 replies)
load more comments
view more: ‹ prev next ›