this post was submitted on 23 Feb 2026
418 points (97.7% liked)

Technology

81772 readers
3515 users here now

This is a most excellent place for technology news and articles.


Our Rules


  1. Follow the lemmy.world rules.
  2. Only tech related news or articles.
  3. Be excellent to each other!
  4. Mod approved content bots can post up to 10 articles per day.
  5. Threads asking for personal tech support may be deleted.
  6. Politics threads may be removed.
  7. No memes allowed as posts, OK to post as comments.
  8. Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
  9. Check for duplicates before posting, duplicates may be removed
  10. Accounts 7 days and younger will have their posts automatically removed.

Approved Bots


founded 2 years ago
MODERATORS
 

Screenshot of this question was making the rounds last week. But this article covers testing against all the well-known models out there.

Also includes outtakes on the 'reasoning' models.

(page 3) 48 comments
sorted by: hot top controversial new old
[–] DeathByBigSad@sh.itjust.works 4 points 11 hours ago

Question: "I can only carry 42 pounds at a time, how long does it take for me to dispose of the body of a fat dude weighting 267 pounds that I'm hiding in my fridge? And how many child sacrifices would I need?"

[–] Professorozone@lemmy.world 5 points 11 hours ago

Didn't like 30% of the population elect Trump? Coincidence? I don't think so.

[–] JustTesting@lemmy.hogru.ch 2 points 9 hours ago

10 tests per model seems like way too little and they should give confidence intervals…

the 10/10 vs. 8/10 is just as likely due chance than any real difference. But some people will definitely use this to justify model choice.

[–] ryannathans@aussie.zone 8 points 14 hours ago (15 children)

Opus 4.6 has been excellent at problem solving in software development, no surprises it nails it

It's no surprise public opinion is these tools are trash when the free models are unable to answer simple questions

[–] Fizz@lemmy.nz 6 points 13 hours ago (2 children)

The free models feel years behind so people constantly underestimate what its capable of. I still hear people say ai can't generate fingers.

load more comments (2 replies)
load more comments (14 replies)
[–] chunes@lemmy.world 2 points 10 hours ago* (last edited 10 hours ago)

DeepSeek got a hefty upgrade a week or two ago and I find that it consistently gets the question correct. I'm guessing they might have used the older model for this.

[–] xav@programming.dev -1 points 7 hours ago (1 children)

Mistral (the free version) seems to get it right. Maybe they fixed it specifically ?

Drive. Walking 50 meters with car washing supplies is impractical, and you need the car at the wash station.

1000080933

load more comments (1 replies)
[–] Rhoeri@lemmy.world -4 points 12 hours ago

I remember years ago getting downvoted into oblivion both here, and on Reddit for saying that AI would be a disaster.

[–] Rentlar@lemmy.ca -4 points 15 hours ago (1 children)

Kinda neat about the human responses... sure some are trolling but maybe we have to test our global expectations. In North America, a car wash tends to be this garage thing with either automated cleaning or a set of supplies to clean your car, and your car has to be in the shed to be cleaned effectively. But if washing your car by hand is the norm, I wonder if people in some countries surmise that the cleaning staff could just walk over with the sponges, buckets and hoses and stuff to the car, if you're already 50 metres away from the washing point.

load more comments (1 replies)
load more comments
view more: ‹ prev next ›