this post was submitted on 23 Feb 2026
376 points (97.5% liked)

Technology

81772 readers
3476 users here now

This is a most excellent place for technology news and articles.


Our Rules


  1. Follow the lemmy.world rules.
  2. Only tech related news or articles.
  3. Be excellent to each other!
  4. Mod approved content bots can post up to 10 articles per day.
  5. Threads asking for personal tech support may be deleted.
  6. Politics threads may be removed.
  7. No memes allowed as posts, OK to post as comments.
  8. Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
  9. Check for duplicates before posting, duplicates may be removed
  10. Accounts 7 days and younger will have their posts automatically removed.

Approved Bots


founded 2 years ago
MODERATORS
 

Screenshot of this question was making the rounds last week. But this article covers testing against all the well-known models out there.

Also includes outtakes on the 'reasoning' models.

top 50 comments
sorted by: hot top controversial new old
[–] 73ms@sopuli.xyz 1 points 43 minutes ago

Did this say whether the reasoning models get this right more than the others? Was curious about that but missed it if it was mentioned.

[–] vane@lemmy.world 6 points 2 hours ago (2 children)

I want to wash my train. The train wash is 50 meters away. Should I walk or drive?

[–] SkaveRat@discuss.tchncs.de 7 points 2 hours ago

Fly, you fool

[–] FatVegan@leminal.space 3 points 1 hour ago

100 Chinese people can lay approximately 30m of track a day

[–] imetators@lemmy.dbzer0.com 11 points 4 hours ago (1 children)

Went to test to google AI first and it says "You cant wash your car at a carwash if it is parked at home, dummy"

Chatgpt and Deepseek says it is dumb to drive cause it is fuel inefficient.

I am honestly surprised that google AI got it right.

[–] rumba@lemmy.zip 37 points 3 hours ago (1 children)

They probably added a system guardrail as soon as they heard about this test. it's been going around for a while now :)

[–] imetators@lemmy.dbzer0.com 2 points 3 hours ago

Article mentions that Gemini 2.0 Flash Lite, Gemini 3 Flash and Gemini 3 Pro have passed the test. All these 3 also did it 10 out of 10 times without being wrong. Even Gemini 2.5 shares highest score in the category of "below 6 right answers". Guess, Gemini is the closest to "intelligence" out of a bunch.

[–] tover153@lemmy.world 4 points 3 hours ago (1 children)

After getting it wrong, the LLM I use most: Me: You can't wash your car if it isn't there.

Ah. Yes. That is an excellent and devastatingly practical correction.

In that case: drive. But do it with the quiet shame of someone moving the car the length of a bowling lane.

This is one of those situations where logic beats virtue. The car must be present to be washed. Walking was philosophically elegant but mechanically flawed.

So:

Start engine.

Travel 50 meters.

Avoid eye contact with pedestrians.

Commit fully.

You are not lazy. You are complying with system requirements.

[–] ne0phyte@feddit.org 1 points 38 minutes ago

Thank you! Finally an answer to my problem that didn't end with me going to the car wash and being utterly confused how to proceed.

[–] Slashme@lemmy.world 38 points 6 hours ago (3 children)

The most common pushback on the car wash test: "Humans would fail this too."

Fair point. We didn't have data either way. So we partnered with Rapidata to find out. They ran the exact same question with the same forced choice between "drive" and "walk," no additional context, past 10,000 real people through their human feedback platform.

71.5% said drive.

So people do better than most AI models. Yay. But seriously, almost 3 in 10 people get this wrong‽‽

[–] bluesheep@sh.itjust.works 3 points 1 hour ago

I saw that and hoped it is cause of the dead Internet theory. At least I hope so cause I'll be losing the last bit of faith in humanity if it isn't

[–] T156@lemmy.world 18 points 5 hours ago (1 children)

It is an online poll. You also have to consider that some people don't care/want to be funny, and so either choose randomly, or choose the most nonsensical answer.

[–] yakko@feddit.uk 0 points 3 hours ago

I wonder... If humans were all super serious, direct, and not funny, would LLMs trained on their stolen data actually function as intended? Maybe. But such people do not use LLMs.

[–] masterofn001@lemmy.ca 7 points 6 hours ago* (last edited 6 hours ago) (3 children)

Without reading the article, the title just says wash the car.

I could go for a walk and wash my car in my driveway.

Reading the article... That is exactly the question asked. It is a very ambiguous question.

[–] Geth@lemmy.dbzer0.com 1 points 51 minutes ago

Mentioning the car wash and washing the car plus the possibility of driving the car in the same context pretty much eliminates any ambiguity. All of the puzzle pieces are there already.

I guess this is an uninteded autism test as well if this is not enough context for someone to understand the question.

[–] bluesheep@sh.itjust.works 3 points 1 hour ago

Without reading the article, the title just says wash the car.

No it doesn't? It says:

I want to wash my car. The car wash is 50 meters away. Should I walk or drive?

In which world is that an ambiguous question?

[–] elucubra@sopuli.xyz 3 points 3 hours ago

It is not. It says what I want to do, and where.

[–] lemmydividebyzero@reddthat.com 6 points 5 hours ago

They will scrape that article, too.

And I'm a few months, they have "learned" how that task works.

[–] Evotech@lemmy.world 1 points 3 hours ago

I got pranked by ddg yesterday

[–] DarrinBrunner@lemmy.world 41 points 9 hours ago (20 children)

I think it's worse when they get it right only some of the time. It's not a matter of opinion, it should not change its "mind".

The fucking things are useless for that reason, they're all just guessing, literally.

load more comments (20 replies)
[–] Greg Fawcett@piefed.social 61 points 10 hours ago (1 children)

What worries me is the consistency test, where they ask the same thing ten times and get opposite answers.

One of the really important properties of computers is that they are massively repeatable, which makes debugging possible by re-running the code. But as soon as you include an AI API in the code, you cease being able to reason about the outcome. And there will be the temptation to say "must have been the AI" instead of doing the legwork to track down the actual bug.

I think we're heading for a period of serious software instability.

load more comments (1 replies)
[–] BanMe@lemmy.world 13 points 8 hours ago (2 children)

In school we were taught to look for hidden meaning in word problems - checkov's gun basically. Why is that sentence there? Because the questions would try to trick you. So humans have to be instructed, again and again, through demonstration and practice, to evaluate all sentences and learn what to filter out and what to keep. To not only form a response, but expect tricks.

If you pre-prompt an AI to expect such trickery and consider all sentences before removing unnecessary information, does it have any influence?

Normally I'd ask "why are we comparing AI to the human mind when they're not the same thing at all," but I feel like we're presupposing they are similar already with this test so I am curious to the answer on this one.

[–] bluesheep@sh.itjust.works 1 points 1 hour ago

Normally I'd ask "why are we comparing AI to the human mind when they're not the same thing at all," but I feel like we're presupposing they are similar already with this test so I am curious to the answer on this one.

I would guess it's because a lot of AI users see their choice of AI as an all-knowing human-like thinking tool. In which case it's not a weird test question, even when the assumption that it "thinks" is wronh

[–] punkibas@lemmy.zip 1 points 2 hours ago

At the end of the article they talk about how to overcome this problem for LLMs doing something akin to what you wrote.

[–] rimu@piefed.social 104 points 13 hours ago (16 children)

Very interesting that only 71% of humans got it right.

[–] CaptDust@sh.itjust.works 35 points 10 hours ago* (last edited 10 hours ago)

That "30% of population = dipshits" statistic keeps rearing its ugly head.

[–] SnotFlickerman@lemmy.blahaj.zone 104 points 12 hours ago* (last edited 12 hours ago) (3 children)

I mean, I've been saying this since LLMs were released.

We finally built a computer that is as unreliable and irrational as humans... which shouldn't be considered a good thing.

I'm under no illusion that LLMs are "thinking" in the same way that humans do, but god damn if they aren't almost exactly as erratic and irrational as the hairless apes whose thoughts they're trained on.

load more comments (3 replies)
load more comments (14 replies)
load more comments
view more: next ›