this post was submitted on 23 Feb 2026
349 points (97.5% liked)

Technology

81772 readers
3327 users here now

This is a most excellent place for technology news and articles.


Our Rules


  1. Follow the lemmy.world rules.
  2. Only tech related news or articles.
  3. Be excellent to each other!
  4. Mod approved content bots can post up to 10 articles per day.
  5. Threads asking for personal tech support may be deleted.
  6. Politics threads may be removed.
  7. No memes allowed as posts, OK to post as comments.
  8. Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
  9. Check for duplicates before posting, duplicates may be removed
  10. Accounts 7 days and younger will have their posts automatically removed.

Approved Bots


founded 2 years ago
MODERATORS
 

Screenshot of this question was making the rounds last week. But this article covers testing against all the well-known models out there.

Also includes outtakes on the 'reasoning' models.

top 50 comments
sorted by: hot top controversial new old
[–] vane@lemmy.world 4 points 47 minutes ago (2 children)

I want to wash my train. The train wash is 50 meters away. Should I walk or drive?

[–] SkaveRat@discuss.tchncs.de 5 points 46 minutes ago

Fly, you fool

[–] FatVegan@leminal.space 1 points 8 minutes ago

100 Chinese people can lay approximately 30m of track a day

[–] imetators@lemmy.dbzer0.com 8 points 2 hours ago (1 children)

Went to test to google AI first and it says "You cant wash your car at a carwash if it is parked at home, dummy"

Chatgpt and Deepseek says it is dumb to drive cause it is fuel inefficient.

I am honestly surprised that google AI got it right.

[–] rumba@lemmy.zip 25 points 2 hours ago (1 children)

They probably added a system guardrail as soon as they heard about this test. it's been going around for a while now :)

[–] imetators@lemmy.dbzer0.com 2 points 1 hour ago

Article mentions that Gemini 2.0 Flash Lite, Gemini 3 Flash and Gemini 3 Pro have passed the test. All these 3 also did it 10 out of 10 times without being wrong. Even Gemini 2.5 shares highest score in the category of "below 6 right answers". Guess, Gemini is the closest to "intelligence" out of a bunch.

[–] tover153@lemmy.world 1 points 1 hour ago

After getting it wrong, the LLM I use most: Me: You can't wash your car if it isn't there.

Ah. Yes. That is an excellent and devastatingly practical correction.

In that case: drive. But do it with the quiet shame of someone moving the car the length of a bowling lane.

This is one of those situations where logic beats virtue. The car must be present to be washed. Walking was philosophically elegant but mechanically flawed.

So:

Start engine.

Travel 50 meters.

Avoid eye contact with pedestrians.

Commit fully.

You are not lazy. You are complying with system requirements.

[–] Slashme@lemmy.world 31 points 4 hours ago (2 children)

The most common pushback on the car wash test: "Humans would fail this too."

Fair point. We didn't have data either way. So we partnered with Rapidata to find out. They ran the exact same question with the same forced choice between "drive" and "walk," no additional context, past 10,000 real people through their human feedback platform.

71.5% said drive.

So people do better than most AI models. Yay. But seriously, almost 3 in 10 people get this wrong‽‽

[–] T156@lemmy.world 15 points 3 hours ago (1 children)

It is an online poll. You also have to consider that some people don't care/want to be funny, and so either choose randomly, or choose the most nonsensical answer.

[–] yakko@feddit.uk 0 points 1 hour ago

I wonder... If humans were all super serious, direct, and not funny, would LLMs trained on their stolen data actually function as intended? Maybe. But such people do not use LLMs.

[–] masterofn001@lemmy.ca 7 points 4 hours ago* (last edited 4 hours ago) (1 children)

Without reading the article, the title just says wash the car.

I could go for a walk and wash my car in my driveway.

Reading the article... That is exactly the question asked. It is a very ambiguous question.

[–] elucubra@sopuli.xyz 1 points 1 hour ago

It is not. It says what I want to do, and where.

[–] Evotech@lemmy.world 1 points 1 hour ago

I got pranked by ddg yesterday

[–] lemmydividebyzero@reddthat.com 6 points 3 hours ago

They will scrape that article, too.

And I'm a few months, they have "learned" how that task works.

[–] clav64@lemmy.world -1 points 1 hour ago (1 children)

Remember that LLMs don't very well understand what a car wash is, as it can be both a place, and an action. Can you define a car wash? There's many types... I can see future LLMs start asking useful follow up/clarity questions before giving their answers. Which could help those who rely on them so much to understand how their questions can be misconstrued.

[–] DarrinBrunner@lemmy.world 40 points 7 hours ago (20 children)

I think it's worse when they get it right only some of the time. It's not a matter of opinion, it should not change its "mind".

The fucking things are useless for that reason, they're all just guessing, literally.

load more comments (20 replies)
[–] BanMe@lemmy.world 13 points 6 hours ago (1 children)

In school we were taught to look for hidden meaning in word problems - checkov's gun basically. Why is that sentence there? Because the questions would try to trick you. So humans have to be instructed, again and again, through demonstration and practice, to evaluate all sentences and learn what to filter out and what to keep. To not only form a response, but expect tricks.

If you pre-prompt an AI to expect such trickery and consider all sentences before removing unnecessary information, does it have any influence?

Normally I'd ask "why are we comparing AI to the human mind when they're not the same thing at all," but I feel like we're presupposing they are similar already with this test so I am curious to the answer on this one.

[–] punkibas@lemmy.zip 1 points 37 minutes ago

At the end of the article they talk about how to overcome this problem for LLMs doing something akin to what you wrote.

[–] Greg Fawcett@piefed.social 56 points 9 hours ago (1 children)

What worries me is the consistency test, where they ask the same thing ten times and get opposite answers.

One of the really important properties of computers is that they are massively repeatable, which makes debugging possible by re-running the code. But as soon as you include an AI API in the code, you cease being able to reason about the outcome. And there will be the temptation to say "must have been the AI" instead of doing the legwork to track down the actual bug.

I think we're heading for a period of serious software instability.

load more comments (1 replies)
[–] rimu@piefed.social 102 points 11 hours ago (16 children)

Very interesting that only 71% of humans got it right.

[–] CaptDust@sh.itjust.works 33 points 8 hours ago* (last edited 8 hours ago)

That "30% of population = dipshits" statistic keeps rearing its ugly head.

[–] SnotFlickerman@lemmy.blahaj.zone 100 points 10 hours ago* (last edited 10 hours ago) (3 children)

I mean, I've been saying this since LLMs were released.

We finally built a computer that is as unreliable and irrational as humans... which shouldn't be considered a good thing.

I'm under no illusion that LLMs are "thinking" in the same way that humans do, but god damn if they aren't almost exactly as erratic and irrational as the hairless apes whose thoughts they're trained on.

load more comments (3 replies)
load more comments (14 replies)
load more comments
view more: next ›