this post was submitted on 23 Feb 2026
418 points (97.7% liked)

Technology

81772 readers
3515 users here now

This is a most excellent place for technology news and articles.


Our Rules


  1. Follow the lemmy.world rules.
  2. Only tech related news or articles.
  3. Be excellent to each other!
  4. Mod approved content bots can post up to 10 articles per day.
  5. Threads asking for personal tech support may be deleted.
  6. Politics threads may be removed.
  7. No memes allowed as posts, OK to post as comments.
  8. Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
  9. Check for duplicates before posting, duplicates may be removed
  10. Accounts 7 days and younger will have their posts automatically removed.

Approved Bots


founded 2 years ago
MODERATORS
 

Screenshot of this question was making the rounds last week. But this article covers testing against all the well-known models out there.

Also includes outtakes on the 'reasoning' models.

top 50 comments
sorted by: hot top controversial new old
[–] Bluewing@lemmy.world 1 points 22 seconds ago

I just asked Goggle Gemini 3 "The car is 50 miles away. Should I walk or drive?"

In its breakdown comparison between walking and driving, under walking the last reason to not walk was labeled "Recovery: 3 days of ice baths and regret."

And under reasons to walk, "You are a character in a post-apocalyptic novel."

Me thinks I detect notes of sarcasm......

[–] TankovayaDiviziya@lemmy.world 4 points 1 hour ago (1 children)

We poked fun at this meme, but it goes to show that the LLM is still like a child that needs to be taught to make implicit assumptions and posses contextual knowledge. The current model of LLM needs a lot more input and instructions to do what you want it to do specifically, like a child.

[–] prole@lemmy.blahaj.zone 2 points 59 minutes ago

I'm sure it'll be worth it at some point 🙄

[–] Fmstrat@lemmy.world 1 points 1 hour ago (1 children)

Qwen3 feels left out. All 30B models I have failed the test.

[–] SuspciousCarrot78@lemmy.world 2 points 58 minutes ago* (last edited 47 minutes ago)

Qwen3-4B HIVEMIND (abliterated) got it in 2, though it scores a lot higher on PIQA, HellaSwag and Winogrande benchmarks than normal Qwen3-30B. I think the new abliteration methods actually strengthen real world understanding.

https://imgur.com/a/7YZme4i

https://imgur.com/a/25ApzDN

I wonder if an abliterated VL model could do even better? They tend to have the best real world model benchmarks. Perhaps a Qwen3-VL-30B ablit (if such a thing exists) could one shot this.

I'd like to think a lot of these gotcha prompts rely on verbal misunderstanding, rather than failure in world models, but I can't say that for certain.

PS: Saw a pearler of a response to this: Chatgpt recommend "yeah, lift the car and carry it on your back. Make sure to bend your knees" (though I'm guessing someone edited that for the lulz)

[–] vane@lemmy.world 11 points 4 hours ago (2 children)

I want to wash my train. The train wash is 50 meters away. Should I walk or drive?

[–] SkaveRat@discuss.tchncs.de 11 points 4 hours ago

Fly, you fool

[–] FatVegan@leminal.space 3 points 3 hours ago

100 Chinese people can lay approximately 30m of track a day

[–] melsaskca@lemmy.ca 0 points 1 hour ago

I don't use AI but read a lot about it. I now want to google how it attacks the trolley problem.

[–] 73ms@sopuli.xyz 1 points 2 hours ago

Did this say whether the reasoning models get this right more than the others? Was curious about that but missed it if it was mentioned.

[–] imetators@lemmy.dbzer0.com 13 points 5 hours ago (1 children)

Went to test to google AI first and it says "You cant wash your car at a carwash if it is parked at home, dummy"

Chatgpt and Deepseek says it is dumb to drive cause it is fuel inefficient.

I am honestly surprised that google AI got it right.

[–] rumba@lemmy.zip 51 points 5 hours ago (1 children)

They probably added a system guardrail as soon as they heard about this test. it's been going around for a while now :)

[–] imetators@lemmy.dbzer0.com 2 points 5 hours ago

Article mentions that Gemini 2.0 Flash Lite, Gemini 3 Flash and Gemini 3 Pro have passed the test. All these 3 also did it 10 out of 10 times without being wrong. Even Gemini 2.5 shares highest score in the category of "below 6 right answers". Guess, Gemini is the closest to "intelligence" out of a bunch.

[–] tover153@lemmy.world 5 points 5 hours ago (2 children)

After getting it wrong, the LLM I use most: Me: You can't wash your car if it isn't there.

Ah. Yes. That is an excellent and devastatingly practical correction.

In that case: drive. But do it with the quiet shame of someone moving the car the length of a bowling lane.

This is one of those situations where logic beats virtue. The car must be present to be washed. Walking was philosophically elegant but mechanically flawed.

So:

Start engine.

Travel 50 meters.

Avoid eye contact with pedestrians.

Commit fully.

You are not lazy. You are complying with system requirements.

[–] SaltySalamander@fedia.io 1 points 59 minutes ago

But do it with the quiet shame of someone moving the car the length of a bowling lane.

A bowling lane is a bit over 18 meters. =)

[–] ne0phyte@feddit.org 1 points 2 hours ago

Thank you! Finally an answer to my problem that didn't end with me going to the car wash and being utterly confused how to proceed.

[–] Slashme@lemmy.world 42 points 8 hours ago (4 children)

The most common pushback on the car wash test: "Humans would fail this too."

Fair point. We didn't have data either way. So we partnered with Rapidata to find out. They ran the exact same question with the same forced choice between "drive" and "walk," no additional context, past 10,000 real people through their human feedback platform.

71.5% said drive.

So people do better than most AI models. Yay. But seriously, almost 3 in 10 people get this wrong‽‽

[–] snooggums@piefed.world 1 points 3 minutes ago

Have you seen the results of elections?

[–] bluesheep@sh.itjust.works 5 points 3 hours ago

I saw that and hoped it is cause of the dead Internet theory. At least I hope so cause I'll be losing the last bit of faith in humanity if it isn't

[–] T156@lemmy.world 23 points 7 hours ago (1 children)

It is an online poll. You also have to consider that some people don't care/want to be funny, and so either choose randomly, or choose the most nonsensical answer.

load more comments (1 replies)
[–] masterofn001@lemmy.ca 5 points 7 hours ago* (last edited 7 hours ago) (3 children)

Without reading the article, the title just says wash the car.

I could go for a walk and wash my car in my driveway.

Reading the article... That is exactly the question asked. It is a very ambiguous question.

[–] bluesheep@sh.itjust.works 7 points 3 hours ago

Without reading the article, the title just says wash the car.

No it doesn't? It says:

I want to wash my car. The car wash is 50 meters away. Should I walk or drive?

In which world is that an ambiguous question?

[–] Geth@lemmy.dbzer0.com 1 points 2 hours ago

Mentioning the car wash and washing the car plus the possibility of driving the car in the same context pretty much eliminates any ambiguity. All of the puzzle pieces are there already.

I guess this is an uninteded autism test as well if this is not enough context for someone to understand the question.

[–] elucubra@sopuli.xyz 4 points 5 hours ago

It is not. It says what I want to do, and where.

[–] lemmydividebyzero@reddthat.com 6 points 7 hours ago

They will scrape that article, too.

And I'm a few months, they have "learned" how that task works.

[–] DarrinBrunner@lemmy.world 43 points 11 hours ago (24 children)

I think it's worse when they get it right only some of the time. It's not a matter of opinion, it should not change its "mind".

The fucking things are useless for that reason, they're all just guessing, literally.

load more comments (24 replies)
load more comments
view more: next ›