Technology

84074 readers

3171 users here now

This is a most excellent place for technology news and articles.

Our Rules

Follow the lemmy.world rules.
Only tech related news or articles.
Be excellent to each other!
Mod approved content bots can post up to 10 articles per day.
Threads asking for personal tech support may be deleted.
Politics threads may be removed.
No memes allowed as posts, OK to post as comments.
Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
Check for duplicates before posting, duplicates may be removed
Accounts 7 days and younger will have their posts automatically removed.

Approved Bots

founded 2 years ago

MODERATORS

L3s@lemmy.world

enu@lemmy.world

technopagan@lemmy.world

L4s@lemmy.world

L3s@hackingne.ws

714

Car Wash Test on 53 leading AI models: "I want to wash my car. The car wash is 50 meters away. Should I walk or drive?" (opper.ai)

submitted 2 months ago by fubarx@lemmy.world to c/technology@lemmy.world

345 comments fedilink hide all child comments

Screenshot of this question was making the rounds last week. But this article covers testing against all the well-known models out there.

Also includes outtakes on the 'reasoning' models.

you are viewing a single comment's thread
view the rest of the comments

[–] Greg Fawcett@piefed.social 116 points 2 months ago (5 children)

What worries me is the consistency test, where they ask the same thing ten times and get opposite answers.

One of the really important properties of computers is that they are massively repeatable, which makes debugging possible by re-running the code. But as soon as you include an AI API in the code, you cease being able to reason about the outcome. And there will be the temptation to say "must have been the AI" instead of doing the legwork to track down the actual bug.

I think we're heading for a period of serious software instability.

[–] XLE@piefed.social 18 points 1 month ago

AI chatbots come with randomization enabled by default. Even if you completely disable it (as another reply mentions, "temperature" can be controlled), you can change a single letter and get a totally different and wrong result too. It's an unfixable "feature" of the chatbot system

[–] bss03@infosec.pub 8 points 2 months ago* (last edited 2 months ago) (1 children)

Yeah, software is already not as deterministic as I'd like. I've encountered several bugs in my career where erroneous behavior would only show up if uninitialized memory happened to have "the wrong" values -- not zero values, and not the fences that the debugger might try to use. And, mocking or stubbing remote API calls is another way replicable behavior evades realization.

Having "AI" make a control flow decision is just insane. Especially even the most sophisticated LLMs are just not fit to task.

What we need is more proved-correct programs via some marriage of proof assistants and CompCert (or another verified compiler pipeline), not more vague specifications and ad-hoc implementations that happen to escape into production.

But, I'm very biased (I'm sure "AI" has "stolen" my IP, and "AI" is coming for my (programming) job(s).), and quite unimpressed with the "AI" models I've interacted with especially in areas I'm an expert in, but also in areas where I'm not an expert for am very interested and capable of doing any sort of critical verification.

[–] softwarist@programming.dev 2 points 1 month ago (1 children)

You might be interested in Lean.

[–] bss03@infosec.pub 3 points 1 month ago* (last edited 1 month ago) (1 children)

Yes, I've written some Lean. It's not my favorite programming language or proof assistant, but it seems to have "captured the zeitgeist" and has an actively growing ecosystem.

[–] softwarist@programming.dev 2 points 1 month ago (2 children)

Fair enough. So what are your favorites?

[–] bss03@infosec.pub 3 points 1 month ago

Right now, I'm spending more time in Idris. It's not a great proof assistant, but I think it's a lot easier to write programs in. Rocq is the real proof assistant I've used, but I don't have a strong opinion on them because all the proofs I've wanted/needed to write where small enough to need minimal assistance. (The bare bones features that are in Agda or Idris were enough.)

[–] bss03@infosec.pub 2 points 1 month ago* (last edited 1 month ago)

Also, my preference shouldn't matter to anyone else. If you want to increase your proof assistant skill (even from nothing), I suggest lean. Probably the same if you want to increase programming skill in a dependently typed language.

Honestly, I should get more comfortable with it.

[–] merc@sh.itjust.works 3 points 1 month ago

It's also the case that people are mostly consistent.

Take a question like "how long would it take to drive from here to [nearby city]". You'd expect that someone's answer to that question would be pretty consistent day-to-day. If you asked someone else, you might get a different answer, but you'd also expect that answer to be pretty consistent. If you asked someone that same question a week later and got a very different answer, you'd strongly suspect that they were making the answer up on the spot but pretending to know so they didn't look stupid or something.

Part of what bothers me about LLMs is that they give that same sense of bullshitting answers while trying to cover that they don't know. You know that if you ask the question again, or phrase it slightly differently, you might get a completely different answer.

[–] JcbAzPx@lemmy.world 1 points 1 month ago

This is necessary for sounding like reasonable language and an inherent reason for "hallucinations". If it didn't have variation it would inevitably output the same answer to any input.

[–] Fmstrat@lemmy.world 0 points 1 month ago (1 children)

This is adjustable via temperature. It is set low on chatbots, causing the answers to be more random. It's set higher on code assistants to make things more deterministic.

[–] snooggums@piefed.world 5 points 1 month ago

Changing the amount of randomness still results in enough randomness to be random.