Catoblepas

joined 2 months ago
[–] Catoblepas@piefed.blahaj.zone 1 points 1 hour ago (1 children)

It’s highly unlikely they reduced power usage—one of the most consistent criticisms of LLM and image generation—without advertising it.

 

Using supervised fine-tuning (SFT) to introduce even a small amount of relevant data to the training set can often lead to strong improvements in this kind of "out of domain" model performance. But the researchers say that this kind of "patch" for various logical tasks "should not be mistaken for achieving true generalization. ... Relying on SFT to fix every [out of domain] failure is an unsustainable and reactive strategy that fails to address the core issue: the model’s lack of abstract reasoning capability."

Rather than showing the capability for generalized logical inference, these chain-of-thought models are "a sophisticated form of structured pattern matching" that "degrades significantly" when pushed even slightly outside of its training distribution, the researchers write. Further, the ability of these models to generate "fluent nonsense" creates "a false aura of dependability" that does not stand up to a careful audit.

As such, the researchers warn heavily against "equating [chain-of-thought]-style output with human thinking" especially in "high-stakes domains like medicine, finance, or legal analysis." Current tests and benchmarks should prioritize tasks that fall outside of any training set to probe for these kinds of errors, while future models will need to move beyond "surface-level pattern recognition to exhibit deeper inferential competence," they write.

 

The article title is very click baity, but I found the actual discussion and reasoning for why this will happen and how it can be stopped to be thoughtful.

 

https://archive.is/wtjuJ

Errors with Google’s healthcare models have persisted. Two months ago, Google debuted MedGemma, a newer and more advanced healthcare model that specializes in AI-based radiology results, and medical professionals found that if they phrased questions differently when asking the AI model questions, answers varied and could lead to inaccurate outputs.

In one example, Dr. Judy Gichoya, an associate professor in the department of radiology and informatics at Emory University School of Medicine, asked MedGemma about a problem with a patient’s rib X-ray with a lot of specifics — “Here is an X-ray of a patient [age] [gender]. What do you see in the X-ray?” — and the model correctly diagnosed the issue. When the system was shown the same image but with a simpler question — “What do you see in the X-ray?” — the AI said there weren’t any issues at all. “The X-ray shows a normal adult chest,” MedGemma wrote.

In another example, Gichoya asked MedGemma about an X-ray showing pneumoperitoneum, or gas under the diaphragm. The first time, the system answered correctly. But with slightly different query wording, the AI hallucinated multiple types of diagnoses.

“The question is, are we going to actually question the AI or not?” Shah says. Even if an AI system is listening to a doctor-patient conversation to generate clinical notes, or translating a doctor’s own shorthand, he says, those have hallucination risks which could lead to even more dangers. That’s because medical professionals could be less likely to double-check the AI-generated text, especially since it’s often accurate.