this post was submitted on 01 Dec 2025
74 points (72.3% liked)

Technology

77084 readers
2534 users here now

This is a most excellent place for technology news and articles.


Our Rules


  1. Follow the lemmy.world rules.
  2. Only tech related news or articles.
  3. Be excellent to each other!
  4. Mod approved content bots can post up to 10 articles per day.
  5. Threads asking for personal tech support may be deleted.
  6. Politics threads may be removed.
  7. No memes allowed as posts, OK to post as comments.
  8. Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
  9. Check for duplicates before posting, duplicates may be removed
  10. Accounts 7 days and younger will have their posts automatically removed.

Approved Bots


founded 2 years ago
MODERATORS
 

For one month beginning on October 5, I ran an experiment: Every day, I asked ChatGPT 5 (more precisely, its "Extended Thinking" version) to find an error in "Today's featured article". In 28 of these 31 featured articles (90%), ChatGPT identified what I considered a valid error, often several. I have so far corrected 35 such errors.

top 50 comments
sorted by: hot top controversial new old
[–] chronicledmonocle@lemmy.world 23 points 3 days ago (3 children)

Congrats. You just burned down 4 trees in the rainforest for every article you had an LLM analyze.

LLMs can be incredibly useful, but everybody forgets how much of an environmental nightmare this shit is.

[–] GooseFinger@sh.itjust.works 3 points 3 days ago (1 children)

Had to look up Chat GPT's energy usage because you made me curious.

Seems like Open AI claims Chat GPT 4o uses about 0.34 Wh per "query." This is apparently consistent with third party estimates. The average Google search is about 0.03 Wh, for reference.

Issue is, "query" isn't defined, and it's possible this figure is the energy consumption of the GPUs alone, omitting additional sources that comprise the full picture (energy conversion loss, cooling, infrastructure, etc.). It's also unclear if this figure was obtained during model training, or during normal use.

I also briefly saw that Chat GPT 5 uses between 18-40 Wh per query, so 100x more than GPT 4o. The OP used GPT 5.

It sounds like the energy consumption is relatively bad no matter how it's spun, but consider that it replaces other forms of compute and reduces workload for people, and the net energy tradeoff may not be that bad. Consider the task from the OP - how much longer/how many more people would it take to accomplish the same result that GPT 5 and the lone author accomplished? I bet the net energy difference isn't that far from zero.

Here's the article I found: https://towardsdatascience.com/lets-analyze-openais-claims-about-chatgpt-energy-use/

[–] REDACTED@infosec.pub 8 points 3 days ago (2 children)

How would this compare to one person with 5090 gaming for a week?

[–] GooseFinger@sh.itjust.works 11 points 3 days ago

A setup with one monitor and a computer with a 5090 will draw about 1 kW under load. That's 7 kWh per week if the average is 1 hour a day.

So that's about:

  • 233k Google searches
  • 20k GPT 4o queries
  • 175 GPT 5 queries
[–] Frigger@feddit.uk 1 points 3 days ago

Lol good question

[–] Kellenved@sh.itjust.works -1 points 3 days ago* (last edited 3 days ago)

This is my number 1 reason to oppose them. It is not worth the damage.

[–] Pika@rekabu.ru -2 points 3 days ago (1 children)

Not much when you use an already trained model, actually.

[–] SoftestSapphic@lemmy.world 3 points 3 days ago* (last edited 3 days ago)

Unfortunately unless you are hosting your own, or using like DeepSeek which had a cutoff on its training data, then it is a perpetually training model.

When you ask ChatGPT things it is horrible for the world. It digs us a little deeper into an unsalvageable situation that will probably make us go extinct

[–] W3dd1e@lemmy.zip 7 points 3 days ago* (last edited 3 days ago)

This headline is a bit misleading. The article also says that only 2/3 of the errors GPT found were verified errors (according to the author).

  • Overall, ChatGPT identified 56 supposed errors in these 31 featured articles.
  • I confirmed 38 of these (i.e. 68%) as valid errors in my assessment. Implemented corrections for 35 of these, and Agreed with 3 additional ones without yet implementing a correction myself. Disagreed with 13 of the alleged errors (23%).
  • I rated 4 as** Inconclusive** (7%), and one as  Not Applicable (in the sense that ChatGPT's observation appeared factually correct but would only have implied an error in case that part of the article was intended in a particular way, a possibility that the ChatGPT response had acknowledged explicitly).
[–] Gonzako@lemmy.world 23 points 3 days ago

"Liar thinks truth is also a lie. More at 11"

[–] dukemirage@lemmy.world 93 points 5 days ago (7 children)
[–] anamethatisnt@sopuli.xyz 90 points 5 days ago (2 children)

I find that an extremely simplified way of finding out whether the use of an LLM is good or not is whether the output from it is used as a finished product or not. Here the human uses it to identify possible errors and then verify the LLM output before acting and the use of AI isn't mentioned at all for the corrections.

The only danger I see is that errors the LLM didn't find will continue to go undiscovered, but they probably would be undiscovered without the use of the LLM too.

[–] porcoesphino@mander.xyz 19 points 5 days ago* (last edited 5 days ago) (1 children)

I think the first part you wrote is a bit hard to parse but I think this is related:

I think the problematic part of most genAI use cases is validation at the end. If you're doing something that has a large amount of exploration but a small amount of validation, like this, then it's useful.

A friend was using it to learn the linux command line, that can be framed as having a single command at the end that you copy, paste and validate. That isn't perfect because the explanation could still be off and it wouldn't be validated but I think it's still a better use case than most.

If you're asking for the grand unifying theory of gravity then:

  • validation isn't built into the task (so you're unlikely to do it with time).
  • validation could be as time intensive as the task (so there is no efficiency gain if you validate).
  • its beyond your ability to validate so if it says nice things about you then a subset of people will decide the tool is amazing.
[–] anamethatisnt@sopuli.xyz 8 points 4 days ago

Yeah, my morning brain was trying to say that when it is used as a tool by someone that can validate the output and act upon it then it's often good. When it is used by someone who can't, or won't, validate the output and simply uses it as the finished product then it usually isn't any good.

Regarding your friend learning to use the terminal I'd still recommend validating the output before using it. If it's asking genAI about flags for ls then sure no big deal, but if a genAI ends up switching around sda and sdb in your dd command resulting in a wiped drive you only got yourself to blame for not checking the manual.

[–] shiroininja@lemmy.world 6 points 4 days ago* (last edited 4 days ago)

Or it flags something as an error falsely and the human has so much faith in the system that it must be correct, and either wastes time finding the solution or bends reality to “correct” it in a human form of hallucinating bs. Especially dangerous if saying there is an error supports the individual’s personal beliefs

Edit:

I’ll call it “AI-induced confirmation bias” cousin to AI-induced psychosis.

[–] ordnance_qf_17_pounder@reddthat.com 13 points 4 days ago (1 children)

"AI" summed up. 95% of the time it's pointless bullshit being shoehorned into absolutely everything. 5% of the time it can be useful.

[–] dukemirage@lemmy.world 5 points 4 days ago (1 children)

Something weird about corporations spending billions on "the Comic Sans of technology"

[–] Treczoks@lemmy.world 9 points 4 days ago (2 children)

Yep. Let it flag potential problems, and have humans react to it, e.g. by reviewing and correcting things manually. AI can do a lot of things quick and efficiently, but it must be supervised like a toddler.

[–] architect@thelemmy.club 2 points 3 days ago

So… the same as most employees but cheaper.

People here are above average and overestimate the vast majority of humanity.

This is an interesting idea:

The "at least one" in the prompt is deliberately aggressive, and seems likely to force hallucinations in case an article is definitely error-free. So, while the sample here (running the prompt only once against a small set of articles) would still be too small for it, it might be interesting to investigate using this prompt to produce a kind of article quality metric: If it repeatedly results only in invalid error findings (i.e. what a human reviewer no Disagrees with), that should indicate that the article is less likely to contain factual errors

[–] passepartout@feddit.org 4 points 4 days ago (2 children)

Yes and no. I have enjoyed reading through this approach, but it seems like a slippery slope from this to "vibe knowledge" where LLMs are used for actually trying to add / infer information.

[–] LastYearsIrritant@sopuli.xyz 14 points 4 days ago

Don't discard a good technique cause it can be implemented poorly.

[–] architect@thelemmy.club 1 points 3 days ago

The issue is that some people are lazy cheaters no matter what you do. Banning every tool because of those people isn’t helpful to the rest of humanity.

load more comments (3 replies)
[–] selokichtli@lemmy.ml 3 points 2 days ago* (last edited 2 days ago)

Just wanted to point out the insane disparity between the cost of running Wikipedia and that of ChatGPT. The question here is not if LLMs are useful for some things, rather than if it's worth it for most things.

[–] Stefan_S_from_H@discuss.tchncs.de 63 points 4 days ago (4 children)

A tool that gives at least 40% wrong answers, used to find 90% errors?

[–] acosmichippo@lemmy.world 19 points 4 days ago (1 children)

90% errors isn't accurate. It's not that 90% of all facts in wikipedia are wrong. 90% of the featured articles contained at least one error, so the articles were still mostly correct.

[–] pulsewidth@lemmy.world 1 points 3 days ago* (last edited 3 days ago)

And the featured articles are usually quite large. As an example, today's featured article is on a type of crab - the article is over 3,700 words with 129 references and 30-something books in the bibliography.

It's not particularly unreasonable or unsurprising to be able to find a single error amongst articles that complex.

Bias needs to be reinforced!

[–] echodot@feddit.uk 13 points 3 days ago

The problem is a lot of this is almost impossible to actually verify. After all if an article says a skyscraper has 70 stories even people working in the building may not be able to necessarily verify that.

I have worked in a building where the elevator only went to every other floor, and I must have been in that building for at least 3 months before I noticed because the ground floor obviously had access and the floor I worked on just happened to do have an elevator so it never occurred to me that there may be other floors not listed.

For something the size of a 63 (or whatever it actually was) story building it's not really visually apparent from the outside either, you'd really have to put in the effort to count the windows. Plus often times the facade looks like more stories so even counting the windows doesn't necessarily give you an accurate answer not that anyone would necessarily have the inclination to do so. So yeah, I'm not surprised that errors like that exist.

More to the point the bigger issue is can the AI actually prove that it is correct. In the article there was contradictory information in official sources so how does the AI know which one was the right one? Could somebody be employed to go check? Presumably even the building management don't know the article is incorrect otherwise they would have been inclined to fix it.

[–] crypt0cler1c@infosec.pub 35 points 4 days ago (1 children)

This is way overblown. Wikipedia is on par with the most accurate Encyclopedias with 3-4 factual errors per article.

[–] TheBlackLounge@lemmy.zip 14 points 4 days ago

More like 1, sometimes 2, errors in 90% of wikipedia's longest and most active articles.

[–] helpImTrappedOnline@lemmy.world 33 points 4 days ago* (last edited 4 days ago) (1 children)

The first edit was undoing a vandalism that persisted for 5 years. Someone changed the number of floors a building had from 67, to 70.

A friendly reminder to only use Wikipedia as a summary/reference aggregate for serious research.

This is a cool tool for checking these sorts of things, run everything through the LLM to flag errors and go after them like a wack-a-mole game instead of a hidden object game.

[–] mika_mika@lemmy.world 1 points 3 days ago
[–] kalkulat@lemmy.world 11 points 4 days ago (1 children)

Finding inconsistencies is not so hard. Pointing them out might be a -little- useful. But resolving them based on trustworthy sources can be a -lot- harder. Most science papers require privileged access. Many news stories may have been grounded in old, mistaken histories ... if not on outright guesses, distortions or even lies. (The older the history, the worse.)

And, since LLMs are usually incapable of citing sources for their own (often batshit) claims any -- where will 'the right answers' come from? I've seen LLMs, when questioned again, apologize that their previous answers were wrong.

[–] architect@thelemmy.club -1 points 3 days ago (2 children)

Which LLMs are incapable of citing sources?

[–] kalkulat@lemmy.world 1 points 2 days ago* (last edited 2 days ago)

To quote ChatGPT:

"Large Language Models (LLMs) like ChatGPT cannot accurately cite sources because they do not have access to the internet and often generate fabricated references. This limitation is common across many LLMs, making them unreliable for tasks that require precise source citation."

It also mentions Claude. Without a cite, of course.

Reliable information must be provided by a source with a reputation for accuracy ... trustworthy. Else it's little more than a rumor. Of course, to reveal a source is to reveal having read that source ... which might leave the provider open to a copyright lawsuit.

[–] jacksilver@lemmy.world 7 points 3 days ago

All of them. If you're seeing sources cited, it means it's a RAG (LLM with extra bits). The extra bits make a big difference as it means the response is limited to a select few points of reference and isn't comparing all known knowledge on a subject matter.

[–] kepix@lemmy.world 21 points 4 days ago (1 children)

the tool that is mainly based on wikipedia info?

[–] x00z@lemmy.world 2 points 4 days ago

The tool doesn't just check the text for errors it would know of. It can also check sources, compare articles, and find inconsistencies within the article itself.

There's a list of the problems it found that often explains where it got the correct information from.

[–] Tollana1234567@lemmy.today 1 points 3 days ago

wikipedia does have some outdated info on certain things, mostly with certain species/discovery phylogeny.