this post was submitted on 05 Dec 2025
116 points (99.2% liked)

Fuck AI

4728 readers
648 users here now

"We did it, Patrick! We made a technological breakthrough!"

A place for all those who loathe AI to discuss things, post articles, and ridicule the AI hype. Proud supporter of working people. And proud booer of SXSW 2024.

AI, in this case, refers to LLMs, GPT technology, and anything listed as "AI" meant to increase market valuations.

founded 2 years ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments

Leaking people's personally identifiable nformation (PII) is harmful, even if this particular instance of leakage weren't harmful.

When proponents of AI respond to the argument from creatives that training Generative AI involves stealing creative works, they often assert that the entire method of training means that the original works are not contained within the end model, and that the process is analogous to how humans learn. In a technical sense, I do agree with this characterisation of training as a sort of informational distillation. However, it appears that there are instances where an unreasonable amount of the original work is still retained in the final model. An analogy that I'd draw here is that in determining whether a derivative work that draws on an existing one is fair use, one of the factors is how much of the original work is contained within the derivative, and in what context. If a model is able to regurgitate data that it was trained on, then morally speaking, it's harder to justify this as being fair use (I say "morally" because I'm drawing on the ethical theme of fair use rather than using it in its straightforward legal sense). Of course, the question here isn't about stealing of art or other copyright concerns, but considering this separate problem is useful for understanding why this leakage is problematic.

One of the big problems with AI, whether we're talking about training on creative works, or the leakage of PII is that these models are incredibly opaque. It is exceptionally hard, if not impossible, to determine what data from the training data has been preserved in the final model โ€” I don't even know whether the AI companies are able to glean that information. These models are so incredibly complex and are trained on unfathomable amounts of data, which leads to more and more instances where we see inappropriate levels of reproduction of the training data.

The key questions are:

  • If the model can reproduce this, are there more harmful things that could plausibly be retrievable via the AI? (Given that we have been seeing models trained on extremely sensitive medical or legal data, the answer is "almost certainly");
  • How can we know what PII or other sensitive data may have been contained in the training data? I.e. how do we gauge the extent of the severity of the risk of sensitive stuff being reproduced (Certainly we can't, and I'm doubtful if even the engineers behind the models could effectively answer this)
  • If we know for certain that sensitive materials have been included in the training data, how do we stop (or reduce the likelihood of) that data being reproduced? Is it possible to train a general purpose AI on sensitive data without significant risk of said sensitive data being reproduced (speaking as someone who has done a lot of nitty gritty data work and coding with machine learning systems, and tries to keep up with the literature, to my knowledge, we can't, and we might not ever be able to)

I consider this leakage of PII to be pretty serious already, but this is just an example of why people are so concerned about these systems being rolled out in the way they have. This particular instance is barely scratching the surface of a much wider, and deeper problem