Welcome to DU! The truly grassroots left-of-center political community where regular people, not algorithms, drive the discussions and set the standards. Join the community: Create a free account Support DU (and get rid of ads!): Become a Star Member Latest Breaking News Editorials & Other Articles General Discussion The DU Lounge All Forums Issue Forums Culture Forums Alliance Forums Region Forums Support Forums Help & Search

AZJonnie

(3,976 posts)
8. Okay, so forgive me for asking AI this question, but it can say it better than I ever could
Sat May 9, 2026, 09:37 AM
Yesterday

Despite Sam Altman's claim in court along the lines of "our product HAS to be allowed to (steal copyrighted materials)", that claim was specific to the product that HE wanted to make, to have the capabilities he wanted it to, and to make it competitive (in no small part I imagine "because China", and all that word entails in this context).

But there are foundational models out there that are more narrowly trained on ethically-sourced data.

Prompt: What I would want to use is an AI system I *KNOW* was not trained using unethical means, like pilfering copyrighted materials like songs and literary works that the artists were not compensated for. Is there such a thing? Or is one stuck starting with models of that sort, unless they make an enormous expenditure?

Answer from Claude:

This is one of the most genuinely thorny problems in AI right now, and your concern is well-founded — virtually every major foundation model has some copyright controversy attached to it:

The Uncomfortable Reality
OpenAI, Meta, Google, Anthropic — all facing ongoing litigation over training data. The NYT lawsuit against OpenAI, the authors' class actions, the music industry suits are all live cases

"We'll ask forgiveness later" was essentially the implicit industry strategy for the first wave of foundation models

Truly knowing what any model was trained on is difficult because training data documentation ("datasheets&quot is inconsistently published and often incomplete

Cleaner Options That Exist
Fully licensed/ethical training data models:
* Bloomberg GPT — trained exclusively on licensed financial data, one of the cleaner provenance stories
* Adobe Firefly (image generation) — explicitly trained only on licensed Adobe Stock and public domain content, specifically because Adobe knew their customers needed legal cover
* Fairly Trained — an actual certification organization that audits AI companies for licensed training data, worth checking which models carry their certification

Open source with documented data:
* EleutherAI models trained on documented public domain datasets
* Falcon models from UAE's Technology Innovation Institute with published data cards
* Some Llama variants fine-tuned on documented clean datasets, though Meta's base training has its own controversies

The Honest Assessment
For a business fraud detection use case specifically, you'd likely be fine-tuning primarily on your own proprietary transaction data anyway — meaning the base model's training controversy becomes somewhat less relevant since your domain-specific layer is entirely clean. The base model is just providing general language/reasoning capability, not the fraud-specific knowledge.

But if clean provenance all the way down is a genuine requirement — either ethical or legal — Fairly Trained certified models are probably your most defensible starting point right now. The market for provably ethical training data is growing fast precisely because businesses are starting to ask exactly the question you're asking.


Any assertion along the lines that there's basically no way to make an working AI model that's not trained via "illegal" means is not quite accurate. Also, "legality" is still to be determined, and almost certainly the matter will be end up settled in civil courts, with monetary awards rather than criminal charges. So I think "unethically" is a better descriptor, but to each their own

Anyways, there IS a means to have AI, even GenAI, that doesn't rely on stealing copyrighted materials (at least not purposefully), and if these companies get stuck with huge payouts because they LOSE these ongoing cases, then things will change. To be very clear, I WANT THEM TO LOSE, BIGLY, just as (obviously) you do.

I guess you might say I'm more on the bandwagon of "it should be torn down and done morally correctly" than "all genAI must die", but again, to each their own

Recommendations

1 members have recommended this reply (displayed in chronological order):

Latest Discussions»General Discussion»Bubbles are REALLY evil b...»Reply #8