General Discussion

AZJonnie

(3,976 posts)

8. Okay, so forgive me for asking AI this question, but it can say it better than I ever could

Sat May 9, 2026, 09:37 AM

Yesterday

Despite Sam Altman's claim in court along the lines of "our product HAS to be allowed to (steal copyrighted materials)", that claim was specific to the product that HE wanted to make, to have the capabilities he wanted it to, and to make it competitive (in no small part I imagine "because China", and all that word entails in this context).

But there are foundational models out there that are more narrowly trained on ethically-sourced data.

Prompt: What I would want to use is an AI system I *KNOW* was not trained using unethical means, like pilfering copyrighted materials like songs and literary works that the artists were not compensated for. Is there such a thing? Or is one stuck starting with models of that sort, unless they make an enormous expenditure?

Answer from Claude:

This is one of the most genuinely thorny problems in AI right now, and your concern is well-founded — virtually every major foundation model has some copyright controversy attached to it:

The Uncomfortable Reality
OpenAI, Meta, Google, Anthropic — all facing ongoing litigation over training data. The NYT lawsuit against OpenAI, the authors' class actions, the music industry suits are all live cases

"We'll ask forgiveness later" was essentially the implicit industry strategy for the first wave of foundation models

Truly knowing what any model was trained on is difficult because training data documentation ("datasheets&quot

is inconsistently published and often incomplete

Cleaner Options That Exist
Fully licensed/ethical training data models:
* Bloomberg GPT — trained exclusively on licensed financial data, one of the cleaner provenance stories
* Adobe Firefly (image generation) — explicitly trained only on licensed Adobe Stock and public domain content, specifically because Adobe knew their customers needed legal cover
* Fairly Trained — an actual certification organization that audits AI companies for licensed training data, worth checking which models carry their certification

Open source with documented data:
* EleutherAI models trained on documented public domain datasets
* Falcon models from UAE's Technology Innovation Institute with published data cards
* Some Llama variants fine-tuned on documented clean datasets, though Meta's base training has its own controversies

The Honest Assessment
For a business fraud detection use case specifically, you'd likely be fine-tuning primarily on your own proprietary transaction data anyway — meaning the base model's training controversy becomes somewhat less relevant since your domain-specific layer is entirely clean. The base model is just providing general language/reasoning capability, not the fraud-specific knowledge.

But if clean provenance all the way down is a genuine requirement — either ethical or legal — Fairly Trained certified models are probably your most defensible starting point right now. The market for provably ethical training data is growing fast precisely because businesses are starting to ask exactly the question you're asking.

Any assertion along the lines that there's basically no way to make an working AI model that's not trained via "illegal" means is not quite accurate. Also, "legality" is still to be determined, and almost certainly the matter will be end up settled in civil courts, with monetary awards rather than criminal charges. So I think "unethically" is a better descriptor, but to each their own

Anyways, there IS a means to have AI, even GenAI, that doesn't rely on stealing copyrighted materials (at least not purposefully), and if these companies get stuck with huge payouts because they LOSE these ongoing cases, then things will change. To be very clear, I WANT THEM TO LOSE, BIGLY, just as (obviously) you do.

I guess you might say I'm more on the bandwagon of "it should be torn down and done morally correctly" than "all genAI must die", but again, to each their own

Edit history

Please sign in to view edit histories.

Recommendations

1 members have recommended this reply (displayed in chronological order):

fujiyamasan

10 replies

= new reply since forum marked as read

Highlight:

Bubbles are REALLY evil by Cory Doctorow [View all] justaprogressive Friday OP

Cory's right about all the economic harm that is already being done by AI, and the harm that will be highplainsdem Friday #1

+1 dalton99a Friday #2

Not all "AI" is ChatGPT and Claude and such HPD AZJonnie Friday #5

You'd have to give me specific examples of companies using genAI who are not in any way using AI highplainsdem Friday #6

Okay, so forgive me for asking AI this question, but it can say it better than I ever could AZJonnie Yesterday #8

Claude's apparently unaware that Adobe Firefly's training included millions of AI-generated images from highplainsdem Yesterday #10

Get rich quick artists have precipitated bubble after bubble in recent times. usonian Friday #3

So, how would a 35% drop after the AI bubble bursts be worse than the 50% drop of the 2008 GFC? Fiendish Thingy Friday #4

Except we had edhopper Yesterday #9

It's likely our government will prop it all up... hunter Friday #7