Despite Sam Altman's claim in court along the lines of "our product HAS to be allowed to (steal copyrighted materials)", that claim was specific to the product that HE wanted to make, to have the capabilities he wanted it to, and to make it competitive (in no small part I imagine "because China", and all that word entails in this context).
But there are foundational models out there that are more narrowly trained on ethically-sourced data.
Prompt: What I would want to use is an AI system I *KNOW* was not trained using unethical means, like pilfering copyrighted materials like songs and literary works that the artists were not compensated for. Is there such a thing? Or is one stuck starting with models of that sort, unless they make an enormous expenditure?
Answer from Claude:
This is one of the most genuinely thorny problems in AI right now, and your concern is well-founded virtually every major foundation model has some copyright controversy attached to it:
The Uncomfortable Reality
OpenAI, Meta, Google, Anthropic all facing ongoing litigation over training data. The NYT lawsuit against OpenAI, the authors' class actions, the music industry suits are all live cases
"We'll ask forgiveness later" was essentially the implicit industry strategy for the first wave of foundation models
Truly knowing what any model was trained on is difficult because training data documentation ("datasheets"

is inconsistently published and often incomplete
Cleaner Options That Exist
Fully licensed/ethical training data models:
* Bloomberg GPT trained exclusively on licensed financial data, one of the cleaner provenance stories
* Adobe Firefly (image generation) explicitly trained only on licensed Adobe Stock and public domain content, specifically because Adobe knew their customers needed legal cover
* Fairly Trained an actual certification organization that audits AI companies for licensed training data, worth checking which models carry their certification
Open source with documented data:
* EleutherAI models trained on documented public domain datasets
* Falcon models from UAE's Technology Innovation Institute with published data cards
* Some Llama variants fine-tuned on documented clean datasets, though Meta's base training has its own controversies
The Honest Assessment
For a business fraud detection use case specifically, you'd likely be fine-tuning primarily on your own proprietary transaction data anyway meaning the base model's training controversy becomes somewhat less relevant since your domain-specific layer is entirely clean. The base model is just providing general language/reasoning capability, not the fraud-specific knowledge.
But if clean provenance all the way down is a genuine requirement either ethical or legal Fairly Trained certified models are probably your most defensible starting point right now.
The market for provably ethical training data is growing fast precisely because businesses are starting to ask exactly the question you're asking.
Any assertion along the lines that there's basically no way to make an working AI model that's not trained via "illegal" means is not quite accurate. Also, "legality" is still to be determined, and almost certainly the matter will be end up settled in civil courts, with monetary awards rather than criminal charges. So I think "unethically" is a better descriptor, but to each their own
Anyways, there IS a means to have AI, even GenAI, that doesn't rely on stealing copyrighted materials (at least not purposefully), and if these companies get stuck with huge payouts because they LOSE these ongoing cases, then things will change. To be very clear, I WANT THEM TO LOSE, BIGLY, just as (obviously) you do.
I guess you might say I'm more on the bandwagon of "it should be torn down and done morally correctly" than "all genAI must die", but again, to each their own