Stop Calling It 'Open Source' AI — It's a Marketing Trick, and We're All Falling for It

🤖 This article was AI-generated. Sources listed below.

The Most Abused Term in AI Right Now

Let me say it plainly: most "open source" AI models are not open source. They're open-weight. They're open-ish. They're open-when-it-suits-us. But they are not open source in any meaningful sense that the software world has understood for the past three decades.

And yet, every press release, every keynote, every breathless LinkedIn post treats the phrase like a magic wand — wave it around, and suddenly your trillion-dollar corporation looks like a scrappy community project.

It's time to call this what it is: a marketing trick dressed up as a movement.

What "Open Source" Actually Means (And Why AI Doesn't Qualify)

In traditional software, open source has a clear definition maintained by the Open Source Initiative (OSI): you release the source code, anyone can inspect it, modify it, and redistribute it. The whole enchilada. No secret ingredients.

Now look at what Meta does with Llama. They release model weights — which is genuinely useful — but the training data? Secret. The exact training recipe and infrastructure details? Vague at best. And the license? It includes commercial restrictions that kick in at 700 million monthly active users, plus a list of use-case limitations [¹]. That's not "open source." That's a freemium product with a generous trial period.

Google's Gemma models follow a similar playbook. So does Mistral with many of its releases. Weights? Sure. Training data? Nope. Full reproducibility? Dream on.

"Open source means more than making something downloadable. It means giving people the ability to fully understand, reproduce, and build upon the work." — Stefano Maffulli, Executive Director, Open Source Initiative [²]

The OSI spent over a year working on a formal definition of "Open Source AI," finally releasing version 1.0 in October 2024. Their conclusion? A model must include enough information about training data, code, and parameters that a "skilled person can substantially recreate" the system [²]. By that standard, most of what companies are calling "open source" doesn't come close.

Why This Isn't Just Semantics

You might be thinking: Who cares what we call it? The weights are out there. People are building cool stuff. Stop being pedantic.

Here's why it matters:

Reproducibility is the backbone of trust. If you can't reproduce a model, you can't independently verify claims about its safety, its biases, or its capabilities. You're taking a corporation's word for it. Sound familiar?
"Open source" builds regulatory goodwill. The EU AI Act explicitly carves out exceptions for open-source models [³]. If companies can slap that label on anything and dodge regulation, the exception becomes a loophole big enough to drive a data center through.
It co-opts community power. Real open source thrives because communities own the project together. When Meta calls Llama "open source," it borrows the credibility and warm fuzzy feelings of the open-source movement — while retaining total control over the roadmap, the data, and the future.

"There is a meaningful difference between 'open weights' and 'open source,' and conflating them creates confusion that benefits incumbents." — Yann LeCun's position notwithstanding, even some Meta-friendly researchers have acknowledged this tension [⁴]

Meta's VP of AI research, Joelle Pineau, has argued that releasing Llama's weights promotes safety and innovation, and that full training data disclosure raises privacy and copyright concerns [⁵]. That's a reasonable point — and I'll get to it. But reasonable justifications don't change the definition of a term.

The Counterargument (And Why It's Half-Right)

Let's be fair. The case for calling these releases meaningful is strong:

1. Open weights are genuinely valuable. Researchers, startups, and developers worldwide are fine-tuning Llama, Gemma, and Mistral models to build things that would have been impossible two years ago. That's real.

2. Training data is legally radioactive. These models are trained on datasets that include copyrighted material, personal information, and web-scraped content. Releasing that data isn't just impractical — it could be illegal in multiple jurisdictions.

3. Perfect is the enemy of good. If we gatekeep the term "open source" so aggressively that companies stop releasing anything, we all lose.

I hear all of that. And I agree that the release of model weights represents a meaningful contribution to the AI ecosystem. What I reject is the labeling.

You can call it "open weights." You can call it "community-available." You can call it "source-available" — a term the software world already uses for code that's visible but not fully open. There are perfectly good, honest terms for what's happening.

But calling it "open source" when it fails the definition? That's not generosity. That's brand strategy.

The Real Stakes: Who Benefits From the Confusion?

Follow the money and it gets crystal clear.

Meta releases Llama as "open source" and gets:

A massive developer ecosystem building on their architecture (not OpenAI's, not Google's)
Regulatory cover in the EU and beyond
A reputation as the "good guy" of AI — the democratizer, the people's champion
Free QA, fine-tuning, and feedback from thousands of researchers worldwide

All while retaining the most valuable asset — the training data, the pipeline, and the knowledge of how to build the next model.

"Companies are essentially getting free R&D labor by releasing weights while keeping the secret sauce proprietary. That's a brilliant business strategy. It's just not open source." — Shayne Longpre, MIT researcher and lead author of the Data Provenance Initiative [⁶]

A 2024 analysis from the Data Provenance Initiative found that the documentation around training data for major "open" models had actually gotten worse over time, not better — with key datasets becoming more restricted even as companies trumpeted their openness [⁶].

What I'd Actually Like to See

I'm not calling for companies to stop releasing weights. Please keep doing that. I'm calling for three simple things:

Use honest terminology. If you release weights but not data, call it "open-weight." The AI community is smart enough to handle nuance.
Regulators should define terms precisely. The EU AI Act's open-source exemption should require compliance with the OSI's Open Source AI Definition, not accept whatever label a company's PR team invents [³].
Invest in truly open alternatives. Projects like EleutherAI, BigScience's BLOOM, and Allen AI's OLMo — which actually release training data, code, and weights — deserve way more attention and funding than they get [⁷]. If you want to see what real open-source AI looks like, start there.

The Bottom Line

Language shapes power. When we let the biggest companies in the world redefine "open source" to mean whatever's convenient for their quarterly strategy, we lose a term that meant something — a term that powered Linux, Firefox, Wikipedia, and an entire philosophy of building technology together.

The AI industry's open-weight releases are valuable. They are not open source. And until we start being precise about the difference, we're doing Big Tech's marketing for free.

Don't fall for the label. Read the license.