Demystifying AI Model Jargon - A Plain English Guide
Ever found yourself nodding along in an AI discussion whilst secretly wondering what everyone's actually talking about? You're not alone. The world of AI models is packed with jargon that sounds impressive but can be genuinely confusing.
Let's break down some of the most common terms you'll hear, in plain English.
What's Actually Inside an AI Model?
When someone says they're downloading an AI model, what are they actually getting?
Weights: The Brain of the Operation
Think of weights as billions of tiny knobs that have been carefully adjusted during training. Each weight is just a number that determines how strongly one part of the network influences another.
During training, the model sees millions of examples and slowly tweaks these knobs until it gets good at predicting what comes next. A model like GPT-4 has hundreds of billions of these weights, which is why the files are so massive.
The weights represent everything the model learned. That's why they're so valuable and why companies either guard them closely or release them strategically.
Parameters: The Bigger Picture
You'll often hear people say "GPT-4 has 1.7 trillion parameters" and "open-weight model" in the same conversation. So which is it, weights or parameters?
Technically, parameters = weights + biases.
Biases are additional adjustable values that help the model make better predictions. Think of weights as the volume controls and biases as the bass and treble adjustments. Together, they're all parameters.
In practice, most people use "weights" and "parameters" interchangeably because weights make up 95%+ of the total. When you hear "open-weight model", you're getting all the parameters.
The Supporting Cast
A complete model isn't just weights. When you download something like GLM-4.7, you're actually getting a folder with several pieces:
- The tokenizer - Converts your text into numbers the model understands. Without the exact tokenizer the model was trained with, nothing works properly.
- Architecture definition - The blueprint showing how many layers, what types, and how they connect. It's like the difference between knowing a building has bricks (weights) versus knowing how those bricks are arranged.
- Configuration files - All the settings: context window size, special tokens, recommended parameters.
- Metadata - Version info, licensing, and documentation about what the model can and can't do.
Open-Weight vs Open-Source: What's the Difference?
This distinction trips people up constantly.
Open-Weight Models
An open-weight model means you can download the trained weights and run the model yourself. You get the weights, the inference code, and usually the tokenizer and config files. What you typically don't get:
- The training code
- The training data
- The full methodology to reproduce the model from scratch
Examples: GLM-4.7, Llama, Mixtral
True Open-Source
A fully open-source model gives you everything needed to recreate it from scratch -- training code, data sources, the works. This is genuinely rare. Training data tends to be either proprietary or an absolute mess to release cleanly.
Closed (Proprietary)
Models like GPT-4, Claude, and Gemini keep their weights private. You access them through APIs, paying per token.
Why It Matters
With open-weight models, you can:
- Run them on your own servers
- Keep your data completely private
- Avoid per-token API costs
- Fine-tune them for specific tasks
- Customise inference behaviour
The trade-off is you can't fully reproduce the training process or audit every decision that went into creating the model. It's a middle ground between fully open and fully closed -- which for most people is more than enough.
Mixture of Experts: Efficiency Through Specialisation
This is where things get interesting.
The Problem with Dense Models
Traditional models (called "dense" models) use every single parameter for every single token. If you have a 70 billion parameter model, all 70 billion are active on every word you generate. That's expensive and slow to run.
The MoE Solution
A Mixture of Experts (MoE) model splits the work between multiple smaller "expert" networks. Instead of one massive network doing everything, you have specialists -- and a router that decides which ones to call on.
For each token, only a small subset of experts actually run. The rest stay dormant. It's like a hospital: instead of one doctor handling everything from broken bones to cardiac surgery, you route patients to the right specialist.
The Benefits
- Speed and cost - You might have 70 billion total parameters but only activate 20 billion per token. Much faster inference.
- Scaling - You can build larger models without proportional cost increases.
- Specialisation - Different experts can naturally gravitate towards different types of knowledge: maths, coding, creative writing, and so on.
Real Examples
MoE models are often described in a format like "8x7B" -- meaning 8 experts with 7 billion parameters each. Total capacity: 56 billion parameters. But if only 2 experts run per token, the actual inference cost is closer to a 14 billion parameter model. You get the knowledge base of a large model at a fraction of the running cost.
Some notable MoE models:
- Mixtral 8x7B
- GLM-4.7
- DeepSeek-V3
- Reportedly GPT-4 (though OpenAI has never confirmed this)
The Trade-offs
MoE isn't a free lunch. These models are harder to train -- you need to ensure the routing works well and that experts get balanced usage rather than a few doing all the work. You also still need to load all the experts into memory, even the dormant ones.
But when done right, MoE gives you a model that punches well above its apparent weight.
Mistral vs Mixtral: Easy to Mix Up
Quick one, because these names cause genuine confusion:
- Mistral AI - A French AI company building excellent open-weight models
- Mistral 7B - Their dense model with 7 billion parameters
- Mixtral 8x7B - Their MoE model; the "x" is a hint at the multiplication
Mistral is the company. Mixtral is specifically their MoE line. Both worth knowing about.
Wrapping Up
Most of the jargon in AI conversations is actually pretty logical once you know what's underneath:
- Weights are the learned numbers that make the model work
- Parameters include weights plus biases -- used interchangeably in practice
- Open-weight means you can download and run it yourself, without necessarily having the full training story
- MoE routes each token to a small set of specialist networks rather than running everything at once
Next time someone mentions they're running Mixtral 8x7B locally, you'll know they're using an open-weight MoE model -- one that's far more efficient to run than a dense model of similar total size.
Now you can nod along with actual understanding.