Models: The Engines of AI
What are AI models, how are they trained, and why does it matter?
What is a model?
Think of an AI model as compressed knowledge. A model takes massive amounts of text, code, images, or other data and distills it into a set of numerical weights -- billions of them -- that capture patterns in that data. When you ask a question, the model doesn't look up an answer in a database. It generates one, word by word, based on the patterns it learned.
The most prominent category of AI models today is the large language model, or LLM.
Large Language Model (LLM)
A type of AI model trained on vast amounts of text data to understand and generate human language. "Large" refers to both the training data (trillions of tokens) and the model itself (billions of parameters). Examples include GPT-4, Claude, Gemini, and Llama.
A model's "parameters" are the internal numbers adjusted during training. More parameters generally means the model can capture more nuance, but also requires more compute to train and run. GPT-3 had 175 billion parameters. Today's frontier models are believed to have trillions.
How models are trained
Training an LLM is a multi-stage process, each stage refining the model's capabilities.
Pre-training
The model reads enormous amounts of text from books, websites, code repositories, and scientific papers. It learns to predict the next word in a sequence. This stage is the most expensive, often costing tens of millions of dollars in compute. The result is a "base model" that understands language but isn't yet useful as an assistant.
Token
The basic unit a language model reads and generates. A token is roughly three-quarters of a word in English. The sentence "AI models are fascinating" is five tokens. Training data is measured in tokens -- modern models train on trillions of them.
Fine-tuning
The base model is then trained on smaller, carefully curated datasets. These might include question-answer pairs, coding exercises, or domain-specific documents. Fine-tuning teaches the model how to respond in useful formats rather than just completing text.
RLHF (Reinforcement Learning from Human Feedback)
Human evaluators rank model outputs by quality. A separate "reward model" learns from these rankings, and the main model is then optimized to produce outputs the reward model scores highly. This is what makes the difference between a model that generates plausible text and one that gives genuinely helpful, safe answers.
Key players
The model landscape spans both closed-source companies offering API access and open-source projects releasing weights publicly.
| Model | Developer | Access | Context Window | Strengths |
|---|---|---|---|---|
| GPT-4o | OpenAI | API | 128K tokens | Multimodal, strong reasoning |
| Claude 4 (Opus/Sonnet) | Anthropic | API | 200K tokens | Long context, safety, coding |
| Gemini 2.5 | Google DeepMind | API | 1M tokens | Massive context, multimodal |
| Llama 3.1 405B | Meta | Open weights | 128K tokens | Best open-weight model at scale |
| Mistral Large | Mistral AI | Open weights / API | 128K tokens | Efficient, strong multilingual |
| DeepSeek-V3 | DeepSeek | Open weights / API | 128K tokens | Cost-efficient training, strong coding |
Context window
A model's context window is how much text it can consider at once. Early models handled 4,000 tokens (about 3,000 words). Today's models handle 128K to 1M tokens, enabling them to process entire codebases or books in a single conversation.
Open vs closed
This is one of the defining debates in AI.
Closed-source models (GPT-4o, Claude, Gemini) are accessed through APIs. The model weights, training data, and techniques remain proprietary. Companies argue this is necessary for safety and to recoup research costs.
Open-weight models (Llama, Mistral, DeepSeek) release the trained model weights publicly. Anyone can download, modify, and deploy them. This enables independent research, customization, and deployment without API dependencies, though the training data and process may still be proprietary.
The practical difference matters: with a closed model, you send data to someone else's servers. With an open model, you can run it on your own infrastructure, retaining full control over your data.
Neither approach is strictly better. Closed models tend to be more capable at the frontier, while open models offer transparency, customizability, and independence.
Scale and cost
Training a frontier model is staggeringly expensive. To put the numbers in context:
- GPT-4's training reportedly cost over $100 million in compute alone
- A single training run for a frontier model can consume the electricity of a small town for months
- Training clusters use tens of thousands of GPUs running in parallel, each costing $20,000-$40,000
- Total investment for a leading AI lab runs $2-5 billion per year between compute, talent, and data
This cost is why only a handful of organizations can train frontier models. But inference (running a trained model) is far cheaper, and costs drop steadily. Running a query that would have cost $0.10 in 2023 might cost $0.001 today.
Inference
Running a trained model to generate outputs. Distinct from training, which is the process of creating the model. Inference is what happens when you send a prompt and get a response. It's orders of magnitude cheaper per query than training, but at scale the costs are enormous -- billions of queries per day across all AI services.
What's next
The model layer is evolving in several directions simultaneously:
Multimodal models handle text, images, audio, and video within a single architecture. This is already standard at the frontier -- you can show a model a photo and ask questions about it, or have it generate images from text.
Reasoning models (like OpenAI's o3 and DeepSeek-R1) use chain-of-thought techniques to work through complex problems step by step, trading speed for accuracy on hard tasks like math, logic, and coding.
Agent capabilities allow models to use tools, browse the web, execute code, and take actions in the real world. Rather than just answering questions, agent-capable models can complete multi-step tasks autonomously.
Efficiency gains are making smaller models more capable. Techniques like distillation, quantization, and mixture-of-experts architectures mean that a 70-billion parameter model today can match what a 175-billion parameter model did two years ago.
The trend is clear: models are becoming more capable, more efficient, and more accessible. What was a research curiosity five years ago is now infrastructure that millions of developers build on daily.