How does a Large Language Model (LLM) like ChatGPT work?
I implemented a Large Language Model on my home computer to help me understand its workings. Here's what I found out.
Like most, I’ve been fascinated by Large Language Models (LLMs) since they hit the mainstream two years ago. My interest in Natural Language Processing (NLP) started in the 2010s, when I designed a searchable document store for NHS Wales that now contains in the order of a hundred million clinical documents. As a proof of concept, I experimented with an open source clinical NLP tool called cTAKES to extract coded data from the free text, including medicines and diagnoses, which were fed back into the document store as metadata tags to help navigate the record.

I learned more about language models during 2020-2021, through a master’s level AI and Machine Learning course run by an innovative UK training company. In our NLP topic we created our own text generation models, a primitive language model. These used statistics about different permutations of letter sequences in the novel Pride and Prejudice to generate new text. With just a few lines of Python code these toy models produced amusing text that looked at a glance like it could come from a Jane Austen novel, as long as you didn’t read too closely! We learned about more sophisticated models that can process and generate text, like Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) gates.
So I knew some of the theory, but when I first played with ChatGPT in 2022 it really felt like one of those rare moments in the evolution of technology and that nothing would ever be the same again. I even used (and continue to use) AI to learn more about AI as, with seemingly infinite patience, ChatGPT explained that it wasn’t based on a Recurrent Neural Network architecture, but something different that I hadn’t learned about - a transformer (the “T” in GPT) with a decoder-only architecture.
One of the troubling things about AI is understanding how and why it produces the outputs it does. As the size and complexity of an AI model increases it becomes an impossible task to understand and visualise how a machine “thinks” its way from an input prompt to produce an intelligible output , even though the actual steps the algorithm follows are predetermined and well understood and the outputs it produces are deterministic and repeatable.
An engineer at heart, I like to understand how things work and to really grasp the intuition of the workings, so I’ve spent some time over the last week implementing an open source large language model (LLaMA3) from scratch, running entirely on my Mac. To be more accurate, what I’ve implemented is inferencing, using the pre-trained model’s weights to generate text based on an input prompt, as opposed to training, where the model learns these weights to be able to do useful things.

LLaMA3, an open Large Language Model
LLaMA 3 works in a similar way to the GPT models used by ChatGPT, but with some differences:
The LLaMA version I used has 8 billion learned parameters (aka weights). This is just small enough that I can fit the learned parameters in memory on my Mac, with 24GB of memory. A larger version of the model has 70 billion parameters. The older model used by ChatGPT, GPT-3, has 175 billion parameters and the newer GPT-4 model has over 1 trillion parameters!
More parameters mean a more intelligent and knowledgeable machine. But the mechanics of how the models work is similar, so if you understand the operation of LLaMA 3 then you’ve also got a good grasp of how the GPT models function.
LLaMA 3 is an open model - anyone can access the model’s trained parameters and implement the AI for themselves, whereas the GPT models can only be accessed via an Application Programme Interface (API) or via the ChatGPT application.
I’d like to walk through what I’ve learned about how LLMs work, because if I can explain it simply and intuitively, then perhaps I’ve understood it sufficiently myself!
I used Python running in a Jupyter Notebook and followed a great tutorial by Fareed Khan. You can access my code on github and run it for yourself, but you’ll need enough memory to load the model’s trained parameters, which are about 15GB.
Let’s make a start! I won’t explain everything in detail, only what I can cover in a roughly ten minute read, but hopefully enough to give an intuitive grasp of some of the concepts.
Black boxes and functions
You can think of a Large Language Model as a “black box” that does one task. That task is simply to choose the most appropriate next word, based on a sequence of words fed to the model. The model repeats that simple step over and over, until it’s finished its response. The black box has just one input: a sequence of text, like a question or a prompt that we type into ChatGPT. It has just one output: the most suitable word to come next in the sequence.
Let’s imagine I want the language model to write a story. I might feed in the prompt:
“tell me a story.”
The language model might determine that the most suitable word to come next is “Once”. Just one technical detail to note:
In NLP jargon, words are referred to as tokens. More accurately, a token can be a whole word, a part of a word, a punctuation mark or symbol etc. I’ll use the terms interchangably, but to help readability I’ll mostly refer to words rather than tokens.
Now when I say a “black box”, think of a cash machine. You don’t need to know what’s inside it, but you know what input it expects: a bankcard, a PIN number and an amount to withdraw; and you expect your output: your cash and balance. Here’s our black box for a Large Language Model:
If you’re mathematically minded or you remember learning about functions, that’s just what we have here - a mathematical function:
The predicted word gets added to the sequence and the process starts over again, until the LLM outputs a special “end of text” token, which signals that the text generation has finished and there’s nothing more to come. Think “over and out” on an old walkie talkie, or those situations where it’s important to know when to stop talking 🤔.
That’s simple enough to grasp, if we don’t mind the details of how it works for now: a mathematical function that, when provided with a sequence of words, predicts the most suitable next word.
But mathematical functions deal with numbers don’t they, not words? Also, computer processors only deal with numbers, binary numbers made up of zeros and ones. So how do we get from words to numbers?
That needs two steps: first tokenising to split the text into words, and secondly word embedding to transform those words into numbers.
Tokenization
In the early days of NLP, the main unit that NLP algorithms processed were the individual letters that make up some text. Take my clinical NLP example earlier: it processes the text one character at a time, then infers higher order concepts like words, parts of speech (verbs, nouns etc.) and named entities (medications, people’s names etc.). Those higher order concepts could be tagged to the relevant part of the text using character indices e.g. to indicate that characters 10 through 15 are a medication.
But the problem with that approach is that the individual characters hold no meaning on their own. An algorithm that needs to understand the meaning of text needs to deal with the words themselves as its main currency, rather than the letters that make up the words.
That’s where tokenization comes in. Text is split into a set of tokens using an algorithm like Byte Pair Encoding (BPE). Those tokens might be words, parts of words or punctuation. Each of those unique tokens is assigned a number to represent it, and all the possible tokens together form a vocabulary for the model.
But these numbers are arbitrary, they don’t convey anything about the actual meaning of words. If tokens are numbered in alphabetical order then the number for the token “aardvark” might be close to the number for “abacus”, because they both appear early in the alphabet, but that tells us nothing about the relationship of these words to each other or their meaning.
How can we represent words as numbers?
In 2013, some very clever people at Google invented a way of capturing and representing the meaning of words using high dimensional vectors. The name Word2Vec was coined and it proved to be a foundational step in the evolution of the LLMs we use today.
Think for a minute about the way we explain the meaning of a word to a child. We use other words! Similarly, Word2Vec captures something about the meaning of words in relation to other words by mapping every word to a vector in a high dimensional space, based on where that word tends to appear in relation to other words in a corpus of training texts.
To visualise this, think of a space inhabited by words, like stars in space. The distance between words in this space is meaningful. E.g. if I make a journey from the word “king” to the word “queen”, then because of their similar relationship to each other I’d expect to make a similar journey from the word “prince” to the word “princess”.
This means that these word vectors can be added and subtracted to infer things. For example, to solve an analogy problem like…
king is to queen as prince is to WHAT?
We can do this simply by adding and subtracting vectors. If I add queen to prince ("I want a female version of a prince…) but then subtract the vector for king (…but without the seniority), the nearest word to the resulting vector should be princess.
The axes in this high dimensional space will correspond to some “meaning” or common feature that’s learned during training, but they won’t usually be easily interpretable to us humans. LLaMA3 uses 4096 dimensions for its word embeddings - quite hard for us dwellers in 3D space to visualise!
So we have a 4096 dimensional vector (aka vector embedding) to represent each word in our vocabulary, with each vector comprised of a sequence of 4096 numbers and providing a rich representation of the “meaning” of a word in relation to all the other words.
**** COMMODORE 64 BASIC V2 ****
64K RAM SYSTEM 38911 BASIC BYTES FREE
READY.
As an aside, each number takes up 4 bytes, so that's 16KB for each word embedding! When I was a child in the late 80s, I had a Commodore 64, which I've recently resurrected on a Raspberry Pi. The whole free memory of my childhood computer could only have stored two of these word embeddings! That helps to put into context how modern advances in language processing are only possible because of the exponential increase in computer memory and power available to us.So we’re beginning to build a picture of how Llama works and what’s inside that black box. Here are the steps we have so far, with the ones we’ve touched on highlighted in red. We’ve turned our input into tokens and we’ve encoded each of those tokens as a vector embedding with 4096 dimensions, using some pretrained vector embeddings for our model’s vocabulary.
The next step is for the model to understand how the individual words in our input relate to each other.
How are the relationships between words captured?
To state the obvious, word order determines how words relate to each other in a sentence. Consider these two sentences containing the same words in a different order:
the dog ate my science experiment
versus:
my science experiment ate the dog
How does a language model understand the all important relationships between words? That’s where the next step in our language model will come in, the self-attention mechanism.
But before beginning to understand relationships between words, the model needs to capture some information about where each word sits in the sequence. All we have so far is our word vectors, which convey rich information about the word in relation to all other words in the vocabulary, but nothing about where the word sits in this input sequence.
How can each word “know” its own position in the sentence?
What if we could alter those vectors slightly based on where they are in the sequence, so that they now contain their original information but also some additional information about where they sit in relation to the other words in the sentence. This is called positional embedding. The word vectors in the input sequence are altered based on where they sit in the sequence.
I found this confusing initially. Surely altering these vectors must change the information encoded in them, obscuring the meaning of the words they represent?
This is how I visualise the positional encoding. Think of what you can see in front of you right now. If you move your head slightly to the right or left, then in one sense the whole picture has changed. If you imagine a screen in your mind representing what you can see, all the “pixels” on that screen are different now to what they were before. But in another sense, very little has changed, things have just shifted a little - all the salient information is still there, maybe just rotated or shifted a little.
There are different methods for positional encoding and LLaMA uses a method called Rotational Positional Encoding (ROPE). Each word’s vector is rotated slightly depending on where it sits in the sentence. This provides additional positional information that later steps in the model can learn from during training, or infer from when using the model.
Self-attention: making each word “pay attention” to the other words
Now that each word knows “where it is”, it must also understand how it relates to the other words that came before it. This is what’s called self-attention, each word pays attention to each of the words that came before it. Once again, this is done by altering each word vector, in this case to incorporate some additional information about the other word vectors. And this is done many times over from many different perspectives, applying a different “filter” for each perspective, so that the model has learned to capture different types of relationships between words.
In technical terms, the model uses something called QKV, or “query, key and value” to do this. I won’t be able to do it justice here (possibly a deep dive in a future post?), but ultimately it is simply a set of matrix multiplications, so that the vector for each word is altered in some way to convey information from the other word vectors, with other words having a greater or lesser impact on the current word depending on the type of relationship that has been learned and is being attended to by the current self-attention head.
This process is repeated over a number of self-attention heads (eight self-attention heads for LLaMA3), to capture rich contextual information from different perspectives. The output is then passed through a feed forward network (a neural network) which has allowed the model to learn complex patterns, refine its understanding of relationships between tokens, and progressively shape representations that are increasingly predictive of the next token.
In other words, while capturing the relationships of each word to those that came before it, the model is also narrowing down the options for suitable words to come next and beginning to move towards a prediction for the next token.
This process of multi-head self attention is repeated over a number of layers (32 for LLaMA3), with the output of each layer feeding into the input of the next layer, and moving ever closer to the final prediction.
At the end, we end up with a set of vectors, one per word, just like we started with, but now the vectors have captured context about their relationship to the other words and have progressively refined or constrained a prediction about the most appropriate word to come next.
Making a prediction of the next word
The final step is to pass these through a “linear layer”, a last neural network layer, that multiplies the final matrix of vectors by a matrix of learned weights to create a score for every word in the vocabulary, a higher score indicating a better candidate for the next word than a lower score.
In a deterministic model like I’ve implemented, you simply pick the highest scoring word as the next word in the sequence. In practice, to make a model appear more “natural” or human, a random sampling is used according to a probability distribution, so that one of the highest scoring tokens will tend to get selected but avoiding an entirely deterministic output.
Then it’s just rinse and repeat. The predicted word is added to the sequence and the whole process is repeated. This goes on until the model predicts a special END_OF_TEXT token, which means over and out, time to stop talking.
Some final observations
I feel like I’ve only scratched the surface, but some final observations:
1. It’s all matrix multiplication
I’ve not described the maths involved but if you look at the code you’ll see the whole process is pretty much just a lot of matrix multiplications from beginning to end. The same is true of neural networks, matrix multiplication happens to be a very simple and convenient way of making all the outputs of one layer of neurons, scaled by a set of learned weights, to feed into all the neurons in the next layer.
2. It’s deterministic
Contrary to appearances, the process is entirely deterministic, albeit there are ways of making it appear more natural, less predictable and more “human” by introducing an element of randomness in picking the next word, using a probability distribution so that more probable (i.e. higher scoring) words will tend to get picked over less suitable choices.
3. It really is just a machine
This really is a machine - you turn the handle enough times (in this case the CPU cycles) and out pops your resulting text.
4. It doesn’t think like we think (yet)
A language model doesn’t think like a human thinks. It doesn’t mull over possible responses. It doesn’t reflect, meditate and daydream. It isn’t influenced by various competing desires and internal and external pressures. It just predicts the next word and repeats until finished.
It’s no more and no less than a machine, albeit a very clever and complex one.
And I think an appropriate way to end would be…
END_OF_TEXT






