
Video review (Deep Dive into LLMs like ChatGPT)
I has chance to go over the video Deep Dive into LLMs like ChatGPT by Andrej Karpathy during the weekend. First of all, this is an awesome video for anyone who want to understand how LLMs works(Just like the title). This video don't cover the nitty-gritty of the mathematical details of each component in an LLM in a LLM, so it is super friendly for anyone even you are not a math/technical person. Also, the examples and words in the video is very intuitive, makes easier to understand.
The video first describe how a LLM is build, the process is divided into 3 steps, pre-train, post-train, reinforcement learning.
In the pre-train step. Essentially the company collected all the available data in the internet then feed in to the model, in this step, the data quantity is more important, and the training process is very expensive bunch of super GPU run for days and costs a lots, the pre-train don't happened often, probably annually. The web page first been turn into plain text then stored into .txt files, then we use tokennizer(a model to turn text into number vector), then uses supervised learning to fedd into giant NN to plikroduce output, where the output would be the next token, then we update the weight so on so for. That is a brief summarize what is happened in this step.
At the end of the pre-train, you will have a model called Base model, but this is not enough, the base model just generate documents sequence by given input. To turn the model into Assistance, we uses post-training, where we turn the llm model into chat assistance, in the basic format, we will design/create dataset with conversation pair ex:
[ {'human': 'What is the weather like today?', 'Assistance': 'It is sunny and warm.'}, {'human': 'And what about tomorrow?', 'Assistance': 'Tomorrow it will be cloudy.'}, {'human': 'Will it rain?', 'Assistance': 'There is a 30% chance of rain.'} ]
The model is then trained with datasets like the one above. The post-training step typically requires less data than pre-training, but the data needs to be high quality, which involves significantly more labor to verify its correctness. This is also an opportunity to provide the model with cognitive knowledge and enable it to respond with "I don't know" to unknown queries. In the video, the author points out that earlier models often fabricated answers. This is partly because the training data was unbalanced, lacking "I don't know" responses, so the model always answered confidently, even when incorrect. This tendency, while amusing, highlights the importance of balanced training data. Cognitive knowledge involves training the model on information about itself, such as its creators and capabilities. Safety features are also included in this step, training the model on what it should and should not answer."
At this point, some of the LLM are actually ready to use and act like the ChatGPT models, on the third step which is the reinforcement learning step, we give the model opportunity to think, to generate the logic behind the answer and drive the correct answer by trials. The feature we see on the more mordern model we see today that included in this step such as DeepSeek Deepthink, OpenAi reasoning.. is archive by using reinforcement learning. The idea of the reinforcement learning in the video, is like given the model prompt(question) and answer then ask the model to figure out the needed steps and the correct path to the correct answer. In my understanding the reinforcement learning very align with how human actually learn, you going to the environment, you have different actions, when you perform different actions at different state, you get different reward. In the llm setting, the agent will be the llm itself, and action will be the llm generate the words, the state can be artificial state, then the reward is the evaluation of the answer, so that is the core idea. Now, there are tons of technical details I've skipped, since it's probably still a top-tier research problem. One good reference is the RLHF paper published by OpenAI. We can imagine that evaluating LLM outputs can be very difficult. For some prompts, like 'write a poem' or 'write a joke,' humans themselves have to evaluate the output, and evaluating the output millions of times is not feasible. RLHF proposes a technique where the model generates different outcomes, and then humans rank or score these outcomes. These rankings or scores are then used to train a reward model, which approximates human preferences and guides the LLM during reinforcement learning. However, since we don't have control over how the model learned the topic, sometime the model generate stuff don't make sense to us such as 'The joke is the the the the the the the the.....', which... is funny but is a bad joke, so the model will sort of trick because it does whatever to maximize the reward.
At the end of the video, the Author discuss the potential of reinforcement learning, where supervised learning can get the model archive the highest standard of human knowledge with shorter time, but reinforcement learning is the way to exceed the boundary of the highest human knowledge or boundary, at least now is the way. Then follow by some of the potential usage of LLM, such as tokenize the audio, picture, video then develop large picture model, large audio model....etc. Finally, the future is exciting.
Is is been the review of the video. I really enjoyed it.