Language Model Training Fundamentals
It actually has more to do with "spoon-feeding" than "training".
As should be evident now, training a large language model is no small task. Researchers and companies have been testing out new techniques and developing an understanding of what yields better results, but it’s still early days. Like I’ve said, it is very much both an art and a science. Perhaps more so in our case, since the fundamental paradigm the Satoshi models are meant to reflect are opposite or out of phase with just about every kind of model out there right now, whether open or closed source. How we dealt with these challenges is discussed in more details in this post.
In this article, I’d like to help elucidate the differences between training, tuning and augmentation - which is often mis-referenced as a way to “train your own model”.
Training versus Fine Tuning versus Augmentation
There is a lot of misinformation out there regarding “training” models.
If you trust the average ‘AI bro’ on Twitter, you’ll believe that you can just upload a few PDF’s, a couple of books, a podcast or company’s financial report, to “train your own AI”.
This is completely false.
For a number of reasons, two of them being:
You cannot train a model (or even effectively fine-tune it) without a sizeable corpus of data.
Even if you have A LOT of data, the data must be transformed into a format which represents the style and kind of output you want the model to produce. For example, if you’re training a model to answer questions, you cannot just “feed it some books.” You actually have to transform the content within the book into a set of questions and answers, which become the examples, or “training data set”.
We will explain what this all means, and the different options one has with respect to training a model, fine-tuning it, or augmenting an existing model with a semantic database that it references.
Understanding Large Language Model (LLM) Training
Large Language Models are AI models that can understand and generate human-like text. Unlike more focused AI applications, LLM training doesn't target a specific domain or task; instead, it aims to develop a comprehensive understanding of language, often across multiple languages and contexts. This training involves feeding vast amounts of quality* text data into a model, enabling it to learn patterns, nuances, and structures of language.
Notice the use of the word “quality” here. Quality is a subjective term. In the context of data for LLM training, quality refers to how representative the training data set is, with what you want the model to output later.
The process of LLM training is complex and multifaceted. It's not just about accumulating data but also about preparing it in a way that's conducive to learning the intricacies of language.
Imagine it as teaching a child language by exposing them to an extensive library of books, conversations, and writings, but at a scale and speed that’s only possible in the digital world.
The Essence of LLM Training: Comprehensive Language Understanding
LLM training is about creating a foundation of language understanding that's broad and deep. The trained model should be capable of not just repeating what it has seen but also generating new, coherent, and contextually-appropriate content.
The Process and Components
Data Collection and Curation: The first step involves gathering a diverse and extensive dataset. This dataset can range from general literature, websites, and articles to more specific texts depending on the desired expertise of the model.
Data Preparation: The collected data needs to be cleaned and formatted. This involves removing irrelevant or sensitive information, correcting errors, and ensuring that the data is in a uniform format for the model to not only process efficiently, but as mentioned earlier, representative of how you’d like the model to output content later.
Training Phase: In this phase, the model is exposed to the prepared data. Using various algorithms, the model learns to predict the next word in a sentence, understand context, and generate coherent responses, all through developing relationships between letters, words and sentences. The weighting and biases between these elements are the “parameters”. This phase requires large amounts of computational power and can take days, weeks or even months depending on the model's complexity and the size of the dataset.
Dataset Size and Variety: The effectiveness of an LLM is directly related to the size, the diversity or specificity of its training dataset. A larger and more varied dataset enables the model to develop a more nuanced understanding of language and its many applications, but also leads it to produce more generalizations. In other words, the more general you want the capabilities of the model to be, the more varied your dataset must be (this is known as your data-blend). The more specific the capabilities, the more specific the dataset needs to be.
Compute Requirements
The computational requirements for training an LLM are not trivial. High-end GPUs or TPUs, often available only to well-funded organizations or research institutions are needed. And of course, this all requires energy. The process is not only data-intensive, but compute-intensive. At the micro level, this is not a concern. Companies like us just plug into what’s available, use “free credits” from cloud providers where possible, and seek compute through whatever means possible (centralized or decentralized like GPUtopia).
Of course, at the macro level, this is a concern, and it will be interesting to see what the Bitcoin industry can teach the AI industry when it comes to the efficient scaling of compute.
All this is to say that when people tell you that “you can train a model on your own data” - they have absolutely no idea what they’re talking about.
Understanding Fine-Tuning in AI Models
Fine-tuning particularly follows the pre-training phase of the LLM development cycle, although in computational terms, it’s quite similar. The difference here is specificity of the data and the time involved. It's akin to honing a broadly-educated mind to specialize in a specific field. After an LLM has been pre-trained on a vast, general dataset to understand language, fine-tuning adjusts the model to excel in specific tasks or comprehend particular domains.
Think of fine-tuning as customizing an all-purpose tool to perform specific jobs with greater efficiency and accuracy.
This phase is crucial for tailoring a model to specific needs, whether it's understanding medical terminology, generating marketing content, engaging in casual conversation, or in our case, speaking like a bitcoiner!
The Essence of Fine-Tuning: Specialization Over Generalization
Fine-tuning shifts the focus from a general understanding of language to specialized knowledge or capabilities. It actually involves retraining the model, but now with a dataset that's closely aligned with the intended application.
The Process and Variations
There are three “general” categories for fine tuning. Once again, this is not 100% the case, all of the time. It’s just a useful way to understand it.
Full Fine-Tune: This involves retraining the entire model on a new, domain-specific dataset. It's like giving the model an intensive course in a new subject, reshaping its understanding and response patterns to align with specific requirements.
Low Rank Adaptation (LoRA): LoRA is a more targeted approach to fine-tuning. Instead of retraining the whole model, LoRA adjusts only a small fraction of the model's parameters (usually the top layers). This method is efficient and requires less computational power. It's particularly useful for fine-tuning models where access to the entire model structure is restricted or when computational resources are limited. It’s a bit like an 80/20 rule, but more like 80/2. You tune the 2% that matters, to give you 80% of the result.
Partial Fine-Tuning: In some cases, the fine-tuning process might be constrained to the top layers of the model. This form of fine-tuning still allows significant customization but within the framework of the original model's broader understanding. Fine-tuning OpenAI’s models is such an example.
Applications and Varied Approaches
The choice between full fine-tuning, LORA, and partial fine-tuning depends on several factors:
Intended Use: The specific task or domain for which the model is being fine-tuned can dictate the depth and approach of fine-tuning.
Resource Availability: Full fine-tuning requires substantial computational resources and data, whereas LORA is more resource-efficient.
Model Accessibility: Some models, especially proprietary ones like DaVinci by OpenAI, may have limitations on how deeply they can be fine-tuned.
Testing. If you’re testing, it’s often best to start with a LoRA tune and then if the results are positive, move onto a full fine-tune.
Understanding Reinforcement Learning
The final stage in LLM development focuses on aligning the model with specific human standards and preferences. This 'last mile' stage is essential for refining the model’s decision-making capabilities and ensuring its outputs align with desired outcomes, particularly in terms of relevance, style and accuracy.
Imagine this as the final tuning of a high-performance engine, ensuring it not only runs smoothly but also responds precisely as intended.
There are two main options for reinforcement learning, and a blend of both can be used. Let’s look at each.
Reinforcement Learning from Human Feedback (RLHF)
RLHF requires human feedback to build a reward model, in order to then further LLM refinement. The steps are:
Collecting Human Comparisons: RLHF starts with human evaluators providing qualitative feedback on the model’s outputs, effectively teaching the model what is considered a desirable response. Think “ranking” and “scoring” response variants.
Training a Reward Model: This feedback is used to train a separate 'reward model'. This model learns to predict which responses will be favored based on human evaluations.
Iterative Refinement: The LLM is then fine-tuned using reinforcement learning techniques, such as Proximal Policy Optimization (PPO), or the more recent DPO, to maximize the rewards as predicted by the reward model. This process is iterative, continually evolving the model’s output quality based on new feedback.
Reinforcement Learning from AI Feedback (RLAIF)
Very similar to RLHF, except we use existing LLMs to get the feedback. This is of ultimately lower quality, but also lower cost in time and money.
Automating Feedback: RLAIF involves using another AI model to provide feedback, making the process more scalable. This AI-generated feedback aims to mimic human evaluations, guiding the LLM towards desirable outputs.
Challenges in RLAIF: While RLAIF enhances scalability, it also introduces complexities in ensuring the AI feedback’s quality and reliability, which is crucial for the model’s accurate and ethical alignment.
Conclusion
Reinforcement Learning, whether RLHF, RLAIF or some blend, plays a vital role in the final stages of LLM development. It is the last mile alignment stage, and really puts the icing on the cake, so to speak.
Now that we have the training stages out of the way, let’s look at what most people erroneously call “training” today, and understand why it is fundamentally different to training a model, but still useful in particular contexts.
Download the full report here, free.
Augmentation
Retrieval Augmented Generation (RAG) is the most popular way to augment or enhance a model, so we’ll put our focus here. RAG is a novel way to get a model to produce responses that are more accurate or “relevant” to a domain or point of view. Contrary to popular belief, RAG has nothing to do with actually “training a model”. Instead, it focuses on augmenting the capabilities of an existing model. This augmentation is achieved not through retraining with new data, but by enhancing its responses through a sophisticated use of embeddings and external data retrieval.
Think of it as a smart way to do dynamic prompting by abstracting away the context injections using semantic tooling.
It’s a bit like asking a model to answer a question by referencing some specific context you pasted into the prompt. Imagine you just copied a relevant section from a book, pasted it into ChatGPT, then asked the model to answer a question by referencing that context. It’s actually pretty simple, conceptually speaking.
In fact, most people who use ChatGPT (or any other model) do this already, only somewhat manually. They make sophisticated prompts so that the model can reply more accurately. RAG just allows you to do it dynamically and programmatically. It abstracts away the manual process.
The Essence of RAG: Augmentation Over Training
RAG enhances AI applications by allowing them to dynamically access and incorporate information from external databases. This method effectively broadens the AI’s knowledge base without altering its foundational training. The core AI model, already trained on a substantial dataset, is coupled with a retrieval system that fetches relevant information from a vast external database in response to specific queries or content requirements.
The Process and Components
Data Ingestion and Embedding: RAG starts with embedding large amounts of data into a vector database. This database is optimized for quick semantic searches, crucial for retrieving relevant information rapidly. Embedding is simply the process of transforming data into a vector form that can be efficiently processed, read and referenced by an LLM.
Contextual Response Generation: When a user query is received, RAG identifies relevant data from the vector database and uses this context to enhance the AI model's response. This process involves interpreting the user's input, searching for pertinent information, and then integrating that information into the response.
Size of Dataset & Vector Space: The beauty of RAG is that you can augment a model on a small dataset, for example a single book or article, or you can use massive datasets, although that requires a lot more upfront work with data chunking, metadata and, assuming you want high quality results, ensuring all of the embeddings are of high quality (this can take quite a bit of time).
Practical Applications and Limitations
RAG is valuable in areas where AI responses need to be supplemented with up-to-date or specialized information. However, RAG’s effectiveness hinges on the quality of the external data sources and the system's ability to accurately match query embeddings with relevant information. The complexity of setting up and maintaining such a system, especially in dealing with vast and continually updating data sources is precisely where things get challenging.
Furthermore, the core model has not been changed, so it’s not producing anything novel or unique. The model is still the same underlying model, and as soon as a question is asked that’s outside of what’s in the vector store, it will revert to default, or not answer.
If your goal is a little widget, this is a useful solution. If your goal is to reference an internal document more easily, or perhaps turn your company FAQ’s into something that you can reference conversationally, then great. But this is not a new model, and it will not perform as well as a fully trained model will.
Conclusion
In summary, LLM training is a powerful but resource-intensive process aimed at creating AI models with a broad and deep understanding of human language. It's a complex endeavor that combines data science, machine learning, and linguistic expertise, resulting in models that can interpret and generate human-like text across various contexts and applications. The training process not only shapes the capabilities of the model but also sets the limitations within which it operates.
Fine-tuning allows for the customization of a general model to meet specific needs and perform specialized tasks. It can be done via a low-resource approach like LoRA or as a full update to the model's parameters. Reinforcement learning is the last mile of the process, and aligns the model.
Finally RAG, or other approaches to augmentation, are not training, but enhancements or wrappers on models which are great for very narrow applications, prototyping and demonstrations.
Understanding these different elements and their implications is key to leveraging the full potential of AI in a targeted and efficient manner.
In the up-coming post, we will cover what we did to build the Satoshi models, and we’ll compare the models on the market through a Bitcoin lens.
If the convergence of AI and Bitcoin is a rabbit hole you want to explore further, you can download the full NEXUS Report here. It’s the First annual Bitcoin <> AI Industry Report. It contains loads of interesting data and helps sorting the real from the hype. You will also learn how we leverage Bitcoin to crowd-source the human feedback necessary to train our open-source language model.