Pretraining vs Fine-Tuning

  • Reading time:10 mins read

Recently I was invited to be part of an upcoming project related to building a LLM from ground up; and many were asking me about the feasibilities of different approaches.

So I thought I would just summarize it in simple English

When building a LLM, mainly there are two approaches ( and you can go for a hybrid as well, which I will not cover here )

  • Training a LLM from Scratch ( Pretraining )
  • Using an existing LLM and building on top of it ( Fine-tuning )

In an everyday analogy,

Pretraining is like a person studying all subjects from grade 1 to college. — Here he develops a general understanding of a wide range of topics and skills.

Fine-tuning is like that same person, now choosing to specialize in Electronics Engineering. — Here he becomes excellent in this domain, but might forget some of the earlier, unrelated topics like history or art. (This is sometimes called “catastrophic forgetting.” )

So the question becomes, should you Pretrain or Fine-Tune and the answer is simple. In 99% of the cases, you should just fine-tune ( that too is relatively expensive compared to just using already available domain specific models )

There is practically no need for you as an individual or a company to go for building your own LLM from ground up, unless its done with a very specific goal in mind and with about 50 million USD at your disposal.

Let’s go through different aspects so that you understand the difference

Purpose

Pretraining’s Purpose is “Learn general language, knowledge, and reasoning”

Fine-Tuning’s Purpose is “Adapt an exisiting model for specific tasks/domains”

Starting Point

Pretraining starts from Scratch ( For those who are familiar with basic ML, think of it as starting from random weight, which is actually whats happening more or less )

Fine-Tuning starts with an already pretrained model ( A good set of weights are already available )

Dataset Size

Pretraining needs a dataset that is in the range of Billions of tokens ( To give a context, GPT 3 was trained on 300 Billion tokens and GPT 4 is reportedly trained on 13 Trillion tokens. That’s roughly 10 Trillion words )

Fine-Tuning needs, depending on your use case, something from few thousands to few millions of tokens.

Domain

Pretraining’s domain is usually very generic. We feed it books, articles, essentially the whole internet, as the idea is to give it as much as knowledge possible

Fine-Tuning’s domain is usually very specific. For example it could be medical, financial, legal and so on. We feed it information that are very specific to the domain to which we are fine-tuning it in to.

Cost

Pretraining is extremely costly. It will cost somewhere from 1 million dollars all the way close to 100 million ( For reference, reported cost of training of GPT-4 is 78 million dollars ).

Fine-Tuning is comparatively very cheap. Depending on your use case, it might cost you somewhere from few hundred dollars to few thousands.

Resources

Pretraining definitely requires GPUs and TPUs. There is no way around that. Ideally, speciallized versions of them. ( For context, GPT-4 training reportedly involved 25,000 NVIDIA A100 GPUs )

Fine-Tuning can be done on consumer grade GPUs and even in consumer facing cloud infrastructure.

Time

Pretraining usually takes weeks to months ( If we take GPT-4, it took around 100 to 140 days, so roughly 5 months )

Fine-Tuning usually takes just few hours or few days

Energy

Pretraining needs a huge amount of energy. We are talking about MWhs and GWhs ( According to by back-of-the-napkin calculations, GPT-4 would have taken around 24GWhs, thats 24,000 Mega Watt Hours, and that is crazy levels of energy for a program to consume. )

Fine-Tuning needs negligible amount of energy compared to pretraining.

Human Capital

Pretraining needs a very big research and development team.

Fine-Tuning can be done by one or a handful of ML engineers.

To give you context; for a proper Pretraining team, we need

  1. Machine Learning Engineers
  2. Data Scientists
  3. Infrastructure Engineers
  4. Linguists & Computational Linguists
  5. Mathematicians
  6. Philosophers (Ethics & Reasoning)
  7. Ethics and Policy Experts
  8. Psychologists & Cognitive Scientists
  9. Domain Experts
  10. AI & ML Researchers
  11. Data Engineers
  12. Security Experts
  13. Quality Assurance & Testing Engineers
  14. Product Managers & Project Managers
  15. Designers (UX/UI Experts)
  16. Legal and Compliance Experts.

So you see that it is a huge undertaking.

Data Labeling

Pretraining doesn’t necessarily need labeled data. Often it is self-supervised (no manual labels needed). This is obvious when you think about the few Trillions of tokens we use for LLM training that not even the entire humanity would be enough to label that data.

Fine-Tuning has the risks of Overfitting and also the catastrophic forgetting. It is not uncommon that one might need re-finetuning. And we need close supervision and more or less a good set of labeled data for fine-tuning.

Flexibility

Pretraining is by design expected to build an LLM which is flexible. ( Just like the student who learned from grade 1 to college has the knowledge about many things and can be flexible to take on many different tasks )

Fine-Tuning is by design expected to generate an LLM which is specialized ( Just like the student becoming an Electronics engineer. He will excel at Electronics tasks, but might have forgotten the grade school History facts thus unable to be much flexible )

Reusability

Pretraining is building an LLM which can be reused as well as fine-tuned.

Fine-Tuning usually binds an LLM to a specific use case or domain, therefore the likelihood of reusability is low.

Few Examples

Pretraining gives you General-purpose foundation model like GPT-4, Claude, LLaMA-2, Falcon, Mistral

Fine-Tuning gives you Specialized model like BioGPT (medical), BloombergGPT (finance), Med-PaLM (healthcare)

With all that you now understand that in 99.99% cases, you will never ever have to worry about pretraining a LLM.

But if you have a curious mind, if you have an engineering mind which can’t sleep without knowing how things tick under the hood, then I encourage you to take a peak under the hood of LLMs. Because the ‘Inner World of LLMs’ are truly fascinating.