I read through the llama 3 paper. Some random toughts:
The big model performs more or less as well as the other major models
(GPT, Gemini, and Claude) but you can pull it down and fine tune it
for your needs. This is a remarkable move I assume to undermine the
competetive advantage of the big AI companies. It's means that you
don't need 10 billion to enter the AI race in a deep way.
It took 54 days running on 16,000 N100s. That is a lot of compute.
During training, tens of thousands of GPUs may increase or decrease
power consumption at the same time, for example, due to all GPUs
waiting for checkpointing or collective communications to finish, or
the startup or shutdown of the entire training job. When this happens,
it can result in instant fluctuations of power consumption across the
data center on the order of tens of megawatts, stretching the limits
of the power grid. This is an ongoing challenge for us as we scale
training for future, even larger Llama models.
Moving data around, both training data and intermediate check point
training checks, required a huge amount of engineering work. The Meta
infrastructure – even ourside of the compute stuff – was
instrumental to this amount of effort.
One interesting observation is the impact of environmental factors on
training performance at scale. For Llama 3 405B , we noted a diurnal
1-2% throughput variation based on time-of-day. This fluctuation is
the result of higher mid-day temperatures impacting GPU dynamic
voltage and frequency scaling.
Sourcing quality input data seemed like it was all cobbled together.
There was a bunch of work to pull data out of webpages.
It's mostly trained on English input, and then a much smaller fraction
of other languages. I would imagine that quality in English is much
higher, and people who use the models in different languges would be
at a disadvantage.
It filtered out stuff I'd expect, like how to make a bomb or create a
bioweapon, but I was surprised that it filtered out "sexual content"
which it labeled under "adult content". So if sexuality is part of
your life, don't expect the models to know anything about it.
There's the general pre-training model, which was fed a sort of
mismash of data. "Better quality input", whatever that objectively
means at this sort of scale.
Post-training is basically taking a whole bunch of expert
human-produced data and making sure that the models answer in that
sort of way. So the knowledged and whatever else that is embedded is
sort of forced into it at that area.
Pre-training then is like putting in the full corpus of how language
works and the concepts that our languages have embedded. This is
interesting in itself because it represents how we model the world in
our communication, though it's fully capable of spitting out coherent
bullshit it doesn't really have any of the "understanding of experts"
that would differentiate knowing what you are talking about.
The post-training is to put in capabilities that are actually useful
– both in terms of elevating accepted knowledge, but also other
capabilities like tool use. This sort of tuning seems like cheating,
or at least a very pragmatic engineering method that "gets the model
to produce the types of answers we want".
The obvious thing is the -instruct
variation, which adds things like
"system prompt" and "agent" and "user", so you can layer on the chat
interface that everyone knows and loves. But tool use and coding
generation – it can spit out python code for evaluation when it needs
a quantiative answer – are also part of that. I believe that this
sort of post-training is of a different sort than the "process all of
the words so I understand embedded conceptions in linguistic
communication".
The paper is also a sort of blueprint of what you'd need to do if you
wanted to make your own foundation model. They didn't use necessarily
the most advanced techniques – preferring to push the envelope on
data quality and training time – but the results are working and I
suppose in tune with the general "more data less clever" idea in AI.
The methodolgy of training these things is probably well known by the
experts out there, but if it was obfucated knowledge before it's no
longer.