July 24, 2024 6:15 am

Thoughts on reading the llama 3.1 paper

I read through the llama 3 paper. Some random toughts:

The big model performs more or less as well as the other major models (GPT, Gemini, and Claude) but you can pull it down and fine tune it for your needs. This is a remarkable move I assume to undermine the competetive advantage of the big AI companies. It's means that you don't need 10 billion to enter the AI race in a deep way.

It took 54 days running on 16,000 N100s. That is a lot of compute.

During training, tens of thousands of GPUs may increase or decrease power consumption at the same time, for example, due to all GPUs waiting for checkpointing or collective communications to finish, or the startup or shutdown of the entire training job. When this happens, it can result in instant fluctuations of power consumption across the data center on the order of tens of megawatts, stretching the limits of the power grid. This is an ongoing challenge for us as we scale training for future, even larger Llama models.

Moving data around, both training data and intermediate check point training checks, required a huge amount of engineering work. The Meta infrastructure – even ourside of the compute stuff – was instrumental to this amount of effort.

One interesting observation is the impact of environmental factors on training performance at scale. For Llama 3 405B , we noted a diurnal 1-2% throughput variation based on time-of-day. This fluctuation is the result of higher mid-day temperatures impacting GPU dynamic voltage and frequency scaling.

Sourcing quality input data seemed like it was all cobbled together. There was a bunch of work to pull data out of webpages.

It's mostly trained on English input, and then a much smaller fraction of other languages. I would imagine that quality in English is much higher, and people who use the models in different languges would be at a disadvantage.

It filtered out stuff I'd expect, like how to make a bomb or create a bioweapon, but I was surprised that it filtered out "sexual content" which it labeled under "adult content". So if sexuality is part of your life, don't expect the models to know anything about it.

There's the general pre-training model, which was fed a sort of mismash of data. "Better quality input", whatever that objectively means at this sort of scale.

Post-training is basically taking a whole bunch of expert human-produced data and making sure that the models answer in that sort of way. So the knowledged and whatever else that is embedded is sort of forced into it at that area.

Pre-training then is like putting in the full corpus of how language works and the concepts that our languages have embedded. This is interesting in itself because it represents how we model the world in our communication, though it's fully capable of spitting out coherent bullshit it doesn't really have any of the "understanding of experts" that would differentiate knowing what you are talking about.

The post-training is to put in capabilities that are actually useful – both in terms of elevating accepted knowledge, but also other capabilities like tool use. This sort of tuning seems like cheating, or at least a very pragmatic engineering method that "gets the model to produce the types of answers we want".

The obvious thing is the -instruct variation, which adds things like "system prompt" and "agent" and "user", so you can layer on the chat interface that everyone knows and loves. But tool use and coding generation – it can spit out python code for evaluation when it needs a quantiative answer – are also part of that. I believe that this sort of post-training is of a different sort than the "process all of the words so I understand embedded conceptions in linguistic communication".

The paper is also a sort of blueprint of what you'd need to do if you wanted to make your own foundation model. They didn't use necessarily the most advanced techniques – preferring to push the envelope on data quality and training time – but the results are working and I suppose in tune with the general "more data less clever" idea in AI.

The methodolgy of training these things is probably well known by the experts out there, but if it was obfucated knowledge before it's no longer.


July 8, 2024 8:28 am

Vacation Book Reading

Spent a few weeks in Spain, managed to get some good reading in!

The Gutenberg Parenthesis: The Age of Print and Its Lessons for the Age of the Internet (2023)

Hard to get into, but once you are there worth the read. The biggest take away for me was the idea of "the mass" as being created by the medium, so people reading (say) twitter all of a sudden become this group, which doesn't really have an existance, but, like Santa Claus, changes everyones reality.

Tomorrow, and Tomorrow, and Tomorrow (2002)

Expected very little of this, got much more than expected.

A Stainless Steel Trio: A Stainless Steel Rat Is Born/The Stainless Steel Rat Gets Drafted/The Stainless Steel Rat Sings the Blues (1985)

Its funny to reread these books when you are older and get all of the references. Really holds up, both as a satire on the Heinlien adventure-boy genre, but as an interesting discussion on how to fit inbetween places. It's a very impressive feat to put so much philosophy and sociology in a page turning absurdist caper plot that keeps 12 year old's attentions.

The Latchkey Murders (2015)

Since Ksenia is Russian I have a new appreciation of the Russians and Soviets. Not totally satisfying as a mystery, but felt like I got a glimpse into Moscow in the 60s.

Server Driven Webapps with HTMX (2024)

As far as server side JavaScript frameworks go, this is exceedingly clever but I'm not sure that it really makes things simplier.

Wintersmith (2015)

Ah… English and their distain for the wrong sort of hemegonic thought. In many ways the same sort of book as the Stainless Steel Rat – satire and commentary under the guise of silliness, as a way to harmlessly subvert the – but ultimately just one thing after another. Fun if you are in the right sort of mood.

Cynicism and Magic: Intelligence and Intuition on the Buddhist Path (2021)

I can't possibly do much justice. The smallest book that took the longest to read. Very interesting to see the early ways and how of how Buddhism got into the states, and also so many important reminders.

Deja Dead (2007)

What's funny about this book is that it spawned an empire, and reading just the first one you'd never really expect it. Also a bit of time travel here since it's in the precell phone days and it's hard to find anyone.

Surfing the Internet (1995)

This book was amazing – both obscure and also exactly of my world, I knew all of the references and all of the things they talked about and I had forgotten so much. Was also interesting the interview with what we would now call an incel, but before the time when the sadness had turned into hostility towards these losers. I very much respected this author and tracked down a whole bunch of other stuff she wrote.

The Enigma of Room 622 (2023)

An absurd story inside of a dumb story wrapped in a magical outer story that brought it all home at the end. I do want to read more of his work, but not in a rush.

Narrative of Auther Gordon Pym of Nantucket (1838)

Poe invented nearly everything! The so-called ending of this book is infuriating, but the ripples of it have influenced so much. Its the sort of book that's a key to understanding a whole bunch of other books, so necessary in a complete-your-education sort of way, but without the context it's a bit strange.

The Prisoner of Heaven (2013)

I never heard of Carlos Ruiz Zafón before but picked it up in a small english section of a Benidorm bookstore, very clever, moody and oddly soothing. Will go through the rest of his oevre.

Homage to Catalonia (1938)

Reading Orwell after reading both the Gutenberg Parenthesis and reflected on Foucaults 40 years was a tremendous experience and I look at intellectual efforts completely differently now, and honestly feel better about the state of the world than I did before. With the Supreme Court making kings and the farce of the elections its not getting worse; its actually the same as it ever was and we were just fed a load of liberal democracy nonsense all this time, and it's capital and power all along.


June 15, 2024 6:03 pm

Four freedoms

We went to the Norman Rockwell Museam today, and there was an inspired paring of a MAD exhibition. Did you know that MAD Magazine still exists? Me neither.

American Ideals at their finest.

One thing that I didn't know about was the Four Freedoms:

  1. Freedom of Speech
  2. Freedom from Want
  3. Freedom from Fear
  4. Freedom of Worship

These images captured something that is embedded in my psyche – especially Freedom of Speech, which I think of everytime someone posts to the local town mailing list with whatever their take is on something. Freedom from Want – which is a picture of a large thanksgiving turkey is probably the reason why I think we need to have turkey on thanksgiving. On any other day I'm not sure I'd ever order turkey.

The other funny thing about this is that I'd only heard of "four freedoms" in the context of Freedoms of Free software:

Freedom 0: The freedom to use the program for any purpose.

Freedom 1: The freedom to study how the program works, and change it to make it do what you wish.

Freedom 2: The freedom to redistribute and make copies so you can help your neighbor.

Freedom 3: The freedom to improve the program, and release your improvements (and modified versions in general) to the public, so that the whole community benefits.


June 9, 2024 12:40 pm

Adapting to new mediums

Technologies are aritificial, but - paradox again - artificiality is natural to human beings. Technology, properly interiorized, does not degrade human life but on the contrary enhances it… such shaping of a tool to oneself, learning a technological skill, is hardly dehumanizing. The use of a technology can enrich the human psyche, enlarge the human spirit, and intensify its interior life.

– Walter Ong Orality and Literacy, pg 81-82.

How long did it take for humans to interiorize writing and its tools? How long will it take for us to interiorize the network?

– Jeff Jarvis Gutenberg Parenthesis

In the beginning, there were ABC, NBC, and CBS, and they were good. Midcentury American man could come home after eight hours of work and turn on his television and know where he stood in relation to his wife, and his children, and his neighbors, and his town, and his country, and his world. And that was good. Or he could open the local paper in the morning in the ritual fashion, taking his civic communion with his coffee, and know that identical scenes were unfolding in households across the country.

Bad News Selling the story of disinformation by Joseph Bernstein


April 30, 2024 9:53 pm

I need a trigger warning

All of these protests and the war in gaza has brought up all the old feelings of being in that terrorist attack in Rome when I was 8, and I just spent 2 hours searching for this passage by Fredrick Douglass that I read, what? 30 years ago in high school? Memory is weird.

My mistress was, as I have said, a kind and tender-hearted woman; and in the simplicity of her soul she commenced, when I first went to live with her, to treat me as she supposed one human being ought to treat another. In entering upon the duties of a slaveholder, she did not seem to perceive that I sustained to her the relation of a mere chattel, and that for her to treat me as a human being was not only wrong, but dangerously so. Slavery proved as injurious to her as it did to me. When I went there, she was a pious, warm, and tender-hearted woman. There was no sorrow or suffering for which she had not a tear. She had bread for the hungry, clothes for the naked, and comfort for every mourner that came within her reach. Slavery soon proved its ability to divest her of these heavenly qualities. Under its influence, the tender heart became stone, and the lamblike disposition gave way to one of tiger-like fierceness. The first step in her downward course was in her ceasing to instruct me. She now commenced to practise her husband's precepts. She finally became even more violent in her opposition than her husband himself. She was not satisfied with simply doing as well as he had commanded; she seemed anxious to do better.

  • CHAPTER VII of Narrative of the Life of Frederick Douglass

April 18, 2024 9:56 am

Oh javascript

Somehow this sort of thing in Ruby is charming, and in JavaScript just a never ending source of confusion.

Value equality is based on the SameValueZero algorithm. (It used to use SameValue, which treated 0 and -0 as different. Check browser compatibility.) This means NaN is considered the same as NaN (even though NaN !== NaN) and all other values are considered equal according to the semantics of the = operator.

Set documentation


March 26, 2024 4:41 pm

Discovering idagio

I was a primephonic user before it got whisked away into the Apple ecosystem, and sort of fell back on Spotify. I recently discovered idagio and I’m amazed at the quality. Originally I was thinking that the classical services were better because they organized the music in a more sane way, so you could hear 5 different renditions of a particular piece.

Each recording though isn’t just different; it’s like Spotify licenses the cheapest one to get. The playing on this particular one really is phenomenal.

bachtrack streaming services


March 24, 2024 10:00 am

Things I love about my phone

Lets not forget how cool smart phones are.

I can open up WhatsApp and send a message, and the person on the other side will recieve it. I could be walking in the woods when a thought occured to me, and they could be on an entirely different continent, and it doesn't matter.

I really like ordering stuff on the go, like when I'm waiting for an elevator and I make a quick order of, say, some mechanical pencils. And then later it'll just show up.

Other things are mixed. Looking up information and having it all at your fingertips sometimes pulls you out of yourself. Do I really need to know what mechanical sand is made out of, for example, rather just be playing with it? Are the opening hours of the store actually up to date on google? (Surprisingly, often not.) And often the response you get back is bullshit, which is to say both definitively stated and also wrong.

But still, being able to reach out to any connection anytime anywhere on the planet is still mindblowing. I can remember a time when you needed to schedule and coordinate when you'd make an expensive long distance call, and more often than not you'd simply be uncontactable for days or weeks.


March 14, 2024 10:28 am

My physical relationship to the internet

Where I live there is very spotty cell service. If I'm not connected to WiFi at home or at the market, my connectivity to everyone is sort of fire and forget – the messages will go out, the messages will come it, but not immediately.

The feeling is what I go over the hill and all of a sudden my phone blows up with notifications, then I'm off grid for the next few miles.

Coverage is not better inside, or on the road or something – I get better service way out in the woods behind my house than I do on the town green.

I'm used to this reality that its jarring when I take the train back to the city. At first, coverage is really bad on the train. I'll tether the laptop to the phone, and watch the packet loss slowly improve right up until I get to grand central.

At home, everything is downloaded to the phone. PocketCasts streaming feature is basically pointless for me, I need to wait till it's downloaded before getting in the car.

When I get to the city, the process is inverted. Why would I connect to the hotel's WiFi when it's so much easier and more reliable to tether through the phone?

The internet on my home turf is more like being in an airplane – it works under certain situations, but it needs to be offline first.


March 1, 2024 9:42 am

Why are LLMs so small?

LLMs are compressing information in a wildly different way than I understand. If we compare a couple open source LLMs to Wikipedia, they are all 20%-25% smaller than the compressed version of English wikipedia. And yet you can ask questions about the LLM, they can – in a sense – reason about things, and they know how to code.

NAMESIZE
gemma:7b5.2 GB
llava:latest4.7 GB
mistral:7b4.1 GB
zephyr:latest4.1 GB

Contrast that to the the size of English wikipedia – 22gb. That's without media or images.

Shannon Entropy is a measure of information desitity, and whatever happens in training LLMs gets a lot closer to the limit than our current way of sharing information.