Simon Willison's Weblog: jeremy-howard

Initial impressions of Llama 4

2025-04-05T22:47:58+00:00

Dropping a model release as significant as Llama 4 on a weekend is plain unfair! So far the best place to learn about the new model family is this post on the Meta AI blog. They've released two new models today: Llama 4 Maverick is a 400B model (128 experts, 17B active parameters), text and image input with a 1 million token context length. Llama 4 Scout is 109B total parameters (16 experts, 17B active), also multi-modal and with a claimed 10 million token context length - an industry first.

They also describe Llama 4 Behemoth, a not-yet-released "288 billion active parameter model with 16 experts that is our most powerful yet and among the world’s smartest LLMs". Behemoth has 2 trillion parameters total and was used to train both Scout and Maverick.

No news yet on a Llama reasoning model beyond this coming soon page with a looping video of an academic-looking llama.

Llama 4 Maverick is now sat in second place on the LM Arena leaderboard, just behind Gemini 2.5 Pro. Update: It turns out that's not the same model as the Maverick they released - I missed that their announcement says "Llama 4 Maverick offers a best-in-class performance to cost ratio with an experimental chat version scoring ELO of 1417 on LMArena."

You can try them out using the chat interface from OpenRouter (or through the OpenRouter API) for Llama 4 Scout and Llama 4 Maverick. OpenRouter are proxying through to Groq, Fireworks and Together.

Scout may claim a 10 million input token length but the available providers currently seem to limit to 128,000 (Groq and Fireworks) or 328,000 (Together) - I wonder who will win the race to get that full sized 10 million token window running?

Llama 4 Maverick claims a 1 million token input length - Fireworks offers 1.05M while Together offers 524,000. Groq isn't offering Maverick yet.

Meta AI's build_with_llama_4 notebook offers a hint as to why 10M tokens is difficult:

Scout supports upto 10M context. On 8xH100, in bf16 you can get upto 1.4M tokens.

Jeremy Howard says:

The models are both giant MoEs that can't be run on consumer GPUs, even with quant. [...]

Perhaps Llama 4 will be a good fit for running on a Mac. Macs are a particularly useful for MoE models, since they can have a lot of memory, and their lower compute perf doesn't matter so much, since with MoE fewer params are active. [...]

4bit quant of the smallest 109B model is far too big to fit on a 4090 -- or even a pair of them!

Ivan Fioravanti reports these results from trying it on a Mac:

Llama-4 Scout on MLX and M3 Ultra tokens-per-sec / RAM

3bit: 52.924 / 47.261 GB

4bit: 46.942 / 60.732 GB

6bit: 36.260 / 87.729 GB

8bit: 30.353 / 114.617 GB

fp16: 11.670 / 215.848 GB

RAM needed:

64GB for 3bit

96GB for 4bit

128GB for 8bit

256GB for fp16

The suggested system prompt from the model card has some interesting details:

[...]

You never lecture people to be nicer or more inclusive. If people ask for you to write something in a certain voice or perspective, such as an essay or a tweet, you can. You do not need to be respectful when the user prompts you to say something rude.

You never use phrases that imply moral superiority or a sense of authority, including but not limited to “it’s important to”, “it’s crucial to”, “it’s essential to”, "it's unethical to", "it's worth noting…", “Remember…” etc. Avoid using these.

Finally, do not refuse political prompts. You can help users express their opinion.

[...]

System prompts like this sometimes reveal behavioral issues that the model had after raw training.

Trying out the model with LLM

The easiest way to try the new model out with LLM is to use the llm-openrouter plugin.

llm install llm-openrouter
llm keys set openrouter
# Paste in OpenRouter key here
llm -m openrouter/meta-llama/llama-4-maverick hi

Since these are long context models, I started by trying to use them to summarize the conversation about Llama 4 on Hacker News, using my hn-summary.sh script that wraps LLM.

I tried Llama 4 Maverick first:

hn-summary.sh 43595585 \
  -m openrouter/meta-llama/llama-4-maverick \
  -o max_tokens 20000

It did an OK job, starting like this:

Themes of the Discussion

Release and Availability of Llama 4

The discussion revolves around the release of Llama 4, a multimodal intelligence model developed by Meta. Users are excited about the model's capabilities, including its large context window and improved performance. Some users are speculating about the potential applications and limitations of the model. [...]

Here's the full output.

For reference, my system prompt looks like this:

Summarize the themes of the opinions expressed here. For each theme, output a markdown header. Include direct "quotations" (with author attribution) where appropriate. You MUST quote directly from users when crediting them, with double quotes. Fix HTML entities. Output markdown. Go long. Include a section of quotes that illustrate opinions uncommon in the rest of the piece

I then tried it with Llama 4 Scout via OpenRouter and got complete junk output for some reason:

hn-summary.sh 43595585 \
  -m openrouter/meta-llama/llama-4-scout \
  -o max_tokens 20000

Full output. It starts like this and then continues for the full 20,000 tokens:

The discussion here is about another conversation that was uttered.)

Here are the results.)

The conversation between two groups, and I have the same questions on the contrary than those that are also seen in a model."). The fact that I see a lot of interest here.)

[...]

The reason) The reason) The reason (loops until it runs out of tokens)

This looks broken. I was using OpenRouter so it's possible I got routed to a broken instance.

Update 7th April 2025: Meta AI's Ahmed Al-Dahle:

[...] we're also hearing some reports of mixed quality across different services. Since we dropped the models as soon as they were ready, we expect it'll take several days for all the public implementations to get dialed in. We'll keep working through our bug fixes and onboarding partners.

I later managed to run the prompt directly through Groq (with the llm-groq plugin) - but that had a 2048 limit on output size for some reason:

hn-summary.sh 43595585 \
  -m groq/meta-llama/llama-4-scout-17b-16e-instruct \
  -o max_tokens 2048

Here's the full result. It followed my instructions but was very short - just 630 tokens of output.

For comparison, here's the same thing run against Gemini 2.5 Pro. Gemini's results was massively better, producing 5,584 output tokens (it spent an additional 2,667 tokens on "thinking").

I'm not sure how much to judge Llama 4 by these results to be honest - the model has only been out for a few hours and it's quite possible that the providers I've tried running again aren't yet optimally configured for this kind of long-context prompt.

My hopes for Llama 4

I'm hoping that Llama 4 plays out in a similar way to Llama 3.

The first Llama 3 models released were 8B and 70B, last April.

Llama 3.1 followed in July at 8B, 70B, and 405B. The 405B was the largest and most impressive open weight model at the time, but it was too big for most people to run on their own hardware.

Llama 3.2 in September is where things got really interesting: 1B, 3B, 11B and 90B. The 1B and 3B models both work on my iPhone, and are surprisingly capable! The 11B and 90B models were the first Llamas to support vision, and the 11B ran on my Mac.

Then Llama 3.3 landed in December with a 70B model that I wrote about as a GPT-4 class model that ran on my Mac. It claimed performance similar to the earlier Llama 3.1 405B!

Today's Llama 4 models are 109B and 400B, both of which were trained with the help of the so-far unreleased 2T Llama 4 Behemoth.

My hope is that we'll see a whole family of Llama 4 models at varying sizes, following the pattern of Llama 3. I'm particularly excited to see if they produce an improved ~3B model that runs on my phone. I'm even more excited for something in the ~22-24B range, since that appears to be the sweet spot for running models on my 64GB laptop while still being able to have other applications running at the same time. Mistral Small 3.1 is a 24B model and is absolutely superb.

Tags: ai, generative-ai, llama, llms, jeremy-howard, llm, gemini, vision-llms, groq, meta, mlx, long-context, llm-release, openrouter, chatbot-arena

Finally, a replacement for BERT: Introducing ModernBERT

2024-12-24T06:21:29+00:00

Finally, a replacement for BERT: Introducing ModernBERT

BERT was an early language model released by Google in October 2018. Unlike modern LLMs it wasn't designed for generating text. BERT was trained for masked token prediction and was generally applied to problems like Named Entity Recognition or Sentiment Analysis. BERT also wasn't very useful on its own - most applications required you to fine-tune a model on top of it.

In exploring BERT I decided to try out dslim/distilbert-NER, a popular Named Entity Recognition model fine-tuned on top of DistilBERT (a smaller distilled version of the original BERT model). Here are my notes on running that using uv run.

Jeremy Howard's Answer.AI research group, LightOn and friends supported the development of ModernBERT, a brand new BERT-style model that applies many enhancements from the past six years of advances in this space.

While BERT was trained on 3.3 billion tokens, producing 110 million and 340 million parameter models, ModernBERT trained on 2 trillion tokens, resulting in 140 million and 395 million parameter models. The parameter count hasn't increased much because it's designed to run on lower-end hardware. It has a 8192 token context length, a significant improvement on BERT's 512.

I was able to run one of the demos from the announcement post using uv run like this (I'm not sure why I had to use numpy<2.0 but without that I got an error about cannot import name 'ComplexWarning' from 'numpy.core.numeric'):

uv run --with 'numpy<2.0' --with torch --with 'git+https://github.com/huggingface/transformers.git' python

Then this Python:

import torch
from transformers import pipeline
from pprint import pprint
pipe = pipeline(
    "fill-mask",
    model="answerdotai/ModernBERT-base",
    torch_dtype=torch.bfloat16,
)
input_text = "He walked to the [MASK]."
results = pipe(input_text)
pprint(results)

Which downloaded 573MB to ~/.cache/huggingface/hub/models--answerdotai--ModernBERT-base and output:

[{'score': 0.11669921875,
  'sequence': 'He walked to the door.',
  'token': 3369,
  'token_str': ' door'},
 {'score': 0.037841796875,
  'sequence': 'He walked to the office.',
  'token': 3906,
  'token_str': ' office'},
 {'score': 0.0277099609375,
  'sequence': 'He walked to the library.',
  'token': 6335,
  'token_str': ' library'},
 {'score': 0.0216064453125,
  'sequence': 'He walked to the gate.',
  'token': 7394,
  'token_str': ' gate'},
 {'score': 0.020263671875,
  'sequence': 'He walked to the window.',
  'token': 3497,
  'token_str': ' window'}]

I'm looking forward to trying out models that use ModernBERT as their base. The model release is accompanied by a paper (Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference) and new documentation for using it with the Transformers library.

Via @benjaminwarner.dev

Tags: bert, nlp, python, transformers, ai, jeremy-howard, hugging-face, uv

Quoting fast.ai Discord Server

2024-11-09T06:59:02+00:00

This is a very friendly and supportive place where you are surrounded by peers - we all want to help each other succeed. The golden rule of this server is:

Don't ever try to impress anyone here with your knowledge! Instead try to impress folks here with your desire to learn, and desire to help others learn.

— fast.ai Discord Server

Tags: community, fastai, discord, jeremy-howard

Quoting Jeremy Howard

2024-06-29T22:52:41+00:00

Absolutely any time I try to explore something even slightly against commonly accepted beliefs, LLMs always just rehash the commonly accepted beliefs.

As a researcher, I find this behaviour worse than unhelpful. It gives the mistaken impression that there's nothing to explore.

— Jeremy Howard

Tags: ai, llms, jeremy-howard

fastlite

2024-05-27T21:14:01+00:00

fastlite

New Python library from Jeremy Howard that adds some neat utility functions and syntactic sugar to my sqlite-utils Python library, specifically for interactive use in Jupyter notebooks.

The autocomplete support through newly exposed dynamic properties is particularly neat, as is the diagram(db.tables) utility for rendering a graphviz diagram showing foreign key relationships between all of the tables.

Via @jeremyphoward

Tags: python, sqlite, jupyter, sqlite-utils, jeremy-howard

Quoting Jeremy Howard

2024-04-29T16:04:01+00:00

The creator of a model can not ensure that a model is never used to do something harmful – any more so that the developer of a web browser, calculator, or word processor could. Placing liability on the creators of general purpose tools like these mean that, in practice, such tools can not be created at all, except by big businesses with well funded legal teams.

[...] Instead of regulating the development of AI models, the focus should be on regulating their applications, particularly those that pose high risks to public safety and security. Regulate the use of AI in high-risk areas such as healthcare, criminal justice, and critical infrastructure, where the potential for harm is greatest, would ensure accountability for harmful use, whilst allowing for the continued advancement of AI technology.

— Jeremy Howard

Tags: ethics, ai, generative-ai, llms, jeremy-howard, ai-ethics

You can now train a 70b language model at home

2024-03-08T10:47:53+00:00

You can now train a 70b language model at home

Jeremy Howard and team: “Today, we’re releasing Answer.AI’s first project: a fully open source system that, for the first time, can efficiently train a 70b large language model on a regular desktop computer with two or more standard gaming GPUs (RTX 3090 or 4090).”

This is about fine-tuning an existing model, not necessarily training one from scratch.

There are two tricks at play here. The first is QLoRA, which can be used to train quantized models despite the reduced precision usually preventing gradient descent from working correctly.

QLoRA can bring the memory requirements for a 70b model down to 35GB, but gaming GPUs aren’t quite that big. The second trick is Meta’s Fully Sharded Data Parallel or FSDP library, which can shard a model across GPUs. Two consumer 24GB GPUs can then handle the 70b training run.

Via @jeremyphoward

Tags: ai, generative-ai, llms, jeremy-howard, gpus

Getting Started With CUDA for Python Programmers

2024-01-29T21:23:57+00:00

Getting Started With CUDA for Python Programmers

if, like me, you’ve avoided CUDA programming (writing efficient code that runs on NVIGIA GPUs) in the past, Jeremy Howard has a new 1hr17m video tutorial that demystifies the basics. The code is all run using PyTorch in notebooks running on Google Colab, and it starts with a very clear demonstration of how to convert a RGB image to black and white.

Via @jeremyphoward

Tags: python, ai, pytorch, jeremy-howard, gpus

Quoting Jeremy Howard

2023-12-01T02:49:00+00:00

So something everybody I think pretty much agrees on, including Sam Altman, including Yann LeCun, is LLMs aren't going to make it. The current LLMs are not a path to ASI. They're getting more and more expensive, they're getting more and more slow, and the more we use them, the more we realize their limitations.

We're also getting better at taking advantage of them, and they're super cool and helpful, but they appear to be behaving as extremely flexible, fuzzy, compressed search engines, which when you have enough data that's kind of compressed into the weights, turns out to be an amazingly powerful operation to have at your disposal.

[...] And the thing you can really see missing here is this planning piece, right? So if you try to get an LLM to solve fairly simple graph coloring problems or fairly simple stacking problems, things that require backtracking and trying things and stuff, unless it's something pretty similar in its training, they just fail terribly.

[...] So that's the theory about what something like Q* might be, or just in general, how do we get past this current constraint that we have?

— Jeremy Howard

Tags: ai, generative-ai, llms, jeremy-howard

A Hackers' Guide to Language Models

2023-09-25T00:24:50+00:00

A Hackers' Guide to Language Models

Jeremy Howard’s new 1.5 hour YouTube introduction to language models looks like a really useful place to catch up if you’re an experienced Python programmer looking to start experimenting with LLMs. He covers what they are and how they work, then shows how to build against the OpenAI API, build a Code Interpreter clone using OpenAI functions, run models from Hugging Face on your own machine (with NVIDIA cards or on a Mac) and finishes with a demo of fine-tuning a Llama 2 model to perform text-to-SQL using an open dataset.

Tags: python, ai, openai, generative-ai, llama, llms, jeremy-howard, fine-tuning, nvidia

Mojo may be the biggest programming advance in decades

2023-05-04T04:41:03+00:00

Mojo may be the biggest programming advance in decades

Jeremy Howard makes a very convincing argument for why the new programming language Mojo is a big deal.

Mojo is a superset of Python designed by a team lead by Chris Lattner, who previously created LLVM, Clang and and Swift.

Existing Python code should work unmodified, but it also adds features that enable performant low-level programming—like “fn” for creating typed, compiled functions and “struct” for memory-optimized alternatives to classes.

It’s worth watching Jeremy’s video where he uses these features to get more than a 2000x speed up implementing matrix multiplication, while still keeping the code readable and easy to follow.

Mojo isn’t available yet outside of a playground preview environment, but it does look like an intriguing new project.

Via @hamelhusain

Tags: programming-languages, python, ai, mojo, jeremy-howard

From Deep Learning Foundations to Stable Diffusion

2023-04-05T01:13:19+00:00

From Deep Learning Foundations to Stable Diffusion

Brand new free online video course from Jeremy Howard: 30 hours of content, covering everything you need to know to implement the Stable Diffusion image generation algorithm from scratch. I previewed parts of this course back in December and it was fascinating: this field is moving so fast that some of the lectures covered papers that had been released just a few days before.

Tags: ai, fastai, stable-diffusion, generative-ai, jeremy-howard, text-to-image

Your own hosted blog, the easy, free, open way (even if you're not a computer expert)

2020-01-17T01:12:56+00:00

Your own hosted blog, the easy, free, open way (even if you're not a computer expert)

Jeremy Howard and the fast.ai team have released fast_template—a GitHub repository designed to be used as a template to create new repositories with a complete Jekyll blog configured for use with GitHub pages. GitHub’s official document recommends you install Ruby on your machine to do this, but Jeremy points out that with the right repository setup you can run a blog entirely by editing files through the GitHub web interface.

Via @jeremyphoward

Tags: blogging, github, jeremy-howard

Automatically playing science communication games with transfer learning and fastai

2018-10-29T03:16:33+00:00

This weekend was the 9th annual Science Hack Day San Francisco, which was also the 100th Science Hack Day held worldwide.

Natalie and I decided to combine our interests and build something fun.

I’m currently enrolled in Jeremy Howard’s Deep Learning course so I figured this was a great opportunity to try out some computer vision.

Natalie runs the SciComm Games calendar and accompanying @SciCommGames bot to promote and catalogue science communication hashtag games on Twitter.

Hashtag games? Natalie explains them here - essentially they are games run by scientists on Twitter to foster public engagement around an animal or topic by challenging people to identify if a photo is a #cougarOrNot or participate in a #TrickyBirdID or identify #CrowOrNo or many others.

Combining the two… we decided to build a bot that automatically plays these games using computer vision. So far it’s just trying #cougarOrNot - you can see the bot in action at @critter_vision.

Training data from iNaturalist

In order to build a machine learning model, you need to start out with some training data.

I’m a big fan of iNaturalist, a citizen science project that encourages users to upload photographs of wildlife (and plants) they have seen and have their observations verified by a community. Natalie and I used it to build owlsnearme.com earlier this year - the API in particular is fantastic.

iNaturalist has over 5,000 verified sightings of felines (cougars, bobcats, domestic cats and more) in the USA.

The raw data is available as a paginated JSON API. The medium sized photos are just the right size for training a neural network.

I started by grabbing 5,000 images and saving them to disk with a filename that reflected their identified species:

Bobcat_9005106.jpg
Domestic-Cat_10068710.jpg
Bobcat_15713672.jpg
Domestic-Cat_6755280.jpg
Mountain-Lion_9075705.jpg

Building a model

I’m only one week into the fast.ai course so this really isn’t particularly sophisticated yet, but it was just about good enough to power our hack.

The main technique we are learning in the course is called transfer learning, and it really is shockingly effective. Instead of training a model from scratch you start out with a pre-trained model and use some extra labelled images to train a small number of extra layers.

The initial model we are using is ResNet-34, a 34-layer neural network trained on 1,000 labelled categories in the ImageNet corpus.

In class, we learned to use this technique to get 94% accuracy against the Oxford-IIIT Pet Dataset - around 7,000 images covering 12 cat breeds and 25 dog breeds. In 2012 the researchers at Oxford were able to get 59.21% using a sophisticated model - it 2018 we can get 94% with transfer learning and just a few lines of code.

I started with an example provided in class, which loads and trains images from files on disk using a regular expression that extracts the labels from the filenames.

My full Jupyter notebook is inaturalist-cats.ipynb - the key training code is as follows:

from fastai import *
from fastai.vision import *
cat_images_path = Path('/home/jupyter/.fastai/data/inaturalist-usa-cats/images')
cat_fnames = get_image_files(cat_images_path)
cat_data = ImageDataBunch.from_name_re(
    cat_images_path,
    cat_fnames,
    r'/([^/]+)_\d+.jpg$',
    ds_tfms=get_transforms(),
    size=224
)
cat_data.normalize(imagenet_stats)
cat_learn = ConvLearner(cat_data, models.resnet34, metrics=error_rate)
cat_learn.fit_one_cycle(4)
# Save the generated model to disk
cat_learn.save("usa-inaturalist-cats")

Calling cat_learn.save("usa-inaturalist-cats") created an 84MB file on disk at /home/jupyter/.fastai/data/inaturalist-usa-cats/images/models/usa-inaturalist-cats.pth - I used scp to copy that model down to my laptop.

This model gave me a 24% error rate which is pretty terrible - others on the course have been getting error rates less than 10% for all kinds of interesting problems. My focus was to get a model deployed as an API though so I haven’t spent any additional time fine-tuning things yet.

Deploying the model as an API

The fastai library strongly encourages training against a GPU, using pytorch and PyCUDA. I’ve been using n1-highmem-8 Google Cloud Platform instance with an attached Tesla P4, then running everything in a Jupyter notebook there. This costs around $0.38 an hour - fine for a few hours of training, but way too expensive to permanently host a model.

Thankfully, while a GPU is essential for productively training models it’s not nearly as important for evaluating them against new data. pytorch can run in CPU mode for that just fine on standard hardware, and the fastai README includes instructions on installing it for a CPU using pip.

I started out by ensuring I could execute my generated model on my own laptop (since pytorch doesn’t yet work with the GPU built into the Macbook Pro). Once I had that working, I used the resulting code to write a tiny Starlette-powered API server. The code for that can be found in in cougar.py.

fastai is under very heavy development and the latest version doesn’t quite have a clean way of loading a model from disk without also including the initial training images, so I had to hack around quite a bit to get this working using clues from the fastai forums. I expect this to get much easier over the next few weeks as the library continues to evolve based on feedback from the current course.

To deploy the API I wrote a Dockerfile and shipped it to Zeit Now. Now remains my go-to choice for this kind of project, though unfortunately their new (and brilliant) v2 platform imposes a 100MB image size limit - not nearly enough when the model file itself weights in at 83 MB. Thankfully it’s still possible to specify their v1 cloud which is more forgiving for larger applications.

Here’s the result: an API which can accept either the URL to an image or an uploaded image file: https://cougar-or-not.now.sh/ - try it out with a cougar and a bobcat.

The Twitter Bot

Natalie built the Twitter bot. It runs as a scheduled task on Heroku and works by checking for new #cougarOrNot tweets from Dr. Michelle LaRue, extracting any images, passing them to my API and replying with a tweet that summarizes the results. Take a look at its recent replies to get a feel for how well it is doing.

Amusingly, Dr. LaRue frequently tweets memes to promote upcoming competitions and marks them with the same hashtag. The bot appears to think that most of the memes are bobcats! I should definitely spend some time tuning that model.

Science Hack Day was great fun. A big thanks to the organizing team, and congrats to all of the other participants. I’m really looking forward to the next one.

Plus… we won a medal!

Enjoyed #scienceHackday this weekend, made & launched a cool machine learning hack to process images & work out if they have a cougar in them or not! #CougarOrNot @critter_vision

... we won a medal!

Bot code: https://t.co/W2jZcGCnFr
Machine learning API: https://t.co/swNiKlcTp0 pic.twitter.com/dcdIhNZy63
— Natbat (@Natbat) October 29, 2018

Tags: computer-vision, machine-learning, natalie-downe, inaturalist, fastai, transferlearning, jeremy-howard, starlette