Quoting SparseGPT

2023-05-03T19:48:26+00:00

We show for the first time that large-scale generative pretrained transformer (GPT) family models can be pruned to at least 50% sparsity in one-shot, without any retraining, at minimal loss of accuracy. [...] We can execute SparseGPT on the largest available open-source models, OPT-175B and BLOOM-176B, in under 4.5 hours, and can reach 60% unstructured sparsity with negligible increase in perplexity: remarkably, more than 100 billion weights from these models can be ignored at inference time.

— SparseGPT, by Elias Frantar and Dan Alistarh

Tags: ai, generative-ai, local-llms, llms, bloom

ROOTS search tool

2023-04-03T20:40:28+00:00

ROOTS search tool

BLOOM is one of the most interesting completely openly licensed language models. The ROOTS corpus is the training data that was collected for it, and this tool lets you run searches directly against that corpus. I tried searching for my own name and got an interesting insight into what it knows about me.

Via Anna Rogers

Tags: ai, generative-ai, llms, bloom, training-data

Could you train a ChatGPT-beating model for $85,000 and run it in a browser?

2023-03-17T15:43:38+00:00

I think it's now possible to train a large language model with similar functionality to GPT-3 for $85,000. And I think we might soon be able to run the resulting model entirely in the browser, and give it capabilities that leapfrog it ahead of ChatGPT.

This is currently wild speculation on my part, but bear with me because I think this is worth exploring further.

Large language models with GPT-3-like capabilities cost millions of dollars to build, thanks to the cost of running the expensive GPU servers needed to train them. Whether you are renting or buying those machines, there are still enormous energy costs to cover.

Just one example of this: the BLOOM large language model was trained in France with the support of the French government. The cost was estimated as $2-5M, it took almost four months to train and boasts about its low carbon footprint because most of the power came from a nuclear reactor!

[ Fun fact: as of a few days ago you can now run the openly licensed BLOOM on your own laptop, using Nouamane Tazi's adaptive copy of the llama.cpp code that made that possible for LLaMA ]

Recent developments have made me suspect that these costs could be made dramatically lower. I think a capable language model can now be trained from scratch for around $85,000.

It's all about that LLaMA

The LLaMA plus Alpaca combination is the key here.

I wrote about these two projects previously:

Large language models are having their Stable Diffusion moment discusses the significance of LLaMA
Stanford Alpaca, and the acceleration of on-device large language model development describes Alpaca

To recap: LLaMA by Meta research provided a GPT-3 class model trained entirely on documented, available public training information, as opposed to OpenAI's continuing practice of not revealing the sources of their training data.

This makes the model training a whole lot more likely to be replicable by other teams.

The paper also describes some enormous efficiency improvements they made to the training process.

The LLaMA research was still extremely expensive though. From the paper:

... we estimate that we used 2048 A100-80GB for a period of approximately 5 months to develop our models

My friends at Replicate told me that a simple rule of thumb for A100 cloud costs is $1/hour.

2048 * 5 * 30 * 24 = $7,372,800

But... that $7M was the cost to both iterate on the model and to train all four sizes of LLaMA that they tried: 7B, 13B, 33B, and 65B.

Here's Table 15 from the paper, showing the cost of training each model.

This shows that the smallest model, LLaMA-7B, was trained on 82,432 hours of A100-80GB GPUs, costing 36MWh and generating 14 tons of CO2.

(That's about 28 people flying from London to New York.)

Going by the $1/hour rule of thumb, this means that provided you get everything right on your first run you can train a LLaMA-7B scale model for around $82,432.

Upgrading to Alpaca

You can run LLaMA 7B on your own laptop (or even on a phone), but you may find it hard to get good results out of. That's because it hasn't been instruction tuned, so it's not great at answering the kind of prompts that you might send to ChatGPT or GPT-3 or 4.

Alpaca is the project from Stanford that fixes that. They fine-tuned LLaMA on 52,000 instructions (of somewhat dubious origin) and claim to have gotten ChatGPT-like performance as a result... from that smallest 7B LLaMA model!

You can try out their demo (update: no you can't, "Our live demo is suspended until further notice") and see for yourself that it really does capture at least some of that ChatGPT magic.

The best bit? The Alpaca fine-tuning can be done for less than $100. The Replicate team have repeated the training process and published a tutorial about how they did it.

Other teams have also been able to replicate the Alpaca fine-tuning process, for example antimatter15/alpaca.cpp on GitHub.

We are still within our $85,000 budget! And Alpaca - or an Alpaca-like model using different fine tuning data - is the ChatGPT on your own device model that we've all been hoping for.

Could we run it in a browser?

Alpaca is effectively the same size as LLaMA 7B - around 3.9GB (after 4-bit quantization ala llama.cpp). And LLaMA 7B has already been shown running on a whole bunch of different personal devices: laptops, Raspberry Pis (very slowly) and even a Pixel 5 phone at a decent speed!

The next frontier: running it in the browser.

I saw two tech demos yesterday that made me think this may be possible in the near future.

The first is Transformers.js. This is a WebAssembly port of the Hugging Face Transformers library of models - previously only available for server-side Python.

It's worth spending some time with their demos, which include some smaller language models and some very impressive image analysis languages too.

The second is Web Stable Diffusion. This team managed to get the Stable Diffusion generative image model running entirely in the browser as well!

Web Stable Diffusion uses WebGPU, a still emerging standard that's currently only working in Chrome Canary. But it does work! It rendered me this image of two raccoons eating a pie in the forest in 38 seconds.

The Stable Diffusion model this loads into the browser is around 1.9GB.

LLaMA/Alpaca at 4bit quantization is 3.9GB.

The sizes of these two models are similar enough that I would not be at all surprised to see an Alpaca-like model running in the browser in the not-too-distant future. I wouldn't be surprised if someone is working on that right now.

Now give it extra abilities with ReAct

A model running in your browser that behaved like a less capable version of ChatGPT would be pretty impressive. But what if it could be MORE capable than ChatGPT?

The ReAct prompt pattern is a simple, proven way of expanding a language model's abilities by giving it access to extra tools.

Matt Webb explains the significance of the pattern in The surprising ease and effectiveness of AI in a loop.

I got it working with a few dozen lines of Python myself, which I described in A simple Python implementation of the ReAct pattern for LLMs.

Here's the short version: you tell the model that it must think out loud and now has access to tools. It can then work through a question like this:

Question: Population of Paris, squared?

Thought: I should look up the population of paris and then multiply it

Action: search_wikipedia: Paris

Then it stops. Your code harness for the model reads that last line, sees the action and goes and executes an API call against Wikipedia. It continues the dialog with the model like this:

Observation: <truncated content from the Wikipedia page, including the 2,248,780 population figure>

The model continues:

Thought: Paris population is 2,248,780 I should square that

Action: calculator: 2248780 ** 2

Control is handed back to the harness, which passes that to a calculator and returns:

Observation: 5057011488400

The model then provides the answer:

Answer: The population of Paris squared is 5,057,011,488,400

Adding new actions to this system is trivial: each one can be a few lines of code.

But as the ReAct paper demonstrates, adding these capabilities to even an under-powered model (such as LLaMA 7B) can dramatically improve its abilities, at least according to several common language model benchmarks.

This is essentially what Bing is! It's GPT-4 with the added ability to run searches against the Bing search index.

Obviously if you're going to give a language model the ability to execute API calls and evaluate code you need to do it in a safe environment! Like for example... a web browser, which runs code from untrusted sources as a matter of habit and has the most thoroughly tested sandbox mechanism of any piece of software we've ever created.

Adding it all together

There are a lot more groups out there that can afford to spend $85,000 training a model than there are that can spend $2M or more.

I think LLaMA and Alpaca are going to have a lot of competition soon, from an increasing pool of openly licensed models.

A fine-tuned LLaMA scale model is leaning in the direction of a ChatGPT competitor already. But... if you hook in some extra capabilities as seen in ReAct and Bing even that little model should be able to way outperform ChatGPT in terms of actual ability to solve problems and do interesting things.

And we might be able to run such a thing on our phones... or even in our web browsers... sooner than you think.

And it's only going to get cheaper

Tobias Lütke on Twitter:

H100s are shipping and you can half this again. Twice (or more) if fp8 works.
- tobi lutke (@tobi) March 17, 2023

The H100 is the new Tensor Core GPU from NVIDIA, which they claim can offer up to a 30x performance improvement over their current A100s.

Tags: ai, webassembly, generative-ai, chatgpt, llama, local-llms, llms, bloom, mlc, transformers-js, llm-tool-use, llama-cpp, tobias-lutke

bloomz.cpp

2023-03-16T00:24:37+00:00

bloomz.cpp

Nouamane Tazi Adapted the llama.cpp project to run against the BLOOM family of language models, which were released in July 2022 and trained in France on 45 natural languages and 12 programming languages using the Jean Zay Public Supercomputer, provided by the French government and powered using mostly nuclear energy.

It’s under the RAIL license which allows (limited) commercial use, unlike LLaMA.

Nouamane reports getting 16 tokens/second from BLOOMZ-7B1 running on an M1 Pro laptop.

Via @nouamanetazi

Tags: open-source, ai, generative-ai, llama, local-llms, llms, bloom, llama-cpp, ai-energy-usage

Petals

2023-01-02T23:29:42+00:00

Petals

The challenge with large language models in the same scale ballpark as GPT-3 is that they’re large—really large. Far too big to run on a single machine at home. Petals is a fascinating attempt to address that problem: it works a little bit like BitTorrent, in that each user of Petal runs a subset of the overall language model on their machine and participates in a larger network to run inference across potentially hundreds of distributed GPUs. I tried it just now in Google Colab and it worked exactly as advertised, after downloading an 8GB subset of the 352GB BLOOM-176B model.

Via @bobek

Tags: ai, gpt-3, generative-ai, llms, bloom, gpus

Simon Willison's Weblog: bloom