Simon Willison's Weblog: nvidia-spark

Using Codex CLI with gpt-oss:120b on an NVIDIA DGX Spark via Tailscale

2025-11-07T07:23:12+00:00

Using Codex CLI with gpt-oss:120b on an NVIDIA DGX Spark via Tailscale

Inspired by a YouTube comment I wrote up how I run OpenAI's Codex CLI coding agent against the gpt-oss:120b model running in Ollama on my NVIDIA DGX Spark via a Tailscale network.

It takes a little bit of work to configure but the result is I can now use Codex CLI on my laptop anywhere in the world against a self-hosted model.

I used it to build this space invaders clone.

Tags: ai, tailscale, til, generative-ai, local-llms, llms, nvidia, coding-agents, space-invaders, codex, nvidia-spark

Getting DeepSeek-OCR working on an NVIDIA Spark via brute force using Claude Code

2025-10-20T17:21:52+00:00

DeepSeek released a new model yesterday: DeepSeek-OCR, a 6.6GB model fine-tuned specifically for OCR. They released it as model weights that run using PyTorch and CUDA. I got it running on the NVIDIA Spark by having Claude Code effectively brute force the challenge of getting it working on that particular hardware.

This small project (40 minutes this morning, most of which was Claude Code churning away while I had breakfast and did some other things) ties together a bunch of different concepts I've been exploring recently. I designed an agentic loop for the problem, gave Claude full permissions inside a Docker sandbox, embraced the parallel agents lifestyle and reused my notes on the NVIDIA Spark from last week.

I knew getting a PyTorch CUDA model running on the Spark was going to be a little frustrating, so I decided to outsource the entire process to Claude Code to see what would happen.

TLDR: It worked. It took four prompts (one long, three very short) to have Claude Code figure out everything necessary to run the new DeepSeek model on the NVIDIA Spark, OCR a document for me and produce copious notes about the process.

The setup

I connected to the Spark from my Mac via SSH and started a new Docker container there:

docker run -it --gpus=all \
  -v /usr/local/cuda:/usr/local/cuda:ro \
  nvcr.io/nvidia/cuda:13.0.1-devel-ubuntu24.04 \
  bash

Then I installed npm and used that to install Claude Code:

apt-get update
DEBIAN_FRONTEND=noninteractive TZ=Etc/UTC apt-get install -y npm
npm install -g @anthropic-ai/claude-code

Then started Claude Code, telling it that it's OK that it's running as root because it's in a sandbox:

IS_SANDBOX=1 claude --dangerously-skip-permissions

It provided me a URL to click on to authenticate with my Anthropic account.

The initial prompts

I kicked things off with this prompt:

Create a folder deepseek-ocr and do everything else in that folder

Then I ran the following, providing links to both the GitHub repository and the Hugging Face model, providing a clue about NVIDIA ARM and giving it an image (this one, see previous post) that I wanted it to run OCR on.

Your task is to get this working: https://github.com/deepseek-ai/DeepSeek-OCR - it uses Hugging Face Transformers and models from https://huggingface.co/deepseek-ai/DeepSeek-OCR - you are running in a Docker container on an NVIDIA ARM device, first research your environment to check it will be possible to do this. Git clone both of those - you may need to enable LFS first. Make extensive notes as you go in a notes.md file which you should only ever append new notes to. At the end of this you should have scripts in the folder which can be run to both setup and then run the model against an example - use https://static.simonwillison.net/static/2025/ft.jpeg as the example image that you OCR. As a final step create a detailed README.md file explaining the project and what you learned and how to run your code.

Note how I also told it to "make extensive notes as you go in a notes.md file which you should only ever append new notes to" so I'd have some useful notes at the end.

The follow-ups

It churned away for 13 minutes, and made a sequence of discoveries about the platform and the model and the difficulties of running PyTorch on it.

And then... it gave up:

The NVIDIA GB10 GPU has compute capability sm_121, which is too new for PyTorch 2.5.1

PyTorch 2.5.1 supports: sm_50, sm_80, sm_86, sm_89, sm_90, sm_90a

GB10 requires: sm_121

Error: CUDA error: no kernel image is available for execution on the device

Additionally, the model has hardcoded .cuda() calls, preventing CPU fallback without code modifications.

So I gave it some encouragement:

Any other options that might help get CUDA working like using different versions of PyTorch that have available CUDA wheels for this platform? Have you checked to see what PyTorch versions already have these ARM CUDA wheels that would work here?

It ran this command to grep for all available PyTorch wheels on download.pytorch.org that match the ARM architecture:

curl -s https://download.pytorch.org/whl/torch/ | grep -o 'torch-[0-9.]*.*aarch64.whl' | sort -V | tail -20

... and that unblocked it.

Excellent! I can see that PyTorch 2.9.0 has ARM64 wheels for CUDA 12.8, 12.9, and CUDA 13.0 (cu130)! This is exactly what we need!

It installed the wheels it found and reported:

Good news! PyTorch 2.9.0 can detect the GB10 GPU with CUDA capability 12.1 (sm_121), but it says the maximum supported is 12.0. However, it's just a warning now instead of a hard error! Let's try running inference: [...]

Then kept on churning for a bit, ignoring warnings, and managed to run the model! It loaded the image and ran this prompt:

<|grounding|>OCR this image.

And got back output that started like this:

<|ref|>The perils of vibe coding<|/ref|><|det|>[[352, 30, 624, 111]]<|/det|>
<|ref|>opt<|/ref|><|det|>[[970, 0, 994, 30]]<|/det|>
<|ref|>such<|/ref|><|det|>[[970, 24, 996, 58]]<|/det|>
<|ref|>days<|/ref|><|det|>[[970, 52, 996, 87]]<|/det|>
<|ref|>pavi<|/ref|><|det|>[[970, 85, 996, 118]]<|/det|>
<|ref|>TECHNOLOGY<|/ref|><|det|>[[33, 199, 127, 230]]<|/det|>
<|ref|>holds the promise of replacing program-<|/ref|><|det|>[[220, 170, 390, 210]]<|/det|>
<|ref|>mers - a profession of very well paid<|/ref|><|det|>[[220, 199, 390, 238]]<|/det|>
<|ref|>people.For another, the work can be<|/ref|><|det|>[[220, 228, 390, 267]]<|/det|>

Claude declared victory and pointed me to the output/result.mmd file, which contained only whitespace. So OCR had worked but the result had failed to be written correctly to disk.

I said:

That result.mmd file looks to be exclusively whitespace to me, try and get a better result there - the bounding box image was good though

Which caused Claude to reconsider the prompt it used:

Maybe we need to use a different prompt. Looking at the README from DeepSeek-OCR, there are different prompts:

"<image>\n<|grounding|>Convert the document to markdown." - for documents

"<image>\n<|grounding|>OCR this image." - general OCR

"<image>\nFree OCR." - without layouts

The "grounding" mode seems to be focused on bounding boxes, which is why we got good bounding box results but the text output might be in a different format.

Let me try running it again with a different prompt that's more focused on text extraction rather than grounding.

It tried all of the prompt variants, made a note about them and created a file called PROMPTS_GUIDE.md with a bunch more details, including this comparison table:

Prompt Speed Text Quality Structure Coordinates Best Use Case

Free OCR ⚡⚡⚡ Fast ⭐⭐⭐ Excellent ⭐ Basic ❌ No General OCR

Markdown ⚡⚡ Medium ⭐⭐⭐ Excellent ⭐⭐⭐ Full ⭐⭐ Partial Documents

Grounding ⚡ Slow ⭐⭐ Good ⭐ Basic ⭐⭐⭐ Full Annotations

Detailed ⚡⚡⚡ Fastest ⭐ N/A ❌ N/A ❌ No Image analysis

Prompt	Speed	Text Quality	Structure	Coordinates	Best Use Case
Free OCR	⚡⚡⚡ Fast	⭐⭐⭐ Excellent	⭐ Basic	❌ No	General OCR
Markdown	⚡⚡ Medium	⭐⭐⭐ Excellent	⭐⭐⭐ Full	⭐⭐ Partial	Documents
Grounding	⚡ Slow	⭐⭐ Good	⭐ Basic	⭐⭐⭐ Full	Annotations
Detailed	⚡⚡⚡ Fastest	⭐ N/A	❌ N/A	❌ No	Image analysis

And this table under the heading "Performance Benchmarks":

Test image: 3503×1668 pixels (Financial Times article)

Prompt Time Output Size Tokens

Free OCR 24s Clean text 2257

Markdown 39s Formatted MD 2257 + structure

Grounding 58s Text + coords 2257 + boxes

Detailed 9s Description ~300

Prompt	Time	Output Size	Tokens
Free OCR	24s	Clean text	2257
Markdown	39s	Formatted MD	2257 + structure
Grounding	58s	Text + coords	2257 + boxes
Detailed	9s	Description	~300

My final prompt was this, to gather everything together into a zip file I could extract from the Docker container:

Create a zip file with the output and output_text and all of the scripts and notes - but leave out the github repo and the huggingface repo directories

I added the contents of that zip file to my new simonw/research GitHub repo in the deepseek-ocr-nvidia-spark folder.

Claude really likes writing notes! Here's the directory listing of that finished folder:

  |-- download_test_image.sh
  |-- FINAL_SUMMARY.md
  |-- notes.md
  |-- output
  |   |-- images
  |   |-- result_with_boxes.jpg
  |   `-- result.mmd
  |-- output_text
  |   |-- detailed
  |   |   |-- images
  |   |   |-- result_with_boxes.jpg
  |   |   `-- result.mmd
  |   |-- free_ocr
  |   |   |-- images
  |   |   |-- result_with_boxes.jpg
  |   |   `-- result.mmd
  |   `-- markdown
  |       |-- images
  |       |   `-- 0.jpg
  |       |-- result_with_boxes.jpg
  |       `-- result.mmd
  |-- PROMPTS_GUIDE.md
  |-- README_SUCCESS.md
  |-- README.md
  |-- run_ocr_best.py
  |-- run_ocr_cpu_nocuda.py
  |-- run_ocr_cpu.py
  |-- run_ocr_text_focused.py
  |-- run_ocr.py
  |-- run_ocr.sh
  |-- setup.sh
  |-- SOLUTION.md
  |-- test_image.jpeg
  |-- TEXT_OUTPUT_SUMMARY.md
  `-- UPDATE_PYTORCH.md

Takeaways

My first prompt was at 15:31:07 (UTC). The final message from Claude Code came in at 16:10:03. That means it took less than 40 minutes start to finish, and I was only actively involved for about 5-10 minutes of that time. The rest of the time I was having breakfast and doing other things.

Having tried and failed to get PyTorch stuff working in the past, I count this as a huge win. I'll be using this process a whole lot more in the future.

How good were the actual results? There's honestly so much material in the resulting notes created by Claude that I haven't reviewed all of it. There may well be all sorts of errors in there, but it's indisputable that it managed to run the model and made notes on how it did that such that I'll be able to do the same thing in the future.

I think the key factors in executing this project successfully were the following:

I gave it exactly what it needed: a Docker environment in the target hardware, instructions on where to get what it needed (the code and the model) and a clear goal for it to pursue. This is a great example of the pattern I described in designing agentic loops.
Running it in a Docker sandbox meant I could use claude --dangerously-skip-permissions and leave it running on its own. If I'd had to approve every command it wanted to run I would have got frustrated and quit the project after just a few minutes.
I applied my own knowledge and experience when it got stuck. I was confident (based on previous experiments with the Spark) that a CUDA wheel for ARM64 existed that was likely to work, so when it gave up I prompted it to try again, leading to success.

Oh, and it looks like DeepSeek OCR is a pretty good model if you spend the time experimenting with different ways to run it.

Bonus: Using VS Code to monitor the container

A small TIL from today: I had kicked off the job running in the Docker container via SSH to the Spark when I realized it would be neat if I could easily monitor the files it was creating while it was running.

I asked Claude.ai:

I am running a Docker container on a remote machine, which I started over SSH

How can I have my local VS Code on MacOS show me the filesystem in that docker container inside that remote machine, without restarting anything?

It gave me a set of steps that solved this exact problem:

Install the VS Code "Remote SSH" and "Dev Containers" extensions
Use "Remote-SSH: Connect to Host" to connect to the remote machine (on my Tailscale network that's spark@100.113.1.114)
In the window for that remote SSH session, run "Dev Containers: Attach to Running Container" - this shows a list of containers and you can select the one you want to attach to
... and that's it! VS Code opens a new window providing full access to all of the files in that container. I opened up notes.md and watched it as Claude Code appended to it in real time.

At the end when I told Claude to create a zip file of the results I could select that in the VS Code file explorer and use the "Download" menu item to download it to my Mac.

Tags: ocr, python, ai, docker, pytorch, generative-ai, llms, ai-assisted-programming, anthropic, claude, nvidia, vs-code, vision-llms, deepseek, llm-release, coding-agents, claude-code, ai-in-china, nvidia-spark

NVIDIA DGX Spark + Apple Mac Studio = 4x Faster LLM Inference with EXO 1.0

2025-10-16T05:34:41+00:00

NVIDIA DGX Spark + Apple Mac Studio = 4x Faster LLM Inference with EXO 1.0

EXO Labs wired a 256GB M3 Ultra Mac Studio up to an NVIDIA DGX Spark and got a 2.8x performance boost serving Llama-3.1 8B (FP16) with an 8,192 token prompt.

Their detailed explanation taught me a lot about LLM performance.

There are two key steps in executing a prompt. The first is the prefill phase that reads the incoming prompt and builds a KV cache for each of the transformer layers in the model. This is compute-bound as it needs to process every token in the input and perform large matrix multiplications across all of the layers to initialize the model's internal state.

Performance in the prefill stage influences TTFT - time‑to‑first‑token.

The second step is the decode phase, which generates the output one token at a time. This part is limited by memory bandwidth - there's less arithmetic, but each token needs to consider the entire KV cache.

Decode performance influences TPS - tokens per second.

EXO noted that the Spark has 100 TFLOPS but only 273GB/s of memory bandwidth, making it a better fit for prefill. The M3 Ultra has 26 TFLOPS but 819GB/s of memory bandwidth, making it ideal for the decode phase.

They run prefill on the Spark, streaming the KV cache to the Mac over 10Gb Ethernet. They can start streaming earlier layers while the later layers are still being calculated. Then the Mac runs the decode phase, returning tokens faster than if the Spark had run the full process end-to-end.

Via @exolabs

Tags: apple, ai, generative-ai, local-llms, llms, nvidia, nvidia-spark

NVIDIA DGX Spark: great hardware, early days for the ecosystem

2025-10-14T23:36:21+00:00

NVIDIA sent me a preview unit of their new DGX Spark desktop "AI supercomputer". I've never had hardware to review before! You can consider this my first ever sponsored post if you like, but they did not pay me any cash and aside from an embargo date they did not request (nor would I grant) any editorial input into what I write about the device.

The device retails for around $4,000. They officially go on sale tomorrow.

First impressions are that this is a snazzy little computer. It's similar in size to a Mac mini, but with an exciting textured surface that feels refreshingly different and a little bit science fiction.

There is a very powerful machine tucked into that little box. Here are the specs, which I had Claude Code figure out for me by poking around on the device itself:

Hardware Specifications

Architecture: aarch64 (ARM64)

CPU: 20 cores

10x Cortex-X925 (performance cores)

10x Cortex-A725 (efficiency cores)

RAM: 119 GB total (112 GB available) - I’m not sure why Claude reported it differently here, the machine is listed as 128GB - it looks like a 128GB == 119GiB thing because Claude used free -h

Storage: 3.7 TB (6% used, 3.3 TB available)

GPU Specifications

Model: NVIDIA GB10 (Blackwell architecture)

Compute Capability: sm_121 (12.1)

Memory: 119.68 GB

Multi-processor Count: 48 streaming multiprocessors

Architecture: Blackwell

Short version: this is an ARM64 device with 128GB of memory that's available to both the GPU and the 20 CPU cores at the same time, strapped onto a 4TB NVMe SSD.

The Spark is firmly targeted at “AI researchers”. It’s designed for both training and running models.

The tricky bit: CUDA on ARM64

Until now almost all of my own model running experiments have taken place on a Mac. This has gotten far less painful over the past year and a half thanks to the amazing work of the MLX team and community, but it's still left me deeply frustrated at my lack of access to the NVIDIA CUDA ecosystem. I've lost count of the number of libraries and tutorials which expect you to be able to use Hugging Face Transformers or PyTorch with CUDA, and leave you high and dry if you don't have an NVIDIA GPU to run things on.

Armed (ha) with my new NVIDIA GPU I was excited to dive into this world that had long eluded me... only to find that there was another assumption baked in to much of this software: x86 architecture for the rest of the machine.

This resulted in all kinds of unexpected new traps for me to navigate. I eventually managed to get a PyTorch 2.7 wheel for CUDA on ARM, but failed to do so for 2.8. I'm not confident there because the wheel itself is unavailable but I'm finding navigating the PyTorch ARM ecosystem pretty confusing.

NVIDIA are trying to make this easier, with mixed success. A lot of my initial challenges got easier when I found their official Docker container, so now I'm figuring out how best to use Docker with GPUs. Here's the current incantation that's been working for me:

docker run -it --gpus=all \
  -v /usr/local/cuda:/usr/local/cuda:ro \
  nvcr.io/nvidia/cuda:13.0.1-devel-ubuntu24.04 \
  bash

I have not yet got my head around the difference between CUDA 12 and 13. 13 appears to be very new, and a lot of the existing tutorials and libraries appear to expect 12.

The missing documentation isn't missing any more

When I first received this machine around a month ago there was very little in the way of documentation to help get me started. This meant climbing the steep NVIDIA+CUDA learning curve mostly on my own.

This has changed substantially in just the last week. NVIDIA now have extensive guides for getting things working on the Spark and they are a huge breath of fresh air - exactly the information I needed when I started exploring this hardware.

Here's the getting started guide, details on the DGX dashboard web app, and the essential collection of playbooks. There's still a lot I haven't tried yet just in this official set of guides.

Claude Code for everything

Claude Code was an absolute lifesaver for me while I was trying to figure out how best to use this device. My Ubuntu skills were a little rusty, and I also needed to figure out CUDA drivers and Docker incantations and how to install the right versions of PyTorch. Claude 4.5 Sonnet is much better than me at all of these things.

Since many of my experiments took place in disposable Docker containers I had no qualms at all about running it in YOLO mode:

IS_SANDBOX=1 claude --dangerously-skip-permissions

The IS_SANDBOX=1 environment variable stops Claude from complaining about running as root.

Before I found out about IS_SANDBOX

I was tipped off about IS_SANDBOX after I published this article. Here's my original workaround:

Claude understandably won't let you do this as root, even in a Docker container, so I found myself using the following incantation in a fresh nvcr.io/nvidia/cuda:13.0.1-devel-ubuntu24.04 instance pretty often:

apt-get update && apt-get install -y sudo
# pick the first free UID >=1000
U=$(for i in $(seq 1000 65000); do if ! getent passwd $i >/dev/null; then echo $i; break; fi; done)
echo "Chosen UID: $U"
# same for a GID
G=$(for i in $(seq 1000 65000); do if ! getent group $i >/dev/null; then echo $i; break; fi; done)
echo "Chosen GID: $G"
# create user+group
groupadd -g "$G" devgrp
useradd -m -u "$U" -g "$G" -s /bin/bash dev
# enable password-less sudo:
printf 'dev ALL=(ALL) NOPASSWD:ALL\n' > /etc/sudoers.d/90-dev-nopasswd
chmod 0440 /etc/sudoers.d/90-dev-nopasswd
# Install npm
DEBIAN_FRONTEND=noninteractive TZ=Etc/UTC apt-get install -y npm
# Install Claude
npm install -g @anthropic-ai/claude-code

Then switch to the dev user and run Claude for the first time:

su - dev
claude --dangerously-skip-permissions

This will provide a URL which you can visit to authenticate with your Anthropic account, confirming by copying back a token and pasting it into the terminal.

Docker tip: you can create a snapshot of the current image (with Claude installed) by running docker ps to get the container ID and then:

docker commit --pause=false <container_id> cc:snapshot

Then later you can start a similar container using:

docker run -it \
  --gpus=all \
  -v /usr/local/cuda:/usr/local/cuda:ro \
  cc:snapshot bash

Here's an example of the kinds of prompts I've been running in Claude Code inside the container:

I want to run https://huggingface.co/unsloth/Qwen3-4B-GGUF using llama.cpp - figure out how to get llama cpp working on this machine such that it runs with the GPU, then install it in this directory and get that model to work to serve a prompt. Goal is to get this command to run: llama-cli -hf unsloth/Qwen3-4B-GGUF -p "I believe the meaning of life is" -n 128 -no-cnv

That one worked flawlessly - Claude checked out the llama.cpp repo, compiled it for me and iterated on it until it could run that model on the GPU. Here's a full transcript, converted from Claude's .jsonl log format to Markdown using a script I vibe coded just now.

I later told it:

Write out a markdown file with detailed notes on what you did. Start with the shortest form of notes on how to get a successful build, then add a full account of everything you tried, what went wrong and how you fixed it.

Which produced this handy set of notes.

Tailscale was made for this

Having a machine like this on my local network is neat, but what's even neater is being able to access it from anywhere else in the world, from both my phone and my laptop.

Tailscale is perfect for this. I installed it on the Spark (using the Ubuntu instructions here), signed in with my SSO account (via Google)... and the Spark showed up in the "Network Devices" panel on my laptop and phone instantly.

I can SSH in from my laptop or using the Termius iPhone app on my phone. I've also been running tools like Open WebUI which give me a mobile-friendly web interface for interacting with LLMs on the Spark.

Here comes the ecosystem

The embargo on these devices dropped yesterday afternoon, and it turns out a whole bunch of relevant projects have had similar preview access to myself. This is fantastic news as many of the things I've been trying to figure out myself suddenly got a whole lot easier.

Four particularly notable examples:

Ollama works out of the box. They actually had a build that worked a few weeks ago, and were the first success I had running an LLM on the machine.
llama.cpp creator Georgi Gerganov just published extensive benchmark results from running llama.cpp on a Spark. He's getting ~3,600 tokens/second to read the prompt and ~59 tokens/second to generate a response with the MXFP4 version of GPT-OSS 20B and ~817 tokens/second to read and ~18 tokens/second to generate for GLM-4.5-Air-GGUF.
LM Studio now have a build for the Spark. I haven't tried this one yet as I'm currently using my machine exclusively via SSH.
vLLM - one of the most popular engines for serving production LLMs - had early access and there's now an official NVIDIA vLLM NGC Container for running their stack.

Here's a tutorial from Unsloth on fine-tuning gpt-oss-20b on the Spark.

Should you get one?

It's a bit too early for me to provide a confident recommendation concerning this machine. As indicated above, I've had a tough time figuring out how best to put it to use, largely through my own inexperience with CUDA, ARM64 and Ubuntu GPU machines in general.

The ecosystem improvements in just the past 24 hours have been very reassuring though. I expect it will be clear within a few weeks how well supported this machine is going to be.

Tags: hardware, ai, docker, tailscale, generative-ai, local-llms, llms, nvidia, ollama, llama-cpp, coding-agents, claude-code, lm-studio, disclosures, nvidia-spark