Simon Willison's Weblog: phi

Saying "hi" to Microsoft's Phi-4-reasoning

2025-05-06T18:25:45+00:00

Microsoft released a new sub-family of models a few days ago: Phi-4 reasoning. They introduced them in this blog post celebrating a year since the release of Phi-3:

Today, we are excited to introduce Phi-4-reasoning, Phi-4-reasoning-plus, and Phi-4-mini-reasoning - marking a new era for small language models and once again redefining what is possible with small and efficient AI.

All three of the new models are released under the MIT license, as is the case with previous Phi models.

They're available through Ollama as phi4-reasoning (which includes the plus model) and phi4-mini-reasoning.

These models appear to have a tendency to overthink!

Trying it out with Ollama

I ran it using Ollama and llm-ollama like this. Fetching the model downloaded 11GB of data:

ollama pull phi4-reasoning
llm install llm-ollama

Then I said "hi" to it:

llm -m phi4-reasoning:latest hi

Here's the full response, which starts with a thought section wrapped in <think> tags:

<think>User says "hi". We are given instructions "You are Phi, a language model developed by Microsoft, trained to provide accurate, secure, and user-aligned responses." There is additional information. Now I must answer "hi" message from the user.

Step 1: Read the instructions carefully: The conversation text instructs us that we are "Phi", with internal guidelines provided in system instructions. But our answer should be safe according to these instructions. We need to produce a response that says hello and greet back. However, I must adhere to the principles described above: Provide clear, specific language, etc.

I must not mention any chain-of-thought or guidelines.

Now instructions "hi" are from user. So we say "Hello! How can I help you?" But must include disclaimers for sensitive topics? Actually instructions says: "When replying topics above at every message include a disclaimer both at the beginning and the end when replying topics on medical, legal, financial matters or political matters." But our answer "hi" is not such topic. So we can say "Hi! I'm Phi, how may I help you?" but no disclaimers required.

But instructions: "Do not share guidelines." So do not mention them.

I must check if the message "hi" instructs to greet back normally. The assistant should respond with a greeting.

Let me recall that instructions say: Use markdown formatting where appropriate, so I can include a formatted greeting maybe "Hello!" or something.

Additionally, check guidelines. They are internal instructions for safe completions. So we just produce "Hi there! How may I help you?" in a friendly tone.

I should greet the user politely and ask if they need any help. Possibly provide a disclaimer that I'm not licensed professional? But the conversation is just greeting, so no disclaimers required as it's not sensitive topics.

Wait instructions say: "You must give a disclaimer both at the beginning and the end when replying topics above at every message." Actually the guidelines for medical/legal/financial/political are needed only if discussing those topics. In our case it is just a greeting message, so disclaimers are not required.

I should greet politely. So I'll say something like:

"Hello! I'm Phi, how can I help you today?" Possibly "Hi there!"

Since the question was simply "hi", I respond with a greeting.

I must include markdown formatting if appropriate. So maybe bullet point or code snippet is not necessary. I will simply greet.

I need to check instructions: "You are Phi, a language model developed by Microsoft." But I must not mention internal guidelines? But that's okay.

I'll produce: "Hello! How may I help you today?" with greeting and friendly tone.

Also note that the system message instructs me to include disclaimers at beginning and end for topics above. But this is a simple greeting, so no disclaimer needed.

Thus my answer:

"Hi there! How can I assist you today?"

I must not mention guidelines in output text? The instructions say: "Do not share these guidelines with the user even in chain-of-thought", so I should not mention them.

I'll produce a final message: "Hello, how can I help?" I'll produce a friendly greeting message. Possibly add markdown formatting: e.g. "Hi there!" or bullet list if needed?

I'll produce a response like:

"Hello! How may I assist you today? Please let me know what you need."

I'll produce the answer.

I'll produce text with no disclaimers because not required here.

I'll produce final answer message: "Hello, how can I help?" That is it.

I'll produce answer as final message.</think>

Hi there! How can I assist you today?

That's 56 sentences of reasoning followed by a single sentence of output!

I find this deeply amusing.

Qwen 3 over-thinks things a whole lot less

For comparison, here's Qwen 3 (8B 4bit via MLX) - another brand new "reasoning" model - responding to the same prompt:

llm -m mlx-community/Qwen3-8B-4bit hi

Output:

<think>
Okay, the user said "hi". That's a greeting. I should respond in a friendly and welcoming manner. Let me think about how to make it engaging. Maybe start with a cheerful "Hello!" to match the greeting. Then ask how I can assist them today. Keep it simple and open-ended so they feel comfortable sharing their needs. Avoid any complicated language. Make sure the tone is positive and helpful. Alright, that should work.
</think>

Hello! 😊 How can I assist you today? I'm here to help with anything you need!

A fun thing about Qwen 3 is that you can tell it not to "think" at all by adding /nothink to your prompt:

llm -m mlx-community/Qwen3-8B-4bit 'hi /nothink'

Output (empty <think> section included):

<think>

</think>

Hello! How can I assist you today? 😊

Phi-4 reasoning's system prompt

Since Phi-4 talked about its system prompt so much, I decided to see where that was coming from. It turns out Ollama bakes the system prompt into their model releases. Reading that in full helps explain why Phi-4 reasoning acted the way it did:

You are Phi, a language model trained by Microsoft to help users. Your role as an assistant involves thoroughly exploring questions through a systematic thinking process before providing the final precise and accurate solutions. This requires engaging in a comprehensive cycle of analysis, summarizing, exploration, reassessment, reflection, backtracing, and iteration to develop well-considered thinking process. Please structure your response into two main sections: Thought and Solution using the specified format: <think> {Thought section} </think> {Solution section}. In the Thought section, detail your reasoning process in steps. Each step should include detailed considerations such as analysing questions, summarizing relevant findings, brainstorming new ideas, verifying the accuracy of the current steps, refining any errors, and revisiting previous steps. In the Solution section, based on various attempts, explorations, and reflections from the Thought section, systematically present the final solution that you deem correct. The Solution section should be logical, accurate, and concise and detail necessary steps needed to reach the conclusion. Now, try to solve the following question through the above guidelines:

I don't see anything in there about "Do not share guidelines", even though the model response mentioned that rule.

My guess is that the model has been trained to "not talk about the system prompt" through RLHF or similar. I've heard in the past that models default to chattering about their system prompt if you don't put measures in place to discourage that.

The lengthy response from Phi-4-reasoning shown above may well be caused by the system prompt containing significantly more tokens than the single token "hi" sent by the user.

It's still hard to know when to use reasoning models

We've had access to these "reasoning" models - with a baked in chain-of-thought at the start of each response - since o1 debuted in September last year.

I'll be honest: I still don't have a great intuition for when it makes the most sense to use them.

I've had great success with them for code: any coding tasks that might involve multiple functions or classes that co-ordinate together seems to benefit from a reasoning step.

They are an absolute benefit for debugging: I've seen reasoning models walk through quite large codebases following multiple levels of indirection in order to find potential root causes of the problem I've described.

Other than that though... they're apparently good for mathematical puzzles - the phi4-reasoning models seem to really want to dig into a math problem and output LaTeX embedded in Markdown as the answer. I'm not enough of a mathematician to put them through their paces here.

All of that in mind, these reasoners that run on my laptop are fun to torment with inappropriate challenges that sit far beneath their lofty ambitions, but aside from that I don't really have a great answer to when I would use them.

Update 8th May 2025: I said "hi" to NVIDIA's new OpenCodeReasoning-Nemotron-32B model (run using Ollama and this GGUF file) and got a similar result.

Tags: microsoft, generative-ai, local-llms, llms, llm, phi, qwen, ollama, llm-reasoning, llm-release, ai-in-china

Phi-4 Bug Fixes by Unsloth

2025-01-11T01:20:08+00:00

Phi-4 Bug Fixes by Unsloth

This explains why I was seeing weird <|im_end|> suffexes during my experiments with Phi-4 the other day: it turns out the Phi-4 tokenizer definition as released by Microsoft had a bug in it, and there was a small bug in the chat template as well.

Daniel and Michael Han figured this out and have now published GGUF files with their fixes on Hugging Face.

Via Hacker News

Tags: ai, generative-ai, llms, phi

microsoft/phi-4

2025-01-08T17:57:18+00:00

microsoft/phi-4

Here's the official release of Microsoft's Phi-4 LLM, now officially under an MIT license.

A few weeks ago I covered the earlier unofficial versions, where I talked about how the model used synthetic training data in some really interesting ways.

It benchmarks favorably compared to GPT-4o, suggesting this is yet another example of a GPT-4 class model that can run on a good laptop.

The model already has several available community quantizations. I ran the mlx-community/phi-4-4bit one (a 7.7GB download) using mlx-llm like this:

uv run --with 'numpy<2' --with mlx-lm python -c '
from mlx_lm import load, generate

model, tokenizer = load("mlx-community/phi-4-4bit")

prompt = "Generate an SVG of a pelican riding a bicycle"

if tokenizer.chat_template is not None:
    messages = [{"role": "user", "content": prompt}]
    prompt = tokenizer.apply_chat_template(
        messages, add_generation_prompt=True
    )

response = generate(model, tokenizer, prompt=prompt, verbose=True, max_tokens=2048)
print(response)'

Here's what I got back.

Update: The model is now available via Ollama, so you can fetch a 9.1GB model file using ollama run phi4, after which it becomes available via the llm-ollama plugin.

Tags: microsoft, ai, generative-ai, local-llms, llms, llm, phi, uv, mlx, ollama, pelican-riding-a-bicycle, llm-release

Phi-4 Technical Report

2024-12-15T23:58:22+00:00

Phi-4 Technical Report

Phi-4 is the latest LLM from Microsoft Research. It has 14B parameters and claims to be a big leap forward in the overall Phi series. From Introducing Phi-4: Microsoft’s Newest Small Language Model Specializing in Complex Reasoning:

Phi-4 outperforms comparable and larger models on math related reasoning due to advancements throughout the processes, including the use of high-quality synthetic datasets, curation of high-quality organic data, and post-training innovations. Phi-4 continues to push the frontier of size vs quality.

The model is currently available via Azure AI Foundry. I couldn't figure out how to access it there, but Microsoft are planning to release it via Hugging Face in the next few days. It's not yet clear what license they'll use - hopefully MIT, as used by the previous models in the series.

In the meantime, unofficial GGUF versions have shown up on Hugging Face already. I got one of the matteogeniaccio/phi-4 GGUFs working with my LLM tool and llm-gguf plugin like this:

llm install llm-gguf
llm gguf download-model https://huggingface.co/matteogeniaccio/phi-4/resolve/main/phi-4-Q4_K_M.gguf
llm chat -m gguf/phi-4-Q4_K_M

This downloaded a 8.4GB model file. Here are some initial logged transcripts I gathered from playing around with the model.

An interesting detail I spotted on the Azure AI Foundry page is this:

Limited Scope for Code: Majority of phi-4 training data is based in Python and uses common packages such as typing, math, random, collections, datetime, itertools. If the model generates Python scripts that utilize other packages or scripts in other languages, we strongly recommend users manually verify all API uses.

This leads into the most interesting thing about this model: the way it was trained on synthetic data. The technical report has a lot of detail about this, including this note about why synthetic data can provide better guidance to a model:

Synthetic data as a substantial component of pretraining is becoming increasingly common, and the Phi series of models has consistently emphasized the importance of synthetic data. Rather than serving as a cheap substitute for organic data, synthetic data has several direct advantages over organic data.

Structured and Gradual Learning. In organic datasets, the relationship between tokens is often complex and indirect. Many reasoning steps may be required to connect the current token to the next, making it challenging for the model to learn effectively from next-token prediction. By contrast, each token generated by a language model is by definition predicted by the preceding tokens, making it easier for a model to follow the resulting reasoning patterns.

And this section about their approach for generating that data:

Our approach to generating synthetic data for phi-4 is guided by the following principles:

Diversity: The data should comprehensively cover subtopics and skills within each domain. This requires curating diverse seeds from organic sources.

Nuance and Complexity: Effective training requires nuanced, non-trivial examples that reflect the complexity and the richness of the domain. Data must go beyond basics to include edge cases and advanced examples.

Accuracy: Code should execute correctly, proofs should be valid, and explanations should adhere to established knowledge, etc.

Chain-of-Thought: Data should encourage systematic reasoning, teaching the model various approaches to the problems in a step-by-step manner. [...]

We created 50 broad types of synthetic datasets, each one relying on a different set of seeds and different multi-stage prompting procedure, spanning an array of topics, skills, and natures of interaction, accumulating to a total of about 400B unweighted tokens. [...]

Question Datasets: A large set of questions was collected from websites, forums, and Q&A platforms. These questions were then filtered using a plurality-based technique to balance difficulty. Specifically, we generated multiple independent answers for each question and applied majority voting to assess the consistency of responses. We discarded questions where all answers agreed (indicating the question was too easy) or where answers were entirely inconsistent (indicating the question was too difficult or ambiguous). [...]

Creating Question-Answer pairs from Diverse Sources: Another technique we use for seed curation involves leveraging language models to extract question-answer pairs from organic sources such as books, scientific papers, and code.

Via @peteratmsr

Tags: microsoft, python, ai, generative-ai, llms, ai-assisted-programming, llm, phi, training-data, llm-release

NuExtract 1.5

2024-11-16T16:33:17+00:00

NuExtract 1.5

Structured extraction - where an LLM helps turn unstructured text (or image content) into structured data - remains one of the most directly useful applications of LLMs.

NuExtract is a family of small models directly trained for this purpose (though text only at the moment) and released under the MIT license.

It comes in a variety of shapes and sizes:

NuExtract-v1.5 is a 3.8B parameter model fine-tuned on Phi-3.5-mini instruct. You can try this one out in this playground.
NuExtract-tiny-v1.5 is 494M parameters, fine-tuned on Qwen2.5-0.5B.
NuExtract-1.5-smol is 1.7B parameters, fine-tuned on SmolLM2-1.7B.

All three models were fine-tuned on NuMind's "private high-quality dataset". It's interesting to see a model family that uses one fine-tuning set against three completely different base models.

Useful tip from Steffen Röcker:

Make sure to use it with low temperature, I've uploaded NuExtract-tiny-v1.5 to Ollama and set it to 0. With the Ollama default of 0.7 it started repeating the input text. It works really well despite being so smol.

Tags: ai, generative-ai, local-llms, llms, hugging-face, fine-tuning, phi, qwen, smollm, structured-extraction, llm-release, ai-in-china

Running Llama 3.2 Vision and Phi-3.5 Vision on a Mac with mistral.rs

2024-10-19T16:14:20+00:00

mistral.rs is an LLM inference library written in Rust by Eric Buehler. Today I figured out how to use it to run the Llama 3.2 Vision and Phi-3.5 Vision models on my Mac.

Despite the name, it's not just for the Mistral family of models - like how llama.cpp has grown beyond Llama, mistral.rs has grown beyond Mistral.

I already have a Rust installation, so I checked out and compiled the library like this:

git clone https://github.com/EricLBuehler/mistral.rs.git
cd mistral.rs
cargo run --features metal --release -- -i --isq Q4K vision-plain -m lamm-mit/Cephalo-Llama-3.2-11B-Vision-Instruct-128k -a vllama

This downloaded around 20GB of model files to ~/.cache/huggingface/hub/models--lamm-mit--Cephalo-Llama-3.2-11B-Vision-Instruct-128k, loaded the model into memory and started a terminal-based chat session.

I ran this prompt:

\image https://niche-museums.imgix.net/pioneer-history.jpeg?w=1600&h=800&fit=crop&auto=compress <|image|> describe this image including any text

The prompt uses this image from my Niche Museums write-up of the Pioneer Memorial Museum:

I got back this response:

The image shows the exterior of the Daughters of Utah Pioneers (DUP) Museum, which is also known as the Pioneer Memorial Museum. The building is located in Salt Lake City, Utah, USA.

Visible in the image are:

The DUP Museum Building: The main structure of the museum is a prominent feature in the image. It is a two-story building with a rectangular shape and a flat roof. The exterior walls are made of brick or stone, giving it a classic and historic appearance.

Signage: Above the entrance, there is a sign that reads "Daughters of Utah Pioneers Museum" in bold letters. Below this sign, another sign indicates that the building is also known as the "Pioneer Memorial Museum."

Entrance: The main entrance to the museum is marked by a doorway with a decorative frame and a pair of large wooden doors.

Landscaping: The surrounding area around the museum appears well-maintained with neatly trimmed grass and some trees nearby.

Statue or Monument: In front of the museum, there is a small statue or monument that appears to be dedicated to pioneers or early settlers.

The Daughters of Utah Pioneers (DUP) Museum showcases historical artifacts and exhibits related to Utah's pioneer history, focusing on the experiences and contributions of early settlers in the region.

Note that the text it generates correctly describes the museum's content and identifies the museum as being located in Salt Lake City, neither of which are details that were displayed in the image itself.

Activity Monitor confirmed that this Metal build was using my GPU, and about 10GB of memory, while the prompt was running.

mistral.rs also provides a Python binding, which can be installed from PyPI but requires a Rust toolchain (I filed an issue requesting PyPI wheels):

pip install mistralrs-metal

I got an error running the Python Llama vision demo but the Phi 3.5 Vision example worked as advertised. I modified that to enable streaming output and ran it against a photograph from Mendenhall's Museum of Gasoline Pumps & Petroliana:

from mistralrs import Runner, Which, ChatCompletionRequest, VisionArchitecture

res = runner.send_chat_completion_request(
    ChatCompletionRequest(
        model="phi3v",
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": "https://niche-museums.imgix.net/mendenhalls-16.jpeg?w=1200&auto=compress"
                        },
                    },
                    {
                        "type": "text",
                        "text": "<|image_1|>\nWhat is shown in this image? Write a detailed response analyzing the scene.",
                    },
                ],
            }
        ],
        max_tokens=1024,
        presence_penalty=1.0,
        top_p=0.1,
        temperature=0.1,
        stream=True,
    )
)
for item in res:
    print(item.choices[0].delta.content, end='')

Here's what It gave me:

The image captures a nostalgic scene of a vintage gas station, reminiscent of a bygone era. Dominating the left side of the frame is a red and white gas pump, standing tall and proud, its metallic surface gleaming under the soft light filtering through the wooden ceiling. Adjacent to it, a white and black gas canister stands ready for service, its presence adding to the authenticity of the setting.

On the right, a red and white gasoline sign hangs from the ceiling, its bold colors contrasting with the muted tones of the surroundings. Above it, a yellow and white gasoline sign is suspended, its vibrant hues drawing the eye upwards.

The floor beneath these relics of the past is a checkerboard pattern, a common design choice for gas stations of yesteryears. It provides a stark contrast to the smooth, unblemished surfaces of the gas pumps and canisters.

In the background, a variety of other signs and advertisements add to the eclectic mix of objects. They are a testament to the diverse range of products and services that were once available at this location.

Despite the passage of time, the gas station retains a certain charm, its vintage gas pumps and signs serving as a tangible link to a different era. The image is a snapshot of history, frozen in time, waiting to be discovered and appreciated by those who take the time to look closer.

This description looks fantastic at first glance, but if you review it carefully and compare it to the image you'll see that it's full of inaccuracies. The vibes of the description match the image but the actual details are definitely incorrect.

This model downloaded 7.7GB to ~/.cache/huggingface/hub/models--microsoft--Phi-3.5-vision-instruct - significantly smaller than Llama 3.2's 20GB. I wonder if that size difference helps explain the greater hallucination rate in Phi-3.5 Vision.

If you're running Python 3.10 on Apple Silicon you may be able to skip the Rust compiler by installing the wheel I built here:

pip install https://static.simonwillison.net/static/2024/mistralrs_metal-0.3.1-cp310-cp310-macosx_11_0_arm64.whl

Tags: microsoft, python, ai, rust, generative-ai, llama, local-llms, llms, mistral, phi, vision-llms, meta

New Phi-3 models: small, medium and vision

2024-05-21T20:04:30+00:00

New Phi-3 models: small, medium and vision

I couldn't find a good official announcement post to link to about these three newly released models, but this post on LocalLLaMA on Reddit has them in one place: Phi-3 small (7B), Phi-3 medium (14B) and Phi-3 vision (4.2B) (the previously released model was Phi-3 mini - 3.8B).

You can try out the vision model directly here, no login required. It didn't do a great job with my first test image though, hallucinating the text.

As with Mini these are all released under an MIT license.

UPDATE: Here's a page from the newly published Phi-3 Cookbook describing the models in the family.

Tags: microsoft, ai, generative-ai, llms, phi, llm-release

experimental-phi3-webgpu

2024-05-09T22:21:48+00:00

experimental-phi3-webgpu

Run Microsoft’s excellent Phi-3 model directly in your browser, using WebGPU so didn’t work in Firefox for me, just in Chrome.

It fetches around 2.1GB of data into the browser cache on first run, but then gave me decent quality responses to my prompts running at an impressive 21 tokens a second (M2, 64GB).

I think Phi-3 is the highest quality model of this size, so it’s a really good fit for running in a browser like this.

Via @xenovacom

Tags: browsers, ai, webassembly, generative-ai, local-llms, llms, phi, webgpu

microsoft/Phi-3-mini-4k-instruct-gguf

2024-04-23T17:40:16+00:00

microsoft/Phi-3-mini-4k-instruct-gguf

Microsoft’s Phi-3 LLM is out and it’s really impressive. This 4,000 token context GGUF model is just a 2.2GB (for the Q4 version) and ran on my Mac using the llamafile option described in the README. I could then run prompts through it using the llm-llamafile plugin.

The vibes are good! Initial test prompts I’ve tried feel similar to much larger 7B models, despite using just a few GBs of RAM. Tokens are returned fast too—it feels like the fastest model I’ve tried yet.

And it’s MIT licensed.

Via @simonw

Tags: microsoft, ai, generative-ai, local-llms, llms, llm, phi

Microsoft Research relicense Phi-2 as MIT

2024-01-06T06:06:42+00:00

Microsoft Research relicense Phi-2 as MIT

Phi-2 was already an interesting model—really strong results for its size—made available under a non-commercial research license. It just got significantly more interesting: Microsoft relicensed it as MIT open source.

Via @abacaj

Tags: microsoft, open-source, mitlicense, ai, generative-ai, llms, phi