<?xml version="1.0" encoding="utf-8"?>
<feed xml:lang="en-us" xmlns="http://www.w3.org/2005/Atom"><title>Simon Willison's Weblog: image-segmentation</title><link href="http://simonwillison.net/" rel="alternate"/><link href="http://simonwillison.net/tags/image-segmentation.atom" rel="self"/><id>http://simonwillison.net/</id><updated>2025-04-18T13:26:00+00:00</updated><author><name>Simon Willison</name></author><entry><title>Image segmentation using Gemini 2.5</title><link href="https://simonwillison.net/2025/Apr/18/gemini-image-segmentation/#atom-tag" rel="alternate"/><published>2025-04-18T13:26:00+00:00</published><updated>2025-04-18T13:26:00+00:00</updated><id>https://simonwillison.net/2025/Apr/18/gemini-image-segmentation/#atom-tag</id><summary type="html">
    &lt;p&gt;Max Woolf pointed out this new feature of the Gemini 2.5 series (here's my coverage of &lt;a href="https://simonwillison.net/2025/Mar/25/gemini/"&gt;2.5 Pro&lt;/a&gt; and &lt;a href="https://simonwillison.net/2025/Apr/17/start-building-with-gemini-25-flash/"&gt;2.5 Flash&lt;/a&gt;) in &lt;a href="https://news.ycombinator.com/item?id=43720845#43722227"&gt;a comment&lt;/a&gt; on Hacker News:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;One hidden note from Gemini 2.5 Flash when diving deep into the documentation: for image inputs, not only can the model be instructed to generated 2D bounding boxes of relevant subjects, but it can also &lt;a href="https://ai.google.dev/gemini-api/docs/image-understanding#segmentation"&gt;create segmentation masks&lt;/a&gt;!&lt;/p&gt;
&lt;p&gt;At this price point with the Flash model, creating segmentation masks is pretty nifty.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I built a tool last year to &lt;a href="https://simonwillison.net/2024/Aug/26/gemini-bounding-box-visualization/"&gt;explore Gemini's bounding box abilities&lt;/a&gt;. This new segmentation mask feature represents a significant new capability!&lt;/p&gt;
&lt;p&gt;Here's my new tool to try it out: &lt;strong&gt;&lt;a href="https://tools.simonwillison.net/gemini-mask"&gt;Gemini API Image Mask Visualization&lt;/a&gt;&lt;/strong&gt;. As with my bounding box tool it's browser-based JavaScript that talks to the Gemini API directly. You provide it with a &lt;a href="https://aistudio.google.com/app/apikey"&gt;Gemini API key&lt;/a&gt; which isn't logged anywhere that I can see it.&lt;/p&gt;
&lt;p&gt;This is what it can do:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/mask-tool.jpg" alt="Screenshot of mask tool. At the top is a select box to pick a model (currently using Gemini 2.5 Pro) and a prompt that reads: Give the segmentation masks for the pelicans. Output a JSON list of segmentation masks where each entry contains the 2D bounding box in the key &amp;quot;box_2d&amp;quot; and the segmentation mask in key &amp;quot;mask&amp;quot;. Below that is JSON that came back - an array of objects. The mask keys are base64 encoded PNG data. Below that is the original image, then the image with masks overlaid and a coordinate system, then two columns showing each cropped image and mask next to each other." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;Give it an image and a prompt of the form:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;Give the segmentation masks for the objects. Output a JSON list of segmentation masks where each entry contains the 2D bounding box in the key "box_2d" and the segmentation mask in key "mask".&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;My tool then runs the prompt and displays the resulting JSON. The Gemini API returns segmentation masks as base64-encoded PNG images in strings that start &lt;code&gt;data:image/png;base64,iVBOR...&lt;/code&gt;. The tool then visualizes those in a few different ways on the page, including overlaid over the original image.&lt;/p&gt;
&lt;p&gt;I &lt;a href="https://simonwillison.net/tags/vibe-coding/"&gt;vibe coded&lt;/a&gt; the whole thing together using a combination of Claude and ChatGPT. I started with &lt;a href="https://claude.ai/share/2dd2802a-c8b4-4893-8b61-0861d4fcb0f1"&gt;a Claude Artifacts React prototype&lt;/a&gt;, then pasted the code from my old project into Claude and &lt;a href="https://claude.ai/share/9e42d82b-56c7-46c1-ad0c-fc67c3cad91f"&gt;hacked on that until I ran out of tokens&lt;/a&gt;. I transferred the incomplete result to a new Claude session where I &lt;a href="https://claude.ai/share/f820f361-5aa7-48b5-a96d-f0f8b11d3869"&gt;kept on iterating&lt;/a&gt; until it got stuck in a bug loop (the same bug kept coming back no matter how often I told it to fix that)... so I switched over to O3 in ChatGPT &lt;a href="https://chatgpt.com/share/6801c8ad-18c8-8006-bdd8-447500eae33e"&gt;to finish it off&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Here's &lt;a href="https://github.com/simonw/tools/blob/main/gemini-mask.html"&gt;the finished code&lt;/a&gt;. It's a total mess, but it's also less than 500 lines of code and the interface solves my problem in that it lets me explore the new Gemini capability.&lt;/p&gt;
&lt;p&gt;Segmenting my pelican photo via the Gemini API was &lt;em&gt;absurdly&lt;/em&gt; inexpensive. Using Gemini 2.5 Pro the call cost 303 input tokens and 353 output tokens, for a total cost of 0.2144 cents (less than a quarter of a cent). I ran it again with the new Gemini 2.5 Flash and it used 303 input tokens and 270 output tokens, for a total cost of 0.099 cents (less than a tenth of a cent). I calculated these prices using my &lt;a href="https://tools.simonwillison.net/llm-prices"&gt;LLM pricing calculator&lt;/a&gt; tool.&lt;/p&gt;

&lt;h4 id="gemini-2-5-flash-non-thinking"&gt;1/100th of a cent with Gemini 2.5 Flash non-thinking&lt;/h4&gt;
&lt;p&gt;Gemini 2.5 Flash has two pricing models. Input is a standard $0.15/million tokens, but the output charges differ a lot: in non-thinking mode output is $0.60/million, but if you have thinking enabled (the default) output is $3.50/million. I think of these as "Gemini 2.5 Flash" and "Gemini 2.5 Flash Thinking".&lt;/p&gt;
&lt;p&gt;My initial experiments all used thinking mode. I decided to upgrade the tool to try non-thinking mode, but noticed that the API library it was using (&lt;a href="https://github.com/google-gemini/deprecated-generative-ai-js"&gt;google/generative-ai&lt;/a&gt;) is marked as deprecated.&lt;/p&gt;
&lt;p&gt;On a hunch, I pasted the code into &lt;a href="https://simonwillison.net/2025/Apr/16/introducing-openai-o3-and-o4-mini/"&gt;the new o4-mini-high model&lt;/a&gt; in ChatGPT and prompted it with:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;This code needs to be upgraded to the new recommended JavaScript  library from Google. Figure out what that is and then look up enough documentation to port this code to it&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;o4-mini and o3 both have search tool access and claim to be good at mixing different tool uses together.&lt;/p&gt;
&lt;p&gt;This worked &lt;em&gt;extremely&lt;/em&gt; well! It ran a few searches and identified exactly what needed to change:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/o4-thinking.jpg" alt="Screenshot of AI assistant response about upgrading Google Gemini API code. Shows &amp;quot;Thought for 21 seconds&amp;quot; followed by web search results for &amp;quot;Google Gemini API JavaScript library recommended new library&amp;quot; with options including Google AI for Developers, GitHub, and Google for Developers. The assistant explains updating from GoogleGenerativeAI library to @google-ai/generative, with code samples showing: import { GoogleGenAI } from 'https://cdn.jsdelivr.net/npm/@google/genai@latest'; and const ai = new GoogleGenAI({ apiKey: getApiKey() });" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;Then gave me detailed instructions along with an updated snippet of code. Here's &lt;a href="https://chatgpt.com/share/68028f7b-11ac-8006-8150-00c4205a2507"&gt;the full transcript&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I prompted for a few more changes, then had to tell it not to use TypeScript (since I like copying and pasting code directly out of the tool without needing to run my own build step). The &lt;a href="https://tools.simonwillison.net/gemini-mask"&gt;latest version&lt;/a&gt; has been rewritten by o4-mini for the new library, defaults to Gemini 2.5 Flash non-thinking and displays usage tokens after each prompt.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/mask-tool-non-thinking.jpg" alt="Screenshot of the new tool. Gemini 2.5 Flash non-thinking is selected. Same prompt as before. Input tokens: 303 • Output tokens: 123" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;Segmenting my pelican photo in non-thinking mode cost me 303 input tokens and 123 output tokens - that's 0.0119 cents, just over 1/100th of a cent!&lt;/p&gt;

&lt;h4 id="but-this-looks-like-way-more-than-123-output-tokens"&gt;But this looks like way more than 123 output tokens&lt;/h4&gt;
&lt;p&gt;The JSON that's returned by the API looks &lt;em&gt;way&lt;/em&gt; too long to fit just 123 tokens.&lt;/p&gt;
&lt;p&gt;My hunch is that there's an additional transformation layer here. I think the Gemini 2.5 models return a much more efficient token representation of the image masks, then the Gemini API layer converts those into base4-encoded PNG image strings.&lt;/p&gt;
&lt;p&gt;We do have one clue here: last year DeepMind &lt;a href="https://simonwillison.net/2024/May/15/paligemma/"&gt;released PaliGemma&lt;/a&gt;, an open weights vision model that could generate segmentation masks on demand.&lt;/p&gt;
&lt;p&gt;The &lt;a href="https://github.com/google-research/big_vision/blob/main/big_vision/configs/proj/paligemma/README.md#tokenizer"&gt;README for that model&lt;/a&gt; includes this note about how their tokenizer works:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;PaliGemma uses the Gemma tokenizer with 256,000 tokens, but we further extend its vocabulary with 1024 entries that represent coordinates in normalized image-space (&lt;code&gt;&amp;lt;loc0000&amp;gt;...&amp;lt;loc1023&amp;gt;&lt;/code&gt;), and another with 128 entries (&lt;code&gt;&amp;lt;seg000&amp;gt;...&amp;lt;seg127&amp;gt;&lt;/code&gt;) that are codewords used by a lightweight referring-expression segmentation vector-quantized variational auto-encoder (VQ-VAE) [...]&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;My guess is that Gemini 2.5 is using a similar approach.&lt;/p&gt;

&lt;h4 id="bonus-image-segmentation-with-llm-and-a-schema"&gt;Bonus: Image segmentation with LLM and a schema&lt;/h4&gt;
&lt;p&gt;Since &lt;a href="https://simonwillison.net/2025/Feb/28/llm-schemas/"&gt;my LLM CLI tool supports JSON schemas&lt;/a&gt; we can use those to return the exact JSON shape we want for a given image.&lt;/p&gt;
&lt;p&gt;Here's an example using Gemini 2.5 Flash to return bounding boxes and segmentation masks for all of the objects in an image:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm -m gemini-2.5-flash-preview-04-17 --schema &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;{&lt;/span&gt;
&lt;span class="pl-s"&gt;  "type": "object",&lt;/span&gt;
&lt;span class="pl-s"&gt;  "properties": {&lt;/span&gt;
&lt;span class="pl-s"&gt;    "masks": {&lt;/span&gt;
&lt;span class="pl-s"&gt;      "type": "array",&lt;/span&gt;
&lt;span class="pl-s"&gt;      "items": {&lt;/span&gt;
&lt;span class="pl-s"&gt;        "type": "object",&lt;/span&gt;
&lt;span class="pl-s"&gt;        "required": ["box_2d", "mask"],&lt;/span&gt;
&lt;span class="pl-s"&gt;        "properties": {&lt;/span&gt;
&lt;span class="pl-s"&gt;          "box_2d": {&lt;/span&gt;
&lt;span class="pl-s"&gt;            "type": "array",&lt;/span&gt;
&lt;span class="pl-s"&gt;            "items": {&lt;/span&gt;
&lt;span class="pl-s"&gt;              "type": "integer"&lt;/span&gt;
&lt;span class="pl-s"&gt;            }&lt;/span&gt;
&lt;span class="pl-s"&gt;          },&lt;/span&gt;
&lt;span class="pl-s"&gt;          "mask": {&lt;/span&gt;
&lt;span class="pl-s"&gt;            "type": "string"&lt;/span&gt;
&lt;span class="pl-s"&gt;          }&lt;/span&gt;
&lt;span class="pl-s"&gt;        }&lt;/span&gt;
&lt;span class="pl-s"&gt;      }&lt;/span&gt;
&lt;span class="pl-s"&gt;    }&lt;/span&gt;
&lt;span class="pl-s"&gt;  },&lt;/span&gt;
&lt;span class="pl-s"&gt;  "required": ["masks"]&lt;/span&gt;
&lt;span class="pl-s"&gt;}&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; -a https://static.simonwillison.net/static/2025/two-pelicans.jpg \
  -s &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;Return bounding boxes and segmentation masks for all objects&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;That returned:&lt;/p&gt;
&lt;div class="highlight highlight-source-json"&gt;&lt;pre&gt;{
  &lt;span class="pl-ent"&gt;"masks"&lt;/span&gt;: [
    {&lt;span class="pl-ent"&gt;"box_2d"&lt;/span&gt;: [&lt;span class="pl-c1"&gt;198&lt;/span&gt;, &lt;span class="pl-c1"&gt;508&lt;/span&gt;, &lt;span class="pl-c1"&gt;755&lt;/span&gt;, &lt;span class="pl-c1"&gt;929&lt;/span&gt;], &lt;span class="pl-ent"&gt;"mask"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAQAAAAEACAAAAAB5Gfe6AAACfElEQVR42u3ZS27dMBAF0dr/pjsDBwlsB4ZjfZ7IqjvySMQ96EfRFJRSSimlXJX5E3V5o8L8O/L6GoL5Mvb+2wvMN2Lvv6/AfD8BuOvvKDBjBpj/j73/uNtvJDATgFlgDuXdY3TtVx+KuSzy+ksYzB2R138swdybBB6FMC+Lu/0TDOYJcbd/mcE8LfL69xLMY2Pvf4vBPD7q8lca/PhKZwuCHy+/xxgcWHiHn8KxFVffD46vte6eeM4q674Wzlpg1TfjaU9e9HRw4vOWPCGdOk8rnhJft5s8xeB179KHEJx6oDJfHnSH0i3KKpcJCUSQQAJdKl8uMHIA7ZX6Uh8W+rDSl6W+rAUQgLr/VQLTBLQFdAp4ZtGb/hO0Xggv/YWsAdhTIIAA3AAEEIAaAOQCAcgBCCAAt4AdgADcAATgBkAOQAPQAAQgBiAANwByAAKovxkAOQByAOQABOAGaAAaADUAAbgBCMANQABuAAJwAyAHQA5AAG4B5ADIAZADEIAbADkAcgACcAPU3w2AHIAA3ADIAeovF7ADIAcAtwDIBZALsET0ANcREIBbgADcACAXCEAOwOoABGACIICP7Y/uCywK8Psv5qgAawp8pnABvJOwAXz4MegAPu8GYwfA2T+Av9ugFuAN4dguyPoChwDYIwEEEIC6fwAEEIC7fwAByPsHEIAdgADk/QPQA2DvH0AAdgDs/QMIIAA5AAEEIAfA3j+AAAJw9w+AAAIIwA2QQAABdBRqBAIIoJNAAAEkEIC1//cFApALEIBbANQC7B57f+z9vxYAuQB2AewCdgACCMAtEIBdwA4AcgE7AAG4BZADgFoAadzt3wgo5b78AitLcVa+Qqb7AAAAAElFTkSuQmCC&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;},
    {"box_2d": [415, 95, 867, 547], "mask": "data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAQAAAAEACAAAAAB5Gfe6AAADUklEQVR42u3d7W6rMBCE4bn/m54e6VRVpKoQBX/s+p39WVVm58EYQiiVUjXKhsc3V8A/BY9vBYCdPwDJv2SLZfMTAVbnr3ageTFAtZXGSwHqLbVeCVDwXOOFAO6Q38csNZ8CfPfnzkfa8/zjW+y1/8c32W//j22yY37P2lLZK6B5ADWP/7v8Pjz+bX4fffhvy+8qLl4D8Pegu+fFGoCLMcvn99z8uz8Ybc9ffQX0hG0kPyp/5fn/zgr4tOfrYd0j/wOBm0GPB7C96kJzav5Pu7wbdCuAPRtg/gJTG+B+9///He1ZCzwbwG/N/22TYX9+7T0eJgP48zohv10dYGpP9mkAyc/O75X5uwP4xPxeF7/mKfDtzjyiiuZ/ozGbDWB3EZjTmOEAgPxXrblR/hkArfLP+JzaKf6ED6qNwk8BaJX+abuT8he+E3rbabf8gu9/1dv/tb8LuOkVlt/98w+dAKbld+ez//D7tcnPOwD+frSVMgEMPwBeW4YDmJr/+1EWcH43u/cz67Zd8gMvATIBmufPChCAHAEBCEAAuPkDEIAABIANoADQAYQHUADoAIUIAhABuoDoAqILiC4QALqA6AKiC4guEAC6gOgCyhSAC0hwgQDQBUQXCABdQHSBAEQgAHCBANAFRBcIAF0gAAGAC4guQAeQ4AIBCABcIAB0gQDQBQIQgACwBQIQALgAHUABCABbIABwAQUADSCxASS2gNAAql54ANHzKzMgABEIQAACEIBcCAQAAfCvIS8FqLyrVwiUnugogMsGz89/2aPPB/CugsfPOxPy3hR4/Lw+LC+Qg8fPa0TzJl14fOed+vm/GvD4qwFcrwLAjr8SwOj8rlr0/GanXwJgowFsNoDZADYawEYD2GwAswFsNICNBrDRADYawB0LHn+cgPsWPP4IArcvdvpHAj6m6Pk/IniwqRMIHm2k/zx4OnxzgOeDt14PhozZdl0cNVDTk8O42dTzDDnwUGp5kbB/IWkDcOjNswpXElsFSlxK7hT4/TOTPki/9pxbyESBAORrpADki1QwQZ4lycNUXALsk/RL/5wAsJsrE6hMsdPvEFDBgsdfSKC6BY+/wED1Cx7/l8E4G51R8Pifaujsgse/QRCo4PFfJcYO9wWdFFckoSpT7wAAAABJRU5ErkJggg=="}
  ]
}&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;I &lt;a href="https://claude.ai/share/2dd2802a-c8b4-4893-8b61-0861d4fcb0f1"&gt;vibe coded a tool&lt;/a&gt; for visualizing that JSON - paste it into &lt;a href="https://tools.simonwillison.net/mask-visualizer"&gt;tools.simonwillison.net/mask-visualizer&lt;/a&gt; to see the results.&lt;/p&gt;
&lt;p&gt;I wasn't sure of the origin for the co-ordinate system when I first built the tool so I had Claude add buttons for switching that to see which one fit. Then I left the buttons in because you can use them to make my pelican outlines flap around the page!
&lt;img src="https://static.simonwillison.net/static/2025/flap.gif" alt="Animated demo. Two pelican outlines are shown offset from each other - clicking the four different origin buttons causes them to move in relationship to each other." style="max-width: 100%;" /&gt;&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/google"&gt;google&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/tools"&gt;tools&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/max-woolf"&gt;max-woolf&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-assisted-programming"&gt;ai-assisted-programming&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gemini"&gt;gemini&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/vision-llms"&gt;vision-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-pricing"&gt;llm-pricing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/vibe-coding"&gt;vibe-coding&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/image-segmentation"&gt;image-segmentation&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="google"/><category term="tools"/><category term="ai"/><category term="max-woolf"/><category term="generative-ai"/><category term="llms"/><category term="ai-assisted-programming"/><category term="gemini"/><category term="vision-llms"/><category term="llm-pricing"/><category term="vibe-coding"/><category term="image-segmentation"/></entry><entry><title>SAM 2: The next generation of Meta Segment Anything Model for videos and images</title><link href="https://simonwillison.net/2024/Jul/29/sam-2/#atom-tag" rel="alternate"/><published>2024-07-29T23:59:08+00:00</published><updated>2024-07-29T23:59:08+00:00</updated><id>https://simonwillison.net/2024/Jul/29/sam-2/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://ai.meta.com/blog/segment-anything-2/"&gt;SAM 2: The next generation of Meta Segment Anything Model for videos and images&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Segment Anything is Meta AI's model for image segmentation: for any image or frame of video it can identify which shapes on the image represent different "objects" - things like vehicles, people, animals, tools and more.&lt;/p&gt;
&lt;p&gt;SAM 2 "outperforms SAM on its 23 dataset zero-shot benchmark suite, while being six times faster". Notably, SAM 2 works with video where the original SAM only worked with still images. It's released under the Apache 2 license.&lt;/p&gt;
&lt;p&gt;The best way to understand SAM 2 is to try it out. Meta have a &lt;a href="https://sam2.metademolab.com/demo"&gt;web demo&lt;/a&gt; which worked for me in Chrome but not in Firefox. I uploaded a recent video of my brand new cactus tweezers (for removing detritus from my cacti without getting spiked) and selected the succulent and the tweezers as two different objects:&lt;/p&gt;
&lt;p&gt;&lt;img alt="A video editing interface focused on object tracking. The main part of the screen displays a close-up photograph of a blue-gray succulent plant growing among dry leaves and forest floor debris. The plant is outlined in blue, indicating it has been selected as &amp;quot;Object 1&amp;quot; for tracking. On the left side of the interface, there are controls for selecting and editing objects. Two objects are listed: Object 1 (the succulent plant) and Object 2 (likely the yellow stem visible in the image). At the bottom of the screen is a video timeline showing thumbnail frames, with blue and yellow lines representing the tracked paths of Objects 1 and 2 respectively. The interface includes options to add or remove areas from the selected object, start over, and &amp;quot;Track objects&amp;quot; to follow the selected items throughout the video." src="https://static.simonwillison.net/static/2024/sam-ui.jpg" /&gt;&lt;/p&gt;
&lt;p&gt;Then I applied a "desaturate" filter to the background and exported this resulting video, with the background converted to black and white while the succulent and tweezers remained in full colour:&lt;/p&gt;
&lt;video poster="https://static.simonwillison.net/static/2024/cactus-tweezers-still.jpg" controls&gt;
  &lt;source src="https://static.simonwillison.net/static/2024/sam2-cactus-tweezers.mp4" type="video/mp4"&gt;
  Your browser does not support the video tag.
&lt;/video&gt;

&lt;p&gt;Also released today: the &lt;a href="https://ai.meta.com/research/publications/sam-2-segment-anything-in-images-and-videos/"&gt;full SAM 2 paper&lt;/a&gt;, the &lt;a href="https://ai.meta.com/datasets/segment-anything-video"&gt;SA-V dataset&lt;/a&gt; of "51K diverse videos and 643K spatio-temporal segmentation masks" and a &lt;a href="https://sam2.metademolab.com/dataset"&gt;Dataset explorer tool&lt;/a&gt; (again, not supported by Firefox) for poking around in that collection.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://news.ycombinator.com/item?id=41104523"&gt;Hacker News&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/training-data"&gt;training-data&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/meta"&gt;meta&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/image-segmentation"&gt;image-segmentation&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="training-data"/><category term="meta"/><category term="image-segmentation"/></entry><entry><title>PaliGemma model README</title><link href="https://simonwillison.net/2024/May/15/paligemma/#atom-tag" rel="alternate"/><published>2024-05-15T21:16:36+00:00</published><updated>2024-05-15T21:16:36+00:00</updated><id>https://simonwillison.net/2024/May/15/paligemma/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/google-research/big_vision/blob/main/big_vision/configs/proj/paligemma/README.md"&gt;PaliGemma model README&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
One of the more over-looked announcements from Google I/O yesterday was PaliGemma, an openly licensed VLM (Vision Language Model) in the Gemma family of models.&lt;/p&gt;
&lt;p&gt;The model accepts an image and a text prompt. It outputs text, but that text can include special tokens representing regions on the image. This means it can return both bounding boxes and fuzzier segment outlines of detected objects, behavior that can be triggered using a prompt such as "segment puffins".&lt;/p&gt;
&lt;p&gt;From &lt;a href="https://github.com/google-research/big_vision/blob/main/big_vision/configs/proj/paligemma/README.md#tokenizer"&gt;the README&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;PaliGemma uses the Gemma tokenizer with 256,000 tokens, but we further extend its vocabulary with 1024 entries that represent coordinates in normalized image-space (&lt;code&gt;&amp;lt;loc0000&amp;gt;...&amp;lt;loc1023&amp;gt;&lt;/code&gt;), and another with 128 entries (&lt;code&gt;&amp;lt;seg000&amp;gt;...&amp;lt;seg127&amp;gt;&lt;/code&gt;) that are codewords used by a lightweight referring-expression segmentation vector-quantized variational auto-encoder (VQ-VAE) [...]&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;You can try it out &lt;a href="https://huggingface.co/spaces/google/paligemma"&gt;on Hugging Face&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;It's a 3B model, making it feasible to run on consumer hardware.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://blog.roboflow.com/paligemma-multimodal-vision/"&gt;Roboflow: PaliGemma: Open Source Multimodal Model by Google&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/google"&gt;google&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/google-io"&gt;google-io&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/vision-llms"&gt;vision-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gemma"&gt;gemma&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/image-segmentation"&gt;image-segmentation&lt;/a&gt;&lt;/p&gt;



</summary><category term="google"/><category term="google-io"/><category term="ai"/><category term="generative-ai"/><category term="local-llms"/><category term="llms"/><category term="vision-llms"/><category term="gemma"/><category term="image-segmentation"/></entry></feed>