Simon Willison's Weblog: user-agents

ChatGPT agent's user-agent

2025-08-04T22:49:25+00:00

I was exploring how ChatGPT agent works today. I learned some interesting things about how it exposes its identity through HTTP headers, then made a huge blunder in thinking it was leaking its URLs to Bingbot and Yandex... but it turned out that was a Cloudflare feature that had nothing to do with ChatGPT.

ChatGPT agent is the recently released (and confusingly named) ChatGPT feature that provides browser automation combined with terminal access as a feature of ChatGPT - replacing their previous Operator research preview which is scheduled for deprecation on August 31st.

Investigating ChatGPT agent's user-agent

I decided to dig into how it works by creating a logged web URL endpoint using django-http-debug. Then I told ChatGPT agent mode to explore that new page:

My logging captured these request headers:

Via: 1.1 heroku-router
Host: simonwillison.net
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7
Cf-Ray: 96a0f289adcb8e8e-SEA
Cookie: cf_clearance=zzV8W...
Server: Heroku
Cdn-Loop: cloudflare; loops=1
Priority: u=0, i
Sec-Ch-Ua: "Not)A;Brand";v="8", "Chromium";v="138"
Signature: sig1=:1AxfqHocTf693inKKMQ7NRoHoWAZ9d/vY4D/FO0+MqdFBy0HEH3ZIRv1c3hyiTrzCvquqDC8eYl1ojcPYOSpCQ==:
Cf-Visitor: {"scheme":"https"}
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/138.0.0.0 Safari/537.36
Cf-Ipcountry: US
X-Request-Id: 45ef5be4-ead3-99d5-f018-13c4a55864d3
Sec-Fetch-Dest: document
Sec-Fetch-Mode: navigate
Sec-Fetch-Site: none
Sec-Fetch-User: ?1
Accept-Encoding: gzip, br
Accept-Language: en-US,en;q=0.9
Signature-Agent: "https://chatgpt.com"
Signature-Input: sig1=("@authority" "@method" "@path" "signature-agent");created=1754340838;keyid="otMqcjr17mGyruktGvJU8oojQTSMHlVm7uO-lrcqbdg";expires=1754344438;nonce="_8jbGwfLcgt_vUeiZQdWvfyIeh9FmlthEXElL-O2Rq5zydBYWivw4R3sV9PV-zGwZ2OEGr3T2Pmeo2NzmboMeQ";tag="web-bot-auth";alg="ed25519"
X-Forwarded-For: 2a09:bac5:665f:1541::21e:154, 172.71.147.183
X-Request-Start: 1754340840059
Cf-Connecting-Ip: 2a09:bac5:665f:1541::21e:154
Sec-Ch-Ua-Mobile: ?0
X-Forwarded-Port: 80
X-Forwarded-Proto: http
Sec-Ch-Ua-Platform: "Linux"
Upgrade-Insecure-Requests: 1

That Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/138.0.0.0 Safari/537.36 user-agent header is the one used by the most recent Chrome on macOS - which is a little odd here as the Sec-Ch-Ua-Platform : "Linux" indicates that the agent browser runs on Linux.

At first glance it looks like ChatGPT is being dishonest here by not including its bot identity in the user-agent header. I thought for a moment it might be reflecting my own user-agent, but I'm using Firefox on macOS and it identified itself as Chrome.

Then I spotted this header:

Signature-Agent: "https://chatgpt.com"

Which is accompanied by a much more complex header called Signature-Input:

Signature-Input: sig1=("@authority" "@method" "@path" "signature-agent");created=1754340838;keyid="otMqcjr17mGyruktGvJU8oojQTSMHlVm7uO-lrcqbdg";expires=1754344438;nonce="_8jbGwfLcgt_vUeiZQdWvfyIeh9FmlthEXElL-O2Rq5zydBYWivw4R3sV9PV-zGwZ2OEGr3T2Pmeo2NzmboMeQ";tag="web-bot-auth";alg="ed25519"

And a Signature header too.

These turn out to come from a relatively new web standard: RFC 9421 HTTP Message Signatures' published February 2024.

The purpose of HTTP Message Signatures is to allow clients to include signed data about their request in a way that cannot be tampered with by intermediaries. The signature uses a public key that's provided by the following well-known endpoint:

https://chatgpt.com/.well-known/http-message-signatures-directory

Add it all together and we now have a rock-solid way to identify traffic from ChatGPT agent: look for the Signature-Agent: "https://chatgpt.com" header and confirm its value by checking the signature in the Signature-Input and Signature headers.

And then came Bingbot and Yandex

Just over a minute after it captured that request, my logging endpoint got another request:

Via: 1.1 heroku-router
From: bingbot(at)microsoft.com
Host: simonwillison.net
Accept: */*
Cf-Ray: 96a0f4671d1fc3c6-SEA
Server: Heroku
Cdn-Loop: cloudflare; loops=1
Cf-Visitor: {"scheme":"https"}
User-Agent: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm) Chrome/116.0.1938.76 Safari/537.36
Cf-Ipcountry: US
X-Request-Id: 6214f5dc-a4ea-5390-1beb-f2d26eac5d01
Accept-Encoding: gzip, br
X-Forwarded-For: 207.46.13.9, 172.71.150.252
X-Request-Start: 1754340916429
Cf-Connecting-Ip: 207.46.13.9
X-Forwarded-Port: 80
X-Forwarded-Proto: http

I pasted 207.46.13.9 into Microsoft's Verify Bingbot tool (after solving a particularly taxing CAPTCHA) and it confirmed that this was indeed a request from Bingbot.

I set up a second URL to confirm... and this time got a visit from Yandex!

Via: 1.1 heroku-router
From: support@search.yandex.ru
Host: simonwillison.net
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Cf-Ray: 96a16390d8f6f3a7-DME
Server: Heroku
Cdn-Loop: cloudflare; loops=1
Cf-Visitor: {"scheme":"https"}
User-Agent: Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)
Cf-Ipcountry: RU
X-Request-Id: 3cdcbdba-f629-0d29-b453-61644da43c6c
Accept-Encoding: gzip, br
X-Forwarded-For: 213.180.203.138, 172.71.184.65
X-Request-Start: 1754345469921
Cf-Connecting-Ip: 213.180.203.138
X-Forwarded-Port: 80
X-Forwarded-Proto: http

Yandex suggest a reverse DNS lookup to verify, so I ran this command:

dig -x 213.180.203.138 +short

And got back:

213-180-203-138.spider.yandex.com.

Which confirms that this is indeed a Yandex crawler.

I tried a third experiment to be sure... and got hits from both Bingbot and YandexBot.

It was Cloudflare Crawler Hints, not ChatGPT

So I wrote up and posted about my discovery... and Jatan Loya asked:

do you have crawler hints enabled in cf?

And yeah, it turned out I did. I spotted this in my caching configuration page (and it looks like I must have turned it on myself at some point in the past):

Here's the Cloudflare documentation for that feature.

I deleted my posts on Twitter and Bluesky (since you can't edit those and I didn't want the misinformation to continue to spread) and edited my post on Mastodon, then updated this entry with the real reason this had happened.

I also changed the URL of this entry as it turned out Twitter and Bluesky were caching my social media preview for the previous one, which included the incorrect information in the title.

Original "So what's going on here?" section from my post

Here's a section of my original post with my theories about what was going on before learning about Cloudflare Crawler Hints.

So what's going on here?

There are quite a few different moving parts here.

I'm using Firefox on macOS with the 1Password and Readwise Highlighter extensions installed and active. Since I didn't visit the debug pages at all with my own browser I don't think any of these are relevant to these results.
ChatGPT agent makes just a single request to my debug URL ...
... which is proxied through both Cloudflare and Heroku.
Within about a minute, I get hits from one or both of Bingbot and Yandex.

Presumably ChatGPT agent itself is running behind at least one proxy - I would expect OpenAI to keep a close eye on that traffic to ensure it doesn't get abused.

I'm guessing that infrastructure is hosted by Microsoft Azure. The OpenAI Sub-processor List - though that lists Microsoft Corporation, CoreWeave Inc, Oracle Cloud Platform and Google Cloud Platform under the "Cloud infrastructure" section so it could be any of those.

Since the page is served over HTTPS my guess is that any intermediary proxies should be unable to see the path component of the URL, making the mystery of how Bingbot and Yandex saw the URL even more intriguing.

Tags: bing, privacy, search-engines, user-agents, ai, cloudflare, generative-ai, chatgpt, llms, browser-agents, retractions

Claude feature drop

2025-05-02T18:45:57+00:00

It's not in their release notes yet but Anthropic pushed some big new features today. Alex Albert:

We've improved web search and rolled it out worldwide to all paid plans. Web search now combines light Research functionality, allowing Claude to automatically adjust search depth based on your question.

Anthropic announced Claude Research a few weeks ago as a product that can combine web search with search against your private Google Workspace - I'm not clear on how much of that product we get in this "light Research" functionality.

I'm most excited about this detail:

You can also drop a web link in any chat and Claude will fetch the content for you.

In my experiments so far the user-agent it uses is Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Claude-User/1.0; +Claude-User@anthropic.com). It appears to obey robots.txt.

Tags: user-agents, ai, generative-ai, llms, anthropic, claude, alex-albert, llm-tool-use, deep-research

Running gunicorn behind nginx on Heroku for buffering and logging

2017-10-02T01:57:20+00:00

Heroku's default setup for Django uses the gunicorn application server. Each Heroku dyno can only run a limited number of gunicorn workers, which means a limited number of requests can be served in parallel (around 4 per dyno is a good rule of thumb).

Where things get nasty is when you have devices on slow connections - like mobile phones. Heroku's router buffers headers but it does not buffer response bodies, so a slow device could hold up a gunicorn worker for several seconds. Too many slow devices at once and the site will become unavailable to other users.

This issue is explained and discussed here: Don't use Gunicorn to host your Django sites on Heroku

That article recommends using waitress as an alternative to gunicorn, but in the comments at the bottom of the article people suggest using a Heroku nginx-buildpack as an alternative.

Here is a slightly out-of-date tutorial on getting this all set up: https://koed00.github.io/Heroku_setups/

I used the following commands to set up the buildpacks:

heroku stack:set cedar-14
heroku buildpacks:clear
heroku buildpacks:add https://github.com/beanieboi/nginx-buildpack.git
heroku buildpacks:add https://github.com/heroku/heroku-buildpack-python.git

Unfortunately the nginx buildpack is not yet compatible with the new heroku-16 stack, so until the nginx buildpack has been updated it's necessary to run the application on the older cedar-14 stack. See this discussion for details: ryandotsmith/nginx-buildpack#68

Adding nginx in this way also gives us the opportunity to fix another limitation of Heroku: its default logging configuration. By default, log lines produced by Heroku (visible using heroku logs --tail or with a logging addon such as Papertrail) look like this:

    Oct 01 18:01:06 simonwillisonblog heroku/router: at=info
        method=GET path="/2017/Oct/1/ship/" host=simonwillison.net
        request_id=bb22f67e-6924-4e81-b6ad-74d1f465cda7
        fwd="2001:8003:74c5:8b00:79e4:80ed:fa85:7b37,108.162.249.198"
        dyno=web.1 connect=0ms service=338ms status=200 bytes=4523 protocol=http

Notably missing here is both the user-agent string and the referrer header sent by the browser! If you're a fan of tailing log files these omissions are pretty disappointing.

The nginx buildback I'm using loads a default configuration file at config/nginx.conf.erb. By including my own copy of this file I can override the original and define my own custom log format.

Having applied this change, the new log lines look like this:

    2017-10-02T01:44:38.762845+00:00 app[web.1]:
        measure#nginx.service=0.133 request="GET / HTTP/1.1" status_code=200
        request_id=8b6402de-d072-42c4-9854-0f71697b30e5 remote_addr="10.16.227.159"
        forwarded_for="199.188.193.220" forwarded_proto="http" via="1.1 vegur"
        body_bytes_sent=12666 referer="-" user_agent="Mozilla/5.0 (Macintosh;
        Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko)
        Chrome/61.0.3163.100 Safari/537.36"

This blog entry started life as a commit message.

Tags: django, logging, nginx, user-agents, heroku, gunicorn

Changes in Opera's user agent string format

2009-05-28T01:16:36+00:00

Changes in Opera's user agent string format

How depressing... Opera 10 will ship with 9.80 in the User-Agent string because badly written browser sniffing scripts can’t cope with double digits.

Via Douglas Crockford

Tags: browsers, browsersniffing, opera, user-agents

minidetector

2008-08-15T08:21:59+00:00

minidetector

Neat piece of Django middleware that adds a “mobile = True” attribute to the request object if the request’s user-agent matches a list of strings of known low-power browsers in mobiles, PDAs or game consoles.

Tags: django, middleware, minidetector, mobile, python, user-agents

Browser detection reconsidered

2003-02-24T12:07:53+00:00

Leonard Lin on The Folly of Depending on CSS Parsing Bugs:

I would not compensate for CSS rendering bugs by exploiting CSS parsing bugs except as a last resort. Think about it from a standardized test perspective: what strong relation does CSS rendering bugs have with CSS parsing bugs? There's no reason (nor right!) to assume that all future browsers with the same rendering bugs will have the same parsing bugs (and vice versa). In fact, if you look at the recent past releases (Safari, Opera, IE), even within browser families you'll see that this is absolutely not true!

He has an interesting point - user agent sniffing, while derided by many, is at least predictable in that you can deliberately target specific versions of specific browsers (as long as you're careful not to feed a user-agent cloaked Opera something nasty by mistake). CSS hacks may target browsers based purely on their capabilities (which cannot be cloaked by a false user agent string) but require careful maintenance against future version changes - as with Safari which has fixed some CSS parsing errors that were being used to filter Safari specific styles.

I suppose it boils down to the question of which is easier to maintain - a site-wide stylesheet (or two) with hacks in vs a server side (I'll ignore client side as it's even uglier) browser detection routine to serve up the right stylesheet. Either way, if you don't have a full set of browsers and platforms to test on the best you can do is cross your fingers and pray nothing breaks.

Of course, this issue is true of table layouts as well. Who knows, maybe in about 5 years time the browsers will have caught up with the current set of standards (and we'll all be stressing over how much of CSS3 we can use without the house of cards tumbling back down again).

Tags: css, leonard-lin, user-agents

Mozilla versions

2002-06-20T15:50:05+00:00

Peter-Paul Koch on evolt: One browser, many names - an invaluable guide to the many different version numbers and user-agent strings resulting from the Mozilla project.

Tags: mozilla, user-agents

User Agent list

2002-06-15T14:25:27+00:00

rudy on thelist pointed me to this handy List of User Agents.

Tags: user-agents