<?xml version="1.0" encoding="utf-8"?>
<feed xml:lang="en-us" xmlns="http://www.w3.org/2005/Atom"><title>Simon Willison's Weblog: hamel-husain</title><link href="http://simonwillison.net/" rel="alternate"/><link href="http://simonwillison.net/tags/hamel-husain.atom" rel="self"/><id>http://simonwillison.net/</id><updated>2025-07-03T20:36:56+00:00</updated><author><name>Simon Willison</name></author><entry><title>Frequently Asked Questions (And Answers) About AI Evals</title><link href="https://simonwillison.net/2025/Jul/3/faqs-about-ai-evals/#atom-tag" rel="alternate"/><published>2025-07-03T20:36:56+00:00</published><updated>2025-07-03T20:36:56+00:00</updated><id>https://simonwillison.net/2025/Jul/3/faqs-about-ai-evals/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://hamel.dev/blog/posts/evals-faq/"&gt;Frequently Asked Questions (And Answers) About AI Evals&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Hamel Husain and Shreya Shankar have been running a paid, cohort-based course on &lt;a href="https://maven.com/parlance-labs/evals"&gt;AI Evals For Engineers &amp;amp; PMs&lt;/a&gt; over the past few months. Here Hamel collects answers to the most common questions asked during the course.&lt;/p&gt;
&lt;p&gt;There's a ton of actionable advice in here. I continue to believe that a robust approach to evals is the single most important distinguishing factor between well-engineered, reliable AI systems and YOLO cross-fingers and hope it works development.&lt;/p&gt;
&lt;p&gt;Hamel says:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;It’s important to recognize that evaluation is part of the development process rather than a distinct line item, similar to how debugging is part of software development. [...]&lt;/p&gt;
&lt;p&gt;In the projects we’ve worked on, &lt;strong&gt;we’ve spent 60-80% of our development time on error analysis and evaluation&lt;/strong&gt;. Expect most of your effort to go toward understanding failures (i.e. looking at data) rather than building automated checks.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I found this tip to be useful and surprising:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;If you’re passing 100% of your evals, you’re likely not challenging your system enough. A 70% pass rate might indicate a more meaningful evaluation that’s actually stress-testing your application.&lt;/p&gt;
&lt;/blockquote&gt;

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://news.ycombinator.com/item?id=44430117"&gt;Hacker News&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/hamel-husain"&gt;hamel-husain&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/evals"&gt;evals&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="hamel-husain"/><category term="evals"/></entry><entry><title>Quoting Hamel Husain</title><link href="https://simonwillison.net/2025/Apr/15/hamel-husain/#atom-tag" rel="alternate"/><published>2025-04-15T18:05:42+00:00</published><updated>2025-04-15T18:05:42+00:00</updated><id>https://simonwillison.net/2025/Apr/15/hamel-husain/#atom-tag</id><summary type="html">
    &lt;blockquote cite="https://www.oreilly.com/radar/a-field-guide-to-rapidly-improving-ai-products/"&gt;&lt;p&gt;The single most impactful investment I’ve seen AI teams make isn’t a fancy evaluation dashboard—it’s building a customized interface that lets anyone examine what their AI is actually doing. I emphasize &lt;em&gt;customized&lt;/em&gt; because every domain has unique needs that off-the-shelf tools rarely address. When reviewing apartment leasing conversations, you need to see the full chat history and scheduling context. For real-estate queries, you need the property details and source documents right there. Even small UX decisions—like where to place metadata or which filters to expose—can make the difference between a tool people actually use and one they avoid. [...]&lt;/p&gt;
&lt;p&gt;Teams with thoughtfully designed data viewers iterate 10x faster than those without them. And here’s the thing: These tools can be built in hours using AI-assisted development (like Cursor or Loveable). The investment is minimal compared to the returns.&lt;/p&gt;&lt;/blockquote&gt;
&lt;p class="cite"&gt;&amp;mdash; &lt;a href="https://www.oreilly.com/radar/a-field-guide-to-rapidly-improving-ai-products/"&gt;Hamel Husain&lt;/a&gt;, A Field Guide to Rapidly Improving AI Products&lt;/p&gt;

    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-assisted-programming"&gt;ai-assisted-programming&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/hamel-husain"&gt;hamel-husain&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="datasette"/><category term="llms"/><category term="ai-assisted-programming"/><category term="hamel-husain"/></entry><entry><title>follow_theirs.py</title><link href="https://simonwillison.net/2024/Nov/24/follow-theirs/#atom-tag" rel="alternate"/><published>2024-11-24T18:57:16+00:00</published><updated>2024-11-24T18:57:16+00:00</updated><id>https://simonwillison.net/2024/Nov/24/follow-theirs/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://gist.github.com/simonw/848a3b91169a789bc084a459aa7ecf83"&gt;follow_theirs.py&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Hamel Husain wrote &lt;a href="https://gist.github.com/hamelsmu/fb9ed633de7d784619e4b6da5039e6ae"&gt;this Python script&lt;/a&gt; on top of the &lt;a href="https://pypi.org/project/atproto/"&gt;atproto&lt;/a&gt; Python library for interacting with Bluesky, which lets you specify another user and then follows every account that user is following.&lt;/p&gt;
&lt;p&gt;I forked it and added two improvements: inline &lt;a href="https://peps.python.org/pep-0723/"&gt;PEP 723&lt;/a&gt; dependencies and &lt;code&gt;input()&lt;/code&gt; and &lt;code&gt;getpass.getpass()&lt;/code&gt; to interactively ask for the credentials needed to run the script.&lt;/p&gt;
&lt;p&gt;This means you can run my version using &lt;code&gt;uv run&lt;/code&gt; like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;uv run https://gist.githubusercontent.com/simonw/848a3b91169a789bc084a459aa7ecf83/raw/397ad07c8be0601eaf272d9d5ab7675c7fd3c0cf/follow_theirs.py
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I really like this pattern of being able to create standalone Python scripts with dependencies that can be run from a URL as a one-liner. Here's the comment section at the top of the script that makes it work:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# /// script
# dependencies = [
#   "atproto"
# ]
# ///
&lt;/code&gt;&lt;/pre&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/python"&gt;python&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/hamel-husain"&gt;hamel-husain&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/uv"&gt;uv&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/bluesky"&gt;bluesky&lt;/a&gt;&lt;/p&gt;



</summary><category term="python"/><category term="hamel-husain"/><category term="uv"/><category term="bluesky"/></entry><entry><title>Creating a LLM-as-a-Judge that drives business results</title><link href="https://simonwillison.net/2024/Oct/30/llm-as-a-judge/#atom-tag" rel="alternate"/><published>2024-10-30T18:08:07+00:00</published><updated>2024-10-30T18:08:07+00:00</updated><id>https://simonwillison.net/2024/Oct/30/llm-as-a-judge/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://hamel.dev/blog/posts/llm-judge/"&gt;Creating a LLM-as-a-Judge that drives business results&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Hamel Husain's sequel to &lt;a href="https://hamel.dev/blog/posts/evals/"&gt;Your AI product needs evals&lt;/a&gt;. This is &lt;em&gt;packed&lt;/em&gt; with hard-won actionable advice.&lt;/p&gt;
&lt;p&gt;Hamel warns against using scores on a 1-5 scale, instead promoting an alternative he calls "Critique Shadowing". Find a domain expert (one is better than many, because you want to keep their scores consistent) and have them answer the yes/no question "Did the AI achieve the desired outcome?" - providing a critique explaining their reasoning for each of their answers.&lt;/p&gt;
&lt;p&gt;This gives you a reliable score to optimize against, and the critiques mean you can capture nuance and improve the system based on that captured knowledge.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Most importantly, &lt;strong&gt;the critique should be detailed enough so that you can use it in a few-shot prompt for a LLM judge&lt;/strong&gt;. In other words, it should be detailed enough that a new employee could understand it.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Once you've gathered this expert data system you can switch to using an LLM-as-a-judge. You can then iterate on the prompt you use for it in order to converge its "opinions" with those of your domain expert.&lt;/p&gt;
&lt;p&gt;Hamel concludes:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The real value of this process is looking at your data and doing careful analysis. Even though an AI judge can be a helpful tool, going through this process is what drives results. I would go as far as saying that creating a LLM judge is a nice “hack” I use to trick people into carefully looking at their data!&lt;/p&gt;
&lt;/blockquote&gt;

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://news.ycombinator.com/item?id=41995253"&gt;Hacker News&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/hamel-husain"&gt;hamel-husain&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/evals"&gt;evals&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="hamel-husain"/><category term="evals"/></entry><entry><title>Quoting Hamel Husain</title><link href="https://simonwillison.net/2024/Jun/7/hamel-husain/#atom-tag" rel="alternate"/><published>2024-06-07T18:02:51+00:00</published><updated>2024-06-07T18:02:51+00:00</updated><id>https://simonwillison.net/2024/Jun/7/hamel-husain/#atom-tag</id><summary type="html">
    &lt;blockquote cite="https://twitter.com/HamelHusain/status/1798757828100047063"&gt;&lt;p&gt;LLM bullshit knife, to cut through bs&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;RAG -&amp;gt;              Provide relevant context
Agentic -&amp;gt;          Function calls that work
CoT -&amp;gt;              Prompt model to think/plan
FewShot -&amp;gt;          Add examples
PromptEng -&amp;gt;        Someone w/good written comm skills.
Prompt Optimizer -&amp;gt; For loop to find best examples.
&lt;/code&gt;&lt;/pre&gt;&lt;/blockquote&gt;
&lt;p class="cite"&gt;&amp;mdash; &lt;a href="https://twitter.com/HamelHusain/status/1798757828100047063"&gt;Hamel Husain&lt;/a&gt;&lt;/p&gt;

    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/rag"&gt;rag&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/hamel-husain"&gt;hamel-husain&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="rag"/><category term="hamel-husain"/></entry><entry><title>Your AI Product Needs Evals</title><link href="https://simonwillison.net/2024/Mar/31/your-ai-product-needs-evals/#atom-tag" rel="alternate"/><published>2024-03-31T21:53:59+00:00</published><updated>2024-03-31T21:53:59+00:00</updated><id>https://simonwillison.net/2024/Mar/31/your-ai-product-needs-evals/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://hamel.dev/blog/posts/evals/"&gt;Your AI Product Needs Evals&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Hamel Husain: “I’ve seen many successful and unsuccessful approaches to building LLM products. I’ve found that unsuccessful products almost always share a common root cause: a failure to create robust evaluation systems.”&lt;/p&gt;

&lt;p&gt;I’ve been frustrated about this for a while: I know I need to move beyond “vibe checks” for the systems I have started to build on top of LLMs, but I was lacking a thorough guide about how to build automated (and manual) evals in a productive way.&lt;/p&gt;

&lt;p&gt;Hamel has provided exactly the tutorial I was needing for this, with a really thorough example case-study.&lt;/p&gt;

&lt;p&gt;Using GPT-4 to create test cases is an interesting approach: “Write 50 different instructions that a real estate agent can give to his assistant to create contacts on his CRM. The contact details can include name, phone, email, partner name, birthday, tags, company, address and job.”&lt;/p&gt;

&lt;p&gt;Also important: “... unlike traditional unit tests, you don’t necessarily need a 100% pass rate. Your pass rate is a product decision.”&lt;/p&gt;

&lt;p&gt;Hamel’s guide then covers the importance of traces for evaluating real-world performance of your deployed application, plus the pros and cons of leaning on automated evaluation using LLMs themselves.&lt;/p&gt;

&lt;p&gt;Plus some wisdom from a footnote: “A reasonable heuristic is to keep reading logs until you feel like you aren’t learning anything new.”

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/HamelHusain/status/1773765490663735319"&gt;@HamelHusain&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/testing"&gt;testing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/hamel-husain"&gt;hamel-husain&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/evals"&gt;evals&lt;/a&gt;&lt;/p&gt;



</summary><category term="testing"/><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="hamel-husain"/><category term="evals"/></entry></feed>