<?xml version="1.0" encoding="utf-8" standalone="yes" ?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>Evals on Fabian G. Williams</title>
    <link>https://www.fabswill.com/tags/evals/</link>
    <description>Recent content in Evals on Fabian G. Williams</description>
    <generator>Hugo -- gohugo.io</generator>
    <language>en</language>
    <lastBuildDate>Sat, 28 Mar 2026 00:00:00 +0000</lastBuildDate>
    
	<atom:link href="https://www.fabswill.com/tags/evals/index.xml" rel="self" type="application/rss+xml" />
    
    
    <item>
      <title>How Do You Trust an Autonomous AI Agent? Evals Are the Answer.</title>
      <link>https://www.fabswill.com/blog/how-do-you-trust-an-autonomous-ai-agent/</link>
      <pubDate>Sat, 28 Mar 2026 00:00:00 +0000</pubDate>
      
      <guid>https://www.fabswill.com/blog/how-do-you-trust-an-autonomous-ai-agent/</guid>
      <description>TL;DR I run an autonomous AI agent on a Mac Mini in my house. She handles 16 daily cron jobs — finances, email triage, outreach campaigns, device monitoring, morning briefings. The agent says &amp;ldquo;done.&amp;rdquo; But did it actually do anything? I built a 9-dimension eval rubric to find out. Along the way I discovered that my evals were broken, my agent was better than I thought, and the most important metric isn&amp;rsquo;t pass/fail — it&amp;rsquo;s whether a failure is your fault or the agent&amp;rsquo;s fault.</description>
    </item>
    
  </channel>
</rss>