8 Key Insights from Building Eval-Agents with GitHub Copilot

Imagine automating the very intellectual toil that defines your daily work—then stepping back to maintain the tool so your entire team can do the same. That’s exactly what I accomplished as an AI researcher on the Copilot Applied Science team. This journey taught me powerful lessons about leveraging GitHub Copilot not just for code generation, but as a catalyst for building agent-driven development tools. Below are eight essential insights from creating eval-agents, a system that turns repetitive trajectory analysis into an automated, collaborative process. Jump to the first insight.

1. The Problem: Analyzing Thousands of Agent Trajectories

My core work involves evaluating coding agents against benchmarks like TerminalBench2 or SWEBench-Pro. Each task in these benchmarks generates a “trajectory”—a detailed .json file listing every thought and action the agent performed. Multiply that by dozens of tasks, then again by multiple benchmark runs each day, and you’re staring at hundreds of thousands of lines of code. Alone, it’s impossible to parse. I had to rely on AI assistance just to surface meaningful patterns. This overwhelming data volume was the primary motivation for the entire project.

8 Key Insights from Building Eval-Agents with GitHub Copilot — Source: github.blog

2. The Initial Solution: Using Copilot to Surface Patterns

My first human-AI collaboration loop was simple: I used GitHub Copilot to identify common patterns across trajectories, then manually investigated those findings. This reduced the lines I had to read from hundreds of thousands to a few hundred per run. But the engineer in me saw this repetition and thought, “I want to automate that.” Copilot was already helping me shrink the problem; the next step was to let an agent take over the pattern-spotting entirely, freeing me to focus on higher-level analysis and decision-making.

3. The Automation Spark: Creating ‘Eval-Agents’

Thus, eval-agents was born. This tool automates the intellectual toil of sifting through trajectories by turning pattern recognition into executable agent workflows. Instead of manually repeating the Copilot-aided loop, I built agents that could scan thousands of .json files, flag anomalies, and summarize trends—all without direct human intervention. The key insight was that agents could perform the same analytical steps I would, but at scale and with greater consistency. This project exemplifies how Copilot can move beyond code generation to orchestrate entire research tasks.

4. Design Goal #1: Make Agents Easy to Share and Use

From the start, I knew these agents had to be accessible to my entire team. Drawing on my experience as an OSS maintainer on the GitHub CLI, I prioritized simplicity in sharing and usage. This meant packaging the agent logic in a reusable, well-documented format. Any team member could grab an agent, run it on their benchmark data, and immediately get actionable insights. No steep learning curve, no hidden configuration. This goal ensured that the tool wouldn’t remain a solo project but would empower everyone on the Copilot Applied Science team to automate their own analysis.

5. Design Goal #2: Simplify Authoring of New Agents

To truly democratize agent creation, I designed eval-agents with a low‑friction authoring experience. Creating a new agent required only a clear description of the analysis you wanted—like “find all trajectories where the agent failed to read a file”—and the system would scaffold the code. This was possible because GitHub Copilot could generate the underlying script from natural language prompts, which I then wrapped into an agent template. The result: team members, even those less experienced with AI pipelines, could author custom agents in minutes instead of days.

6. Design Goal #3: Agents as Primary Contribution Vehicle

For long‑term success, I wanted agents to become the primary way the team contributed to the project. Instead of writing static reports or dashboards, we could contribute new analytical capabilities as agents. This shifted the project from a one‑time automation to a living library. Each agent was a discrete, shareable unit of intellectual work. When someone discovered a new pattern worth tracking, they could write an agent for it and push it to a shared repository. Over time, this built a collective intelligence that made all our evaluations faster and richer.

7. Lessons in Effective Collaboration with Copilot

Throughout this process, I learned three critical lessons about collaborating with GitHub Copilot:

State the intent clearly: Copilot works best when I describe not just what to do, but why—context helps generate more relevant analysis code.
Iterate in small loops: Breaking a large analysis into small, verifiable steps (e.g., “load this trajectory”, “extract the success flag”) made Copilot’s suggestions more accurate and easier to debug.
Let Copilot suggest patterns you didn’t ask for: Sometimes the AI would propose analyses I hadn’t considered—like correlating agent verbosity with failure rates—sparking new research questions.

8. The Outcome: A Faster Development Loop for the Team

Today, the entire Copilot Applied Science team uses eval-agents. We’ve dramatically shortened the time from benchmark run to insight, from days to hours. More importantly, the tool has become a platform for collective innovation: anyone can extend it with new agents, and those contributions automatically benefit everyone. I may have automated my original job, but in its place I’ve found a new role—enabling my peers to do the same. The future of agent‑driven development is collaborative, and GitHub Copilot is the engine that makes it possible.

Conclusion: Building eval-agents proved that with the right design principles and a powerful AI copilot, we can automate not just manual labor, but intellectual analysis itself. The three goals—shareability, ease of authoring, and agent‑first contributions—turned a personal tool into a team‑wide force multiplier. If you’re wrestling with repetitive analytical tasks, consider applying these same principles. You might just find yourself automating your own job into something much more impactful.