Principles of Harness Engineering: Lessons From Building an Agentic Data-Eating Machine

Last month I built a Data-Eating Machine: I point it to api docs and it reads, cleans, and QAs data. It’s an autonomous team of 6 agents, each owning a different piece of the machine, working in tandem on days-long trajectories. This is starting to be called “Harness Engineering”.

I had built similar kinds of agentic systems in the past few months:

A quant research team of 5 agents extracting market-making strategies from billions of rows of data
A program search harness, hill-climbing over the space of market-making algorithms
An analytics team of 5 agents running hundreds of SQL queries to construct rich answers with charts
A news editorial pipeline of 3 agents publishing news by querying and ingesting prediction markets along with large amounts of news
An infra monitoring agent ingesting GCP logs (e.g. Cloud Run) and reporting on infrastructure health
A sales team of agents that source, enrich and maintain sales leads for polytubes.xyz

Data-Eating Machine was the hardest to build. By definition it required convergence towards a verifiable end-state of completeness and correctness of data. It’s a stateful system of data + code that goes on days-long trajectories. Below I explain some principles of such architecture, and what I’ve seen to work.

Forward and Backward Pressure from Constitution

Assume we’re pointing the Data-Eating Machine towards https://docs.polymarket.com/ and want it drained in a data warehouse to full completion. How do you define that end-state? for most systems you’d want to define the end-sate as a precisely specified destination or criteria to be satisfied. Let’s try a few. You could say:

“drain me all polymarket data”, or could break it down into narrowly scoped sub-goals and say:
“i want you to paginate over endpoint gamma.polymarket.com/data-api/market-prices?page and for each page …. “

or could think a bit harder and arrive at some form of well-generalizing definition that despite its succinctness captures a large range of nuances of the terminal end-state:

“if a row exists in the backend of polymarket, then we must have it too in our data warehouse”

or could even state it the other way around:

“we have satisfied completeness once there are no more rows that we could get which we haven’t already”

So which one do we go with? Each requires a slightly different harness, each take different trajectories and almost certainly–on long enough trajectories– will end up in differnt end-states. Let’s think more systematically.

Your agentic system has a current-state, a terminal end-state, and a trajectory that it needs to traverse in order to reach the end-state. Here we are working with a stateful system of data + code, which once turned on, runs until complete backfill and from there runs continuously to eat more and more data as becomes available. So your end-state is constantly moving. This looks like a trajectory that is constatly growing in length that needs to be swept. Holding the machine on the trajectory requires both a systematized Workflow as well as standardized Atoms (units) of work.

Idea is that you first define the end-state by carving out a shaped cavity to be filled. In the case of Data-Eating Machine these shaped cavities are empty tables with schemas, ready to be filled up. Then a simple COUNT(*) informs where you are on your trajectory. You put an agent on continuously estimating the size of the iceberg (iceberg because you don’t know the actual size of the table in reality) and update the target count that the machine needs to sweep towards.

Constitutions here are essentially bindings. We systematize the workflow by having one standardized YAML file at each step of the workflow. We then bind one agent per YAML, and have them at each epoch apply pressure to hold the binding between. For example the “Target Agent” holding a binding between docs/* and polymarket_spec_target.yaml, and “Current Agent” holding a binding between the current state of the data infra and polymarket_spec_current.yaml. And have agents that measure the gap between where the target consititution requires the system to be at, and where the system is currently at, and an agent that actively closes this gap.

Most components of the data infrascture are locked and there are pockets within which the agents own and are allowed to work on. The shaped cavities and target count create the forward-pressure for your agentic system. The constitutions and locked components create back-pressure to carve out a reliable trajectory.

Cinematic Universe and Trajectory Steerability

Your agent calls the wrong tools and steps off the path towards the target you’ve defined for it. You patch the prompt by adding few more lines of instructions. Soon your prompt is few pages long. Even worse case is that some of the insructions slightly contradict each other. And then your agent starts acting “dumb”. You notice this like some form of “collapse”. It’s a quite common failure-mode. Here is why this happens:

In pre-training LLMs learn about the world by encoding a representation in their weights, which then in post-training this distributed representation gets conditioned and constrained into a specific shape (following instructions, tool-use, refusal, etc). That shape is called a “Cinematic Universe”. Think of the LLM as an actor in theater, being told: “You are a Senior Engineer with 20 years of experience”. Then what the LLM does–very loosely– is to predict what it’d be like to be that “Senior Engineer with 20 years of experience”. Here we could swap the “Cinematic Universe” with the word “Ontology” or Frame. Every single word that you inject into your prompt shapes this ontology, which directly affects effectivenss of your agent in the real world. The more you load up your prompts, you will be running higher risk of injecting unhelpful ontologies which “confuse” the agent by biasing it towards unhelpful, sub-optimal, or contradictory cinematic universes. i.e. defining the LLM to be a nurse and asking it to deploy cloud infrastructure wouldn’t work very well!

Now, remember that this is all happening in the semantic space, and therefore approximate and directional as opposed to precise and descrete. Meanwhile, you are trying to delegate a kind of task like code-gen which requires precision. You need a prickly and deterministic way of steering the model to follow a precise rubric. So then the question arises that if the optimal thing is to avoid dirtying the cinematic universe by cutting down your prompts, instructions and description of the cinematic universe, then how do you actually enforce your long list of Dos and Don’ts in order to steer the agent? answer is Hooks! they look like this:

1. "You can't go there." — Pure denial, agent must find another path.
{"hookSpecificOutput": {"hookEventName": "PreToolUse", "permissionDecision": "deny", "permissionDecisionReason": "Production database is read-only."}}

2. "Before you do that, keep this in mind." — Context injection before tool execution, positive steering.
{"hookSpecificOutput": {"hookEventName": "PreToolUse", "additionalContext": "This table has 400M rows. Use LIMIT or TABLESAMPLE to avoid a $50 query."}}

3. "You think you're done, but you're not." — Stop prevention, forces continuation.
{"decision": "block", "reason": "QA is not complete yet.", "hookSpecificOutput": {"hookEventName": "Stop", "additionalContext": "Required: row_count, null_rates, schema_match."}}

...

You can mix and match these things (e.g. AND/OR ing them) to get narrow streerability. That’s how you end up with pockets in your harness within which you define the prompts that create the “forward-pressure” for your agentic system. So ideally what you want is a loose and effective cinematic universe while simulataneously creating “back-pressure” via constitutions (YAMLs in our case) and use hooks to carve out a reliable trajectory. This looks like tiny prompts and lots of well designed hooks.

What should you put in those tiny prompts? more than anything use it to tell your LLM as much as you can about 1) who you are, and what you want, and 2) what the ideal end-state looks like when you close your eyes. Do this with high specificity and precision. I explain this at the end of this post.

Burden of Reasoning, Scope and Tool Surfaces

LLMs used to be called “Stochastic Parrots” a lot (back on the 2020-2023 arc), i.e. all they do is pattern matching and repeating. This is said in contrast to “reasoning” capabilities. Today even though the frontier of LLMs are shown to have developed circuts capable of above human reasoning, it does still stand that they are far better at being “Stochastic Parrots” than they are at reasoning.

Take the burden of reasoning off the LLM’s shoulders. Agents want to get the job done, your job is to constantly get the friction out of their way.

Other dimnesion here is the scope of the task given to the agent. The narrower the scope, the easier it is to get determinism, reliability, and explainability for it. This is a spectrum, meaning if you break down a task way too much, you lose the immediate shared context, or make the task too big and the attention is dissipated towards way too many directions all at once, which causes a collapse of attentnion.

Here’s a short run of Data-Eating Machine on a single source:

Harness Phase	Duration	Tool calls	Cost
Current Agent	3.3 min	18	$0.28
Gap Agent	2.1 min	19	$0.24
Solution Agent	1.3 min	18	$0.16
Shepherd Agent	2.6 min	12	$0.20
Total	9.3 min	67	$0.88

The tool surface shape is about how you design the surface. You could have three tools or you could have 30 tools. How do you decide that? How do you divide the tasks between those tools? How do you decide what the arguments are for those tools? Should you break up one of your tools into two parts?

You don’t care much about the number of tool calls. If your LLM is going to make 100 tool calls or 10 tool calls, you don’t necessarily worry about it that much. But if your LLM has to do work in order to get the “Thing X”, and to find it, it has to identify five different tools with different names and arguments and reasoning required to put the results of those five different tools together to get the answer, that’s not a good scenario. What you want is that the path is clear: it wants Thing X, and it can get Thing X by ideally one single obvious tool call. That’s much better. But if that comes at the cost of having to pass a series of 20 different things as arguments into that one tool call, then you should break it up. There are some tensions here that you need to contantly resolve.

Scalable Oversight, Verifiable Correctness, and Monosemanticity

How do you generate “large” codebases that are correct? Or to reduce the problem: how do you generate narrow little code that is correct?

You are dealing with stochastisity and entropy. You go long enough and tiny percentage of errors accumulate.

The state of the art of this is https://theorem.dev/blog/lf-lean/, in which you create some form of consistency across the code generations by defining task class specifications. You’re not writing the specification one-by-one for each, but rather you’re writing a specification for the task class, and then the specification itself is getting generated one-by-one for each.

Additionally you can take advantage of how LLMs understand code, which is in 1) structural and 2) semantic form. The structural pattern you could think of as looking at your code from a distance, and your eyes see a structure and pattern in code(e.g. structural shape of React componenets or OOP modules). And then you also have variables or functions or namespaces like “check_if_we_have” or “THE_THING” which LLM understands semantically.

TDD works well in code-gen becuause tests are usually semantically more meaningful. Here things like assert statement read like a sentence, it’s semantically more meaningful, and determining its correctness semantically requires less reasoning compared to what a pice of code does.. What this helps with is that when a bug is introduced it jumps out far more strongly in the test statement than in the code implementation. For example a bug like:

Implementation: 

if (SOMETHING ** 2 != 42){

    ANYTHING - 78 = SOMETHING + 32

    if ANYTHING > SOMETHING or ANYTHING {
        return SOMETHING
    }
}

Test: 

assert SOMETHING > ANYTHING

And then it jumps out as incorrect, vs. the case that this ALL_THINGS variable/bug would be somewhere in the implementation burried in something like ALL_THINGS + 2 = SOMETHING // 43.

Now what are the controls that we have over this? You can take advantage of this by the way you choose your variable names. Your variable names matter. Longer and more descriptive variables work for us in the direction of precision when it comes to grabbing attention from the LLM. For example these two variable here:

let anchor_from_apply_quotes_last_tick = asset.q_mid;
let mark_from_bbo_after_ws_events_this_tick = asset.mid();

as symbols are less prone errors of being semantically interpreted vs. mid and q_mid. When writing large amounts of code via llm codegen and when a human reviewer is not in the process it’s possible to push this to ranges that is not usually seen in any codebase, so take an example like a_variable_that_is_this_but_not_that and push it even further. There is a collapse that happens w.r.t this.

Precise Specification of Intent and End-state

Designing an agentic harness is highly empirical and expriential, it requires tight feedback loops. Unlike traditional system architecture, you cannot design these systems in the abstract and on the paper. You need real feedback. You need to watch one run of the harness and get a feel for what the agent “sees and it doesn’t see”, “what it already knows in its base knowledge, and what it’s not as familiar with and going to struggle with. From there, once you close your loops once, you can now start to hammer out the trajectory by scaffolding hooks that hold the agent on the trajectory, or tweaking the prompt.

Lastly, LLMs are graudully eating through the layers of: HOW -> WHAT -> WHY in that exact order. As of 2026 the HOW has been eaten pretty significantly. For example if tomorrow I decided to do some quantum physics math (which I have no clue about), as long as I can barelly jesture at and describe WHAT it is that want, the LLM will pretty reliably have a satisfactory answer for the HOW. Today you see this with harnesses like Claude Code where gradually agents are going meta and prompting groups of subagetns as their delagates, which is a form of eating into the WHAT layer by deciding what needs to be done. That leaves the WHY layer. Good news is that us as humans are uniquely good at haveing the WHY capabilites.

Concretely, this means less micro managing and less interfering with the HOW layer (unless you’re confidently on the frontier of your field which then is better to be opinionated). Instead do more of describing the WHY and WHAT of the situation: who are you? why do you want a certain thing? who are you doing this for? who are they? what are they about? (remember the Cinematic Universe?).

[Sources]

https://github.com/varungodbole/prompt-tuning-playbook/blob/main/README.md
https://www.benedict.dev/optimization-arena-learnings
https://www.anthropic.com/engineering/harness-design-long-running-apps