Hate the game.
A contract refusal and the larger ecosystem in which it sits
I had two conversations this week that pointed at the same thing from different angles.
The first was an evaluation opportunity. As we talked through what the client was looking for, the shape of what they wanted became clear without anyone saying it out loud: they had a decision to make and they wanted a summative evaluation that arrived at that recommendation. They needed a third-party expert to put it in writing, and the signature was the unspoken deliverable.
I declined. The request did not meet ethical evaluation standards, which require findings to follow the evidence rather than the other way around.
The second was a conversation with a former colleague who now works in public health. She was describing a company trying to win a contract, and to win it the company needs to demonstrate specific statistical benchmarks. They are massaging the data to get there. The benchmarks were set up to standardize quality across providers, but in practice, the colleague said, everyone games them, and the actual quality across providers varies wildly.
By her account, the company in question is actually one of the good ones, with a service that an external evaluation found to exceed expectations. They provider is gaming the benchmark not because their work is poor but because the benchmark cannot capture what their work actually does. But the poor quality providers also do the same thing, so rather than focusing on improving actual quality, they focus on success in gaming the system.
Two different sectors, two different stories, and the same underlying problem sitting underneath both.
The contract refusal was a privilege
The privilege is worth naming out loud, because pretending otherwise is dishonest.
Evaluators have industry standards and ethical codes, but declining work depends on being able to, which depends on other work, on savings, on a household income that does not require this one contract. The evaluator who needs this engagement to pay rent is not always less ethical, but they are, by definition, more constrained, and they are also a more useful case study in what the system actually rewards.
When we talk about evaluator integrity, we tend to treat it as a stance you take. Sometimes it is. More often, it is a luxury good purchased with margin, and writing about ethics without naming that costs nothing and changes nothing.
The fish do not make the water
The public health organization is doing what the system asks of them. Their question is existential, and the intervention they offer might be one of the ones that actually works. If they do not produce the numbers in the right shape, the contract goes elsewhere, the program ends, staff lose jobs, and the people they serve lose the service.
In that frame, massaging data starts to look like keeping an effective program alive. One that they know is working. The data is now load-bearing for survival, not just for learning.
The temptation to round up is not exotic, and it is not a character flaw. It is the rational response of a fish trying to swim in the only water available.
This is a structural problem before it is a character problem. The fish are not the issue. The water is.
The dynamic has a name. Goodhart’s Law says that when a measure becomes a target, it ceases to be a good measure. Campbell’s Law, the social-science cousin, says that the more any quantitative social indicator is used for social decision-making, the more subject it will be to corruption pressures. It is the same law in different language, and the field has had fifty years to know it, but keeps building the systems anyway.
I wrote about this dynamic for the AI context last August, in a piece on how people break AI in social impact. That piece looked at how the pattern plays out inside a single model. What follows extends the frame to three sectors I have either worked in or watched closely: international development, education, and healthcare. The public health story sits inside a much larger pattern, and the pattern is the point.
Bad data follows the money in international development
I will start with the field I know best. Justin Sandefur and Amanda Glassman wrote a paper for the Center for Global Development in 2014 called The Political Economy of Bad Data. Their finding, across multiple African countries, was that official statistics systematically exaggerate development progress, and they traced the exaggeration to two specific mechanisms.
The first mechanism is that governments overreport to foreign donors. Their lead example is a results-based aid program that paid for reported vaccination rates, and the reported rates exceeded what independent household surveys could find. The money was tied to the number, and the number obliged.
The second mechanism is that governments are themselves misled by frontline providers. Their example is primary school enrollment, where official numbers diverged sharply from survey estimates after funding shifted from user fees to per-pupil government grants. The schools had been counting students before, more or less honestly, and then the grant changed what the count was for.
USAID’s own Inspector General reached similar conclusions from the inside. Across 21 performance audits in Egypt, Jordan, and the West Bank and Gaza between 2011 and 2013, 71 percent found unreliable data, including overstated indicators and missing documentation. In one case from Ukraine, the implementer overstated an indicator by roughly 100,000 and attributed the overstatement to a typographical error. In Guatemala, another implementer overstated leveraged funds by $3.4 million.
The pattern is not occasional misconduct; it is the predictable output of a funding structure that pays for indicator achievement. Logframes do not cause this. What does is the cost-reimbursement contract, the indicator-tethered disbursement, and the political pressure to show progress, all running at the same time.
Andrew Natsios, who ran USAID from 2001 to 2006, named the dynamic from the inside. He called it “obsessive measurement disorder.” The programs that get measured most precisely, he argued, are the least transformational, while the programs that actually transform anything are the hardest to measure. The aid system, he said, had organized itself around the wrong half of that observation.
Education made evidence into a badge
The Every Student Succeeds Act of 2015 set up a four-tier system for evaluating evidence behind educational programs, with Tier 1 reserved for strong evidence from a randomized trial, Tier 2 for moderate evidence from a quasi-experiment, and the lower tiers for weaker designs. Federal money for certain program categories is tied to using interventions with stronger evidence.
Vendors self-select the tier they claim, and the incentive structure that results is striking. Once a vendor has a single Tier 1 study showing a statistically significant positive effect, additional studies can only weaken the claim. The rational move, then, is to fund one tightly controlled study with extensive implementation support, get the positive result, badge the product, and never study it again. The market fills with Tier 1 claims, and the actual quality among Tier 1 products varies enormously.
The framework was supposed to standardize quality. In practice, it standardized the gaming.
This is the story my colleague was telling me, told in a different sector. Replace ESSA with whatever the equivalent benchmark is for a state public health contract, and the structure is the same. There is a number, that number determines whether the program gets the money, and the shape of the company’s behavior follows.
The older and sharper version of this same pattern is the Atlanta Public Schools cheating scandal, which ended with multiple educators convicted of racketeering for systematically changing student answers on standardized tests. The high-stakes use of a single metric for teacher accountability did not produce better teaching. It produced erased answers.
In healthcare, the metric moves and the reality does not
The Affordable Care Act created the Hospital Readmissions Reduction Program in 2010, under which hospitals with higher-than-expected 30-day readmission rates for selected conditions get penalized through reduced Medicare payments. The program was credited with substantial reductions in readmissions, and for years it was held up as a success of pay-for-performance in healthcare.
In 2019, Ody and colleagues published a paper in Health Affairs showing that the credit was overstated. A coincident change in electronic transaction standards allowed hospitals to document more diagnoses per claim, which meant higher risk scores, which in turn meant lower risk-adjusted readmission rates. The patients were the same, the care was largely the same, and what actually changed was the paperwork. Accounting for that change cuts the apparent decline in risk-adjusted readmissions for targeted conditions by 48 percent, and the authors conclude that either the HRRP had no effect on readmissions, or its effect was roughly half what we thought.
The Veterans Affairs wait-times scandal is the cleanest version of the same dynamic. The VA set a 14-day target for new patient appointments and tied executive bonuses to performance on that measure. Across at least 93 VA facilities, schedulers and managers manipulated the data: they started the clock late, kept secret paper waiting lists, and cancelled and rescheduled appointments to reset the count. One scheduler wrote in an internal email, “Yes, it is gaming the system a bit. But you have to know the rules of the game you are playing, and when we exceed the 14-day measure, the front office gets very upset.”
The Inspector General documented the manipulation in 2014. A follow-up audit in 2022 found the practice continuing under different mechanisms, with a 66-day wait being logged as 43 because of how the start date was defined. A decade on, FOIA records show the gaming persists.
Gaming is the system working as designed
Three sectors, different actors, different stakes, and the same pattern emerges in each. This is not a moral failure of evaluators, vendors, hospitals, or aid implementers. It is the predictable result of designing accountability around single quantitative targets attached to financial consequences, while leaving the underlying conditions of the work unexamined. The field has not lacked for the analysis. It has lacked for any structural willingness to use it.
My refusal of the predetermined evaluation matters. It protects my integrity, and it may have spared a specific program from a third-party signature on a verdict already written. It is the right thing to do. It does not change the structure that produced the request.
For every evaluator who declines, there is another evaluator who needs the work, and the field selects, over time, for the people who will deliver what is asked. That is not because evaluators are weak. It is because the field is funded that way. The same is true of the public health vendor. They can decline to game, but the next vendor will not, and the next vendor will get the contract. The system is not waiting on individual ethics to fix itself.
This is the part that gets uncomfortable for those of us who think of ourselves as on the right side of the question. Our refusal is necessary, and our refusal is not enough.
Some interventions actually end the gaming
If individual integrity is not enough, the next question is what is.
The research and the field experience point at a few real candidates, and the two with the strongest track records are worth developing in some detail.
The first is closure paired with quality alternatives. Charter schools are an imperfect and sometimes maligned case, and the case needs to be told honestly. In states with strong authorizer accountability, low-performing charter schools do close, and at higher rates than equivalent traditional public schools. CREDO’s 2017 Lights Off study tracked 1,522 low-performing schools that closed between 2006 and 2013 across 26 states, and it found that the charter sector closed a higher share of its low-performing schools than the traditional sector did. That is a real structural intervention, and one that is rare in the rest of the social sector, where ineffective programs tend to keep running because no one has both the authority and the willingness to end them.
The same CREDO study, however, found that less than half of displaced students landed in better schools. Closure without somewhere good to go is not accountability; it is dispersal. The intervention is not closure alone. It is closure paired with quality alternatives, and paired with rules that prevent failing programs from authorizer-shopping their way back into the market.
The second is independent verification that the producer cannot prepare for. Sandefur and Glassman flag one working example at the end of the bad-data paper: the World Bank’s Health Results Innovation Trust Fund, which uses small unannounced household surveys to cross-check the administrative data participating facilities report, and applies penalties for over-reporting. In Cameroon, this dropped over-reporting of outpatient consultations by more than 90 percent in less than a year. The verification is built into the funding mechanism, and the people producing the numbers cannot study to the test, because they do not know which numbers will be checked or when the check will arrive.
Neither of these is exotic. Neither requires new theory. They require funders willing to design accountability that cannot be gamed by the people whose careers depend on it.
The harder thing this week was not declining the evaluation. The harder thing is staying clear-eyed about which parts of this system I am still inside even when I decline.
I take contracts, I produce indicators, and I write reports that get read by people making decisions about budgets. I am one of the fish. Refusal is honest. Refusal is partial. Building anything different requires naming the ecosystem the game runs on, not just naming the players in it.
Anthralytic is a strategy and evaluation studio for mission-driven organizations. This newsletter is a practitioner's thinking-out-loud about evaluation, AI governance, and the systems social impact work runs through. If you know someone who makes decisions about resources in the social sector, whether or not they call themselves an evaluator, this newsletter is for them.

