This week I published a case study about rebuilding New York City's Child Care benefits portal, a working prototype I built during my engagement with the city's Office of Technology and Innovation, now handed on to the team carrying it toward citizen testing. Most of that work was the patient, unglamorous craft of making a long application humane. One slice of it deserves its own post, because it is a slice I think most teams shipping AI into consequential workflows will need, and that I rarely see built yet.
The slice is what happens when an automated system reads a family's pay documents and gets them wrong.
The stakes, stated plainly
Child care assistance can decide whether a parent takes a job, keeps a job, or steps back from work. To qualify, families prove their income, which means pay documents, and increasingly it means software reading those documents and extracting what it believes the numbers say. On the day that read is right, automation is a gift, minutes instead of weeks. On the day it is wrong, a machine's misreading sits between a family and the help they qualify for, and the family may not even know a machine was involved.
Both days ship together. That is the part I think is easiest for product teams to skip. The error rate of document extraction is not zero and will not be zero, paystubs are genuinely hard, and at the volume of a city, a small percentage of misreads is a large number of families. You do not get to design only for the day the model is right. The other day is in the same release.
Why pay refuses to be tidy
The prototype treats income proving as its deepest slice, a complete cut through every layer, screens down to logic, and the reason is what we found when we looked at how people are actually paid. Hours that change week to week. Amounts that vary. A document older than the window the rules expect. A read that times out. A second job on a different pay cycle. None of these are edge cases in any honest sense, they are the ordinary texture of hourly work, and hourly work is precisely who child care assistance exists for. A pipeline tuned on tidy salaried paystubs meets the population least likely to have one.
So the prototype models the untidy cases as scenarios, part of the sixteen testable situations it carries overall, each one triggerable on demand by anyone on the team. Every income scenario shows what a family would see for each employer, what the system believed, and the way to put it right.
Designing the disagreement
The screens themselves follow a few rules I have come to believe are general, and I will state them as my working rules, fully expecting other teams' versions to differ in the details.
The machine's belief is shown, never silently applied. The family sees what was read from their documents, per employer, in plain words. An extraction that disappears into a decision cannot be caught by the only person who knows the ground truth, the person who got paid.
Correction is a first-class path, not an appeal. Disagreeing with the read sits right there as an ordinary action, no tone of accusation, no implication the family did something wrong by having complicated pay. The system was the one guessing, the person is the one who knows.
Every failure has a next step that is not a dead end. A timeout, a stale document, an unreadable image, each lands on a screen that says what happened and what to do now, and the path to a human stays visible throughout. The quiet design principle underneath is that the worst day on the form should be a designed experience, deliberately built and rehearsed, never an exception handler with apologetic copy.
And the hard cases are reproducible on purpose. Because any scenario can be triggered at will, the difficult moments can be tested with real families before any production commitment, watched, refined, and tested again. The alternative, which is the industry default, is discovering your failure design in production telemetry, one stressed family at a time.
What I think generalizes
Strip the domain away and the shape applies to most of what is being shipped right now under the banner of agentic workflows. Some model reads something, decides something, fills something in, and a person lives with the result. Every one of those products contains a misread-paystub moment, the moment the machine's confident belief and the person's reality disagree, and I think the quality of the product is mostly decided there, not on the happy path.
The happy path is a demo. The disagreement is the product.
If I were reviewing any AI-assisted workflow today, I would ask to see one thing first. Show me the screen where the user discovers the machine was wrong, and walk me through how you tested it. Teams with a real answer have usually thought hard about the rest too. When there is no answer, there is often a polished demo with a gap behind it.
The Child Care work was a team effort across the engagement, and the prototype now belongs to the people taking it forward, who are well placed to test exactly these moments with the families they serve. What I take with me is the conviction this post argues. Failure handling in AI products is not an engineering detail to polish after launch. It is the part of the design where someone's trust is either kept or spent, and it deserves the same deliberate craft we lavish on the parts that go well.