A math problem can pass my entire validation pipeline, every hard gate and a healthy quality score, with the data correct, the text clear, and the answer key right, and still be broken on the actual screen. A visual meant to show three groups of four objects renders as a single cramped column, half the objects pushed below the fold. A child counts what they can see and gets the problem wrong, by doing exactly the right thing.
That kind of failure is one of the most useful things this project has taught me, because it exposed an assumption I did not know I was making. Everything in my pipeline judged the problem as data. Nothing judged it as a picture. The learner only ever meets the picture.
Checks that read, content that renders
The gates and scores I wrote about in August all operate on structure. They parse JSON, validate fields, check the text against the declared visual contents. They are thorough and they are blind in one specific, dangerous way. A scenario can be perfectly described and badly drawn. Layout overflows on a narrow viewport. A runtime error quietly drops one element of an array. Two labels overlap so a 6 reads as an 8. The structure says one thing, the pixels say another, and nearly every check I had was reading the structure.
The fix had to look at pixels, so the pipeline now does what I do when I review. It looks at the rendered screen.
The blind judge
Scenarios that pass the programmatic checks now go through a headless rendering audit. The backend launches an isolated browser, renders the scenario exactly as the learner's client would, fails the run outright on any runtime error, and captures a high-density screenshot of the result.
Then comes the part I am happiest with. A vision model is handed that screenshot and nothing else. No metadata, no canonical answer, no hint of what the problem is supposed to be. It has to solve the problem from the rendered image alone, the way a learner would. If the picture is broken, cropped, contradictory, or misleading, the blind judge gets the answer wrong, and that disagreement is the whole signal. The judge does not need to recognize a layout bug. It just needs to be misled by one.
The judge's answer is compared against the stored ground truth through a small cascade of normalizations, because honest agreement needs slack in the right places. Decimals and integers are reconciled, fraction characters map to their plain ASCII forms, and clock-reading problems tolerate a narrow time window with twelve-hour wraparound handled, since a clock face showing 2:00 is a correct read whether the judge says afternoon or morning. The cascade exists so the comparison fails on real visual problems and never on formatting trivia.
What it has caught
Since the audit went in, the failures it surfaces have sorted into three families. Layout breakage is the obvious one, overflow, cropping, collisions, the kind of thing the opening describes. Silent runtime errors are the second, scenarios where a renderer threw midway and produced a plausible-looking partial screen, maybe the most dangerous family, because nothing about the result announces a failure. The third family is the most interesting. Problems where the picture is technically correct and pedagogically misleading, objects grouped in a way that suggests the wrong operation, visual emphasis landing on a distractor. The blind judge gets those wrong at a noticeably higher rate, and each one it catches teaches me something about visual clarity I did not have words for.
I want to be precise about what I am claiming. The audit has not made the pipeline perfect. It has moved the failure rate of what reaches the pool, and more importantly it changed the kind of failure that gets through, away from the ones a child would silently absorb.
The limits I can see from here
A vision model can be wrong too. It sometimes solves a slightly broken picture correctly, the way a generous adult does, filling gaps a seven-year-old would not fill. The audit therefore leans strict, anything short of clean agreement routes to me, and I accept the false alarms as the price of the direction of failure. There is also a real cost per scenario, a full browser render and a vision call are not free, and at larger catalog volumes I suspect I will need to sample where I currently check everything. I would have sneered at that sentence in February, when everything got my eyes anyway. Volume keeps teaching me what my principles cost.
The deeper lesson sits beyond this pipeline. Every system I have worked on had some version of this gap, the artifact the team validates and the artifact the person receives, drifting apart in the space everyone assumed someone else was watching. The structured data was clean, and what the user actually saw was nobody's test. The blind judge is one answer for one product. The habit of asking what does the person actually see, and building a check that lives there, travels much further than edtech, and I rarely see it built, including in places I have worked.