In February I wrote that every problem surviving my automated checks was still reviewed by a person, and that someday I would have to decide what the machine could clear on its own. I promised to write that decision down when I made it from data instead of instinct. This is that post.

The short version is that my factory now auto-approves content that scores at or above 0.85 on a composite quality score, and routes everything below that line to a person. The longer version is about where that number came from, because the number itself is the least interesting part.

What the score is made of

A generated problem faces two kinds of judgment before any score exists. The first is a battery of hard gates, pass or fail, no nuance. The answer choices have to match the requested format. A true-or-false question has to contain exactly one claim, because a compound statement is unanswerable for a child even when it is technically fine for an adult. A number-sequence problem has to hold a constant step. Currency, measurements, and names have to match the learner's locale. Problem text must never mention objects the visual does not actually show. A failure at any gate discards the scenario outright. No score, no appeal.

What survives the gates gets scored. A judge model evaluates the scenario against a rubric, grade fit, clarity of phrasing, whether the difficulty matches what was requested, whether the context makes sense in the learner's world. Those dimensions combine into one composite between 0 and 1. The gates ask can this possibly be served. The score asks how good is it.

How the line was drawn

For months the line did not exist. Everything that passed the gates came to me, and I approved or rejected each scenario by hand. That queue was slow and it was also the most valuable instrument I have built, because every decision I made was recorded alongside the score the machine had already given.

After enough volume, I compared the two columns. Where the machine scored a scenario high, I almost always agreed, and my rare rejections up there were cosmetic, a phrasing I would have tightened, never something that would mislead a learner. The disagreements that mattered clustered lower down. Ambiguity, contexts that were plausible but subtly off for the age group, problems that were correct and somehow still confusing. Around 0.85 the character of my rejections changed from this is wrong to I would have done it differently, and that is a distinction worth automating around. Below the line lives judgment. Above it lived my taste, and taste is not what the review queue is for.

So the line went in at 0.85. Not because the number is special, but because that is where my own recorded behavior said the machine and I stopped disagreeing about anything a child would feel.

Hard gates discard outright failures before any scoring. Survivors get a composite quality score, and 0.85 is the line where months of my own review decisions stopped adding anything a learner would notice.

What stays human on purpose

The threshold is not a retirement of the review queue. Three kinds of content keep a person in the chain no matter what they score.

The first batches of any new concept get reviewed in full, because the score is only trustworthy where my recorded decisions taught it what good looks like, and a new concept is new territory. Anything touching locale and cultural context gets a person when it sits near the line, because those misses embarrass quietly and the gates only catch the crude cases. And a sample from above the line gets pulled for review on an ongoing basis, because the whole arrangement rests on the score staying honest, and the only way to know is to keep checking it against a human.

That last one matters most. A dashboard now tracks approval rates per template over time, and the thing I watch for is drift. Prompts age, models change, and a threshold calibrated in summer can rot by winter without a single alarming event. If auto-approval rates climb while my sampled agreement falls, the line moves up until I understand why.

The part I am less sure about

I want to flag the assumption this whole post stands on, because someone will eventually test it harder than I have. My calibration data is my own judgment. The threshold encodes me, my pedagogical instincts, my cultural blind spots, my tired-evening leniency. For a product run by one person that is at least coherent. The moment other reviewers join, the line stops meaning one thing, and I suspect maintaining a shared standard across reviewers is a harder problem than any of the machinery in this post.

I also do not know yet how the threshold behaves at real catalog scale. The volumes I calibrated on are honest but small. It is possible the score distribution shifts as concepts multiply, and 0.85 turns out to be the right line in the wrong place. If that happens, you will read about it here.

What I can say is that the system fails in the right direction. Every doubt routes to a person, every discard is silent and costless, and the worst outcome of a miscalibrated threshold is me reviewing more than I strictly need to. A learning product should inconvenience its builder before it confuses its learner. That ordering is the actual decision in this post, and the number is just where it landed this season.