Making AI honest enough to teach a child
An adaptive learning engine I am designing and building on my own. The whole problem in one line: put AI in front of a child without ever showing them something false. The operating rule that follows from it is simple, AI generates, humans verify, learners learn.
- 91
- concepts modelled
- 18
- visual renderers
- 9
- question types
- 3K+
- automated tests
Why I built this
I have kids in primary school. When I went looking for a math platform that genuinely adapted to them, and to me as a parent, I could not find one. Most options changed how hard a problem was, not whether my child actually understood it. Progress was counted in minutes and streaks, not in concepts mastered. Several leaned on punitive scoring, ads, or upsells. So I started building the thing I wanted for my own children.
-
- Not truly adaptive
- Difficulty moved up and down, but nothing tracked why a child was stuck, or adapted at the level of the specific concept.
-
- Engagement over learning
- Time in the app and streaks were the headline. Real concept mastery was hard to see, for the child or the parent.
-
- Trust as an afterthought
- Ads, manipulative loops, shallow content. Nothing I would hand a young child without watching over their shoulder.
The timing
I started this when ChatGPT had only just appeared and AI was not trusted for writing code at all. So the foundation was architected and hand-built on engineering judgment first. Today I use AI for most of the work, but under the same principles I lead with everywhere, I design the product, set the architecture, write the tests the AI's code has to pass, and verify every step.
Put AI in front of a child without ever showing them something false. If a learner could be misled, the system has failed.
This is a large, ongoing build, not a finished product, and I will be honest about that throughout. What follows is how it works, and the design and engineering decisions that make it trustworthy.
AI generates, humans verify, learners learn
AI is never the interface a child talks to. Every problem a learner sees is generated by a model, put through automated checks, and, in the cases that matter, reviewed by a human before it is ever served. A human is in the chain from the first draft to what reaches the child. This is a deliberate design choice, not a temporary limit, and it is the spine the rest of this page walks.
The foundation
This is built to carry a serious initiative, not a demo. A containerized stack runs the backend, the learner and admin apps, background workers, the database, and full observability, with stateless services that scale horizontally.
The data is deliberately structured. Rather than one sprawling table, the database is split into eight separate areas of concern, so accounts, curriculum, the generation pipeline, approved content, learning activity, guardian reporting, admin operations, and the system itself stay cleanly isolated.
And the engine is config-driven. Concepts, grades, question types, visual contracts, and the validation rules all live in configuration, not code. Adding a new subject is a configuration task, not a rewrite, which is the whole reason it can grow beyond math.
-
- Containerized and scalable
- Stateless services, background workers, metrics and dashboards, ready to scale horizontally.
-
- Eight-schema database
- Clean separation across accounts, curriculum, generation, content, activity, reporting, admin, and system.
-
- Config-driven engine
- Subjects, grades, concepts, contracts, and rules live in config. A new subject is new config, not new code.
The technology under it
The model layer is built to be model-agnostic and open-source-first. Everything routes through a single provider layer, and I swap models with a config change, so I can test new capabilities as they ship, month after month, without touching the pipeline. As stronger open models arrive, the engine adopts them without a rewrite.
A few current examples, all swappable:
| Task | What it needs | Current example |
|---|---|---|
| Generation | Speed and high throughput | Open-weight Qwen model |
| Validation and judging | Deep reasoning | Larger Qwen reasoning model |
| Blind visual check | Vision | Gemini Flash |
| Illustration | Image generation | Gemini Flash Image, GPT Image |
Every model output is forced through a typed schema. The model must return a valid, structured object, not loose text, and it is automatically retried when it does not. Each of the 18 renderer types carries a matching typed contract from backend to frontend, so what the model promises and what the screen draws can never quietly drift apart.
The pipeline: automation and human checks, together
Content moves through a sequence where the machine does the heavy lifting and a human holds the gate at the points that matter. This is the path a single problem takes.
Scheduler, a human starts a batch
An admin picks subject, grade, concept, and question type, and the engine builds the prompt from config. Automated from there.
The automated gauntlet
Before a human ever sees a problem, it passes a sequence of gates, an arithmetic check, a concept-alignment judge that reads the visual as well as the text, an answer-leak check, complexity-band limits, then a declarative rule engine. Each problem ends up auto-approved, auto-rejected, or sent to a human. Automated.
The keystone, a blind visual check
This is the part I am proudest of. A separate model looks only at the rendered picture, tries to solve the problem from the image alone, and its answer is compared to the real answer it never saw. If the picture does not actually carry the math, this catches it before a child ever sees it. Automated.
When it fails, it retries with feedback
A rejected problem is regenerated with the reason for the failure fed back into the prompt, with a budget and a circuit breaker so it can never loop forever. Automated.
Quality evaluation and rubric checks
Beyond per-problem gates, whole batches are scored against rubrics, both statistical and a model acting as judge, for things like phrasing variety, visual diversity, and answer balance. The output is a pass or fail report a human reads before anything moves forward. Automated, then human.
The human gate, the Visual Playground
A reviewer sees exactly what a learner would, the real rendered visual, the problem, and the answer flow, and decides, approve and promote, send back, or discard. Where a coded visual will not fit, the admin generates an illustration, compares attempts across models, and picks one. Human.
Refine and Optimizer
Borderline or rejected problems open in Refine or the Optimizer for a human to edit and regenerate before approval. Human.
Only vetted content reaches a learner
Approved problems go into a pool. Serving uses adaptive difficulty, spaced repetition, and a prerequisite graph, and makes zero AI calls during a learner's session. The child only ever sees content a human has already cleared.
The experience built on top
The trust work is invisible to the child. What they get is a calm, encouraging place to learn.
- A student-as-assessor mode, where the child checks a character's work and finds the mistake, practising the highest kind of thinking by teaching.
- Voice in and voice out, so a child who cannot yet type can hear the problem and speak the answer.
- Accessibility and offline support built in, not bolted on.
- A guardian view that reports concept-level mastery, what the child actually understands, not minutes played.
Designed and built, by one person
I designed the trust model and the interface, and I wrote the code. The 18-renderer visual system is a design system in its own right. Failure states are designed, not left to chance. The same discipline I bring to a client's product, design and build together, guardrails and evals, the prototype as the source of truth, is the discipline behind this.
It is not finished. It is an ambitious, ongoing project that came from a real need, built to a standard where I would trust the AI in front of my own children. That standard is the entire point.