4 mins read

hero-bg-image

Making AI honest enough to teach a child

An adaptive learning engine I am designing and building on my own. The whole problem in one line: put AI in front of a child without ever showing them something false. The operating rule that follows from it is simple, AI generates, humans verify, learners learn.

91: concepts modelled

18: visual renderers

9: question types

3K+: automated tests

pair-hero-bg-image

Why I built this

I have kids in primary school. When I went looking for a math platform that genuinely adapted to them, and to me as a parent, I could not find one. Most options changed how hard a problem was, not whether my child actually understood it. Progress was counted in minutes and streaks, not in concepts mastered. Several leaned on punitive scoring, ads, or upsells. So I started building the thing I wanted for my own children.

Not truly adaptive

Difficulty moved up and down, but nothing tracked why a child was stuck, or adapted at the level of the specific concept.
Engagement over learning

Time in the app and streaks were the headline. Real concept mastery was hard to see, for the child or the parent.
Trust as an afterthought

Ads, manipulative loops, shallow content. Nothing I would hand a young child without watching over their shoulder.

The timing

I started this when ChatGPT had only just appeared and AI was not trusted for writing code at all. So the foundation was architected and hand-built on engineering judgment first. Today I use AI for most of the work, but under the same principles I lead with everywhere, I design the product, set the architecture, write the tests the AI's code has to pass, and verify every step.

Put AI in front of a child without ever showing them something false. If a learner could be misled, the system has failed.

This is a large, ongoing build, not a finished product, and I will be honest about that throughout. What follows is how it works, and the design and engineering decisions that make it trustworthy.

The principle

AI generates, humans verify, learners learn

AI is never the interface a child talks to. Every problem a learner sees is generated by a model, put through automated checks, and, in the cases that matter, reviewed by a human before it is ever served. A human is in the chain from the first draft to what reaches the child. This is a deliberate design choice, not a temporary limit, and it is the spine the rest of this page walks.

[SCREENSHOT: a simple generate, validate, human-review, serve diagram]

The build

The foundation

This is built to carry a serious initiative, not a demo. A containerized stack runs the backend, the learner and admin apps, background workers, the database, and full observability, with stateless services that scale horizontally.

The data is deliberately structured. Rather than one sprawling table, the database is split into eight separate areas of concern, so accounts, curriculum, the generation pipeline, approved content, learning activity, guardian reporting, admin operations, and the system itself stay cleanly isolated.

And the engine is config-driven. Concepts, grades, question types, visual contracts, and the validation rules all live in configuration, not code. Adding a new subject is a configuration task, not a rewrite, which is the whole reason it can grow beyond math.

Containerized and scalable
Stateless services, background workers, metrics and dashboards, ready to scale horizontally.
Eight-schema database
Clean separation across accounts, curriculum, generation, content, activity, reporting, admin, and system.
Config-driven engine
Subjects, grades, concepts, contracts, and rules live in config. A new subject is new config, not new code.

The technology under it

The model layer is built to be model-agnostic and open-source-first. Everything routes through a single provider layer, and I swap models with a config change, so I can test new capabilities as they ship, month after month, without touching the pipeline. As stronger open models arrive, the engine adopts them without a rewrite.

A few current examples, all swappable:

Task	What it needs	Current example
Generation	Speed and high throughput	Open-weight Qwen model
Validation and judging	Deep reasoning	Larger Qwen reasoning model
Blind visual check	Vision	Gemini Flash
Illustration	Image generation	Gemini Flash Image, GPT Image

Every model output is forced through a typed schema. The model must return a valid, structured object, not loose text, and it is automatically retried when it does not. Each of the 18 renderer types carries a matching typed contract from backend to frontend, so what the model promises and what the screen draws can never quietly drift apart.

The pipeline: automation and human checks, together

Content moves through a sequence where the machine does the heavy lifting and a human holds the gate at the points that matter. This is the path a single problem takes.

Scheduler, a human starts a batch

An admin picks subject, grade, concept, and question type, and the engine builds the prompt from config. Automated from there.

[SCREENSHOT: the Scheduler, batch creation]

The automated gauntlet

Before a human ever sees a problem, it passes a sequence of gates, an arithmetic check, a concept-alignment judge that reads the visual as well as the text, an answer-leak check, complexity-band limits, then a declarative rule engine. Each problem ends up auto-approved, auto-rejected, or sent to a human. Automated.

[SCREENSHOT: a validation score trace across the gates]

The keystone, a blind visual check

This is the part I am proudest of. A separate model looks only at the rendered picture, tries to solve the problem from the image alone, and its answer is compared to the real answer it never saw. If the picture does not actually carry the math, this catches it before a child ever sees it. Automated.

[SCREENSHOT: the rendered visual and the blind judge's verdict]

When it fails, it retries with feedback

A rejected problem is regenerated with the reason for the failure fed back into the prompt, with a budget and a circuit breaker so it can never loop forever. Automated.

Quality evaluation and rubric checks

Beyond per-problem gates, whole batches are scored against rubrics, both statistical and a model acting as judge, for things like phrasing variety, visual diversity, and answer balance. The output is a pass or fail report a human reads before anything moves forward. Automated, then human.

[SCREENSHOT: the Quality Dashboard, per-concept pass and fail]

[SCREENSHOT: an evaluation report, the rubric check card by card]

The human gate, the Visual Playground

A reviewer sees exactly what a learner would, the real rendered visual, the problem, and the answer flow, and decides, approve and promote, send back, or discard. Where a coded visual will not fit, the admin generates an illustration, compares attempts across models, and picks one. Human.

[SCREENSHOT: the Visual Playground review surface]

Refine and Optimizer

Borderline or rejected problems open in Refine or the Optimizer for a human to edit and regenerate before approval. Human.

[SCREENSHOT: refining a scenario before approval]

Only vetted content reaches a learner

Approved problems go into a pool. Serving uses adaptive difficulty, spaced repetition, and a prerequisite graph, and makes zero AI calls during a learner's session. The child only ever sees content a human has already cleared.

What a child and parent get

The experience built on top

The trust work is invisible to the child. What they get is a calm, encouraging place to learn.

A student-as-assessor mode, where the child checks a character's work and finds the mistake, practising the highest kind of thinking by teaching.
Voice in and voice out, so a child who cannot yet type can hear the problem and speak the answer.
Accessibility and offline support built in, not bolted on.
A guardian view that reports concept-level mastery, what the child actually understands, not minutes played.

[SCREENSHOT: the student-as-assessor mode]

[SCREENSHOT: the guardian view, concept-level mastery]

Why this matters for how I work

Designed and built, by one person

I designed the trust model and the interface, and I wrote the code. The 18-renderer visual system is a design system in its own right. Failure states are designed, not left to chance. The same discipline I bring to a client's product, design and build together, guardrails and evals, the prototype as the source of truth, is the discipline behind this.

It is not finished. It is an ambitious, ongoing project that came from a real need, built to a standard where I would trust the AI in front of my own children. That standard is the entire point.