
Labas: https://labas.rogasper.com
Most AI apps trust the LLM output. Ours has a 434-line defensive layer that feels like a linter + spellcheck + babysitter for your AI.
When you ask an LLM to generate exam questions in structured JSON, it will lie to you. Not maliciously just confidently wrong. It'll return "Option A" as an option text. It'll answer "T" when you asked for "TRUE". It'll generate three options when you requested four. And it'll do all of this with perfect JSON formatting.
We built a repair engine to catch every one of these failures before they reach users.
repair.ts) catches placeholder text, coerces answer formats, deduplicates options, and provides exam-specific fallbacks before Zod validation even runs.<think>), HTML error pages from proxies, and truncated JSON mid-stream each requiring specific defensive handling in the API client.The best LLMs produce structurally valid but semantically incorrect JSON roughly 15-25% of the time when generating complex nested objects. For exam questions, that means 1 in 4 questions might have a wrong answer, a missing option, or a placeholder text.
Our experience matched this. When we first built the AI question generator for Labas an open-source exam pratice platform supporting IELTS, TOEFL, JLPT, HSK, and German exams we assumed the LLM would return perfect. It didn't.
Here's what an LLM actually returns when asked to generate a multiple-choice question:
Four problems in one response:
Zod validation would catch #3 (empty string fails .min(1)). But #1, #2, and #4 are structurally valid they just produce bad questions.

The repair engine lives in packages/ai/src/repair.ts 434 lines of defensive programming. It runs before Zod validation, maximizing the chance that raw LLM output passes structural checks.
The first check catches lazy AI outputs:
This regex catches "Option A", "Pilihan B", "Choice 1", "opsi C", and even "placeholder". The multilingual pattern (pilihan is Indonesian, opsi is Indonesian) reflects our user base Indonesian students practicing exams in English, Japanese, Chinese, and German.
When a placeholder is detected, the question fails semantic validation and enters the regeneration queue.
Different exam formats expect different answer conventions. The repair engine normalizes them:
An LLM might return "T", "true", "True", or even "yes" for a True/False question. The coercion function maps all of theses to the canonical "TRUE". Same for "F" -> "FALSE", "NG" -> "NOT_GIVEN".
For author_view format (a different IELTS question type), the valid answers are "YES", "NO", or "NOT_GIVEN" and the coercion handles those too.
LLMS sometimes generate duplicate option keys:
If an LLM returns options with keys ["A", "B", "B", "C"], the duplication reduces it to ["A", "B", "C"]. The question then fails the "minimum options" check and enters regeneration.
When question text is missing or too short, the repair engine inserts language-specific defaults:
For JLPT (Japanese), it inserts a standard Japanese question. For HSK (Chinese), a Chinese one. For TOPIK (Korean), Korean. This ensures that even if the LLM fails to generate question text, the output is still exam-appropiate.
The repair engine is one piece of a larger system: the agentic generation pipeline in packages/ai/src/agentic.ts (585 lines). It's 4-step workflow with a retry loop:
After Step 4, the repair engine runs. If any questions fail validation, the pipeline enters a regeneration loop:
The key insight: only regenerate the invalid subset, not all questions. And pass the LLM context about why the previous attempts failed the repair log becomes part of the regeneration prompt.
This is the "self-healing" pattern in action. The pipeline doesn't just detect failures it fixes them automatically, retrying up to 2 times (in "full" strategy) before surfacing results to the user.
The repair engine handles LLM output. But the API client (packages/ai/src/clients.ts, 317 lines) handles everything else that can go wrong between your code and the LLM.
Users can bring their own AI provider including local models running on localhost. But in production, we block requests to metadata endpoints and private IPs:
This prevents a malicious user from setting their base_url to http://169.254.169.254/latest/meta-data/ and reading cloud credentials.
Some API proxies return HTML error pages instead of JSON. The client detects this:
Without this check, the JSON parser would fail with an opaque error. With it, the user gets a clear message about what went wrong.
Reasoning models (DeepSeek, GLM) wrap their chain-of-thought in <think> tags. The client strips these before JSON parsing:
Without stripping, the LLM response would be <think>...reasoning...</think>{"questions": [...]}, which fails JSON parsing.
If the LLM's response looks like truncated JSON (ends mid-object), the client doubles max_tokens and retries:
This handles the common case where the LLM runs out of tokens mid-JSON.
The entire system repair engine, agentic pipeline, defensive API client embodies one principle: treat the LLM as junior developer who writes fast but needs review.
You wouldn't ship code from a junior dev without review. Don't ship LLM output without validation either.
The repair engine is the code review. The agentic pipeline is the pair programming session. The API client is the CI pipeline that catches infrastructure issues.