Scaling an AI Training Chatbot Across a Government Agency
I picked up an unfunded internal proof-of-concept powered by Claude and redesigned it for agency-wide deployment—introducing a five-level difficulty system that taught employees to conduct investigative interviews, including testing their ability to detect deception, while optimizing the product to run at scale within cost and performance constraints.
Images shown are illustrative representations of my design process and do not include actual product screenshots or wireframes due to confidentiality.
Role:
Product Designer
Contributions:
Conversation UX & onboarding flow, admin configuration interface, end-to-end product design
Tools:
Figma, Claude API
Skills:
AI UX design, conversational design, difficulty system design, cost optimization, stakeholder management
Timeline:
2024 – 2025
Challenge
An internal team at a government agency had built a Claude Opus-powered chatbot to support employee training—then lost their funding before it ever reached users. The project was shelved, but the work wasn't without value. When the agency developed a new, more targeted mandatory training program, we picked up where that team had left off and took on the challenge of adapting their proof-of-concept for a completely different context and scale.
The original tool could answer questions about agency procedures, guide users through exercises, and sustain a coherent conversation. But it had been built as if all employees were the same. Deploying it agency-wide meant reaching 2,000 employees across a wide spectrum of experience: new hires encountering agency systems and terminology for the first time, and seasoned staff who already understood the fundamentals and needed depth, not orientation.
A single prompt and a single interface couldn't serve both groups. But rigid role-gating wasn't the answer either—employees needed the flexibility to enter training where it made sense for them and progress when they were ready, not when a system decided they were. Scaling the tool without solving for that flexibility wasn't scaling—it was broadcasting the same experience to 2,000 people and hoping it landed.
Insight
The inherited proof-of-concept used a single system prompt regardless of who was asking. The training it had been designed to support was investigative interview technique—teaching employees to extract complex information from a subject who may be reluctant, evasive, or dishonest. That's an inherently graduated skill, and a flat prompt architecture couldn't reflect that gradient.
But the challenge wasn't just about calibrating depth for different experience levels. Stakeholders wanted the upper levels of the training to actively test employees' ability to detect deception—which meant the chatbot itself needed to lie, convincingly and consistently. That was a design and technical problem that the original team had never attempted to solve.
The obvious fix—assigning separate experiences to separate role types—had a critical flaw. Experience doesn't map cleanly to job title. A new hire with a relevant background might need less scaffolding than a long-tenured employee working in an unfamiliar domain. Rigid role gating would have created new mismatches while solving the old ones. The right solution wasn't to sort people into buckets—it was to let people place themselves, and move when they were ready.
Solving for this meant rethinking the product at every layer: how difficulty was structured across levels, how the model was instructed to behave at each one, what kinds of falsehoods it could produce reliably, and—underneath all of it—whether the system could run at the token volume that 2,000 concurrent users would demand.
Key findings
No experience signal in the system:
The inherited proof-of-concept had no mechanism to communicate a user's experience level to the model. Every session started from the same blank slate, regardless of tenure or prior familiarity with investigative interview technique.
Difficulty wasn't designed—it was absent:
The original tool had no mechanism for graduated challenge. Stakeholders needed the upper levels to actively test employees' ability to detect deception, but the MVP had never attempted to make the chatbot behave as an evasive or dishonest interviewee.
Deception was harder than expected:
Early testing revealed that lying convincingly and consistently was a significant technical challenge. Fabricating locations, named groups, and complex historical or social concepts produced inconsistencies that broke immersion. Numerical and temporal deception—randomized dates and figures—proved far more reliable and harder for employees to catch without deliberate cross-referencing.
The inherited prompts were too costly to scale:
The defunct pilot team's prompts had been designed without deployment at scale in mind. At the token volume required to support 2,000 employees through a multi-day training, the original architecture was prohibitively expensive and too slow to deliver a usable experience.
No admin configuration layer:
There was no interface for training program owners to structure levels, configure progression, or monitor completion across the workforce—any operational change required going back to engineering.
Conversation UX lacked scaffolding:
New users had no guidance on how to approach the interview exercise or use the tool effectively. Without structured entry points, employees at the lower levels struggled to get started—undermining the value of the difficulty gradient.
Approach
Rather than patching the inherited proof-of-concept, I redesigned the product from the conversation surface down to the configuration layer—replacing the flat, one-size-fits-all experience with a structured level system that gave employees agency over their own entry point and pace.
Leveled conversation design
I designed a progression system across five levels, each with its own prompt architecture governing how the chatbot behaved as an interview subject. The training goal was specific: teach employees to conduct effective investigative interviews and extract complex information from a reluctant or uncooperative interviewee. The levels mapped directly to that skill curve.
At Level 1, the chatbot was instructed to be fully forthcoming and truthful—a cooperative subject who answered questions completely and volunteered relevant detail. This gave new employees a low-stakes environment to practice basic interview technique without the added challenge of resistance. By Level 5, the chatbot was instructed to be evasive, answer only in yes or no responses, and actively lie. Experienced employees working at the upper levels were being tested not just on their ability to gather information, but on their ability to spot falsehoods and surface inconsistencies across a conversation.
No one was locked into their starting level. Beginners who felt ready could advance; anyone could revisit earlier levels for reference or practice. The system gave experienced employees a faster path in without closing that path to anyone else.
Designing deception that could scale
Getting the upper levels right required significant testing. Stakeholders wanted the chatbot to lie convincingly—but designing falsehoods that held up across a full investigative interview turned out to be technically and conceptually difficult in ways we hadn't anticipated.
Certain categories of information were extremely hard to fabricate consistently. Locations, named groups, and complex human concepts or historical events resisted clean deception: the model would produce contradictions, break internal consistency, or generate falsehoods so implausible they immediately signaled dishonesty rather than simulating it. A chatbot that lies badly doesn't test interview skill—it just teaches employees to spot obvious errors.
Where deception worked well was in numerical and temporal data. Introducing noise and randomness to dates, figures, and statistics produced falsehoods that were plausible, internally consistent, and genuinely challenging to catch without careful cross-referencing—exactly the kind of inconsistency a skilled interviewer should be trained to surface. We scoped the deception mechanics around what the model could do reliably, and worked with stakeholders to calibrate what kinds of falsehoods were worth testing for.
Cost, performance, and scaling constraints
The defunct pilot team's prompts had been built without scale in mind. At the token volume required to run 2,000 employees through a multi-day training, the original prompt architecture was prohibitively expensive and too slow to deliver a usable experience. Before we could deploy, we had to make the product financially and technically viable.
We worked through this on three fronts. First, we negotiated with stakeholders on class sizes—structuring concurrent user cohorts in a way that spread load without compromising the training timeline. Second, we systematically reduced token usage by stripping redundancy from the prompts and tightening context windows, finding the minimum viable complexity at each level that preserved the behavioral fidelity the training required. Third, we addressed the user experience of latency directly: rather than leaving employees staring at a blank screen during longer responses, we implemented live streaming output so responses appeared word by word as they generated—making the wait feel like part of a natural conversation rather than a system delay.
Admin configuration interface
I designed an admin layer inside the CMS that allowed training program owners to manage level structures, configure content sequencing per level, and monitor completion across the workforce—without engineering involvement. The interface surfaced usage and progress data by level and cohort, giving program owners visibility into where employees were concentrating and where they were getting stuck, so they could intervene or adjust content before issues compounded at scale.
End-to-end experience design
I mapped and redesigned the full user journey from level selection through exercise completion—including how employees were introduced to the tool, how they chose their starting level, how exercises were structured within each level's conversation flow, how progress was surfaced, and how the experience guided users toward advancement when they were ready. Level 1 included scaffolded entry points: suggested opening questions, a brief orientation to the tool's capabilities, and explicit framing of what each level involved. Higher levels offered a more direct entry that respected experienced employees' time—and prepared them for an interviewee who wouldn't make it easy.
Impact
The redesigned product was deployed as the primary delivery mechanism for a mandatory multi-day training program. The leveled progression system allowed a single tool to serve the full range of the agency's workforce—giving experienced employees a faster path in while keeping that path open to everyone.
Key Achievements
01.
2,000 employees completed mandatory multi-day investigative interview training through the tool over three months—the first time an AI-powered training product had been deployed at this scale within the agency.
02.
Designed a five-level difficulty system that ranged from a fully cooperative interview subject at Level 1 to an evasive, deceptive interviewee at Level 5—giving employees a self-directed path through a skill curve that matched their experience, with the freedom to advance when ready.
03.
Solved a novel deception design problem through iterative testing—identifying that numerical and temporal fabrications produced reliable, plausible falsehoods that genuinely tested employees' ability to spot inconsistencies, while complex conceptual or geographic deception broke down under scrutiny.
04.
Reduced costs and improved performance to make the product viable at scale—negotiating cohort sizes with stakeholders, stripping prompt redundancy without sacrificing behavioral fidelity, and implementing live streaming output so response latency felt like natural conversation rather than system delay.
05.
Delivered an admin configuration layer and a reusable set of leveled conversation UX patterns that gave training program owners operational independence and established a foundation for future AI-powered training programs within the agency.
This project demonstrated my ability to inherit unfinished work, diagnose what had to change, and ship a production-ready product that solved problems the original team never reached—from the novel challenge of designing scalable deception mechanics, to making an expensive proof-of-concept financially viable at 2,000 users, to building the administrative infrastructure for a training program that had to run itself.
Looking for more design leadership?
Continue reading about how I established accessibility practices at my company.