About
Practical tooling for
AI safety in mental health.
We build evaluation infrastructure for the part of the AI stack that most teams skip. Not the most glamorous work. Probably the most important.
Mission
Make AI safety evaluation accessible and reproducible.
Mental health chatbots are being deployed at scale in apps used by people in genuine distress. Most of them are tested informally, if at all, against clinical safety criteria.
We built the tooling we wished existed: a structured, open-source pipeline that any team can run to get reproducible safety scores before deployment.
The benchmark at multiphasiclabs.com/benchmark is the public output of that pipeline — a living comparison of how current AI models perform on clinical safety criteria.
Early stage
Where we are.
The Clinical Testing Tool is our first release. We're iterating on the persona library, evaluation criteria, and judge design based on feedback from teams building in this space.
We're interested in partnerships with organizations building mental health AI who want to run rigorous pre-deployment safety evaluation.
Get in touchValues
What we believe.
Safety testing should be repeatable and citable.
Ad hoc eval processes produce results nobody can verify. We build structured pipelines with versioned personas, fixed criteria, and JSON outputs so results can be reproduced and compared.
Tooling should be open and auditable.
If you can't inspect what you're being evaluated against, you can't trust the result. The Clinical Testing Tool is fully open source — personas, criteria, and judge prompts included.
A single safety eval is not enough.
Models change. Personas improve. Criteria evolve. The benchmark is designed to be run repeatedly, not once at deployment. Safety is an ongoing process.
Testing AI is not clinical practice.
Our tools evaluate AI systems. They must not be used to assess or triage real people in mental health distress. This is a hard boundary.
Limitations
What this is not.
We are explicit about the boundaries of what this tool does and does not do. Misuse of AI safety tooling can cause harm just as misuse of the models themselves can.
- A clinical assessment tool for real users
- A substitute for human expert review or clinical supervision
- A complete guarantee of safety in production
- A general-purpose chatbot evaluation framework
- A diagnostic tool of any kind