A plain-language explanation of what the AI actually scores, why the criteria were chosen, and what separates a 4/5 from a 5/5.
The AI reads your transcribed or typed response and assigns scores against the same criteria that real examiners formally use. It is not listening for polish or filler words. It is checking whether you actually addressed what the scenario was testing.
Every score comes with written feedback explaining what was present, what was absent, and what a stronger version of the same response would look like. The goal is to give you the same calibration a good examiner would give you, at a fraction of the cost and with no waiting.
The AI does not simulate the subjective reaction of any specific examiner. It will not penalise you for a nervous laugh, a long pause, or an unusual word choice. It marks on structural content: did you show empathy, did you reason through the dilemma, did you reflect? That is what moves scores at real interviews too.
Each MMI station you complete is marked across five criteria. Scores run from 1 to 5 per criterion per prompt. The AI marks every prompt in the station separately, then produces a per-criterion average and an overall station score.
Empathy 1/5: The candidate jumps immediately to policy or procedure. The people in the scenario are treated as a problem to manage rather than people to understand. There is no pause for feelings, no acknowledgement that the situation is difficult.
Empathy 5/5: The candidate names the emotional weight of the situation in specific terms. They stay present in the human element of the scenario for at least part of the response, before, not after, any practical action is mentioned.
Reasoning 1/5: The candidate states a conclusion with no visible logic. "I would tell the patient" with no explanation of what competing values were weighed.
Reasoning 5/5: The candidate explicitly names the tension: patient autonomy versus beneficence, for example, or the colleague's wellbeing versus patient safety. They state which value they are weighing more heavily and say why, not just that they are weighing them.
The AI marks each follow-up question in a station as a separate unit. If you gave a strong answer to the first prompt and then failed to engage with the second, the AI will flag it. This mirrors how MMI examiners are trained: they re-score at each new prompt rather than carrying a halo from the opening answer.
Specialist mode applies higher bar calibration for ACRRM, RANZCO, RACS, and other vocational training program interviews. The reasoning and real-world awareness criteria are weighted more heavily. An answer that would score 4/5 at medical school level may score 3/5 in specialist mode because it lacks the depth expected of a candidate entering autonomous practice.
CASPer is marked across nine competencies. These mirror the competencies ACER formally assesses in the official CASPer test used by Australian and New Zealand medical programs.
Scores are given on a 1 to 9 scale. The AI also produces a short summary of the strongest element of the response and the single most useful improvement that would move the score.
The marking is performed by a large language model operating with a low temperature setting. Low temperature means the model is constrained toward consistent, structured outputs rather than creative or unpredictable ones. Two runs of the same response will produce highly similar scores and near-identical feedback themes.
The model reads the full transcript of your spoken response, not just keywords. It can identify that you named empathy in an abstract sense while giving a response that contains no concrete acknowledgement of the person in front of you. It will flag this gap.
Every marked response includes:
When you attempt the same station more than once, your history page shows a criterion-by-criterion comparison between attempts. This lets you confirm that a specific improvement you worked on actually moved the score, rather than inferring improvement from the overall number alone.
Filler words and hesitations. "Um," "ah," and brief pauses do not reduce your score. In fact, a response with genuine pauses for thought often scores higher on Reasoning and Reflection than a response that rushes through without them.
Imperfect sentence structure. The AI marks spoken transcripts. Spoken language is not syntactically clean and the model knows this. You will not lose marks for a sentence that trails off and restarts.
Agreement with any particular ethical position. The AI does not have a preferred answer to ethical dilemmas. It marks on whether you reasoned through the dilemma, not on which side you landed.
Skipping empathy entirely. If the scenario involves a person in distress and your response goes straight to action without acknowledgement, that is a Empathy score of 1 or 2 regardless of how good the action is.
Generic responses. A response that could be pasted under any scenario in the same category will score poorly on Communication and Real-world Awareness. The AI checks whether your response is specific to the scenario or is essentially a template.
No visible reasoning structure. Naming competing values and concluding is not enough. The response needs to show that you weighed them, meaning it should say something like "I am giving more weight to X because in this context Y is less at risk" rather than just "there is a tension between X and Y."