Marking methodology

How the AI marks your MMI and CASPer responses

A plain-language explanation of what the AI actually scores, why the criteria were chosen, and what separates a 4/5 from a 5/5.

By Dan Brittain | Updated June 2026

The short version

The AI reads your transcribed or typed response and assigns scores against the same criteria that real examiners formally use. It is not listening for polish or filler words. It is checking whether you actually addressed what the scenario was testing.

Every score comes with written feedback explaining what was present, what was absent, and what a stronger version of the same response would look like. The goal is to give you the same calibration a good examiner would give you, at a fraction of the cost and with no waiting.

What the AI is not

The AI does not simulate the subjective reaction of any specific examiner. It will not penalise you for a nervous laugh, a long pause, or an unusual word choice. It marks on structural content: did you show empathy, did you reason through the dilemma, did you reflect? That is what moves scores at real interviews too.

MMI marking

How MMI responses are marked

Each MMI station you complete is marked across five criteria. Scores run from 1 to 5 per criterion per prompt. The AI marks every prompt in the station separately, then produces a per-criterion average and an overall station score.

Empathy
Does the response acknowledge the emotional reality of the people in the scenario before reaching for a solution?
Communication
Is the response structured? Does it speak to the person, not past them? Is it free of rehearsed script delivery?
Reasoning
Does the response name and weigh competing values? Is there a clear decision logic, or does it just list considerations?
Reflection
Does the response acknowledge what the candidate does not yet know, or what they would do differently? Does it avoid false certainty?
Real-world Awareness
Does the response show an understanding of how healthcare actually works, not how a textbook says it should work?

What a 1/5 looks like vs a 5/5

Empathy 1/5: The candidate jumps immediately to policy or procedure. The people in the scenario are treated as a problem to manage rather than people to understand. There is no pause for feelings, no acknowledgement that the situation is difficult.

Empathy 5/5: The candidate names the emotional weight of the situation in specific terms. They stay present in the human element of the scenario for at least part of the response, before, not after, any practical action is mentioned.

Reasoning 1/5: The candidate states a conclusion with no visible logic. "I would tell the patient" with no explanation of what competing values were weighed.

Reasoning 5/5: The candidate explicitly names the tension: patient autonomy versus beneficence, for example, or the colleague's wellbeing versus patient safety. They state which value they are weighing more heavily and say why, not just that they are weighing them.

Per-prompt scoring

The AI marks each follow-up question in a station as a separate unit. If you gave a strong answer to the first prompt and then failed to engage with the second, the AI will flag it. This mirrors how MMI examiners are trained: they re-score at each new prompt rather than carrying a halo from the opening answer.

Specialist mode

Specialist mode applies higher bar calibration for ACRRM, RANZCO, RACS, and other vocational training program interviews. The reasoning and real-world awareness criteria are weighted more heavily. An answer that would score 4/5 at medical school level may score 3/5 in specialist mode because it lacks the depth expected of a candidate entering autonomous practice.

CASPer marking

How CASPer responses are marked

CASPer is marked across nine competencies. These mirror the competencies ACER formally assesses in the official CASPer test used by Australian and New Zealand medical programs.

Collaboration
Working constructively with others; sharing credit; avoiding conflict escalation.
Communication
Clarity, structure, and appropriate tone for the audience described in the scenario.
Empathy
Genuine understanding of the perspective of affected parties, not just token acknowledgement.
Ethics
Application of ethical principles with appropriate nuance; avoidance of black-and-white thinking.
Fairness
Equitable treatment of all parties; awareness of bias or conflicting interests.
Motivation
Demonstration of genuine engagement with the scenario rather than performance of expected answers.
Problem Solving
Practical, realistic steps rather than generic or impossibly idealistic resolutions.
Professionalism
Appropriate handling of hierarchy, institutional constraints, and boundaries.
Service Orientation
Consistent orientation toward the needs of patients and communities over personal benefit.

Scores are given on a 1 to 9 scale. The AI also produces a short summary of the strongest element of the response and the single most useful improvement that would move the score.

How the model works

The model and its constraints

The marking is performed by a large language model operating with a low temperature setting. Low temperature means the model is constrained toward consistent, structured outputs rather than creative or unpredictable ones. Two runs of the same response will produce highly similar scores and near-identical feedback themes.

The model reads the full transcript of your spoken response, not just keywords. It can identify that you named empathy in an abstract sense while giving a response that contains no concrete acknowledgement of the person in front of you. It will flag this gap.

The AI is calibrated to the same structural patterns that real examiners are formally trained to reward. Whether it captures the exact human reaction of a specific examiner is a harder question. Whether it catches the structural gaps that will cost you marks at a real interview: yes, reliably.

What the feedback tells you

Every marked response includes:

The delta view

When you attempt the same station more than once, your history page shows a criterion-by-criterion comparison between attempts. This lets you confirm that a specific improvement you worked on actually moved the score, rather than inferring improvement from the overall number alone.

Common misunderstandings

What the AI does not penalise

Filler words and hesitations. "Um," "ah," and brief pauses do not reduce your score. In fact, a response with genuine pauses for thought often scores higher on Reasoning and Reflection than a response that rushes through without them.

Imperfect sentence structure. The AI marks spoken transcripts. Spoken language is not syntactically clean and the model knows this. You will not lose marks for a sentence that trails off and restarts.

Agreement with any particular ethical position. The AI does not have a preferred answer to ethical dilemmas. It marks on whether you reasoned through the dilemma, not on which side you landed.

What the AI does penalise

Skipping empathy entirely. If the scenario involves a person in distress and your response goes straight to action without acknowledgement, that is a Empathy score of 1 or 2 regardless of how good the action is.

Generic responses. A response that could be pasted under any scenario in the same category will score poorly on Communication and Real-world Awareness. The AI checks whether your response is specific to the scenario or is essentially a template.

No visible reasoning structure. Naming competing values and concluding is not enough. The response needs to show that you weighed them, meaning it should say something like "I am giving more weight to X because in this context Y is less at risk" rather than just "there is a tension between X and Y."

See it mark your answer now

Practise a station, read the AI feedback, and use the delta view to track what actually changes between attempts.

Try an MMI station CASPer practice