Methods

Why the lab asks for judgment before explanation.

Data science decisions often fail between the method and the meeting: metric definitions shift, comparison groups wobble, model scores hide workflow risk, and pressure turns weak evidence into a confident story. The lab is designed to make those judgment moments practiceable.

How A Case Works

Each case starts as an incomplete evidence scene rather than a lesson summary. Learners inspect artifacts, form a working theory, decide what they would do, and state confidence before the expert replay appears. That order matters because the first answer exposes the learner's current judgment habits.

The interaction is deliberately more than reading comprehension. Learners open audio notes, inspect charts, sort artifacts by evidentiary role, cite the evidence they would rely on, reconstruct what changed, and write the caveat they would use in a real meeting. The replay then compares that reconstruction with the expert record.

Why Productive Struggle Fits Data Science

This is productive struggle, not hidden-answer guessing. The cases are written so plausible but insufficient stories compete with quieter evidence. Failure is useful when the replay helps learners identify why a tempting interpretation felt stronger than it was.

The design also uses desirable difficulty: learners must retrieve prior ideas, discriminate among similar risks, and delay explanation until after they have committed to a position. The aim is transfer to real analytical work, where nobody announces the topic label before the decision has to be made.

What Data Science Judgment Means Here

Data science judgment is the ability to connect method, evidence, and context without overclaiming. It includes knowing when a dashboard is measuring a changed construct, when a causal claim outruns its design, when a model score ignores deployment risk, and when a technically true statement would mislead a decision-maker.

The lab treats judgment as a learnable practice. It does not ask learners to memorize trap names. It asks them to notice what the evidence can bear, where uncertainty remains, and how confidence should change when pressure, incentives, or missing context enter the room.

Certainty-Based Marking

Constrained decisions use certainty-based marking. Correct answers earn more when confidence is high. Incorrect answers lose more when confidence is high. Low confidence is not punished for its own sake; it can be the responsible choice when the evidence is genuinely thin. The goal is calibration: knowing when the evidence justifies confidence and when it does not.

Outcome	Low	Medium	High
Correct	+1	+2	+3
Partial	0	+1	+1
Incorrect	0	-1	-3

Research Base

The lab is not a formal psychometric instrument. It borrows from learning-science ideas that fit applied data science judgment: productive failure before consolidation, desirable difficulty, retrieval practice, and confidence calibration.

Productive failure: Manu Kapur’s work on learning from complex, ill-structured problem solving before consolidation. Kapur, 2008
Desirable difficulties and retrieval as learning events. Bjork, 2013
Testing as a pedagogical tool rather than only assessment. Bjork & Kroll, 2015
Certainty-based marking as implemented in Moodle: learners report a low, medium, or high certainty level after each answer. MoodleDocs CBM
Statistical humility in data claims: the American Statistical Association’s guidance on interpreting p-values and evidence. ASA p-value statement
Data acumen as a core undergraduate data science outcome. National Academies, 2018

Scoring Limits

Static scoring can evaluate constrained decisions and confidence, but it should not pretend to fully grade open-ended professional judgment. The written rationale and reconstruction fields are part of the learning record and replay, not an automated essay score.