Instructor mode

Facilitate without flattening the investigation.

This view exposes concept targets, answer maps, key evidence, and facilitation focus for each file. Keep it out of the learner path until after decisions have been submitted.

Case 01intro · 8 min

The Dashboard Spike

Product analytics

metric interpretationdata qualityuncertainty calibration

Hidden concept targets

instrumentationdenominator shiftevent logging

Key evidence

ev-004: Release checklist excerptev-005: Activity split printoutev-006: Raw event sampleev-007: Registry change logev-008: Engineering standup clipev-011: Legacy metric recompute

Answer map

correctMetric instrumentation changed inside the reporting window
partialThe redesign likely improved early onboarding, but the WAU headline overstates what changed
partialThe paid social campaign drove the spike
incorrectThe pricing page refresh changed user behavior
partialBot or test traffic is the primary cause
incorrectApprove the executive deck claim and cite the 38 percent lift
correctPause the WAU headline; report onboarding evidence separately after a stable-event recompute
partialAsk for another month of data before responding, without rerunning the current metric
partialCredit the campaign as the likely cause but add a caveat about tracking

Facilitation focus: The case tests whether you can hold two ideas at once: the product may be improving, and the polished dashboard claim may still be unsupported.

Open learner file
Case 02standard · 10 min

The Checkout Readout

Experimentation

experiment interpretationcausal cautionred flag detection

Hidden concept targets

multiple comparisonspeekingsample ratio mismatch

Key evidence

ev-203: Segments workbook tabev-204: Analysis timelineev-205: Traffic diagnostic exportev-206: Pre-analysis plan excerptev-207: Experiment review clipev-209: Exclusion rule diff

Answer map

correctThe launch story depends on a post-hoc subgroup after peeking and repeated slicing
incorrectThe US paid mobile result is promising enough to treat as a clean targeted launch win
partialThe primary miss should block the launch claim, but the paid-mobile signal may still justify a pre-specified follow-up test
partialThe only issue is browser randomization; the subgroup result is otherwise launch-ready
incorrectShip to US paid mobile and cite the significant subgroup lift
correctTreat the subgroup as diagnostic; fix assignment and rerun or reanalyze with pre-specified rules
incorrectKeep the test running until the primary metric becomes significant
partialShip broadly but monitor the browser issue after launch

Facilitation focus: The case tests whether you can treat a subgroup result as a lead without upgrading it into a launch-grade causal claim.

Open learner file
Case 03standard · 12 min

The Churn Model Pitch

ML evaluation

model evaluationdecision policycalibration judgment

Hidden concept targets

selective labelstreatment effectscalibrationfeedback loops

Key evidence

ev-303: Outreach ledger extractev-304: Calibration slice printoutev-306: Save desk capacity ledgerev-307: Retraining dataset diffev-308: Evaluation design scratchpadev-309: Save desk call clipev-310: Discount approval log

Answer map

correctThe model may rank risk, but the evidence does not yet prove the proposed outreach policy creates incremental saves
incorrectThe retrospective top-decile results are strong enough to replace the current save queue
partialThe model should be rejected because customer-success judgment already touches many risky accounts
partialThe main issue is capacity; if staffing increases, the validation evidence is sufficient
incorrectBuy the model and route the save desk by the vendor score next quarter
correctRun a constrained policy pilot that tests incremental saves within risk bands before routing the queue
partialUse the score only as a background flag for CSM review while making no ROI or forecast claims
partialReject the model outright because the historical labels are contaminated by save activity

Facilitation focus: The case tests whether you separate prediction from intervention value when a model becomes part of the system it predicts.

Open learner file
Case 04standard · 12 min

The Inspection Queue

Public policy analytics

administrative data reasoningsampling judgmentdeployment caution

Hidden concept targets

selective labelsmeasurement opportunityfeedback loops

Key evidence

ev-403: Coverage board photoev-404: Inspection ledger extractev-405: Program calendar clippingev-407: Audit sample drawerev-408: Evaluation design notesev-409: Shift-change clipev-410: Complaint intake flyer

Answer map

correctThe score contains useful signal, but the backtest overstates readiness because labels reflect where and how the city inspected
incorrectThe high-score bands find more critical violations, so the city should route routine inspections by score
partialThe model should be rejected because prior enforcement patterns contaminate the labels
partialThe main issue is neighborhood information, so removing route and geography fields solves the problem
incorrectAdopt the score as the primary ranker for next month's routine inspection queue
correctPilot the score as one input with random routine checks, thin-corridor sampling, and prospective evaluation
partialUse the score only to generate analyst leads, with no operational routing change yet
partialPause all model use until the city collects an entirely new inspection dataset

Facilitation focus: The case tests whether you reconstruct how administrative labels were produced before treating them as ground truth.

Open learner file
Case 05standard · 11 min

The Spring Tutoring Brief

Education analytics

claim evaluationestimand reasoningevidence design

Hidden concept targets

selection effectstreatment definitionmeasurement alignment

Key evidence

ev-503: Roster assignment exportev-505: Baseline balance tableev-506: Access log clippingev-507: Outcome definition noteev-508: Evaluation margin notesev-510: Family access surveyev-511: Item alignment excerpt

Answer map

correctThe evidence supports a promising association and implementation signal, not the public causal claim as written
incorrectThe session dose-response proves the tutoring platform caused the stronger spring gains
partialThe platform may help when embedded in scheduled class time, but this dataset cannot isolate that effect
incorrectThe delayed rollout schools provide a clean natural experiment for the board claim
incorrectPublish the causal impact claim and expand districtwide based on the 5+ session result
correctRevise to descriptive language and fund continuation only with a cleaner evaluation design
partialPublish only descriptive results and make no funding recommendation
partialKeep the public claim descriptive, but recommend districtwide expansion from the usage groups as-is

Facilitation focus: The case tests whether you define the treatment before judging the causal claim.

Open learner file
Case 06standard · 12 min

The Winter Shelter Forecast

Public service forecasting

uncertainty rangesforecast evaluationscenario thinking

Hidden concept targets

backtestingconcept driftprediction intervalscensored demand

Key evidence

ev-603: Five-winter backtest tabev-605: Bed inventory reconciliationev-606: Housing court leading indicatorsev-608: Cold-night residualsev-610: Outreach van radio clipev-611: Scenario worksheet

Answer map

correctThe model fits ordinary observed shelter use, but the planning target needs a wider range for unmet demand and disruption
incorrectThe five-year backtest is strong enough to use the 82-bed point forecast as the operating plan
incorrectThe main uncertainty is winter temperature, so the warmer seasonal outlook should lower the capacity ask
partialThe issue is mostly a bed-inventory bookkeeping problem; demand itself is probably stable
partialLeading indicators justify planning above the model interval, while still using the model to anchor the lower end of the range
incorrectBudget to the model point estimate: 82 overflow beds, with no additional staffing trigger
correctPlan a 96-108 bed base range, pre-authorize a stress trigger, and update weekly from leading signals
partialDo not set a winter capacity number until January demand is observed
partialOpen 150 overflow beds immediately because the historical model cannot be trusted
incorrectReduce the request below 82 beds because the seasonal outlook is warmer than average

Facilitation focus: The case tests whether you can distinguish a forecast of observed use from a public-service planning range for actual need.

Open learner file
Case 07advanced · 12 min

The Benefits Queue Score

Government benefits analytics

fairness reasoningproxy detectiontradeoff communication

Hidden concept targets

proxy variablesdisparate impactsubgroup performancelabel bias

Key evidence

ev-703: Burden audit printoutev-704: Feature explanation exportev-705: Rights and access memoev-706: Training label lineageev-707: Subgroup calibration sliceev-709: Manual review capacity ledgerev-710: Cleared-case sampleev-711: Threshold stress test

Answer map

correctRemoving protected attributes is not enough; proxy features and biased labels create uneven review burden
incorrectThe aggregate pilot is strong enough for a limited statewide launch if protected-class fields stay excluded and subgroup monitoring is public
partialPredictive triage may be legally usable only after the agency proves applicants can contest holds before first payment is delayed
incorrectThe main issue is explaining the model better to applicants and advocates
partialThe model is acceptable if the agency simply hires enough manual reviewers
incorrectLaunch the current statewide routing policy because the model excludes protected attributes and improves aggregate queue speed
correctDefer statewide launch and test a remediated score under binding burden triggers and independent audit sampling
partialKeep the score in shadow mode until subgroup gaps have confidence intervals, documented causes, and a mitigation plan
partialStop this model and use rules-based triage until the agency has unbiased review labels and notice workflows
partialLaunch only the higher threshold with translated notices and appeals, but without random-audit labels or burden caps

Facilitation focus: The case tests whether you recognize fairness as a deployed burden question, not just a protected-column question.

Open learner file
Case 08standard · 12 min

The Claimant Chatbot

Public sector AI evaluation

AI evaluationseverity scoringlaunch readiness

Hidden concept targets

evaluation setshallucination riskescalation policyretrieval coverage

Key evidence

ev-803: Golden-answer mismatch logev-804: Severity rubric draftev-805: Retrieval coverage reportev-806: Prompt mix auditev-807: Escalation policy draftev-808: Navigator review clipev-809: Red-team transcript excerptev-812: Remediation worksheet

Answer map

correctThe average demo score hides high-severity failures; launch readiness depends on critical-error and escalation performance
incorrectThe chatbot is ready for full homepage launch because the overall answer acceptance rate is above 90 percent
partialFixing the retrieval index is enough to launch the current answer policy
partialRestrict the bot to internal navigator assist until high-stakes handoff behavior is validated under live monitoring
incorrectLaunch the chatbot broadly on the homepage with the current escalation draft
correctLimit launch to low-risk intents; block or hand off high-stakes policy situations and monitor severe errors
partialLimit launch to low-risk intents, but skip live severe-error monitoring because the scope is narrow
partialAllow high-stakes answers when the bot displays citations and offers a handoff link

Facilitation focus: The case tests whether you evaluate AI by severity and use context, not by polished demos or average pass rates.

Open learner file
Case 09standard · 12 min

The Payment Hold Dial

Government risk operations

tradeoff reasoningthreshold selectionoperational capacity

Hidden concept targets

false positivesfalse negativesexpected valuehuman review capacity

Key evidence

ev-901: Threshold dial simulationev-903: Historical outcomes by score bandev-904: Adjudication capacity ledgerev-905: Payment-hold burden auditev-906: Hardship policy memoev-907: Hotline escalation clipev-911: Score calibration checkev-912: Staged threshold protocol

Answer map

correctThe threshold should be chosen as an operating policy that balances fraud capture, review capacity, hardship, and subgroup burden
incorrectThe agency should lower the threshold to maximize fraud capture before payments leave
partialThe agency should pay all claims first and investigate later because false positives cause hardship
incorrectThe best policy is the threshold with the highest model precision, regardless of review capacity
incorrectSet the hold line at score >= 62 to catch the most suspicious claims before payment
correctUse the middle hold line only with cluster rules, hardship fast lane, and backlog/subgroup guardrails
partialSet the hold line at score >= 86 to minimize claimant burden
partialStop pre-payment holds and investigate suspicious claims only after payment

Facilitation focus: The case tests whether you treat a model threshold as a human operating policy, not a pure model setting.

Open learner file
Case 10standard · 11 min

The Clearance Rate Metric

Public administration analytics

metric designincentive reasoningexecutive communication

Hidden concept targets

Goodhart-style behaviorleading indicatorsmetric definitionsbalanced scorecards

Key evidence

ev-1003: Metric definition diffev-1004: Funnel drift extractev-1005: Reopened case auditev-1006: Team incentive chatev-1008: Caseworker callback clipev-1009: Metric burden sliceev-1010: Dashboard drilldownev-1011: Metric options scratchpadev-1012: Metric lineage note

Answer map

correctThe new metric captures digital throughput but hides reopened cases, delayed payments, and shifted claimant burden
incorrectThe digital clearance rate proves modernization improved because it rose sharply
partialThe metric can remain in the dashboard if the headline also includes reopened cases, payment delay, and paper-channel burden
incorrectThe metric is acceptable if the dashboard footnote explains its definition
incorrectApprove digital clearance rate as the headline modernization metric
correctMake first-payment time the headline and report digital clearance only with reopen, delay, hold, and subgroup guardrails
partialPublish a balanced dashboard now, but mark conflicting guardrails as unresolved
incorrectPublish the digital clearance headline with a definition footnote and keep other measures internal

Facilitation focus: The case tests whether you see metrics as designed instruments that create incentives, not neutral mirrors of performance.

Open learner file
Case 11standard · 12 min

The Survey Sample Mirage

Survey analytics

sampling judgmentrepresentativenessuncertainty communication

Hidden concept targets

nonresponse biassampling framesweighting limits

Key evidence

ev-1101: Response composition boardev-1103: Survey invite pathev-1104: Preference by service historyev-1106: Nonresponse checkev-1107: Question wording cardev-1108: Weighting sensitivityev-1111: Follow-up design sketch

Answer map

correctThe survey captures a real portal-user signal, but the sampling frame and nonresponse make the broad customer-preference claim unsafe
incorrectThe large response count and 64 percent majority are enough to represent customer preference
partialWeighting by channel history fully fixes the sample problem
partialThe survey should be discarded because online respondents are biased
incorrectThe staffing decision should be based mainly on the falling routine-call cost trend
incorrectPublish the 64 percent result as evidence that customers prefer digital-first support
correctReport strong portal-user support for routine questions, remove the broad customer claim, and run a mixed-mode follow-up before staffing cuts
partialUse the weighted 49 percent estimate as the official customer-preference result and proceed with the staffing shift
partialReport the portal-user result, but postpone staffing cuts until a mixed-mode nonresponse check is complete
partialReject the portal recommendation and expand phone staffing because phone-first respondents prefer phone support

Facilitation focus: The case tests whether you can separate a true respondent finding from an unsupported population claim.

Open learner file
Case 12standard · 12 min

The Bed-Ready Field

Healthcare operations

data provenancemeasurement validitycross-site comparison

Hidden concept targets

semantic driftdata lineagemetric comparability

Key evidence

ev-1203: Data dictionary diffev-1204: Release and metric timelineev-1205: Site split printoutev-1206: Raw timestamp sampleev-1208: Warehouse mapping noteev-1209: Manual chart auditev-1211: Lineage route sketchev-1212: Metric reconciliation scratchpad

Answer map

correctThe pooled dashboard mixes incompatible field meanings; a local signal may exist, but the network claim is not supportable as written
incorrectThe 22 percent network drop proves the discharge coordination workflow improved flow across hospitals
partialEast probably improved, so the workflow can be scaled if the board packet adds a definition footnote
partialNetwork reporting should pause until site-specific field definitions are versioned and bridge-tested across eras
incorrectThe easing ED boarding trend is the main explanation, so the field definition issue is secondary
incorrectApprove the board claim that the workflow cut bed-ready delay by 22 percent across the network
correctFreeze the network claim, version and reconcile the bed-ready definition, audit timestamps, and report site/era results separately
partialPublish an East-only improvement with a caveat and scale to sites that can adopt the same template
partialDelay all discharge reporting until analytics can build a brand-new metric from scratch
incorrectAttribute the improvement to lower ED boarding and ignore the dictionary change for this packet

Facilitation focus: The case tests whether you inspect data meaning and lineage instead of trusting a stable field name and a clean aggregate trend.

Open learner file
Case 13standard · 12 min

The Missingness Report

Clinical analytics

missing data reasoningbias detectionevidence qualification

Hidden concept targets

complete-case analysismissing not at randommeasurement opportunity

Key evidence

ev-1302: Missingness pattern boardev-1303: Full cohort versus retained recordsev-1304: Excluded-record outcome checkev-1306: Analyst notebook excerptev-1307: Language access sliceev-1309: Sensitivity checkev-1310: Manual chart auditev-1312: Claim revision draft

Answer map

correctThe complete-case report is useful but narrow; structured missingness makes the broad stable-risk claim unsafe
incorrectStable risk among complete records proves the workflow did not change the clinical risk profile
partialMultiple imputation should settle the issue as long as the final risk estimate remains stable
partialThe report should be discarded because missing clinical fields invalidate the entire analysis
incorrectThe missing fields are mainly documentation quality problems and should be separated from clinical judgment
incorrectApprove the committee claim that risk stayed stable and intake documentation can be deprioritized
correctQualify the complete-case result, characterize missingness, audit excluded charts, and run sensitivity checks before clinical action
partialRun imputation, publish the revised estimate if it stays close, and keep the committee narrative
partialStop all sepsis-risk reporting until intake documentation is nearly complete
incorrectExclude hallway and interpreter-needed encounters from the quality report so the metric stays comparable

Facilitation focus: The case tests whether you treat missing data as a signal about the data-generating process instead of a cleanup detail.

Open learner file
Case 14advanced · 12 min

The Privacy-Safe Export

Data governance

privacy risk reviewgovernance judgmentstakeholder communication

Hidden concept targets

re-identification riskconsent boundariesdata minimization

Key evidence

ev-1403: Export field inventoryev-1404: Linkage risk surfaceev-1405: Consent language excerptev-1407: External linkage scanev-1408: Agreement gap compareev-1409: Risk is not evenly distributedev-1411: Minimization redesign tableev-1412: Lifecycle control timeline

Answer map

correctThe file is not ready as proposed, but a minimized, purpose-limited, access-controlled release could be defensible
incorrectThe file is safe to release because direct identifiers are removed and dates are shifted
incorrectThe public health purpose is strong enough to accept residual privacy risk under the current draft
partialSensitive outreach data should never be shared outside the county, even with stronger controls
partialOnly aggregate tables should be shared; row-level access is never justified for this project
incorrectRelease the proposed row-level export this week because it passed the direct-identifier checklist
correctShare only the needed fields through tiered access, consent alignment, linkage review, and retention controls
partialSeek legal signoff on the current file and release if counsel agrees it is de-identified
partialReplace the export with public aggregate tables and deny all row-level partner access
incorrectAllow the vendor to host and reuse the file because the university partner is accountable for the project

Facilitation focus: The case tests whether you treat privacy as a contextual release decision rather than a mechanical masking checklist.

Open learner file
Case 15intro · 9 min

The Board Slide

Executive reporting

visualization critiquescale interpretationclaim wording

Hidden concept targets

axis truncationvisual rhetoricpractical significance

Key evidence

ev-1501: Board slide chartev-1503: Raw monthly tableev-1505: Denominator noteev-1506: Practical threshold cardev-1507: Longer baseline pullev-1508: Subgroup guardrailev-1509: Analyst caveat noteev-1511: Alternative view spec

Answer map

correctThe chart contains a real early improvement signal, but the cropped frame and headline overstate practical impact and certainty
incorrectThe chart proves the intake model dramatically improved eviction-prevention outcomes
partialThe chart shows a board-ready improvement if the cropped axis is labeled and absolute eviction counts are shown beside it
partialThe board should ignore rates and use only absolute eviction outcome counts
partialThe denominator change fully explains the improvement, so the model has no positive signal
incorrectApprove the cropped chart and headline saying the model nearly cut failures in half
correctRevise the board slide to show the modest rate gain with counts, targets, denominator notes, and subgroup guardrails; use qualified wording
partialRemove the chart entirely because the axis crop makes it unusable
incorrectUse only the 0-100 axis chart and keep the same "nearly halves failures" headline
partialDelay all board reporting until a full year of post-model data is available

Facilitation focus: The case tests whether you evaluate what a visualization is being used to claim, not just whether the chart is technically labeled.

Open learner file
Case 16standard · 12 min

The Geo Test Winner

Retail media

causal design critiquegeo experiment interpretationclaim qualification

Hidden concept targets

matched marketsinterferenceseasonalitycounterfactual uncertainty

Key evidence

ev-1601: Matched market boardev-1604: Pre-analysis plan excerptev-1605: Pre-period balance tableev-1606: Campaign and seasonality calendarev-1607: Market-pair sales readoutev-1608: Field sales spillover noteev-1609: Spillover diagnosticev-1610: Counterfactual sensitivity stripev-1611: Inventory and merchandising ledgerev-1612: Causal review memo

Answer map

correctThe geo test is directionally promising, but matching, seasonality, spillover, and bundled operations prevent the +11% national causal claim
incorrectThe treated markets beat controls by enough to prove the media campaign caused an 11% lift
partialThe main issue is that the test has too few markets; a larger sample would solve the design concerns
partialBecause there is some spillover, the test provides no useful evidence at all
partialThe lift is entirely caused by inventory and merchandising, so media had no effect
incorrectApprove national rollout and report that the geo test proved an 11% incremental sales lift
correctQualify the result as promising but not decision-grade, fix the claim wording, and either rerun with cleaner holdouts or scale with reserved test markets
partialApprove a limited regional scale-up while reserving two clean holdout markets and dropping the national +11% claim
incorrectScale to similar warm-weather markets using the adjusted estimate, but do not reserve a holdout
partialReplace the headline with the +4.8% adjusted estimate and approve national scale

Facilitation focus: The case tests whether you can reconstruct the missing counterfactual in a market-level experiment instead of treating a regional win as proof of national causal lift.

Open learner file
Case 17standard · 12 min

The Parallel Trends Slide

Labor policy

difference-in-differences judgmentcomparison-group critiquecausal claim wording

Hidden concept targets

parallel trendsevent timingplacebo checkscomparison validity

Key evidence

ev-1701: Parallel trends slideev-1704: Comparison selection noteev-1705: Policy timing ledgerev-1706: Pre-trend slope checkev-1707: Industry recovery tableev-1708: Local workforce noteev-1709: Placebo outcome checkev-1710: Estimate sensitivity cardev-1711: Robustness check noteev-1712: Alternative control comparison

Answer map

correctThe pilot evidence is suggestive, but pre-trend and timing problems make the strong causal claim unsafe
incorrectThe five-point post-policy divergence proves the workforce pilot caused the employment gain
incorrectBecause baseline levels were nearly identical in January, the comparison group is credible
partialAny pre-trend difference invalidates the analysis completely and the pilot should be ignored
partialManufacturing reopenings fully explain the gains, so the program had no effect
incorrectApprove the slide saying the pilot caused a five-point employment gain and recommend statewide expansion
correctRevise the claim as suggestive, add event-study and robustness checks, and avoid expansion claims until comparison validity is stronger
partialRecommend statewide expansion using the five-point estimate but add a caveat about pre-trends
partialThe pre-trend weakens the expansion claim, but the pilot may still justify a smaller evidence-building extension
incorrectKeep the causal claim but move the post-period boundary to March

Facilitation focus: The case tests whether you treat difference-in-differences as a counterfactual argument, not a formula that converts any post-period divergence into causation.

Open learner file
Case 18advanced · 13 min

The Cutoff Policy Claim

Benefits eligibility

regression-discontinuity judgmentmanipulation diagnosticscausal claim qualification

Hidden concept targets

running variablecutoff sortingfuzzy compliancebandwidth sensitivity

Key evidence

ev-1801: Cutoff inspectorev-1804: Eligibility rule excerptev-1805: Density diagnosticev-1806: Revision audit logev-1807: Near-cutoff balance checkev-1808: Assignment compliance ledgerev-1809: Navigator intake noteev-1810: Outreach and scoring timelineev-1811: Bandwidth sensitivity cardev-1812: Evaluator recommendation

Answer map

correctThe cutoff evidence is suggestive, but score sorting, revisions, and fuzzy compliance make the clean causal claim unsafe
incorrectBecause the treatment starts at score 70, the below-versus-above comparison proves the navigator prevented evictions
incorrectBunching near 70 may reflect legitimate caseworker discretion, so the fuzzy estimate should be reported as the main sensitivity check
partialThe sorting concern should narrow the claim to a monitored extension rather than a causal proof claim
partialA fuzzy RD estimate alone is enough to keep the causal claim if the point estimate remains negative
incorrectApprove the budget slide saying navigator access prevented evictions by 9.4 percentage points
correctRevise the claim as suggestive, add manipulation and sensitivity checks, and redesign assignment or scoring before making a causal expansion claim
partialReplace the headline with the fuzzy RD estimate and recommend expansion as proven effective
partialRecommend canceling the navigator because the cutoff study has sorting concerns
incorrectExclude scores 69-71 and present the remaining estimate without discussing the sorting issue

Facilitation focus: The case tests whether you treat regression discontinuity as an assumption-driven design, not a magic property of any threshold rule.

Open learner file
Case 19standard · 12 min

The QuickStart Readout

Product experimentation

statistical power judgmentexperiment interpretationdecision under uncertainty

Hidden concept targets

minimum detectable effectconfidence intervalsequivalence testingexposure dilution

Key evidence

ev-1901: Power audit boardev-1904: Pre-analysis power noteev-1905: Exposure funnel ledgerev-1906: Experiment operations timelineev-1907: Estimate interpretation tableev-1908: Subgroup signal checkev-1909: Experiment analyst caveatev-1910: Decision sensitivity cardev-1911: Support burden checkev-1912: Experiment review recommendation

Answer map

correctThe experiment is inconclusive: it failed to reach significance, but it was underpowered for the effect size the team cared about
incorrectBecause p = 0.18, the experiment proves QuickStart Coach has no meaningful effect on activation
partialBecause the point estimate is positive, the feature should ship as a proven activation win
partialThe new-planner subgroup proves the feature works for the intended audience
partialNo decision can be made until the experiment is rerun from scratch
incorrectDeclare no measurable impact and remove QuickStart Coach from the roadmap
correctRevise the readout as inconclusive, fix exposure/logging, and continue or rerun against the original decision threshold
partialShip broadly because the point estimate is positive and the support-burden signal is not worse
incorrectCall it equivalent to no effect because the p-value missed and the confidence interval crosses zero
partialIf a decision is unavoidable, use a limited rollout with a preserved holdout and explicit uncertainty language

Facilitation focus: The case tests whether you distinguish a failed significance test from evidence that an effect is absent.

Open learner file
Case 20standard · 12 min

The Short-Term Lift

Subscription growth

metric horizon judgmentexperiment guardrail reviewbusiness outcome reasoning

Hidden concept targets

surrogate metricsretention cohortsrefund biaslong-term value

Key evidence

ev-2001: Lifecycle lift boardev-2004: Experiment contract excerptev-2005: Cohort maturity timelineev-2006: Retention and refund tableev-2007: Support queue noteev-2008: Revenue quality ledgerev-2009: Acquisition mix checkev-2010: Cancel reason sampleev-2011: LTV sensitivity cardev-2012: Launch review memo

Answer map

correctThe paid-start lift is real, but the global growth-quality claim is unsafe because downstream guardrails are worse or immature
incorrectBecause paid starts rose significantly, FastStart is proven to drive efficient subscription growth
partialRefund and support guardrails are concerning enough to pause expansion while preserving the paid-start lift as a real short-term result
incorrectBooked annual revenue is the best decision metric because subscription revenue is recognized at checkout
partialFastStart should ship only to the organic segment because that segment has higher retained starts
incorrectApprove global rollout and report FastStart as a subscription-growth win
correctReport the conversion lift, hold the global claim, wait for mature retention/refund outcomes, and use lifecycle guardrails for rollout
partialShip globally with a warning that refunds should be watched after launch
partialKeep FastStart in paid search only, where retained starts look better, while waiting for mature refund outcomes
incorrectReplace the primary metric with net contribution after seeing the result and declare the test failed

Facilitation focus: The case tests whether you can preserve a valid short-term experimental result while refusing to overextend it into a long-term business outcome.

Open learner file
Case 21advanced · 13 min

The Discharge Score

Hospital readmission

model leakage detectiondeployment readiness judgmentfeature availability audit

Hidden concept targets

target leakagetemporal validationdecision-time featuresprospective validation

Key evidence

ev-2101: Feature-time auditev-2103: Validation deck excerptev-2104: Top feature importanceev-2105: Discharge workflow timelineev-2106: Snapshot timing auditev-2107: Data engineering caveatev-2108: Decision-time replayev-2109: Care queue capacity checkev-2110: Site split performanceev-2111: Deployment readiness cardev-2112: Deployment review memo

Answer map

correctThe current model is not ready for discharge-time use because its validation performance depends on post-decision or near-outcome features
incorrectThe model is ready because an AUC above 0.90 on holdout data proves it will rank patients well in production
partialThe late features should be removed, and the remaining signal may still support a redesigned, prospectively validated workflow
partialThis model version is unsuitable for discharge-time use, but the same program could be rebuilt around score-time features
incorrectThe main issue is call-center capacity; the model performance evidence is otherwise sufficient
incorrectApprove launch next month because the reported AUC is high and the care queue has enough capacity
correctBlock deployment, rebuild using decision-time features, and require prospective shadow validation plus calibration and capacity review
partialLaunch only for the highest-risk threshold while engineers remove the late fields later
partialUse the current model for retrospective audit while designing a separate discharge-time version
incorrectKeep the model unchanged but require clinicians to manually review every high-risk patient before action

Facilitation focus: The case tests whether you can treat model performance as conditional on the data-generating and scoring moment, not as a portable property of the model.

Open learner file
Case 22advanced · 13 min

The Labeling Vendor Benchmark

Trust and safety

label quality auditbenchmark validity judgmentAI deployment governance

Hidden concept targets

ground truth constructionlabel noiseinter-rater reliabilitypolicy drift

Key evidence

ev-2201: Ground-truth stress testev-2203: Launch scorecard excerptev-2204: Vendor contract excerptev-2205: Policy taxonomy mapev-2206: Policy-slice performanceev-2207: Rater agreement sampleev-2208: Gold-slice adjudicationev-2209: Policy caveat noteev-2210: Reviewer support chatev-2211: Launch queue simulationev-2212: Revised validation plan

Answer map

correctThe benchmark is not launch-ready because the vendor labels are noisy, policy-lagged, and uneven across high-risk categories
incorrectThe model is ready because it beats the rules engine on 40,000 vendor-labeled examples
partialThe model may be usable for obvious low-disagreement categories, but not for broad auto-removal
incorrectThe main fix is to replace Vendor A with another vendor that reports higher raw agreement
partialKeep model development in audit mode while policy categories with high disagreement are adjudicated
incorrectApprove high-confidence auto-removal before peak season based on the vendor benchmark
correctPause broad automation, build an adjudicated gold set, then pilot only slices that clear policy-specific review
partialPilot automation only for obvious counterfeit-logo cases while auditing and redesigning the broader label system
incorrectSwitch to a second labeling vendor and launch if its aggregate agreement is higher
partialKeep the benchmark as-is but raise the confidence threshold until the false positive count looks acceptable

Facilitation focus: The case tests whether you treat ground truth as something constructed by people, incentives, policy versions, and disagreement rules rather than as a fixed column in a benchmark.

Open learner file
Case 23advanced · 13 min

The Drift Alarm Nobody Owned

Logistics ETA

model monitoring judgmentoperational ownership reviewincident response design

Hidden concept targets

data driftcalibration decaymodel governancefallback policy

Key evidence

ev-2301: Drift response boardev-2303: ETA accuracy trendev-2304: Slice calibration tableev-2305: Operations change logev-2306: Monitoring ownership pageev-2307: Alert policy excerptev-2308: Customer impact sampleev-2309: Retraining proposalev-2310: Shadow guardrail simulationev-2311: Ownership caveatev-2312: Governance handoff memo

Answer map

correctThe model has degraded in operationally important slices, and the deeper failure is that monitoring is not connected to an owned response process
incorrectThe alarm can be closed because aggregate SLA and overall ETA error are still acceptable
partialRetraining is probably needed, but only after label completeness and the operational routing change are understood
partialA temporary fallback rule should be used for affected slices, but governance does not need to change
incorrectThis is mainly an MLOps uptime issue because the alert came from the model monitoring dashboard
incorrectSilence the drift alarm until aggregate SLA breaches or support volume becomes unmanageable
correctOpen an incident, name an owner, guard affected slices, validate labels, and set the alert-response policy
partialRetrain immediately on the latest four weeks and redeploy if the backtest improves
partialDisable the ETA model for all deliveries and use static promise windows until peak season ends
incorrectLeave the model unchanged but add a dashboard note explaining that carrier mix changed

Facilitation focus: The case tests whether you can distinguish model monitoring from model governance: detecting drift is not enough unless someone owns the decision response.

Open learner file
Case 24advanced · 13 min

The Holiday Override

Retail supply chain

target definition reviewcensored demand reasoningforecast validation

Hidden concept targets

stockout censoringlost salesavailability biasreplenishment simulation

Key evidence

ev-2401: Observed sales demand boardev-2403: Launch deck excerptev-2404: Inventory position tableev-2405: Sales versus availability auditev-2406: Training target definitionev-2408: Planner caveatev-2409: Store segment performanceev-2410: Lost-sales reconstructionev-2411: Replenishment capacity simulationev-2412: Model risk review memo

Answer map

correctThe model is not ready for broad replenishment control because it forecasts observed sales censored by inventory rather than unconstrained demand
incorrectThe low observed-sales WAPE proves ShelfSight is accurate enough for automatic planner overrides
partialThe model may be useful for historically unconstrained SKU/store pairs, but constrained pairs need availability-aware validation
partialHoliday promotion and weather explain the forecast misses, so adding better calendar features is enough
incorrectThe main issue is planner resistance to automation, not a problem with the forecast target
incorrectApprove automatic overrides for all 300 stores because backtest WAPE is below the launch threshold
correctBlock broad overrides, rebuild around availability-aware demand, and pilot only slices with guardrails
partialLaunch auto-ordering only for historically unconstrained SKU/store pairs while redesigning the constrained-demand target
partialKeep ShelfSight as an advisory signal for planner review during the holiday period
incorrectKeep the current observed-sales target but add stockout flags and relaunch the backtest after peak season

Facilitation focus: The case tests whether you can distinguish measured sales from the demand a replenishment decision actually needs to forecast.

Open learner file
Case 25advanced · 13 min

The DealDesk Pilot

Enterprise AI

AI risk evaluationtool-permission reviewadversarial test design

Hidden concept targets

prompt injectionleast privilegeuntrusted contenthuman-in-the-loop controls

Key evidence

ev-2501: Tool-risk replay boardev-2503: Pilot scorecard excerptev-2504: Tool permission inventoryev-2506: Indirect injection replayev-2507: Data boundary mapev-2508: Red-team resultsev-2509: Security caveatev-2510: Audit log excerptev-2511: Launch blast-radius simulationev-2512: Launch readiness memo

Answer map

correctThe pilot is not ready for broad tool-enabled rollout because clean-task helpfulness did not test prompt injection, tool misuse, or data exposure
incorrectThe assistant is ready because clean-task success exceeded 90 percent and users liked it
partialLower-risk Q&A and summarization may be usable if retrieval is scoped, cited, and logged
partialApproval gates reduce one failure mode, but they do not solve retrieval, tool-use, or untrusted-content risk
incorrectThe pilot should stay internal until clean workflows and adversarial workflows perform equally well
incorrectExpand the full tool-enabled pilot to all account managers before renewal season
correctKeep scoped low-risk workflows, add tool and data controls, and require adversarial evaluation before expansion
partialAllow draft-only assistance with CRM read access while postponing CRM writes and ticket creation
partialLaunch broadly as long as users must manually send emails and approve final tool actions
incorrectThe main failure is user training on suspicious attachments, so the model and tool architecture can remain unchanged

Facilitation focus: The case tests whether you can see that a helpful assistant becomes a different risk object once it reads untrusted content and can use tools.

Open learner file