Instructor mode

Facilitate without flattening the investigation.

This view exposes concept targets, answer maps, key evidence, and facilitation focus for each file. Keep it out of the learner path until after decisions have been submitted.

Case 01intro · 8 min

The Dashboard Spike

Product analytics

metric interpretationdata qualityuncertainty calibration

Hidden concept targets

instrumentationdenominator shiftevent logging

Key evidence

ev-004: Release checklist excerptev-005: Activity split printoutev-006: Raw event sampleev-007: Registry change logev-008: Engineering standup clipev-011: Legacy metric recompute

Answer map

correctMetric instrumentation changed inside the reporting window

partialThe redesign likely improved early onboarding, but the WAU headline overstates what changed

partialThe paid social campaign drove the spike

incorrectThe pricing page refresh changed user behavior

partialBot or test traffic is the primary cause

incorrectApprove the executive deck claim and cite the 38 percent lift

correctPause the WAU headline; report onboarding evidence separately after a stable-event recompute

partialAsk for another month of data before responding, without rerunning the current metric

partialCredit the campaign as the likely cause but add a caveat about tracking

Facilitation focus: The case tests whether you can hold two ideas at once: the product may be improving, and the polished dashboard claim may still be unsupported.

Open learner file

Case 02standard · 10 min

The Checkout Readout

Experimentation

experiment interpretationcausal cautionred flag detection

Hidden concept targets

multiple comparisonspeekingsample ratio mismatch

Key evidence

ev-203: Segments workbook tabev-204: Analysis timelineev-205: Traffic diagnostic exportev-206: Pre-analysis plan excerptev-207: Experiment review clipev-209: Exclusion rule diff

Answer map

correctThe launch story depends on a post-hoc subgroup after peeking and repeated slicing

incorrectThe US paid mobile result is promising enough to treat as a clean targeted launch win

partialThe primary miss should block the launch claim, but the paid-mobile signal may still justify a pre-specified follow-up test

partialThe only issue is browser randomization; the subgroup result is otherwise launch-ready

incorrectShip to US paid mobile and cite the significant subgroup lift

correctTreat the subgroup as diagnostic; fix assignment and rerun or reanalyze with pre-specified rules

incorrectKeep the test running until the primary metric becomes significant

partialShip broadly but monitor the browser issue after launch

Facilitation focus: The case tests whether you can treat a subgroup result as a lead without upgrading it into a launch-grade causal claim.

Open learner file

Case 03standard · 12 min

The Churn Model Pitch

ML evaluation

model evaluationdecision policycalibration judgment

Hidden concept targets

selective labelstreatment effectscalibrationfeedback loops

Key evidence

ev-303: Outreach ledger extractev-304: Calibration slice printoutev-306: Save desk capacity ledgerev-307: Retraining dataset diffev-308: Evaluation design scratchpadev-309: Save desk call clipev-310: Discount approval log

Answer map

correctThe model may rank risk, but the evidence does not yet prove the proposed outreach policy creates incremental saves

incorrectThe retrospective top-decile results are strong enough to replace the current save queue

partialThe model should be rejected because customer-success judgment already touches many risky accounts

partialThe main issue is capacity; if staffing increases, the validation evidence is sufficient

incorrectBuy the model and route the save desk by the vendor score next quarter

correctRun a constrained policy pilot that tests incremental saves within risk bands before routing the queue

partialUse the score only as a background flag for CSM review while making no ROI or forecast claims

partialReject the model outright because the historical labels are contaminated by save activity

Facilitation focus: The case tests whether you separate prediction from intervention value when a model becomes part of the system it predicts.

Open learner file

Case 04standard · 12 min

The Inspection Queue

Public policy analytics

administrative data reasoningsampling judgmentdeployment caution

Hidden concept targets

selective labelsmeasurement opportunityfeedback loops

Key evidence

ev-403: Coverage board photoev-404: Inspection ledger extractev-405: Program calendar clippingev-407: Audit sample drawerev-408: Evaluation design notesev-409: Shift-change clipev-410: Complaint intake flyer

Answer map

correctThe score contains useful signal, but the backtest overstates readiness because labels reflect where and how the city inspected

incorrectThe high-score bands find more critical violations, so the city should route routine inspections by score

partialThe model should be rejected because prior enforcement patterns contaminate the labels

partialThe main issue is neighborhood information, so removing route and geography fields solves the problem

incorrectAdopt the score as the primary ranker for next month's routine inspection queue

correctPilot the score as one input with random routine checks, thin-corridor sampling, and prospective evaluation

partialUse the score only to generate analyst leads, with no operational routing change yet

partialPause all model use until the city collects an entirely new inspection dataset

Facilitation focus: The case tests whether you reconstruct how administrative labels were produced before treating them as ground truth.

Open learner file

Case 05standard · 11 min

The Spring Tutoring Brief

Education analytics

claim evaluationestimand reasoningevidence design

Hidden concept targets

selection effectstreatment definitionmeasurement alignment

Key evidence

ev-503: Roster assignment exportev-505: Baseline balance tableev-506: Access log clippingev-507: Outcome definition noteev-508: Evaluation margin notesev-510: Family access surveyev-511: Item alignment excerpt

Answer map

correctThe evidence supports a promising association and implementation signal, not the public causal claim as written

incorrectThe session dose-response proves the tutoring platform caused the stronger spring gains

partialThe platform may help when embedded in scheduled class time, but this dataset cannot isolate that effect

incorrectThe delayed rollout schools provide a clean natural experiment for the board claim

incorrectPublish the causal impact claim and expand districtwide based on the 5+ session result

correctRevise to descriptive language and fund continuation only with a cleaner evaluation design

partialPublish only descriptive results and make no funding recommendation

partialKeep the public claim descriptive, but recommend districtwide expansion from the usage groups as-is

Facilitation focus: The case tests whether you define the treatment before judging the causal claim.

Open learner file

Case 06standard · 12 min

The Winter Shelter Forecast

Public service forecasting

uncertainty rangesforecast evaluationscenario thinking

Hidden concept targets

backtestingconcept driftprediction intervalscensored demand

Key evidence

ev-603: Five-winter backtest tabev-605: Bed inventory reconciliationev-606: Housing court leading indicatorsev-608: Cold-night residualsev-610: Outreach van radio clipev-611: Scenario worksheet

Answer map

correctThe model fits ordinary observed shelter use, but the planning target needs a wider range for unmet demand and disruption

incorrectThe five-year backtest is strong enough to use the 82-bed point forecast as the operating plan

incorrectThe main uncertainty is winter temperature, so the warmer seasonal outlook should lower the capacity ask

partialThe issue is mostly a bed-inventory bookkeeping problem; demand itself is probably stable

partialLeading indicators justify planning above the model interval, while still using the model to anchor the lower end of the range

incorrectBudget to the model point estimate: 82 overflow beds, with no additional staffing trigger

correctPlan a 96-108 bed base range, pre-authorize a stress trigger, and update weekly from leading signals

partialDo not set a winter capacity number until January demand is observed

partialOpen 150 overflow beds immediately because the historical model cannot be trusted

incorrectReduce the request below 82 beds because the seasonal outlook is warmer than average

Facilitation focus: The case tests whether you can distinguish a forecast of observed use from a public-service planning range for actual need.

Open learner file

Case 07advanced · 12 min

The Benefits Queue Score

Government benefits analytics

fairness reasoningproxy detectiontradeoff communication

Hidden concept targets

proxy variablesdisparate impactsubgroup performancelabel bias

Key evidence

ev-703: Burden audit printoutev-704: Feature explanation exportev-705: Rights and access memoev-706: Training label lineageev-707: Subgroup calibration sliceev-709: Manual review capacity ledgerev-710: Cleared-case sampleev-711: Threshold stress test

Answer map

correctRemoving protected attributes is not enough; proxy features and biased labels create uneven review burden

incorrectThe aggregate pilot is strong enough for a limited statewide launch if protected-class fields stay excluded and subgroup monitoring is public

partialPredictive triage may be legally usable only after the agency proves applicants can contest holds before first payment is delayed

incorrectThe main issue is explaining the model better to applicants and advocates

partialThe model is acceptable if the agency simply hires enough manual reviewers

incorrectLaunch the current statewide routing policy because the model excludes protected attributes and improves aggregate queue speed

correctDefer statewide launch and test a remediated score under binding burden triggers and independent audit sampling

partialKeep the score in shadow mode until subgroup gaps have confidence intervals, documented causes, and a mitigation plan

partialStop this model and use rules-based triage until the agency has unbiased review labels and notice workflows

partialLaunch only the higher threshold with translated notices and appeals, but without random-audit labels or burden caps

Facilitation focus: The case tests whether you recognize fairness as a deployed burden question, not just a protected-column question.

Open learner file

Case 08standard · 12 min

The Claimant Chatbot

Public sector AI evaluation

AI evaluationseverity scoringlaunch readiness

Hidden concept targets

evaluation setshallucination riskescalation policyretrieval coverage

Key evidence

ev-803: Golden-answer mismatch logev-804: Severity rubric draftev-805: Retrieval coverage reportev-806: Prompt mix auditev-807: Escalation policy draftev-808: Navigator review clipev-809: Red-team transcript excerptev-812: Remediation worksheet

Answer map

correctThe average demo score hides high-severity failures; launch readiness depends on critical-error and escalation performance

incorrectThe chatbot is ready for full homepage launch because the overall answer acceptance rate is above 90 percent

partialFixing the retrieval index is enough to launch the current answer policy

partialRestrict the bot to internal navigator assist until high-stakes handoff behavior is validated under live monitoring

incorrectLaunch the chatbot broadly on the homepage with the current escalation draft

correctLimit launch to low-risk intents; block or hand off high-stakes policy situations and monitor severe errors

partialLimit launch to low-risk intents, but skip live severe-error monitoring because the scope is narrow

partialAllow high-stakes answers when the bot displays citations and offers a handoff link

Facilitation focus: The case tests whether you evaluate AI by severity and use context, not by polished demos or average pass rates.

Open learner file

Case 09standard · 12 min

The Payment Hold Dial

Government risk operations

tradeoff reasoningthreshold selectionoperational capacity

Hidden concept targets

false positivesfalse negativesexpected valuehuman review capacity

Key evidence

ev-901: Threshold dial simulationev-903: Historical outcomes by score bandev-904: Adjudication capacity ledgerev-905: Payment-hold burden auditev-906: Hardship policy memoev-907: Hotline escalation clipev-911: Score calibration checkev-912: Staged threshold protocol

Answer map

correctThe threshold should be chosen as an operating policy that balances fraud capture, review capacity, hardship, and subgroup burden

incorrectThe agency should lower the threshold to maximize fraud capture before payments leave

partialThe agency should pay all claims first and investigate later because false positives cause hardship

incorrectThe best policy is the threshold with the highest model precision, regardless of review capacity

incorrectSet the hold line at score >= 62 to catch the most suspicious claims before payment

correctUse the middle hold line only with cluster rules, hardship fast lane, and backlog/subgroup guardrails

partialSet the hold line at score >= 86 to minimize claimant burden

partialStop pre-payment holds and investigate suspicious claims only after payment

Facilitation focus: The case tests whether you treat a model threshold as a human operating policy, not a pure model setting.

Open learner file

Case 10standard · 11 min

The Clearance Rate Metric

Public administration analytics

metric designincentive reasoningexecutive communication

Hidden concept targets

Goodhart-style behaviorleading indicatorsmetric definitionsbalanced scorecards

Key evidence

ev-1003: Metric definition diffev-1004: Funnel drift extractev-1005: Reopened case auditev-1006: Team incentive chatev-1008: Caseworker callback clipev-1009: Metric burden sliceev-1010: Dashboard drilldownev-1011: Metric options scratchpadev-1012: Metric lineage note

Answer map

correctThe new metric captures digital throughput but hides reopened cases, delayed payments, and shifted claimant burden

incorrectThe digital clearance rate proves modernization improved because it rose sharply

partialThe metric can remain in the dashboard if the headline also includes reopened cases, payment delay, and paper-channel burden

incorrectThe metric is acceptable if the dashboard footnote explains its definition

incorrectApprove digital clearance rate as the headline modernization metric

correctMake first-payment time the headline and report digital clearance only with reopen, delay, hold, and subgroup guardrails

partialPublish a balanced dashboard now, but mark conflicting guardrails as unresolved

incorrectPublish the digital clearance headline with a definition footnote and keep other measures internal

Facilitation focus: The case tests whether you see metrics as designed instruments that create incentives, not neutral mirrors of performance.

Open learner file

Case 11standard · 12 min

The Survey Sample Mirage

Survey analytics

sampling judgmentrepresentativenessuncertainty communication

Hidden concept targets

nonresponse biassampling framesweighting limits

Key evidence

ev-1101: Response composition boardev-1103: Survey invite pathev-1104: Preference by service historyev-1106: Nonresponse checkev-1107: Question wording cardev-1108: Weighting sensitivityev-1111: Follow-up design sketch

Answer map

correctThe survey captures a real portal-user signal, but the sampling frame and nonresponse make the broad customer-preference claim unsafe

incorrectThe large response count and 64 percent majority are enough to represent customer preference

partialWeighting by channel history fully fixes the sample problem

partialThe survey should be discarded because online respondents are biased

incorrectThe staffing decision should be based mainly on the falling routine-call cost trend

incorrectPublish the 64 percent result as evidence that customers prefer digital-first support

correctReport strong portal-user support for routine questions, remove the broad customer claim, and run a mixed-mode follow-up before staffing cuts

partialUse the weighted 49 percent estimate as the official customer-preference result and proceed with the staffing shift

partialReport the portal-user result, but postpone staffing cuts until a mixed-mode nonresponse check is complete

partialReject the portal recommendation and expand phone staffing because phone-first respondents prefer phone support

Facilitation focus: The case tests whether you can separate a true respondent finding from an unsupported population claim.

Open learner file

Case 12standard · 12 min

The Bed-Ready Field

Healthcare operations

data provenancemeasurement validitycross-site comparison

Hidden concept targets

semantic driftdata lineagemetric comparability

Key evidence

ev-1203: Data dictionary diffev-1204: Release and metric timelineev-1205: Site split printoutev-1206: Raw timestamp sampleev-1208: Warehouse mapping noteev-1209: Manual chart auditev-1211: Lineage route sketchev-1212: Metric reconciliation scratchpad

Answer map

correctThe pooled dashboard mixes incompatible field meanings; a local signal may exist, but the network claim is not supportable as written

incorrectThe 22 percent network drop proves the discharge coordination workflow improved flow across hospitals

partialEast probably improved, so the workflow can be scaled if the board packet adds a definition footnote

partialNetwork reporting should pause until site-specific field definitions are versioned and bridge-tested across eras

incorrectThe easing ED boarding trend is the main explanation, so the field definition issue is secondary

incorrectApprove the board claim that the workflow cut bed-ready delay by 22 percent across the network

correctFreeze the network claim, version and reconcile the bed-ready definition, audit timestamps, and report site/era results separately

partialPublish an East-only improvement with a caveat and scale to sites that can adopt the same template

partialDelay all discharge reporting until analytics can build a brand-new metric from scratch

incorrectAttribute the improvement to lower ED boarding and ignore the dictionary change for this packet

Facilitation focus: The case tests whether you inspect data meaning and lineage instead of trusting a stable field name and a clean aggregate trend.

Open learner file

Case 13standard · 12 min

The Missingness Report

Clinical analytics

missing data reasoningbias detectionevidence qualification

Hidden concept targets

complete-case analysismissing not at randommeasurement opportunity

Key evidence

ev-1302: Missingness pattern boardev-1303: Full cohort versus retained recordsev-1304: Excluded-record outcome checkev-1306: Analyst notebook excerptev-1307: Language access sliceev-1309: Sensitivity checkev-1310: Manual chart auditev-1312: Claim revision draft

Answer map

correctThe complete-case report is useful but narrow; structured missingness makes the broad stable-risk claim unsafe

incorrectStable risk among complete records proves the workflow did not change the clinical risk profile

partialMultiple imputation should settle the issue as long as the final risk estimate remains stable

partialThe report should be discarded because missing clinical fields invalidate the entire analysis

incorrectThe missing fields are mainly documentation quality problems and should be separated from clinical judgment

incorrectApprove the committee claim that risk stayed stable and intake documentation can be deprioritized

correctQualify the complete-case result, characterize missingness, audit excluded charts, and run sensitivity checks before clinical action

partialRun imputation, publish the revised estimate if it stays close, and keep the committee narrative

partialStop all sepsis-risk reporting until intake documentation is nearly complete

incorrectExclude hallway and interpreter-needed encounters from the quality report so the metric stays comparable

Facilitation focus: The case tests whether you treat missing data as a signal about the data-generating process instead of a cleanup detail.

Open learner file

Case 14advanced · 12 min

The Privacy-Safe Export

Data governance

privacy risk reviewgovernance judgmentstakeholder communication

Hidden concept targets

re-identification riskconsent boundariesdata minimization

Key evidence

ev-1403: Export field inventoryev-1404: Linkage risk surfaceev-1405: Consent language excerptev-1407: External linkage scanev-1408: Agreement gap compareev-1409: Risk is not evenly distributedev-1411: Minimization redesign tableev-1412: Lifecycle control timeline

Answer map

correctThe file is not ready as proposed, but a minimized, purpose-limited, access-controlled release could be defensible

incorrectThe file is safe to release because direct identifiers are removed and dates are shifted

incorrectThe public health purpose is strong enough to accept residual privacy risk under the current draft

partialSensitive outreach data should never be shared outside the county, even with stronger controls

partialOnly aggregate tables should be shared; row-level access is never justified for this project

incorrectRelease the proposed row-level export this week because it passed the direct-identifier checklist

correctShare only the needed fields through tiered access, consent alignment, linkage review, and retention controls

partialSeek legal signoff on the current file and release if counsel agrees it is de-identified

partialReplace the export with public aggregate tables and deny all row-level partner access

incorrectAllow the vendor to host and reuse the file because the university partner is accountable for the project

Facilitation focus: The case tests whether you treat privacy as a contextual release decision rather than a mechanical masking checklist.

Open learner file

Case 15intro · 9 min

The Board Slide

Executive reporting

visualization critiquescale interpretationclaim wording

Hidden concept targets

axis truncationvisual rhetoricpractical significance

Key evidence

ev-1501: Board slide chartev-1503: Raw monthly tableev-1505: Denominator noteev-1506: Practical threshold cardev-1507: Longer baseline pullev-1508: Subgroup guardrailev-1509: Analyst caveat noteev-1511: Alternative view spec

Answer map

correctThe chart contains a real early improvement signal, but the cropped frame and headline overstate practical impact and certainty

incorrectThe chart proves the intake model dramatically improved eviction-prevention outcomes

partialThe chart shows a board-ready improvement if the cropped axis is labeled and absolute eviction counts are shown beside it

partialThe board should ignore rates and use only absolute eviction outcome counts

partialThe denominator change fully explains the improvement, so the model has no positive signal

incorrectApprove the cropped chart and headline saying the model nearly cut failures in half

correctRevise the board slide to show the modest rate gain with counts, targets, denominator notes, and subgroup guardrails; use qualified wording

partialRemove the chart entirely because the axis crop makes it unusable

incorrectUse only the 0-100 axis chart and keep the same "nearly halves failures" headline

partialDelay all board reporting until a full year of post-model data is available

Facilitation focus: The case tests whether you evaluate what a visualization is being used to claim, not just whether the chart is technically labeled.

Open learner file

Case 16standard · 12 min

The Geo Test Winner

Retail media

causal design critiquegeo experiment interpretationclaim qualification

Hidden concept targets

matched marketsinterferenceseasonalitycounterfactual uncertainty

Key evidence

ev-1601: Matched market boardev-1604: Pre-analysis plan excerptev-1605: Pre-period balance tableev-1606: Campaign and seasonality calendarev-1607: Market-pair sales readoutev-1608: Field sales spillover noteev-1609: Spillover diagnosticev-1610: Counterfactual sensitivity stripev-1611: Inventory and merchandising ledgerev-1612: Causal review memo

Answer map

correctThe geo test is directionally promising, but matching, seasonality, spillover, and bundled operations prevent the +11% national causal claim

incorrectThe treated markets beat controls by enough to prove the media campaign caused an 11% lift

partialThe main issue is that the test has too few markets; a larger sample would solve the design concerns

partialBecause there is some spillover, the test provides no useful evidence at all

partialThe lift is entirely caused by inventory and merchandising, so media had no effect

incorrectApprove national rollout and report that the geo test proved an 11% incremental sales lift

correctQualify the result as promising but not decision-grade, fix the claim wording, and either rerun with cleaner holdouts or scale with reserved test markets

partialApprove a limited regional scale-up while reserving two clean holdout markets and dropping the national +11% claim

incorrectScale to similar warm-weather markets using the adjusted estimate, but do not reserve a holdout

partialReplace the headline with the +4.8% adjusted estimate and approve national scale

Facilitation focus: The case tests whether you can reconstruct the missing counterfactual in a market-level experiment instead of treating a regional win as proof of national causal lift.

Open learner file

Case 17standard · 12 min

The Parallel Trends Slide

Labor policy

difference-in-differences judgmentcomparison-group critiquecausal claim wording

Hidden concept targets

parallel trendsevent timingplacebo checkscomparison validity

Key evidence

ev-1701: Parallel trends slideev-1704: Comparison selection noteev-1705: Policy timing ledgerev-1706: Pre-trend slope checkev-1707: Industry recovery tableev-1708: Local workforce noteev-1709: Placebo outcome checkev-1710: Estimate sensitivity cardev-1711: Robustness check noteev-1712: Alternative control comparison

Answer map

correctThe pilot evidence is suggestive, but pre-trend and timing problems make the strong causal claim unsafe

incorrectThe five-point post-policy divergence proves the workforce pilot caused the employment gain

incorrectBecause baseline levels were nearly identical in January, the comparison group is credible

partialAny pre-trend difference invalidates the analysis completely and the pilot should be ignored

partialManufacturing reopenings fully explain the gains, so the program had no effect

incorrectApprove the slide saying the pilot caused a five-point employment gain and recommend statewide expansion

correctRevise the claim as suggestive, add event-study and robustness checks, and avoid expansion claims until comparison validity is stronger

partialRecommend statewide expansion using the five-point estimate but add a caveat about pre-trends

partialThe pre-trend weakens the expansion claim, but the pilot may still justify a smaller evidence-building extension

incorrectKeep the causal claim but move the post-period boundary to March

Facilitation focus: The case tests whether you treat difference-in-differences as a counterfactual argument, not a formula that converts any post-period divergence into causation.

Open learner file

Case 18advanced · 13 min

The Cutoff Policy Claim

Benefits eligibility

regression-discontinuity judgmentmanipulation diagnosticscausal claim qualification

Hidden concept targets

running variablecutoff sortingfuzzy compliancebandwidth sensitivity

Key evidence

ev-1801: Cutoff inspectorev-1804: Eligibility rule excerptev-1805: Density diagnosticev-1806: Revision audit logev-1807: Near-cutoff balance checkev-1808: Assignment compliance ledgerev-1809: Navigator intake noteev-1810: Outreach and scoring timelineev-1811: Bandwidth sensitivity cardev-1812: Evaluator recommendation

Answer map

correctThe cutoff evidence is suggestive, but score sorting, revisions, and fuzzy compliance make the clean causal claim unsafe

incorrectBecause the treatment starts at score 70, the below-versus-above comparison proves the navigator prevented evictions

incorrectBunching near 70 may reflect legitimate caseworker discretion, so the fuzzy estimate should be reported as the main sensitivity check

partialThe sorting concern should narrow the claim to a monitored extension rather than a causal proof claim

partialA fuzzy RD estimate alone is enough to keep the causal claim if the point estimate remains negative

incorrectApprove the budget slide saying navigator access prevented evictions by 9.4 percentage points

correctRevise the claim as suggestive, add manipulation and sensitivity checks, and redesign assignment or scoring before making a causal expansion claim

partialReplace the headline with the fuzzy RD estimate and recommend expansion as proven effective

partialRecommend canceling the navigator because the cutoff study has sorting concerns

incorrectExclude scores 69-71 and present the remaining estimate without discussing the sorting issue

Facilitation focus: The case tests whether you treat regression discontinuity as an assumption-driven design, not a magic property of any threshold rule.

Open learner file

Case 19standard · 12 min

The QuickStart Readout

Product experimentation

statistical power judgmentexperiment interpretationdecision under uncertainty

Hidden concept targets

minimum detectable effectconfidence intervalsequivalence testingexposure dilution

Key evidence

ev-1901: Power audit boardev-1904: Pre-analysis power noteev-1905: Exposure funnel ledgerev-1906: Experiment operations timelineev-1907: Estimate interpretation tableev-1908: Subgroup signal checkev-1909: Experiment analyst caveatev-1910: Decision sensitivity cardev-1911: Support burden checkev-1912: Experiment review recommendation

Answer map

correctThe experiment is inconclusive: it failed to reach significance, but it was underpowered for the effect size the team cared about

incorrectBecause p = 0.18, the experiment proves QuickStart Coach has no meaningful effect on activation

partialBecause the point estimate is positive, the feature should ship as a proven activation win

partialThe new-planner subgroup proves the feature works for the intended audience

partialNo decision can be made until the experiment is rerun from scratch

incorrectDeclare no measurable impact and remove QuickStart Coach from the roadmap

correctRevise the readout as inconclusive, fix exposure/logging, and continue or rerun against the original decision threshold

partialShip broadly because the point estimate is positive and the support-burden signal is not worse

incorrectCall it equivalent to no effect because the p-value missed and the confidence interval crosses zero

partialIf a decision is unavoidable, use a limited rollout with a preserved holdout and explicit uncertainty language

Facilitation focus: The case tests whether you distinguish a failed significance test from evidence that an effect is absent.

Open learner file

Case 20standard · 12 min

The Short-Term Lift

Subscription growth

metric horizon judgmentexperiment guardrail reviewbusiness outcome reasoning

Hidden concept targets

surrogate metricsretention cohortsrefund biaslong-term value

Key evidence

ev-2001: Lifecycle lift boardev-2004: Experiment contract excerptev-2005: Cohort maturity timelineev-2006: Retention and refund tableev-2007: Support queue noteev-2008: Revenue quality ledgerev-2009: Acquisition mix checkev-2010: Cancel reason sampleev-2011: LTV sensitivity cardev-2012: Launch review memo

Answer map

correctThe paid-start lift is real, but the global growth-quality claim is unsafe because downstream guardrails are worse or immature

incorrectBecause paid starts rose significantly, FastStart is proven to drive efficient subscription growth

partialRefund and support guardrails are concerning enough to pause expansion while preserving the paid-start lift as a real short-term result

incorrectBooked annual revenue is the best decision metric because subscription revenue is recognized at checkout

partialFastStart should ship only to the organic segment because that segment has higher retained starts

incorrectApprove global rollout and report FastStart as a subscription-growth win

correctReport the conversion lift, hold the global claim, wait for mature retention/refund outcomes, and use lifecycle guardrails for rollout

partialShip globally with a warning that refunds should be watched after launch

partialKeep FastStart in paid search only, where retained starts look better, while waiting for mature refund outcomes

incorrectReplace the primary metric with net contribution after seeing the result and declare the test failed

Facilitation focus: The case tests whether you can preserve a valid short-term experimental result while refusing to overextend it into a long-term business outcome.

Open learner file

Case 21advanced · 13 min

The Discharge Score

Hospital readmission

model leakage detectiondeployment readiness judgmentfeature availability audit

Hidden concept targets

target leakagetemporal validationdecision-time featuresprospective validation

Key evidence

ev-2101: Feature-time auditev-2103: Validation deck excerptev-2104: Top feature importanceev-2105: Discharge workflow timelineev-2106: Snapshot timing auditev-2107: Data engineering caveatev-2108: Decision-time replayev-2109: Care queue capacity checkev-2110: Site split performanceev-2111: Deployment readiness cardev-2112: Deployment review memo

Answer map

correctThe current model is not ready for discharge-time use because its validation performance depends on post-decision or near-outcome features

incorrectThe model is ready because an AUC above 0.90 on holdout data proves it will rank patients well in production

partialThe late features should be removed, and the remaining signal may still support a redesigned, prospectively validated workflow

partialThis model version is unsuitable for discharge-time use, but the same program could be rebuilt around score-time features

incorrectThe main issue is call-center capacity; the model performance evidence is otherwise sufficient

incorrectApprove launch next month because the reported AUC is high and the care queue has enough capacity

correctBlock deployment, rebuild using decision-time features, and require prospective shadow validation plus calibration and capacity review

partialLaunch only for the highest-risk threshold while engineers remove the late fields later

partialUse the current model for retrospective audit while designing a separate discharge-time version

incorrectKeep the model unchanged but require clinicians to manually review every high-risk patient before action

Facilitation focus: The case tests whether you can treat model performance as conditional on the data-generating and scoring moment, not as a portable property of the model.

Open learner file

Case 22advanced · 13 min

The Labeling Vendor Benchmark

Trust and safety

label quality auditbenchmark validity judgmentAI deployment governance

Hidden concept targets

ground truth constructionlabel noiseinter-rater reliabilitypolicy drift

Key evidence

ev-2201: Ground-truth stress testev-2203: Launch scorecard excerptev-2204: Vendor contract excerptev-2205: Policy taxonomy mapev-2206: Policy-slice performanceev-2207: Rater agreement sampleev-2208: Gold-slice adjudicationev-2209: Policy caveat noteev-2210: Reviewer support chatev-2211: Launch queue simulationev-2212: Revised validation plan

Answer map

correctThe benchmark is not launch-ready because the vendor labels are noisy, policy-lagged, and uneven across high-risk categories

incorrectThe model is ready because it beats the rules engine on 40,000 vendor-labeled examples

partialThe model may be usable for obvious low-disagreement categories, but not for broad auto-removal

incorrectThe main fix is to replace Vendor A with another vendor that reports higher raw agreement

partialKeep model development in audit mode while policy categories with high disagreement are adjudicated

incorrectApprove high-confidence auto-removal before peak season based on the vendor benchmark

correctPause broad automation, build an adjudicated gold set, then pilot only slices that clear policy-specific review

partialPilot automation only for obvious counterfeit-logo cases while auditing and redesigning the broader label system

incorrectSwitch to a second labeling vendor and launch if its aggregate agreement is higher

partialKeep the benchmark as-is but raise the confidence threshold until the false positive count looks acceptable

Facilitation focus: The case tests whether you treat ground truth as something constructed by people, incentives, policy versions, and disagreement rules rather than as a fixed column in a benchmark.

Open learner file

Case 23advanced · 13 min

The Drift Alarm Nobody Owned

Logistics ETA

model monitoring judgmentoperational ownership reviewincident response design

Hidden concept targets

data driftcalibration decaymodel governancefallback policy

Key evidence

ev-2301: Drift response boardev-2303: ETA accuracy trendev-2304: Slice calibration tableev-2305: Operations change logev-2306: Monitoring ownership pageev-2307: Alert policy excerptev-2308: Customer impact sampleev-2309: Retraining proposalev-2310: Shadow guardrail simulationev-2311: Ownership caveatev-2312: Governance handoff memo

Answer map

correctThe model has degraded in operationally important slices, and the deeper failure is that monitoring is not connected to an owned response process

incorrectThe alarm can be closed because aggregate SLA and overall ETA error are still acceptable

partialRetraining is probably needed, but only after label completeness and the operational routing change are understood

partialA temporary fallback rule should be used for affected slices, but governance does not need to change

incorrectThis is mainly an MLOps uptime issue because the alert came from the model monitoring dashboard

incorrectSilence the drift alarm until aggregate SLA breaches or support volume becomes unmanageable

correctOpen an incident, name an owner, guard affected slices, validate labels, and set the alert-response policy

partialRetrain immediately on the latest four weeks and redeploy if the backtest improves

partialDisable the ETA model for all deliveries and use static promise windows until peak season ends

incorrectLeave the model unchanged but add a dashboard note explaining that carrier mix changed

Facilitation focus: The case tests whether you can distinguish model monitoring from model governance: detecting drift is not enough unless someone owns the decision response.

Open learner file

Case 24advanced · 13 min

The Holiday Override

Retail supply chain

target definition reviewcensored demand reasoningforecast validation

Hidden concept targets

stockout censoringlost salesavailability biasreplenishment simulation

Key evidence

ev-2401: Observed sales demand boardev-2403: Launch deck excerptev-2404: Inventory position tableev-2405: Sales versus availability auditev-2406: Training target definitionev-2408: Planner caveatev-2409: Store segment performanceev-2410: Lost-sales reconstructionev-2411: Replenishment capacity simulationev-2412: Model risk review memo

Answer map

correctThe model is not ready for broad replenishment control because it forecasts observed sales censored by inventory rather than unconstrained demand

incorrectThe low observed-sales WAPE proves ShelfSight is accurate enough for automatic planner overrides

partialThe model may be useful for historically unconstrained SKU/store pairs, but constrained pairs need availability-aware validation

partialHoliday promotion and weather explain the forecast misses, so adding better calendar features is enough

incorrectThe main issue is planner resistance to automation, not a problem with the forecast target

incorrectApprove automatic overrides for all 300 stores because backtest WAPE is below the launch threshold

correctBlock broad overrides, rebuild around availability-aware demand, and pilot only slices with guardrails

partialLaunch auto-ordering only for historically unconstrained SKU/store pairs while redesigning the constrained-demand target

partialKeep ShelfSight as an advisory signal for planner review during the holiday period

incorrectKeep the current observed-sales target but add stockout flags and relaunch the backtest after peak season

Facilitation focus: The case tests whether you can distinguish measured sales from the demand a replenishment decision actually needs to forecast.

Open learner file

Case 25advanced · 13 min

The DealDesk Pilot

Enterprise AI

AI risk evaluationtool-permission reviewadversarial test design

Hidden concept targets

prompt injectionleast privilegeuntrusted contenthuman-in-the-loop controls

Key evidence

ev-2501: Tool-risk replay boardev-2503: Pilot scorecard excerptev-2504: Tool permission inventoryev-2506: Indirect injection replayev-2507: Data boundary mapev-2508: Red-team resultsev-2509: Security caveatev-2510: Audit log excerptev-2511: Launch blast-radius simulationev-2512: Launch readiness memo

Answer map

correctThe pilot is not ready for broad tool-enabled rollout because clean-task helpfulness did not test prompt injection, tool misuse, or data exposure

incorrectThe assistant is ready because clean-task success exceeded 90 percent and users liked it

partialLower-risk Q&A and summarization may be usable if retrieval is scoped, cited, and logged

partialApproval gates reduce one failure mode, but they do not solve retrieval, tool-use, or untrusted-content risk

incorrectThe pilot should stay internal until clean workflows and adversarial workflows perform equally well

incorrectExpand the full tool-enabled pilot to all account managers before renewal season

correctKeep scoped low-risk workflows, add tool and data controls, and require adversarial evaluation before expansion

partialAllow draft-only assistance with CRM read access while postponing CRM writes and ticket creation

partialLaunch broadly as long as users must manually send emails and approve final tool actions

incorrectThe main failure is user training on suspicious attachments, so the model and tool architecture can remain unchanged

Facilitation focus: The case tests whether you can see that a helpful assistant becomes a different risk object once it reads untrusted content and can use tools.

Open learner file