Case 01intro · 8 min
The Dashboard Spike
Product analytics
metric interpretationdata qualityuncertainty calibration
Hidden concept targets
instrumentationdenominator shiftevent logging
Key evidence
ev-004: Release checklist excerptev-005: Activity split printoutev-006: Raw event sampleev-007: Registry change logev-008: Engineering standup clipev-011: Legacy metric recompute
Answer map
correctMetric instrumentation changed inside the reporting window
partialThe redesign likely improved early onboarding, but the WAU headline overstates what changed
partialThe paid social campaign drove the spike
incorrectThe pricing page refresh changed user behavior
partialBot or test traffic is the primary cause
incorrectApprove the executive deck claim and cite the 38 percent lift
correctPause the WAU headline; report onboarding evidence separately after a stable-event recompute
partialAsk for another month of data before responding, without rerunning the current metric
partialCredit the campaign as the likely cause but add a caveat about tracking
Facilitation focus: The case tests whether you can hold two ideas at once: the product may be improving, and the polished dashboard claim may still be unsupported.
Open learner fileCase 02standard · 10 min
The Checkout Readout
Experimentation
experiment interpretationcausal cautionred flag detection
Hidden concept targets
multiple comparisonspeekingsample ratio mismatch
Key evidence
ev-203: Segments workbook tabev-204: Analysis timelineev-205: Traffic diagnostic exportev-206: Pre-analysis plan excerptev-207: Experiment review clipev-209: Exclusion rule diff
Answer map
correctThe launch story depends on a post-hoc subgroup after peeking and repeated slicing
incorrectThe US paid mobile result is promising enough to treat as a clean targeted launch win
partialThe primary miss should block the launch claim, but the paid-mobile signal may still justify a pre-specified follow-up test
partialThe only issue is browser randomization; the subgroup result is otherwise launch-ready
incorrectShip to US paid mobile and cite the significant subgroup lift
correctTreat the subgroup as diagnostic; fix assignment and rerun or reanalyze with pre-specified rules
incorrectKeep the test running until the primary metric becomes significant
partialShip broadly but monitor the browser issue after launch
Facilitation focus: The case tests whether you can treat a subgroup result as a lead without upgrading it into a launch-grade causal claim.
Open learner fileCase 03standard · 12 min
The Churn Model Pitch
ML evaluation
model evaluationdecision policycalibration judgment
Hidden concept targets
selective labelstreatment effectscalibrationfeedback loops
Key evidence
ev-303: Outreach ledger extractev-304: Calibration slice printoutev-306: Save desk capacity ledgerev-307: Retraining dataset diffev-308: Evaluation design scratchpadev-309: Save desk call clipev-310: Discount approval log
Answer map
correctThe model may rank risk, but the evidence does not yet prove the proposed outreach policy creates incremental saves
incorrectThe retrospective top-decile results are strong enough to replace the current save queue
partialThe model should be rejected because customer-success judgment already touches many risky accounts
partialThe main issue is capacity; if staffing increases, the validation evidence is sufficient
incorrectBuy the model and route the save desk by the vendor score next quarter
correctRun a constrained policy pilot that tests incremental saves within risk bands before routing the queue
partialUse the score only as a background flag for CSM review while making no ROI or forecast claims
partialReject the model outright because the historical labels are contaminated by save activity
Facilitation focus: The case tests whether you separate prediction from intervention value when a model becomes part of the system it predicts.
Open learner fileCase 04standard · 12 min
The Inspection Queue
Public policy analytics
administrative data reasoningsampling judgmentdeployment caution
Hidden concept targets
selective labelsmeasurement opportunityfeedback loops
Key evidence
ev-403: Coverage board photoev-404: Inspection ledger extractev-405: Program calendar clippingev-407: Audit sample drawerev-408: Evaluation design notesev-409: Shift-change clipev-410: Complaint intake flyer
Answer map
correctThe score contains useful signal, but the backtest overstates readiness because labels reflect where and how the city inspected
incorrectThe high-score bands find more critical violations, so the city should route routine inspections by score
partialThe model should be rejected because prior enforcement patterns contaminate the labels
partialThe main issue is neighborhood information, so removing route and geography fields solves the problem
incorrectAdopt the score as the primary ranker for next month's routine inspection queue
correctPilot the score as one input with random routine checks, thin-corridor sampling, and prospective evaluation
partialUse the score only to generate analyst leads, with no operational routing change yet
partialPause all model use until the city collects an entirely new inspection dataset
Facilitation focus: The case tests whether you reconstruct how administrative labels were produced before treating them as ground truth.
Open learner fileCase 05standard · 11 min
The Spring Tutoring Brief
Education analytics
claim evaluationestimand reasoningevidence design
Hidden concept targets
selection effectstreatment definitionmeasurement alignment
Key evidence
ev-503: Roster assignment exportev-505: Baseline balance tableev-506: Access log clippingev-507: Outcome definition noteev-508: Evaluation margin notesev-510: Family access surveyev-511: Item alignment excerpt
Answer map
correctThe evidence supports a promising association and implementation signal, not the public causal claim as written
incorrectThe session dose-response proves the tutoring platform caused the stronger spring gains
partialThe platform may help when embedded in scheduled class time, but this dataset cannot isolate that effect
incorrectThe delayed rollout schools provide a clean natural experiment for the board claim
incorrectPublish the causal impact claim and expand districtwide based on the 5+ session result
correctRevise to descriptive language and fund continuation only with a cleaner evaluation design
partialPublish only descriptive results and make no funding recommendation
partialKeep the public claim descriptive, but recommend districtwide expansion from the usage groups as-is
Facilitation focus: The case tests whether you define the treatment before judging the causal claim.
Open learner fileCase 06standard · 12 min
The Winter Shelter Forecast
Public service forecasting
uncertainty rangesforecast evaluationscenario thinking
Hidden concept targets
backtestingconcept driftprediction intervalscensored demand
Key evidence
ev-603: Five-winter backtest tabev-605: Bed inventory reconciliationev-606: Housing court leading indicatorsev-608: Cold-night residualsev-610: Outreach van radio clipev-611: Scenario worksheet
Answer map
correctThe model fits ordinary observed shelter use, but the planning target needs a wider range for unmet demand and disruption
incorrectThe five-year backtest is strong enough to use the 82-bed point forecast as the operating plan
incorrectThe main uncertainty is winter temperature, so the warmer seasonal outlook should lower the capacity ask
partialThe issue is mostly a bed-inventory bookkeeping problem; demand itself is probably stable
partialLeading indicators justify planning above the model interval, while still using the model to anchor the lower end of the range
incorrectBudget to the model point estimate: 82 overflow beds, with no additional staffing trigger
correctPlan a 96-108 bed base range, pre-authorize a stress trigger, and update weekly from leading signals
partialDo not set a winter capacity number until January demand is observed
partialOpen 150 overflow beds immediately because the historical model cannot be trusted
incorrectReduce the request below 82 beds because the seasonal outlook is warmer than average
Facilitation focus: The case tests whether you can distinguish a forecast of observed use from a public-service planning range for actual need.
Open learner fileCase 07advanced · 12 min
The Benefits Queue Score
Government benefits analytics
fairness reasoningproxy detectiontradeoff communication
Hidden concept targets
proxy variablesdisparate impactsubgroup performancelabel bias
Key evidence
ev-703: Burden audit printoutev-704: Feature explanation exportev-705: Rights and access memoev-706: Training label lineageev-707: Subgroup calibration sliceev-709: Manual review capacity ledgerev-710: Cleared-case sampleev-711: Threshold stress test
Answer map
correctRemoving protected attributes is not enough; proxy features and biased labels create uneven review burden
incorrectThe aggregate pilot is strong enough for a limited statewide launch if protected-class fields stay excluded and subgroup monitoring is public
partialPredictive triage may be legally usable only after the agency proves applicants can contest holds before first payment is delayed
incorrectThe main issue is explaining the model better to applicants and advocates
partialThe model is acceptable if the agency simply hires enough manual reviewers
incorrectLaunch the current statewide routing policy because the model excludes protected attributes and improves aggregate queue speed
correctDefer statewide launch and test a remediated score under binding burden triggers and independent audit sampling
partialKeep the score in shadow mode until subgroup gaps have confidence intervals, documented causes, and a mitigation plan
partialStop this model and use rules-based triage until the agency has unbiased review labels and notice workflows
partialLaunch only the higher threshold with translated notices and appeals, but without random-audit labels or burden caps
Facilitation focus: The case tests whether you recognize fairness as a deployed burden question, not just a protected-column question.
Open learner fileCase 08standard · 12 min
The Claimant Chatbot
Public sector AI evaluation
AI evaluationseverity scoringlaunch readiness
Hidden concept targets
evaluation setshallucination riskescalation policyretrieval coverage
Key evidence
ev-803: Golden-answer mismatch logev-804: Severity rubric draftev-805: Retrieval coverage reportev-806: Prompt mix auditev-807: Escalation policy draftev-808: Navigator review clipev-809: Red-team transcript excerptev-812: Remediation worksheet
Answer map
correctThe average demo score hides high-severity failures; launch readiness depends on critical-error and escalation performance
incorrectThe chatbot is ready for full homepage launch because the overall answer acceptance rate is above 90 percent
partialFixing the retrieval index is enough to launch the current answer policy
partialRestrict the bot to internal navigator assist until high-stakes handoff behavior is validated under live monitoring
incorrectLaunch the chatbot broadly on the homepage with the current escalation draft
correctLimit launch to low-risk intents; block or hand off high-stakes policy situations and monitor severe errors
partialLimit launch to low-risk intents, but skip live severe-error monitoring because the scope is narrow
partialAllow high-stakes answers when the bot displays citations and offers a handoff link
Facilitation focus: The case tests whether you evaluate AI by severity and use context, not by polished demos or average pass rates.
Open learner fileCase 09standard · 12 min
The Payment Hold Dial
Government risk operations
tradeoff reasoningthreshold selectionoperational capacity
Hidden concept targets
false positivesfalse negativesexpected valuehuman review capacity
Key evidence
ev-901: Threshold dial simulationev-903: Historical outcomes by score bandev-904: Adjudication capacity ledgerev-905: Payment-hold burden auditev-906: Hardship policy memoev-907: Hotline escalation clipev-911: Score calibration checkev-912: Staged threshold protocol
Answer map
correctThe threshold should be chosen as an operating policy that balances fraud capture, review capacity, hardship, and subgroup burden
incorrectThe agency should lower the threshold to maximize fraud capture before payments leave
partialThe agency should pay all claims first and investigate later because false positives cause hardship
incorrectThe best policy is the threshold with the highest model precision, regardless of review capacity
incorrectSet the hold line at score >= 62 to catch the most suspicious claims before payment
correctUse the middle hold line only with cluster rules, hardship fast lane, and backlog/subgroup guardrails
partialSet the hold line at score >= 86 to minimize claimant burden
partialStop pre-payment holds and investigate suspicious claims only after payment
Facilitation focus: The case tests whether you treat a model threshold as a human operating policy, not a pure model setting.
Open learner fileCase 10standard · 11 min
The Clearance Rate Metric
Public administration analytics
metric designincentive reasoningexecutive communication
Hidden concept targets
Goodhart-style behaviorleading indicatorsmetric definitionsbalanced scorecards
Key evidence
ev-1003: Metric definition diffev-1004: Funnel drift extractev-1005: Reopened case auditev-1006: Team incentive chatev-1008: Caseworker callback clipev-1009: Metric burden sliceev-1010: Dashboard drilldownev-1011: Metric options scratchpadev-1012: Metric lineage note
Answer map
correctThe new metric captures digital throughput but hides reopened cases, delayed payments, and shifted claimant burden
incorrectThe digital clearance rate proves modernization improved because it rose sharply
partialThe metric can remain in the dashboard if the headline also includes reopened cases, payment delay, and paper-channel burden
incorrectThe metric is acceptable if the dashboard footnote explains its definition
incorrectApprove digital clearance rate as the headline modernization metric
correctMake first-payment time the headline and report digital clearance only with reopen, delay, hold, and subgroup guardrails
partialPublish a balanced dashboard now, but mark conflicting guardrails as unresolved
incorrectPublish the digital clearance headline with a definition footnote and keep other measures internal
Facilitation focus: The case tests whether you see metrics as designed instruments that create incentives, not neutral mirrors of performance.
Open learner fileCase 11standard · 12 min
The Survey Sample Mirage
Survey analytics
sampling judgmentrepresentativenessuncertainty communication
Hidden concept targets
nonresponse biassampling framesweighting limits
Key evidence
ev-1101: Response composition boardev-1103: Survey invite pathev-1104: Preference by service historyev-1106: Nonresponse checkev-1107: Question wording cardev-1108: Weighting sensitivityev-1111: Follow-up design sketch
Answer map
correctThe survey captures a real portal-user signal, but the sampling frame and nonresponse make the broad customer-preference claim unsafe
incorrectThe large response count and 64 percent majority are enough to represent customer preference
partialWeighting by channel history fully fixes the sample problem
partialThe survey should be discarded because online respondents are biased
incorrectThe staffing decision should be based mainly on the falling routine-call cost trend
incorrectPublish the 64 percent result as evidence that customers prefer digital-first support
correctReport strong portal-user support for routine questions, remove the broad customer claim, and run a mixed-mode follow-up before staffing cuts
partialUse the weighted 49 percent estimate as the official customer-preference result and proceed with the staffing shift
partialReport the portal-user result, but postpone staffing cuts until a mixed-mode nonresponse check is complete
partialReject the portal recommendation and expand phone staffing because phone-first respondents prefer phone support
Facilitation focus: The case tests whether you can separate a true respondent finding from an unsupported population claim.
Open learner fileCase 12standard · 12 min
The Bed-Ready Field
Healthcare operations
data provenancemeasurement validitycross-site comparison
Hidden concept targets
semantic driftdata lineagemetric comparability
Key evidence
ev-1203: Data dictionary diffev-1204: Release and metric timelineev-1205: Site split printoutev-1206: Raw timestamp sampleev-1208: Warehouse mapping noteev-1209: Manual chart auditev-1211: Lineage route sketchev-1212: Metric reconciliation scratchpad
Answer map
correctThe pooled dashboard mixes incompatible field meanings; a local signal may exist, but the network claim is not supportable as written
incorrectThe 22 percent network drop proves the discharge coordination workflow improved flow across hospitals
partialEast probably improved, so the workflow can be scaled if the board packet adds a definition footnote
partialNetwork reporting should pause until site-specific field definitions are versioned and bridge-tested across eras
incorrectThe easing ED boarding trend is the main explanation, so the field definition issue is secondary
incorrectApprove the board claim that the workflow cut bed-ready delay by 22 percent across the network
correctFreeze the network claim, version and reconcile the bed-ready definition, audit timestamps, and report site/era results separately
partialPublish an East-only improvement with a caveat and scale to sites that can adopt the same template
partialDelay all discharge reporting until analytics can build a brand-new metric from scratch
incorrectAttribute the improvement to lower ED boarding and ignore the dictionary change for this packet
Facilitation focus: The case tests whether you inspect data meaning and lineage instead of trusting a stable field name and a clean aggregate trend.
Open learner fileCase 13standard · 12 min
The Missingness Report
Clinical analytics
missing data reasoningbias detectionevidence qualification
Hidden concept targets
complete-case analysismissing not at randommeasurement opportunity
Key evidence
ev-1302: Missingness pattern boardev-1303: Full cohort versus retained recordsev-1304: Excluded-record outcome checkev-1306: Analyst notebook excerptev-1307: Language access sliceev-1309: Sensitivity checkev-1310: Manual chart auditev-1312: Claim revision draft
Answer map
correctThe complete-case report is useful but narrow; structured missingness makes the broad stable-risk claim unsafe
incorrectStable risk among complete records proves the workflow did not change the clinical risk profile
partialMultiple imputation should settle the issue as long as the final risk estimate remains stable
partialThe report should be discarded because missing clinical fields invalidate the entire analysis
incorrectThe missing fields are mainly documentation quality problems and should be separated from clinical judgment
incorrectApprove the committee claim that risk stayed stable and intake documentation can be deprioritized
correctQualify the complete-case result, characterize missingness, audit excluded charts, and run sensitivity checks before clinical action
partialRun imputation, publish the revised estimate if it stays close, and keep the committee narrative
partialStop all sepsis-risk reporting until intake documentation is nearly complete
incorrectExclude hallway and interpreter-needed encounters from the quality report so the metric stays comparable
Facilitation focus: The case tests whether you treat missing data as a signal about the data-generating process instead of a cleanup detail.
Open learner fileCase 14advanced · 12 min
The Privacy-Safe Export
Data governance
privacy risk reviewgovernance judgmentstakeholder communication
Hidden concept targets
re-identification riskconsent boundariesdata minimization
Key evidence
ev-1403: Export field inventoryev-1404: Linkage risk surfaceev-1405: Consent language excerptev-1407: External linkage scanev-1408: Agreement gap compareev-1409: Risk is not evenly distributedev-1411: Minimization redesign tableev-1412: Lifecycle control timeline
Answer map
correctThe file is not ready as proposed, but a minimized, purpose-limited, access-controlled release could be defensible
incorrectThe file is safe to release because direct identifiers are removed and dates are shifted
incorrectThe public health purpose is strong enough to accept residual privacy risk under the current draft
partialSensitive outreach data should never be shared outside the county, even with stronger controls
partialOnly aggregate tables should be shared; row-level access is never justified for this project
incorrectRelease the proposed row-level export this week because it passed the direct-identifier checklist
correctShare only the needed fields through tiered access, consent alignment, linkage review, and retention controls
partialSeek legal signoff on the current file and release if counsel agrees it is de-identified
partialReplace the export with public aggregate tables and deny all row-level partner access
incorrectAllow the vendor to host and reuse the file because the university partner is accountable for the project
Facilitation focus: The case tests whether you treat privacy as a contextual release decision rather than a mechanical masking checklist.
Open learner fileCase 15intro · 9 min
The Board Slide
Executive reporting
visualization critiquescale interpretationclaim wording
Hidden concept targets
axis truncationvisual rhetoricpractical significance
Key evidence
ev-1501: Board slide chartev-1503: Raw monthly tableev-1505: Denominator noteev-1506: Practical threshold cardev-1507: Longer baseline pullev-1508: Subgroup guardrailev-1509: Analyst caveat noteev-1511: Alternative view spec
Answer map
correctThe chart contains a real early improvement signal, but the cropped frame and headline overstate practical impact and certainty
incorrectThe chart proves the intake model dramatically improved eviction-prevention outcomes
partialThe chart shows a board-ready improvement if the cropped axis is labeled and absolute eviction counts are shown beside it
partialThe board should ignore rates and use only absolute eviction outcome counts
partialThe denominator change fully explains the improvement, so the model has no positive signal
incorrectApprove the cropped chart and headline saying the model nearly cut failures in half
correctRevise the board slide to show the modest rate gain with counts, targets, denominator notes, and subgroup guardrails; use qualified wording
partialRemove the chart entirely because the axis crop makes it unusable
incorrectUse only the 0-100 axis chart and keep the same "nearly halves failures" headline
partialDelay all board reporting until a full year of post-model data is available
Facilitation focus: The case tests whether you evaluate what a visualization is being used to claim, not just whether the chart is technically labeled.
Open learner fileCase 16standard · 12 min
The Geo Test Winner
Retail media
causal design critiquegeo experiment interpretationclaim qualification
Hidden concept targets
matched marketsinterferenceseasonalitycounterfactual uncertainty
Key evidence
ev-1601: Matched market boardev-1604: Pre-analysis plan excerptev-1605: Pre-period balance tableev-1606: Campaign and seasonality calendarev-1607: Market-pair sales readoutev-1608: Field sales spillover noteev-1609: Spillover diagnosticev-1610: Counterfactual sensitivity stripev-1611: Inventory and merchandising ledgerev-1612: Causal review memo
Answer map
correctThe geo test is directionally promising, but matching, seasonality, spillover, and bundled operations prevent the +11% national causal claim
incorrectThe treated markets beat controls by enough to prove the media campaign caused an 11% lift
partialThe main issue is that the test has too few markets; a larger sample would solve the design concerns
partialBecause there is some spillover, the test provides no useful evidence at all
partialThe lift is entirely caused by inventory and merchandising, so media had no effect
incorrectApprove national rollout and report that the geo test proved an 11% incremental sales lift
correctQualify the result as promising but not decision-grade, fix the claim wording, and either rerun with cleaner holdouts or scale with reserved test markets
partialApprove a limited regional scale-up while reserving two clean holdout markets and dropping the national +11% claim
incorrectScale to similar warm-weather markets using the adjusted estimate, but do not reserve a holdout
partialReplace the headline with the +4.8% adjusted estimate and approve national scale
Facilitation focus: The case tests whether you can reconstruct the missing counterfactual in a market-level experiment instead of treating a regional win as proof of national causal lift.
Open learner fileCase 17standard · 12 min
The Parallel Trends Slide
Labor policy
difference-in-differences judgmentcomparison-group critiquecausal claim wording
Hidden concept targets
parallel trendsevent timingplacebo checkscomparison validity
Key evidence
ev-1701: Parallel trends slideev-1704: Comparison selection noteev-1705: Policy timing ledgerev-1706: Pre-trend slope checkev-1707: Industry recovery tableev-1708: Local workforce noteev-1709: Placebo outcome checkev-1710: Estimate sensitivity cardev-1711: Robustness check noteev-1712: Alternative control comparison
Answer map
correctThe pilot evidence is suggestive, but pre-trend and timing problems make the strong causal claim unsafe
incorrectThe five-point post-policy divergence proves the workforce pilot caused the employment gain
incorrectBecause baseline levels were nearly identical in January, the comparison group is credible
partialAny pre-trend difference invalidates the analysis completely and the pilot should be ignored
partialManufacturing reopenings fully explain the gains, so the program had no effect
incorrectApprove the slide saying the pilot caused a five-point employment gain and recommend statewide expansion
correctRevise the claim as suggestive, add event-study and robustness checks, and avoid expansion claims until comparison validity is stronger
partialRecommend statewide expansion using the five-point estimate but add a caveat about pre-trends
partialThe pre-trend weakens the expansion claim, but the pilot may still justify a smaller evidence-building extension
incorrectKeep the causal claim but move the post-period boundary to March
Facilitation focus: The case tests whether you treat difference-in-differences as a counterfactual argument, not a formula that converts any post-period divergence into causation.
Open learner fileCase 18advanced · 13 min
The Cutoff Policy Claim
Benefits eligibility
regression-discontinuity judgmentmanipulation diagnosticscausal claim qualification
Hidden concept targets
running variablecutoff sortingfuzzy compliancebandwidth sensitivity
Key evidence
ev-1801: Cutoff inspectorev-1804: Eligibility rule excerptev-1805: Density diagnosticev-1806: Revision audit logev-1807: Near-cutoff balance checkev-1808: Assignment compliance ledgerev-1809: Navigator intake noteev-1810: Outreach and scoring timelineev-1811: Bandwidth sensitivity cardev-1812: Evaluator recommendation
Answer map
correctThe cutoff evidence is suggestive, but score sorting, revisions, and fuzzy compliance make the clean causal claim unsafe
incorrectBecause the treatment starts at score 70, the below-versus-above comparison proves the navigator prevented evictions
incorrectBunching near 70 may reflect legitimate caseworker discretion, so the fuzzy estimate should be reported as the main sensitivity check
partialThe sorting concern should narrow the claim to a monitored extension rather than a causal proof claim
partialA fuzzy RD estimate alone is enough to keep the causal claim if the point estimate remains negative
incorrectApprove the budget slide saying navigator access prevented evictions by 9.4 percentage points
correctRevise the claim as suggestive, add manipulation and sensitivity checks, and redesign assignment or scoring before making a causal expansion claim
partialReplace the headline with the fuzzy RD estimate and recommend expansion as proven effective
partialRecommend canceling the navigator because the cutoff study has sorting concerns
incorrectExclude scores 69-71 and present the remaining estimate without discussing the sorting issue
Facilitation focus: The case tests whether you treat regression discontinuity as an assumption-driven design, not a magic property of any threshold rule.
Open learner fileCase 19standard · 12 min
The QuickStart Readout
Product experimentation
statistical power judgmentexperiment interpretationdecision under uncertainty
Hidden concept targets
minimum detectable effectconfidence intervalsequivalence testingexposure dilution
Key evidence
ev-1901: Power audit boardev-1904: Pre-analysis power noteev-1905: Exposure funnel ledgerev-1906: Experiment operations timelineev-1907: Estimate interpretation tableev-1908: Subgroup signal checkev-1909: Experiment analyst caveatev-1910: Decision sensitivity cardev-1911: Support burden checkev-1912: Experiment review recommendation
Answer map
correctThe experiment is inconclusive: it failed to reach significance, but it was underpowered for the effect size the team cared about
incorrectBecause p = 0.18, the experiment proves QuickStart Coach has no meaningful effect on activation
partialBecause the point estimate is positive, the feature should ship as a proven activation win
partialThe new-planner subgroup proves the feature works for the intended audience
partialNo decision can be made until the experiment is rerun from scratch
incorrectDeclare no measurable impact and remove QuickStart Coach from the roadmap
correctRevise the readout as inconclusive, fix exposure/logging, and continue or rerun against the original decision threshold
partialShip broadly because the point estimate is positive and the support-burden signal is not worse
incorrectCall it equivalent to no effect because the p-value missed and the confidence interval crosses zero
partialIf a decision is unavoidable, use a limited rollout with a preserved holdout and explicit uncertainty language
Facilitation focus: The case tests whether you distinguish a failed significance test from evidence that an effect is absent.
Open learner fileCase 20standard · 12 min
The Short-Term Lift
Subscription growth
metric horizon judgmentexperiment guardrail reviewbusiness outcome reasoning
Hidden concept targets
surrogate metricsretention cohortsrefund biaslong-term value
Key evidence
ev-2001: Lifecycle lift boardev-2004: Experiment contract excerptev-2005: Cohort maturity timelineev-2006: Retention and refund tableev-2007: Support queue noteev-2008: Revenue quality ledgerev-2009: Acquisition mix checkev-2010: Cancel reason sampleev-2011: LTV sensitivity cardev-2012: Launch review memo
Answer map
correctThe paid-start lift is real, but the global growth-quality claim is unsafe because downstream guardrails are worse or immature
incorrectBecause paid starts rose significantly, FastStart is proven to drive efficient subscription growth
partialRefund and support guardrails are concerning enough to pause expansion while preserving the paid-start lift as a real short-term result
incorrectBooked annual revenue is the best decision metric because subscription revenue is recognized at checkout
partialFastStart should ship only to the organic segment because that segment has higher retained starts
incorrectApprove global rollout and report FastStart as a subscription-growth win
correctReport the conversion lift, hold the global claim, wait for mature retention/refund outcomes, and use lifecycle guardrails for rollout
partialShip globally with a warning that refunds should be watched after launch
partialKeep FastStart in paid search only, where retained starts look better, while waiting for mature refund outcomes
incorrectReplace the primary metric with net contribution after seeing the result and declare the test failed
Facilitation focus: The case tests whether you can preserve a valid short-term experimental result while refusing to overextend it into a long-term business outcome.
Open learner fileCase 21advanced · 13 min
The Discharge Score
Hospital readmission
model leakage detectiondeployment readiness judgmentfeature availability audit
Hidden concept targets
target leakagetemporal validationdecision-time featuresprospective validation
Key evidence
ev-2101: Feature-time auditev-2103: Validation deck excerptev-2104: Top feature importanceev-2105: Discharge workflow timelineev-2106: Snapshot timing auditev-2107: Data engineering caveatev-2108: Decision-time replayev-2109: Care queue capacity checkev-2110: Site split performanceev-2111: Deployment readiness cardev-2112: Deployment review memo
Answer map
correctThe current model is not ready for discharge-time use because its validation performance depends on post-decision or near-outcome features
incorrectThe model is ready because an AUC above 0.90 on holdout data proves it will rank patients well in production
partialThe late features should be removed, and the remaining signal may still support a redesigned, prospectively validated workflow
partialThis model version is unsuitable for discharge-time use, but the same program could be rebuilt around score-time features
incorrectThe main issue is call-center capacity; the model performance evidence is otherwise sufficient
incorrectApprove launch next month because the reported AUC is high and the care queue has enough capacity
correctBlock deployment, rebuild using decision-time features, and require prospective shadow validation plus calibration and capacity review
partialLaunch only for the highest-risk threshold while engineers remove the late fields later
partialUse the current model for retrospective audit while designing a separate discharge-time version
incorrectKeep the model unchanged but require clinicians to manually review every high-risk patient before action
Facilitation focus: The case tests whether you can treat model performance as conditional on the data-generating and scoring moment, not as a portable property of the model.
Open learner fileCase 22advanced · 13 min
The Labeling Vendor Benchmark
Trust and safety
label quality auditbenchmark validity judgmentAI deployment governance
Hidden concept targets
ground truth constructionlabel noiseinter-rater reliabilitypolicy drift
Key evidence
ev-2201: Ground-truth stress testev-2203: Launch scorecard excerptev-2204: Vendor contract excerptev-2205: Policy taxonomy mapev-2206: Policy-slice performanceev-2207: Rater agreement sampleev-2208: Gold-slice adjudicationev-2209: Policy caveat noteev-2210: Reviewer support chatev-2211: Launch queue simulationev-2212: Revised validation plan
Answer map
correctThe benchmark is not launch-ready because the vendor labels are noisy, policy-lagged, and uneven across high-risk categories
incorrectThe model is ready because it beats the rules engine on 40,000 vendor-labeled examples
partialThe model may be usable for obvious low-disagreement categories, but not for broad auto-removal
incorrectThe main fix is to replace Vendor A with another vendor that reports higher raw agreement
partialKeep model development in audit mode while policy categories with high disagreement are adjudicated
incorrectApprove high-confidence auto-removal before peak season based on the vendor benchmark
correctPause broad automation, build an adjudicated gold set, then pilot only slices that clear policy-specific review
partialPilot automation only for obvious counterfeit-logo cases while auditing and redesigning the broader label system
incorrectSwitch to a second labeling vendor and launch if its aggregate agreement is higher
partialKeep the benchmark as-is but raise the confidence threshold until the false positive count looks acceptable
Facilitation focus: The case tests whether you treat ground truth as something constructed by people, incentives, policy versions, and disagreement rules rather than as a fixed column in a benchmark.
Open learner fileCase 23advanced · 13 min
The Drift Alarm Nobody Owned
Logistics ETA
model monitoring judgmentoperational ownership reviewincident response design
Hidden concept targets
data driftcalibration decaymodel governancefallback policy
Key evidence
ev-2301: Drift response boardev-2303: ETA accuracy trendev-2304: Slice calibration tableev-2305: Operations change logev-2306: Monitoring ownership pageev-2307: Alert policy excerptev-2308: Customer impact sampleev-2309: Retraining proposalev-2310: Shadow guardrail simulationev-2311: Ownership caveatev-2312: Governance handoff memo
Answer map
correctThe model has degraded in operationally important slices, and the deeper failure is that monitoring is not connected to an owned response process
incorrectThe alarm can be closed because aggregate SLA and overall ETA error are still acceptable
partialRetraining is probably needed, but only after label completeness and the operational routing change are understood
partialA temporary fallback rule should be used for affected slices, but governance does not need to change
incorrectThis is mainly an MLOps uptime issue because the alert came from the model monitoring dashboard
incorrectSilence the drift alarm until aggregate SLA breaches or support volume becomes unmanageable
correctOpen an incident, name an owner, guard affected slices, validate labels, and set the alert-response policy
partialRetrain immediately on the latest four weeks and redeploy if the backtest improves
partialDisable the ETA model for all deliveries and use static promise windows until peak season ends
incorrectLeave the model unchanged but add a dashboard note explaining that carrier mix changed
Facilitation focus: The case tests whether you can distinguish model monitoring from model governance: detecting drift is not enough unless someone owns the decision response.
Open learner fileCase 24advanced · 13 min
The Holiday Override
Retail supply chain
target definition reviewcensored demand reasoningforecast validation
Hidden concept targets
stockout censoringlost salesavailability biasreplenishment simulation
Key evidence
ev-2401: Observed sales demand boardev-2403: Launch deck excerptev-2404: Inventory position tableev-2405: Sales versus availability auditev-2406: Training target definitionev-2408: Planner caveatev-2409: Store segment performanceev-2410: Lost-sales reconstructionev-2411: Replenishment capacity simulationev-2412: Model risk review memo
Answer map
correctThe model is not ready for broad replenishment control because it forecasts observed sales censored by inventory rather than unconstrained demand
incorrectThe low observed-sales WAPE proves ShelfSight is accurate enough for automatic planner overrides
partialThe model may be useful for historically unconstrained SKU/store pairs, but constrained pairs need availability-aware validation
partialHoliday promotion and weather explain the forecast misses, so adding better calendar features is enough
incorrectThe main issue is planner resistance to automation, not a problem with the forecast target
incorrectApprove automatic overrides for all 300 stores because backtest WAPE is below the launch threshold
correctBlock broad overrides, rebuild around availability-aware demand, and pilot only slices with guardrails
partialLaunch auto-ordering only for historically unconstrained SKU/store pairs while redesigning the constrained-demand target
partialKeep ShelfSight as an advisory signal for planner review during the holiday period
incorrectKeep the current observed-sales target but add stockout flags and relaunch the backtest after peak season
Facilitation focus: The case tests whether you can distinguish measured sales from the demand a replenishment decision actually needs to forecast.
Open learner fileCase 25advanced · 13 min
The DealDesk Pilot
Enterprise AI
AI risk evaluationtool-permission reviewadversarial test design
Hidden concept targets
prompt injectionleast privilegeuntrusted contenthuman-in-the-loop controls
Key evidence
ev-2501: Tool-risk replay boardev-2503: Pilot scorecard excerptev-2504: Tool permission inventoryev-2506: Indirect injection replayev-2507: Data boundary mapev-2508: Red-team resultsev-2509: Security caveatev-2510: Audit log excerptev-2511: Launch blast-radius simulationev-2512: Launch readiness memo
Answer map
correctThe pilot is not ready for broad tool-enabled rollout because clean-task helpfulness did not test prompt injection, tool misuse, or data exposure
incorrectThe assistant is ready because clean-task success exceeded 90 percent and users liked it
partialLower-risk Q&A and summarization may be usable if retrieval is scoped, cited, and logged
partialApproval gates reduce one failure mode, but they do not solve retrieval, tool-use, or untrusted-content risk
incorrectThe pilot should stay internal until clean workflows and adversarial workflows perform equally well
incorrectExpand the full tool-enabled pilot to all account managers before renewal season
correctKeep scoped low-risk workflows, add tool and data controls, and require adversarial evaluation before expansion
partialAllow draft-only assistance with CRM read access while postponing CRM writes and ticket creation
partialLaunch broadly as long as users must manually send emails and approve final tool actions
incorrectThe main failure is user training on suspicious attachments, so the model and tool architecture can remain unchanged
Facilitation focus: The case tests whether you can see that a helpful assistant becomes a different risk object once it reads untrusted content and can use tools.
Open learner file