Choosing the Right Statistical Test
The choice depends on the type of variable measured and the nature of the populations compared.
Decision tree
| Situation | Recommended test | Usage | Example |
|---|
| Binary variable (yes/no): conversion rate, enrolment rate | Proportion difference test (Z-test or chi-squared) | Compare two rates: Exposed vs Non-Exposed Population | Enrolment rate: 6.5% vs 4.0% |
| Continuous variable, normal distribution: average basket, revenue per customer | Student’s t-test (two independent samples) | Compare two means on normal distributions | Average basket: €85 vs €72 |
| Continuous variable, non-normal distribution: frequency, LTV | Mann-Whitney test (non-parametric) | Compare two medians without normality assumption | Purchases/year: median 3 vs 2 |
| Control variables to integrate (customer profile, history) | Linear or logistic regression with control variables | Isolate Reelevant’s effect while controlling for other factors | Effect of exposure controlling for tenure and segment |
| Observational data without possible randomisation | Propensity Score Matching (PSM) or Difference-in-Differences (DiD) | Quasi-experimentation to approximate a causal effect | Comparison of Exposed/Non-Exposed with matched profiles |
Test 1 — Proportion Difference Test (Z-test)
The most common test for binary metrics such as enrolment rates, conversion rates, or reactivation rates.
Test statistic:
Z = (p̂_E − p̂_C) / √[ p̂(1 − p̂) × (1/n_E + 1/n_C) ]
Where:
p̂_E = observed rate in the Exposed Population
p̂_C = observed rate in the Non-Exposed Population
p̂ = pooled rate = (x_E + x_C) / (n_E + n_C)
n_E = size of the Exposed Population
n_C = size of the Non-Exposed Population
| Threshold | Interpretation |
|---|
| Z > 1.96 | p-value < 0.05 → Uplift statistically significant at 95% |
| Z > 2.58 | p-value < 0.01 → Uplift statistically significant at 99% |
| 95% CI | Uplift ± 1.96 × standard error of the Uplift |
Worked example
| Metric | Exposed Population | Non-Exposed Population |
|---|
| Size | 50,000 | 50,000 |
| Enrolment rate | 6.5% (3,250) | 4.0% (2,000) |
| Absolute Uplift | +2.5 pts | |
Step-by-step calculation:
p̂_pooled = (3,250 + 2,000) / (50,000 + 50,000) = 5.25%
Standard error = √[ 0.0525 × 0.9475 × (1/50,000 + 1/50,000) ] = 0.001407
Z = (0.065 − 0.040) / 0.001407 = 17.77
→ Significant at 99.9%. 95% CI: [+2.22%; +2.78%]
A Z of 17.77 means the observed difference is 17 times larger than what chance could explain. There is less than 1 in 1,000,000 chance this Uplift is due to luck. The effect is real.
Test 2 — Student’s t-test for Continuous Metrics
Used to compare means: average basket, annual spend, observed LTV.
Test statistic:
t = (x̄_E − x̄_C) / √( s²_E/n_E + s²_C/n_C )
Where:
x̄_E = mean of the Exposed Population
x̄_C = mean of the Non-Exposed Population
s²_E = variance of the Exposed Population
s²_C = variance of the Non-Exposed Population
| Threshold | Interpretation |
|---|
| |t| > 1.96 | Uplift significant at 95% (large samples) |
| 95% CI | (x̄_E − x̄_C) ± t_critical × standard error |
Worked example — impact on average basket
| Group | Mean | Std. dev. | N | t-value | p-value |
|---|
| Exposed | €87.50 | €32 | 10,000 | | |
| Non-Exposed | €74.20 | €30 | 10,000 | | |
| Uplift | +€13.30 | | | 29.4 | < 0.001 |
Uplift Modelling — Going Further
Uplift modelling (or causal machine learning) estimates the individual treatment effect, not just the average effect. It identifies four customer profiles:
| Profile | Behaviour without exposure | Behaviour with exposure | Recommended action |
|---|
| Persuadables (true responders) | Does not convert | Converts | TARGET — high priority |
| Sure things (always buyers) | Converts | Converts | Do not expose — waste of resources |
| Lost causes (never buyers) | Does not convert | Does not convert | Do not expose — no effect |
| Sleeping dogs (negative effect) | Converts | Does not convert | Exclude — counter-productive |
In practice, Uplift modelling requires large volumes and data science expertise. For most Reelevant Use Cases, the classic A/B test framework is sufficient and much simpler to implement.
Reading and Communicating Confidence Intervals
Scenario 1 — Positive and significant
95% CI: [+1.8%; +3.2%]
The effect is positive and statistically established. The minimum guaranteed Uplift at 95% confidence is +1.8 pts.
→ Actionable result — publish and valorise.
Scenario 2 — Inconclusive
95% CI: [−0.2%; +2.8%]
The interval crosses zero. A null effect cannot be excluded. The result is not conclusive.
→ Increase sample size or extend the test.
Scenario 3 — Negative and significant
95% CI: [−2.1%; −0.3%]
The effect is negative and statistically significant. The Reelevant Content had a counter-productive effect on this segment.
→ Stop exposure on this segment and review the Content.
Common Interpretation Mistakes
- Confusing statistical significance with practical importance: An Uplift of +0.1% can be statistically significant but have no business interest.
- Stopping the test as soon as a positive result appears (peeking): This artificially inflates the false-positive rate.
- Running multiple tests without correction (Bonferroni or Benjamini-Hochberg): Out of 20 tests, 1 false positive is expected by construction.
- Not verifying test assumptions: The t-test assumes an approximately normal distribution on large samples.
Validation Checklist Before Publishing Results
Experimental design
Statistical analysis
Valorisation
Communication