Pulled ALL skills from 15 source repositories: - anthropics/skills: 16 (docs, design, MCP, testing) - obra/superpowers: 14 (TDD, debugging, agents, planning) - coreyhaines31/marketingskills: 25 (marketing, CRO, SEO, growth) - better-auth/skills: 5 (auth patterns) - vercel-labs/agent-skills: 5 (React, design, Vercel) - antfu/skills: 16 (Vue, Vite, Vitest, pnpm, Turborepo) - Plus 13 individual skills from various repos Mosaic Stack is not limited to coding — the Orchestrator and subagents serve coding, business, design, marketing, writing, logistics, analysis, and more. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
253 lines
6.9 KiB
Markdown
253 lines
6.9 KiB
Markdown
# Sample Size Guide
|
||
|
||
Reference for calculating sample sizes and test duration.
|
||
|
||
## Sample Size Fundamentals
|
||
|
||
### Required Inputs
|
||
|
||
1. **Baseline conversion rate**: Your current rate
|
||
2. **Minimum detectable effect (MDE)**: Smallest change worth detecting
|
||
3. **Statistical significance level**: Usually 95% (α = 0.05)
|
||
4. **Statistical power**: Usually 80% (β = 0.20)
|
||
|
||
### What These Mean
|
||
|
||
**Baseline conversion rate**: If your page converts at 5%, that's your baseline.
|
||
|
||
**MDE (Minimum Detectable Effect)**: The smallest improvement you care about detecting. Set this based on:
|
||
- Business impact (is a 5% lift meaningful?)
|
||
- Implementation cost (worth the effort?)
|
||
- Realistic expectations (what have past tests shown?)
|
||
|
||
**Statistical significance (95%)**: Means there's less than 5% chance the observed difference is due to random chance.
|
||
|
||
**Statistical power (80%)**: Means if there's a real effect of size MDE, you have 80% chance of detecting it.
|
||
|
||
---
|
||
|
||
## Sample Size Quick Reference Tables
|
||
|
||
### Conversion Rate: 1%
|
||
|
||
| Lift to Detect | Sample per Variant | Total Sample |
|
||
|----------------|-------------------|--------------|
|
||
| 5% (1% → 1.05%) | 1,500,000 | 3,000,000 |
|
||
| 10% (1% → 1.1%) | 380,000 | 760,000 |
|
||
| 20% (1% → 1.2%) | 97,000 | 194,000 |
|
||
| 50% (1% → 1.5%) | 16,000 | 32,000 |
|
||
| 100% (1% → 2%) | 4,200 | 8,400 |
|
||
|
||
### Conversion Rate: 3%
|
||
|
||
| Lift to Detect | Sample per Variant | Total Sample |
|
||
|----------------|-------------------|--------------|
|
||
| 5% (3% → 3.15%) | 480,000 | 960,000 |
|
||
| 10% (3% → 3.3%) | 120,000 | 240,000 |
|
||
| 20% (3% → 3.6%) | 31,000 | 62,000 |
|
||
| 50% (3% → 4.5%) | 5,200 | 10,400 |
|
||
| 100% (3% → 6%) | 1,400 | 2,800 |
|
||
|
||
### Conversion Rate: 5%
|
||
|
||
| Lift to Detect | Sample per Variant | Total Sample |
|
||
|----------------|-------------------|--------------|
|
||
| 5% (5% → 5.25%) | 280,000 | 560,000 |
|
||
| 10% (5% → 5.5%) | 72,000 | 144,000 |
|
||
| 20% (5% → 6%) | 18,000 | 36,000 |
|
||
| 50% (5% → 7.5%) | 3,100 | 6,200 |
|
||
| 100% (5% → 10%) | 810 | 1,620 |
|
||
|
||
### Conversion Rate: 10%
|
||
|
||
| Lift to Detect | Sample per Variant | Total Sample |
|
||
|----------------|-------------------|--------------|
|
||
| 5% (10% → 10.5%) | 130,000 | 260,000 |
|
||
| 10% (10% → 11%) | 34,000 | 68,000 |
|
||
| 20% (10% → 12%) | 8,700 | 17,400 |
|
||
| 50% (10% → 15%) | 1,500 | 3,000 |
|
||
| 100% (10% → 20%) | 400 | 800 |
|
||
|
||
### Conversion Rate: 20%
|
||
|
||
| Lift to Detect | Sample per Variant | Total Sample |
|
||
|----------------|-------------------|--------------|
|
||
| 5% (20% → 21%) | 60,000 | 120,000 |
|
||
| 10% (20% → 22%) | 16,000 | 32,000 |
|
||
| 20% (20% → 24%) | 4,000 | 8,000 |
|
||
| 50% (20% → 30%) | 700 | 1,400 |
|
||
| 100% (20% → 40%) | 200 | 400 |
|
||
|
||
---
|
||
|
||
## Duration Calculator
|
||
|
||
### Formula
|
||
|
||
```
|
||
Duration (days) = (Sample per variant × Number of variants) / (Daily traffic × % exposed)
|
||
```
|
||
|
||
### Examples
|
||
|
||
**Scenario 1: High-traffic page**
|
||
- Need: 10,000 per variant (2 variants = 20,000 total)
|
||
- Daily traffic: 5,000 visitors
|
||
- 100% exposed to test
|
||
- Duration: 20,000 / 5,000 = **4 days**
|
||
|
||
**Scenario 2: Medium-traffic page**
|
||
- Need: 30,000 per variant (60,000 total)
|
||
- Daily traffic: 2,000 visitors
|
||
- 100% exposed
|
||
- Duration: 60,000 / 2,000 = **30 days**
|
||
|
||
**Scenario 3: Low-traffic with partial exposure**
|
||
- Need: 15,000 per variant (30,000 total)
|
||
- Daily traffic: 500 visitors
|
||
- 50% exposed to test
|
||
- Effective daily: 250
|
||
- Duration: 30,000 / 250 = **120 days** (too long!)
|
||
|
||
### Minimum Duration Rules
|
||
|
||
Even with sufficient sample size, run tests for at least:
|
||
- **1 full week**: To capture day-of-week variation
|
||
- **2 business cycles**: If B2B (weekday vs. weekend patterns)
|
||
- **Through paydays**: If e-commerce (beginning/end of month)
|
||
|
||
### Maximum Duration Guidelines
|
||
|
||
Avoid running tests longer than 4-8 weeks:
|
||
- Novelty effects wear off
|
||
- External factors intervene
|
||
- Opportunity cost of other tests
|
||
|
||
---
|
||
|
||
## Online Calculators
|
||
|
||
### Recommended Tools
|
||
|
||
**Evan Miller's Calculator**
|
||
https://www.evanmiller.org/ab-testing/sample-size.html
|
||
- Simple interface
|
||
- Bookmark-worthy
|
||
|
||
**Optimizely's Calculator**
|
||
https://www.optimizely.com/sample-size-calculator/
|
||
- Business-friendly language
|
||
- Duration estimates
|
||
|
||
**AB Test Guide Calculator**
|
||
https://www.abtestguide.com/calc/
|
||
- Includes Bayesian option
|
||
- Multiple test types
|
||
|
||
**VWO Duration Calculator**
|
||
https://vwo.com/tools/ab-test-duration-calculator/
|
||
- Duration-focused
|
||
- Good for planning
|
||
|
||
---
|
||
|
||
## Adjusting for Multiple Variants
|
||
|
||
With more than 2 variants (A/B/n tests), you need more sample:
|
||
|
||
| Variants | Multiplier |
|
||
|----------|------------|
|
||
| 2 (A/B) | 1x |
|
||
| 3 (A/B/C) | ~1.5x |
|
||
| 4 (A/B/C/D) | ~2x |
|
||
| 5+ | Consider reducing variants |
|
||
|
||
**Why?** More comparisons increase chance of false positives. You're comparing:
|
||
- A vs B
|
||
- A vs C
|
||
- B vs C (sometimes)
|
||
|
||
Apply Bonferroni correction or use tools that handle this automatically.
|
||
|
||
---
|
||
|
||
## Common Sample Size Mistakes
|
||
|
||
### 1. Underpowered tests
|
||
**Problem**: Not enough sample to detect realistic effects
|
||
**Fix**: Be realistic about MDE, get more traffic, or don't test
|
||
|
||
### 2. Overpowered tests
|
||
**Problem**: Waiting for sample size when you already have significance
|
||
**Fix**: This is actually fine—you committed to sample size, honor it
|
||
|
||
### 3. Wrong baseline rate
|
||
**Problem**: Using wrong conversion rate for calculation
|
||
**Fix**: Use the specific metric and page, not site-wide averages
|
||
|
||
### 4. Ignoring segments
|
||
**Problem**: Calculating for full traffic, then analyzing segments
|
||
**Fix**: If you plan segment analysis, calculate sample for smallest segment
|
||
|
||
### 5. Testing too many things
|
||
**Problem**: Dividing traffic too many ways
|
||
**Fix**: Prioritize ruthlessly, run fewer concurrent tests
|
||
|
||
---
|
||
|
||
## When Sample Size Requirements Are Too High
|
||
|
||
Options when you can't get enough traffic:
|
||
|
||
1. **Increase MDE**: Accept only detecting larger effects (20%+ lift)
|
||
2. **Lower confidence**: Use 90% instead of 95% (risky, document it)
|
||
3. **Reduce variants**: Test only the most promising variant
|
||
4. **Combine traffic**: Test across multiple similar pages
|
||
5. **Test upstream**: Test earlier in funnel where traffic is higher
|
||
6. **Don't test**: Make decision based on qualitative data instead
|
||
7. **Longer test**: Accept longer duration (weeks/months)
|
||
|
||
---
|
||
|
||
## Sequential Testing
|
||
|
||
If you must check results before reaching sample size:
|
||
|
||
### What is it?
|
||
Statistical method that adjusts for multiple looks at data.
|
||
|
||
### When to use
|
||
- High-risk changes
|
||
- Need to stop bad variants early
|
||
- Time-sensitive decisions
|
||
|
||
### Tools that support it
|
||
- Optimizely (Stats Accelerator)
|
||
- VWO (SmartStats)
|
||
- PostHog (Bayesian approach)
|
||
|
||
### Tradeoff
|
||
- More flexibility to stop early
|
||
- Slightly larger sample size requirement
|
||
- More complex analysis
|
||
|
||
---
|
||
|
||
## Quick Decision Framework
|
||
|
||
### Can I run this test?
|
||
|
||
```
|
||
Daily traffic to page: _____
|
||
Baseline conversion rate: _____
|
||
MDE I care about: _____
|
||
|
||
Sample needed per variant: _____ (from tables above)
|
||
Days to run: Sample / Daily traffic = _____
|
||
|
||
If days > 60: Consider alternatives
|
||
If days > 30: Acceptable for high-impact tests
|
||
If days < 14: Likely feasible
|
||
If days < 7: Easy to run, consider running longer anyway
|
||
```
|