Hypothesis Testing Calculator
Run a complete, guided hypothesis test in 6 structured steps - from stating H₀ to the final conclusion.
📖 What is Hypothesis Testing?
Hypothesis testing is a formal statistical procedure for using sample data to evaluate a claim about a population parameter. It is the backbone of scientific inquiry - from clinical drug trials to manufacturing quality control to psychological experiments. The procedure answers the question: "Given what I observed in my sample, is there enough evidence to reject the assumption that nothing unusual is happening?"
Every hypothesis test involves two competing statements. The null hypothesis (H₀) is the default, conservative position - typically that a population mean equals a reference value, that two groups are the same, or that a proportion equals a claimed value. The alternative hypothesis (H₁) is what the researcher is trying to demonstrate: that there is a real effect, a real difference, or a meaningful departure from the reference.
The test works by computing a test statistic - a number that summarises how far the sample data is from what H₀ predicts. Under H₀, this statistic follows a known distribution (Z, t, F, chi-square). The p-value measures how likely it is to observe a result at least as extreme as yours if H₀ were true. A small p-value (below the chosen significance level α, typically 0.05) is evidence against H₀, and we reject it in favour of H₁.
This calculator supports five major test types: the one-sample Z-test (population σ known), the one-sample t-test (σ estimated from sample), the one-proportion Z-test, the two-sample Welch's t-test, and the paired t-test. For every test, it produces all six standard steps and computes Cohen's d as an effect size to quantify practical significance alongside statistical significance.
📐 Formulas
One-sample Z: Z = (x̄ − μ₀) / (σ / √n) - use when population σ is known
One-proportion Z: Z = (p̂ − p₀) / √(p₀(1−p₀)/n) - normal approximation; valid when np₀ ≥ 5 and n(1−p₀) ≥ 5
Two-sample Welch's t: t = (x̄₁ − x̄₂) / √(s₁²/n₁ + s₂²/n₂), with df by Welch-Satterthwaite equation
Paired t: t = d̄ / (s_d / √n), df = n − 1, where d̄ = mean of pair differences, s_d = their SD
p-value (two-tailed): p = 2 × P(T > |t_obs|) - compare to α; reject H₀ if p < α
Cohen's d (effect size): d = |x̄ − μ₀| / s for one-sample; d = |x̄₁ − x̄₂| / s_pooled for two-sample
📖 How to Use This Calculator
📝 Example Calculations
Example 1 - Medical Treatment Effectiveness (One-Sample t-Test)
A cardiologist wants to know if a new drug changes mean systolic blood pressure from the known baseline of 130 mmHg. A sample of 25 patients shows x̄ = 124, s = 10. Test at α = 0.05, two-tailed.
t = (124 − 130) / (10 / √25) = −6 / 2 = −3.000, df = 24
p ≈ 0.006 < 0.05 - Reject H₀. The drug significantly changes blood pressure.
Cohen's d = |124 − 130| / 10 = 0.60 - medium effect size.
Example 2 - Manufacturing Quality Test (One-Sample Z-Test)
A factory claims its bolts have mean diameter μ = 12.00 mm with known σ = 0.05 mm. A QA inspector measures n = 36 bolts and finds x̄ = 12.008 mm. Is the process off-spec? α = 0.05, two-tailed.
Z = (12.008 − 12.000) / (0.05 / √36) = 0.008 / 0.00833 = 0.96
p ≈ 0.337 > 0.05 - Fail to Reject H₀. No significant evidence the process is off-spec.
Example 3 - Election Polling (One-Proportion Z-Test)
An exit poll of 500 voters shows 54% (p̂ = 0.54) supporting Candidate A. Is there significant evidence the candidate leads (> 50%)? α = 0.05, one-tailed right.
SE = √(0.50 × 0.50 / 500) = 0.02236; Z = (0.54 − 0.50) / 0.02236 = 1.789
p ≈ 0.037 < 0.05 - Reject H₀. Statistically significant evidence that Candidate A leads.
Example 4 - A/B Test (Two-Sample Welch's t-Test)
Website A: mean session time 240 s, s = 45, n = 80. Website B: mean 220 s, s = 60, n = 70. Did redesign improve engagement? α = 0.05, two-tailed.
SE = √(45²/80 + 60²/70) = √(25.31 + 51.43) = √76.74 = 8.76; t = (240 − 220) / 8.76 = 2.283
Welch df ≈ 123; p ≈ 0.024 < 0.05 - Reject H₀. Redesign significantly increased session time.
Example 5 - Before/After Study (Paired t-Test)
20 students take a study skills course. Mean improvement in test score = 8.5 points, SD of differences = 12.0. Did the course help? α = 0.05, one-tailed right.
t = 8.5 / (12.0 / √20) = 8.5 / 2.683 = 3.168, df = 19
p ≈ 0.003 < 0.05 - Reject H₀. Course significantly improved scores. Cohen's d = 8.5 / 12.0 = 0.71 (medium-large effect).
❓ Frequently Asked Questions
🔗 Related Calculators
What are the 6 steps of hypothesis testing?
The 6 standard steps are: (1) State hypotheses - define H₀ (null) and H₁ (alternative); (2) Set the significance level α (e.g., 0.05); (3) Compute the test statistic (Z, t, etc.) from your sample data; (4) Find the p-value - the probability of observing a result at least this extreme if H₀ is true; (5) Determine the critical value - the threshold the test statistic must exceed to reject H₀; (6) State the conclusion - reject or fail to reject H₀, and interpret in context.
What is the difference between H₀ and H₁?
H₀ (null hypothesis) is the default assumption - usually that there is no effect, no difference, or the parameter equals a specific value. H₁ (alternative hypothesis) is what you are trying to show evidence for - that there is an effect, a difference, or the parameter is greater/less/not equal to the reference. You never 'prove' H₁; you only find sufficient evidence to reject H₀ in its favour.
When should I use a Z-test vs a t-test for means?
Use a Z-test when the population standard deviation σ is known (rare in practice). Use a t-test when σ must be estimated from the sample (almost always the case). For large samples (n > 30), the t and Z distributions are very similar, but using t is still correct and conservative. This calculator uses the t-distribution for one-sample t and two-sample tests, and the Z-distribution when σ is explicitly provided.
What does p-value mean in hypothesis testing?
The p-value is the probability of obtaining a test statistic at least as extreme as the observed one, assuming H₀ is true. A small p-value (below α) is evidence against H₀. Crucially, the p-value is NOT the probability that H₀ is true, and it is NOT the probability of making an error. It is a measure of how surprising your data would be under H₀.
What is a Type I and Type II error?
A Type I error (false positive) is rejecting H₀ when it is actually true. Its probability equals α (e.g., 5%). A Type II error (false negative) is failing to reject H₀ when H₁ is actually true. Its probability is β. Statistical power = 1 − β. Increasing sample size reduces both error types simultaneously. Decreasing α (stricter test) reduces Type I errors but increases Type II errors.
What is Cohen's d and how do I interpret it?
Cohen's d is a standardised effect size for mean tests: d = |μ₁ − μ₂| / σ_pooled. It expresses how many standard deviations apart the means are. Conventional benchmarks (Jacob Cohen, 1988): d < 0.2 = negligible effect; 0.2–0.5 = small; 0.5–0.8 = medium; > 0.8 = large. A study can be statistically significant (low p) with a negligible effect size (small d) when n is very large, which is why both must be reported.
What is the difference between one-tailed and two-tailed tests?
A two-tailed test (H₁: μ ≠ μ₀) detects differences in either direction and is appropriate when you have no strong prior directional hypothesis. A one-tailed test (H₁: μ > μ₀ or H₁: μ < μ₀) only tests one direction and has more power in that direction, but misses effects in the other direction. Most academic journals require two-tailed tests unless a directional hypothesis was pre-specified before data collection.
What is the one-proportion Z-test used for?
The one-proportion Z-test tests whether an observed sample proportion p̂ differs from a hypothesised population proportion p₀. It uses the normal approximation: Z = (p̂ − p₀) / √(p₀(1−p₀)/n). The approximation is valid when np₀ ≥ 5 and n(1−p₀) ≥ 5. Use cases: election polling (is the proportion above 50%?), quality control (is the defect rate below 2%?), A/B testing (did click-through improve?).
What is a two-sample t-test (Welch's test)?
The two-sample t-test compares the means of two independent groups. This calculator uses Welch's version, which does not assume equal population variances - making it more robust than Student's pooled t-test. Welch's test adjusts the degrees of freedom using the Welch-Satterthwaite equation: df ≈ (s₁²/n₁ + s₂²/n₂)² / [(s₁²/n₁)²/(n₁−1) + (s₂²/n₂)²/(n₂−1)]. This is always at least as good as the pooled test and is recommended as the default.
What is the paired t-test?
The paired t-test (dependent samples t-test) is used when two measurements are linked - typically before/after measurements on the same subjects, or matched-pair experimental designs. Instead of comparing group means, it computes the difference for each pair and performs a one-sample t-test on those differences (H₀: μ_d = 0). The paired test removes between-subject variability, making it more powerful than an independent two-sample test for the same data.