The Chi-Squared test is one of the most widely used statistical tests for categorical data. It comes in two flavors: the goodness-of-fit test asks whether an observed frequency distribution matches an expected one, while the test of independence asks whether two categorical variables are associated with each other. In both cases the null hypothesis is that any deviation from the expected pattern is explainable by random chance.
Leveraging the VillageSQL Extension Bundle (VEB)
, you can calculate these statistics directly in SQL with STATS_CHISQ_GOF() (goodness-of-fit) and STATS_CHISQ_INDEP() (test of independence). VillageSQL
is a drop-in MySQL 8.4 replacement, so no additional tools or language runtimes are needed.
COVID-19 Household Size Example
The Technology Networks article “The Chi-Squared Test” poses the question: is household size associated with symptomatic COVID-19 infection among people who tested positive?
The 2×2 contingency table contains 218 participants split by household size (Small: 1–3 members, Large: 4+ members) and symptom status (Symptomatic / Asymptomatic).
CREATE TABLE covid_household (
household_size VARCHAR(10) NOT NULL,
symptom_status VARCHAR(15) NOT NULL,
observed DOUBLE NOT NULL,
expected DOUBLE NOT NULL
);
INSERT INTO covid_household VALUES
('Small', 'Symptomatic', 30, 73 * 126 / 218),
('Small', 'Asymptomatic', 43, 73 * 92 / 218),
('Large', 'Symptomatic', 96, 145 * 126 / 218),
('Large', 'Asymptomatic', 49, 145 * 92 / 218);
WITH indep AS (
SELECT CAST(STATS_CHISQ_INDEP(observed, expected, 2.0, 2.0) AS CHAR(200)) AS json
FROM covid_household
)
SELECT
JSON_EXTRACT(json, '$.chi_sq') AS chi_squared,
JSON_EXTRACT(json, '$.df') AS degrees_of_freedom,
JSON_EXTRACT(json, '$.p') AS p_value,
IF(JSON_EXTRACT(json, '$.p') < 0.05,
'Reject H₀ — household size IS associated with symptomatic COVID-19',
'Retain H₀ — no significant association') AS conclusion
FROM indep\G
Results
*************************** 1. row ***************************
chi_squared: 12.5521237427249
degrees_of_freedom: 1
p_value: 0.00039575626994953
conclusion: Reject H₀ — household size IS associated with symptomatic COVID-19
χ² = 12.55, df = 1, p < 0.001 — we reject the null. Larger households show a significantly higher rate of symptomatic infection.
Kent State University Examples
Kent State Chi-Squared SQL Example
The Kent State University SPSS guide provides two classroom examples. These cover both possible outcomes — a case where the null is retained and one where it is rejected.
Smoking Behavior by Gender (3×2 table)
A survey of 402 people records smoking status (Nonsmoker, Past smoker, Current smoker) against gender. The expected counts are calculated as (row total × column total) / grand total.
WITH indep AS (
SELECT CAST(STATS_CHISQ_INDEP(observed, expected, 3.0, 2.0) AS CHAR(200)) AS json
FROM smoking_gender
)
SELECT
JSON_EXTRACT(json, '$.chi_sq') AS chi_squared,
JSON_EXTRACT(json, '$.df') AS degrees_of_freedom,
JSON_EXTRACT(json, '$.p') AS p_value,
IF(JSON_EXTRACT(json, '$.p') < 0.05,
'Reject Null — smoking behavior IS associated with gender',
'Retain Null — no association between smoking behavior and gender') AS conclusion
FROM indep\G
*************************** 1. row ***************************
chi_squared: 3.17125676733291
degrees_of_freedom: 2
p_value: 0.204819047726126
conclusion: Retain Null — no association between smoking behavior and gender
χ² = 3.17, df = 2, p = 0.205 — we retain the null. Smoking behavior is independent of gender in this sample.
Class Rank by On-Campus Living (2×2 table)
A survey of 388 students records whether they live on or off campus, split by class rank (Underclassman / Upperclassman).
WITH indep AS (
SELECT CAST(STATS_CHISQ_INDEP(observed, expected, 2.0, 2.0) AS CHAR(200)) AS json
FROM class_rank_living
)
SELECT
JSON_EXTRACT(json, '$.chi_sq') AS chi_squared,
JSON_EXTRACT(json, '$.df') AS degrees_of_freedom,
JSON_EXTRACT(json, '$.p') AS p_value,
IF(JSON_EXTRACT(json, '$.p') < 0.05,
'Reject Null — class rank IS associated with living on campus',
'Retain Null — class rank is independent of living on campus') AS conclusion
FROM indep\G
*************************** 1. row ***************************
chi_squared: 138.925986264012
degrees_of_freedom: 1
p_value: 0.0
conclusion: Reject Null — class rank IS associated with living on campus
χ² = 138.93, df = 1, p ≈ 0 — an overwhelming rejection of the null. Upper-classmen are far less likely to live on campus.
Royal Geographical Society Tourist Visitors Example
RGS Tourist Visitors SQL Example
The Royal Geographical Society student guide to chi-squared testing poses a goodness-of-fit question: does the age distribution of visitors to a tourist attraction reflect the age structure of the surrounding town?
Observed visitor counts (724,570 total) are compared against expected counts derived from Census population shares across five age groups. The expected values are computed inline so the formula remains explicit and auditable.
WITH data AS (
SELECT
age_group,
observed,
(pop_of_town / SUM(pop_of_town) OVER ()) *
SUM(observed) OVER () AS expected
FROM tourist_visitors
),
gof AS (
SELECT CAST(STATS_CHISQ_GOF(observed, expected) AS CHAR(200)) AS json
FROM data
)
SELECT
JSON_EXTRACT(json, '$.chi_sq') AS chi_squared,
JSON_EXTRACT(json, '$.df') AS degrees_of_freedom,
JSON_EXTRACT(json, '$.p') AS p_value,
IF(JSON_EXTRACT(json, '$.p') < 0.05,
'Reject Null — visitor age profile does NOT match the town population',
'Retain Null — visitor age profile matches the town population') AS conclusion
FROM gof\G
Results
*************************** 1. row ***************************
chi_squared: 141914.565565322
degrees_of_freedom: 4
p_value: 0
conclusion: Reject Null — visitor age profile does NOT match the town population
χ² = 141,915, df = 4, p ≈ 0 — the attraction draws a very different age profile from the town. Younger age groups are heavily over-represented.
Customer Support Ticket Examples
Two practical business scenarios illustrate both test types on data a support team might already have in their database.
Goodness-of-Fit: Are ticket volumes uniform across weekdays?
500 tickets over one week with a uniform expectation of 100 per day:
INSERT INTO tickets_by_day VALUES
('Monday', 140, 100),
('Tuesday', 110, 100),
('Wednesday', 95, 100),
('Thursday', 85, 100),
('Friday', 70, 100);
WITH gof AS (
SELECT CAST(STATS_CHISQ_GOF(observed, expected) AS CHAR(200)) AS json
FROM tickets_by_day
)
SELECT
JSON_EXTRACT(json, '$.chi_sq') AS chi_squared,
JSON_EXTRACT(json, '$.df') AS degrees_of_freedom,
JSON_EXTRACT(json, '$.p') AS p_value,
IF(JSON_EXTRACT(json, '$.p') < 0.05,
'Reject Null — ticket volume is NOT uniform across weekdays',
'Retain Null — no significant difference across weekdays') AS conclusion
FROM gof\G
*************************** 1. row ***************************
chi_squared: 28.5
degrees_of_freedom: 4
p_value: 0.000009875827068
conclusion: Reject Null — ticket volume is NOT uniform across weekdays
Test of Independence: Is resolution speed linked to agent experience?
200 tickets split by experience level (Junior / Senior) and resolution speed (Fast / Slow):
INSERT INTO resolution_survey VALUES
('Junior', 'Fast', 20, 40),
('Junior', 'Slow', 80, 60),
('Senior', 'Fast', 60, 40),
('Senior', 'Slow', 40, 60);
WITH indep AS (
SELECT CAST(STATS_CHISQ_INDEP(observed, expected, 2.0, 2.0) AS CHAR(200)) AS json
FROM resolution_survey
)
SELECT
JSON_EXTRACT(json, '$.chi_sq') AS chi_squared,
JSON_EXTRACT(json, '$.df') AS degrees_of_freedom,
JSON_EXTRACT(json, '$.p') AS p_value,
IF(JSON_EXTRACT(json, '$.p') < 0.05,
'Reject Null — resolution speed IS linked to agent experience',
'Retain Null — resolution speed is independent of agent experience') AS conclusion
FROM indep\G
*************************** 1. row ***************************
chi_squared: 33.3333333333333
degrees_of_freedom: 1
p_value: 0.000000007764037
conclusion: Reject Null — resolution speed IS linked to agent experience
χ² = 33.33, df = 1, p < 0.000001 — senior agents resolve tickets faster at a highly significant level.
UPenn Linguistics Examples
The UPenn Linguistics 300 chi-squared tutorial provides two textbook examples that illustrate the borderline case and a clear rejection.
Gender × Party Affiliation (2×2 table)
A survey of 100 people, evenly split by gender, records Democrat or Republican party affiliation. With all expected cell counts equal to 25 (50×50/100), this is one of the cleanest possible setups.
INSERT INTO gender_party VALUES
('Male', 'Democrat', 20, 25),
('Male', 'Republican', 30, 25),
('Female', 'Democrat', 30, 25),
('Female', 'Republican', 20, 25);
*************************** 1. row ***************************
chi_squared: 4
degrees_of_freedom: 1
p_value: 0.045500263896302
conclusion: Reject Null — party affiliation IS associated with gender
χ² = 4.0, df = 1, p = 0.0455 — just below the α = 0.05 threshold. A borderline result; the tutorial notes this as “not conclusive.”
Class Attendance × Exam Outcome (2×2 table)
54 students classified by whether they attended class and whether they passed the exam.
INSERT INTO attendance_exam VALUES
('Attended', 'Pass', 25, 31 * 33 / 54),
('Attended', 'Fail', 6, 31 * 21 / 54),
('Skipped', 'Pass', 8, 23 * 33 / 54),
('Skipped', 'Fail', 15, 23 * 21 / 54);
*************************** 1. row ***************************
chi_squared: 11.68601664849
degrees_of_freedom: 1
p_value: 0.000629715686170025
conclusion: Reject Null — exam outcome IS associated with class attendance
χ² = 11.69, df = 1, p < 0.001 — a clear rejection. Attending class is strongly associated with passing the exam.
Summary
All five examples use the same two functions: STATS_CHISQ_GOF() for goodness-of-fit and STATS_CHISQ_INDEP() for tests of independence. The results returned are JSON containing chi_sq, df, and p, which are easily extracted with JSON_EXTRACT() for inline reporting or downstream filtering.
| Example | Test | χ² | df | p-value | Result |
|---|---|---|---|---|---|
| COVID household size | Independence | 12.55 | 1 | 0.0004 | Reject Hâ‚€ |
| Smoking × Gender | Independence | 3.17 | 2 | 0.205 | Retain H₀ |
| Class rank × Living | Independence | 138.93 | 1 | ≈ 0 | Reject H₀ |
| RGS Tourist visitors | Goodness-of-fit | 141,915 | 4 | ≈ 0 | Reject H₀ |
| Support tickets by day | Goodness-of-fit | 28.50 | 4 | < 0.0001 | Reject Hâ‚€ |
| Resolution × Experience | Independence | 33.33 | 1 | < 0.0001 | Reject H₀ |
| Gender × Party | Independence | 4.00 | 1 | 0.0455 | Reject H₀ |
| Attendance × Exam | Independence | 11.69 | 1 | 0.0006 | Reject H₀ |
The extension is available at github.com/ronaldbradford/vsql-statistics and installs with a single statement:
INSTALL EXTENSION vsql_statistics;