Producing Chi-Squared statistics with SQL

Producing Chi-Squared statistics with SQL

The Chi-Squared test is one of the most widely used statistical tests for categorical data. It comes in two flavors: the goodness-of-fit test asks whether an observed frequency distribution matches an expected one, while the test of independence asks whether two categorical variables are associated with each other. In both cases the null hypothesis is that any deviation from the expected pattern is explainable by random chance.

Leveraging the VillageSQL Extension Bundle (VEB) , you can calculate these statistics directly in SQL with STATS_CHISQ_GOF() (goodness-of-fit) and STATS_CHISQ_INDEP() (test of independence). VillageSQL is a drop-in MySQL 8.4 replacement, so no additional tools or language runtimes are needed.

COVID-19 Household Size Example

Chi-Squared SQL Example

The Technology Networks article “The Chi-Squared Test” poses the question: is household size associated with symptomatic COVID-19 infection among people who tested positive?

The 2×2 contingency table contains 218 participants split by household size (Small: 1–3 members, Large: 4+ members) and symptom status (Symptomatic / Asymptomatic).

CREATE TABLE covid_household (
    household_size VARCHAR(10) NOT NULL,
    symptom_status VARCHAR(15) NOT NULL,
    observed       DOUBLE      NOT NULL,
    expected       DOUBLE      NOT NULL
);

INSERT INTO covid_household VALUES
    ('Small', 'Symptomatic',    30,  73 * 126 / 218),
    ('Small', 'Asymptomatic',   43,  73 *  92 / 218),
    ('Large', 'Symptomatic',    96, 145 * 126 / 218),
    ('Large', 'Asymptomatic',   49, 145 *  92 / 218);

WITH indep AS (
    SELECT CAST(STATS_CHISQ_INDEP(observed, expected, 2.0, 2.0) AS CHAR(200)) AS json
    FROM covid_household
)
SELECT
    JSON_EXTRACT(json, '$.chi_sq') AS chi_squared,
    JSON_EXTRACT(json, '$.df')     AS degrees_of_freedom,
    JSON_EXTRACT(json, '$.p')      AS p_value,
    IF(JSON_EXTRACT(json, '$.p') < 0.05,
       'Reject H₀ — household size IS associated with symptomatic COVID-19',
       'Retain H₀ — no significant association') AS conclusion
FROM indep\G

Results

*************************** 1. row ***************************
       chi_squared: 12.5521237427249
degrees_of_freedom: 1
           p_value: 0.00039575626994953
        conclusion: Reject H₀ — household size IS associated with symptomatic COVID-19

χ² = 12.55, df = 1, p < 0.001 — we reject the null. Larger households show a significantly higher rate of symptomatic infection.

Kent State University Examples

Kent State Chi-Squared SQL Example

The Kent State University SPSS guide provides two classroom examples. These cover both possible outcomes — a case where the null is retained and one where it is rejected.

Smoking Behavior by Gender (3×2 table)

A survey of 402 people records smoking status (Nonsmoker, Past smoker, Current smoker) against gender. The expected counts are calculated as (row total × column total) / grand total.

WITH indep AS (
    SELECT CAST(STATS_CHISQ_INDEP(observed, expected, 3.0, 2.0) AS CHAR(200)) AS json
    FROM smoking_gender
)
SELECT
    JSON_EXTRACT(json, '$.chi_sq') AS chi_squared,
    JSON_EXTRACT(json, '$.df')     AS degrees_of_freedom,
    JSON_EXTRACT(json, '$.p')      AS p_value,
    IF(JSON_EXTRACT(json, '$.p') < 0.05,
       'Reject Null — smoking behavior IS associated with gender',
       'Retain Null — no association between smoking behavior and gender') AS conclusion
FROM indep\G
*************************** 1. row ***************************
       chi_squared: 3.17125676733291
degrees_of_freedom: 2
           p_value: 0.204819047726126
        conclusion: Retain Null — no association between smoking behavior and gender

χ² = 3.17, df = 2, p = 0.205 — we retain the null. Smoking behavior is independent of gender in this sample.

Class Rank by On-Campus Living (2×2 table)

A survey of 388 students records whether they live on or off campus, split by class rank (Underclassman / Upperclassman).

WITH indep AS (
    SELECT CAST(STATS_CHISQ_INDEP(observed, expected, 2.0, 2.0) AS CHAR(200)) AS json
    FROM class_rank_living
)
SELECT
    JSON_EXTRACT(json, '$.chi_sq') AS chi_squared,
    JSON_EXTRACT(json, '$.df')     AS degrees_of_freedom,
    JSON_EXTRACT(json, '$.p')      AS p_value,
    IF(JSON_EXTRACT(json, '$.p') < 0.05,
       'Reject Null — class rank IS associated with living on campus',
       'Retain Null — class rank is independent of living on campus') AS conclusion
FROM indep\G
*************************** 1. row ***************************
       chi_squared: 138.925986264012
degrees_of_freedom: 1
           p_value: 0.0
        conclusion: Reject Null — class rank IS associated with living on campus

χ² = 138.93, df = 1, p ≈ 0 — an overwhelming rejection of the null. Upper-classmen are far less likely to live on campus.

Royal Geographical Society Tourist Visitors Example

RGS Tourist Visitors SQL Example

The Royal Geographical Society student guide to chi-squared testing poses a goodness-of-fit question: does the age distribution of visitors to a tourist attraction reflect the age structure of the surrounding town?

Observed visitor counts (724,570 total) are compared against expected counts derived from Census population shares across five age groups. The expected values are computed inline so the formula remains explicit and auditable.

WITH data AS (
    SELECT
        age_group,
        observed,
        (pop_of_town / SUM(pop_of_town) OVER ()) *
            SUM(observed) OVER () AS expected
    FROM tourist_visitors
),
gof AS (
    SELECT CAST(STATS_CHISQ_GOF(observed, expected) AS CHAR(200)) AS json
    FROM data
)
SELECT
    JSON_EXTRACT(json, '$.chi_sq') AS chi_squared,
    JSON_EXTRACT(json, '$.df')     AS degrees_of_freedom,
    JSON_EXTRACT(json, '$.p')      AS p_value,
    IF(JSON_EXTRACT(json, '$.p') < 0.05,
       'Reject Null — visitor age profile does NOT match the town population',
       'Retain Null — visitor age profile matches the town population') AS conclusion
FROM gof\G

Results

*************************** 1. row ***************************
       chi_squared: 141914.565565322
degrees_of_freedom: 4
           p_value: 0
        conclusion: Reject Null — visitor age profile does NOT match the town population

χ² = 141,915, df = 4, p ≈ 0 — the attraction draws a very different age profile from the town. Younger age groups are heavily over-represented.

Customer Support Ticket Examples

Support Tickets SQL Example

Two practical business scenarios illustrate both test types on data a support team might already have in their database.

Goodness-of-Fit: Are ticket volumes uniform across weekdays?

500 tickets over one week with a uniform expectation of 100 per day:

INSERT INTO tickets_by_day VALUES
    ('Monday',     140, 100),
    ('Tuesday',    110, 100),
    ('Wednesday',   95, 100),
    ('Thursday',    85, 100),
    ('Friday',      70, 100);

WITH gof AS (
    SELECT CAST(STATS_CHISQ_GOF(observed, expected) AS CHAR(200)) AS json
    FROM tickets_by_day
)
SELECT
    JSON_EXTRACT(json, '$.chi_sq') AS chi_squared,
    JSON_EXTRACT(json, '$.df')     AS degrees_of_freedom,
    JSON_EXTRACT(json, '$.p')      AS p_value,
    IF(JSON_EXTRACT(json, '$.p') < 0.05,
       'Reject Null — ticket volume is NOT uniform across weekdays',
       'Retain Null — no significant difference across weekdays') AS conclusion
FROM gof\G
*************************** 1. row ***************************
       chi_squared: 28.5
degrees_of_freedom: 4
           p_value: 0.000009875827068
        conclusion: Reject Null — ticket volume is NOT uniform across weekdays

Test of Independence: Is resolution speed linked to agent experience?

200 tickets split by experience level (Junior / Senior) and resolution speed (Fast / Slow):

INSERT INTO resolution_survey VALUES
    ('Junior', 'Fast',  20, 40),
    ('Junior', 'Slow',  80, 60),
    ('Senior', 'Fast',  60, 40),
    ('Senior', 'Slow',  40, 60);

WITH indep AS (
    SELECT CAST(STATS_CHISQ_INDEP(observed, expected, 2.0, 2.0) AS CHAR(200)) AS json
    FROM resolution_survey
)
SELECT
    JSON_EXTRACT(json, '$.chi_sq') AS chi_squared,
    JSON_EXTRACT(json, '$.df')     AS degrees_of_freedom,
    JSON_EXTRACT(json, '$.p')      AS p_value,
    IF(JSON_EXTRACT(json, '$.p') < 0.05,
       'Reject Null — resolution speed IS linked to agent experience',
       'Retain Null — resolution speed is independent of agent experience') AS conclusion
FROM indep\G
*************************** 1. row ***************************
       chi_squared: 33.3333333333333
degrees_of_freedom: 1
           p_value: 0.000000007764037
        conclusion: Reject Null — resolution speed IS linked to agent experience

χ² = 33.33, df = 1, p < 0.000001 — senior agents resolve tickets faster at a highly significant level.

UPenn Linguistics Examples

UPenn Linguistics SQL Example

The UPenn Linguistics 300 chi-squared tutorial provides two textbook examples that illustrate the borderline case and a clear rejection.

Gender × Party Affiliation (2×2 table)

A survey of 100 people, evenly split by gender, records Democrat or Republican party affiliation. With all expected cell counts equal to 25 (50×50/100), this is one of the cleanest possible setups.

INSERT INTO gender_party VALUES
    ('Male',   'Democrat',   20, 25),
    ('Male',   'Republican', 30, 25),
    ('Female', 'Democrat',   30, 25),
    ('Female', 'Republican', 20, 25);
*************************** 1. row ***************************
       chi_squared: 4
degrees_of_freedom: 1
           p_value: 0.045500263896302
        conclusion: Reject Null — party affiliation IS associated with gender

χ² = 4.0, df = 1, p = 0.0455 — just below the α = 0.05 threshold. A borderline result; the tutorial notes this as “not conclusive.”

Class Attendance × Exam Outcome (2×2 table)

54 students classified by whether they attended class and whether they passed the exam.

INSERT INTO attendance_exam VALUES
    ('Attended', 'Pass', 25, 31 * 33 / 54),
    ('Attended', 'Fail',  6, 31 * 21 / 54),
    ('Skipped',  'Pass',  8, 23 * 33 / 54),
    ('Skipped',  'Fail', 15, 23 * 21 / 54);
*************************** 1. row ***************************
       chi_squared: 11.68601664849
degrees_of_freedom: 1
           p_value: 0.000629715686170025
        conclusion: Reject Null — exam outcome IS associated with class attendance

χ² = 11.69, df = 1, p < 0.001 — a clear rejection. Attending class is strongly associated with passing the exam.

Summary

All five examples use the same two functions: STATS_CHISQ_GOF() for goodness-of-fit and STATS_CHISQ_INDEP() for tests of independence. The results returned are JSON containing chi_sq, df, and p, which are easily extracted with JSON_EXTRACT() for inline reporting or downstream filtering.

Example Test χ² df p-value Result
COVID household size Independence 12.55 1 0.0004 Reject Hâ‚€
Smoking × Gender Independence 3.17 2 0.205 Retain H₀
Class rank × Living Independence 138.93 1 ≈ 0 Reject H₀
RGS Tourist visitors Goodness-of-fit 141,915 4 ≈ 0 Reject H₀
Support tickets by day Goodness-of-fit 28.50 4 < 0.0001 Reject Hâ‚€
Resolution × Experience Independence 33.33 1 < 0.0001 Reject H₀
Gender × Party Independence 4.00 1 0.0455 Reject H₀
Attendance × Exam Independence 11.69 1 0.0006 Reject H₀

The extension is available at github.com/ronaldbradford/vsql-statistics and installs with a single statement:

INSTALL EXTENSION vsql_statistics;
Tagged with: MySQL VillageSQL Extensions

Producing Two Sample T-Test statistics with SQL

The two sample t-test for equal variance is a statistical test to determine if the means of two groups are different enough that the difference is likely caused by some underlying difference, rather than random chance.

Building your first VillageSQL Extension with AI skills

This is a technical walkthrough of the vsql-extension-builder recently released May 28 at Percona Live Bay Area 2026 and found at https://github.com/villagesql/villagesql-skills . Highlights Install VillageSQL pre-built binary first Install SDK with pre-built binary second Install the skill Run it with your AI tool The output can be found at https://github.

Why using production workloads over simulated workloads is critical

AI-Assisted SQL Tuning Last week in his keynote speech at Percona Live Bay Area 2026 , Andy Pavlo presented Databases: The Final Boss of Agents and provided some useful insights into query optimization of simulated workloads leveraging AI.