Producing IQR and Outlier statistics with SQL

Ronald Bradford
June 30, 2026

The interquartile range (IQR) measures the spread of the middle 50% of a distribution — the distance between the first quartile (Q1) and the third quartile (Q3). Combined with Tukey’s 1.5×IQR fence rule, it provides a robust method for identifying outliers that is far less sensitive to extreme values than mean ± standard deviation.

MySQL has no built-in quartile or outlier function. Leveraging the VillageSQL Extension Bundle (VEB) , STATS_IQR(col) computes the full five-number summary and fence values in a single aggregate pass, returning them as JSON. VillageSQL is a drop-in MySQL 8.4 replacement — no Python or R required.

Returns JSON: { "q1": ..., "median": ..., "q3": ..., "iqr": ...,
                "lower_fence": ..., "upper_fence": ... }

Field	Description
`q1`, `q3`	Tukey’s hinges (exclusive-median method)
`median`	Middle value of the sorted group
`iqr`	Q3 − Q1
`lower_fence`	Q1 − 1.5 × IQR (Tukey inner lower fence)
`upper_fence`	Q3 + 1.5 × IQR (Tukey inner upper fence)

Values outside the fences are mild Tukey outliers. Use CAST(STATS_IQR(...) AS CHAR) to read results in the mysql CLI, and CAST(col AS DOUBLE) on integer columns to preserve decimal precision in JSON output.

Parkrun Finish Times by Age Group

IQR SQL Example

The first example computes IQR statistics per age group for a community 5 km run, then joins the fence values back to individual finish times to count outliers. IQR grows with age — older runners span a wider range of fitness — and each group contains one walker or injury finish that sits above the upper fence.

WITH iqr_stats AS (
    SELECT
        age_group,
        COUNT(*)                                            AS n,
        CAST(STATS_IQR(finish_time) AS CHAR(300))          AS json
    FROM race_results
    GROUP BY age_group
),
fences AS (
    SELECT
        age_group,
        n,
        ROUND(JSON_EXTRACT(json, '$.q1'),     2)  AS q1,
        ROUND(JSON_EXTRACT(json, '$.median'), 2)  AS median_time,
        ROUND(JSON_EXTRACT(json, '$.q3'),     2)  AS q3,
        ROUND(JSON_EXTRACT(json, '$.iqr'),    2)  AS iqr,
        JSON_EXTRACT(json, '$.lower_fence')        AS lower_fence,
        JSON_EXTRACT(json, '$.upper_fence')        AS upper_fence
    FROM iqr_stats
)
SELECT
    f.age_group,
    f.n,
    CONCAT(f.q1, '–', f.q3, ' min')            AS q1_q3,
    CONCAT(f.median_time, ' min')               AS median,
    CONCAT(f.iqr, ' min')                       AS iqr,
    CONCAT(ROUND(f.upper_fence, 2), ' min')     AS upper_fence,
    SUM(CASE
        WHEN r.finish_time > f.upper_fence
          OR r.finish_time < f.lower_fence THEN 1 ELSE 0
    END)                                        AS outlier_count
FROM race_results r
JOIN fences f USING (age_group)
GROUP BY f.age_group, f.n, f.q1, f.q3, f.median_time, f.iqr, f.lower_fence, f.upper_fence
ORDER BY f.age_group;

Results

age_group  n   q1_q3           median       iqr          upper_fence   outlier_count
18-29      12  19.7–23 min     21.75 min    3.3 min      27.95 min     1
30-44      12  22.75–26.4 min  24.5 min     3.65 min     31.88 min     1
45-59      12  24.85–30.25 min 27.15 min    5.4 min      38.35 min     1
60+        12  29.25–38 min    32.75 min    8.75 min     51.12 min     1

IQR grows steadily from 3.3 minutes (18–29) to 8.75 minutes (60+), confirming that older age groups have wider performance spread. Each group has exactly one outlier above the upper fence — finish times of 35.8, 43.0, 55.0, and 67.0 minutes respectively — consistent with a walker or injury finish in an otherwise competitive pack. The GROUP BY → join-back pattern is the natural structure for per-group outlier counts.

E-Commerce Order Values — Flag Individual Outlier Rows

The second example applies STATS_IQR to a single group of 25 orders and cross-joins the fence values to every row, tagging each order as normal, low outlier, or high outlier. Normal orders cluster between $45 and $90; a $2.00 test/refund entry and a $200.00 bulk purchase both fall outside the Tukey fences.

WITH fences AS (
    SELECT
        JSON_EXTRACT(json, '$.q1')          AS q1,
        JSON_EXTRACT(json, '$.median')      AS median_value,
        JSON_EXTRACT(json, '$.q3')          AS q3,
        JSON_EXTRACT(json, '$.lower_fence') AS lower_fence,
        JSON_EXTRACT(json, '$.upper_fence') AS upper_fence
    FROM (
        SELECT CAST(STATS_IQR(order_value) AS CHAR(300)) AS json
        FROM orders
    ) s
)
SELECT
    o.order_id,
    o.customer,
    o.order_value,
    CASE
        WHEN o.order_value < f.lower_fence THEN 'LOW OUTLIER'
        WHEN o.order_value > f.upper_fence THEN 'HIGH OUTLIER'
        ELSE 'normal'
    END                                     AS flag
FROM orders o, fences f
ORDER BY o.order_value;

Results

order_id  customer   order_value  flag
1         Chen       2            LOW OUTLIER
2         Walsh      45           normal
...
24        Bauer      90           normal
25        Costa      200          HIGH OUTLIER

With n = 25, Q1 = $55, Q3 = $81, and IQR = $26, the fences are $16.00 (lower) and $120.00 (upper). Order #1 ($2.00) is flagged as a low outlier — likely a test transaction or refund. Order #25 ($200.00) is a high outlier — a bulk purchase well outside the normal $45–$90 range. The cross-join pattern works for any single-group outlier scan without needing GROUP BY.

Manufacturing — Cylinder Diameter by Machine Line

The third example combines IQR per machine with engineering tolerance checks. Target diameter is 50.0 mm with a spec of ±2.0 mm (48.0–52.0 mm). Three machines show different process characteristics: Machine A is well-calibrated, Machine B has drifted high, and Machine C has worn tooling with wide spread and multiple violations.

WITH iqr_stats AS (
    SELECT
        machine,
        COUNT(*)                                            AS n,
        CAST(STATS_IQR(diameter) AS CHAR(300))             AS json
    FROM components
    GROUP BY machine
),
fences AS (
    SELECT
        machine,
        n,
        ROUND(JSON_EXTRACT(json, '$.q1'),     2)  AS q1,
        ROUND(JSON_EXTRACT(json, '$.median'), 2)  AS median_mm,
        ROUND(JSON_EXTRACT(json, '$.q3'),     2)  AS q3,
        ROUND(JSON_EXTRACT(json, '$.iqr'),    2)  AS iqr_mm,
        JSON_EXTRACT(json, '$.lower_fence')        AS lower_fence,
        JSON_EXTRACT(json, '$.upper_fence')        AS upper_fence
    FROM iqr_stats
)
SELECT
    f.machine,
    f.median_mm,
    f.q1,
    f.q3,
    f.iqr_mm,
    SUM(CASE
        WHEN c.diameter < 48.0 OR c.diameter > 52.0 THEN 1 ELSE 0
    END)                                        AS spec_violations,
    SUM(CASE
        WHEN c.diameter < f.lower_fence
          OR c.diameter > f.upper_fence THEN 1 ELSE 0
    END)                                        AS tukey_outliers
FROM components c
JOIN fences f USING (machine)
GROUP BY f.machine, f.median_mm, f.q1, f.q3, f.iqr_mm, f.lower_fence, f.upper_fence
ORDER BY f.machine;

Results

machine  median_mm  q1    q3    iqr_mm  spec_violations  tukey_outliers
A        50.0       49.7  50.3  0.6     0                0
B        50.6       50.3  50.8  0.5     0                0
C        50.2       49.0  52.0  3.0     4                1

Machine A is tight and centred on target (IQR = 0.6 mm, median = 50.0 mm) with no violations. Machine B is equally tight (IQR = 0.5 mm) but the median has drifted 0.6 mm above target to 50.6 mm — a calibration issue visible in the median but not yet causing spec failures. Machine C tells a different story: IQR of 3.0 mm is five times wider than Machine A, with four parts outside the ±2.0 mm engineering spec and one statistical outlier at 57.5 mm. IQR quantifies process consistency; combining it with domain-specific tolerance checks gives both statistical and engineering views in one query.

Three Usage Patterns

Pattern	When to use	Example
`GROUP BY` → join fences back	Per-group summary and outlier counts	Parkrun age groups, machine lines
Single `STATS_IQR` → cross-join	Flag individual rows in one group	Order value review
`GROUP BY` + domain thresholds	Statistical outliers plus engineering specs	Manufacturing QC

All three patterns extract fence values from the JSON returned by STATS_IQR and apply them to raw rows — the function computes the distribution summary once per group; the join does the row-level classification.

Summary

All three examples use STATS_IQR(), which returns a JSON object with quartiles, median, IQR, and Tukey fence values computed using the exclusive-median (Tukey’s hinges) method.

Example	Group	Median	IQR	Outliers
Parkrun 18–29	12 runners	21.75 min	3.3 min	1 (35.8 min walker)
Parkrun 30–44	12 runners	24.5 min	3.65 min	1 (43.0 min walker)
Parkrun 45–59	12 runners	27.15 min	5.4 min	1 (55.0 min walker)
Parkrun 60+	12 runners	32.75 min	8.75 min	1 (67.0 min walker)
E-commerce orders	25 orders	$68	$26	2 ($2 low, $200 high)
Machine A	15 parts	50.0 mm	0.6 mm	0
Machine B	15 parts	50.6 mm	0.5 mm	0 (drifting high)
Machine C	15 parts	50.2 mm	3.0 mm	4 spec + 1 Tukey

The extension is available at github.com/ronaldbradford/vsql-statistics and installs with a single statement:

INSTALL EXTENSION vsql_statistics;

Tagged with: MySQL VillageSQL Extensions

Producing Mode statistics with SQL

Ronald Bradford
June 26, 2026

The mode is the value or values that appear most frequently in a dataset. Unlike the mean or median, it applies naturally to categorical and ordinal data — star ratings, product codes, survey responses — and reveals what is most common, not what is average.

Producing Alternative Means statistics with SQL

Ronald Bradford
June 23, 2026

MySQL’s built-in AVG() computes the arithmetic mean — the sum divided by the count. That is the right default for many questions, but it is not always the right measure of central tendency.

Producing One-Sample Z-Test statistics with SQL

Ronald Bradford
June 20, 2026

The one-sample Z-test determines whether a sample mean differs significantly from a known population mean when the population standard deviation is also known. It is the appropriate test when the population parameters are established — quality control benchmarks, national averages, long-run process measurements — and you want to evaluate whether a new sample is consistent with them.

Producing IQR and Outlier statistics with SQL

Parkrun Finish Times by Age Group

Results

E-Commerce Order Values — Flag Individual Outlier Rows

Results

Manufacturing — Cylinder Diameter by Machine Line

Results

Three Usage Patterns

Summary

Related Posts

Producing Mode statistics with SQL

Producing Alternative Means statistics with SQL

Producing One-Sample Z-Test statistics with SQL