Producing IQR and Outlier statistics with SQL

Producing IQR and Outlier statistics with SQL

The interquartile range (IQR) measures the spread of the middle 50% of a distribution — the distance between the first quartile (Q1) and the third quartile (Q3). Combined with Tukey’s 1.5×IQR fence rule, it provides a robust method for identifying outliers that is far less sensitive to extreme values than mean ± standard deviation.

MySQL has no built-in quartile or outlier function. Leveraging the VillageSQL Extension Bundle (VEB) , STATS_IQR(col) computes the full five-number summary and fence values in a single aggregate pass, returning them as JSON. VillageSQL is a drop-in MySQL 8.4 replacement — no Python or R required.

Returns JSON: { "q1": ..., "median": ..., "q3": ..., "iqr": ...,
                "lower_fence": ..., "upper_fence": ... }
Field Description
q1, q3 Tukey’s hinges (exclusive-median method)
median Middle value of the sorted group
iqr Q3 − Q1
lower_fence Q1 − 1.5 × IQR (Tukey inner lower fence)
upper_fence Q3 + 1.5 × IQR (Tukey inner upper fence)

Values outside the fences are mild Tukey outliers. Use CAST(STATS_IQR(...) AS CHAR) to read results in the mysql CLI, and CAST(col AS DOUBLE) on integer columns to preserve decimal precision in JSON output.

Parkrun Finish Times by Age Group

IQR SQL Example

The first example computes IQR statistics per age group for a community 5 km run, then joins the fence values back to individual finish times to count outliers. IQR grows with age — older runners span a wider range of fitness — and each group contains one walker or injury finish that sits above the upper fence.

WITH iqr_stats AS (
    SELECT
        age_group,
        COUNT(*)                                            AS n,
        CAST(STATS_IQR(finish_time) AS CHAR(300))          AS json
    FROM race_results
    GROUP BY age_group
),
fences AS (
    SELECT
        age_group,
        n,
        ROUND(JSON_EXTRACT(json, '$.q1'),     2)  AS q1,
        ROUND(JSON_EXTRACT(json, '$.median'), 2)  AS median_time,
        ROUND(JSON_EXTRACT(json, '$.q3'),     2)  AS q3,
        ROUND(JSON_EXTRACT(json, '$.iqr'),    2)  AS iqr,
        JSON_EXTRACT(json, '$.lower_fence')        AS lower_fence,
        JSON_EXTRACT(json, '$.upper_fence')        AS upper_fence
    FROM iqr_stats
)
SELECT
    f.age_group,
    f.n,
    CONCAT(f.q1, '–', f.q3, ' min')            AS q1_q3,
    CONCAT(f.median_time, ' min')               AS median,
    CONCAT(f.iqr, ' min')                       AS iqr,
    CONCAT(ROUND(f.upper_fence, 2), ' min')     AS upper_fence,
    SUM(CASE
        WHEN r.finish_time > f.upper_fence
          OR r.finish_time < f.lower_fence THEN 1 ELSE 0
    END)                                        AS outlier_count
FROM race_results r
JOIN fences f USING (age_group)
GROUP BY f.age_group, f.n, f.q1, f.q3, f.median_time, f.iqr, f.lower_fence, f.upper_fence
ORDER BY f.age_group;

Results

age_group  n   q1_q3           median       iqr          upper_fence   outlier_count
18-29      12  19.7–23 min     21.75 min    3.3 min      27.95 min     1
30-44      12  22.75–26.4 min  24.5 min     3.65 min     31.88 min     1
45-59      12  24.85–30.25 min 27.15 min    5.4 min      38.35 min     1
60+        12  29.25–38 min    32.75 min    8.75 min     51.12 min     1

IQR grows steadily from 3.3 minutes (18–29) to 8.75 minutes (60+), confirming that older age groups have wider performance spread. Each group has exactly one outlier above the upper fence — finish times of 35.8, 43.0, 55.0, and 67.0 minutes respectively — consistent with a walker or injury finish in an otherwise competitive pack. The GROUP BY → join-back pattern is the natural structure for per-group outlier counts.

E-Commerce Order Values — Flag Individual Outlier Rows

The second example applies STATS_IQR to a single group of 25 orders and cross-joins the fence values to every row, tagging each order as normal, low outlier, or high outlier. Normal orders cluster between $45 and $90; a $2.00 test/refund entry and a $200.00 bulk purchase both fall outside the Tukey fences.

WITH fences AS (
    SELECT
        JSON_EXTRACT(json, '$.q1')          AS q1,
        JSON_EXTRACT(json, '$.median')      AS median_value,
        JSON_EXTRACT(json, '$.q3')          AS q3,
        JSON_EXTRACT(json, '$.lower_fence') AS lower_fence,
        JSON_EXTRACT(json, '$.upper_fence') AS upper_fence
    FROM (
        SELECT CAST(STATS_IQR(order_value) AS CHAR(300)) AS json
        FROM orders
    ) s
)
SELECT
    o.order_id,
    o.customer,
    o.order_value,
    CASE
        WHEN o.order_value < f.lower_fence THEN 'LOW OUTLIER'
        WHEN o.order_value > f.upper_fence THEN 'HIGH OUTLIER'
        ELSE 'normal'
    END                                     AS flag
FROM orders o, fences f
ORDER BY o.order_value;

Results

order_id  customer   order_value  flag
1         Chen       2            LOW OUTLIER
2         Walsh      45           normal
...
24        Bauer      90           normal
25        Costa      200          HIGH OUTLIER

With n = 25, Q1 = $55, Q3 = $81, and IQR = $26, the fences are $16.00 (lower) and $120.00 (upper). Order #1 ($2.00) is flagged as a low outlier — likely a test transaction or refund. Order #25 ($200.00) is a high outlier — a bulk purchase well outside the normal $45–$90 range. The cross-join pattern works for any single-group outlier scan without needing GROUP BY.

Manufacturing — Cylinder Diameter by Machine Line

The third example combines IQR per machine with engineering tolerance checks. Target diameter is 50.0 mm with a spec of ±2.0 mm (48.0–52.0 mm). Three machines show different process characteristics: Machine A is well-calibrated, Machine B has drifted high, and Machine C has worn tooling with wide spread and multiple violations.

WITH iqr_stats AS (
    SELECT
        machine,
        COUNT(*)                                            AS n,
        CAST(STATS_IQR(diameter) AS CHAR(300))             AS json
    FROM components
    GROUP BY machine
),
fences AS (
    SELECT
        machine,
        n,
        ROUND(JSON_EXTRACT(json, '$.q1'),     2)  AS q1,
        ROUND(JSON_EXTRACT(json, '$.median'), 2)  AS median_mm,
        ROUND(JSON_EXTRACT(json, '$.q3'),     2)  AS q3,
        ROUND(JSON_EXTRACT(json, '$.iqr'),    2)  AS iqr_mm,
        JSON_EXTRACT(json, '$.lower_fence')        AS lower_fence,
        JSON_EXTRACT(json, '$.upper_fence')        AS upper_fence
    FROM iqr_stats
)
SELECT
    f.machine,
    f.median_mm,
    f.q1,
    f.q3,
    f.iqr_mm,
    SUM(CASE
        WHEN c.diameter < 48.0 OR c.diameter > 52.0 THEN 1 ELSE 0
    END)                                        AS spec_violations,
    SUM(CASE
        WHEN c.diameter < f.lower_fence
          OR c.diameter > f.upper_fence THEN 1 ELSE 0
    END)                                        AS tukey_outliers
FROM components c
JOIN fences f USING (machine)
GROUP BY f.machine, f.median_mm, f.q1, f.q3, f.iqr_mm, f.lower_fence, f.upper_fence
ORDER BY f.machine;

Results

machine  median_mm  q1    q3    iqr_mm  spec_violations  tukey_outliers
A        50.0       49.7  50.3  0.6     0                0
B        50.6       50.3  50.8  0.5     0                0
C        50.2       49.0  52.0  3.0     4                1

Machine A is tight and centred on target (IQR = 0.6 mm, median = 50.0 mm) with no violations. Machine B is equally tight (IQR = 0.5 mm) but the median has drifted 0.6 mm above target to 50.6 mm — a calibration issue visible in the median but not yet causing spec failures. Machine C tells a different story: IQR of 3.0 mm is five times wider than Machine A, with four parts outside the ±2.0 mm engineering spec and one statistical outlier at 57.5 mm. IQR quantifies process consistency; combining it with domain-specific tolerance checks gives both statistical and engineering views in one query.

Three Usage Patterns

Pattern When to use Example
GROUP BY → join fences back Per-group summary and outlier counts Parkrun age groups, machine lines
Single STATS_IQR → cross-join Flag individual rows in one group Order value review
GROUP BY + domain thresholds Statistical outliers plus engineering specs Manufacturing QC

All three patterns extract fence values from the JSON returned by STATS_IQR and apply them to raw rows — the function computes the distribution summary once per group; the join does the row-level classification.

Summary

All three examples use STATS_IQR(), which returns a JSON object with quartiles, median, IQR, and Tukey fence values computed using the exclusive-median (Tukey’s hinges) method.

Example Group Median IQR Outliers
Parkrun 18–29 12 runners 21.75 min 3.3 min 1 (35.8 min walker)
Parkrun 30–44 12 runners 24.5 min 3.65 min 1 (43.0 min walker)
Parkrun 45–59 12 runners 27.15 min 5.4 min 1 (55.0 min walker)
Parkrun 60+ 12 runners 32.75 min 8.75 min 1 (67.0 min walker)
E-commerce orders 25 orders $68 $26 2 ($2 low, $200 high)
Machine A 15 parts 50.0 mm 0.6 mm 0
Machine B 15 parts 50.6 mm 0.5 mm 0 (drifting high)
Machine C 15 parts 50.2 mm 3.0 mm 4 spec + 1 Tukey

The extension is available at github.com/ronaldbradford/vsql-statistics and installs with a single statement:

INSTALL EXTENSION vsql_statistics;
Tagged with: MySQL VillageSQL Extensions

Producing Mode statistics with SQL

The mode is the value or values that appear most frequently in a dataset. Unlike the mean or median, it applies naturally to categorical and ordinal data — star ratings, product codes, survey responses — and reveals what is most common, not what is average.

Producing Alternative Means statistics with SQL

MySQL’s built-in AVG() computes the arithmetic mean — the sum divided by the count. That is the right default for many questions, but it is not always the right measure of central tendency.

Producing One-Sample Z-Test statistics with SQL

The one-sample Z-test determines whether a sample mean differs significantly from a known population mean when the population standard deviation is also known. It is the appropriate test when the population parameters are established — quality control benchmarks, national averages, long-run process measurements — and you want to evaluate whether a new sample is consistent with them.