Producing Skewness statistics with SQL

Producing Skewness statistics with SQL

Skewness measures the asymmetry of a distribution. A perfectly symmetric distribution has a skewness of zero. A positive skew (right-skewed) means the tail extends to the right — a small number of high values pull the mean above the median. A negative skew (left-skewed) means the tail extends to the left — a small number of low values pull the mean below the median.

Leveraging the VillageSQL Extension Bundle (VEB) , you can calculate skewness directly in SQL using STATS_SKEWNESS(). The function returns two measures as JSON: the moment skewness (g₁ = m₃ / m₂^(3/2), the standardised third central moment) and Pearson’s second skewness coefficient (3 × (mean − median) / σ). VillageSQL is a drop-in MySQL 8.4 replacement — no Python or R required.

A useful rule of thumb: |g₁| < 0.5 is approximately symmetric, 0.5–1.0 is moderate skew, and |g₁| ≥ 1.0 is high skew.

Income and Exam Score Examples

Skewness SQL Example

Two small datasets illustrate the contrast between positive and negative skew clearly.

Income (n=15): most households earn £25k–£60k, with two high-earning outliers at £80k and £120k. The outliers pull the mean above the median — a classic right-skewed distribution.

Exam scores (n=15): most students score 70–98, but one student scored only 20. That single outlier pulls the mean below the median — a left-skewed distribution.

WITH skew AS (
    SELECT
        cohort,
        COUNT(*)                                 AS n,
        ROUND(AVG(value), 3)                     AS mean,
        CAST(STATS_SKEWNESS(value) AS CHAR(200)) AS json
    FROM survey_data
    GROUP BY cohort
)
SELECT
    cohort,
    n,
    mean,
    ROUND(JSON_EXTRACT(json, '$.moment'),  4) AS moment_skewness_g1,
    ROUND(JSON_EXTRACT(json, '$.pearson'), 4) AS pearson_skewness,
    CASE
        WHEN JSON_EXTRACT(json, '$.moment') >  0.5 THEN 'Positive skew — long right tail (mean > median)'
        WHEN JSON_EXTRACT(json, '$.moment') < -0.5 THEN 'Negative skew — long left tail  (mean < median)'
        ELSE                                             'Approximately symmetric'
    END AS interpretation
FROM skew
ORDER BY cohort;

Skewness Results

cohort   n    mean     moment_skewness_g1  pearson_skewness  interpretation
exam     15   78.400   -1.7938             -0.5629           Negative skew — long left tail  (mean < median)
income   15   47.667    1.8931              0.9685           Positive skew — long right tail (mean > median)

London 2012 Men’s Long Jump

Long Jump SQL Example

The London 2012 men’s long jump qualifying round is a useful real-world dataset because it contains three natural subsets — all qualifiers, the top 12 who advanced to the final, and the final results — each with meaningfully different skewness.

-- Full qualifying round: all 40 athletes with a valid jump
WITH stats AS (
    SELECT CAST(STATS_SKEWNESS(distance) AS CHAR(200)) AS json
    FROM longjump_qualifying
)
SELECT
    JSON_EXTRACT(json, '$.moment')  AS skewness_moment,
    JSON_EXTRACT(json, '$.pearson') AS skewness_pearson,
    'Strong negative skew: bulk near 8m, long left tail to 6.55m' AS interpretation
FROM stats\G

All 40 Qualifiers

*************************** 1. row ***************************
 skewness_moment: -1.11865687760678
skewness_pearson: -0.743125461443011
  interpretation: Strong negative skew: bulk near 8m, long left tail to 6.55m

Top 12 Qualifiers (Final Entrants)

*************************** 1. row ***************************
 skewness_moment: -0.124878211697895
skewness_pearson: 0.0815892439830761
  interpretation: Near-symmetric: top qualifiers clustered tightly around 8m

Final Results

*************************** 1. row ***************************
 skewness_moment: -0.085783574381751
skewness_pearson: -0.770172172269908
  interpretation: Very slight negative skew: elite performances closely grouped

The progression is instructive. Across all 40 qualifiers the distribution is strongly negatively skewed (g₁ = −1.12): most athletes land near 8m but a long tail of weaker performances extends down to 6.55m. Once we narrow to the 12 finalists the distribution becomes near-symmetric (g₁ = −0.12) — elite athletes cluster tightly. The final results are even more so (g₁ = −0.09), with Greg Rutherford’s winning jump of 8.31m and Sebastian Bayer’s last-place 7.70m separated by only 61cm across twelve world-class athletes. This example was adapted from an R analysis using the moments package.

Australian Age at Death (2012)

Australian Deaths SQL Example

The Australian Bureau of Statistics catalogue 3302.0 records deaths by age band and sex for 2012. Both distributions are negatively skewed: the bulk of deaths occur at old ages and the tail extends left toward infant and childhood mortality. Females show a more negative skewness because their deaths are even more concentrated at older ages.

STATS_SKEWNESS operates on individual observations, not pre-aggregated bins. For histogram data each age band is expanded to deaths copies of its midpoint via a cross-join numbers table — the standard technique for working with pre-binned data.

WITH
digit AS (
    SELECT 0 n UNION ALL SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4
    UNION ALL SELECT 5 UNION ALL SELECT 6 UNION ALL SELECT 7 UNION ALL SELECT 8 UNION ALL SELECT 9
),
nums AS (
    SELECT a.n + b.n * 10 + c.n * 100 + d.n * 1000 + e.n * 10000 + 1 AS n
    FROM digit a, digit b, digit c, digit d, digit e
    HAVING n <= 15430
),
obs AS (
    SELECT s.sex, s.midpoint
    FROM aus_deaths_by_age s
    JOIN nums ON nums.n <= s.deaths
),
stats AS (
    SELECT
        sex,
        COUNT(*)                                    AS total_deaths,
        AVG(midpoint)                               AS mean_age,
        CAST(STATS_IQR(midpoint)      AS CHAR(500)) AS iqr_json,
        CAST(STATS_SKEWNESS(midpoint) AS CHAR(500)) AS skew_json
    FROM obs
    GROUP BY sex
)
SELECT
    CASE sex WHEN 'M' THEN 'Male' ELSE 'Female' END AS sex,
    total_deaths,
    ROUND(mean_age, 1)                              AS mean_age,
    JSON_EXTRACT(iqr_json,  '$.median')             AS median_age,
    JSON_EXTRACT(skew_json, '$.moment')             AS skewness_moment,
    JSON_EXTRACT(skew_json, '$.pearson')            AS skewness_pearson
FROM stats
ORDER BY sex\G

Results

*************************** 1. row ***************************
             sex: Female
    total_deaths: 72303
        mean_age: 80.1
      median_age: 82.5
 skewness_moment: -1.86926181211086
skewness_pearson: -0.439403958278522
*************************** 2. row ***************************
             sex: Male
    total_deaths: 74792
        mean_age: 74.2
      median_age: 77.5
 skewness_moment: -1.49989665868781
skewness_pearson: -0.577062409483288

Both distributions show high negative skew (|g₁| > 1.0). Female deaths peak at age band 85–89 (15,430 deaths) while male deaths peak at 80–84 (12,619 deaths), explaining why females have the more negative moment: their distribution is shifted further right with a longer left tail reaching back to infant mortality. The 3.3-year difference in mean age (80.1 vs 74.2) and the matching shift in medians (82.5 vs 77.5) are consistent with well-established female longevity patterns.

Summary

All three examples use STATS_SKEWNESS(), which returns a JSON object with moment (g₁) and pearson fields. The moment skewness is the more commonly reported measure; Pearson’s coefficient is useful when you want a quick sanity check tied directly to the mean-median gap.

Example n Moment g₁ Interpretation
Household income 15 +1.89 High positive skew
Exam scores 15 −1.79 High negative skew
Long jump — all qualifiers 40 −1.12 High negative skew
Long jump — top 12 12 −0.12 Near-symmetric
Long jump — final 12 −0.09 Near-symmetric
Australian deaths — Female 72,303 −1.87 High negative skew
Australian deaths — Male 74,792 −1.50 High negative skew

The extension is available at github.com/ronaldbradford/vsql-statistics and installs with a single statement:

INSTALL EXTENSION vsql_statistics;
Tagged with: MySQL VillageSQL Extensions

Producing Chi-Squared statistics with SQL

The Chi-Squared test is one of the most widely used statistical tests for categorical data. It comes in two flavors: the goodness-of-fit test asks whether an observed frequency distribution matches an expected one, while the test of independence asks whether two categorical variables are associated with each other.

Producing Two Sample T-Test statistics with SQL

The two sample t-test for equal variance is a statistical test to determine if the means of two groups are different enough that the difference is likely caused by some underlying difference, rather than random chance.

Exploring the vsql-ai extension

The vsql-ai extension adds AI prompt capabilities and text embeddings directly in SQL queries, with support for Anthropic Claude , Google Gemini , OpenAI ChatGPT , or a local LLM such as Ollama .