Skewness measures the asymmetry of a distribution. A perfectly symmetric distribution has a skewness of zero. A positive skew (right-skewed) means the tail extends to the right — a small number of high values pull the mean above the median. A negative skew (left-skewed) means the tail extends to the left — a small number of low values pull the mean below the median.
Leveraging the VillageSQL Extension Bundle (VEB)
, you can calculate skewness directly in SQL using STATS_SKEWNESS(). The function returns two measures as JSON: the moment skewness (g₁ = m₃ / m₂^(3/2), the standardised third central moment) and Pearson’s second skewness coefficient (3 × (mean − median) / σ). VillageSQL
is a drop-in MySQL 8.4 replacement — no Python or R required.
A useful rule of thumb: |g₁| < 0.5 is approximately symmetric, 0.5–1.0 is moderate skew, and |g₁| ≥ 1.0 is high skew.
Income and Exam Score Examples
Two small datasets illustrate the contrast between positive and negative skew clearly.
Income (n=15): most households earn £25k–£60k, with two high-earning outliers at £80k and £120k. The outliers pull the mean above the median — a classic right-skewed distribution.
Exam scores (n=15): most students score 70–98, but one student scored only 20. That single outlier pulls the mean below the median — a left-skewed distribution.
WITH skew AS (
SELECT
cohort,
COUNT(*) AS n,
ROUND(AVG(value), 3) AS mean,
CAST(STATS_SKEWNESS(value) AS CHAR(200)) AS json
FROM survey_data
GROUP BY cohort
)
SELECT
cohort,
n,
mean,
ROUND(JSON_EXTRACT(json, '$.moment'), 4) AS moment_skewness_g1,
ROUND(JSON_EXTRACT(json, '$.pearson'), 4) AS pearson_skewness,
CASE
WHEN JSON_EXTRACT(json, '$.moment') > 0.5 THEN 'Positive skew — long right tail (mean > median)'
WHEN JSON_EXTRACT(json, '$.moment') < -0.5 THEN 'Negative skew — long left tail (mean < median)'
ELSE 'Approximately symmetric'
END AS interpretation
FROM skew
ORDER BY cohort;
Skewness Results
cohort n mean moment_skewness_g1 pearson_skewness interpretation
exam 15 78.400 -1.7938 -0.5629 Negative skew — long left tail (mean < median)
income 15 47.667 1.8931 0.9685 Positive skew — long right tail (mean > median)
London 2012 Men’s Long Jump
The London 2012 men’s long jump qualifying round is a useful real-world dataset because it contains three natural subsets — all qualifiers, the top 12 who advanced to the final, and the final results — each with meaningfully different skewness.
-- Full qualifying round: all 40 athletes with a valid jump
WITH stats AS (
SELECT CAST(STATS_SKEWNESS(distance) AS CHAR(200)) AS json
FROM longjump_qualifying
)
SELECT
JSON_EXTRACT(json, '$.moment') AS skewness_moment,
JSON_EXTRACT(json, '$.pearson') AS skewness_pearson,
'Strong negative skew: bulk near 8m, long left tail to 6.55m' AS interpretation
FROM stats\G
All 40 Qualifiers
*************************** 1. row ***************************
skewness_moment: -1.11865687760678
skewness_pearson: -0.743125461443011
interpretation: Strong negative skew: bulk near 8m, long left tail to 6.55m
Top 12 Qualifiers (Final Entrants)
*************************** 1. row ***************************
skewness_moment: -0.124878211697895
skewness_pearson: 0.0815892439830761
interpretation: Near-symmetric: top qualifiers clustered tightly around 8m
Final Results
*************************** 1. row ***************************
skewness_moment: -0.085783574381751
skewness_pearson: -0.770172172269908
interpretation: Very slight negative skew: elite performances closely grouped
The progression is instructive. Across all 40 qualifiers the distribution is strongly negatively skewed (g₁ = −1.12): most athletes land near 8m but a long tail of weaker performances extends down to 6.55m. Once we narrow to the 12 finalists the distribution becomes near-symmetric (g₁ = −0.12) — elite athletes cluster tightly. The final results are even more so (g₁ = −0.09), with Greg Rutherford’s winning jump of 8.31m and Sebastian Bayer’s last-place 7.70m separated by only 61cm across twelve world-class athletes. This example was adapted from an R analysis using the moments package.
Australian Age at Death (2012)
The Australian Bureau of Statistics catalogue 3302.0 records deaths by age band and sex for 2012. Both distributions are negatively skewed: the bulk of deaths occur at old ages and the tail extends left toward infant and childhood mortality. Females show a more negative skewness because their deaths are even more concentrated at older ages.
STATS_SKEWNESS operates on individual observations, not pre-aggregated bins. For histogram data each age band is expanded to deaths copies of its midpoint via a cross-join numbers table — the standard technique for working with pre-binned data.
WITH
digit AS (
SELECT 0 n UNION ALL SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4
UNION ALL SELECT 5 UNION ALL SELECT 6 UNION ALL SELECT 7 UNION ALL SELECT 8 UNION ALL SELECT 9
),
nums AS (
SELECT a.n + b.n * 10 + c.n * 100 + d.n * 1000 + e.n * 10000 + 1 AS n
FROM digit a, digit b, digit c, digit d, digit e
HAVING n <= 15430
),
obs AS (
SELECT s.sex, s.midpoint
FROM aus_deaths_by_age s
JOIN nums ON nums.n <= s.deaths
),
stats AS (
SELECT
sex,
COUNT(*) AS total_deaths,
AVG(midpoint) AS mean_age,
CAST(STATS_IQR(midpoint) AS CHAR(500)) AS iqr_json,
CAST(STATS_SKEWNESS(midpoint) AS CHAR(500)) AS skew_json
FROM obs
GROUP BY sex
)
SELECT
CASE sex WHEN 'M' THEN 'Male' ELSE 'Female' END AS sex,
total_deaths,
ROUND(mean_age, 1) AS mean_age,
JSON_EXTRACT(iqr_json, '$.median') AS median_age,
JSON_EXTRACT(skew_json, '$.moment') AS skewness_moment,
JSON_EXTRACT(skew_json, '$.pearson') AS skewness_pearson
FROM stats
ORDER BY sex\G
Results
*************************** 1. row ***************************
sex: Female
total_deaths: 72303
mean_age: 80.1
median_age: 82.5
skewness_moment: -1.86926181211086
skewness_pearson: -0.439403958278522
*************************** 2. row ***************************
sex: Male
total_deaths: 74792
mean_age: 74.2
median_age: 77.5
skewness_moment: -1.49989665868781
skewness_pearson: -0.577062409483288
Both distributions show high negative skew (|g₁| > 1.0). Female deaths peak at age band 85–89 (15,430 deaths) while male deaths peak at 80–84 (12,619 deaths), explaining why females have the more negative moment: their distribution is shifted further right with a longer left tail reaching back to infant mortality. The 3.3-year difference in mean age (80.1 vs 74.2) and the matching shift in medians (82.5 vs 77.5) are consistent with well-established female longevity patterns.
Summary
All three examples use STATS_SKEWNESS(), which returns a JSON object with moment (g₁) and pearson fields. The moment skewness is the more commonly reported measure; Pearson’s coefficient is useful when you want a quick sanity check tied directly to the mean-median gap.
| Example | n | Moment g₁ | Interpretation |
|---|---|---|---|
| Household income | 15 | +1.89 | High positive skew |
| Exam scores | 15 | −1.79 | High negative skew |
| Long jump — all qualifiers | 40 | −1.12 | High negative skew |
| Long jump — top 12 | 12 | −0.12 | Near-symmetric |
| Long jump — final | 12 | −0.09 | Near-symmetric |
| Australian deaths — Female | 72,303 | −1.87 | High negative skew |
| Australian deaths — Male | 74,792 | −1.50 | High negative skew |
The extension is available at github.com/ronaldbradford/vsql-statistics and installs with a single statement:
INSTALL EXTENSION vsql_statistics;