Producing Alternative Means statistics with SQL

Producing Alternative Means statistics with SQL

MySQL’s built-in AVG() computes the arithmetic mean — the sum divided by the count. That is the right default for many questions, but it is not always the right measure of central tendency. Investment returns compound multiplicatively, not additively. Executive salaries distort typical employee pay. Fuel economy over equal-distance segments is a rate problem where the harmonic mean, not the arithmetic mean, gives the correct overall average.

Leveraging the VillageSQL Extension Bundle (VEB) , STATS_MEAN(value, trim_pct) computes four alternative means in a single aggregate pass and returns them as JSON. VillageSQL is a drop-in MySQL 8.4 replacement — no Python or R required.

Returns JSON: { "trimmed": ..., "winsorized": ..., "geometric": ..., "harmonic": ... }
Field When to use
trimmed Remove the bottom and top trim_pct fraction of values, then average the rest — robust against outliers
winsorized Cap extremes at the same boundary instead of removing them — retains sample size while limiting distortion
geometric exp(mean of ln(x)) — correct for compound growth and multiplicative processes; positive values only
harmonic n / Σ(1/x) — correct for rates where the denominator unit (distance, time) is constant; positive values only

Pass NULL for trim_pct when only geometric and harmonic are needed. Pass a fraction between 0 and 0.5 (for example 0.10 for a 10% trim from each tail) when trimmed and winsorized results are required. Use CAST(STATS_MEAN(...) AS CHAR) to read the JSON in the mysql CLI.

Annual Investment Returns — Geometric Mean and Volatility Drag

Means SQL Example

Two funds each have the same arithmetic mean return of 7% over ten years. The Stable fund earns close to 7% every year. The Volatile fund swings between large gains and large losses while maintaining the same average. Because returns multiply rather than add — a +45% year followed by a −30% year gives a net multiplier of 1.45 × 0.70 = 1.015, not the 1.075 the arithmetic average suggests — the volatile fund’s true compound annual growth rate (CAGR) is far lower.

The data stores each year’s growth multiplier (1 + return), so CAGR = geometric mean − 1.

WITH means AS (
    SELECT
        fund,
        COUNT(*)                                          AS years,
        ROUND(AVG(multiplier) - 1, 4)                    AS arithmetic_avg_return,
        CAST(STATS_MEAN(multiplier, NULL) AS CHAR(200))  AS json
    FROM fund_returns
    GROUP BY fund
)
SELECT
    fund,
    years,
    CONCAT(ROUND(arithmetic_avg_return * 100, 1), '%')                       AS arithmetic_mean_return,
    CONCAT(ROUND((JSON_EXTRACT(json, '$.geometric') - 1) * 100, 1), '%')     AS geometric_mean_return,
    CONCAT(ROUND((POW(JSON_EXTRACT(json, '$.geometric'), years) - 1) * 100, 1), '%')
                                                                              AS total_return_10yr
FROM means
ORDER BY fund;

Results

fund      years  arithmetic_mean_return  geometric_mean_return  total_return_10yr
Stable    10     7%                      7%                     96.7%
Volatile  10     7%                      2.7%                   31.1%

Both funds report the same 7% arithmetic mean, but the volatile fund’s geometric mean is only 2.7% — a 10-year total return of 31.1% versus 96.7% for the stable fund. The arithmetic mean systematically overstates long-run return when volatility is high. This phenomenon is called volatility drag. For performance reporting, the geometric mean is the measure that answers “what did an investor actually earn?”

Employee Compensation — Trimmed and Winsorized Means

The second example compares three measures of typical pay across two departments. Each department has 28 staff earning $45k–$88k plus two executive salaries ($250k–$380k). The arithmetic mean is pulled well above what most employees earn.

A 10% trim removes k = floor(0.10 × 30) = 3 values from each tail — eliminating both executives and the lowest-paid staff. Winsorizing caps at the same boundary values but keeps all 30 observations in the average.

WITH means AS (
    SELECT
        department,
        COUNT(*)                                          AS n,
        ROUND(AVG(salary))                               AS arithmetic_mean,
        CAST(STATS_MEAN(salary, 0.10) AS CHAR(300))      AS json
    FROM employee_salaries
    GROUP BY department
)
SELECT
    department,
    n,
    FORMAT(arithmetic_mean, 0)                           AS arithmetic_mean,
    FORMAT(ROUND(JSON_EXTRACT(json, '$.trimmed')),    0) AS trimmed_10pct,
    FORMAT(ROUND(JSON_EXTRACT(json, '$.winsorized')), 0) AS winsorized_10pct
FROM means
ORDER BY department;

Results

department    n   arithmetic_mean  trimmed_10pct  winsorized_10pct
Engineering   30  89,233           74,417         74,333
Marketing     30  73,667           61,500         61,500

The trimmed mean ($74,417 for Engineering) reflects typical staff compensation much more accurately than the arithmetic mean ($89,233). The winsorized mean ($74,333) is close — the small gap reflects that capping the two executive salaries at the boundary rather than removing them shifts the average slightly. For HR benchmarking and pay-equity reporting, trimmed or winsorized means are often more informative than AVG() when a small number of extreme values dominate the distribution.

Road-Trip Fuel Efficiency — Harmonic Mean for Equal Distances

The third example shows why the harmonic mean is the correct average for rates measured over equal distances. Each trip segment covers 100 km at a recorded fuel economy (km/L). To compute total fuel consumed across the trip, you need the harmonic mean — not the arithmetic mean.

The verification identity is straightforward: total litres = total km / harmonic mean.

WITH means AS (
    SELECT
        trip,
        COUNT(*) * 100                                    AS total_km,
        ROUND(AVG(km_per_l), 2)                          AS arithmetic_mean,
        CAST(STATS_MEAN(km_per_l, NULL) AS CHAR(200))    AS json
    FROM trip_segments
    GROUP BY trip
)
SELECT
    trip,
    total_km,
    arithmetic_mean                                       AS arith_km_per_l,
    ROUND(JSON_EXTRACT(json, '$.harmonic'),  2)           AS harmonic_km_per_l,
    ROUND(JSON_EXTRACT(json, '$.geometric'), 2)           AS geometric_km_per_l,
    ROUND(total_km / JSON_EXTRACT(json, '$.harmonic'), 1) AS total_litres_used
FROM means
ORDER BY trip;

Results

trip              total_km  arith_km_per_l  harmonic_km_per_l  geometric_km_per_l  total_litres_used
City commute      300       9.73            8.91               9.31                  33.7
Interstate run    400       10.58           10.03              10.30                 39.9

The arithmetic mean overstates fuel economy on both trips because it weights each segment equally rather than by fuel consumed. Low-efficiency legs burn disproportionately more fuel and should count more heavily — exactly what the harmonic mean does. For the City commute, the arithmetic mean of 9.73 km/L implies 30.8 litres (300 / 9.73), but the actual consumption is 33.7 litres. Dividing 300 km by the harmonic mean of 8.91 km/L gives the correct 33.7 litres.

The geometric mean (9.31 km/L) sits between the arithmetic and harmonic values, as expected for positive numbers. It would be the appropriate choice if segments represented equal time at each rate rather than equal distance — a different physical constraint, same function.

When to Use Which Mean

Question Use MySQL
Typical value when outliers distort the average Trimmed or winsorized mean STATS_MEAN(col, 0.10)$.trimmed or $.winsorized
Compound growth rate over multiple periods Geometric mean STATS_MEAN(col, NULL)$.geometric
Average rate over equal-distance or equal-time segments Harmonic mean STATS_MEAN(col, NULL)$.harmonic
Simple sum/count average with no special structure Arithmetic mean AVG(col)

STATS_MEAN works with GROUP BY, so department-level pay benchmarks, fund-level CAGR comparisons, and trip-level fuel economy all compute in a single query pass alongside the familiar AVG().

Summary

All three examples use STATS_MEAN(), which returns a JSON object with up to four fields computed from the same accumulated values. Pass NULL for trim_pct when only geometric and harmonic results are needed; pass a trim fraction when robust location estimates are required.

Example Key measure Arithmetic Alternative Insight
Stable fund (10 yr) Geometric 7.0% 7.0% Steady returns: arithmetic ≈ geometric
Volatile fund (10 yr) Geometric 7.0% 2.7% Volatility drag: arithmetic overstates CAGR
Engineering salaries Trimmed 10% $89,233 $74,417 Executives inflate AVG()
Marketing salaries Trimmed 10% $73,667 $61,500 Same pattern, smaller department
City commute (300 km) Harmonic 9.73 km/L 8.91 km/L Arithmetic overstates fuel economy
Interstate run (400 km) Harmonic 10.58 km/L 10.03 km/L Harmonic gives correct litres consumed

The extension is available at github.com/ronaldbradford/vsql-statistics and installs with a single statement:

INSTALL EXTENSION vsql_statistics;
Tagged with: MySQL VillageSQL Extensions

Producing One-Sample Z-Test statistics with SQL

The one-sample Z-test determines whether a sample mean differs significantly from a known population mean when the population standard deviation is also known. It is the appropriate test when the population parameters are established — quality control benchmarks, national averages, long-run process measurements — and you want to evaluate whether a new sample is consistent with them.

Producing Skewness statistics with SQL

Skewness measures the asymmetry of a distribution. A perfectly symmetric distribution has a skewness of zero. A positive skew (right-skewed) means the tail extends to the right — a small number of high values pull the mean above the median.

Producing Chi-Squared statistics with SQL

The Chi-Squared test is one of the most widely used statistical tests for categorical data. It comes in two flavors: the goodness-of-fit test asks whether an observed frequency distribution matches an expected one, while the test of independence asks whether two categorical variables are associated with each other.