Descriptive Statistics With Charts and Outlier Checks
Compute mean, median, mode, variance, standard deviation, skewness, kurtosis, and outliers. Compare datasets or calculate weighted statistics with visual charts.
Compute mean, median, mode, variance, standard deviation, skewness, kurtosis, and outliers. Compare datasets or calculate weighted statistics with visual charts.
Before any modeling, you summarize. The mean and median locate the center; standard deviation and variance describe spread around the mean; the interquartile range (IQR) measures spread around the median and stays robust when outliers are present; min, max, and the quartiles bound the distribution; skewness and kurtosis describe shape. Different summaries answer different questions, and using the wrong one is where a lot of analysis goes wrong.
When the distribution is roughly symmetric, mean and standard deviation are the natural pair. When it's skewed (income, response times, anything with a heavy right tail) the mean drifts toward the tail and overstates the typical value, so report the median and IQR instead, or report both. Variance is just SD squared and lives in the wrong units (dollars², not dollars), so SD is the one you put in front of stakeholders. The page returns all of them on one screen with a histogram and box plot, so you can see what the numbers are summarizing. If the histogram is bimodal or shows a long tail, no single-number summary will do justice to the data; report the visuals alongside.
The mean (average) sums all values and divides by count. It uses every data point, making it sensitive to extremes. One outlier can drag the mean far from where most values cluster.
The median is the middle value after sorting. Half the data falls below, half above. It ignores how far outliers lie from the center, so it stays stable even when extreme values appear. For income, house prices, or anything with a long tail, median tells the truer story.
The mode is the most frequent value. A dataset can have one mode (unimodal), two (bimodal), or many (multimodal). When all values are unique, no mode exists. Mode shines for categorical data—like the most common shoe size sold—where arithmetic averages don't apply.
Quick formulas:
• Mean: μ = (Σxᵢ) / n
• Median: middle value after sorting (average of two middles if n is even)
• Mode: value(s) with highest frequency
Variance measures how far values scatter from the mean by averaging squared deviations. Squaring ensures positive and negative deviations don't cancel out, but it leaves units squared (dollars², cm²), making interpretation awkward.
Standard deviation (SD) is the square root of variance, restoring original units. A small SD means data clusters tightly around the mean; a large SD means wide scatter. For normally distributed data, roughly 68% falls within ±1 SD, 95% within ±2 SD, and 99.7% within ±3 SD.
The interquartile range (IQR) measures the middle 50% of data: Q3 minus Q1. Unlike SD, IQR ignores extremes entirely, making it robust to outliers. Box plots visualize IQR directly, with whiskers extending to 1.5×IQR beyond the quartiles.
Spread formulas:
• Variance: σ² = Σ(xᵢ − μ)² / n
• Std Dev: σ = √(variance)
• IQR = Q3 − Q1
Skewness quantifies asymmetry. A positive (right) skew means a longer right tail—think income distributions where a few high earners stretch the mean above the median. Negative (left) skew means a longer left tail. Zero skewness indicates symmetry, like a bell curve.
Kurtosis measures tail heaviness relative to a normal distribution. High kurtosis (leptokurtic) means fatter tails and more extreme outliers than normal. Low kurtosis (platykurtic) means thinner tails and fewer outliers. A normal distribution has kurtosis of 3, or excess kurtosis of 0.
Together, skewness and kurtosis reveal whether standard parametric tests apply. Many statistical methods assume normality—if skewness exceeds ±1 or excess kurtosis exceeds ±2, consider transformations or non-parametric alternatives.
Rule of thumb: If |skewness| > 1 or |excess kurtosis| > 2, the distribution deviates substantially from normal and may require special handling.
Outliers are data points far from the bulk of observations. The z-score method flags values beyond ±3 standard deviations from the mean. The IQR method flags anything below Q1 − 1.5×IQR or above Q3 + 1.5×IQR. Neither method is perfect—each catches different types of extremes.
An outlier isn't automatically an error. It might be a data entry mistake, a measurement glitch, or a genuine extreme value worth investigating. Before deleting outliers, understand their source. Legitimate extremes carry real information; erroneous ones distort analysis.
Robust statistics resist outlier influence. The median and IQR are robust; the mean and SD are not. Trimmed means (dropping extreme percentiles) or Winsorized statistics (capping extremes) offer middle-ground approaches when you suspect contamination but don't want to discard data entirely.
Caution: Automatic outlier removal can bias results. Always examine flagged values in context before deciding to exclude them.
A histogram bins continuous data into intervals and shows frequency per bin as bar height. The shape reveals distribution type: bell-shaped, skewed, bimodal, or uniform. Bin width matters—too few bins hide detail, too many create noise. A common rule: use √n bins, where n is sample size.
A box plot (box-and-whisker) compresses distribution into five numbers: minimum, Q1, median, Q3, and maximum. The box spans IQR; the line inside marks the median. Whiskers extend to 1.5×IQR, with individual points beyond marked as potential outliers. Box plots excel at comparing multiple groups side by side.
Always visualize before computing summary statistics. Two datasets with identical mean, median, and SD can have completely different shapes—Anscombe's quartet famously demonstrates this. Charts reveal patterns that numbers alone miss: clusters, gaps, outliers, multimodality.
Visualization rule: Histograms for detailed shape, box plots for quick comparison across groups. Use both when space allows.
When should I use median instead of mean?
Use the median when data is skewed or contains outliers. Income, house prices, response times, and customer lifetime values often have long tails where the median better represents the typical case. If mean and median diverge significantly, report both and note the skew.
What's the difference between population and sample standard deviation?
Population SD divides by n; sample SD divides by n−1 (Bessel's correction). The n−1 adjustment makes sample SD an unbiased estimator of population SD. If your data is the entire population, use n. If it's a sample from a larger population, use n−1.
How do I compare two datasets with different units?
Use the coefficient of variation (CV = SD / mean × 100%). CV expresses variability relative to the mean, making it unit-free. A dataset with SD = 10 and mean = 100 (CV = 10%) is relatively less variable than one with SD = 5 and mean = 20 (CV = 25%).
Can a dataset have no mode?
Yes. If every value appears exactly once, no mode exists. If all values appear equally often, some define every value as a mode, others say none exists. Mode is most useful when some values repeat significantly more than others.
Why does my histogram look different with different bin widths?
Bin width is a tuning parameter, not a property of the data. Wide bins smooth out detail; narrow bins show noise. There's no single correct choice. Try multiple widths and use the one that reveals the underlying pattern without excessive chatter.
Data cleanliness: these statistics summarize whatever you feed in. They can't detect entry errors, transcription mistakes, or sampling bias. Validate the data first.
Robust vs classical: mean and SD are heavily moved by outliers. With heavy-tailed or skewed data (income, response times, anything with a long right tail), the median and IQR are the more honest summaries. Report both when the distribution is unclear.
Sample vs population: sample statistics estimate population parameters with uncertainty that descriptive measures don't surface. For inference, you want a CI or hypothesis test, not just μ̂ and σ̂.
Plot first, summarize second: two datasets can share the same mean, median, and SD and look completely different (Anscombe's quartet). Numbers alone hide structure.
Note: pandas.DataFrame.describe() and R's summary() are the standard one-liners and produce the same numbers as this page. For trimmed/winsorized stats specifically, R's robustbase package and scipy.stats.trim_mean cover the common variants. Tukey (1977) is the original EDA reference and still worth reading.
Methods and formulas follow standard statistical references:
Mean for symmetric distributions, median for skewed ones or anything with outliers. Income, response times, house prices, anything with a long right tail: the mean drifts toward the tail and overstates the typical value. Bill Gates walks into a bar; the mean income jumps but the median barely moves. Symmetric data (heights, IQ, exam scores at large n) have mean ≈ median and either is fine. The honest move when distribution shape is unclear: report both. If they disagree noticeably, the mean is misleading and you should lead with the median.
SD when the distribution is roughly symmetric and you'll be using it for downstream inference (CIs, t-tests). IQR when the distribution is skewed or has outliers, since IQR ignores the tails by construction (it's the range of the middle 50%). Together: SD describes spread around the mean, IQR describes spread around the median. For normal data the two are linked: roughly SD ≈ IQR / 1.35. If your SD/IQR ratio is far from that, the data are non-normal and you should lean on IQR for description.
An observation is an outlier if it's below Q1 − 1.5·IQR or above Q3 + 1.5·IQR, where IQR = Q3 − Q1. The 1.5 multiplier (Tukey 1977) is calibrated so that ~0.7% of normal data are flagged as outliers, which catches genuine extremes without flagging too many points. The 3·IQR rule flags "far outliers" (~0.0001% of normal data, essentially never by chance). For non-normal data, expect more flagged points; high skew naturally pushes more data past the 1.5·IQR fence on one side. Box plots visualize this directly.
Skewness ≈ 0 means symmetric. Positive skew (right-tailed) means the mean is pulled right of the median; negative skew (left-tailed) means the mean is pulled left. Rough thresholds: |skew| < 0.5 fairly symmetric, 0.5-1.0 moderate, above 1.0 highly skewed. Income distributions typically run skew = 1.5 to 3. The sign tells direction; the magnitude tells severity. For inference assuming normality, |skew| above 1 is a warning sign and you should consider transforming the data (log, square root) or using non-parametric tests.
Bessel's correction. Dividing by n − 1 instead of n produces an unbiased estimator of population variance from a sample. The intuition: the sample mean x̄ is computed from the same data, so the sum of squared deviations from x̄ is systematically smaller than from the true μ. Dividing by n − 1 corrects for that. Population SD (σ, n in denominator) is correct when you have the entire population. Sample SD (s, n − 1) is correct when you have a sample and want to estimate the population. R's sd() and Python's numpy.std(ddof=1) use n − 1.
Q1 is the 25th percentile, Q3 the 75th. The disagreement: there are at least nine different definitions in use, depending on how to interpolate between data points. Hyndman and Fan (1996) catalogued all of them. R's quantile() default is type 7 (linear interpolation). Python's numpy.quantile defaults to linear too. Excel's QUARTILE.INC uses one method, QUARTILE.EXC another. For large n the differences are negligible. For small n (under 20) you can see different software return different Q1 values for the same data; that's the definition issue, not a bug.
Yes, and famously so. Anscombe's quartet (1973) shows four datasets with essentially identical mean, variance, correlation, and regression line, but radically different shapes when plotted: linear, curved, outlier-dominated, and one with constant x except for a single point. The Datasaurus Dozen (Matejka and Fitzmaurice 2017) extends this with twelve datasets sharing summary statistics that look nothing like each other on a scatter plot. The lesson: numbers alone never tell the whole story. Plot before reporting summary statistics.
The box runs from Q1 to Q3 (the IQR), with a line at the median (Q2). Whiskers extend to the most extreme data point within 1.5·IQR of the box edges. Points beyond the whiskers are plotted individually as outliers. Width has no information by default; some variants use violin plots or notched boxes that overlay distribution shape or CI on the median. Box plots compress the distribution to five numbers (min, Q1, median, Q3, max) plus outliers, which is why they're useful for comparing groups but lose information about modes (multimodal data still show as a single box).
Enter your data and click Calculate to see mean, median, mode, standard deviation, skewness, kurtosis, and more