Descriptive statistics summarize and describe the key characteristics of a dataset, providing insights into its center (central tendency), spread (variability), and shape (distribution). These fundamental measures are essential for understanding data before applying deeper inferential or predictive techniques in business analytics, scientific research, education, and decision-making.
Measures of Central Tendency
- Mean (Average): The sum of all values divided by the count. Most sensitive to outliers. Formula: μ = (Σxi) / n. Used when data is symmetric and without extreme values.
- Median: The middle value when data is sorted in ascending order. Robust to outliers and preferred for skewed distributions (e.g., income, house prices). If n is even, median is the average of the two middle values.
- Mode: The most frequently occurring value(s). A dataset can be unimodal (one mode), bimodal (two modes), or multimodal (multiple modes). Useful for categorical data and identifying common values.
Measures of Spread (Variability)
- Variance (ϲ): The average of squared deviations from the mean. Measures how spread out the data is. Formula: ϲ = Ī£(xi - μ)² / n. Units are squared (e.g., dollars²), making interpretation less intuitive.
- Standard Deviation (Ļ): The square root of variance, expressing variability in the same units as the data. A smaller Ļ indicates data clustered near the mean; larger Ļ indicates more spread. Approximately 68% of data falls within ±1Ļ, 95% within ±2Ļ, and 99.7% within ±3Ļ (for normal distributions).
- Range: Maximum value minus minimum value. Simple but highly sensitive to outliers.
- Interquartile Range (IQR): Q3 - Q1, representing the middle 50% of data. Robust to outliers and used in box plots.
Measures of Shape
- Skewness: Measures the asymmetry of the distribution. Positive skew (right-skewed) has a longer right tail (mean > median); negative skew (left-skewed) has a longer left tail (mean < median). Zero skew indicates a symmetric distribution (e.g., normal distribution).
- Kurtosis: Measures the heaviness of the tails relative to a normal distribution. High kurtosis (> 3) indicates heavy tails with more outliers; low kurtosis (< 3) indicates light tails. Normal distribution has kurtosis = 3 (or excess kurtosis = 0).
Outlier Detection
Outliers are data points that significantly differ from the majority of observations. They can indicate measurement errors, data entry mistakes, or genuine extreme values that deserve special attention. Common detection methods include:
- Z-score method: Values beyond ±3 standard deviations from the mean.
- IQR method: Values below Q1 - 1.5ĆIQR or above Q3 + 1.5ĆIQR.