Transform Random Variables and Track Mean/SD
Convert between raw values, z-scores, and scaled variables. Explore how linear transformations (Y = aX + b) and min-max scaling change the mean, standard deviation, and distribution of a random variable. This is an educational tool, not for official score reporting or clinical use.
The most common reason to land on this page is z-scoring: converting raw values to z = (x − μ) / σ so they sit on a common scale. That's the simplest case of a linear transformation Y = aX + b. Two facts that follow and matter downstream: E[aX + b] = a · E[X] + b, so the additive constant shifts the mean, and Var(aX + b) = a² · Var(X), so the additive constant drops out of the variance entirely.
The variance fact catches people. Subtracting μ from every value doesn't change the variance, only the mean. Dividing by σ scales variance by 1/σ², which is why z-scores end up with mean 0 and variance 1 by construction. The page handles z-scoring forward and backward (raw ↔ z given μ and σ), min-max scaling to [0, 1] for ML feature prep, and robust scaling using the median and IQR for data with outliers. Min-max is sensitive to extremes (one outlier squeezes the rest of your data into a tiny slice of the [0, 1] range), so robust scaling is the safer default when the distribution shape is unknown.
Linear Transformations: aX + b Made Simple
A linear transformation Y = aX + b has two parts: a scales (stretches or compresses) the distribution, while b shifts it left or right. Temperature conversion is the classic example: Fahrenheit = 1.8 × Celsius + 32. The coefficient 1.8 expands the scale; the constant 32 shifts the zero point.
Linear transforms preserve shape. If X is normally distributed, so is Y. If X is skewed right, Y is skewed right too. Relative positions stay intact—if Alice scored higher than Bob before the transformation, she still scores higher afterward (assuming a > 0).
Correlation survives linear transformation unchanged (or flips sign if a < 0). This matters in regression and machine learning: scaling features doesn't destroy their predictive relationships.
Key formulas:
• E[aX + b] = a·E[X] + b
• Var[aX + b] = a²·Var[X]
• SD[aX + b] = |a|·SD[X]
Raw Score to Z-Score and Back
The z-score formula z = (x − μ) / σ centers data at zero and scales it to unit variance. Every z-score tells you how many standard deviations the original value sits from the mean. A z = −0.5 means half a standard deviation below average.
To reverse the transformation, use x = μ + zσ. If a standardized test reports z = 1.2 and you know the population mean is 500 with SD 100, the raw score is 500 + 1.2 × 100 = 620.
Z-scores make comparison easy. A z = 2.0 in math and z = 1.5 in English means the student performed relatively better in math, even if the raw scores were on completely different scales.
Normal distribution benchmarks: About 68% of values fall within z = ±1, 95% within z = ±2, and 99.7% within z = ±3.
How Mean and SD Change Under Scaling
Adding a constant shifts the mean but leaves the standard deviation untouched. If every student gets 5 bonus points, the class average rises by 5, but the spread stays the same.
Multiplying by a constant scales both mean and standard deviation. Doubling all values doubles the mean and doubles the SD. The coefficient a acts as a stretching factor.
Variance scales by a², not a. If you multiply X by 3, variance becomes 9 times larger, and SD becomes 3 times larger. This distinction trips up students who forget to square.
Example: X has μ = 50, σ = 10
Y = 2X + 5 → μ_Y = 2(50) + 5 = 105
σ_Y = |2|(10) = 20 (not 25!)
Unit Conversions That Preserve Statistics
Converting units is just a linear transformation. Meters to centimeters: multiply by 100 (a = 100, b = 0). Celsius to Fahrenheit: F = 1.8C + 32. These preserve shape and correlation while changing numeric scale.
Z-scores are unit-free. A z = 1.5 in meters or centimeters is still z = 1.5—the transformation to standard units wipes out the original scale. This is why z-scores are useful for cross-variable comparison.
Min-max scaling maps data to a target range like [0, 1]. The formula y = (x − x_min) / (x_max − x_min) puts the minimum at 0 and maximum at 1. Unlike z-scores, this bounds the output, which suits neural networks that expect inputs in [0, 1].
Caution: Min-max scaling is sensitive to outliers. One extreme value stretches the range and compresses everything else.
Limits: Nonlinear Transforms Need More
Linear rules only apply to Y = aX + b. If you take the log, square root, or any curved function, the simple formulas break down. E[log X] ≠ log E[X] in general—Jensen's inequality governs these cases.
Nonlinear transforms change distribution shape. A right-skewed income distribution becomes more symmetric after a log transformation. That's the point—you're reshaping, not just rescaling.
For nonlinear transforms, use simulation or the delta method (Taylor expansion) to approximate the new mean and variance. These techniques go beyond this tool's scope but are standard in advanced statistics.
When linear formulas fail: Y = X², Y = log(X), Y = 1/X, Y = e^X. Each requires distribution-specific or numerical methods.
Transforming variables: working notes
Why does adding a constant not change variance?
Variance measures spread around the mean. Adding a constant shifts every value—including the mean—by the same amount. The distances from each point to the mean stay unchanged, so variance stays unchanged.
Can z-scores be negative?
Yes. A negative z-score means the value is below the mean. Half of all values in a symmetric distribution have negative z-scores. There's nothing unusual or wrong about z = −1.3.
What if I multiply by a negative constant?
The distribution flips direction. If Y = −X, high values become low and vice versa. Standard deviation uses |a|, so SD stays positive. Correlation with other variables changes sign.
How do I rescale to a new mean and SD?
First standardize: z = (x − μ_old) / σ_old. Then scale to the new parameters: y = μ_new + z × σ_new. This two-step process works for any target mean and standard deviation.
Does standardization make data normal?
No. Z-scores shift and scale but don't change shape. If the original data is skewed, the z-scores are skewed too. To induce normality, you need a nonlinear transform like Box-Cox or rank-based normalization.
Limitations of linear transformations
Linear only: the closed-form rules apply to Y = aX + b. Log(X), √X, X², and any nonlinear function need a different approach (delta method for first-order, simulation for the full distribution).
Variance behavior surprises people: subtracting a constant doesn't change the variance, only the mean. Multiplying by a constant scales the variance by a². The intuition that says "the constant changes everything" is wrong for variance.
Min-max scaling is outlier-sensitive: one extreme value squeezes the rest of your data into a tiny slice of [0, 1] and your downstream model never sees the legitimate variation. Robust scaling on median and IQR is the safer default when distribution shape is unknown.
Sample-based μ̂ and σ̂ carry uncertainty: when you z-score with estimates instead of true parameters, the transformed values inherit the estimation noise, especially at small n.
Note: For ML feature scaling, sklearn.preprocessing.StandardScaler, MinMaxScaler, and RobustScaler are the standard implementations and handle train/test separation properly (fit on train, transform on both). Doing it manually invites data leakage into the test set, which is harder to catch than you'd expect.
Sources & References
Methods follow standard probability and statistics references:
- •Khan Academy: Z-Scores Review
- •Penn State STAT 414: Linear Transformations of Random Variables
- •Scikit-learn: Preprocessing Data
Scaling and standardization: working questions
What does z-scoring my data actually do?
It centers the data at zero and scales by the standard deviation: z = (x − μ) / σ. The transformed variable has mean 0 and SD 1 by construction. Shape is preserved (symmetric stays symmetric, skewed stays skewed); only location and scale change. Useful for: comparing values across different distributions on a common scale, removing scale dependence in ML models that use distance metrics (k-NN, k-means, SVM with RBF kernel), and reading deviations in standard-deviation units (z = 2.5 means "2.5 SDs above the mean" regardless of the original units).
Min-max scaling vs z-score standardization, which?
Min-max maps your data to [0, 1] via (x − min) / (max − min). Z-score gives mean 0 and SD 1. Use min-max when you need bounded output (neural network inputs, image pixel values), but be aware: it's extremely sensitive to outliers. One extreme value squeezes everything else into a tiny slice of [0, 1]. Z-score is more robust to outliers and is what statistics defaults to. For ML feature scaling, robust scaling (median and IQR) is the safe default when you don't know whether outliers exist.
Why does standardization matter in ML?
Distance-based and gradient-based algorithms care about scale. k-nearest-neighbors with one feature ranging 0-1 and another ranging 0-1000 effectively only uses the second feature in distance calculations. Gradient descent for neural networks converges much slower with un-scaled inputs because gradient steps are mismatched to feature scales. Tree-based models (random forest, gradient boosting) don't care about scale. Linear regression coefficients depend on scale (β changes if you change units), so for interpretation, standardize first if comparing coefficient magnitudes across predictors.
What's robust scaling and when do I use it?
Subtract the median, divide by the IQR. Resistant to outliers because median and IQR don't move under extreme values, unlike mean and SD. sklearn.preprocessing.RobustScaler is the standard implementation. Use when you can't visually inspect for outliers and want a safe default. Use when downstream models are sensitive to outliers but you want to keep them in the dataset. The cost: the transformed data don't have nice statistical properties (mean isn't exactly 0, SD isn't exactly 1), but for most ML pipelines that doesn't matter.
How do I invert a z-score back to the original scale?
x = z · σ + μ. The reverse of z = (x − μ) / σ. If you computed z = 1.5 with μ = 100 and σ = 15, then x = 1.5 · 15 + 100 = 122.5. The inversion needs the same μ and σ used in the forward transform. For ML pipelines, this means the scaler object holds the parameters: sklearn's StandardScaler.fit() learns μ and σ from training data; .inverse_transform() applies the inverse. Saving these parameters (or the fitted scaler) is essential when you deploy a model to production data.
Why does variance scale by a² but mean by a?
From the definitions. Mean is linear: E[aX + b] = a·E[X] + b. Variance is quadratic: Var(aX + b) = E[(aX + b − E[aX + b])²] = E[a²(X − E[X])²] = a²·Var(X). The constant b drops out because it shifts both X and E[X] by the same amount, leaving the deviation unchanged. The factor a gets squared because the deviation is squared. Standard deviation, being the square root, scales by |a|, with absolute value because SD must be non-negative. The asymmetry surprises people but follows directly from the algebra.
Linear vs nonlinear transforms, what's the practical difference?
Linear (Y = aX + b): preserves shape, mean and SD transform predictably, the inverse is also linear. Z-score, Fahrenheit-to-Celsius, percentage points to proportions are all linear. Nonlinear (log, square root, square, exp): changes shape, breaks the closed-form rules for E and Var, often non-invertible without further constraints. Log transforms are common because they pull in long right tails (income, response times, counts). Box-Cox and Yeo-Johnson generalize the log family with a tunable parameter. For ML feature engineering, nonlinear transforms are routine; the price is losing the simple linear-algebra rules.
Standardization on training data, how do I avoid leakage?
Fit the scaler on training data only. Apply the same fitted scaler to validation and test sets. If you fit on the full dataset (train + test), you've leaked test-set information into the training process, which inflates apparent performance and can mask real-world failures. sklearn handles this correctly via the fit/transform pattern: scaler.fit(X_train), then scaler.transform(X_test). Pipeline objects (sklearn.pipeline.Pipeline) chain the scaler with the model so cross-validation respects the train/test boundary. Manual standardization with np.mean(X) and np.std(X) on the full data is the most common cause of leakage I see in production code reviews.
Related Tools
Z-Score / P-Value Calculator
Convert between z-scores, raw scores, and p-values
Normal Distribution Calculator
Compute probabilities and cutoffs under the normal curve
Descriptive Statistics Calculator
Compute mean, median, std dev, and more from data
Regression Calculator
Fit linear and polynomial models to data
Confidence Interval Calculator
Build confidence intervals for means and proportions
Hypothesis Test Power Calculator
Explore statistical power and sample size
Monty Hall Problem Simulator
Explore counter-intuitive probability with this famous puzzle