Fit Regression Models and Inspect Diagnostics
Linear, Multiple & Polynomial Regression Analysis
Linear, Multiple & Polynomial Regression Analysis
OLS regression fits the line ŷ = β̂₀ + β̂₁x that minimizes Σ(yᵢ − ŷᵢ)², the sum of squared residuals. Closed-form: β̂₁ = Cov(x, y) / Var(x), β̂₀ = ȳ − β̂₁x̄. The page extends this to multiple predictors via the normal equations β̂ = (XᵀX)⁻¹Xᵀy and to polynomial fits by passing x² and x³ as additional columns.
R² is the share of y's variance explained by the model: 1 − SSR/SST, bounded above by 1. Adding any predictor (even pure noise) can only push it up. Use adjusted R² when comparing models with different predictor counts, or hold-out validation when prediction matters more than fit. Two interpretation points worth keeping straight. β̂₁ in simple regression is the slope; in multiple regression it's the slope after partialling out the other predictors, which is a different quantity. And R² isn't a model-correctness score: a U-shape can give an OLS line a low R² without the line being "wrong" about the data, just inappropriate as a model class.
Linear regression assumes a straight-line relationship: y = b₀ + b₁x. It works when scatter plots show points clustered around an obvious line without curvature. Many business relationships—sales vs. staff count, shipping cost vs. weight—follow roughly linear patterns within typical operating ranges.
Polynomial regression adds powers of X: y = b₀ + b₁x + b₂x². A quadratic captures one bend (U-shapes, diminishing returns). A cubic handles S-curves or two inflection points. Before jumping to polynomials, plot your data. If the scatter looks linear, adding x² terms just invites overfitting without improving real-world accuracy.
Multiple linear regression extends the model to several predictors: y = b₀ + b₁x₁ + b₂x₂ + ... Each coefficient isolates that variable's effect while statistically controlling for the others. House prices might depend on square footage, bedrooms, and neighborhood—multiple regression disentangles those influences.
Practical guidance: Start simple. If a linear fit shows patterned residuals (systematic over- or under-prediction), consider a polynomial. If you have several candidate predictors, test multiple regression. Never pick the model with the highest R² alone—validate on held-out data or use adjusted R².
The intercept (b₀) is the predicted Y when all X variables equal zero. Sometimes that baseline makes sense—a shipping cost when package weight is zero might represent handling fees. Other times X = 0 falls outside realistic ranges, so the intercept is just a mathematical anchor, not a meaningful value.
Slope coefficients carry the analytical payload. If b₁ = 45 in a model predicting revenue (dollars) from marketing spend (thousands), each $1,000 increase associates with $45 more revenue. The sign indicates direction: positive means X and Y move together, negative means they move oppositely.
In polynomial models, coefficient interpretation gets trickier. The x² term coefficient doesn't translate to a constant per-unit effect—it depends on where you sit on the curve. Report effects at specific X values (e.g., "at X = 10, a one-unit increase predicts a 3.2-unit rise in Y") rather than quoting raw polynomial coefficients.
Watch units: If X is measured in thousands and Y in single units, the coefficient reflects that scaling. Misreading units can lead to conclusions off by orders of magnitude.
R² (coefficient of determination) shows the fraction of Y's variance explained by the model. R² = 0.85 means the predictors account for 85% of variability; 15% remains unexplained. High R² feels reassuring but can mislead if achieved through overfitting or if the model ignores important predictors.
Adjusted R² penalizes added predictors that don't genuinely improve fit. Use it when comparing models with different numbers of variables. If adjusted R² drops when you add a term, that term probably isn't helping.
Residuals—actual minus predicted—reveal model shortcomings. Plot residuals against predicted values. Random scatter around zero indicates adequate fit. A curved pattern signals you need a polynomial or transformation. Residuals fanning out (growing larger at higher predictions) suggest heteroscedasticity, which can distort standard errors.
Large residuals flag potential outliers. Investigate before deleting—an outlier might be a data-entry error or a legitimately unusual case. Removing genuine data points without justification undermines credibility.
A confidence band wraps around the regression line and quantifies uncertainty in the line's position. It answers: "Where might the true mean of Y lie for this X?" Narrower bands mean the line's location is well-estimated.
A prediction band is wider because it accounts for two sources of variation: uncertainty in the line plus natural scatter of individual observations around the line. It answers: "Where might a single new observation fall?"
For forecasting individual outcomes—estimating next month's sales for one client—you need prediction intervals. For statements about average behavior—mean revenue across all clients at a given spend level—confidence intervals apply. Mixing them up leads to either overconfidence or unnecessary hedging.
Regression describes patterns observed within your data's range. Predicting outside that range—extrapolation—assumes the relationship continues unchanged. Real-world systems rarely cooperate. Growth curves level off, diminishing returns kick in, or structural changes occur.
Polynomial models are especially dangerous for extrapolation. A quadratic fit may look smooth inside the data window but shoots toward infinity (or negative infinity) just past the boundary. Never trust polynomial forecasts far beyond observed X values.
Even linear models deserve caution. A cost-per-unit relationship derived from volumes of 500–2,000 units may not hold at 10,000 units, where bulk discounts or capacity constraints alter the slope.
Rule of thumb: Flag any prediction where X lies beyond the training data's minimum or maximum. Treat such forecasts as speculative guidance, not reliable estimates.
What does a negative coefficient mean?
It signals an inverse relationship: as X increases, Y decreases (and vice versa). For example, a negative coefficient on "days since last purchase" when predicting likelihood of repeat purchase makes intuitive sense—longer gaps associate with lower probability.
Can I compare coefficients across different predictors?
Only if predictors share the same scale. Otherwise, standardize variables first (subtract mean, divide by standard deviation). Standardized coefficients show relative importance on a common scale.
Why doesn't high R² guarantee accurate predictions?
R² measures fit to past data, not future performance. An over-fitted model memorizes training noise rather than capturing the true relationship. Validation on fresh data—or cross-validation—is the real test.
How do I handle categorical variables?
Convert them to dummy (indicator) variables. A "region" field with three values becomes two dummy columns; the third category serves as the reference. Each dummy's coefficient shows the difference from the reference group.
What if residuals show a clear pattern?
Patterns suggest the model misses systematic structure. Try adding polynomial terms, transforming variables (log, square root), or including omitted predictors. A good model leaves residuals looking like random noise.
Linearity: the binding assumption. If a residual-vs-fitted plot shows a systematic curve, the OLS line is wrong and you need a transformation, polynomial term, or non-linear model. Don't use R² as a substitute for residual diagnostics.
Standard error assumptions: independent, normally distributed, homoscedastic residuals. Heteroscedasticity (residuals fanning out) shows up as wrong p-values and wrong CIs even when β̂ itself is unbiased. Switch to robust standard errors (HC sandwich estimator).
Multicollinearity: correlated predictors inflate coefficient variance without breaking the predictions. VIF above ~5 is a warning sign. Above 10, separate the correlated predictors or drop one.
Extrapolation: predicting past the observed range of x is asking the model to forecast in a regime it was never fit on. Don't.
Note: R's lm() and statsmodels.api.OLS produce the same coefficients and the same residual diagnostics. ISLR (James, Witten, Hastie, Tibshirani) is the standard introductory reference. ESL (Hastie, Tibshirani, Friedman) covers the same material at a deeper level for anyone moving toward regularization or kernel methods.
Formulas and interpretation guidelines are grounded in standard statistical references:
Field-dependent. In physics and engineering, R² above 0.95 is normal because the underlying signal-to-noise ratio is high. In social science and economics, R² of 0.2 can be a strong result; with messy human behavior, even 20% explained variance is meaningful. R² isn't a quality score for the model so much as a description of how much of y's variance the predictors capture. A low R² with statistically significant coefficients on a well-motivated model is fine; a high R² from kitchen-sinking dozens of predictors is overfitting.
Whenever you're comparing models with different numbers of predictors. R² always goes up when you add a predictor (even pure noise), so it can't tell you whether the extra predictor earned its place. Adjusted R² penalizes for model complexity: 1 − (1 − R²)·(n − 1)/(n − p − 1). It can go down if a predictor doesn't pay for itself in fit. For inference on a single fitted model, report R². For model comparison, report adjusted R² or use AIC/BIC, which are stricter still.
Plot residuals against fitted values. Random scatter around zero is good. A funnel (wider scatter at high or low fitted values) is heteroscedasticity. Breusch-Pagan and White's test give formal p-values, but the residual plot is the diagnostic that matters; you'll catch it visually before any test does. Fix: switch to robust (heteroscedasticity-consistent) standard errors. R: lmtest::coeftest(model, vcov = sandwich::vcovHC). Python: statsmodels' get_robustcov_results(cov_type='HC3'). The point estimates don't change; only the inference does.
Simple: one predictor, one outcome, straight line ŷ = β̂₀ + β̂₁x. Multiple: several predictors, ŷ = β̂₀ + β̂₁x₁ + ⋯ + β̂ₖxₖ. Each β̂ⱼ is the slope after partialling out the other predictors, which is a different quantity than its slope in a simple regression with only that predictor. Polynomial: still one predictor but with higher-order terms, ŷ = β̂₀ + β̂₁x + β̂₂x² + ⋯. Polynomial is just multiple regression with derived columns x², x³, etc., and the same fitting procedure applies. Above degree 3, you usually want splines instead.
Curved relationships (use polynomial or splines), bounded outcomes like proportions or counts (use logistic or Poisson), heavy heteroscedasticity (use weighted least squares or robust standard errors), strong autocorrelation in time-series residuals (use ARIMA or generalized least squares), and any case where the residuals systematically deviate from zero in a residual-vs-fitted plot. R² alone won't flag any of these; you need the diagnostic plots. The decision tree: plot residuals against fitted values and against each predictor before trusting any coefficient.
Dummy-coding (one-hot encoding minus one column). For a 3-level factor, create two dummy variables; the third level becomes the reference. The intercept then estimates the mean for the reference level, and each dummy's coefficient is the difference from that reference. R does this automatically with factor() variables: lm(y ~ group). Python's statsmodels handles it via formulas (smf.ols('y ~ C(group)')). Watch for the dummy-variable trap: if you include all k levels, you get perfect collinearity and the design matrix becomes singular.
Because the new predictor is correlated with the others. Multiple regression coefficients are partial slopes: each β̂ⱼ measures the effect of xⱼ after holding the others constant. If you add a predictor correlated with x₁, the slope on x₁ shifts because the partial conditioning changes. This is normal and expected. The simple-regression slope and the multiple-regression slope on the same predictor will agree only when the predictors are uncorrelated. The change is sometimes called Simpson's paradox when the sign flips.
Not automatically. "Statistical significance" of a predictor depends heavily on collinearity with other predictors and on the sample size; dropping by p-value alone often produces models that fit your sample well but don't generalize. Better criteria: theoretical justification, AIC/BIC for model comparison, cross-validated prediction error, or LASSO/elastic net which shrinks coefficients and selects variables jointly. If you're explicitly looking for the smallest predictive model, use a regularized method, not stepwise p-value pruning.
Enter your data points to perform linear, multiple, or polynomial regression analysis. Get coefficients, R², and visualize the relationship between your variables.
Supported regression types: