Regression analysis is a statistical method used to model and analyze the relationships between a dependent variable (also called the response or outcome variable) and one or more independent variables (predictors or explanatory variables). The goal is to find the mathematical equation that best describes how changes in the independent variables are associated with changes in the dependent variable. Regression is fundamental for prediction, understanding cause-and-effect relationships, and quantifying the strength of relationships in data.
Regression analysis answers questions like: "How much does revenue increase for each additional dollar spent on advertising?" or "Can we predict house prices based on square footage, number of bedrooms, and location?" or "Is there a relationship between study hours and exam scores?" The output is a regression equation that can be used to make predictions for new data points and assess how well the model fits the observed data.
Simple Linear Regression
Simple linear regression models the relationship between two variables: one independent variable (X) and one dependent variable (Y). The relationship is assumed to be linear, meaning it can be represented by a straight line. The regression equation has the form:
y = b₀ + b₁x + ε
Where y is the predicted value of the dependent variable, b₀ is the y-intercept (the predicted value of y when x = 0), b₁ is the slope (the change in y for each one-unit increase in x), x is the value of the independent variable, and ε (epsilon) represents the error term or residual (the difference between observed and predicted values). The regression algorithm finds the values of b₀ and b₁ that minimize the sum of squared residuals, a method called ordinary least squares (OLS).
For example, if you're modeling the relationship between hours studied (X) and exam score (Y), simple linear regression might produce the equation y = 50 + 5x. This means a student who studies 0 hours is predicted to score 50, and each additional hour of study is associated with a 5-point increase in the exam score.
Multiple Linear Regression
Multiple linear regression extends simple linear regression to include two or more independent variables. This allows you to model more complex relationships and control for confounding factors. The equation has the form:
y = b₀ + b₁x₁ + b₂x₂ + b₃x₃ + ... + bₖxₖ + ε
Each coefficient (b₁, b₂, etc.) represents the change in y associated with a one-unit change in that specific independent variable, holding all other variables constant. For example, when predicting house prices, you might use square footage, number of bedrooms, and age of the house as predictors. The coefficient for square footage tells you how much the price changes for each additional square foot, controlling for bedrooms and age.
Multiple regression is more realistic than simple regression because real-world outcomes are usually influenced by multiple factors simultaneously. It also helps control for confounding—if two predictors are correlated, multiple regression can separate their individual effects on the outcome.
Polynomial Regression
Polynomial regression is used when the relationship between X and Y is nonlinear—that is, it curves rather than forming a straight line. Polynomial regression includes powers (squares, cubes, etc.) of the independent variable(s). A quadratic (degree 2) polynomial has the form:
y = b₀ + b₁x + b₂x² + ε
A cubic (degree 3) polynomial adds an x³ term, and so on. Polynomial regression can capture U-shaped or S-shaped curves, growth and decay patterns, and diminishing returns. For example, the relationship between fertilizer amount and crop yield might show increasing returns at low levels but diminishing or even negative returns at high levels—a quadratic model could capture this.
However, be cautious with high-degree polynomials (degree 4+). They can overfit the data, fitting noise rather than the true underlying pattern. Overfitting produces excellent fit on the training data but poor predictions on new data. Always validate polynomial models on hold-out data or use cross-validation to check generalization performance.
Key Outputs of Regression Analysis
Regression Equation: The fitted equation (y = b₀ + b₁x + ...) that describes the relationship and can be used to make predictions. Plug in values of X to get predicted values of Y.
Coefficients (b₀, b₁, b₂, ...): The slope and intercept values that define the best-fit line or curve. These quantify the strength and direction of relationships. Positive coefficients mean X and Y move together; negative coefficients mean they move in opposite directions.
R² (Coefficient of Determination): A measure of how well the model fits the data, ranging from 0 to 1. R² represents the proportion of variance in Y that is explained by the model. For example, R² = 0.75 means 75% of the variability in Y is accounted for by the predictors, and 25% remains unexplained (due to error or omitted variables). Higher R² indicates better fit, but context matters—some fields naturally have lower R² due to inherent variability.
Adjusted R²: A modified version of R² that adjusts for the number of predictors in the model. Unlike R², which always increases when you add more predictors (even if they're useless), adjusted R² penalizes unnecessary complexity. Use adjusted R² when comparing models with different numbers of predictors.
Residuals: The differences between observed Y values and predicted Y values (residual = observed - predicted). Residuals should be small and randomly scattered around zero if the model fits well. Patterns in residuals (e.g., systematic over- or under-prediction) indicate model misspecification—perhaps you need a polynomial term, a different transformation, or additional predictors.