Regression and Correlation
Learning Objectives
- Understand the conceptual difference between correlation and regression.
- Calculate and interpret the Pearson Correlation Coefficient.
- Formulate a simple linear regression model using the Method of Least Squares.
- Evaluate regression models using , standard error, and residual analysis.
- Validate the fundamental assumptions (LINE) of linear regression.
- Extend regression concepts to multiple predictors and hypothesis testing.
Engineers frequently need to predict the value of one variable based on the value of another. Understanding the relationship between variables allows for better modeling and prediction of material properties, environmental factors, and system behaviors.
Introduction to Regression and Correlation
Engineers frequently need to predict the value of one variable based on the value of another. For example, predicting the compressive strength of concrete () based on its curing time (), or estimating a river's peak flow rate () based on rainfall intensity ().
While correlation measures the strength and direction of a linear relationship, regression provides the mathematical equation to make actual predictions.
Correlation
Quantifying the strength of the linear relationship is the first step in data modeling.
Scatter Plots
A graphical tool used to plot two quantitative variables on a Cartesian coordinate system.
Analyzing Scatter Plots
Scatter plots provide the first visual indication of whether a linear relationship, a non-linear relationship, or no relationship exists between and . They are essential before proceeding with correlation or regression calculations.
Pearson Correlation Coefficient ()
A unitless measure that ranges from -1 to +1, describing how closely data points fall along a straight line.
Pearson Correlation Coefficient
Calculates the strength and direction of the linear relationship between two variables.
Variables
| Symbol | Description | Unit |
|---|---|---|
| Pearson correlation coefficient | - | |
| Individual sample values of variable X | - | |
| Individual sample values of variable Y | - | |
| Sample mean of variable X | - | |
| Sample mean of variable Y | - |
Interpreting Correlation ()
- : Strong positive linear relationship (as increases, predictably increases).
- : Strong negative linear relationship (as increases, predictably decreases).
- : Weak or no linear relationship (the data appears as a random scatter). Note: A value of does not mean there is no relationship at all; there could be a strong non-linear (e.g., quadratic) relationship.
Correlation does not imply causation
Just because two variables are highly correlated (e.g., asphalt sales and ice cream sales) does not mean one causes the other. Both might be driven by a lurking third variable (summer weather).
Simple Linear Regression
Modeling the relationship with a straight line.
The Regression Model
We hypothesize that the true relationship is a straight line, plus some random, unobservable error ().
True Regression Model
The theoretical linear model relating the independent and dependent variables.
Variables
| Symbol | Description | Unit |
|---|---|---|
| The dependent (response) variable we want to predict | - | |
| The independent (predictor or explanatory) variable | - | |
| The true y-intercept (value of y when x=0) | - | |
| The true slope (change in y for a one-unit change in x) | - | |
| The random error term | - |
The Method of Least Squares
Finding the "line of best fit." Because we only have a sample, we estimate the true parameters () with sample statistics ().
Estimated Regression Equation
The line equation constructed from sample data to estimate the true model. The "hat" on indicates it is an estimated or predicted value, not an actual observed value.
Estimated Regression Line
The straight line equation used for making predictions.
Variables
| Symbol | Description | Unit |
|---|---|---|
| Predicted or estimated value of the dependent variable | - | |
| Estimated y-intercept | - | |
| Estimated slope | - | |
| Value of the independent variable | - |
Least Squares Principle
The principle stating that the line of best fit is the one that minimizes the Sum of Squared Errors (SSE). The error (or residual) is the vertical distance between an observed data point () and the predicted value on the line ().
Sum of Squared Errors (SSE)
The objective function minimized in the method of least squares.
Variables
| Symbol | Description | Unit |
|---|---|---|
| Sum of squared errors | - | |
| Observed data point | - | |
| Predicted value on the regression line | - | |
| Residual or error for the i-th observation | - |
Calculating the Slope and Intercept
Using calculus to minimize SSE yields the formulas for the slope () and intercept ().
Least Squares Slope
Formula to calculate the estimated slope.
Variables
| Symbol | Description | Unit |
|---|---|---|
| Estimated slope of the regression line | - | |
| Individual sample values of variable X | - | |
| Individual sample values of variable Y | - | |
| Sample mean of variable X | - | |
| Sample mean of variable Y | - |
Least Squares Intercept
Formula to calculate the estimated y-intercept.
Variables
| Symbol | Description | Unit |
|---|---|---|
| Estimated y-intercept of the regression line | - | |
| Sample mean of variable Y | - | |
| Estimated slope | - | |
| Sample mean of variable X | - |
Assessing the Model and Residual Analysis
How good is our prediction? Before relying on a regression equation for engineering decisions, we must verify that the model is appropriate.
Coefficient of Determination ()
The proportion of the total variation in the dependent variable () that is explained by the regression model (the independent variable ).
Coefficient of Determination
Calculates the percentage of explained variance.
Variables
| Symbol | Description | Unit |
|---|---|---|
| Coefficient of determination | - | |
| Regression sum of squares | - | |
| Total sum of squares | - | |
| Sum of squared errors | - |
Interpreting R-Squared
- .
- An means that 85% of the variation in concrete strength can be explained by the variation in curing time.
- For simple linear regression (one predictor), .
Standard Error of the Estimate ()
A measure of the typical distance that observed data points fall from the regression line. It estimates the standard deviation of the error term ().
Standard Error of the Estimate
Estimates the standard deviation of the residuals.
Variables
| Symbol | Description | Unit |
|---|---|---|
| Standard error of the estimate | - | |
| Sum of squared errors | - | |
| Number of sample observations | - |
Residual Analysis (Validating Assumptions)
A critical step. We plot the residuals () against the predicted values () or the predictor (). For the linear model to be valid, the residual plot should show a random, structureless horizontal band around zero.
Common Residual Plot Patterns
- Non-linearity: If the residuals show a curved pattern (like a U-shape), a straight line is not appropriate; a polynomial regression is needed.
- Heteroscedasticity: If the spread of the residuals increases (a fan shape), the variance of the errors is not constant. This violates a key assumption and requires data transformation (e.g., taking the log of ).
Assumptions of Linear Regression (LINE)
- Linearity: The relationship between X and Y is fundamentally linear.
- Independence: The observations are independent of each other.
- Normality: The residuals () are normally distributed.
- Equal Variance (Homoscedasticity): The variance of the residuals is constant across all values of X.
Multiple Linear Regression
Using more than one predictor variable. Engineers rarely predict an outcome based on a single variable. Concrete strength depends on water-cement ratio, curing time, temperature, and aggregate type.
Multiple Regression Model
An extension of simple linear regression that predicts the response variable using two or more explanatory variables.
Multiple Regression Equation
The model incorporating multiple predictor variables.
Variables
| Symbol | Description | Unit |
|---|---|---|
| Predicted value | - | |
| y-intercept | - | |
| Estimated coefficients for each predictor | - | |
| Individual predictor variables | - |
Interpreting Multiple Regression
- Each represents the change in the estimated for a one-unit change in , holding all other predictor variables constant.
- Adjusted : Unlike regular (which always increases when you add a variable), Adjusted penalizes adding variables that do not significantly improve the model, preventing "overfitting."
Hypothesis Testing in Regression (ANOVA Approach)
Testing if the model is statistically significant.
Hypothesis Test for the Slope ()
In simple linear regression, testing whether a linear relationship exists is equivalent to testing the null hypothesis (the true slope is zero). We use a t-test with degrees of freedom. If we reject , there is sufficient evidence that provides information in predicting .
F-Test for Overall Significance
Tests whether the regression model as a whole is better than simply predicting the mean of ().
Interpreting the F-Test
- (The model is useless).
- If the resulting P-value is very small, we reject , indicating at least one predictor is significantly related to .
t-Tests for Individual Coefficients
If the overall model is significant, we run t-tests on each individual slope () to determine which specific variables are actually contributing.
Interpreting Individual t-Tests
- (This specific predictor is useless, assuming all others are already in the model).
- Rejecting implies that the predictor provides significant information even after accounting for the other variables.
Interactive Simulation
Interact with the simulation below to fit a regression line to data points.
Engineering Data Analysis
Linear Regression Sandbox
Interactive Simulation
Interact with the simulation below to explore how high-leverage outliers and non-linear patterns influence the regression line and the corresponding residual plots.
Engineering Data Analysis • Topic 11
Residuals & Leverage Outliers sandbox
- Correlation (): Measures the strength of a linear relationship (-1 to 1). Does not imply causation.
- Least Squares: The mathematical method used to find the line that minimizes the sum of squared errors (SSE).
- : The percentage of variation in explained by the model.
- Residual Analysis: Crucial for validating the assumptions of linearity and constant variance. Residual plots should look like random noise.
- Multiple Regression: Uses multiple predictors. coefficients represent the effect of holding all other variables constant.
- ANOVA F-Test: Determines if the overall regression model is statistically significant.