Regression and Correlation

Learning Objectives

  • Understand the conceptual difference between correlation and regression.
  • Calculate and interpret the Pearson Correlation Coefficient.
  • Formulate a simple linear regression model using the Method of Least Squares.
  • Evaluate regression models using R2R^2, standard error, and residual analysis.
  • Validate the fundamental assumptions (LINE) of linear regression.
  • Extend regression concepts to multiple predictors and hypothesis testing.

Engineers frequently need to predict the value of one variable based on the value of another. Understanding the relationship between variables allows for better modeling and prediction of material properties, environmental factors, and system behaviors.

Introduction to Regression and Correlation

Engineers frequently need to predict the value of one variable based on the value of another. For example, predicting the compressive strength of concrete (yy) based on its curing time (xx), or estimating a river's peak flow rate (yy) based on rainfall intensity (xx).

While correlation measures the strength and direction of a linear relationship, regression provides the mathematical equation to make actual predictions.

Correlation

Quantifying the strength of the linear relationship is the first step in data modeling.

Scatter Plots

A graphical tool used to plot two quantitative variables on a Cartesian coordinate system.

Analyzing Scatter Plots

Scatter plots provide the first visual indication of whether a linear relationship, a non-linear relationship, or no relationship exists between XX and YY. They are essential before proceeding with correlation or regression calculations.

Pearson Correlation Coefficient (rr)

A unitless measure that ranges from -1 to +1, describing how closely data points fall along a straight line.

Pearson Correlation Coefficient

Calculates the strength and direction of the linear relationship between two variables.

r=(xixˉ)(yiyˉ)(xixˉ)2(yiyˉ)2r = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum (x_i - \bar{x})^2 \sum (y_i - \bar{y})^2}}

Variables

SymbolDescriptionUnit
rrPearson correlation coefficient-
xix_iIndividual sample values of variable X-
yiy_iIndividual sample values of variable Y-
xˉ\bar{x}Sample mean of variable X-
yˉ\bar{y}Sample mean of variable Y-

Interpreting Correlation (rr)

  • r1r \approx 1: Strong positive linear relationship (as xx increases, yy predictably increases).
  • r1r \approx -1: Strong negative linear relationship (as xx increases, yy predictably decreases).
  • r0r \approx 0: Weak or no linear relationship (the data appears as a random scatter). Note: A value of r=0r=0 does not mean there is no relationship at all; there could be a strong non-linear (e.g., quadratic) relationship.

Correlation does not imply causation

Just because two variables are highly correlated (e.g., asphalt sales and ice cream sales) does not mean one causes the other. Both might be driven by a lurking third variable (summer weather).

Simple Linear Regression

Modeling the relationship with a straight line.

The Regression Model

We hypothesize that the true relationship is a straight line, plus some random, unobservable error (ϵ\epsilon).

True Regression Model

The theoretical linear model relating the independent and dependent variables.

yi=β0+β1xi+ϵiy_i = \beta_0 + \beta_1 x_i + \epsilon_i

Variables

SymbolDescriptionUnit
yiy_iThe dependent (response) variable we want to predict-
xix_iThe independent (predictor or explanatory) variable-
β0\beta_0The true y-intercept (value of y when x=0)-
β1\beta_1The true slope (change in y for a one-unit change in x)-
ϵi\epsilon_iThe random error term-

The Method of Least Squares

Finding the "line of best fit." Because we only have a sample, we estimate the true parameters (β0,β1\beta_0, \beta_1) with sample statistics (b0,b1b_0, b_1).

Estimated Regression Equation

The line equation constructed from sample data to estimate the true model. The "hat" on y^\hat{y} indicates it is an estimated or predicted value, not an actual observed value.

Estimated Regression Line

The straight line equation used for making predictions.

y^=b0+b1x\hat{y} = b_0 + b_1 x

Variables

SymbolDescriptionUnit
y^\hat{y}Predicted or estimated value of the dependent variable-
b0b_0Estimated y-intercept-
b1b_1Estimated slope-
xxValue of the independent variable-

Least Squares Principle

The principle stating that the line of best fit is the one that minimizes the Sum of Squared Errors (SSE). The error (or residual) is the vertical distance between an observed data point (yiy_i) and the predicted value on the line (y^i\hat{y}_i).

Sum of Squared Errors (SSE)

The objective function minimized in the method of least squares.

SSE=(yiy^i)2=ei2SSE = \sum (y_i - \hat{y}_i)^2 = \sum e_i^2

Variables

SymbolDescriptionUnit
SSESSESum of squared errors-
yiy_iObserved data point-
y^i\hat{y}_iPredicted value on the regression line-
eie_iResidual or error for the i-th observation-

Calculating the Slope and Intercept

Using calculus to minimize SSE yields the formulas for the slope (b1b_1) and intercept (b0b_0).

Least Squares Slope

Formula to calculate the estimated slope.

b1=(xixˉ)(yiyˉ)(xixˉ)2b_1 = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sum (x_i - \bar{x})^2}

Variables

SymbolDescriptionUnit
b1b_1Estimated slope of the regression line-
xix_iIndividual sample values of variable X-
yiy_iIndividual sample values of variable Y-
xˉ\bar{x}Sample mean of variable X-
yˉ\bar{y}Sample mean of variable Y-

Least Squares Intercept

Formula to calculate the estimated y-intercept.

b0=yˉb1xˉb_0 = \bar{y} - b_1\bar{x}

Variables

SymbolDescriptionUnit
b0b_0Estimated y-intercept of the regression line-
yˉ\bar{y}Sample mean of variable Y-
b1b_1Estimated slope-
xˉ\bar{x}Sample mean of variable X-

Assessing the Model and Residual Analysis

How good is our prediction? Before relying on a regression equation for engineering decisions, we must verify that the model is appropriate.

Coefficient of Determination (R2R^2)

The proportion of the total variation in the dependent variable (yy) that is explained by the regression model (the independent variable xx).

Coefficient of Determination

Calculates the percentage of explained variance.

R2=SSRSST=1SSESSTR^2 = \frac{SSR}{SST} = 1 - \frac{SSE}{SST}

Variables

SymbolDescriptionUnit
R2R^2Coefficient of determination-
SSRSSRRegression sum of squares-
SSTSSTTotal sum of squares-
SSESSESum of squared errors-

Interpreting R-Squared

  • 0R210 \le R^2 \le 1.
  • An R2=0.85R^2 = 0.85 means that 85% of the variation in concrete strength can be explained by the variation in curing time.
  • For simple linear regression (one predictor), R2=r2R^2 = r^2.

Standard Error of the Estimate (ses_e)

A measure of the typical distance that observed data points fall from the regression line. It estimates the standard deviation of the error term (σ\sigma).

Standard Error of the Estimate

Estimates the standard deviation of the residuals.

se=SSEn2s_e = \sqrt{\frac{SSE}{n-2}}

Variables

SymbolDescriptionUnit
ses_eStandard error of the estimate-
SSESSESum of squared errors-
nnNumber of sample observations-

Residual Analysis (Validating Assumptions)

A critical step. We plot the residuals (ei=yiy^ie_i = y_i - \hat{y}_i) against the predicted values (y^i\hat{y}_i) or the predictor (xix_i). For the linear model to be valid, the residual plot should show a random, structureless horizontal band around zero.

Common Residual Plot Patterns

  • Non-linearity: If the residuals show a curved pattern (like a U-shape), a straight line is not appropriate; a polynomial regression is needed.
  • Heteroscedasticity: If the spread of the residuals increases (a fan shape), the variance of the errors is not constant. This violates a key assumption and requires data transformation (e.g., taking the log of yy).

Assumptions of Linear Regression (LINE)

Multiple Linear Regression

Using more than one predictor variable. Engineers rarely predict an outcome based on a single variable. Concrete strength depends on water-cement ratio, curing time, temperature, and aggregate type.

Multiple Regression Model

An extension of simple linear regression that predicts the response variable using two or more explanatory variables.

Multiple Regression Equation

The model incorporating multiple predictor variables.

y^=b0+b1x1+b2x2++bkxk\hat{y} = b_0 + b_1 x_1 + b_2 x_2 + \dots + b_k x_k

Variables

SymbolDescriptionUnit
y^\hat{y}Predicted value-
b0b_0y-intercept-
b1,b2,bkb_1, b_2, b_kEstimated coefficients for each predictor-
x1,x2,xkx_1, x_2, x_kIndividual predictor variables-

Interpreting Multiple Regression

  • Each bib_i represents the change in the estimated yy for a one-unit change in xix_i, holding all other predictor variables constant.
  • Adjusted R2R^2: Unlike regular R2R^2 (which always increases when you add a variable), Adjusted R2R^2 penalizes adding variables that do not significantly improve the model, preventing "overfitting."

Hypothesis Testing in Regression (ANOVA Approach)

Testing if the model is statistically significant.

Hypothesis Test for the Slope (β1\beta_1)

In simple linear regression, testing whether a linear relationship exists is equivalent to testing the null hypothesis H0:β1=0H_0: \beta_1 = 0 (the true slope is zero). We use a t-test with n2n-2 degrees of freedom. If we reject H0H_0, there is sufficient evidence that XX provides information in predicting YY.

F-Test for Overall Significance

Tests whether the regression model as a whole is better than simply predicting the mean of yy (yˉ\bar{y}).

Interpreting the F-Test

  • H0:β1=β2==βk=0H_0: \beta_1 = \beta_2 = \dots = \beta_k = 0 (The model is useless).
  • If the resulting P-value is very small, we reject H0H_0, indicating at least one predictor is significantly related to yy.

t-Tests for Individual Coefficients

If the overall model is significant, we run t-tests on each individual slope (bib_i) to determine which specific variables are actually contributing.

Interpreting Individual t-Tests

  • H0:βi=0H_0: \beta_i = 0 (This specific predictor is useless, assuming all others are already in the model).
  • Rejecting H0H_0 implies that the predictor xix_i provides significant information even after accounting for the other variables.

Interactive Simulation

Interact with the simulation below to fit a regression line to data points.

Engineering Data Analysis

Linear Regression Sandbox

xy
Click on the grid to plot points
Slope (m)0.000
Intercept (b)0.000
Correlation (r)0.0000
R² (Coeff. of Det.)0.0000
Regression Equation
y=0.00x+0.00y = 0.00x + 0.00
The line of best fit is calculated using Ordinary Least Squares (OLS) which minimizes the sum of squared residuals:
(yiy^i)2\sum (y_i - \hat{y}_i)^2

Interactive Simulation

Interact with the simulation below to explore how high-leverage outliers and non-linear patterns influence the regression line and the corresponding residual plots.

Engineering Data Analysis • Topic 11

Residuals & Leverage Outliers sandbox

Data Point at (15, 20)Data Point at (25, 32)Data Point at (35, 40)Data Point at (45, 55)Data Point at (55, 62)Data Point at (65, 78)
Slope (m)1.129
Intercept (b)2.69
R² score0.991
Outlier ResidualN/A
Key Takeaways
  • Correlation (rr): Measures the strength of a linear relationship (-1 to 1). Does not imply causation.
  • Least Squares: The mathematical method used to find the line that minimizes the sum of squared errors (SSE).
  • R2R^2: The percentage of variation in yy explained by the model.
  • Residual Analysis: Crucial for validating the assumptions of linearity and constant variance. Residual plots should look like random noise.
  • Multiple Regression: Uses multiple predictors. bib_i coefficients represent the effect of xix_i holding all other variables constant.
  • ANOVA F-Test: Determines if the overall regression model is statistically significant.