Regression and Correlation

Regression and Correlation

Learning Objectives

Understand the conceptual difference between correlation and regression.
Calculate and interpret the Pearson Correlation Coefficient.
Formulate a simple linear regression model using the Method of Least Squares.
Evaluate regression models using $R^2$ , standard error, and residual analysis.
Validate the fundamental assumptions (LINE) of linear regression.
Extend regression concepts to multiple predictors and hypothesis testing.

Engineers frequently need to predict the value of one variable based on the value of another. Understanding the relationship between variables allows for better modeling and prediction of material properties, environmental factors, and system behaviors.

Introduction to Regression and Correlation

Engineers frequently need to predict the value of one variable based on the value of another. For example, predicting the compressive strength of concrete ( $y$ ) based on its curing time ( $x$ ), or estimating a river's peak flow rate ( $y$ ) based on rainfall intensity ( $x$ ).

While correlation measures the strength and direction of a linear relationship, regression provides the mathematical equation to make actual predictions.

Correlation

Quantifying the strength of the linear relationship is the first step in data modeling.

Scatter Plots

A graphical tool used to plot two quantitative variables on a Cartesian coordinate system.

Analyzing Scatter Plots

Scatter plots provide the first visual indication of whether a linear relationship, a non-linear relationship, or no relationship exists between $X$ and $Y$ . They are essential before proceeding with correlation or regression calculations.

Pearson Correlation Coefficient ( $r$ )

A unitless measure that ranges from -1 to +1, describing how closely data points fall along a straight line.

Pearson Correlation Coefficient

Calculates the strength and direction of the linear relationship between two variables.

r = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum (x_i - \bar{x})^2 \sum (y_i - \bar{y})^2}}

Variables

Symbol	Description	Unit
$r$	Pearson correlation coefficient	-
$x_i$	Individual sample values of variable X	-
$y_i$	Individual sample values of variable Y	-
$\bar{x}$	Sample mean of variable X	-
$\bar{y}$	Sample mean of variable Y	-

Interpreting Correlation ( $r$ )

$r \approx 1$ : Strong positive linear relationship (as $x$ increases, $y$ predictably increases).
$r \approx -1$ : Strong negative linear relationship (as $x$ increases, $y$ predictably decreases).
$r \approx 0$ : Weak or no linear relationship (the data appears as a random scatter). Note: A value of $r=0$ does not mean there is no relationship at all; there could be a strong non-linear (e.g., quadratic) relationship.

Correlation does not imply causation

Just because two variables are highly correlated (e.g., asphalt sales and ice cream sales) does not mean one causes the other. Both might be driven by a lurking third variable (summer weather).

Simple Linear Regression

Modeling the relationship with a straight line.

The Regression Model

We hypothesize that the true relationship is a straight line, plus some random, unobservable error ( $\epsilon$ ).

True Regression Model

The theoretical linear model relating the independent and dependent variables.

y_i = \beta_0 + \beta_1 x_i + \epsilon_i

Variables

Symbol	Description	Unit
$y_i$	The dependent (response) variable we want to predict	-
$x_i$	The independent (predictor or explanatory) variable	-
$\beta_0$	The true y-intercept (value of y when x=0)	-
$\beta_1$	The true slope (change in y for a one-unit change in x)	-
$\epsilon_i$	The random error term	-

The Method of Least Squares

Finding the "line of best fit." Because we only have a sample, we estimate the true parameters ( $\beta_0, \beta_1$ ) with sample statistics ( $b_0, b_1$ ).

Estimated Regression Equation

The line equation constructed from sample data to estimate the true model. The "hat" on $\hat{y}$ indicates it is an estimated or predicted value, not an actual observed value.

Estimated Regression Line

The straight line equation used for making predictions.

\hat{y} = b_0 + b_1 x

Variables

Symbol	Description	Unit
$\hat{y}$	Predicted or estimated value of the dependent variable	-
$b_0$	Estimated y-intercept	-
$b_1$	Estimated slope	-
$x$	Value of the independent variable	-

Least Squares Principle

The principle stating that the line of best fit is the one that minimizes the Sum of Squared Errors (SSE). The error (or residual) is the vertical distance between an observed data point ( $y_i$ ) and the predicted value on the line ( $\hat{y}_i$ ).

Sum of Squared Errors (SSE)

The objective function minimized in the method of least squares.

SSE = \sum (y_i - \hat{y}_i)^2 = \sum e_i^2

Variables

Symbol	Description	Unit
$SSE$	Sum of squared errors	-
$y_i$	Observed data point	-
$\hat{y}_i$	Predicted value on the regression line	-
$e_i$	Residual or error for the i-th observation	-

Calculating the Slope and Intercept

Using calculus to minimize SSE yields the formulas for the slope ( $b_1$ ) and intercept ( $b_0$ ).

Least Squares Slope

Formula to calculate the estimated slope.

b_1 = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sum (x_i - \bar{x})^2}

Variables

Symbol	Description	Unit
$b_1$	Estimated slope of the regression line	-
$x_i$	Individual sample values of variable X	-
$y_i$	Individual sample values of variable Y	-
$\bar{x}$	Sample mean of variable X	-
$\bar{y}$	Sample mean of variable Y	-

Least Squares Intercept

Formula to calculate the estimated y-intercept.

b_0 = \bar{y} - b_1\bar{x}

Variables

Symbol	Description	Unit
$b_0$	Estimated y-intercept of the regression line	-
$\bar{y}$	Sample mean of variable Y	-
$b_1$	Estimated slope	-
$\bar{x}$	Sample mean of variable X	-

Assessing the Model and Residual Analysis

How good is our prediction? Before relying on a regression equation for engineering decisions, we must verify that the model is appropriate.

Coefficient of Determination ( $R^2$ )

The proportion of the total variation in the dependent variable ( $y$ ) that is explained by the regression model (the independent variable $x$ ).

Coefficient of Determination

Calculates the percentage of explained variance.

R^2 = \frac{SSR}{SST} = 1 - \frac{SSE}{SST}

Variables

Symbol	Description	Unit
$R^2$	Coefficient of determination	-
$SSR$	Regression sum of squares	-
$SST$	Total sum of squares	-
$SSE$	Sum of squared errors	-

Interpreting R-Squared

$0 \le R^2 \le 1$ .
An $R^2 = 0.85$ means that 85% of the variation in concrete strength can be explained by the variation in curing time.
For simple linear regression (one predictor), $R^2 = r^2$ .

Standard Error of the Estimate ( $s_e$ )

A measure of the typical distance that observed data points fall from the regression line. It estimates the standard deviation of the error term ( $\sigma$ ).

Standard Error of the Estimate

Estimates the standard deviation of the residuals.

s_e = \sqrt{\frac{SSE}{n-2}}

Variables

Symbol	Description	Unit
$s_e$	Standard error of the estimate	-
$SSE$	Sum of squared errors	-
$n$	Number of sample observations	-

Residual Analysis (Validating Assumptions)

A critical step. We plot the residuals ( $e_i = y_i - \hat{y}_i$ ) against the predicted values ( $\hat{y}_i$ ) or the predictor ( $x_i$ ). For the linear model to be valid, the residual plot should show a random, structureless horizontal band around zero.

Common Residual Plot Patterns

Non-linearity: If the residuals show a curved pattern (like a U-shape), a straight line is not appropriate; a polynomial regression is needed.
Heteroscedasticity: If the spread of the residuals increases (a fan shape), the variance of the errors is not constant. This violates a key assumption and requires data transformation (e.g., taking the log of $y$ ).

Assumptions of Linear Regression (LINE)

Linearity: The relationship between X and Y is fundamentally linear.
Independence: The observations are independent of each other.
Normality: The residuals ( $\epsilon_i$ ) are normally distributed.
Equal Variance (Homoscedasticity): The variance of the residuals is constant across all values of X.

Multiple Linear Regression

Using more than one predictor variable. Engineers rarely predict an outcome based on a single variable. Concrete strength depends on water-cement ratio, curing time, temperature, and aggregate type.

Multiple Regression Model

An extension of simple linear regression that predicts the response variable using two or more explanatory variables.

Multiple Regression Equation

The model incorporating multiple predictor variables.

\hat{y} = b_0 + b_1 x_1 + b_2 x_2 + \dots + b_k x_k

Variables

Symbol	Description	Unit
$\hat{y}$	Predicted value	-
$b_0$	y-intercept	-
$b_1, b_2, b_k$	Estimated coefficients for each predictor	-
$x_1, x_2, x_k$	Individual predictor variables	-

Interpreting Multiple Regression

Each $b_i$ represents the change in the estimated $y$ for a one-unit change in $x_i$ , holding all other predictor variables constant.
Adjusted $R^2$ : Unlike regular $R^2$ (which always increases when you add a variable), Adjusted $R^2$ penalizes adding variables that do not significantly improve the model, preventing "overfitting."

Hypothesis Testing in Regression (ANOVA Approach)

Testing if the model is statistically significant.

Hypothesis Test for the Slope ( $\beta_1$ )

In simple linear regression, testing whether a linear relationship exists is equivalent to testing the null hypothesis $H_0: \beta_1 = 0$ (the true slope is zero). We use a t-test with $n-2$ degrees of freedom. If we reject $H_0$ , there is sufficient evidence that $X$ provides information in predicting $Y$ .

F-Test for Overall Significance

Tests whether the regression model as a whole is better than simply predicting the mean of $y$ ( $\bar{y}$ ).

Interpreting the F-Test

$H_0: \beta_1 = \beta_2 = \dots = \beta_k = 0$ (The model is useless).
If the resulting P-value is very small, we reject $H_0$ , indicating at least one predictor is significantly related to $y$ .

t-Tests for Individual Coefficients

If the overall model is significant, we run t-tests on each individual slope ( $b_i$ ) to determine which specific variables are actually contributing.

Interpreting Individual t-Tests

$H_0: \beta_i = 0$ (This specific predictor is useless, assuming all others are already in the model).
Rejecting $H_0$ implies that the predictor $x_i$ provides significant information even after accounting for the other variables.

Interactive Simulation

Interact with the simulation below to fit a regression line to data points.

Engineering Data Analysis

Linear Regression Sandbox

Click on the grid to plot points

Slope (m)0.000

Intercept (b)0.000

Correlation (r)0.0000

R² (Coeff. of Det.)0.0000

Regression Equation

y = 0.00x + 0.00

The line of best fit is calculated using Ordinary Least Squares (OLS) which minimizes the sum of squared residuals:

\sum (y_i - \hat{y}_i)^2

Interactive Simulation

Interact with the simulation below to explore how high-leverage outliers and non-linear patterns influence the regression line and the corresponding residual plots.

Engineering Data Analysis • Topic 11

Residuals & Leverage Outliers sandbox

Outlier Scenario

Slope (m)1.129

Intercept (b)2.69

R² score0.991

Outlier ResidualN/A

Key Takeaways

Correlation ( $r$ ): Measures the strength of a linear relationship (-1 to 1). Does not imply causation.
Least Squares: The mathematical method used to find the line that minimizes the sum of squared errors (SSE).
$R^2$ : The percentage of variation in $y$ explained by the model.
Residual Analysis: Crucial for validating the assumptions of linearity and constant variance. Residual plots should look like random noise.
Multiple Regression: Uses multiple predictors. $b_i$ coefficients represent the effect of $x_i$ holding all other variables constant.
ANOVA F-Test: Determines if the overall regression model is statistically significant.

Previous TopicTests of Hypotheses - Examples & Applications

Quiz Me

Next TopicRegression and Correlation - Examples & Applications

Prev Next

Quiz Me

Learning Objectives

Introduction to Regression and Correlation

Correlation

Scatter Plots

Analyzing Scatter Plots

Pearson Correlation Coefficient (rrr)

Pearson Correlation Coefficient

Interpreting Correlation (rrr)

Correlation does not imply causation

Simple Linear Regression

The Regression Model

True Regression Model

The Method of Least Squares

Estimated Regression Equation

Estimated Regression Line

Least Squares Principle

Sum of Squared Errors (SSE)

Calculating the Slope and Intercept

Least Squares Slope

Least Squares Intercept

Assessing the Model and Residual Analysis

Coefficient of Determination (R2R^2R2)

Coefficient of Determination

Interpreting R-Squared

Standard Error of the Estimate (ses_ese​)

Standard Error of the Estimate

Residual Analysis (Validating Assumptions)

Common Residual Plot Patterns

Assumptions of Linear Regression (LINE)

Multiple Linear Regression

Multiple Regression Model

Multiple Regression Equation

Interpreting Multiple Regression

Hypothesis Testing in Regression (ANOVA Approach)

Hypothesis Test for the Slope (β1\beta_1β1​)

F-Test for Overall Significance

Interpreting the F-Test

t-Tests for Individual Coefficients

Interpreting Individual t-Tests

Interactive Simulation

Engineering Data Analysis

Interactive Simulation

Engineering Data Analysis • Topic 11

Pearson Correlation Coefficient ( $r$ )

Interpreting Correlation ( $r$ )

Coefficient of Determination ( $R^2$ )

Standard Error of the Estimate ( $s_e$ )

Hypothesis Test for the Slope ( $\beta_1$ )