Curve Fitting and Interpolation

Curve Fitting and Interpolation

Learning Objectives

Differentiate between regression (curve fitting for noisy data) and interpolation (exact fits for precise data).
Apply least-squares regression techniques including linear, polynomial, and multiple linear regression.
Utilize linearizations for exponential, power, and saturation-growth-rate relationships.
Use Newton's divided-difference and Lagrange interpolating polynomials.
Mitigate Runge's Phenomenon using Chebyshev nodes and piecewise interpolation methods.
Understand spline interpolation and common boundary conditions (e.g., natural, clamped).

Regression

A statistical and numerical method for estimating the relationships among variables, particularly useful for deriving a general trend line from noisy experimental data where the resulting curve does not need to pass exactly through every point.

Interpolation

A method of constructing new data points within the range of a discrete set of known, highly precise data points, ensuring the resulting curve passes exactly through every given point.

Engineers often need to construct a continuous mathematical function that represents discrete data points. There are two distinct, fundamentally different approaches depending on the nature of the data. For instance, civil engineers might use regression to find a trend in noisy soil compaction test data, but they would use interpolation for laying out the precise curvature of a highway or railway track.

Regression vs. Interpolation

Regression (Curve Fitting): Used when the data exhibits significant scatter or error (e.g., experimental measurements with noise). The goal is to derive a single curve that represents the general trend of the data. Crucially, the regression curve does not necessarily pass through any of the specific data points.
Interpolation: Used when the data is known to be highly precise or exact (e.g., tabulated thermodynamic properties or computer-generated coordinates). The goal is to derive a curve that passes exactly through every single data point.

Least-Squares Regression

When fitting a trend line to noisy data, the most common method minimizes the sum of the squares of the residuals (errors) between the discrete points and the continuous curve.

Linear Regression

Fitting a straight line $y = a_0 + a_1x$ to $n$ data points. The coefficients $a_0$ (intercept) and $a_1$ (slope) are found by minimizing the sum of the squared residuals.

Normal Equations

The coefficients are derived by taking the partial derivatives of the sum of squared residuals with respect to $a_0$ and $a_1$ and setting them to zero. This yields the "Normal Equations":

$n a_0 + (\sum x_i) a_1 = \sum y_i$
$(\sum x_i) a_0 + (\sum x_i^2) a_1 = \sum x_i y_i$

Solving this $2 \times 2$ linear system directly yields the formulas for $a_0$ and $a_1$ .

Linear Regression Slope

Computes the slope ( $a_1$ ) of the best-fit line using least-squares.

a_1 = \frac{n \sum x_iy_i - \sum x_i \sum y_i}{n \sum x_i^2 - (\sum x_i)^2}

Variables

Symbol	Description	Unit
$a_1$	Slope of the regression line	-
$n$	Number of data points	-
$x_i$	Independent variable data points	-
$y_i$	Dependent variable data points	-

Linear Regression Intercept

Computes the y-intercept ( $a_0$ ) of the best-fit line.

a_0 = \bar{y} - a_1\bar{x}

Variables

Symbol	Description	Unit
$a_0$	y-intercept of the regression line	-
$\bar{y}$	Mean of the $y$ values	-
$a_1$	Computed slope	-
$\bar{x}$	Mean of the $x$ values	-

To quantify the "goodness of fit", we use the coefficient of determination ( $R^2$ ). It measures how much of the variance in the dependent variable is explained by the model:

Coefficient of Determination ( $R^2$ )

Measures how well the regression line approximates the real data points.

R^2 = \frac{S_t - S_r}{S_t}

Variables

Symbol	Description	Unit
$R^2$	Coefficient of determination. An $R^2$ value close to 1 indicates a perfect fit.	-
$S_t$	Total sum of squares, calculated as $\sum(y_i - \bar{y})^2$	-
$S_r$	Sum of the squares of the residuals, calculated as $\sum(y_i - a_0 - a_1x_i)^2$	-

Standard Error of the Estimate

Measures the standard distance between the observed data points and the regression line.

s_{y/x} = \sqrt{\frac{S_r}{n - 2}}

Variables

Symbol	Description	Unit
$s_{y/x}$	Standard error of the estimate	-
$S_r$	Sum of the squares of the residuals	-
$n$	Number of data points	-

Interactive Simulation

Adjust the data points in the simulation below to observe how the least-squares regression line adapts to minimize the sum of squared residuals.

Least-Squares Linear Regression

Drag the points or click on the graph to add new points. Observe how the best-fit line $y = a_0 + a_1x$ minimizes the sum of squared residuals.

Click to add points (max 15). Drag to move.

Regression Statistics

Equation of the Line

y = 0.329 + 1.032x

Intercept (

a_0

)

0.3286

Slope (

a_1

)

1.0321

R-squared (

r^2

)

0.9820

Sum of Squares (

S_r

)

0.5482

Point	x	y
1	1.00	1.50
2	2.00	2.20
3	3.00	3.10
4	4.00	4.80
5	5.00	5.50
6	6.00	6.90
7	7.00	7.20

Linearization of Nonlinear Relationships

Many nonlinear relationships can be transformed to allow for linear regression by taking the natural logarithm or base-10 logarithm of the variables.

Common Linearizations

Exponential Equation: $y = \alpha e^{\beta x}$ becomes $\ln(y) = \ln(\alpha) + \beta x$ . Let $y^* = \ln(y)$ , $a_0 = \ln(\alpha)$ , $a_1 = \beta$ , $x^* = x$ .
Power Equation: $y = \alpha x^{\beta}$ becomes $\log(y) = \log(\alpha) + \beta \log(x)$ . Let $y^* = \log(y)$ , $a_0 = \log(\alpha)$ , $a_1 = \beta$ , $x^* = \log(x)$ .
Saturation-Growth-Rate: $y = \frac{\alpha x}{\beta + x}$ becomes $\frac{1}{y} = \frac{\beta}{\alpha} \frac{1}{x} + \frac{1}{\alpha}$ . Let $y^* = \frac{1}{y}$ , $a_0 = \frac{1}{\alpha}$ , $a_1 = \frac{\beta}{\alpha}$ , $x^* = \frac{1}{x}$ .

Multiple Linear Regression

When a dependent variable $y$ is a linear function of two or more independent variables (e.g., $y = a_0 + a_1x_1 + a_2x_2$ ), least-squares regression can be extended to find the coefficients by solving a system of simultaneous linear equations formed by partial derivatives of the sum of squared residuals.

Polynomial Regression and Overfitting

Extending the least-squares concept to fit polynomials of degree $m$ : $y = a_0 + a_1x + a_2x^2 + \dots + a_mx^m$ . This involves solving an $(m+1) \times (m+1)$ system of simultaneous linear equations (the general Normal Equations) to find the coefficients.

Overfitting

A modeling error that occurs when a function is too closely aligned with a limited set of data points, capturing the noise rather than the underlying true trend, often resulting in poor predictive performance.

The Danger of Overfitting

While a higher-degree polynomial will technically yield a lower sum of squared residuals, arbitrarily increasing the degree $m$ simply to chase a higher $R^2$ value is a severe error known as overfitting. An overfitted polynomial will weave wildly to hit noise points rather than capturing the true underlying physics. This resulting function becomes highly unstable and nearly useless for making predictions between or beyond the data points. A general rule is to use the lowest degree polynomial that adequately captures the physical trend, validating it against an independent test set.

Orthogonal Polynomials and Fourier Approximation

For continuous functions or extensive data sets, using orthogonal polynomials (like Legendre or Chebyshev polynomials) for regression minimizes numerical errors.

Fourier Series

Instead of fitting polynomials, Fourier approximation uses a series of sine and cosine functions to model periodic phenomena (e.g., tides or vibrations). It's a form of least-squares fitting using orthogonal trigonometric functions.

Fourier Series Approximation

Models periodic data using a sum of trigonometric functions.

f(t) = a_0 + \sum_{k=1}^n (a_k \cos(k\omega_0 t) + b_k \sin(k\omega_0 t))

Variables

Symbol	Description	Unit
$f(t)$	Function value at time $t$	-
$a_0$	Constant term (mean value)	-
$a_k$	Cosine coefficients	-
$b_k$	Sine coefficients	-
$\omega_0$	Fundamental frequency	rad/s
$t$	Time	s

The Dangers of Extrapolation

Interpolation refers to predicting values within the range of the given data set. Extrapolation refers to predicting values outside that range.

The Danger of Extrapolation

Extrapolating outside the bounds of the base data is highly dangerous. Polynomials, in particular, tend to oscillate wildly outside their fitting range. A trend that appears strictly linear or exponential within the given data may completely break down outside it, leading to grossly erroneous predictions.

Newton's Divided-Difference Interpolating Polynomials

Interpolation is used when data is known to be highly precise, and the fitted curve must pass directly through every data point. Newton's method is one of the most popular and useful forms.

Newton's Interpolating Polynomial

A polynomial of order $n$ that passes through $n+1$ points. The coefficients $b_i$ are evaluated using finite divided differences.

Newton's Interpolating Polynomial Formula

Constructs an interpolating polynomial using divided differences.

f_n(x) = b_0 + b_1(x - x_0) + b_2(x - x_0)(x - x_1) + \dots + b_n(x - x_0)(x - x_1)\dots(x - x_{n-1})

Variables

Symbol	Description	Unit
$f_n(x)$	Interpolated value at $x$	-
$b_i$	Coefficients evaluated via finite divided differences	-
$x_i$	Known data point coordinates	-

Inverse Interpolation

In standard interpolation, you provide a value $x$ to find $f(x)$ . In inverse interpolation, you are given a value $f(x)$ and must find the corresponding $x$ .

How to perform Inverse Interpolation

Swapping variables: The simplest approach is to swap $x$ and $y$ (making $x$ the dependent variable) and use a standard method like Newton's or Lagrange. However, this only works well if the function is monotonic.
Root-finding: A more robust approach is to form a new function $g(x) = f(x) - y_{\text{target}}$ and use root-finding methods (like the Secant Method) on the interpolating polynomial.

Runge's Phenomenon and Chebyshev Nodes

Runge's Phenomenon

A problem of oscillation at the edges of an interval that occurs when using polynomial interpolation with polynomials of high degree over a set of equally spaced interpolation points.

When fitting a high-degree polynomial to a large set of equally spaced data points, severe oscillations can occur near the edges of the interval. This is known as Runge's Phenomenon.

Chebyshev Nodes

To mitigate Runge's Phenomenon, we can choose non-equally spaced points called Chebyshev nodes. These nodes are clustered near the ends of the interval and sparse in the center, which minimizes the maximum interpolation error across the domain. For an interval $[a, b]$ , these points are linearly scaled from $[-1, 1]$ . Using Chebyshev nodes with Newton's or Lagrange polynomials dramatically improves stability.

Chebyshev Nodes Formula

Calculates optimal non-equally spaced nodes to minimize interpolation error.

x_k = \cos\left(\frac{2k-1}{2n}\pi\right), \quad k=1,\dots,n

Variables

Symbol	Description	Unit
$x_k$	Position of the $k$ -th Chebyshev node in $[-1, 1]$	-
$n$	Total number of nodes	-

Lagrange Interpolating Polynomials

A reformulation of Newton's polynomial that avoids the computation of divided differences. It is conceptually simpler but computationally more expensive if the order of the polynomial is unknown beforehand.

Lagrange Interpolating Polynomial

Calculates the interpolated value without needing divided differences by using weighting functions.

f_n(x) = \sum_{i=0}^n L_i(x) f(x_i)

Variables

Symbol	Description	Unit
$f_n(x)$	Interpolated value at $x$	-
$L_i(x)$	Lagrange basis polynomial (weighting function)	-
$f(x_i)$	Known function value at $x_i$	-

Lagrange Basis Polynomial

Computes the weighting functions for the Lagrange polynomial.

L_i(x) = \prod_{j=0, j \neq i}^n \frac{x - x_j}{x_i - x_j}

Variables

Symbol	Description	Unit
$L_i(x)$	The $i$ -th Lagrange basis polynomial	-
$x$	The interpolation point	-
$x_i, x_j$	Known data points	-

Hermite Interpolation

Unlike standard interpolation (which only guarantees the curve passes through the given points), Hermite interpolation forces the polynomial to match both the function values and the first derivatives (slopes) at those points.

Hermite Polynomials

To interpolate $n$ points where both $f(x_i)$ and $f'(x_i)$ are specified, a polynomial of degree $2n-1$ is required. This ensures the resulting interpolating curve has smooth tangents at the data points, which is particularly useful in computer graphics and path planning.

Piecewise Linear Interpolation

Before moving to complex splines, the most basic form of spline interpolation is piecewise linear interpolation, which simply connects consecutive data points with straight lines (first-degree polynomials).

Piecewise Linear Interpolation

Connects consecutive data points using first-degree polynomials.

f_1(x) = f(x_{i-1}) + \frac{f(x_i) - f(x_{i-1})}{x_i - x_{i-1}} (x - x_{i-1})

Variables

Symbol	Description	Unit
$f_1(x)$	Interpolated value at $x$	-
$x_{i-1}, x_i$	Consecutive known data points	-
$f(x_{i-1}), f(x_i)$	Function values at the data points	-

While easy to implement, its primary drawback is that the first derivative (slope) is discontinuous at the knots, resulting in a curve with sharp corners rather than a smooth profile.

Spline Interpolation

Using a single high-degree polynomial for a large number of points can cause severe oscillations (Runge's phenomenon). Spline interpolation applies lower-degree polynomials to subsets of data points, ensuring smooth transitions at the connection points (knots).

Spline Interpolation

A form of interpolation where the interpolant is a special type of piecewise polynomial called a spline. Spline interpolation is often preferred over standard polynomial interpolation because the interpolation error can be made small even when using low degree polynomials for the spline, thus avoiding Runge's phenomenon.

Cubic Splines

The most common type of spline. It fits a third-degree polynomial to each interval between data points. To ensure a smooth continuous curve across the entire data set, the cubic spline enforces matching conditions at the internal knots:

The function values must match at the knots (continuity).
The first derivatives must match at the knots (smooth slope).
The second derivatives must match at the knots (smooth curvature).

This results in a tridiagonal system of linear equations that can be solved very efficiently.

Spline Boundary Conditions

Because there are $n$ intervals but conditions only specify the internal connections, two extra boundary conditions at the first and last knot are required to close the system:

Natural Spline: The second derivative is set to zero at the endpoints ( $f''(x_0) = f''(x_n) = 0$ ). This makes the curve "straight" at the ends.
Clamped Spline: The first derivatives (slopes) at the endpoints are specified by the user to match a known tangent.
Not-a-knot: Forces the third derivative to be continuous across the first and last internal knots, effectively treating the first two and last two intervals as single cubic functions.

Key Takeaways

Regression creates a general trend line that minimizes errors for noisy data, while Interpolation forces the curve to pass exactly through precise data points.
Regression uses Normal Equations to derive coefficients that minimize the sum of squared residuals.
The Coefficient of Determination ( $R^2$ ) quantifies how well the regression line fits the data.
Linearization allows nonlinear models (exponential, power) to be fitted using linear regression techniques.
Increasing the polynomial degree too high leads to Overfitting, causing wild non-physical oscillations.
Extrapolation outside the data bounds is dangerous because fitted polynomials tend to diverge rapidly.
Runge's Phenomenon causes high-degree polynomials to oscillate on equally spaced points; using Chebyshev Nodes mitigates this.
Hermite Interpolation matches both the function values and their first derivatives at the data points.
Piecewise Linear Interpolation connects points with straight lines but has discontinuous slopes.
Orthogonal polynomials and Fourier series are powerful tools for continuous least-squares fitting, especially for periodic data.
Newton's and Lagrange polynomials provide global interpolation formulas but are susceptible to Runge's phenomenon for high degrees.
Inverse Interpolation finds the $x$ -value for a given $f(x)$ , often by swapping variables or using root-finding methods on the interpolant.
Cubic Splines provide local, piecewise continuous polynomials that maintain smooth slopes and curvatures across knots, avoiding large oscillations.
Spline systems require specific Boundary Conditions (like Natural, Clamped, or Not-a-knot) to fully solve the resulting tridiagonal matrix.

Previous TopicSystems of Linear Equations - Examples & Applications

Quiz Me

Next TopicCurve Fitting and Interpolation - Examples & Applications

Prev Next

Quiz Me

Learning Objectives

Regression

Interpolation

Regression vs. Interpolation

Least-Squares Regression

Linear Regression

Normal Equations

Linear Regression Slope

Linear Regression Intercept

Coefficient of Determination (R2R^2R2)

Standard Error of the Estimate

Interactive Simulation

Least-Squares Linear Regression

Regression Statistics

Linearization of Nonlinear Relationships

Common Linearizations

Multiple Linear Regression

Polynomial Regression and Overfitting

Overfitting

The Danger of Overfitting

Orthogonal Polynomials and Fourier Approximation

Fourier Series

Fourier Series Approximation

The Dangers of Extrapolation

The Danger of Extrapolation

Newton's Divided-Difference Interpolating Polynomials

Newton's Interpolating Polynomial

Newton's Interpolating Polynomial Formula

Inverse Interpolation

How to perform Inverse Interpolation

Runge's Phenomenon and Chebyshev Nodes

Runge's Phenomenon

Chebyshev Nodes

Chebyshev Nodes Formula

Lagrange Interpolating Polynomials

Lagrange Interpolating Polynomial

Lagrange Basis Polynomial

Hermite Interpolation

Hermite Polynomials

Piecewise Linear Interpolation

Piecewise Linear Interpolation

Spline Interpolation

Spline Interpolation

Cubic Splines

Spline Boundary Conditions

Coefficient of Determination ( $R^2$ )