Curve Fitting and Interpolation

Learning Objectives

  • Differentiate between regression (curve fitting for noisy data) and interpolation (exact fits for precise data).
  • Apply least-squares regression techniques including linear, polynomial, and multiple linear regression.
  • Utilize linearizations for exponential, power, and saturation-growth-rate relationships.
  • Use Newton's divided-difference and Lagrange interpolating polynomials.
  • Mitigate Runge's Phenomenon using Chebyshev nodes and piecewise interpolation methods.
  • Understand spline interpolation and common boundary conditions (e.g., natural, clamped).

Regression

A statistical and numerical method for estimating the relationships among variables, particularly useful for deriving a general trend line from noisy experimental data where the resulting curve does not need to pass exactly through every point.

Interpolation

A method of constructing new data points within the range of a discrete set of known, highly precise data points, ensuring the resulting curve passes exactly through every given point.

Engineers often need to construct a continuous mathematical function that represents discrete data points. There are two distinct, fundamentally different approaches depending on the nature of the data. For instance, civil engineers might use regression to find a trend in noisy soil compaction test data, but they would use interpolation for laying out the precise curvature of a highway or railway track.

Regression vs. Interpolation

  • Regression (Curve Fitting): Used when the data exhibits significant scatter or error (e.g., experimental measurements with noise). The goal is to derive a single curve that represents the general trend of the data. Crucially, the regression curve does not necessarily pass through any of the specific data points.
  • Interpolation: Used when the data is known to be highly precise or exact (e.g., tabulated thermodynamic properties or computer-generated coordinates). The goal is to derive a curve that passes exactly through every single data point.

Least-Squares Regression

When fitting a trend line to noisy data, the most common method minimizes the sum of the squares of the residuals (errors) between the discrete points and the continuous curve.

Linear Regression

Fitting a straight line y=a0+a1xy = a_0 + a_1x to nn data points. The coefficients a0a_0 (intercept) and a1a_1 (slope) are found by minimizing the sum of the squared residuals.

Normal Equations

The coefficients are derived by taking the partial derivatives of the sum of squared residuals with respect to a0a_0 and a1a_1 and setting them to zero. This yields the "Normal Equations":

  1. na0+(xi)a1=yin a_0 + (\sum x_i) a_1 = \sum y_i
  2. (xi)a0+(xi2)a1=xiyi(\sum x_i) a_0 + (\sum x_i^2) a_1 = \sum x_i y_i

Solving this 2×22 \times 2 linear system directly yields the formulas for a0a_0 and a1a_1.

Linear Regression Slope

Computes the slope (a1a_1) of the best-fit line using least-squares.

a1=nxiyixiyinxi2(xi)2a_1 = \frac{n \sum x_iy_i - \sum x_i \sum y_i}{n \sum x_i^2 - (\sum x_i)^2}

Variables

SymbolDescriptionUnit
a1a_1Slope of the regression line-
nnNumber of data points-
xix_iIndependent variable data points-
yiy_iDependent variable data points-

Linear Regression Intercept

Computes the y-intercept (a0a_0) of the best-fit line.

a0=yˉa1xˉa_0 = \bar{y} - a_1\bar{x}

Variables

SymbolDescriptionUnit
a0a_0y-intercept of the regression line-
yˉ\bar{y}Mean of the yy values-
a1a_1Computed slope-
xˉ\bar{x}Mean of the xx values-

To quantify the "goodness of fit", we use the coefficient of determination (R2R^2). It measures how much of the variance in the dependent variable is explained by the model:

Coefficient of Determination (R2R^2)

Measures how well the regression line approximates the real data points.

R2=StSrStR^2 = \frac{S_t - S_r}{S_t}

Variables

SymbolDescriptionUnit
R2R^2Coefficient of determination. An R2R^2 value close to 1 indicates a perfect fit.-
StS_tTotal sum of squares, calculated as (yiyˉ)2\sum(y_i - \bar{y})^2-
SrS_rSum of the squares of the residuals, calculated as (yia0a1xi)2\sum(y_i - a_0 - a_1x_i)^2-

Standard Error of the Estimate

Measures the standard distance between the observed data points and the regression line.

sy/x=Srn2s_{y/x} = \sqrt{\frac{S_r}{n - 2}}

Variables

SymbolDescriptionUnit
sy/xs_{y/x}Standard error of the estimate-
SrS_rSum of the squares of the residuals-
nnNumber of data points-

Interactive Simulation

Adjust the data points in the simulation below to observe how the least-squares regression line adapts to minimize the sum of squared residuals.

Least-Squares Linear Regression

Drag the points or click on the graph to add new points. Observe how the best-fit line y=a0+a1xy = a_0 + a_1x minimizes the sum of squared residuals.

1122334455667788991010
Click to add points (max 15). Drag to move.

Regression Statistics

Equation of the Line
y = 0.329 + 1.032x
Intercept (a0a_0)
0.3286
Slope (a1a_1)
1.0321
R-squared (r2r^2)
0.9820
Sum of Squares (SrS_r)
0.5482
Pointxy
11.001.50
22.002.20
33.003.10
44.004.80
55.005.50
66.006.90
77.007.20

Linearization of Nonlinear Relationships

Many nonlinear relationships can be transformed to allow for linear regression by taking the natural logarithm or base-10 logarithm of the variables.

Common Linearizations

  • Exponential Equation: y=αeβxy = \alpha e^{\beta x} becomes ln(y)=ln(α)+βx\ln(y) = \ln(\alpha) + \beta x. Let y=ln(y)y^* = \ln(y), a0=ln(α)a_0 = \ln(\alpha), a1=βa_1 = \beta, x=xx^* = x.
  • Power Equation: y=αxβy = \alpha x^{\beta} becomes log(y)=log(α)+βlog(x)\log(y) = \log(\alpha) + \beta \log(x). Let y=log(y)y^* = \log(y), a0=log(α)a_0 = \log(\alpha), a1=βa_1 = \beta, x=log(x)x^* = \log(x).
  • Saturation-Growth-Rate: y=αxβ+xy = \frac{\alpha x}{\beta + x} becomes 1y=βα1x+1α\frac{1}{y} = \frac{\beta}{\alpha} \frac{1}{x} + \frac{1}{\alpha}. Let y=1yy^* = \frac{1}{y}, a0=1αa_0 = \frac{1}{\alpha}, a1=βαa_1 = \frac{\beta}{\alpha}, x=1xx^* = \frac{1}{x}.

Multiple Linear Regression

When a dependent variable yy is a linear function of two or more independent variables (e.g., y=a0+a1x1+a2x2y = a_0 + a_1x_1 + a_2x_2), least-squares regression can be extended to find the coefficients by solving a system of simultaneous linear equations formed by partial derivatives of the sum of squared residuals.

Polynomial Regression and Overfitting

Extending the least-squares concept to fit polynomials of degree mm: y=a0+a1x+a2x2++amxmy = a_0 + a_1x + a_2x^2 + \dots + a_mx^m. This involves solving an (m+1)×(m+1)(m+1) \times (m+1) system of simultaneous linear equations (the general Normal Equations) to find the coefficients.

Overfitting

A modeling error that occurs when a function is too closely aligned with a limited set of data points, capturing the noise rather than the underlying true trend, often resulting in poor predictive performance.

The Danger of Overfitting

While a higher-degree polynomial will technically yield a lower sum of squared residuals, arbitrarily increasing the degree mm simply to chase a higher R2R^2 value is a severe error known as overfitting. An overfitted polynomial will weave wildly to hit noise points rather than capturing the true underlying physics. This resulting function becomes highly unstable and nearly useless for making predictions between or beyond the data points. A general rule is to use the lowest degree polynomial that adequately captures the physical trend, validating it against an independent test set.

Orthogonal Polynomials and Fourier Approximation

For continuous functions or extensive data sets, using orthogonal polynomials (like Legendre or Chebyshev polynomials) for regression minimizes numerical errors.

Fourier Series

Instead of fitting polynomials, Fourier approximation uses a series of sine and cosine functions to model periodic phenomena (e.g., tides or vibrations). It's a form of least-squares fitting using orthogonal trigonometric functions.

Fourier Series Approximation

Models periodic data using a sum of trigonometric functions.

f(t)=a0+k=1n(akcos(kω0t)+bksin(kω0t))f(t) = a_0 + \sum_{k=1}^n (a_k \cos(k\omega_0 t) + b_k \sin(k\omega_0 t))

Variables

SymbolDescriptionUnit
f(t)f(t)Function value at time tt-
a0a_0Constant term (mean value)-
aka_kCosine coefficients-
bkb_kSine coefficients-
ω0\omega_0Fundamental frequencyrad/s
ttTimes

The Dangers of Extrapolation

Interpolation refers to predicting values within the range of the given data set. Extrapolation refers to predicting values outside that range.

The Danger of Extrapolation

Extrapolating outside the bounds of the base data is highly dangerous. Polynomials, in particular, tend to oscillate wildly outside their fitting range. A trend that appears strictly linear or exponential within the given data may completely break down outside it, leading to grossly erroneous predictions.

Newton's Divided-Difference Interpolating Polynomials

Interpolation is used when data is known to be highly precise, and the fitted curve must pass directly through every data point. Newton's method is one of the most popular and useful forms.

Newton's Interpolating Polynomial

A polynomial of order nn that passes through n+1n+1 points. The coefficients bib_i are evaluated using finite divided differences.

Newton's Interpolating Polynomial Formula

Constructs an interpolating polynomial using divided differences.

fn(x)=b0+b1(xx0)+b2(xx0)(xx1)++bn(xx0)(xx1)(xxn1)f_n(x) = b_0 + b_1(x - x_0) + b_2(x - x_0)(x - x_1) + \dots + b_n(x - x_0)(x - x_1)\dots(x - x_{n-1})

Variables

SymbolDescriptionUnit
fn(x)f_n(x)Interpolated value at xx-
bib_iCoefficients evaluated via finite divided differences-
xix_iKnown data point coordinates-

Inverse Interpolation

In standard interpolation, you provide a value xx to find f(x)f(x). In inverse interpolation, you are given a value f(x)f(x) and must find the corresponding xx.

How to perform Inverse Interpolation

  • Swapping variables: The simplest approach is to swap xx and yy (making xx the dependent variable) and use a standard method like Newton's or Lagrange. However, this only works well if the function is monotonic.
  • Root-finding: A more robust approach is to form a new function g(x)=f(x)ytargetg(x) = f(x) - y_{\text{target}} and use root-finding methods (like the Secant Method) on the interpolating polynomial.

Runge's Phenomenon and Chebyshev Nodes

Runge's Phenomenon

A problem of oscillation at the edges of an interval that occurs when using polynomial interpolation with polynomials of high degree over a set of equally spaced interpolation points.

When fitting a high-degree polynomial to a large set of equally spaced data points, severe oscillations can occur near the edges of the interval. This is known as Runge's Phenomenon.

Chebyshev Nodes

To mitigate Runge's Phenomenon, we can choose non-equally spaced points called Chebyshev nodes. These nodes are clustered near the ends of the interval and sparse in the center, which minimizes the maximum interpolation error across the domain. For an interval [a,b][a, b], these points are linearly scaled from [1,1][-1, 1]. Using Chebyshev nodes with Newton's or Lagrange polynomials dramatically improves stability.

Chebyshev Nodes Formula

Calculates optimal non-equally spaced nodes to minimize interpolation error.

xk=cos(2k12nπ),k=1,,nx_k = \cos\left(\frac{2k-1}{2n}\pi\right), \quad k=1,\dots,n

Variables

SymbolDescriptionUnit
xkx_kPosition of the kk-th Chebyshev node in [1,1][-1, 1]-
nnTotal number of nodes-

Lagrange Interpolating Polynomials

A reformulation of Newton's polynomial that avoids the computation of divided differences. It is conceptually simpler but computationally more expensive if the order of the polynomial is unknown beforehand.

Lagrange Interpolating Polynomial

Calculates the interpolated value without needing divided differences by using weighting functions.

fn(x)=i=0nLi(x)f(xi)f_n(x) = \sum_{i=0}^n L_i(x) f(x_i)

Variables

SymbolDescriptionUnit
fn(x)f_n(x)Interpolated value at xx-
Li(x)L_i(x)Lagrange basis polynomial (weighting function)-
f(xi)f(x_i)Known function value at xix_i-

Lagrange Basis Polynomial

Computes the weighting functions for the Lagrange polynomial.

Li(x)=j=0,jinxxjxixjL_i(x) = \prod_{j=0, j \neq i}^n \frac{x - x_j}{x_i - x_j}

Variables

SymbolDescriptionUnit
Li(x)L_i(x)The ii-th Lagrange basis polynomial-
xxThe interpolation point-
xi,xjx_i, x_jKnown data points-

Hermite Interpolation

Unlike standard interpolation (which only guarantees the curve passes through the given points), Hermite interpolation forces the polynomial to match both the function values and the first derivatives (slopes) at those points.

Hermite Polynomials

To interpolate nn points where both f(xi)f(x_i) and f(xi)f'(x_i) are specified, a polynomial of degree 2n12n-1 is required. This ensures the resulting interpolating curve has smooth tangents at the data points, which is particularly useful in computer graphics and path planning.

Piecewise Linear Interpolation

Before moving to complex splines, the most basic form of spline interpolation is piecewise linear interpolation, which simply connects consecutive data points with straight lines (first-degree polynomials).

Piecewise Linear Interpolation

Connects consecutive data points using first-degree polynomials.

f1(x)=f(xi1)+f(xi)f(xi1)xixi1(xxi1)f_1(x) = f(x_{i-1}) + \frac{f(x_i) - f(x_{i-1})}{x_i - x_{i-1}} (x - x_{i-1})

Variables

SymbolDescriptionUnit
f1(x)f_1(x)Interpolated value at xx-
xi1,xix_{i-1}, x_iConsecutive known data points-
f(xi1),f(xi)f(x_{i-1}), f(x_i)Function values at the data points-

While easy to implement, its primary drawback is that the first derivative (slope) is discontinuous at the knots, resulting in a curve with sharp corners rather than a smooth profile.

Spline Interpolation

Using a single high-degree polynomial for a large number of points can cause severe oscillations (Runge's phenomenon). Spline interpolation applies lower-degree polynomials to subsets of data points, ensuring smooth transitions at the connection points (knots).

Spline Interpolation

A form of interpolation where the interpolant is a special type of piecewise polynomial called a spline. Spline interpolation is often preferred over standard polynomial interpolation because the interpolation error can be made small even when using low degree polynomials for the spline, thus avoiding Runge's phenomenon.

Cubic Splines

The most common type of spline. It fits a third-degree polynomial to each interval between data points. To ensure a smooth continuous curve across the entire data set, the cubic spline enforces matching conditions at the internal knots:

  • The function values must match at the knots (continuity).
  • The first derivatives must match at the knots (smooth slope).
  • The second derivatives must match at the knots (smooth curvature).

This results in a tridiagonal system of linear equations that can be solved very efficiently.

Spline Boundary Conditions

Because there are nn intervals but conditions only specify the internal connections, two extra boundary conditions at the first and last knot are required to close the system:

  • Natural Spline: The second derivative is set to zero at the endpoints (f(x0)=f(xn)=0f''(x_0) = f''(x_n) = 0). This makes the curve "straight" at the ends.
  • Clamped Spline: The first derivatives (slopes) at the endpoints are specified by the user to match a known tangent.
  • Not-a-knot: Forces the third derivative to be continuous across the first and last internal knots, effectively treating the first two and last two intervals as single cubic functions.
Key Takeaways
  • Regression creates a general trend line that minimizes errors for noisy data, while Interpolation forces the curve to pass exactly through precise data points.
  • Regression uses Normal Equations to derive coefficients that minimize the sum of squared residuals.
  • The Coefficient of Determination (R2R^2) quantifies how well the regression line fits the data.
  • Linearization allows nonlinear models (exponential, power) to be fitted using linear regression techniques.
  • Increasing the polynomial degree too high leads to Overfitting, causing wild non-physical oscillations.
  • Extrapolation outside the data bounds is dangerous because fitted polynomials tend to diverge rapidly.
  • Runge's Phenomenon causes high-degree polynomials to oscillate on equally spaced points; using Chebyshev Nodes mitigates this.
  • Hermite Interpolation matches both the function values and their first derivatives at the data points.
  • Piecewise Linear Interpolation connects points with straight lines but has discontinuous slopes.
  • Orthogonal polynomials and Fourier series are powerful tools for continuous least-squares fitting, especially for periodic data.
  • Newton's and Lagrange polynomials provide global interpolation formulas but are susceptible to Runge's phenomenon for high degrees.
  • Inverse Interpolation finds the xx-value for a given f(x)f(x), often by swapping variables or using root-finding methods on the interpolant.
  • Cubic Splines provide local, piecewise continuous polynomials that maintain smooth slopes and curvatures across knots, avoiding large oscillations.
  • Spline systems require specific Boundary Conditions (like Natural, Clamped, or Not-a-knot) to fully solve the resulting tridiagonal matrix.