Curve Fitting and Interpolation
Learning Objectives
- Differentiate between regression (curve fitting for noisy data) and interpolation (exact fits for precise data).
- Apply least-squares regression techniques including linear, polynomial, and multiple linear regression.
- Utilize linearizations for exponential, power, and saturation-growth-rate relationships.
- Use Newton's divided-difference and Lagrange interpolating polynomials.
- Mitigate Runge's Phenomenon using Chebyshev nodes and piecewise interpolation methods.
- Understand spline interpolation and common boundary conditions (e.g., natural, clamped).
Regression
A statistical and numerical method for estimating the relationships among variables, particularly useful for deriving a general trend line from noisy experimental data where the resulting curve does not need to pass exactly through every point.
Interpolation
A method of constructing new data points within the range of a discrete set of known, highly precise data points, ensuring the resulting curve passes exactly through every given point.
Engineers often need to construct a continuous mathematical function that represents discrete data points. There are two distinct, fundamentally different approaches depending on the nature of the data. For instance, civil engineers might use regression to find a trend in noisy soil compaction test data, but they would use interpolation for laying out the precise curvature of a highway or railway track.
Regression vs. Interpolation
- Regression (Curve Fitting): Used when the data exhibits significant scatter or error (e.g., experimental measurements with noise). The goal is to derive a single curve that represents the general trend of the data. Crucially, the regression curve does not necessarily pass through any of the specific data points.
- Interpolation: Used when the data is known to be highly precise or exact (e.g., tabulated thermodynamic properties or computer-generated coordinates). The goal is to derive a curve that passes exactly through every single data point.
Least-Squares Regression
When fitting a trend line to noisy data, the most common method minimizes the sum of the squares of the residuals (errors) between the discrete points and the continuous curve.
Linear Regression
Fitting a straight line to data points. The coefficients (intercept) and (slope) are found by minimizing the sum of the squared residuals.
Normal Equations
The coefficients are derived by taking the partial derivatives of the sum of squared residuals with respect to and and setting them to zero. This yields the "Normal Equations":
Solving this linear system directly yields the formulas for and .
Linear Regression Slope
Computes the slope () of the best-fit line using least-squares.
Variables
| Symbol | Description | Unit |
|---|---|---|
| Slope of the regression line | - | |
| Number of data points | - | |
| Independent variable data points | - | |
| Dependent variable data points | - |
Linear Regression Intercept
Computes the y-intercept () of the best-fit line.
Variables
| Symbol | Description | Unit |
|---|---|---|
| y-intercept of the regression line | - | |
| Mean of the values | - | |
| Computed slope | - | |
| Mean of the values | - |
To quantify the "goodness of fit", we use the coefficient of determination (). It measures how much of the variance in the dependent variable is explained by the model:
Coefficient of Determination ()
Measures how well the regression line approximates the real data points.
Variables
| Symbol | Description | Unit |
|---|---|---|
| Coefficient of determination. An value close to 1 indicates a perfect fit. | - | |
| Total sum of squares, calculated as | - | |
| Sum of the squares of the residuals, calculated as | - |
Standard Error of the Estimate
Measures the standard distance between the observed data points and the regression line.
Variables
| Symbol | Description | Unit |
|---|---|---|
| Standard error of the estimate | - | |
| Sum of the squares of the residuals | - | |
| Number of data points | - |
Interactive Simulation
Adjust the data points in the simulation below to observe how the least-squares regression line adapts to minimize the sum of squared residuals.
Least-Squares Linear Regression
Drag the points or click on the graph to add new points. Observe how the best-fit line minimizes the sum of squared residuals.
Regression Statistics
Linearization of Nonlinear Relationships
Many nonlinear relationships can be transformed to allow for linear regression by taking the natural logarithm or base-10 logarithm of the variables.
Common Linearizations
- Exponential Equation: becomes . Let , , , .
- Power Equation: becomes . Let , , , .
- Saturation-Growth-Rate: becomes . Let , , , .
Multiple Linear Regression
When a dependent variable is a linear function of two or more independent variables (e.g., ), least-squares regression can be extended to find the coefficients by solving a system of simultaneous linear equations formed by partial derivatives of the sum of squared residuals.
Polynomial Regression and Overfitting
Extending the least-squares concept to fit polynomials of degree : . This involves solving an system of simultaneous linear equations (the general Normal Equations) to find the coefficients.
Overfitting
A modeling error that occurs when a function is too closely aligned with a limited set of data points, capturing the noise rather than the underlying true trend, often resulting in poor predictive performance.
The Danger of Overfitting
While a higher-degree polynomial will technically yield a lower sum of squared residuals, arbitrarily increasing the degree simply to chase a higher value is a severe error known as overfitting. An overfitted polynomial will weave wildly to hit noise points rather than capturing the true underlying physics. This resulting function becomes highly unstable and nearly useless for making predictions between or beyond the data points. A general rule is to use the lowest degree polynomial that adequately captures the physical trend, validating it against an independent test set.
Orthogonal Polynomials and Fourier Approximation
For continuous functions or extensive data sets, using orthogonal polynomials (like Legendre or Chebyshev polynomials) for regression minimizes numerical errors.
Fourier Series
Instead of fitting polynomials, Fourier approximation uses a series of sine and cosine functions to model periodic phenomena (e.g., tides or vibrations). It's a form of least-squares fitting using orthogonal trigonometric functions.
Fourier Series Approximation
Models periodic data using a sum of trigonometric functions.
Variables
| Symbol | Description | Unit |
|---|---|---|
| Function value at time | - | |
| Constant term (mean value) | - | |
| Cosine coefficients | - | |
| Sine coefficients | - | |
| Fundamental frequency | rad/s | |
| Time | s |
The Dangers of Extrapolation
Interpolation refers to predicting values within the range of the given data set. Extrapolation refers to predicting values outside that range.
The Danger of Extrapolation
Extrapolating outside the bounds of the base data is highly dangerous. Polynomials, in particular, tend to oscillate wildly outside their fitting range. A trend that appears strictly linear or exponential within the given data may completely break down outside it, leading to grossly erroneous predictions.
Newton's Divided-Difference Interpolating Polynomials
Interpolation is used when data is known to be highly precise, and the fitted curve must pass directly through every data point. Newton's method is one of the most popular and useful forms.
Newton's Interpolating Polynomial
A polynomial of order that passes through points. The coefficients are evaluated using finite divided differences.
Newton's Interpolating Polynomial Formula
Constructs an interpolating polynomial using divided differences.
Variables
| Symbol | Description | Unit |
|---|---|---|
| Interpolated value at | - | |
| Coefficients evaluated via finite divided differences | - | |
| Known data point coordinates | - |
Inverse Interpolation
In standard interpolation, you provide a value to find . In inverse interpolation, you are given a value and must find the corresponding .
How to perform Inverse Interpolation
- Swapping variables: The simplest approach is to swap and (making the dependent variable) and use a standard method like Newton's or Lagrange. However, this only works well if the function is monotonic.
- Root-finding: A more robust approach is to form a new function and use root-finding methods (like the Secant Method) on the interpolating polynomial.
Runge's Phenomenon and Chebyshev Nodes
Runge's Phenomenon
A problem of oscillation at the edges of an interval that occurs when using polynomial interpolation with polynomials of high degree over a set of equally spaced interpolation points.
When fitting a high-degree polynomial to a large set of equally spaced data points, severe oscillations can occur near the edges of the interval. This is known as Runge's Phenomenon.
Chebyshev Nodes
To mitigate Runge's Phenomenon, we can choose non-equally spaced points called Chebyshev nodes. These nodes are clustered near the ends of the interval and sparse in the center, which minimizes the maximum interpolation error across the domain. For an interval , these points are linearly scaled from . Using Chebyshev nodes with Newton's or Lagrange polynomials dramatically improves stability.
Chebyshev Nodes Formula
Calculates optimal non-equally spaced nodes to minimize interpolation error.
Variables
| Symbol | Description | Unit |
|---|---|---|
| Position of the -th Chebyshev node in | - | |
| Total number of nodes | - |
Lagrange Interpolating Polynomials
A reformulation of Newton's polynomial that avoids the computation of divided differences. It is conceptually simpler but computationally more expensive if the order of the polynomial is unknown beforehand.
Lagrange Interpolating Polynomial
Calculates the interpolated value without needing divided differences by using weighting functions.
Variables
| Symbol | Description | Unit |
|---|---|---|
| Interpolated value at | - | |
| Lagrange basis polynomial (weighting function) | - | |
| Known function value at | - |
Lagrange Basis Polynomial
Computes the weighting functions for the Lagrange polynomial.
Variables
| Symbol | Description | Unit |
|---|---|---|
| The -th Lagrange basis polynomial | - | |
| The interpolation point | - | |
| Known data points | - |
Hermite Interpolation
Unlike standard interpolation (which only guarantees the curve passes through the given points), Hermite interpolation forces the polynomial to match both the function values and the first derivatives (slopes) at those points.
Hermite Polynomials
To interpolate points where both and are specified, a polynomial of degree is required. This ensures the resulting interpolating curve has smooth tangents at the data points, which is particularly useful in computer graphics and path planning.
Piecewise Linear Interpolation
Before moving to complex splines, the most basic form of spline interpolation is piecewise linear interpolation, which simply connects consecutive data points with straight lines (first-degree polynomials).
Piecewise Linear Interpolation
Connects consecutive data points using first-degree polynomials.
Variables
| Symbol | Description | Unit |
|---|---|---|
| Interpolated value at | - | |
| Consecutive known data points | - | |
| Function values at the data points | - |
While easy to implement, its primary drawback is that the first derivative (slope) is discontinuous at the knots, resulting in a curve with sharp corners rather than a smooth profile.
Spline Interpolation
Using a single high-degree polynomial for a large number of points can cause severe oscillations (Runge's phenomenon). Spline interpolation applies lower-degree polynomials to subsets of data points, ensuring smooth transitions at the connection points (knots).
Spline Interpolation
A form of interpolation where the interpolant is a special type of piecewise polynomial called a spline. Spline interpolation is often preferred over standard polynomial interpolation because the interpolation error can be made small even when using low degree polynomials for the spline, thus avoiding Runge's phenomenon.
Cubic Splines
The most common type of spline. It fits a third-degree polynomial to each interval between data points. To ensure a smooth continuous curve across the entire data set, the cubic spline enforces matching conditions at the internal knots:
- The function values must match at the knots (continuity).
- The first derivatives must match at the knots (smooth slope).
- The second derivatives must match at the knots (smooth curvature).
This results in a tridiagonal system of linear equations that can be solved very efficiently.
Spline Boundary Conditions
Because there are intervals but conditions only specify the internal connections, two extra boundary conditions at the first and last knot are required to close the system:
- Natural Spline: The second derivative is set to zero at the endpoints (). This makes the curve "straight" at the ends.
- Clamped Spline: The first derivatives (slopes) at the endpoints are specified by the user to match a known tangent.
- Not-a-knot: Forces the third derivative to be continuous across the first and last internal knots, effectively treating the first two and last two intervals as single cubic functions.
- Regression creates a general trend line that minimizes errors for noisy data, while Interpolation forces the curve to pass exactly through precise data points.
- Regression uses Normal Equations to derive coefficients that minimize the sum of squared residuals.
- The Coefficient of Determination () quantifies how well the regression line fits the data.
- Linearization allows nonlinear models (exponential, power) to be fitted using linear regression techniques.
- Increasing the polynomial degree too high leads to Overfitting, causing wild non-physical oscillations.
- Extrapolation outside the data bounds is dangerous because fitted polynomials tend to diverge rapidly.
- Runge's Phenomenon causes high-degree polynomials to oscillate on equally spaced points; using Chebyshev Nodes mitigates this.
- Hermite Interpolation matches both the function values and their first derivatives at the data points.
- Piecewise Linear Interpolation connects points with straight lines but has discontinuous slopes.
- Orthogonal polynomials and Fourier series are powerful tools for continuous least-squares fitting, especially for periodic data.
- Newton's and Lagrange polynomials provide global interpolation formulas but are susceptible to Runge's phenomenon for high degrees.
- Inverse Interpolation finds the -value for a given , often by swapping variables or using root-finding methods on the interpolant.
- Cubic Splines provide local, piecewise continuous polynomials that maintain smooth slopes and curvatures across knots, avoiding large oscillations.
- Spline systems require specific Boundary Conditions (like Natural, Clamped, or Not-a-knot) to fully solve the resulting tridiagonal matrix.