Descriptive Statistics

Learning Objectives

  • Understand the core concepts of descriptive statistics and their application in engineering.
  • Calculate and interpret measures of central tendency (Mean, Median, Mode) for both raw and grouped data.
  • Compute and analyze measures of dispersion (Range, Variance, Standard Deviation, Coefficient of Variation).
  • Apply the Empirical Rule and Chebyshev's Theorem to assess data variability.
  • Determine and interpret measures of position (Percentiles, Quartiles, Interquartile Range) and identify outliers.
  • Evaluate the shape of data distributions using skewness and kurtosis.

Descriptive statistics summarize and organize characteristics of a dataset. They provide simple, quantitative summaries about the sample and the measures, forming the basis of virtually every quantitative analysis of data. For civil engineers, these statistics describe the fundamental properties of materials, environmental conditions, and structural behaviors.

Measures of Central Tendency

Central Tendency Overview

These measures indicate the "center" or typical value of a data set.

Mean (Arithmetic Average)

The sum of all values divided by the number of values. It incorporates every data point but is sensitive to extreme outliers (e.g., an unusually high compressive strength reading).

Sample Mean

Formula for calculating the arithmetic average of a dataset.

xˉ=xin\bar{x} = \frac{\sum x_i}{n}

Variables

SymbolDescriptionUnit
xˉ\bar{x}Sample mean-
xix_iValue of each individual observation-
nnNumber of observations in the sample-

Median

The middle value when the data is sorted in ascending or descending order. If there is an even number of observations, it is the arithmetic average of the two middle values. The median is robust and not heavily influenced by extreme outliers, making it a better measure of center for skewed data (e.g., income, or highly variable soil permeability).

Mode

The value that appears most frequently in the data set. A distribution can be unimodal, bimodal (two distinct peaks), or multimodal. It is primarily useful for categorical data (nominal level).

Mean for Grouped Data

Grouped Data Overview

When dealing with large datasets presented in a frequency distribution table, the exact mean cannot be calculated. Instead, we approximate it using class midpoints.

Grouped Mean

An approximation of the mean for grouped data calculated using the class midpoints and frequencies.

Grouped Mean

Formula for approximating the mean of data grouped into classes.

xˉgrouped(fimi)fi\bar{x}_{\text{grouped}} \approx \frac{\sum (f_i \cdot m_i)}{\sum f_i}

Variables

SymbolDescriptionUnit
xˉgrouped\bar{x}_{\text{grouped}}Approximate mean for grouped data-
fif_iFrequency of the i-th class-
mim_iMidpoint of the i-th class-

Interactive Simulation

Interact with the simulation below to explore measures of central tendency and dispersion in civil engineering scenarios.

Engineering Data Analysis

Descriptive Statistics Explorer

Dataset (n = 5)

5
8
12
15
20

Distribution Map

0
6
13
19
25
Mean
Median
Meanxˉ\bar{x}
12.00

Average of all values

Medianx~\tilde{x}
12.00

Middle value

ModeMo\text{Mo}
None

Most frequent

RangeRR
15.00

Max - Min

Variances2s^2
34.50

Dispersion squared

Std. Dev.ss
5.87

Typical deviation

Measures of Dispersion (Variability)

Overview of Dispersion

These measures describe the spread, scatter, or variability of the data around the central value.

In engineering, variability is often synonymous with risk or uncertainty. High variance in concrete strength means a less reliable material.

Range

The difference between the maximum and minimum values in the dataset. It is a quick measure of total spread but is highly susceptible to extreme outliers.

Range

Formula for calculating the range of a dataset.

R=xmaxxminR = x_{\text{max}} - x_{\text{min}}

Variables

SymbolDescriptionUnit
RRRange of the dataset-
xmaxx_{\text{max}}Maximum value in the dataset-
xminx_{\text{min}}Minimum value in the dataset-

Variance (s2s^2 or σ2\sigma^2)

The average of the squared differences from the mean. It quantifies the average squared distance of each data point from the center.

Population Variance

Used when the dataset includes the entire population of interest.

σ2=(xiμ)2N\sigma^2 = \frac{\sum (x_i - \mu)^2}{N}

Variables

SymbolDescriptionUnit
σ2\sigma^2Population variance-
xix_iValue of each individual observation-
μ\muPopulation mean-
NNTotal number of observations in the population-

Sample Variance

Used when working with a sample to estimate the population variance. It uses n-1 (degrees of freedom) in the denominator to provide an unbiased estimate.

s2=(xixˉ)2n1s^2 = \frac{\sum (x_i - \bar{x})^2}{n - 1}

Variables

SymbolDescriptionUnit
s2s^2Sample variance-
xix_iValue of each individual observation-
xˉ\bar{x}Sample mean-
nnNumber of observations in the sample-

Standard Deviation (ss or σ\sigma)

The positive square root of the variance. It is expressed in the same units as the original data (e.g., MPa, mm, seconds), making it far easier to interpret practically than variance.

Sample Standard Deviation

Formula for calculating the sample standard deviation.

s=s2s = \sqrt{s^2}

Variables

SymbolDescriptionUnit
ssSample standard deviation-
s2s^2Sample variance-

Coefficient of Variation (CVCV)

A measure of relative variability. It expresses the standard deviation as a percentage of the mean, allowing for the comparison of dispersion across datasets with different units or vastly different means.

Coefficient of Variation

Formula for calculating the coefficient of variation.

CV=sxˉ×100%CV = \frac{s}{\bar{x}} \times 100\%

Variables

SymbolDescriptionUnit
CVCVCoefficient of Variation-
ssSample standard deviation-
xˉ\bar{x}Sample mean-

The Empirical Rule and Chebyshev's Theorem

Interpreting Standard Deviation

Rules for interpreting standard deviation relative to the mean.

The Empirical Rule (Normal Distributions)

If the data distribution is approximately bell-shaped (normal):

  • Approximately 68% of the data falls within one standard deviation of the mean (xˉ±1s\bar{x} \pm 1s).
  • Approximately 95% of the data falls within two standard deviations (xˉ±2s\bar{x} \pm 2s).
  • Approximately 99.7% of the data falls within three standard deviations (xˉ±3s\bar{x} \pm 3s).

Chebyshev's Theorem (Any Distribution)

For any set of data (regardless of the shape of the distribution), the proportion of values that lie within kk standard deviations of the mean is at least 11k21 - \frac{1}{k^2}, where k>1k > 1.

  • For k=2k=2: At least 75% of the data falls within xˉ±2s\bar{x} \pm 2s.
  • For k=3k=3: At least 88.9% of the data falls within xˉ±3s\bar{x} \pm 3s.

Measures of Position

Overview of Position

These describe the relative location of a specific data value within the entire dataset.

Percentiles

Values that divide a sorted dataset into 100 equal parts. The kthk^{\text{th}} percentile (PkP_k) is a value such that at least k%k\% of the observations are less than or equal to this value, and (100k)%(100-k)\% are greater.

Quartiles and the Five-Number Summary

Values that divide the sorted data into four equal parts. They form the basis of the Five-Number Summary (Min, Q1Q_1, Median, Q3Q_3, Max) and the Box Plot visualization.

  • Q1Q_1 (First Quartile): 25th percentile (P25P_{25})
  • Q2Q_2 (Second Quartile): 50th percentile (Median, P50P_{50})
  • Q3Q_3 (Third Quartile): 75th percentile (P75P_{75})

Interquartile Range (IQR)

The range of the middle 50% of the sorted data. It is a robust measure of variability.

Interquartile Range

Formula for calculating the interquartile range.

IQR=Q3Q1IQR = Q_3 - Q_1

Variables

SymbolDescriptionUnit
IQRIQRInterquartile Range-
Q3Q_3Third Quartile (75th percentile)-
Q1Q_1First Quartile (25th percentile)-

Outlier Detection

Data points are typically considered outliers if they fall below Q11.5(IQR)Q_1 - 1.5(IQR) or above Q3+1.5(IQR)Q_3 + 1.5(IQR).

Interactive Simulation

Interact with the box plot simulation below to explore quartiles, IQR, and outlier detection in experimental datasets.

Engineering Data Analysis • Topic 2

Interactive Box & Whisker Plot

Data Values

Value x₁20
Value x₂35
Value x₃40
Value x₄50
Value x₅55
Value x₆60
Value x₇75
Value x₈95
020406080100
Median (Q2)52.5
Q137.5
Q367.5
IQR30.0
Outliers are values beyond fences:[Q11.5IQR,Q3+1.5IQR][\text{Q1} - 1.5\text{IQR}, \text{Q3} + 1.5\text{IQR}].No outliers.

Skewness and Kurtosis

Overview of Skewness and Kurtosis

These measures describe the shape of the data's distribution compared to a standard normal (bell-shaped) curve.

Skewness

A measure of the asymmetry of the probability distribution of a real-valued random variable about its mean.

  • Positive Skew (Right-Skewed): The right tail is longer or fatter; the mass of the distribution is concentrated on the left. The Mean is typically greater than the Median.
  • Negative Skew (Left-Skewed): The left tail is longer or fatter; the mass of the distribution is concentrated on the right. The Mean is typically less than the Median.
  • Zero Skew: The distribution is perfectly symmetric (e.g., normal distribution). Mean equals Median.

Kurtosis

A measure of the "tailedness" (heavy or light tails) of the probability distribution. It describes how much of the data is clustered in the extreme tails versus the center, relative to a normal distribution.

  • Leptokurtic (High Kurtosis, >3>3): Heavy tails and a sharper, higher peak compared to a normal distribution. Indicates a higher propensity for extreme outliers (critical in risk assessment for extreme loads).
  • Platykurtic (Low Kurtosis, <3<3): Light tails and a flatter peak. Fewer extreme outliers.
  • Mesokurtic (Kurtosis 3\approx 3): The kurtosis of a standard normal distribution.

Interactive Simulation

Interact with the simulation below to visualize skewness and kurtosis and observe how distributions shift in engineering data.

Engineering Data Analysis

Distribution Shape: Skewness & Kurtosis

Skewness (γ1\gamma_1): 0.0Symmetric (Zero Skew)
Negative SkewSymmetric (0)Positive Skew
Kurtosis (β2\beta_2): 3.0Mesokurtic (Normal)
Platykurtic (Flat)Mesokurtic (3)Leptokurtic (Peaked)

Statistical Moments

Skewness measures the asymmetry of the PDF around the mean. A positive skew has a tail extending towards more positive values.

Kurtosismeasures the "tailedness" of the distribution. Fatter tails and a sharper peak characterize high kurtosis (Leptokurtic).

Loading chart...
Key Takeaways
  • Mean: Arithmetic average, highly sensitive to extreme values or outliers.
  • Median: The exact middle value, robust against outliers, providing a better center for skewed data.
  • Mode: The most frequent value, useful for categorical analysis.
  • Range: Quick measure of the total spread, easily skewed by outliers.
  • Variance: Average squared deviation from the mean (use n1n-1 for sample variance to correct for bias).
  • Standard Deviation: The most common measure of spread, expressed in original data units.
  • Empirical Rule: Useful heuristic for bell-shaped distributions (68-95-99.7 rule).
  • Percentiles: Indicate relative standing (e.g., scoring in the 90th percentile).
  • Quartiles: Divide the dataset into quarters (Q1,Q2,Q3Q_1, Q_2, Q_3).
  • IQR: The spread of the middle half of the data, critical for robust outlier detection using the 1.5×IQR1.5 \times IQR rule.
  • Skewness: Indicates whether data is asymmetric to the left or right of the mean.
  • Kurtosis: Measures the extremity of tails in a distribution, helping engineers predict the likelihood and severity of extreme, outlier events.