Data Point: 5

Descriptive Statistics

Learning Objectives

Understand the core concepts of descriptive statistics and their application in engineering.
Calculate and interpret measures of central tendency (Mean, Median, Mode) for both raw and grouped data.
Compute and analyze measures of dispersion (Range, Variance, Standard Deviation, Coefficient of Variation).
Apply the Empirical Rule and Chebyshev's Theorem to assess data variability.
Determine and interpret measures of position (Percentiles, Quartiles, Interquartile Range) and identify outliers.
Evaluate the shape of data distributions using skewness and kurtosis.

Descriptive statistics summarize and organize characteristics of a dataset. They provide simple, quantitative summaries about the sample and the measures, forming the basis of virtually every quantitative analysis of data. For civil engineers, these statistics describe the fundamental properties of materials, environmental conditions, and structural behaviors.

Measures of Central Tendency

Central Tendency Overview

These measures indicate the "center" or typical value of a data set.

Mean (Arithmetic Average)

The sum of all values divided by the number of values. It incorporates every data point but is sensitive to extreme outliers (e.g., an unusually high compressive strength reading).

Sample Mean

Formula for calculating the arithmetic average of a dataset.

\bar{x} = \frac{\sum x_i}{n}

Variables

Symbol	Description	Unit
$\bar{x}$	Sample mean	-
$x_i$	Value of each individual observation	-
$n$	Number of observations in the sample	-

Median

The middle value when the data is sorted in ascending or descending order. If there is an even number of observations, it is the arithmetic average of the two middle values. The median is robust and not heavily influenced by extreme outliers, making it a better measure of center for skewed data (e.g., income, or highly variable soil permeability).

Mode

The value that appears most frequently in the data set. A distribution can be unimodal, bimodal (two distinct peaks), or multimodal. It is primarily useful for categorical data (nominal level).

Mean for Grouped Data

Grouped Data Overview

When dealing with large datasets presented in a frequency distribution table, the exact mean cannot be calculated. Instead, we approximate it using class midpoints.

Grouped Mean

An approximation of the mean for grouped data calculated using the class midpoints and frequencies.

Grouped Mean

Formula for approximating the mean of data grouped into classes.

\bar{x}_{\text{grouped}} \approx \frac{\sum (f_i \cdot m_i)}{\sum f_i}

Variables

Symbol	Description	Unit
$\bar{x}_{\text{grouped}}$	Approximate mean for grouped data	-
$f_i$	Frequency of the i-th class	-
$m_i$	Midpoint of the i-th class	-

Interactive Simulation

Interact with the simulation below to explore measures of central tendency and dispersion in civil engineering scenarios.

Engineering Data Analysis

Descriptive Statistics Explorer

Add Data Point

Dataset (n = 5)

Distribution Map

Mean

Median

Mean

\bar{x}

12.00

Average of all values

Median

\tilde{x}

12.00

Middle value

Mode

\text{Mo}

None

Most frequent

Range

R

15.00

Max - Min

Variance

s^2

34.50

Dispersion squared

Std. Dev.

s

5.87

Typical deviation

Measures of Dispersion (Variability)

Overview of Dispersion

These measures describe the spread, scatter, or variability of the data around the central value.

In engineering, variability is often synonymous with risk or uncertainty. High variance in concrete strength means a less reliable material.

Range

The difference between the maximum and minimum values in the dataset. It is a quick measure of total spread but is highly susceptible to extreme outliers.

Range

Formula for calculating the range of a dataset.

R = x_{\text{max}} - x_{\text{min}}

Variables

Symbol	Description	Unit
$R$	Range of the dataset	-
$x_{\text{max}}$	Maximum value in the dataset	-
$x_{\text{min}}$	Minimum value in the dataset	-

Variance ( $s^2$ or $\sigma^2$ )

The average of the squared differences from the mean. It quantifies the average squared distance of each data point from the center.

Population Variance

Used when the dataset includes the entire population of interest.

\sigma^2 = \frac{\sum (x_i - \mu)^2}{N}

Variables

Symbol	Description	Unit
$\sigma^2$	Population variance	-
$x_i$	Value of each individual observation	-
$\mu$	Population mean	-
$N$	Total number of observations in the population	-

Sample Variance

Used when working with a sample to estimate the population variance. It uses n-1 (degrees of freedom) in the denominator to provide an unbiased estimate.

s^2 = \frac{\sum (x_i - \bar{x})^2}{n - 1}

Variables

Symbol	Description	Unit
$s^2$	Sample variance	-
$x_i$	Value of each individual observation	-
$\bar{x}$	Sample mean	-
$n$	Number of observations in the sample	-

Standard Deviation ( $s$ or $\sigma$ )

The positive square root of the variance. It is expressed in the same units as the original data (e.g., MPa, mm, seconds), making it far easier to interpret practically than variance.

Sample Standard Deviation

Formula for calculating the sample standard deviation.

s = \sqrt{s^2}

Variables

Symbol	Description	Unit
$s$	Sample standard deviation	-
$s^2$	Sample variance	-

Coefficient of Variation ( $CV$ )

A measure of relative variability. It expresses the standard deviation as a percentage of the mean, allowing for the comparison of dispersion across datasets with different units or vastly different means.

Coefficient of Variation

Formula for calculating the coefficient of variation.

CV = \frac{s}{\bar{x}} \times 100\%

Variables

Symbol	Description	Unit
$CV$	Coefficient of Variation	-
$s$	Sample standard deviation	-
$\bar{x}$	Sample mean	-

The Empirical Rule and Chebyshev's Theorem

Interpreting Standard Deviation

Rules for interpreting standard deviation relative to the mean.

The Empirical Rule (Normal Distributions)

If the data distribution is approximately bell-shaped (normal):

Approximately 68% of the data falls within one standard deviation of the mean ( $\bar{x} \pm 1s$ ).
Approximately 95% of the data falls within two standard deviations ( $\bar{x} \pm 2s$ ).
Approximately 99.7% of the data falls within three standard deviations ( $\bar{x} \pm 3s$ ).

Chebyshev's Theorem (Any Distribution)

For any set of data (regardless of the shape of the distribution), the proportion of values that lie within $k$ standard deviations of the mean is at least $1 - \frac{1}{k^2}$ , where $k > 1$ .

For $k=2$ : At least 75% of the data falls within $\bar{x} \pm 2s$ .
For $k=3$ : At least 88.9% of the data falls within $\bar{x} \pm 3s$ .

Measures of Position

Overview of Position

These describe the relative location of a specific data value within the entire dataset.

Percentiles

Values that divide a sorted dataset into 100 equal parts. The $k^{\text{th}}$ percentile ( $P_k$ ) is a value such that at least $k\%$ of the observations are less than or equal to this value, and $(100-k)\%$ are greater.

Quartiles and the Five-Number Summary

Values that divide the sorted data into four equal parts. They form the basis of the Five-Number Summary (Min, $Q_1$ , Median, $Q_3$ , Max) and the Box Plot visualization.

$Q_1$ (First Quartile): 25th percentile ( $P_{25}$ )
$Q_2$ (Second Quartile): 50th percentile (Median, $P_{50}$ )
$Q_3$ (Third Quartile): 75th percentile ( $P_{75}$ )

Interquartile Range (IQR)

The range of the middle 50% of the sorted data. It is a robust measure of variability.

Interquartile Range

Formula for calculating the interquartile range.

IQR = Q_3 - Q_1

Variables

Symbol	Description	Unit
$IQR$	Interquartile Range	-
$Q_3$	Third Quartile (75th percentile)	-
$Q_1$	First Quartile (25th percentile)	-

Outlier Detection

Data points are typically considered outliers if they fall below $Q_1 - 1.5(IQR)$ or above $Q_3 + 1.5(IQR)$ .

Interactive Simulation

Interact with the box plot simulation below to explore quartiles, IQR, and outlier detection in experimental datasets.

Engineering Data Analysis • Topic 2

Interactive Box & Whisker Plot

Data Values

Value x₁20

Value x₂35

Value x₃40

Value x₄50

Value x₅55

Value x₆60

Value x₇75

Value x₈95

Median (Q2)52.5

Q137.5

Q367.5

IQR30.0

• Outliers are values beyond fences:

[\text{Q1} - 1.5\text{IQR}, \text{Q3} + 1.5\text{IQR}]

.No outliers.

Skewness and Kurtosis

Overview of Skewness and Kurtosis

These measures describe the shape of the data's distribution compared to a standard normal (bell-shaped) curve.

Skewness

A measure of the asymmetry of the probability distribution of a real-valued random variable about its mean.

Positive Skew (Right-Skewed): The right tail is longer or fatter; the mass of the distribution is concentrated on the left. The Mean is typically greater than the Median.
Negative Skew (Left-Skewed): The left tail is longer or fatter; the mass of the distribution is concentrated on the right. The Mean is typically less than the Median.
Zero Skew: The distribution is perfectly symmetric (e.g., normal distribution). Mean equals Median.

Kurtosis

A measure of the "tailedness" (heavy or light tails) of the probability distribution. It describes how much of the data is clustered in the extreme tails versus the center, relative to a normal distribution.

Leptokurtic (High Kurtosis, $>3$ ): Heavy tails and a sharper, higher peak compared to a normal distribution. Indicates a higher propensity for extreme outliers (critical in risk assessment for extreme loads).
Platykurtic (Low Kurtosis, $<3$ ): Light tails and a flatter peak. Fewer extreme outliers.
Mesokurtic (Kurtosis $\approx 3$ ): The kurtosis of a standard normal distribution.

Interactive Simulation

Interact with the simulation below to visualize skewness and kurtosis and observe how distributions shift in engineering data.

Engineering Data Analysis

Distribution Shape: Skewness & Kurtosis

Skewness (

\gamma_1

): 0.0Symmetric (Zero Skew)

Negative SkewSymmetric (0)Positive Skew

Kurtosis (

\beta_2

): 3.0Mesokurtic (Normal)

Platykurtic (Flat)Mesokurtic (3)Leptokurtic (Peaked)

Statistical Moments

• Skewness measures the asymmetry of the PDF around the mean. A positive skew has a tail extending towards more positive values.

• Kurtosismeasures the "tailedness" of the distribution. Fatter tails and a sharper peak characterize high kurtosis (Leptokurtic).

Loading chart...

Key Takeaways

Mean: Arithmetic average, highly sensitive to extreme values or outliers.
Median: The exact middle value, robust against outliers, providing a better center for skewed data.
Mode: The most frequent value, useful for categorical analysis.
Range: Quick measure of the total spread, easily skewed by outliers.
Variance: Average squared deviation from the mean (use $n-1$ for sample variance to correct for bias).
Standard Deviation: The most common measure of spread, expressed in original data units.
Empirical Rule: Useful heuristic for bell-shaped distributions (68-95-99.7 rule).
Percentiles: Indicate relative standing (e.g., scoring in the 90th percentile).
Quartiles: Divide the dataset into quarters ( $Q_1, Q_2, Q_3$ ).
IQR: The spread of the middle half of the data, critical for robust outlier detection using the $1.5 \times IQR$ rule.
Skewness: Indicates whether data is asymmetric to the left or right of the mean.
Kurtosis: Measures the extremity of tails in a distribution, helping engineers predict the likelihood and severity of extreme, outlier events.

Previous TopicIntroduction to Data Analysis - Examples & Applications

Quiz Me

Next TopicDescriptive Statistics - Examples & Applications

Prev Next

Quiz Me

Descriptive Statistics

Learning Objectives

Measures of Central Tendency

Central Tendency Overview

Mean (Arithmetic Average)

Sample Mean

Median

Mode

Mean for Grouped Data

Grouped Data Overview

Grouped Mean

Grouped Mean

Interactive Simulation

Engineering Data Analysis

Dataset (n = 5)

Distribution Map

Measures of Dispersion (Variability)

Overview of Dispersion

Range

Range

Variance (s2s^2s2 or σ2\sigma^2σ2)

Population Variance

Sample Variance

Standard Deviation (sss or σ\sigmaσ)

Sample Standard Deviation

Coefficient of Variation (CVCVCV)

Coefficient of Variation

The Empirical Rule and Chebyshev's Theorem

Interpreting Standard Deviation

The Empirical Rule (Normal Distributions)

Chebyshev's Theorem (Any Distribution)

Measures of Position

Overview of Position

Percentiles

Quartiles and the Five-Number Summary

Interquartile Range (IQR)

Interquartile Range

Outlier Detection

Interactive Simulation

Engineering Data Analysis • Topic 2

Data Values

Skewness and Kurtosis

Overview of Skewness and Kurtosis

Skewness

Kurtosis

Interactive Simulation

Engineering Data Analysis

Statistical Moments

Variance ( $s^2$ or $\sigma^2$ )

Standard Deviation ( $s$ or $\sigma$ )

Coefficient of Variation ( $CV$ )