
Numerical Data
🏋️♂️ Goal: Achieve an average of 160 watts
⏲️ Time: 2 minutes per round
💡 Pro Tip: Adjust the resistance to push your wattage higher!
The mean, often called the average, is a common way to measure the center of a distribution of data. For example, assume we have a watt value every second over a two minute period. To compute the mean watt value, we add up all the watt values and divide by 120.
The Sample Mean
Notation
\(\bar{x}\)
\(\bar{x} = \frac{1}{n}\sum_{i=1}^n x_i\)
For the 🚴 Watt Test 🚴 we know what we want the mean to be e.g., 160.
🤔 Consider how many different ways we can generate that mean value in a two minute interval.
A dot plot provides the most basic of displays when we are interested in the distribution of a single variable.

A dot plot provides the most basic of displays when we are interested in the distribution of a single variable.

We could achieve a similar average in a different way:


We could achieve a similar average in a different way:


We could also consider visualising two variables here as a scatter plot if we include the time aspect.
Consider the following strategy:

We could also consider visualising two variables here as a scatter plot if we include the time aspect.
Consider the following strategy:

We can change the strategy and still maintain the same average.
Consider the following strategy:

We can change the strategy and still maintain the same average.
Consider the following strategy:

We can change the strategy and still maintain the same average.
Consider the following strategy:

We can change the strategy and still maintain the same average.
Consider the following strategy:

A dot plot of the (rounded) loan interest rate data is shown below:
A histogram is a plot that shows the distribution of data by grouping values into bins and displaying their frequencies as bars. A histogram of the loan interest rate data is shown below:
When working with data we often have a sample from a larger population.
Later we’ll explore how reliable \(\bar{x}\) is for estimating \(\mu\)!
The Median
Notation
[1] 11 10 26 10 9 10 17 6 8 13
Quantiles are a generalization of the idea of the median
A quantile partitions the data into proportions
In general a (\(\alpha \times 100\))% - quantile (percentile) splits the data such that at least (\(\alpha \times 100\))% of the values are \(\leq\) the quantile value \(\tilde{x}^{(\alpha)}\).
\[ \tilde{x}^{(\alpha)} = \begin{cases} x_{(k)},& \text{if } n\alpha \text{ is not an integer}, \text{k = smallest integer > } n\alpha\\ \frac{1}{2}(x_{n\alpha} + x_{n\alpha + 1}), & \text{if } n\alpha \text{ is an integer} \end{cases} \]
The range is a measure of dispersion defined as the difference between the maximum and minimum value of the data.
The interquartile range is the difference between the 75% quantile (upper quartile) and 25% quantile (lower quartile).
\[d_{Q} = \tilde{x}^{(0.75)} - \tilde{x}^{(0.25)}\]
A boxplot is a graphical summary of data that shows its median, quartiles, spread, and potential outliers. Consider a boxplot of the interest rates.
Another measure of dispersion is the variance. The variance is one of the most important measures in statistics.
\[s^2 = \frac{1}{n-1}\sum_{i = 1}^{n}(x_i - \bar{x})^2\]
the variance of the interest rates in the loan dataset is 25.52
Which histogram has s = 5, 10, 15, 20
Which boxplot has s = 5, 10, 15, 20
A scatter plot is a chart that displays individual data points to show the relationship between two variables.