Another Footnote to Plato: Financial Mathematics: Statistics

Probability distributions have a number of properties which help us summarize and characterize them. We'll look at some of these properties, how they are calculated and what they are used for.

I) Mean

The mean, or average or "expected value", we've previously defined as:

$\begin{align*} \mu = E(X) &= \sum_x xf(x) \tag{for discrete random variables} \\ \mu = E(X) &= \int_x xf(x) dx \tag{for continuous random variables} \end{align*}$
The mean will always be a real value number. At times, however, the above formulas produce a result that is not a real value number since the series or integral diverges (e.g. "goes to infinity"). In such cases we say that the mean is "undefined".

Example:

Suppose we have the following distribution:

$\begin{align*} f(x) = \left\{ \begin{array}{ll} 25\% &: x = 0 \\ 40\% &: x = 1 \\ 35\% &: x= 2 \end{array} \right\} \end{align*}$
Then the mean of this distribution is:

$\begin{align*} \mu = E(X) &= \sum_x xf(x) \\ &= 0 f(0) + 1f(1) + 2f(2) \\ &= 0 \times 25\% + 1 \times 40\% + 2 \times 35\% \\ &= 1.1 \end{align*}$
One observation to be noted here is that the mean or "expected value" is a value which can't possibly occur. You'll either get 0, 1 or 2; you can't possibly get 1.1. So "expected value" should not be confused with "what I expect to get".

Units

Now it's worth considering the units of all of this summary information. In the case of the mean, the units are going to be the same thing that one is measuring in. For example, if we're measuring widgets¹ in units of centimeters(cm), the units of the mean value will be cm.

For convenience, I will refer to the units as

$\text{UNITS}^2$ and so on. So for the mean, the units are simply

$\text{UNITS}$ .

II) Median

The median value is the value at which half of the probability is "to the left" and half of the probability is "to the right". Formally this states that:

$\begin{align*} m\in\Re\text{ is the median of a random variable, $X$, iff:} \\ P(X \leq m)\geq \frac{1}{2}\text{ and }P(X \geq m) \geq \frac{1}{2} \end{align*}$
This simply states that the probability "to the left" of

$m$ must be at least one half and the probability "to the right" of

$m$ must also be at least one half.

Discrete Random Variable Example

Suppose we have the following (as in the above example):

$\begin{align*} f(x) = \left\{ \begin{array}{ll} 25\% &: x = 0 \\ 40\% &: x = 1 \\ 35\% &: x= 2 \end{array} \right\} \end{align*}$
I claim that

$m=1$ is the median value since it satisfies our definition which has two parts. Here's the first part:

$\begin{align*} P(X \leq m) &= P(X=0) + P(X=1) \\ &= f(0) + f(1) \\ &= 25\% +40\% \\ &= 65\% \geq \frac{1}{2}\end{align*}$
So that satisfies the first part. Here's the second part:

$\begin{align*} P(X \geq m) &= P(X=1) + P(X=2) \\ &= f(1) + f(2) \\ &= 40\% +35\% \\ &= 75\% \geq \frac{1}{2} \end{align*}$
And that proves that

$m=1$ is the median for our probability distribution.

Continuous Random Variable

For a continuous random variable, to show that

$m$ is the median, it is sufficient to show the following:

$\int_{-\infty}^m f(x) = 50\%$
Geometrically, this amounts to measuring the "area under the curve" (

$f(x)$ ) and showing that it's equal to 0.5.

Units

The units for the median are simply the units which the original attribute was measured in:

$\text{UNITS}$ .

Mean-Median Fallacy

For some distributions, the mean is the same as the median. But this isn't always the case. And some people often confuse the two. They will say things like "half of all people are better than average". This may not be the case. In many cases the median may be more or less than the mean. If the median is less than the mean then less than half are better than average and vice versa.

III) Variance

If our probability distribution produced the same result every time, then our results would always be the same. No matter what occurred, we would always see the mean value. In that case, there would be no variability. The only probability distribution that could do that would be as follows:

$\begin{align*} f(x) = \left\{ \begin{array}{ll} 1 &: x = \mu \\ 0 &: x \neq \mu \end{array} \right\} \end{align*}$
Each time we performed an experiment on something that followed the above distribution, we would get the same value,

$\mu$ .

For most probability distributions, the results aren't always the same; they vary. Variance is the conventional way to quantify that varying.

The variance, which is sometimes notated as

$V(X)$ , is defined as the second central moment:

$V(X) = E[(X-\mu)^2]$
It can be shown (see here), due to the linearity property of expected value, that this is equivalent to:

$V(X) = E(X^2) - E(X)^2$
Sometimes the above is more convenient to calculate, especially when the raw moments are known.

An alternative way to notate the variance is using the Greek letter sigma,

$\sigma^2$ . The reason why it's squared will be discussed below.

Example:

We'll stick with the example we've been using. In this case we want to calculate the probability that:

$\sigma^2 = V(X) = E[(X- \mu)^2] = E(X^2) - E(X)^2$
We'll use both formula to show that they obtain the same result. Recall that we already found

$\mu=E(X)=1.1$ . Let's start with the first:

$\begin{align*} \sigma^2 = E[(X- \mu)^2] &= E[(X - 1.1)^2] \\ &= \sum_{x=0}^2 (x - 1.1)^2f(x) \\ &= (0 - 1.1)^2 \times 25\% + (1 - 1.1)^2 \times 40\% + (2-1.1)^2 \times 35\% \\ &= 0.59 \end{align*}$
To use the other formula we need to find the second raw moment:

$\begin{align*} E(X^2) &= \sum_{x=0}^2 x^2f(x) \\ &= 0^2 \times 25\% + 1^2 \times 40\% + 2^2 \times 35\% \\ &= 1.8 \end{align*}$
Now we subtract the square of the mean per the formula and hopefully get the same result:

$\begin{align*} V(X) &= E(X^2) - E(X)^2 \\ &= 1.8 - 1.1^2 \\ &= 0.59 \end{align*}$
Units

The units for the variance are the square of the original measurement units:

$\text{UNITS}^2$ .

Standard Deviation

It's often desirable to calculate the variability in terms of a measurement that has the same units as the original measurement,

$\text{UNITS}$ . As a result, we'll define the standard deviation as the square root of the variance:

$\sigma = \sqrt{\sigma^2} = \sqrt{E[(x-\mu)^2]}$
In finance, the standard deviation frequently gets referred to as "volatility".

Example:

Sticking with our previous example, we have already calculated the variance to be

$\sigma^2 = 0.59$ . As a result, the standard deviation will be:

$\begin{align*} \sigma &= \sqrt{\sigma^2} \\ &= \sqrt{0.59} \\ &\approx 0.768 \end{align*}$

Units

As noted above, the units for the standard deviation are

$\text{UNITS}$ .

Absolute Deviation

The choice to define variability in terms of the square of the difference (variance) is a standard convention that statisticians have adopted. This isn't the only way to define it. We could also define it in terms of the absolute value of the difference:

$E(|X - \mu|)$
Whichever way one defines it, it is important to remain consistent. It's often the case that people conflate standard deviation with absolute deviation.² The two are clearly different and obtain different results.

Example:

Sticking with the same example, the absolute deviation is:

$\begin{align*} E(|X - \mu|) &= |0 - 1.1| \times 25\% + |1-1.1| \times 40\% + |2-1.1|\times 35\% \\ &= 0.63 \end{align*}$
As you can see, this is quite different than the result we obtained for standard deviation,

$\sigma \approx 0.768$ .

Units

Like standard deviation, this has the same units as the units of the measurement taken,

$\text{UNITS}$ .

IV) Skewness
The standard formal definition of skewness is the third standardized moment which is the same as the third central moment divided by the standard deviation cubed:

$\frac{\mu_3}{\sigma^3} = \frac{E[(X-\mu)^3]}{\sigma^3}$
To better appreciate what skewness is, I think the seesaw analogy is appropriate.

Seesaw Analogy

A seesaw is a lever with a fulcrum at the center. It's often a toy that children play on.

If we only consider the force of gravity and place two objects on either end (for example, two children), then the side with the object with the greatest weight will go down while the side with the objects of lesser weight will go up. If both objects have the same mass, the seesaw remains in equilibrium.

But this assumes that the objects are also at the same distance. For it is possible to move the heavier object closer to the fulcrum at the center and balance it out that way. Consider the following diagram (from here):

If you multiply the larger mass (

$M_1$ ) at a shorter distance (

$a$ ) and it is equal to the smaller mass (

$M_2$ ) multiplied by the longer distance (

$b$ ), then the lever will be in balance.

Skewness is a statistical measure of the overall "balance" of a statistical distribution about the mean. The fulcrum can be thought of as the "mean". If the distribution is "balanced" it has a skew of 0.

The lever illustrates a few things. First, even if the lever is not entirely symmetric (namely, the fulcrum is not centered and the masses on each side are not equal), it's still possible to balance the lever. So it is with skewness as well. While it is the case that all symmetric distributions have skew of 0, it's possible to have asymmetric distributions that also have skew of 0.

The second thing that the lever illustrates it that if you have a smaller mass far enough out, you can actually tilt the lever towards that end. For example, if

$b$ in the above equation was larger such that:

$M_1 \times a < M_2 \times b$
you could lift the larger mass. In this case we would say it has "positive" skew. If, on the other hand, the left side was weighted down, we would say we have "negative" skew.

Calculation

The skew or skewness is defined as the third standardized moment which is the third central moment divided by the standard deviation cubed:

$\begin{align*} \frac{\mu_3}{\sigma^3} &= \frac{E[(X-\mu)^3]}{\sigma^3} \end{align*}$
When you work out this calculation, it will tend to be positive when there are more data points further away from the mean to the right, while the left side has more weight near the mean. The result is that the positively and negatively skewed distributions will tend to look like this:

Example

We'll stick with our previous probability distribution and calculate the skewness for that distribution. First we'll calculate the third central moment:

$\begin{align*} E[(X-\mu)^3] &= (0-1.1)^3 \times 25\% + (1-1.1)^3 \times 40\% + (2-1.1)^3 \times 35\% \\ &= -0.078 \end{align*}$
Now recall we previoiusly found the standard deviation to be

$\approx 0.768$ . As a result, the skewness is:

$\begin{align*} \frac{\mu_3}{\sigma^3} &= \frac{-0.078}{0.768^3} \\ &\approx -0.172 \end{align*}$
So there's a little bit of a negative skew.

Units

The units for skewness the third central moment are

$\text{UNITS}^3$ . Likewise, the standard deviation cubed has units of

$\text{UNITS}^3$ . So when we look at skewness, which is the ratio of these two figures, we get:

$\begin{align*} \frac{\text{UNITS}^3}{\text{UNITS}^3}&= 1 \end{align*}$

So skewness has no units; it's a dimensionless quantity.

V) Kurtosis

The measure of kurtosis is frequently associated with "fat tails". While this is no doubt true, the picture is a bit more complex. But first we'll start with how kurtosis is calculated.

Like skewness, kurtosis is calculated is a standardized moment, particularly the fourth standardized moment. This is defined as the fourth central moment divided by the standard deviation to the fourth power:

$\begin{align*} \frac{\mu_4}{\sigma^4} &= \frac{E[(X-\mu)^4]}{\sigma^4} \end{align*}$

On Fat Tails

So how do you get "fat tails"? Well, let's look at how kurtosis is calculated. If we simply add some probability to the "tails" of the distribution, it does two things:

1) It increases the fourth central moment:

$E[(X-\mu)^4]$ .

2) It increases the variance or the second central moment:

$E[(X-\mu)^2]$ .

To increase kurtosis we want (1). But if we increase the variance (2), then we increase the denominator of our fraction. And that will negate the effects of increasing the numerator. So how can we increase the top without increasing the bottom?

The answer to that is that we must borrow from the "shoulders" and put it on top. By having more probability near the mean, we can help to lower the variance.

What does all this mean? Perhaps some graphs will help illustrate:

Now all of the above distributions have a mean value of 0, variance of 1 and a skewness of 0.

If you look, for example, at the Laplace distribution, which has an excess kurtosis of 3 (more on excess kurtosis in a minute), it's very tall in the center and it's very curved around the shoulders.

Contrast that with the uniform distribution (with excess kurtosis of -1.2) and you get a distribution that has very sharp, broad shoulders.

In summary, to get a high kurtosis, we need to move some of the probability from the shoulders to the top and tails. To get a low kurtosis, we need to move some of the probability from the top and tails to the shoulders.

Excess Kurtosis

The kurtosis for the normal distribution is 3. So if we subtract 3 from the kurtosis, we get the excess kurtosis.

$\begin{align*} \frac{\mu_4}{\sigma^4} -3 &= \frac{E[(X-\mu)^4]}{\sigma^4} -3 \end{align*}$
This allows us to make comparisons with the normal distribution. If the excess kurtosis is positive, then it has "fatter tails" (and taller peaks) compared with the normal distribution. If it's negative then it has "broader shoulders".

Example

And if this example hasn't gotten old . . .

First we'll calculate the fourth central moment:

$\begin{align*} E[(X-\mu)^4] &= (0-1.1)^4 \times 25\% + (1-1.1)^4 \times 40\% + (2-1.1)^4 \times 35\% \\ &= 0.5957 \end{align*}$
Now recall we previoiusly found the standard deviation to be

$\approx 0.768$ . As a result, the skewness is:

$\begin{align*} \frac{\mu_4}{\sigma^4} &= \frac{0.5957}{0.768^4} \\ &\approx 1.712 \end{align*}$
Now that's the calculation for kurtosis. But if we want excess kurtosis, we need to subtract three:

$1.712-3 = -1.288$ .
So our example distribution is a broad shoulder type of distribution. And that makes a lot of sense when you look at it graphically:

The distribution doesn't have much of anything on the "tails" or the "peak".

Units

Like skewness, kurtosis (and excess kurtosis) is a dimensionless quantity; it has no units. This is due to the fact that the units for the fourth central moment are

$\text{UNITS}^4$ and the units for the standard deviation to the fourth power are

$\text{UNITS}^4$ as well. The result is this:

$\begin{align*} \frac{\text{UNITS}^4}{\text{UNITS}^4}&= 1 \end{align*}$
So kurtosis (and excess kurtosis) are just numbers, without units.

Back to the Table of Contents

¹ Recall that widgets are the only things which economists know how to produce.
² For example, in We Don't Quite Know What We are Talking About When We Talk About Volatility, Goldstein and Taleb find that many finance professionals confuse standard deviation and absolute deviation.