*2018 October 10*

This post is the first in a short sequence of posts intended to introduce and explore the mathematical concept of a **random variable**. In this post, we shall define random variables formally, familiarize ourselves with them, and classify different types of them Then we shall introduce **measures of central tendency** and **measures of spread**, measurements that can be useful in expressing compactly or predicting the behaviour of a random variable (however, as we shall see, each of these has its shortcomings).

Intuitively, the idea of a random variable is familiar to most. It is the output, of sorts, of a random process (or simply a process that is so complex and unpredictable that it serves as random). For example, we may assign a random variable \(X\) to be the result of a fair coin flip, an unfair coin flip, the roll of a die, the roll of many dice, the shuffling of a deck of cards, or the number of people that enter a restaurant over the course of a day. Of course, none of these things is necessarily *truly* random - in a closed and thoroughly monitored physical system (in a laboratory, say), physicists could reliably predict the result of a coin flip by using data about its initial condition and the laws of physics. However, we shall not concern ourselves with this, and shall study random variables in only the most mathematical sense.

The value of any random variable must be drawn from some set of possible outcomes, or **events**. This set is called the **sample space**, and written \(S\). If \(X\) is the result of a coin flip, then \(S=\{\text{heads},\text{tails}\}\). If \(X\) is the result of the roll of a single six-sided die, then \(S=\{1,2,3,4,5,6\}\). If \(X\) is the top card of a shuffled deck, then \(S\) is the entire deck of cards.

However, this does not suffice to define a random variable. The random variable representing the outcome of an unfair coin flip has the same sample space as with a fair coin flip, but the variables are fundamentally different. Thus, in addition to its sample space \(S\), a random variable \(X\) is further defined by its **probability density function** (for continuous random variables) or **probability mass function** (for discrete random variables), written \(p_X:S\mapsto [0,1]\), assigning to each element of \(S\) the *probability* that \(X\) assumes the value of that element. Because \(X\) must be an element of \(S\), and because no two events in \(S\) can both occur, we have that

\[\sum_{x\in S}p_X(x)=1\]

As a convention, we omit from \(S\) any event with probability \(0\).

For example, if \(X\) represents a fair coin flip, then \(S=\{\text{heads},\text{tails}\}\), and \(p_X(\text{heads})=p_X(\text{tails})=0.5\). If \(X\) represents an unfair coin flip, then \(S=\{\text{heads},\text{tails}\}\), and we might have that \(p_X(\text{heads})=0.3\) and \(p_X(\text{tails})=0.7\). If \(X\) represents a random card drawn fairly from a standard deck, and \(c\in S\) is any card in the deck, then \(p_X(c)=1/52\).

As we can see, there are two types of random variables: **qualitative** and **quantitative**. The former refers to a random variable whose value is not a numerical quantity, such as the result of a coin flip (heads or tails) or the result of a random card drawn from a standard deck. The latter refers to a random variable whose value *is* numerical, like the result of a die roll or the time elapsed between two consecutive customers entering a store. In this post, and in future posts, we shall deal chiefly with quantitative variables - this is a math blog, after all. However, it is sometimes possible to “adjust” the sample space of a qualitative random variable in order to make it quantitative; for example, we might replace “heads” with \(1\) and “tails” with \(0\) in the sample space of a random variable determined by a coin flip, or assign to each card in a standard deck a numerical value, giving face cards the values \(11\) through \(13\) and aces the value \(1\).

Random variables may also be classified as either **discrete** or **continuous** (though it is important to note that some random variables fall not in either category). A discrete random variable typically has a finite sample space \(S\), but if its sample space is infinite, it must be *countably infinite*. For example, the random variable counting the number of customers to enter a store over the course of a day could concievably be arbitrarily large, so its sample space is \(S=\mathbb N\) - ridiculously large values, like \(1,000,000\in S\), we are almost certain will never come to pass, but it is concievable that, under exceptional circumstances, it could be attained (in a purely theoretical world, at least), so \(p_X(1,000,000)\ne 0\). A continuous random variable is chosen from an interval (or a connected set); for example, the height of a random person is a continuous random variable, if errors or limitations in our measuring instruments are ignored.

It is worth noting that we must treat the probability density functions (or PDFs) of continuous random variables much differently. Earlier, we wrote that for any random variable \(X\) with sample space \(S\) and PDF \(p_X\),

\[\sum_{x\in S}p_X(x)=1\]

However, this is not so well defined if \(S\) is an uncountable set. We must instead write

\[\int_S p_X(x)dx=1\]

If this integral exists, while \(p_X(x)\) is nonzero for all \(x\in S\), it must necessarily be true that \(P(X=x)=0\) for all \(x\in S\). Because there are so many possible values of \(X\), the probability that any one of them is chosen is equivalently zero; however, the probability that the value of \(X\) falls within some subset \(E\subset S\) can be nonzero, and is given by

\[P(X\in E)=\int_E p_X(x)dx\]

To understand this intuitively, consider the classic dartboard example. If you throw a dart at a dartboard, the probability that the dart will hit the point at the very center of the dartboard is *zero*, but the probability of hitting some point inside of the bullseye is nonzero, since the bullseye has nonzero area (while a point has area \(0\)).

It is worth noting that from now on we shall deal only with quantitative random variables whose sample space is a subset of the real numbers.

Now we shall consider **measures of central tendency**, or numbers that describe the typical or “expected” behaviour of a random variable, intended to allow one to predict the value of a random variable as reliably as possible. First, we shall consider the most familiar measure of central tendency, called the **mean**, or the **arithmetic mean**, defined and denoted as

\[\mathbb E[X]=\mu_X=\bar{X}:=\sum_{x\in S}xp_X(x)\]

This predicts the value of the random variable \(X\) by summing all of the values in its sample space with *weights* according to their respective probabilities. For example, \(X\) is the value obtained by rolling a fair \(6\)-sided die, then

\[\mathbb E[X]=\frac{1}{6}\cdot 1+\frac{1}{6}\cdot 2+...+\frac{1}{6}\cdot 6=3.5\]

Most mathematicians and math students are familiar of the concept of the arithmetic mean, and are already confident in its ability to express the “average” value of a data set or random variable. However, it is worth asking ourselves why exactly this works, and how to demonstrate that it works.

If we wished to predict the value of a random variable with as high accuracy and reliability as possible, we might consider the error in an arbitrary prediction \(\mu\) and attempt to minimize it. If \(X\) assumes the value of \(x\in S\), then the error increases or decreases with the quantity \((\mu -x)^2\). The probability of this “error” occuring is \(p_X(x)\). Thus, we might consider the quantity

\[\sum_{x\in S}(\mu -x)^2p_X(x)\]

to be our expected error; because the values \(x\in S\) with small probability do not occur frequently, we care less about our estimate being close to those values. Note that this strategy is effective at predicting the random variable \(X\) *over the course of many trials*, rather than predicting it only once; after a total of \(n\) trials, by the law of large numbers, if \(n\) is large, the sum of these squared errors will behave asymptotically like

\[n\cdot \sum_{x\in S}(\mu -x)^2 p_X(x)\]

Now we may demonstrate that \(\mu_X\), as defined above, is the value of \(\mu\) that minimizes this quantity. Let us return momentarily to its definition:

\[\mu_X:=\sum_{x\in S}xp_X(x)\]

Now consider the value of the “error” sum with \(\mu=\mu_X\), after performing a few manipulations:

\[\begin{align}
\sum_{x\in S}(\mu_X -x)^2 p_X(x)
&= \sum_{x\in S}\mu_X^2 p_X(x)+\sum_{x\in S}x^2 p_X(x)-\sum_{x\in S}2\mu_X x p_X(x)\\
&= \mu_X^2\sum_{x\in S} p_X(x)+\sum_{x\in S}x^2 p_X(x)-2\mu_X\sum_{x\in S}x p_X(x)\\
&= \mu_X^2+\mathbb E[X^2]-2\mu_X\mathbb E[X]\\
&= \mathbb E[X^2]-\mathbb E^2[X]\\
\end{align}\]

Then suppose that we let \(\mu=\epsilon+\mu_X\), and consider the new value of the sum:

\[\begin{align}
\sum_{x\in S}(\mu_X+\epsilon -x)^2 p_X(x)
&= \sum_{x\in S}(\epsilon^2+2\epsilon(\mu_X-x)+(\mu_X-x)^2)p_X(x)\\
&= \sum_{x\in S}\epsilon^2 p_X(x)+\sum_{x\in S}2\epsilon(\mu_X-x)p_X(x)+\sum_{x\in S}(\mu_X-x)^2 p_X(x)\\
&= \epsilon^2\sum_{x\in S} p_X(x)+2\epsilon\sum_{x\in S}(\mu_X-x)p_X(x)+\sum_{x\in S}(\mu_X-x)^2 p_X(x)\\
&= \epsilon^2+2\epsilon\mu_X-2\epsilon\sum_{x\in S}x p_X(x)+\mathbb E[X^2]-\mathbb E^2[X]\\
&= \epsilon^2+\mathbb E[X^2]-\mathbb E^2[X]\\
\end{align}\]

Thus, varying \(\mu\) by \(\epsilon\) from \(\mu_X\) *increases* the value of our “error” sum, meaning that \(\mu=\mu_X\) *minimizes* the value of the sum, completing our demonstration. The demonstration proceeds similarly for continuous random variables as well.

Some nice properties of the mean of a random variable \(X\) that follow from its definition are additivity

\[\mathbb E[X+Y]=\mathbb E[X]+\mathbb E[Y]\]

and multiplicativity

\[\mathbb E[XY]=\mathbb E[X]\cdot \mathbb E[Y]\]

which hold if \(X\) and \(Y\) are **independent** random variables; that is, if \(P(X=x,Y=y)=P(X=x)P(Y=y)\), or if \(p_{X,Y}(x,y)=p_X(x)p_Y(y)\).

This measure of central tendency is useful, especially in the analysis of games of chance, and whether they are “fair” or worth playing. For example, suppose a “scratcher” lottery game consists of a card with three hidden numbers, each of which is chosen from \(\{0,1,...,9\}\) with equal probability. If the first two numbers are equal, the player wins \(100\) dollars, and if all three are equal, the player wins \(1000\) dollars, and the game costs \(1\) dollar to play. We may represent the expected winnings of this game as a random variable \(X\), and we may calculate its expected value to determine about how much money one would win per game after playing many times. The probability of the first two numbers being equal without the third equaling them is \(9/100\), and the payoff of this outcome is \(100\) dollars. The probability of all three numbers being equal is \(1/100\), with a payoff of \(1000\) dollars, and the probability of the first two numbers being unequal is \(9/10\), with a payoff of \(0\). Notice that

\[\frac{9}{100}+\frac{1}{100}+\frac{9}{10}=1\]

which verifies that we are on the right track. We may now calculate the expected winnings as

\[\mathbb E[X]=\frac{9}{100}\cdot 100+\frac{1}{100}\cdot 1000+\frac{9}{10}\cdot 0=19\]

Thus, since each ticket costs \(1\) dollar to buy, the expected profit is \(18\) dollars per game. That is, if one were to play \(n\) games (for large \(n\)), the asymptotic expected profit would be \(18n\). A word of advice: should you ever come across a lottery ticket with this expected payoff, buy as many as possible.

However, this measure sometimes fails. Consider for example the random variable \(X\) with the sample space \(S=\{2,4,8,16,...\}\) and probability density function defined as \(p_X(2^n)=2^{-n}\), with \(n\ge 1\). We have that

\[2^{-1}+2^{-2}+2^{-3}+...=1\]

affirming that our PDF is well-defined. However, the expected value of \(X\) is given by

\[\mathbb E[X]=2^{-1}\cdot 2+2^{-2}\cdot 2^2+2^{-3}\cdot 2^2+...=\infty\]

Of course, no matter how many times you play this game, you will not win an infinite amount of money, and so \(\mathbb E[X]=\infty\) is not meaningful or practical for predicting the value of \(X\).

Now is the time to introduce the second measure of central tendency: the **median**. A median \(m=\text{med}(X)=\widetilde{X}\) is defined as a real number \(m\) (often an element of the sample space, but not always) such that \[P(X\le m)=P(X\ge m)=1/2\]. For example, if one assigns values \(1\) through \(13\) to the cards of a standard deck, the median numerical value of a random card from the deck is \(7\). Sometimes, with discrete random variables, the median does not exist within the sample space according to this definition; for example, if \(X\) is the result of a fair \(6\)-sided die roll, because \(S\) has an even number of elements, the sample space contains contains no median. In this case, the median is often taken to be the arithmetic mean of the two “middle elements” of \(S\). In the case of a \(6\)-sided die, the median would then be \(3.5\), but could also be taken, by definition, to be any real number in \((3,4)\).

Sometimes no median exists, as is the case with the random variable \(Y\) with sample space {1,2,3} and PDF \(p_Y(1)=0.4\), \(p_Y(2)=0.2\), and \(p_Y(3)=0.4\). In this case, the median is often taken to be a number \(m\) that minimizes the difference between the quantity \(P(Y\le m)\) and \(1/2\); if this convention were taken here, we might let our median be \(m=2\) with \(P(Y\le m)=0.6\).

As another interesting note, just as the “error” sum mentioned earlier using \((\mu-x)^2\) as the measurement of error was minimized by \(\mu_X\), it also holds that the more intuitive “error sum”

\[\sum_{x\in S}|m -x| p_X(x)\]

is minimized by the median, a fact which we shall not prove here.

Let us consider a continuous example. Suppose a continuous random variable \(X\) takes on the value \(x\in\mathbb R^+\) with probability \(p_X(x)=e^{-x}\). We may verify that

\[\int_0^\infty e^{-x}dx=1\]

Now, if \(m\) is the median of \(X\), then it should be true that

\[\int_0^m e^{-x}dx=\frac{1}{2}\]

and by solving this, we have that \(m=\ln(2)\).

If we return once more to the example that “broke” the arithmetic mean, we can see that the median exists, and is equal to \(1.5\) (but can in fact assume any value in \((1,2)\)).

Now we move on from measures of central tendency to measures of spread. The **variance** \(\sigma_X^2\) of a random variable \(X\) and its **standard deviation** \(\sigma_X\) are defined as

\[\sigma_X^2=\sum_{x\in S}(\mu_X-x)^2 p_X(x)\]

or, for continuous random variables,

\[\sigma_X^2=\int_S(\mu_X-x)^2 p_X(x)dx\]

As discussed earlier, the magnitude of this sum expresses how well \(\mu_X\) approximates \(X\) for each of its possible values, weighting according to their probabilities. If all of the possible values of \(X\) are clustered around its mean \(\mu_X\), then the mean approximates \(X\) very well. However, if the possible values of \(X\) are spread out on either side of its mean, then the variance \(\sigma_X^2\) will be much larger, indicating that the spread of the data is greater.

Consider, for example, the random variables \(X\) and \(Y\) defined as having constant PDFs and respective sample spaces \(\{1,2,4,5\}\) and \(\{2,2,4,4\}\). One may easily calculate \(\mathbb E[X]=\mathbb E[Y]=3\). However, the former sample space is more “spread out,” so we would expect its variance to be greater; sure enough, we have \(\sigma_X^2=5/2\) and \(\sigma_Y^2=1\), verifying our suspicions.

The spread of a random variable can also be analyzed using its **percentiles**. We define a \(p\)-th percentile of a data set similarly to the way we defined the median - \(m_{p}\) is a \(p\)-th percentile of \(X\) if \(P(X\le m_p)=p\) and \(P(X\le m_p)=1-p\). Note that the median \(m\) is, by definition, the \(50\)-th percentile, and the other percentiles can be calculated in the same way that we calculated the median earlier. The two other percentiles most commonly calculated are the \(25\)-th and the \(75\)-th percentile, called the **first quartile** and **third quartile** respectively, and the difference between them is called the **interquartile range**, another important measure of the spread of a random variable.

We shall conclude this post with a demonstration of the power of the first measures of central tendency and spread introduced, namely the mean and standard deviation. Let us consider the probability

\[P(|X-\mu_X|\ge k\sigma_X)\]

that is, the probability that the difference between \(X\) and its mean is greater than \(k\) standard deviations. Clearly, one would expect this to decrease as \(k\) increases and to be rather small at that, since as the distribution of \(X\) is “spread out,” its standard deviation increases to account for this. We therefore seek to bound this probability and demonstrate the extent to which the standard deviation of a random variable determines the probability of it assuming an unexpectedly large or small value. To do this, we shall use the **Iverson Bracket**, a simple piece of notation defined as follows: if \(A\) is a proposition, then \([A]=1\) if \(A\) is true and \([A]=0\) if \(A\) is false. We have that

\[P(|X-\mu_X|\ge k\sigma_X)=\mathbb E[[|X-\mu_X|\ge k\sigma_X]]\]

notice that this is equivalent to

\[\mathbb E\bigg[\bigg[\frac{(X-\mu_X)^2}{k^2\sigma_X^2}\ge 1\bigg]\bigg]\]

Finally, note that

\[\bigg[\frac{(X-\mu_X)^2}{k^2\sigma_X^2}\ge 1\bigg]\le \frac{(X-\mu_X)^2}{k^2\sigma_X^2}\]

which implies that the probability under consideration is less than or equal to

\[\mathbb E\bigg[\frac{(X-\mu_X)^2}{k^2\sigma_X^2}\bigg]=\frac{\mathbb E[(X-\mu_X)^2]}{k^2\sigma_X^2}=\frac{1}{k^2}\]

and so we have, as desired,

\[P(|X-\mu_X|\ge k\sigma_X)\le \frac{1}{k^2}\]

This bound, further demonstrating the validity of the measures of central tendency and spread previously discussed, is called **Chebyshev’s Inequality**.

From Chebyshev’s inequality, one can prove that the mean and median of a random variable differ by no more than one standard deviation; this fact is left as an exercise for the reader.

That’s all for now!