Introduction

This is a post in the series about statistical model fitting. It’s my attempt to explain various approaches and algorithms in a way I would have liked to learn them when I had first started studying this fascinating topic.

In this post, I will be covering a method that relies exclusively on the data and doesn’t require the user to use any auxiliary distributions - Maximum Likelihood Estimate.

I will assume that you are familiar with:

The concept of a random variable
Binomial distribution
Expected value of a random variable

Is this a fair coin?

Imagine walking down the street one day, and seeing a small crowd of people surrounding what appears to be a clown flipping a coin. He notices you as you approach and waves at you with his other hand. Then, he presents you with a challenge.

You need to guess if toss coin is fair or not. He will toss the coin 10 times, telling you the result of each toss. He will then toss it one more time, and you need to guess the result.

What You See Is All There Is (WYSIATI)

This famous mnemonic coined by Daniel Kahneman in his book “Thinking Fast and Slow” gives a hint how to proceed.

What would be your guess if you have observed 10 Heads, and you had no other information to go on?

If you did’t see the greedy smirk on the clown’s face, nor the long face of the person walking away from the table. You only saw the clown toss 10 Heads in a row. What would be the result of the next one?

Clearly, it’d be Heads. And if he threw 10 Tails, you would be stupid not to have bet on Tails… would you?

As these thoughts run through your head, you are trying to guess a certain property - what to expect from the coin as you keep tossing it:

10 Heads - \(E[X] = 1\), so bet Heads
10 Tails - \(E[X] = 0\), so bet Tails
5 Heads and 5 Tails - \(E[X] = 0.5\), so… make a guess :/

We need to guess the expected value of the Random Variable, otherwise known as the mean of its distribution.

Relating data to a Random Variable

Our coin tossing experiment is best represented by a Random Variable with a Binomial Distribution.

A discrete Random Variable is a sampler that pulls items out of a set. How frequently will certain values show up is dictated by its probability (mass) function. And it’s expected value (or mean) indicates which particular item we can expect to see pulled out the most.

But how do we reverse it? How do we guess the mean and parameterize the probability (mass) function given the data that has been pulled out?

Binomial distribution for a coin tossing problem

The expected value of R.V. (a.k.a the mean of its distribution) is a maxima of that R.V’s probability (mass) function. The best way to find the maxima of a function is to solve its derivative for its roots !

So let’s remind ourselves of Binomial distribution p.m.f:

\[f(\theta, n, k) = \binom{n}{k} \theta^{k} (1 - \theta)^{n-k}\]

Where, in our case:

\(n\) - total number of coin tosses
\(k\) - number of positive tosses (in our case let’s assume those are represented by Heads)
\(\theta\) - a latent parameter that can be thought of as the fairness of our coin.

Binomial distribution is a discrete distribution. The word discrete suggests something that’s not continuous, something we wouldn’t be able to differentiate, right?

Misleading vocabulary

When I think about a distribution, I think about its probability function. If its support is a discrete domain, then it’s a discrete distribution. If it’s continuous - like a real numbers line for example - then the distribution is continuous.

When I look at the probability mass function (they make you use the term mass to highlight the discretness of the distribution), I see three parameters:

Parameter	Values	Type
\(n\)	\(n \in {0, 1, ...}\)	Discrete
\(k\)	\(k \in {0, 1, ..., n}\)	Discrete
\(\theta\)	\(theta \in R\)	Continuous

We clearly deal with two discrete parameters, and a continuous one. So which is it then?

It depends on the use case actually. It also depends on the distribution (some, line the Normal distribution, have only continuous params). But the bottom line is that you can change the support of the function. In our case:

if you were interested in finding a probability of a series of coin tosses given a specific coin, identified by a specific \(\theta\), you would fix its value, and the support would become fully discrete.
if you were told what the results of the tosses were, and asked to guess the property of the coin, its \(\theta\), you would fix \(n\) and \(k\), and the continuous \(\theta\) would become its support.

Confusing, isn’t it. One function, so many different options and use cases.

In our case however, these two parameters are fixed - we are given 10 observations (\(n\)), and \(k\) of them are Heads. What we need to find is such value of \(\theta\) that will maximize our observations:

\[argmax f(\theta, n, k)\]

Solving the derivative

In order to find the maxima, we’ll use the differential calculus - a battle hardened method for optimizing a function:

\[\frac{df(\theta, n, k)}{d\theta} = 0\]

After a little bit of algebra, we get \(\theta = \frac{k}{n}\)

The result suggests that the parameter of Binomial Distribution’s probability mass function that maximizes the likelihood of observing clown’s coin toss results is… the ratio of obtained Heads \(n\) to the number of performed throws \(n\).

Deeper insight - the average

If we assumed that Tails are represented by 0, and Heads by a 1, and an \(x_i\) would represent \(i^{th}\) coin toss result, then \(n = \sum_{i=0}^{n} x_i\).

Our result then becomes the average value of the observed data:

\[\theta = \frac{n}{k} = \frac{1}{k} \sum_{i=0}^{n} x_i\]

Maximum Likelihood Estimate

The method the relies on finding the maxima of a probability mass/density function is called Maximum Likelihood Estimation.

It’s major feature is its independence. The only thing we care about when using it is the data. That unfortunately is also one of its major weaknesses, as we’ll see in the subsequent posts about the Bayesian methods.

Other benefits of using it are:

it works with any kind of distribution
it works for multidimensional distributions
you can use any kind of optimization method, not necessarily the derivatives.

Algebraic derivation of MLE for Bernoulli distribution

I really like to dot all the i’s ;) So here’s my attempt at showing you how I derived the formula.

Let’s start with the differentiation:

\[\binom{n}{k} \left( k\theta^{k-1}(1-\theta)^{n-k} + \theta^{n}(n-k)(1-\theta)^{n-k-1} (-1) \right) = 0\]

Derivative of a product breaks down to a sum of derivatives. And the trailing \((-1)\) comes from taking the derrivative of a nested function \(\frac{d(1 - \theta)^{n-k}}{d\theta}\)

Since \(n\) and \(k\) are constants, we might get rid of the ominous binomial \(\binom{n}{k}\), by the property of dividing both sides of equation by it (zero divided by whatever remains a zero).

\[k\theta^{k-1}(1-\theta)^{n-k} - \theta^{n}(n-k)(1-\theta)^{n-k-1} = 0\]

The summation won’t allow us to use division as a way of getting rid of unwanted stuff that easily. So my first hunch is to rearrange the terms. That way I can use the equals sign as a mirror and check if there are any symmetrical terms - terms that occur on both sides that I can divide away.

\[k\theta^{k-1}(1-\theta)^{n-k} = \theta^{k}(n-k)(1-\theta)^{n-k-1}\]

The next transformation becomes immediately obvious.

\(\theta^{k}\) can be broken down into \(\theta^{k-1} * \theta\) which gives us \(\theta^{k-1}\) on both sides of the \(=\) sign. The same transformation can be applied to (1-\theta)^{n-k}

\[k\theta^{k-1}(1-\theta)^{n-k-1}(1-\theta) = \theta^{k-1}\theta(n-k)(1-\theta)^{n-k-1}\]

Now we can divide them away, and we get:

\[k(1-\theta) = \theta(n-k)\]

If we unroll the terms in the parenthesis, we get the term \(-n\theta\) on boths sides of the \(=\) sign.

\[k - k\theta = n\theta - k\theta\]

We can get rid of \(- k\theta\) which occurs on both sides of the \(=\) sign:

\[k = n\theta\]

And that leads straight to:

\[\theta = \frac{k}{n}\]

Literature

Thinking fast and slow, Daniel Kahneman
Probability and statistics, Morris DeGroot, Mark Schervish