Bayesian stats in very plain language

2016-07-01

My pass at explaining the often misunderstood.

Introduction

Some years ago, I got into an argument with someone abut the relative merits of Bayesian versus Maximum Likelihood in phylogenetics. They asserted the two were basically the same or would come to the same answers. I countered that while they would often agree, they were measuring different things. Our conversation subsequently got bogged down in a technical discussion that clarified nothing.

Bayesian statistics can be difficult to explain, can involve several foggy and seemingly obscure concepts, and explanations are frequently illustrated (obsfucated?) with cryptic maths. So here is a maths-light approach to help you get an intuitive grasp on the concept.

(Experts, be warned that I'll cut a few corners and gloss over a few things. Criticism is welcome, but be aware of what I'm trying to do here.)

Take 1: outcomes and models

I'll use a few ugly words here, but persist to the nice examples following.

When talking about probability, we usually talk about data and models:

Outcomes (or observations or data) are the results, the countable things we are directly observing and counting: how many dice come up with a 6, which horse wins a race, how many red balls are pulled out of a bag, etc.
The model (or hypothesis or system) is the thing that is producing the data, giving rise to it. So it's the set of dice you're rolling (and wether any of them are loaded), all the horses in the race and their relative speeds, the number and colour of all the balls in the bag.

Armed with this idea, we can start to compare bayesian and conventional statistics ...

Take 2: explaining the model

When we talk about chance and probability in everyday life, we usually talk about how a model explains or causes an outcome:

The dice were loaded, so that's why you rolled three 6s.
That horse was the fastest in the race, so it won.
The bag has hardly any red balls in it, so you probably won't draw any.

These conventional statistics, what we call frequentism, starts with a known model and predict or explain the results: If half the balls in a bag are red, then it's likely half the balls we draw out of the bag will be red too ...

But there's a problem here: often in life, it's not the result or outcome that we need to know or understand. We can see the result, what has happened. Instead we want to know about the system that produced the results:

If I roll three 6s in a row, does that mean these dice are loaded?
If a specific horse wins a race, what does that tell me about the relative speeds of all horses in the race?
If I draw 6 red and 3 black balls out of a bag, what does that tell me about the contents of the bag?

This sort of problem is common in scientific research. We do experiments to see how a complicated system (e.g. the human body, millions of years of evolution) behaves and use that to try and deduce how it works.

It's a backwards sort of reasoning, and this is exactly what Bayesian statistics do: using the outcomes, look for the model that is best explained by the data. In contrast, methods like maximum likelihood looks for the model with the highest probability of producing the data.

Confused? These two things are subtly different. Let me explain.

Take 3: Cancer

This is a classic toy example that I've modified slightly.

Assume there's a test for cancer. If someone has cancer, it will always detect it, 100% of the time. If they don't have cancer, it will usually correctly call this result, 90% of the time. However in 10% of these cases, it will incorrectly report they do have cancer. Looking at this in a table:

	Cancer detected	Cancer not detected
Patient has cancer	100%	0%
Patient doesn't have cancer	10%	90%

This is what we'd call a 0% false negative and 10% false positive.

Now let's assume that 1% of people actually have cancer. You go in for a test and unfortunately it reports you have cancer. Statistically, do you actually have cancer?

By conventional statistical approaches, we would say yes:

If you have cancer, there is a 100% probability we would get the result seen
If you don't have cancer, theres a 10% result we would get a positive result

So, the scenario with the greatest chance of probability of producing the result we've seen is that you have cancer. Sorry.

What this approach misses out is the relative probability of the different scenarios. Bayesian stats refers to these things as a prior, what we know or believe about a system before we see any result. (Prior? Before? Get it?)

Let's think of a population of 1000 people and put them into the table:

	Cancer detected	Cancer not detected
10 patients with cancer	10	0
990 patients without cancer	99	891

So, of the 109 people that are detected with cancer, only 10 actually do have it. If you are detected as having cancer, there is a roughly 9% chance you actually do. It's far more likely that the test has got it wrong and you're a false positive.

Put it this way:

The explanation (model) that is most likely to say you have cancer, is that you have cancer
If you have cancer, the most likely explanation (model) is that you do not have cancer

Take 4: maths!

I'll give way and actually put it into an equation here. This is Bayes Theorem:

P(A|B) = P(B|A) * P(A) / P(B)

Which in our case means:

Probability (you have cancer if you get a positive test) = Probability (you get a positive test if you have cancer) * Probability (you have cancer regardless of test result) / Probability (you get a positive result regardless of whether you have cancer)

Which is:

1.0 * 0.01 / 0.109 = 0.092

So, if you get a positive result, there's only about 9% chance you have cancer. Obviously this is a grossly simplified situation, and using Bayes theorem is overkill. But it illustrates a general principle: maximum likelihood picks the single "best" answer, while Bayesian approaches consider how likely the individual answers are.