We all love data here. But, don’t get us wrong, we also know that data can be frustrating, misleading, and challenging. Nothing reminds you of that fact more than contemplating some interesting statistical paradoxes and puzzles – those seemingly absurd or self-contradictory phenomena that when investigated or explained may prove to be true.
We usually take data pretty seriously on the blog, so we wanted to share some of the sillier side of statistics. We hope you’ll find plenty of cocktail-party fodder, and we’ve even included some key takeaways that can help you gather some valuable business insights from these statistics puzzles.
Simpson’s Paradox demonstrates that what is true in aggregate may not be true for every subset. There may be other variables or important pieces of information that have yet to be considered.
One of the most renowned examples of Simpson’s Paradox happened in 1973 when UC Berkeley was sued for sex-discrimination. The evidence against them seemed pretty bad – only 35% of female students who applied were admitted, while 44% of male applicants were admitted.
To get a better understanding of which departments were contributing to this gender discrimination, they took a deeper dive into the data behind each department’s admission rates. Once they looked at each subset contributing to the aggregate data that got them into trouble in the first place, they saw a different story. Out of the six departments, four accepted more women than men. In actuality, the gender bias was in the women’s favor.
|Department||# of Men||# of Women||Men Accepted||Women Accepted|
So, why did the aggregated data and the categorical data tell a completely different story? There is a confounding variable that is hidden from sight when you look at the summarized data. You can see the same “hidden” variable clearly coming into play if you direct your attention to the acceptance rates and proportions of women and men applying to the program for both Department 1 and Department 6.
It just so happened that a large proportion of women were applying to highly competitive departments with lower acceptance rates. A large proportion of men, on the other hand, were applying to departments that were simply easier to get into. Even though more departments were each accepting a higher percentage of women than men, the results were skewed due to the acceptance rates of each department.
What can we learn from this paradox?
If you are going to work with data, you can’t just accept it at face value. It’s important to think critically and ask questions of your data, such as:
- What is the source of this data?
- Are there any confounding or extraneous variables?
- Is what this data appears to be saying really accurate?
Simpson’s Paradox highlights the problems that can result from combining data from several groups and how data can send misleading messages if you don’t investigate or analyze it properly. There’s nothing worse than letting data trick you into making the wrong decision! In order for data to guide you in the right direction, you need to be a diligent data analyst.
Monty Hall Problem
The Monty Hall Problem is a well-known (and widely debated) probability puzzle. It is named after the original host of Let’s Make a Deal, because it puts you in the position of a contestant on a game show who has to make a choice. It goes like this:
- There are three doors. Behind one door is a car, and there are goats behind the other two doors.
- You pick a door – let’s say its Door 1, in this example.
- The game show host decides to open one of the other two doors, to reveal that there was a goat behind that one (Door 2).
- The host then gives you a choice to either keep your door (Door 1) or switch to the other remaining unopened door (Door 3).
Knowing that both Door 2 and Door 3 could have goats behind them, but only knowing that Door 2 has a goat behind it for sure, would you choose to stick with your original choice (Door 1) and switch to choosing the other remaining door that has yet to be opened (Door 3)? More importantly, does it matter whether you switch or don’t switch?
Your gut may tell you that it’s a 50% chance that the car is behind Door 1 and a 50% chance that the car is behind Door 3. After all, there are two doors and two options for what could be behind the doors. However, that logic isn’t correct. If you switch doors, you’ll win the car 2/3 of the time. The odds change once one of the doors with the goat behind it gets opened, because you have new information that can help you make a decision.
When you pick one door from three, you have a 1/3 chance of choosing the door with a car behind it. If you stick with that choice, you can’t improve your chances of winning. That means, that those other two doors combined have a 2/3 chance of having a car behind them. When one of them “goes away,” the probability of that door having a car behind it is now 0. That means that the 2/3 probability is shifted to the only remaining door from that group.
This table shows another way of looking at your chances of ending up with a car after selecting Door 1 from the beginning:
|Door 1||Door 2||Door 3||Result if Stick with 1||Result if Switch|
What can we learn from this puzzle?
At its core, this puzzle is all about re-evaluating your decisions as new information emerges. While your intuition may have been telling you to go with Door 1 at the beginning, you need to re-evaluate that choice once you find out new information that can help you make a better, more informed decision. The more you know about the situation, the better your decision.
Without any evidence or information, two options are equally likely. As you learn more about the situation and gather additional evidence, you can increase your confidence that one option is better than the other. It’s important to recognize how new actions and information can challenge previously held beliefs and decisions.
Base Rate Fallacy
A base rate fallacy occurs when people disregard some relevant information when making a decision about how likely something is. The key information that is disregarded is usually a base rate, probability, or some other statistic.
The false positive paradox is a type of base rate fallacy that easily lends itself to examples. If you are testing for a rare medical condition, using a test that is not 100% accurate, there are two types of errors you can make:
- False positive: You do not have the condition, but the test says that you do. You are misdiagnosed with the condition.
- False negative: You do have the condition, but the test says that you do not. You are misdiagnosed as healthy.
Both of these errors can be troublesome, especially when it comes to medical tests. Researchers are looking to identify reasons for false positives and false negatives in order to improve the accuracy of the diagnosis.
If you are testing someone for a rare medical condition, there is a greater chance of them being wrongfully diagnosed as having the condition than not (assuming that the test isn’t 100% accurate). Let’s say you administer a test to see if someone has a rare medical decision. Even if the test is 99% accurate, the probability that they have the condition (even if the test says they do) is pretty low. The test isn’t 99% accurate to each individual person who takes it, it’s 99% accurate for the population.
If there are 100,000 people in a population and one person has the rare disease, that one person will probably test positive. The chance of a false negative is extremely low. However, the test will be wrong 1% of the time for the other 99,999 people in the population who do not have the disease.
What can we learn from this fallacy?
The false positive paradox clearly has implications in the healthcare industry, but the base rate fallacy can be applied across all industries. False negatives can happen in manufacturing, for example. If a defective item passes through inspection, that can impact quality and process efficiency.
In general, if a decision maker is presented with base rate information and specific information, their mind is more likely to focus on the specifics and ignore important, yet generic, information. It’s a type of cognitive bias, known as extension neglect, as the mind tends to ignore the entire set and amount of information, when the size is relevant.
In any business environment, it can be easy to get caught up in the excitement of going after a customer segment that has a high average order value, for example. However, you need to make sure you aren’t ignoring the statistics. What proportion does this segment make up of your target audience? If you focus on a segment that is valuable, but small, you may be wasting your time, money, and resources on the wrong people.
It’s important to consider all of the relevant information when making a decision about how likely something is. Casting aside statistics in favor of anecdotal (and oftentimes irrelevant) information to make a decision can be a costly mistake for organizations. If you are realistic about the statistics and base rates involved in the process, you can make more informed projections and forecasts. It can also help you more accurately decide between the different options that are available to you and your organization.