Can statistics be used to make decisions?

In this post, we look at the nature of statistical hypothesis testing, what it is and why the general usage pattern may not be theoretically sound!

Sep 08, 2024

Introduction

If the weatherman says in the news today that there is a 95% chance of rain, you’ll probably go out with an umbrella. Statistics and probabilities have become such integral parts of our lives that we always keep making decisions, based on them. But can we always do that? Should we live our lives and leave everything to chance?

God does not play dice — Albert Einstein

In this post, we will be questioning some very fundamental and philosophical issues with the practice of making decisions based on statistical theories.

Acknowledgement: The ideas presented in this post are not mine, but they are the culmination of the thoughts of many great minds, especially, David Salsburg, author of the book “The Lady tasting tea”, so the credit goes to them. If you find any mistakes, that is entirely due to my own limited knowledge and misunderstanding of their ideas.

What is Hypothesis testing?

Before delving into the philosophical stuff about whether you should be using statistics to make decisions, let us do a brief recap on what the theory says.

Hypothesis testing is the study of decision-making within the realm of statistics. The setup considers that there are two (possibly contrasting) hypotheses, each representing a decision. For example, in the case of weather forecasting, the two hypotheses are:

It will rain.
It will not rain.

In case of a trial, the hypotheses are, “the defendant is guilty” and the “defendant is not guilty”. In the case of medical drug testing, the hypotheses are, “the new medicine does not work” and “the new medicine does work”. Now in statistical literature, you need to pick one of these decisions as the default or the “null” hypothesis, which is the most common decision. For the courtroom trial example, the “defendant is not guilty” is the “null” hypothesis, because most common men are. For the drug testing example, the “null” hypothesis is that the “drug is not better than existing”, which is the common belief for any new drug at the pre-market stage.

Types of errors

Now whenever we make a decision (i.e., choose a hypothesis over another), there is a chance that we make an error. There are two kinds of errors:

The null hypothesis is true, but we reject the “null” hypothesis and decide to go with the alternative hypothesis (i.e., the other hypothesis). This is called Type-1 error.
The null hypothesis is actually false, but we still accept the null hypothesis. This is called Type-2 error.

Jerzy Neyman and Egon Pearson found that it is usually not possible to reduce both of the errors simultaneously. You have to give up something. Note that, accepting the null hypothesis usually requires no active action (you go out as usual), while rejecting the null hypothesis and accepting the alternative hypothesis requires some active action (you need to find the umbrella and take it along with you, and open it when the rain starts). Since statisticians are lazy, they agreed that they would want the null hypothesis to be true unless there is compelling evidence against it, hence they decided to control the type-1 error.

P-value

Ronald Alymer Fisher, one of the pioneers of the statistical theories, look at this problem from a different perspective. Instead of looking at the errors, he asked the question: What is the root cause that I get so many evidence against the null hypothesis? There are only two possibilities.

The null hypothesis is true, but I am very unlucky and just coincidentally all the evidence have become against the null hypothesis. (The movie 12 angry men shows such a possibility).
Or the null hypothesis must be false.

Thus, he calculates the probability of the former, and calls it p-value. For any given hypothesis testing problem, his process was to calculate this p-value and if it is sufficiently small, he would be willing to reject the null hypothesis. Unfortunately, Fisher never mentions this criterion as a hard and fast rule, and exactly what value of p-value is sufficiently small, he simply leaves it to the judgement of the domain expert. Later, Neyman and Pearson write a series of papers, and from there onwards it was popularized to compare the p-value against 5% or 1% chance, in order to make most decisions.

Situations where this breaks down

Now, we will consider a few examples, and for each example, we will apply the notion of hypothesis testing, and see where it all starts to break down.

The Paradox of Lottery

L. Jonathan Cohen, in his book “An Introduction to the Philosophy of Induction and Probability”, presents this example. Let’s assume we decide that we reject a null hypothesis if it has a p-value less than equal to 0.01%. Now, we organize a fair lottery with 10,000 numbered tickets, marked as 1, 2, up to 10000. Consider the hypothesis that ticket number 1 will win the lottery. The p-value for this will be 1 out of 10000, i.e., 0.01%, so we reject the hypothesis. Next, let’s consider ticket number 2 will win the lottery. Since it is fair, the p-value for this is also 1 out of 10000, i.e., 0.01%, so we again reject the hypothesis. And if we continue like this, we will keep rejecting the hypothesis that a numbered ticket will win the lottery. And finally, if we take the decision that number 1 will not win, number 2 will not win, …, number 10000 will not win, taken together it means the lottery will have no winners.

The paradox here comes from the fact that we apply our logical implications and operations (like “and”) from the decisions that came out of statistical theories. Unfortunately we don’t realize they are based on different objects. Mathematics, starts with a few assumptions (“axioms”) and based on the rules of logic produces theorems or established facts. While, statistics, starts with a “model of reality” and uses rules of probability to make inferences, only which we can only apply rules of probability (but not the rules of logic).

Therefore, one should avoid using the familiar rules of logic to the decisions based on the statistical model of reality.
Statistical theories for hypothesis testing can only be used to reject hypothesis, to say what does not work. It cannot be used to “accept” any hypothesis or say what works.

The gap between managerial decisions and statistical results

This is a real story based on the life of Stella Cunliffe, one of the greatest female statisticians in contemporary times. She worked for some time in Guinness, the beer-making company, where her job was to design experiments to ensure the quality of the production. She understood that the decisions are usually taken by the management who are not expert statisticians, hence they need to rely on multiple layers of people to perform the experiments, consolidate statistical results, and then convey them in layman’s terms. In her own words,

It is amazing how often the description of an experiment as relayed by somebody several layers above the laboratory workers does not agree with what has actually happened

One such story is as follows: The problem was to control the capacity of the beer cusks to conform to specific weights. The woman who was assigned to measure them had to weigh the empty cask, fill it with water, and weigh the full cask. If the cask differed from its proper size by being more than three pints below or more than seven pints above, it was returned for modification. The choice of the boundaries (three pints below or seven pints above), was decided based on the statistical theory of confidence intervals.

Cunliffe found that there were surprisingly large numbers of cusks which were just barely within the limit. So she decided to visit and see the working conditions of the woman who weighed the casks. The woman was required to throw a discarded cask onto a high pile and place an accepted cask onto a conveyor belt. At Cunliffe's suggestion, her weighing position was put on top of the bin for the discarded casks. Then all she had to do was kick the rejected cask down into the bin. The excess of casks just barely making it disappeared, because the woman now does not need to make an effort to discard the cusks.

Statistics does not take into account the behaviour of people, but the decisions impact people. Therefore, any decision you take based on statistical theories must take into account the rationality of the people involved in the process.

Most probability statements are conditional

When the weatherman says that there is a 95% chance of rain tomorrow, we simply assume that it is going to rain tomorrow and based on that assumption, we apply the rules of logic to deduce that we need to take an umbrella. We do not think that these forecasts are conditional on many facts:

The instruments of the weather agency are working properly.
They passed the information to the news agency reliably.
The worker who got the information from the agent wrote down the percentages correctly.
That information got passed down to the weatherman correctly.
The TV signal is not getting corrupted due to probably some crows nest on top of your house.
Your ear is hearing properly and your brain is interpreting these numbers correctly.

But we never think of these facts. We assume that they are true by default. This means, the probability of 95% is conditional on many events being true, which again, under the statistical model of reality is probabilistic.

To better illustrate this point, consider the following example: Suppose a disease affects 1 in 1,000 people, and there is a diagnostic test that is 99% accurate. If a person tests positive, many would think there's a 99% chance they have the disease. However, you can apply Bayes theorem, which will show

$\begin{align*} P(disease | positive) & = \dfrac{P(positive\mid disease)P(disease)}{P(positive\mid healthy)P(healthy) + P(positive\mid disease)P(disease)}\\ & = \dfrac{0.99 \times 0.001}{0.01 \times 0.999 + 0.99 \times 0.001}\\ & = 0.096 \approx 9.6\% \end{align*}$

that the actual probability that you have the disease provided that your test says positive is only 9.6%, significantly smaller than what you have thought.

When using statistics and the rules of probability to make decisions, map the conditional events to real-life facts to independently verify if it is applicable.

Probability applies to no one

When we toss a coin and we tell that the probability of a head is 50%, it is intuitively clear to most people that if you toss the same coin a large number of times, approximately 50% of the time it would turn up head. Think now about the motivational video or a social network post that says “If you study at least 30 hours a week, then the probability of passing the exam is more than 90%” (I hope you have seen similar posts throughout the internet). And based on that, you start studying 30 hours a week, with the hope that you will pass the exam.

Note: Here is a dataset containing the number of hours of study by students and whether they pass or fail an exam. You can perform a logistic regression or probit regression analysis based on this data to achieve a similar conclusion.

However, if you carefully analyze what does this probability mean, you will see that it again refers to the number of experiments, each experiment consisting of a student. Hence, in layman’s term, if you find 10000 students each of whom studies exactly 30 hours a week, then among them approximately 90% will pass. But this does not apply to you, as an individual. In fact, Chester Bliss, the pioneer of Probit regression, showed that it is impossible to determine for a single individual how many hours of study are needed to pass. Hence,

Statistics produces results about the masses, or the average person who does not exist. In contrast, the decisions you make apply to individuals, not to the average person.

Personal probabilities are extreme and asymmetric

Let us again go back to the example of the weatherman. If the weatherman says there is a 2% chance of rain, you probably would not take the umbrella. If he says that there’s a 95% chance of rain, you would probably take the umbrella. If he says 70%, chances are you would take the umbrella still. But if you keep dropping the probability, what is the exact threshold where you switch from taking the umbrella to not taking the umbrella? Is it 50%?

If you say 50%, just pause and think for a minute. So if the weatherman says there’s a 50% chance of rain, you won’t take the umbrella, but if it is 51%, you would take the umbrella? Possibly not. Chances are, some days you would take an umbrella on both occasions, and some days you won’t. You will rely on your gut feeling.

L. J. Savage and Bruno de Finetti were the proponents of the view that everyone have their own set of personal interpretations of probabilities. These agree on the extremes (i.e., most people would agree on what is meant by “almost surely possible” and “not at all possible”), but they seem to diverge significantly in the middle.

Nobel laureate psychologists Daniel Kahneman and Amos Tversky did another experiment to figure out how much deviations these personal probabilities have in the middle part across various people.

It turns out that almost everyone connects their personal probabilities to the decision choice it is associated with and is not consistent across different setups. It means, people are unable to compare the 75% probability of rain with the 75% winning chance of India during a cricket match.
However, a consistent pattern is people associate the gains and the losses connected to the choice of decisions. People are willing to forgo a chance of huge gain to avoid even a small amount of losses, which Kahneman and Tversky called as “loss aversion”. Veritasium has an amazing video that explains this loss aversion mentality in detail.
To illustrate this, imagine you are asked to participate in a game where a fair coin is flipped. You gain $6 if it is head and you lose $4 if it is tail. The mathematics suggest that the expected gain is positive so you should play the game. However, most people would view the loss of $4 as more painful compared to the gain of $6.

The same idea takes place in the case of taxation. You work for a job, you get a salary of say 100$ and then you need to pay 30$ of tax (say). This 30$ tax as a loss seems more painful to people compared to the 100$ of salary. Due to the same “loss aversion” setup, it would have made you happier if you got only (100-30)$ = 70$ instead, without the need to pay taxes. Unfortunately, the people in positions who make the decisions are often far from the statistical results, and have distorted views of reality (as we discussed above with Cunliffe’s story).

Mixing personal probabilities and statistical probabilities would lead to inconsistencies in the decisions.
If you use statistical probabilities, you would take decisions that are proven to work, but the decisions will not make sense to you.
If you use personal probabilities to make decisions, your decisions will make sense to you, but they are not guaranteed to work.

Statistics require you to fix methods first, then collect data

Let’s say you run a medicine company and your company’s research team produces a new drug, believed to be reducing blood pressure. So, you get two groups of 5 people each with high blood pressure (randomly distributed), and in the first group, you give a placebo and in the second group, you provide your new drug. Here are the blood pressure measurements of those groups.

Group 1 (Placebo): 140, 147, 133, 150, 130
Group 2 (Drug): 141, 142, 143, 144, 145

Ideally, you should take the average blood pressure from Group 1 and the average blood pressure from Group 2 and then compare them to show that Group 2 average is significantly lower than Group 1. If you do this, the Group 1 average is 140 and the Group 2 average is 143, which says that your drug does not work.

However, since you really want to advertise that your drug works, you decide to consider your method as “compare the second observations of Group 1 vs Group 2”, and then based on that you say for Group 1, it is 147 and for Group 2, it is 142, clearly showing that the drug works.

Obviously, such an unethical thing should not even be allowed. Basically, we should not see the data first and then be allowed to choose our criterion to take whatever is convenient. However, such a practice with statistical theory is abundant nowadays. You get the dataset from Kaggle or some open-source website and then try out various methods to see what fits the data, and then you produce insights and predictions about the data, based on which business decisions are taken.

Therefore, the validity of the decision directly depends on the quality of the statistical experiments performed.
Any decision derived from statistical knowledge gained from data, must follow “data modelling” first and then collecting the “data”.

Although recently, there have been some emerging studies to allow the usage of a few selective procedures after seeing the data, such as the methods from Discrepancy theory, Post selection inferences. However, none of them are widely used outside of the academic community. Hence, the analysis that we find on the internet — that starts with data and tries out different machine learning models one after another — I wonder how much really we can trust them.

Conclusion

My primary goal with this quite philosophical post is to raise awareness about the distinction between statistical and probabilistic implications and hard-and-fast rules that we require to make decisions.

Feel free to share and comment on any interesting stories that you might know about people wrongly using statistical theories to make decisions.

Thank you very much for being a valued reader! 🙏🏽

Subscribe below to get notified when the next post is out. 📢

Until next time.