When I was a kid, I was told a story about elephant and blind men.
Six blind men were asked to observe an object that they had no idea about. They chose one leader to listen and make conclusion, while the other 5, observing the object in different position. The leader asked, ‘What is an elephant like?’ and they began to touch the object. One of them said: ‘It is like a pillar.’ This blind man had only touched its leg. Another man said, ‘The elephant is like a fan.’ This person had only touched its ears. The third man said, ‘No, it’s a wall.’ This man touch the belly. The fourth who touched the trunk said, ‘ No way, it is round and sharp, it must be a spear.’ The last man said, ‘Yes it is round, but too smooth for a spear, and it keeps moving. It must be a snake.’ This last man only touched the tail.
The leader was confused. None of the information matched one to another. Thus, he failed to make a conclusion.
The story above tells us how the same data when it is seen from different perspectives might be interpreted differently. None of the men were lying, but none of them were telling the truth either. They were right in their own little observation.
In the real world, that case is not happening to the blind men only. Data representation might lead misinterpretation to anyone.
We often hear two opposite opinions that concluded from the same data. So how is that even possible? To explain this, let’s take a look at the fallowing data:
Suppose the data is about government program in providing medical treatment to the people. The data shows that people with normal illness will be more likely recovered when they are treated, with 100% recovered when they are treated and 75% recovered when they are not treated. The possibility of critical illness being recovered is also increased by 25% when they are treated. In conclusion, the treatment increases the chance of one’s recovery.
Unfortunately, even that simple data can be seen from different perspective.
If we aggregate the data based on whether or not they are treated, then among 5 people who are treated, only 2 survive, which makes the survival rate only 40%. On the other hand, 3 out of 5 people survive when they are not treated, which means 60% of them are recovered. This tricky perspective makes it seems like the treatment reduces the chance of one’s recovery.
This phenomenon is called Simpson’s paradox. It’s a statistical paradox where it is possible to draw two or more opposite conclusions from the same data, depending how the data is being grouped. This often occurs when the hidden data hides conditional variable. A factor that significantly influences the results. In this context, the hidden variable is patient’s level of health.
— * —
In this era, it is common for us to hear data driven decision making, where organizations base their decision on organized data. It is also common for us to hear political debate that involves various data to persuade the audience. Of course, it’s NOT a wrong thing. But there’s a problem with that.
In the digital era where the amount of data seemed so overwhelming, oversimplifying data is like an instant solution. No wonder there are bunch of infographics roam around in the internet. The sad thing is, some visual representation data might have something lurking inside it. Something that can turn the result upside down.
So how can we avoid on that kind of paradox?
Unfortunately there is no such as one-size-fits-all solution. Data can be grouped or classified in various way. Grouping them sometimes give you a better understanding about it. So, the key is to study the actual situation and consider whether the lurking variable may be present. Always do double check for information that is crucial to you. Otherwise we let ourselves vulnerable to those who would use data to manipulate others in order to run their agendas.