Update: A concise (and IMO more readable) piece has been written on medium. I recommend checking it out here instead.

### Amazing finding!!!

Recently, a team of *motivated* scientists (University of Raudfays) made a breakthrough discovery. The scientists were studying the ERG24 gene, a gene which is very common in highly intelligent scientists (like themselves) but rare in the wider population. These scientists discovered that the ERG24 gene increases IQ.

How did the scientists do this? Simple. They measured the IQ of a large group of ordinary people. Then, they used gene therapy to introduce ERG24 to half of the group, and then had the entire group perform another IQ test. The result was that the mean IQ of people with the ERG24 gene increased by 2 points, while the IQ of the other participants did not change.

The team hopes to introduce this gene into the wider population, in order to move humanity closer towards utopia, and they are currently trying to identify more genes which can serve a similar purpose.

*But why does a difference in mean IQ equate to people with ERG24 having higher IQ?*

Good question. Why pick the mean? Why not the median?

In fact, the median IQ of people without the gene is 100, and with the gene, 99.1811. Does this now mean that people with this gene absent have the higher IQ?

**This finding in intelligence research is completely fictional.**

## Group Comparisons and Arbitrary Statistics

“…we could create any arbitrary function and then call its result a “statistic”.” – from blog post on Misinformative Statistics.

### Scenario

Suppose that we are motivated researchers. Here is the scenario: we have a data set, and the data set contains data from two groups, A and B. We want to claim that on some variable, group A is greater than group B. Furthermore, we want to state the extent of the difference between group A and group B.

### Decision Rules

We need to make a decision rule which we use to decide when A is greater than B, and by how much.

Let A and B be random variables with the underlying distributions of two groups. We assume that there is no error in the sample, and that the researcher knows the population distribution.

**Mean Rule**

**For direction:** A is greater than B if .

- This seems to be what is meant by “on average, A is greater than B.”
- This is commonly used in research and by journalists.

**For magnitude: **The size of the difference is

**Median Rule**

People often learn about the median as a “robust statistic”, as if it is a quick fix to outliers in the data. Thus, when two data sets have high skew, I’d imagine that someone would blindly conclude that comparing the medians of two distributions is therefore an appropriate decision rule. Is it? Let’s see if I can break this statistic.

**For direction: **

- A is greater than B if the median of A is greater than the median of B.
- This is not as commonly used, but a common example is house prices.

**For magnitude: **The size of the difference is

**Frequency Rule**

The next rule uses a distribution which I term the “difference distribution”. This is the distribution of , and in this context, can be thought of the distribution of the values that are generated by subtracting every value of B away from every value of A. I have not seen it used in the way I used it, but I thought of this as a potentially useful tool, and I intend to explore it more.

As an example, if X = 6 or 10 and Y = 1 or 2, then X-Y= 4, 5, 8 or 9. Get the idea?

**For direction:** A is greater than B if .

- Observing is equivalent to observing .
- A is greater than B if, in your long-term experience, you more frequently observed that A was greater than B.

**For magnitude:** Compare with .

- If you were to predict that for any two random observations, A would be greater than B, you would be right % of the time. I call this figure the correct prediction rate.

An important thing to note is that , so you can derive the mean rule from the difference distribution.

These three rules are the most obvious rules to me, but there are probably more that I haven’t seen or thought of.

### Applying The Decision Rules

Let’s test out these decision rules. I apply them to various scenarios, with increasingly pathological distributions. Will we see any problems?

**Scenario 1:** Let’s take the simplest scenario. (Constructed to be similar to IQ, which is commonly modelled as normal with mean 100, standard deviation 15.)

Distributions:

- A is normal with mean 110, standard deviation 15.
- B is normal with mean 100, standard deviation 15.

Mean:

- A: 110
- B: 100
- A is greater by 10

Median:

- A: 110
- B: 100
- A is greater by 10

Difference distribution

- A is greater: 68% correct prediction rate for A

Comments:

- The three decision rules reached the same conclusion: A is greater than B.
- This scenario is the idealised scenario. This seems to be what everyone means by “A is greater than B”. But why would the distributions be identical in every way except location? I don’t believe the distribution would be rigid, but maybe people’s minds are. This scenario is too unrealistic for the real world.
- Looking at the distribution of the differences, there is still quite a bit on the negative side. In fact, if you always predicted that an observed A is greater than an observed B, you would be wrong 32% of the time. Imagine having to cope with this error rate in the real world – not good!

**Scenario 2:** Let’s complicate this a bit: different variance.

Distributions:

- A is normal with mean 110, standard deviation 30.
- B is normal with mean 100, standard deviation 15.

Mean:

- A: 110
- B: 100
- A is greater by 10

Median:

- A: 110
- B: 100
- A is greater by 10

Difference distribution:

- A is greater: 62% correct prediction rate

Comments:

- Scenario 2 adds a layer of complexity to scenario 1.
- A quick judgement would have concluded the same for scenario 1 and 2: A is greater than B. But is “greater than” for scenario 1 the same as “greater than” for scenario 2?
- Consider the graph of the distribution of differences:
- By increasing the variance, the correct prediction rate is reduced from 68% to 62%.
- What if it started to reach 50%? Prediction would be no better than guessing. Two things could make this happen: if the difference in means was less, or the difference in variances were greater. This would make the comparison more problematic.
- In addition, the variance of the difference distribution has gotten larger. This means that it is not only the frequency, but also the size of the predicted differences which is increasing. And the size is increasing in both directions: are you satisfied with predicting that some A is greater than some B, just to find out that the B was much greater than A?

- This scenario makes me increasingly hesitant to compare distributions.

**Scenario 3: **Let’s introduce skewness to complicate things further.

Distributions:

- A is gamma with shape 2, scale 9
- B is gamma with shape 4, scale 4

Mean:

- A: 18
- B: 16
- A is greater by 2

Median:

- A: 15.1
- B: 14.7
- A is greater by 0.4

Difference distribution

- A is greater: 51% correct prediction rate.

Comments:

- Scenario 3 is starting to be more plausible. Now there are three degrees of freedom: mean, variance, skew. This distribution is less flawed for describing the real world.
- In scenario 3, all decision rules lead to the conclusion that A is greater than B.
- The size of the difference is not the same according to each rule.
- Median rule: B is only 97.3% of A
- Mean rule: B is only 88.9% of A.
- Which one should we report? Which one do we want to report?

- The correct prediction rate is 51% if we predict A. This is nearly the same as guessing. In certain situations, could a comparison of groups be practically insignificant?

**Scenario 4:**

Distributions:

- A is gamma with shape 4.1, scale 5
- B is gamma with shape 10, scale 2

Mean:

- A: 20.5
- B: 20
- A is greater by 0.5

Median:

- A: 18.86
- B: 19.3
- B is greater by 0.44

Difference distribution

- B is greater: 52% correct prediction rate.

NOTE: Mathematica (or me) was not able to plot the graph using TransformedDistribution, so I used an approximation.

Comments:

- There is an outright contradiction:
- The mean rule and median rule lead to different conclusions. Which one should we trust? If people like to use the mean, but the median is the “robust statistic” for asymmetric distributions, then a researcher could pick either one with apparent justification.

- This scenario demolishes the possibility of simple group comparisons.

**Scenario 5:** Let’s use a scenario which actually appreciates the complexity of the real world.

Suppose there are two cities: Aesthetic and Boss. Both cities have the same population size, and have the same number of architects. However, there are two types of architects, high-end architects and ordinary architects.

Aesthetic hires many high-end architects (50% of all architects in Aesthetic) to contribute to the many nice buildings in city. Many of the better-than-average ordinary architects did not feel appreciated in Aesthetic, and left for Boss. Thus, Aesthetic has the lower-end of the ordinary architects.

Boss hires fewer high-end architects (10% of all architects in Boss). However, Boss is the home of the country’s government, and the government needs to assert its dominance. Therefore, the government pays higher salaries to attract the best talent.

Distributions:

- A (Aesthetic)
- Ordinary salary is distributed normally, mean 40K and standard deviation 6K.
- High-end salary is distributed normally, mean 120K, standard deviation 1K.

- B (Boss)
- Ordinary salary is distributed normally, mean 60K and standard deviation 6K.
- High-end salary is distributed normally, mean 140K, standard deviation 1K.

If you view the ordinary and high-end architects separately, this is just scenario 1 – different means but same standard deviation. The means are different by several standard deviations, so you might conclude that because Boss pays more for ordinary architects and for high end architects, therefore, Boss pays more for architects in general.

Let’s fail to distinguish between the types of architects. After all, this is what we do with real data. We groups data, but do not sub-group further. And if we do take sub-groups, we choose one sub-grouping out of an infinite number of different sub-groupings we could have selected. And we could make sub-sub-groups, and so on.

Mean:

- A: 80,000
- B: 68,000
- A is greater by 12,000

Median

- A: 89,758
- B: 60,838
- A is greater by 28,920

Difference distribution

- B is greater: 55% correct prediction rate

Comments:

- Scenario 5 has complexity which can somewhat match the real world.
- We have seemingly reached a paradox:
- When considering the sub-groups separately, B is greater than A.
- When considering the entire groups, the mean and median imply that A is greater than B.
- Is it appropriate to impose judgements on entire groups? An architect concludes that Architect pays higher salaries, and he chooses to work there. But an architect doesn’t get to switch between “ordinary” and “high-end”. Viewing himself as a member of a smaller group, he now concludes that he should go work in Boss. When the conclusion of your comparison depends on the groups you create, and the scale at which you view people, then there is much reason to be apprehensive.

- Look at the difference distribution:
- It has three peaks, and is generally strange.
- It does not feel right to make a comparison at the group level. The group comparison seems to provide no information as to what the individual comparisons might be. In fact, it is nearly impossible to find an individual difference that would be close to the implied group difference.

### Summary of Scenarios and the Key Issues

Scenario | Distribution | Rule 1 (mean) | Rule 2 (freq.) | Rule 3 (median) |
---|---|---|---|---|

Scenario 1: What Everything Seems To Think | Normal, different means | A is greater by 10 | A is greater (68%) | A is greater by 10 |

Scenario 2: Variability decimates prediction | Increased the variance of A from S1. | A is greater by 10 | A is greater (62%) | A is greater by 10 |

Scenario 3: Manipulating the magnitude | Asymmetric distributions | A is greater by 2 | A is greater (51%) | A is greater by 0.4 |

Scenario 4: Manipulating the direction | Asymmetric distributions | A is greater by 0.5 | B is greater (52%) | A is greater by 0.44 |

Scenario 5: Comparisons at different scales | Mixture distribution | A is greater (12,000) | B is greater (55%) | A is greater (28,920) |

- Scenario 1 shows an idealised scenario.
- Scenario 2 demonstrates that more variable groups decreases your correct prediction rate, and makes it more likely to find greater disparities between random individuals in both directions.
- Scenario 3 demonstrates that switching your choice of statistic can make differences appear greater or smaller.
- Scenario 4 demonstrates that switching your choice of statistic can reverse the direction of a comparative statement.
- Scenario 5 demonstrates that comparison of a population grouped in different ways can produce contradictory results. (Simpson’s Paradox)

In summary, arbitrary statistics and group comparisons are problematic, because they can be very misleading.

- You can’t always make one indisputably correct comparative statement.
- Using different statistics on the same data can lead to opposite conclusions.

- You can’t definitively state the magnitude of a group difference.
- Using different statistics on the same data can produce exaggerated or diminished figures.

- You can’t predict the differences between individuals.
- The difference distribution shows that the magnitude of the group difference does not adequately reflect the possible magnitudes of individually observed differences.
- Variance decimates any hope of predicting individual differences. Variability of your predicted difference is even more than the variability within one of the groups, since .
- This problem grows extremely fast as variance increases. Individual differences are less and less predictable.

- Group comparisons do not have the intuitive properties that individual comparisons have.
- Comparison of real numbers:
- Suppose and
- Then and
- We conclude that .

- We might intuitively apply this to statistics.
- is a mixture of two sub-populations and , same with .
- If and , then we might automatically think .

- But we already have a counter-example in scenario 5.

- Comparison of real numbers:

### Aside: Why I See Potential In The Difference Distribution

Unlike other ways of comparing groups which insist on one simplified conclusion, the difference distribution is precise. Very precise. And I think it could be very useful.

Consider a scenario where you are developing medicine. The difference distribution would tell you what possible effects the medicine could have on some variable. For example, measure white blood cell count (WBC count), administer a drug, measure WBC count again, and then take the change in WBC count for each individual.

- Were there non-responders (people who were not affected)?
- You do not want the drug prescribed to these people. Little benefit and potential harm, in addition to the risk of unintended side effects.

- Among the responders, what was the mean effect? Was there variability?
- You want the medicine to have a good effect, but if the effect is too variable, then it would be difficult to predict how the medicine might affect a responder. This would complicate the treatment process.

Of course, finding the true distribution of differences is itself a difficult problem. But I can’t help but see the potential of this idea, if it could be implemented.

### Summary: Statistical Abuse and Motivated People

I have now proved is that **t****here is a POSSIBILITY for the same data to produce different conclusions (in direction and magnitude) under different statistics.**

What are the consequences?

If you could choose a statistic (mean, median) and hide other statistics (variance, skew) or even the entire distribution, then you would have the possibility of creating the conclusion you want. In other words, you could create a narrative.

Who would do such a thing?

Money: A study investigating the effectiveness of a drug would have a conflict of interest if it were funded by the drug manufacturer. The researchers have much to gain by producing the outcome that the manufacturer wants.

Prestige: Many researchers are under pressure to publish their findings into journals, in order to advance their career. What if scientific journals don’t accept uninteresting findings? If every result was either expected or unexpected, and only unexpected findings were interesting, then what result would the researcher want?

Whatever the motivation is, whether it is malicious or not, is irrelevant. We are at risk of having our thoughts and actions manipulated by statistics. Thus, I’m always suspicious of motivated people.

And if we should be suspicious of motivated people, then there’s one person who I’m starting with. The one who I consider to be the most guilty.

Me.