How to Lie with Statistics

How to Lie with Statistics
R6
Darrell Huff
Meredith Mincey
Readings 5050
Spring 2010
In Darrell Huffs famous book, he teaches us how to lie with statistics so as to protect ourselves
from false information. In Chapter 1, Huff tells us that a “surprisingly precise figure” is most likely false.
Anything that is a nice round number or very specific is unlikely to be scientifically accurate. Those who
use those precise figures haven’t done an appropriate sample, and they create bad samples in all kinds of
ways. If the sample is large enough and selected properly, it will represent the whole better. If the sample
is too small and or the creator too biased, the conclusions will be false but appear scientific.
Unfortunately, bad samples lie behind most of what you read. Sometimes respondents to questions lie
because they want to give a pleasing answer. But most of the time, results are only as good as the
samples. Be skeptical.
Creators who are serious about taking accurate samples must eliminate any chance of bias. To do
this, creators can use a basic sample called “the random sample.” Creators choose random samples by
selecting things by chance from the “universe.” The universe is the whole thing in which the sample is a
part. For instance, perhaps the universe is UNT undergraduates, and you want to see how many
undergraduates plan to enroll in graduate courses. All undergraduates currently at UNT would be the
universe, but that’s a very large population to select samples from. It would be expensive to do a random
sample large enough to accurately predict how many undergraduates plan to go to graduate school. A
more economic substitute to the random sample is the “stratified random sample.” To take a stratified
random sample, creators would divide the universe (UNT undergraduates) into several groups in
proportion to their known prevalence. For instance, one group would be journalism majors who want to
enroll in the Mayborn program. Because your population is much smaller, you won’t need as many
random samples to make your data accurate.
In Chapter 2, Huff explains the tricky nature of averages. The word “average” has a loose
R6
meaning. People use averages to trick and influence public opinion or sell products. Readers are fooled
when they don’t know the average without knowing what kind of average it is. Huff explains there are three
kinds of averages: mean, median, and mode. The mean is the sum of all the numbers in a data set divided
by the number of items in the list.
Example: {1+2+3+4=10/4}
Mean=2.5
The median is a finite list of figures found by arranging all the observations from lowest value to highest
value and picking the middle one.
Example: if a < b < c < d, then {a, b, c, d}
Mean= b and c
The mode is the value that occurs the most frequently in a data set.
Example: {2, 2, 3, 6, 2, 7}
Mode=2
Some averages fall so close together that it isn’t vital to distinguish among them, but the mode average is
the most revealing because it shows the most common occurrence in your data set.
In Chapter 3, Huff warns us of the data that is missing from the sample. People usually make
inadequate samples. And instead of creating an honest headline, they omit the size of their sample.
Unfortunately for advertisers, any change in a large sample group is likely to be too small to make a good
headline. And unfortunately for readers, a large sample is more likely to be accurate. Sooner or later, a
test group is going to show an improvement worth a headline, and that headline is unlikely to be true.
Only a substantial number of trials follows “the law of averages.” The law of averages states that
probability will influence all occurrences in the long term.
Example: “The roulette wheel has landed on red three consecutive times. The law of averages
says it's due to land on black!”
Of course, the probabilities do not change according to past results. Even if the wheel has landed on red
10 consecutive times, the probability that the next roll will be black is still 47.6%.
R6
Still, Huff says the law of averages is useful for descriptions and predictions. How useful depends
on how many samples you take. But how many samples do you need to predict something accurately?
The size of your sample depends on how large the population is and how varied the population is.
Sometimes the number of samples can be deceptive. To avoid being fooled, figure out the degree of
significance. Don’t trust an average or graph when important figures are missing. If the creator doesn’t
explain the numbers, the range, or show any data that deviates from the average, they are fighting dirty.
In Chapter 4, Huff explains the sampling method. Any product of the sampling method will have
statistical error. Your sample can be taken to represent the whole field of what is a measured and that can
be represented in figures. There are two ways of doing so: the probable error and the standard error.
The probable error is the amount by which the mean of a sample is expected to vary because of chance
alone.
Example: Suppose you measure the size of a field by pacing along the fence while counting your
steps. You count 100 steps along the fence. You do this a few times and notice that you came
within three yards of hitting the exact 100 steps in half your trials, and missed by three yards in
the other trials. You would calculate the probable error like so: 11±3 yards.
Most statisticians use the standard error, which takes in about two-thirds of the cases. You can only
calculate the standard error by knowing the sample’s size. Sometimes, though, people make a big ado
about a difference that is demonstrable but tiny and unimportant.
In Chapter 5, Huff explains what he likes to call “gee-whiz graphs.” Line graphs are the easiest
statistical picture to use, and they’re good for showing trends and explaining something everyone’s
interested in. Unfortunately, they’re also good for misleading the reader, intentionally or unintentionally.
Suppose you want your bar graph to have more of a “wow” factor. You could cut part of the graph and
make a bigger impression, but still present honest data. Your company can use misleading graphs to
influence public opinion by changing the proportion of graph, and no one can place blame on you. Isn’t
that something?
Example: Which graph looks more impressive? Which one is more honest?
R6
http://www.evsc.virginia.edu/~jhp7e/EVSC503/slides/stats_lie02/sld014.htm
In Chapter 6, you also learn how to use pictorial graphs or pictographs to fool the reader.
Readers like pictographs because they’re eye-appealing, but readers are less likely to understand the
results correctly. When reading, watch out for bar graphs where bars change widths while representing a
single factor. Is it sloppy craftsmanship or yellow journalism? Who knows?
Example: Just how many adult frogs are in the south pond?
The reader might conclude that frogs are simply bigger in September as compared to May, even though
the title says that the graph displays the number of frogs. The reader will notice to the area of the image,
not just the height.
http://wikieducator.org/MathGloss/P/Pictograph
In Chapter 7, Huff tells us what a semiattached figure can do. What is a semiattached figure? If
R6
you can’t prove what you want, demonstrate something else and pretend it’s the same thing! Choose
figures that sound best and trust that few readers will recognize how imperfectly it reflects the situation.
You can recognize a semiattached figure occurs when information is missing or variables are not stated.
Most advertisers want to fool you with numbers, but semiattached figures can also occur by inconsistent
reporting at the source. For instance, if the advertiser asked controversial questions, it might lead to false
information because respondents want to give what they believe is an acceptable answer.
Example: 72% of all crow nests in a particular forest are in pine trees; therefore, crows prefer to
nest in pine trees. (But 95% of all the trees in the forest are pine trees!)
In Chapter 8, Huff explains the common problem of the post hoc fallacy. The post hoc fallacy
occurs when you believe: If B follows A, then A caused B. In other words, because one event occurred
before another, the previous event (A) directly resulted in the next event (B). However, just because A
happens before B doesn’t mean they are related. More than likely, B was caused by a third factor.
Example:
Event A: The US has a high milk consumption rate.
Event B: The US has a higher cancer rate than countries with a low consumption of milk.
Post Hoc fallacy: Because the US has a high milk consumption rate and a higher cancer rate than
countries that consume low amounts of milk, milk causes cancer.
When there are many possible explanations, you shouldn’t pick one just because it suits your tastes. After
all, the correlation can be caused by several things:
a. Chance
b. A co-variation in which the relationship is real, but you don’t know which variable is the
cause or the effect.
c. Sometimes the cause and the effect change places.
d. Both variables are the cause and the effect.
e. Nether variables effect the other, but the correlation is real.
f. When the cause and the effect can only be speculation.
R6
So what have we learned? That people will create false information when they make completely
unwarranted assumptions. People will also create a fallacy in their data based on a conclusion that’s said
to continue beyond the data demonstrated. Ask yourself, how did they connect event A to event B?
In Chapter 9, Huff tells us how to “statisticulate.” Statisticulation is misinforming people by
using statistical material and is caused by incompetence or chicanery. To be fair, statistics are usually
manipulated by people who are not professional statisticians. According to Huff, salesmen, PR experts,
journalists, and copywriters twist data to influence the reader. They frequently exaggerate data and rarely
minimize anything unless it’s negative. They like to paint a picture of giving rather than taking. Maps can
conceal facts and distort relationships and decimals can be deceiving, but have an air of exactness. They
can use percentages to confuse you, and any percentage based on a small number of cases will be
misleading. And a shifting base price will confuse you about discounts. If you can’t add up percentages
freely, there’s a problem.
Example: An ad for Instant Maxwell House Coffee emphasizes that 45% of those tested in a
recent survey preferred its taste. (But how many people are in the sample?)
So, how can readers protect themselves from learning false information? The first thing to do is
to look for a bias or biased samples. Is the creator trying to prove a pet theory, earn a fee, or protect their
reputation? Look for suppressed data and see if they published only favorable data. When reading
graphs, check to see if units of measure that have shifted. Look for unqualified “averages.” Even if the
creator is trying to be honest, their data can still be false. If someone is citing a claim, who is it really?
Huff tells us to watch out for “o.k. names,” names that have some sort of prestige. The unscrupulous
will use o.k. names to influence you, but haven’t actually consulted anyone. Check to see if the source
really supports their claim. And watch out for “firsters.” Anyone can claim to be the first at anything.
Check their claim more carefully to find the truth. And finally, watch out for a switch from the raw figure
and the conclusion. Hopefully, by learning how to lie with statistics, you’ll know how to protect yourself
in the future.
Bibliography
Huff, Darrell. (1954). How to Lie with Statistics. New York: W. W. Norton & Company Inc.
Porter, John H. (1998). How to Lie with Statistics. Retrieved on 2010/4.
http://www.evsc.virginia.edu/~jhp7e/EVSC503/slides/stats_lie02/sld001.htm
Kirkman, T.W. (1996). Display of Statistical Data. Statistics to Use. Retrieved on 2010/4.
http://wikieducator.org/MathGloss/P/Pictograph
R6