Personal genomics and the nuances of communicating statistics
by phil on Saturday Jan 17, 2009 8:58 PM
limits of language
A friend of mine, Strange Loops, highlights some potential problems when everybody can get their DNA analyzed on the cheap. He cites an article by Steve Pinker who really does a great job describing some of the conceptual pickles that personal genomics gets us in:
[T]here is nothing like perusing your genetic data to drive home its limitations as a source of insight into yourself. What should I make of the nonsensical news that I am "probably light-skinned" but have a "twofold risk of baldness"? These diagnoses, of course, are simply peeled off the data in a study: 40 percent of men with the C version of the rs2180439 SNP are bald, compared with 80 percent of men with the T version, and I have the T. But something strange happens when you take a number representing the proportion of people in a sample and apply it to a single individual. The first use of the number is perfectly respectable as an input into a policy that will optimize the costs and benefits of treating a large similar group in a particular way. But the second use of the number is just plain weird. Anyone who knows me can confirm that I'm not 80 percent bald, or even 80 percent likely to be bald; I'm 100 percent likely not to be bald. The most charitable interpretation of the number when applied to me is, "If you knew nothing else about me, your subjective confidence that I am bald, on a scale of 0 to 10, should be 8." But that is a statement about your mental state, not my physical one. If you learned more clues about me (like seeing photographs of my father and grandfathers), that number would change, while not a hair on my head would be different. Some mathematicians say that "the probability of a single event" is a meaningless concept.
My hope is that we will get better at the discussing statistical data. I'm totally down with that last sentence. What the hell is a probability anyway? I enjoy watching Intrade Prediction Markets, where people bet actual money to predict things like elections. It turns out that futures markets are the most accurate at determining what will happen. But what the hell does it mean there's a 90% chance that Obama will win the election? Does that mean that 1 out of 10 times that the same exact scenario plays out, McCain will win? Does it mean we only feel 90% confident, based on the polling data, that he will win? When I see "90% chance Obama will win" to me, that simply means, "he's going to take it."
Half of the fun at looking at those futures markets is the mental exercise in trying to understand what a "75% chance that Slumdog Millionaire will win 'Best Picture'" means.
I think a solution for Pinker would be more nuanced language in the reports:
The two biggest pieces of news I got about my disease risks were a 12.6 percent chance of getting prostate cancer before I turn 80 compared with the average risk for white men of 17.8 percent, and a 26.8 percent chance of getting Type 2 diabetes compared with the average risk of 21.9 percent.
When those DNA reports come back, instead of saying "you have a 12.6% chance" it should say "we can say with a 95% certainty that you have between a 5% and a 22% chance, with the variation depending on lifestyle choices."
The most basic communication of statistics can come across as cold, reductionist, and fatalistic. But I think the vocabulary of statistics is wide enough for us to describe hopeful, free-will language. Whenever they publish averages, they should always publish standard deviations. And whenever they publish standard deviations, they should publish what tends to cause the deviation.
Strange Loops said on January 17, 2009 11:41 PM:
Good point about rephrasing stats. The same information can generally be put in the form of a confidence interval (an instead of a tighter p value for more conservative significance tests, you adjust the CI) and it might evoke more accurate representations in the mind of readers who lack statistical training.
Even in the primary lit articles, authors sometimes confuse data summarizing population trends as applying to predict an individual result (as opposed to predicting how an aggregate of novel individual results will average out). But it gets even worse the more diluted the reporting -- from science press release down to science reporting down to everyday media story down to pub conversation.
I admit to being sloppy in my phrasing of stats sometimes, but the more training I get, and the more I run tests on my own data, the more these things stand out to me. Anyway, thankfully Pinker is more engaging than I, and someone articulate out there is trying to clear things up for the genetics/behavior question.