Document:Five More Nails

From AIDS Wiki

NOTWITHSTANDING ANY OTHER NOTICE ON THIS PAGE, the material on this page is NOT available under the GNU Free Documentation License; in accordance with Title 17 U.S.C. section 107, it is posted in the manner of bulletin boards in schools and workplaces, to encourage public education and citizen awareness, without profit or payment, for persons and entities engaging in non-profit research and educational activities and purposes only.




"You Bet Your Life"
29 October 2006

Previous chapter


Hi Darin,


I noticed on your recent post that you considered Figure 1 of the Sept 27 JAMA article to be a "mathematical artifact". Take a look at this chart on Dr. Bennett’s blog which he maintains is “proof” of a correlation between CD4 cell count and HIV Viral load... He calculates a magic R2 value of .93 calculated from Figure 1.
I am leaning towards thinking you are correct in your analysis, but Bennett seems to have a point (or five – not to be gratuitously humorous about the matter because there is nothing at all funny about AIDS, really.)
Any comment? Thanks.


A concerned reader.


Figure 3 of Rodríguez et al , visually displaying the lack of correlation between viral load and CD4 cell loss
Enlarge
Figure 3 of Rodríguez et al Image:PDFsmallicon.gif, visually displaying the lack of correlation between viral load and CD4 cell loss
Bennett's constructed figure with a magic R2 value of .93
Enlarge
Bennett's constructed figure with a magic R2 value of .93


Dear Concerned Reader,


He has NOT proved a correlation between viral load and CD4 cell loss in individual patients. He has proved a correlation between the viral load and CD4 cell loss for five values that represent the median CD4 cell loss for their respective viral load subgroups. It is statistical trickery of the most transparent kind.

Here is the best way I can put it: ANY data set such as the one under consideration, has a "line of best fit"; these appear in Figure 3. ANY data set at all. The coefficient of determination (R2) tells you how well the data set as a whole "fits" this line of best fit. In Figure 3, clearly they don't fit at all.

Now here is a more mathematical explanation of what I meant by "it's the statistical equivalent of squinting your eyes so hard you can't see any details anymore". Let's say you just take a more-or-less random data set (as Figure 3 almost is) and break it up into subgroups by intervals of the predictor variable. All of the data points in any one of these subgroups come from patients who presented roughly similar HIV viral load levels. But within any one of these subgroups, the data points are still more-or-less randomly scattered. But there IS a general pattern, because the line of best fit does have negative slope. In other words, if you look at the total set, the points do (very generally) slope down slightly. The same holds for each subgroup. For each subgroup, the data points are scattered, but (very generally) have a slight downward trend. The point is, (reflected by the R2 values for the total data set AND for each subgroup) they don't "fit" that trend very well (if at all).

Now, we have decided to choose the "median" response for each subgroup. It does not take a rocket scientist to figure out that if you choose the median response for each subgroup, those medians are going to lie somewhere very close to the line of best fit. The only way the median point could lie FAR from the line of best fit was if the points had some strange distribution, like 2/3 of them very low and then a big jump and the other 1/3 way up high. Just a quick glance at Figure 3 shows they're not strangely distributed like this.

Figure 1 of Rodríguez et al , showing the "simple linear relationship" of median subgroup responses
Enlarge
Figure 1 of Rodríguez et al Image:PDFsmallicon.gif, showing the "simple linear relationship" of median subgroup responses

So, here is the net effect of considering the five points in Figure 1: you have a cloud of almost random data points; you plot the line of best fit through those points (which looks almost as absurd as some of the graphs in Ho/Wei); then you "choose" five data points out of the cloud which all happen to lie very close to the line of best fit. It's no surprise then that they all lie in a straight line and hence give a high correlation to each other. The "correlation" does not reflect any real correlation in the data set itself – it's a mathematical artifact of the way the medians were chosen. It's the statistical equivalent of squinting your eyes real hard and picking five points with your finger. It's ridiculous.

So, all Figure 1 reflects is the slope of the line of best fit from Figure 3, with the lack of correlation obscured, and with some "error bars". The error bars have absolutely no biological meaning; they are confidence intervals for the median points. They are just saying, "look, the median point lies in here somewhere". So what??

Then people like Bennett point out the "simple linear relationship" in Figure 1 and claim it's evidence of some kind of "correlation". It's NOT reflecting the correlation of the total data set; it's just reflecting the small slope of the line of best fit. But every data set has such a slope value. You can do this trick to ANY data set that's more-or-less randomly distributed. It's so clear that the authors of the study couldn't just put the four clouds of data points upfront in the article, and just report the R2 values in the abstract – they had to concoct Figure 1 to distract people at the start of reading the article, and then report the median values to give the idea there was some "correlation" in the abstract.

Actually, that wasn't good enough, because the median values were still too close to each other. They had to go back over each subgroup individually and run a different model with each one, and I can hear their collective sighs of relief when they finally got numbers spaced out from each other a little more. (Meaning, more than 10-15 cells/mm3/year difference between the most extreme groups.) Then they tried to "rescue" the R2 = 0.04 by several ways, but could only get to at best 0.08 or 0.10. I can just see them after they first saw the data and the actual R2 values – OMG, we have to put this in JAMA... what do we do??

This all might mean something if there were any reason to look at the subgroups this way. But I can't find any. The only reason I can find is to smooth the data out and have a nice looking graph like Figure 1. In my 9 Oct post, I point out why I think the boundaries chosen are arbitrary and why I don't think there's any good biological reason to group them this way. And biological reasons have to be the reason for choices like this. The reasons can't be purely mathematical or just arbitrary. This is all standard stuff. Do a google search on "subgroup analysis" (in quotes). You'll come up with a slew of articles on how to "misuse/abuse" subgroup analysis. This paper should go down in history as Exhibit A.


© 2006 by Darin Brown
Originally published at "You Bet Your Life"

Next chapter