A LARGELY STAND-ALONE DISCUSSION FROM THE APPENDIX OF MY UPCOMING BOOK ON OBESITY:
I have concerns about how the academic elite of medicine develop clinical practice guidelines intended to direct the care given by other physicians, as well as to inform insurers what ought and ought not be covered. There is a special need for scrutiny in this era where legitimate cost-containment, coupled with the increasing influence of third party payers, like federal and state governments, and the massive private insurance industry—which prioritize costs often over the wants and needs of patients and doctors—are driving “evidence-based medicine” and risk denying patients care and treatments not anointed by the “evidence.”
By this I mean a certain treatment or test, unless called for by a few top docs, at major medical schools, in their published guidelines, or supported by excellently conducted research reported in a scientific journal, might not be available to real-world patients. Perhaps worse, I think physicians today come out of training cowed by these top experts, devaluing their own thoughts and practice styles, subjugating them to evidence-based (“cookbook”) medicine. Keep in mind that every licensed medical doctor has an undergraduate degree, a four-year medical degree, and usually three-to-five-or-more years of post-graduate training. Such professionals are well trained enough to formulate their own preferred ways of managing patients, if encouraged and allowed (that is, paid) to do so, which they are often not.
Now, don’t misunderstand. Well-conducted research and peer-reviewed papers and clinical practice guidelines are necessary to good medicine. But once that information is out there, it should be the independent physician figuring out how to apply that data to each unique patient. No matter how good the research and conclusions, they can never anticipate all the variables confounding a real-world situation. And even if evidence-based medicine’s answers could be accepted as 100 percent valid, there will never be adequate evidence to answer every question faced by every physician in the course of one routine week.
Our original question from the text was:
Why did it take until less than ten years ago for low-glycemic-index dieting to get taken scientifically seriously? Smart people were talking about it. Why didn’t opposing theories get a fair shot at the height of the gung-ho all-fat-is-bad days? Isn’t the job of science to consider all possibilities, test for them, throw out ideas proved wrong, and refine ideas proved right, until gradually, inexorably we approach some great universal truth?
I used to think so.
Yet my experience has not always shown science to work that way.
There are several questions—low-fat versus low-carb being one, but there are others—in just my subspecialized area of medicine, where it seems to me experts wear bizarre blinders about even contemplating dogma-threatening new ideas. And I stumbled upon an explanation—or at least, objective confirmation of what I’ve observed—a mere few weeks prior to this writing.
Harvard-trained physicist, turned historian and philosopher of science, Thomas S. Kuhn (1922–1996) published a landmark book in 1962 titled, The Structure of Scientific Revolutions. In it, he threw out the notion that I had, and most of us have: That romantic view of science as, in the words of North Carolina State University philosopher, Jeffery L. Kasser, “straightforwardly cumulative, progressive, or truth-tracking.”
This traditional, romanticized image of science includes an “openness to criticism,” an almost obsession with disproving itself, that Kuhn felt did not exist in real-world science.
Normal science, according to Kuhn is governed instead by paradigms. A paradigm is an object of consensus, not open to criticism. The paradigm is assumed to be correct. It is dogma. It determines the puzzles to be solved, which involve fitting nature into the paradigm, and defines the expected results and the standards for evaluating those results. Science doesn’t seek truth, it seeks to prove the paradigm. “Dietary fat is unhealthy and the main promoter of obesity” was the paradigm in our discussion, and few mainstream researchers were allowed to, funded to, do research other than to prove that proposition.
(Another example in endocrinology involves defining hypothyroidism [low thyroid levels] in terms of blood TSH testing. Without going into details, the paradigm is that a TSH level virtually always accurately assesses thyroid function. I don’t happen to believe that—much of the time, perhaps, but no test is as reliable as this one is given credit for. Many scientific papers, however, looking at questions like, Does Agent X affect thyroid function? base their conclusions largely, if not exclusively on TSH testing.
In Kuhn’s view, a crisis occurs when the paradigm loses its grip, the puzzles repeatedly resist solution (the failure of low-fat dieting, for instance, to quell the epidemic of obesity), and confidence in the paradigm is lost. It is during crisis that the paradigm is questioned and tested, perhaps rejected. When a new paradigm takes over, that is a scientific revolution. Older scientists have their careers invested in the old paradigm, while younger ones make the switch more readily, the older paradigm perhaps dying off, literally, as their proponents do.
My discovery of Kuhn was an epiphany—because I can honestly say there have been several occasions where I have been dismayed to witness what seemed to me to be blatant bias driving what I thought should’ve been open-minded scientific inquiry.
I believe there to be other weaknesses, which question the role of the randomized, controlled scientific trial as the end-all-be-all of medical decision making—useful to be sure, perhaps vital—but not everything, as I fear is the current paradigm. First, for valid reasons, medical researchers design studies to focus on one, or a limited number of parameters. If you want to know, for example, the effect of different levels of smoking on lung cancer, you’ll compare groups of subjects that are as much alike as possible, differing only in how much they smoke—same sex, age, blood pressure, BMI, medications, etc.
But suppose—and I’m making this up—childhood exposure to Play-Doh plus Lysol, plus adult exposure to battery acid, all in combination, increases the risk of lung cancer in smokers. There might be a study looking at Play-Doh exposure (again: totally making this up) and lung cancer, or even Play-Doh plus smoking and lung cancer, but I guarantee you there is no study looking at Play-Doh + Lysol + battery-acid in smokers compared to nonsmokers, and lung cancer. My point being, the randomized controlled trial’s look at one or very few variables against a background of supposed uniformity is both a great strength and great weakness, for human populations are anything but uniform and it is utter hubris to believe we can draw inviolable conclusions as if they were. Yes, do research, consider the results, but in the case of medical science in particular, as opposed to physics, or economics, do not hold the individual patient and physician hostage to the results—which is what’s happening!
Lastly I want to take a swipe at the greatest sacred cow in all of science and statistical analysis.
The greatest, most entrenched paradigm of all!
The lowly p-value.
“P” standing for probability.
When data from any experiment is analyzed, it must be determined whether the results from one group differ from another group—supporting the hypothesis that Group A is different from Group B for some reason—and if they are different, is that difference statistically significant? The question that in reality gets asked is: what is the probability that the null hypothesis, that Groups A and B are not different, is true in spite of the data.
In other words, under the conditions of the experiment, what are the chances the results could have come from random chance, rather than as a consequence of some actual difference. And if that probability is sufficiently small (a low p-value), the results are said to be statistically significant and the experiment is said to show the groups are different.
If, however, the probability that the null hypothesis is true is large (a high p-value), then the data are said to have not achieved significance, and the groups are said to be no different. That particular experiment goes down in the annals of science as failing to show a difference between Groups A and B. If the question is an important one, more than one experiment will be done, and if all the studies agree there is a difference, or there isn’t a difference, then consensus is reached and, assuming we all agree that enough of the work was conducted properly, everybody is happy.
If some experiments draw one conclusion, though, and others draw a different one, that’s when things get confusing. That’s when we get a news item one week saying Vitamin Triple-X is good for us, and one next week saying it causes heart attacks.
That’s when we need to do bigger better studies.
This is all good and useful. We obviously need some method we all can agree upon to look at sets of data and decide if they tell us anything. My gripe is with where the line is drawn between significance and nonsignificance. It’s an important question because in medicine today, if a certain test, or certain treatment is studied and the data as to their usefulness is deemed not significant, then that test or treatment has a good chance of being thrown away, or at least not supported under evidence-based medicine, which might mean insurance won’t pay for it.
A p-value of 5 percent is that cutoff virtually universally in all scientific research.
If p is less than 5 percent, a study’s results are said to be statistically significant, and we pay attention to whatever a cursory exam of the data already told us. We are saying there is a less than 5 percent chance the null hypothesis could be true given the data. If p is greater or equal to 5%, then results are deemed not significant, no matter what an eyeballing of the data might suggest. In other words, if Drug Alpha cures people 75% of the time, but the p-value is 10 percent (a 10 percent chance that the null hypothesis—that Drug Alpha is worthless—is true despite the data) then Drug Alpha is going to be discredited or at least have to wait for another experiment.
Why 5 percent?
Why is 5 percent the magic p-value?
Surely there is some sound statistical reasoning, right?
No… not so much…
It was an arbitrary threshold set by one man, Cambridge geneticist and statistician R.A. Fisher, in 1926, in a paper published in the Journal of the Ministry of Agriculture of Great Britain. In that paper he discusses the pros and cons of various p cutoffs, and states, “Personally, the writer prefers…the 5 per cent point.”
Now, I’m not arguing against a 5 percent p-value threshold. It’s a perfectly reasonable number, if we have to have just one number. What I am arguing against, am flabbergasted by, is that under the potential tyranny of evidence-based medicine, all medical decision making would, in theory, be subjected to a p-value analysis, and any and all major patient-care interventions would be accepted or rejected on the basis of whether the p was greater or less than 5 percent, a number that sounded good to one man 86 years ago.
What’s wrong with 4.5 percent, or 3 percent, or 6.35 percent. Especially in this age of highly accurate, push-of-a-button, computer computations, in which multi-page tables in the backs of statistics textbooks are obsolete, any reasonable figure could be chosen. Don’t tell me a treatment rejected because of a p equal to 6.0002 percent is really all that less likely to be helpful to some people than a treatment that lucked out with a p of 4.9997%.
I think the whole notion of p-values should be revamped.
My proposal: ignore all results failing to reach a significance indicated by a p of, say, 10 percent, but for each study—ahead of time, a priori, that’s extremely important to assuring integrity—the investigators conducting the research will establish the p-value threshold of significance for that particular study, a value anywhere between 9.9999% and, say, 1.0000%, or possibly less. The p-value would no longer be arbitrary, because its selection would be based upon the relative cost of a wrong conclusion, a risk-benefit analysis, which is already part and parcel of good medicine.
If we are investigating the efficacy of Drug Y against Disease Q, we might set a high p-value threshold of, say, 9 percent if Drug Y is very safe and Disease Q is a nonfatal nuisance, on the order of the common cold. Whereas if Drug Y has a dangerous collection of side effects and Disease Q is almost always fatal, you wouldn’t want to deem Drug Y to be an acceptable treatment unless you were pretty darned sure it worked. An appropriate significance threshold in that situation might be less than 1 percent, or perhaps, less than 0.1 percent. Any number in between the defined extremes might be chosen depending on the exact combination of risks and benefits intrinsic to the situation being studied.
Such a system would introduce utter chaos if deployed throughout the scientific community, across all fields. I do think, however, such a system—admittedly and intentionally complex—should be considered for the biomedical, or at least medical sciences. Medicine is complex and diverse, and the same set of standards cannot be applied to all situations. One important example of complexity in medicine which does not exist in, say, astrophysics, is that there can be important, even fatal costs to both overcaution and overboldness. As many people could, hypothetically, die from the rejection of an effective treatment, as could from acceptance of a dangerous or ineffective one.
jkr
I have concerns about how the academic elite of medicine develop clinical practice guidelines intended to direct the care given by other physicians, as well as to inform insurers what ought and ought not be covered. There is a special need for scrutiny in this era where legitimate cost-containment, coupled with the increasing influence of third party payers, like federal and state governments, and the massive private insurance industry—which prioritize costs often over the wants and needs of patients and doctors—are driving “evidence-based medicine” and risk denying patients care and treatments not anointed by the “evidence.”
By this I mean a certain treatment or test, unless called for by a few top docs, at major medical schools, in their published guidelines, or supported by excellently conducted research reported in a scientific journal, might not be available to real-world patients. Perhaps worse, I think physicians today come out of training cowed by these top experts, devaluing their own thoughts and practice styles, subjugating them to evidence-based (“cookbook”) medicine. Keep in mind that every licensed medical doctor has an undergraduate degree, a four-year medical degree, and usually three-to-five-or-more years of post-graduate training. Such professionals are well trained enough to formulate their own preferred ways of managing patients, if encouraged and allowed (that is, paid) to do so, which they are often not.
Now, don’t misunderstand. Well-conducted research and peer-reviewed papers and clinical practice guidelines are necessary to good medicine. But once that information is out there, it should be the independent physician figuring out how to apply that data to each unique patient. No matter how good the research and conclusions, they can never anticipate all the variables confounding a real-world situation. And even if evidence-based medicine’s answers could be accepted as 100 percent valid, there will never be adequate evidence to answer every question faced by every physician in the course of one routine week.
Our original question from the text was:
Why did it take until less than ten years ago for low-glycemic-index dieting to get taken scientifically seriously? Smart people were talking about it. Why didn’t opposing theories get a fair shot at the height of the gung-ho all-fat-is-bad days? Isn’t the job of science to consider all possibilities, test for them, throw out ideas proved wrong, and refine ideas proved right, until gradually, inexorably we approach some great universal truth?
I used to think so.
Yet my experience has not always shown science to work that way.
There are several questions—low-fat versus low-carb being one, but there are others—in just my subspecialized area of medicine, where it seems to me experts wear bizarre blinders about even contemplating dogma-threatening new ideas. And I stumbled upon an explanation—or at least, objective confirmation of what I’ve observed—a mere few weeks prior to this writing.
Harvard-trained physicist, turned historian and philosopher of science, Thomas S. Kuhn (1922–1996) published a landmark book in 1962 titled, The Structure of Scientific Revolutions. In it, he threw out the notion that I had, and most of us have: That romantic view of science as, in the words of North Carolina State University philosopher, Jeffery L. Kasser, “straightforwardly cumulative, progressive, or truth-tracking.”
This traditional, romanticized image of science includes an “openness to criticism,” an almost obsession with disproving itself, that Kuhn felt did not exist in real-world science.
Normal science, according to Kuhn is governed instead by paradigms. A paradigm is an object of consensus, not open to criticism. The paradigm is assumed to be correct. It is dogma. It determines the puzzles to be solved, which involve fitting nature into the paradigm, and defines the expected results and the standards for evaluating those results. Science doesn’t seek truth, it seeks to prove the paradigm. “Dietary fat is unhealthy and the main promoter of obesity” was the paradigm in our discussion, and few mainstream researchers were allowed to, funded to, do research other than to prove that proposition.
(Another example in endocrinology involves defining hypothyroidism [low thyroid levels] in terms of blood TSH testing. Without going into details, the paradigm is that a TSH level virtually always accurately assesses thyroid function. I don’t happen to believe that—much of the time, perhaps, but no test is as reliable as this one is given credit for. Many scientific papers, however, looking at questions like, Does Agent X affect thyroid function? base their conclusions largely, if not exclusively on TSH testing.
In Kuhn’s view, a crisis occurs when the paradigm loses its grip, the puzzles repeatedly resist solution (the failure of low-fat dieting, for instance, to quell the epidemic of obesity), and confidence in the paradigm is lost. It is during crisis that the paradigm is questioned and tested, perhaps rejected. When a new paradigm takes over, that is a scientific revolution. Older scientists have their careers invested in the old paradigm, while younger ones make the switch more readily, the older paradigm perhaps dying off, literally, as their proponents do.
My discovery of Kuhn was an epiphany—because I can honestly say there have been several occasions where I have been dismayed to witness what seemed to me to be blatant bias driving what I thought should’ve been open-minded scientific inquiry.
I believe there to be other weaknesses, which question the role of the randomized, controlled scientific trial as the end-all-be-all of medical decision making—useful to be sure, perhaps vital—but not everything, as I fear is the current paradigm. First, for valid reasons, medical researchers design studies to focus on one, or a limited number of parameters. If you want to know, for example, the effect of different levels of smoking on lung cancer, you’ll compare groups of subjects that are as much alike as possible, differing only in how much they smoke—same sex, age, blood pressure, BMI, medications, etc.
But suppose—and I’m making this up—childhood exposure to Play-Doh plus Lysol, plus adult exposure to battery acid, all in combination, increases the risk of lung cancer in smokers. There might be a study looking at Play-Doh exposure (again: totally making this up) and lung cancer, or even Play-Doh plus smoking and lung cancer, but I guarantee you there is no study looking at Play-Doh + Lysol + battery-acid in smokers compared to nonsmokers, and lung cancer. My point being, the randomized controlled trial’s look at one or very few variables against a background of supposed uniformity is both a great strength and great weakness, for human populations are anything but uniform and it is utter hubris to believe we can draw inviolable conclusions as if they were. Yes, do research, consider the results, but in the case of medical science in particular, as opposed to physics, or economics, do not hold the individual patient and physician hostage to the results—which is what’s happening!
Lastly I want to take a swipe at the greatest sacred cow in all of science and statistical analysis.
The greatest, most entrenched paradigm of all!
The lowly p-value.
“P” standing for probability.
When data from any experiment is analyzed, it must be determined whether the results from one group differ from another group—supporting the hypothesis that Group A is different from Group B for some reason—and if they are different, is that difference statistically significant? The question that in reality gets asked is: what is the probability that the null hypothesis, that Groups A and B are not different, is true in spite of the data.
In other words, under the conditions of the experiment, what are the chances the results could have come from random chance, rather than as a consequence of some actual difference. And if that probability is sufficiently small (a low p-value), the results are said to be statistically significant and the experiment is said to show the groups are different.
If, however, the probability that the null hypothesis is true is large (a high p-value), then the data are said to have not achieved significance, and the groups are said to be no different. That particular experiment goes down in the annals of science as failing to show a difference between Groups A and B. If the question is an important one, more than one experiment will be done, and if all the studies agree there is a difference, or there isn’t a difference, then consensus is reached and, assuming we all agree that enough of the work was conducted properly, everybody is happy.
If some experiments draw one conclusion, though, and others draw a different one, that’s when things get confusing. That’s when we get a news item one week saying Vitamin Triple-X is good for us, and one next week saying it causes heart attacks.
That’s when we need to do bigger better studies.
This is all good and useful. We obviously need some method we all can agree upon to look at sets of data and decide if they tell us anything. My gripe is with where the line is drawn between significance and nonsignificance. It’s an important question because in medicine today, if a certain test, or certain treatment is studied and the data as to their usefulness is deemed not significant, then that test or treatment has a good chance of being thrown away, or at least not supported under evidence-based medicine, which might mean insurance won’t pay for it.
A p-value of 5 percent is that cutoff virtually universally in all scientific research.
If p is less than 5 percent, a study’s results are said to be statistically significant, and we pay attention to whatever a cursory exam of the data already told us. We are saying there is a less than 5 percent chance the null hypothesis could be true given the data. If p is greater or equal to 5%, then results are deemed not significant, no matter what an eyeballing of the data might suggest. In other words, if Drug Alpha cures people 75% of the time, but the p-value is 10 percent (a 10 percent chance that the null hypothesis—that Drug Alpha is worthless—is true despite the data) then Drug Alpha is going to be discredited or at least have to wait for another experiment.
Why 5 percent?
Why is 5 percent the magic p-value?
Surely there is some sound statistical reasoning, right?
No… not so much…
It was an arbitrary threshold set by one man, Cambridge geneticist and statistician R.A. Fisher, in 1926, in a paper published in the Journal of the Ministry of Agriculture of Great Britain. In that paper he discusses the pros and cons of various p cutoffs, and states, “Personally, the writer prefers…the 5 per cent point.”
Now, I’m not arguing against a 5 percent p-value threshold. It’s a perfectly reasonable number, if we have to have just one number. What I am arguing against, am flabbergasted by, is that under the potential tyranny of evidence-based medicine, all medical decision making would, in theory, be subjected to a p-value analysis, and any and all major patient-care interventions would be accepted or rejected on the basis of whether the p was greater or less than 5 percent, a number that sounded good to one man 86 years ago.
What’s wrong with 4.5 percent, or 3 percent, or 6.35 percent. Especially in this age of highly accurate, push-of-a-button, computer computations, in which multi-page tables in the backs of statistics textbooks are obsolete, any reasonable figure could be chosen. Don’t tell me a treatment rejected because of a p equal to 6.0002 percent is really all that less likely to be helpful to some people than a treatment that lucked out with a p of 4.9997%.
I think the whole notion of p-values should be revamped.
My proposal: ignore all results failing to reach a significance indicated by a p of, say, 10 percent, but for each study—ahead of time, a priori, that’s extremely important to assuring integrity—the investigators conducting the research will establish the p-value threshold of significance for that particular study, a value anywhere between 9.9999% and, say, 1.0000%, or possibly less. The p-value would no longer be arbitrary, because its selection would be based upon the relative cost of a wrong conclusion, a risk-benefit analysis, which is already part and parcel of good medicine.
If we are investigating the efficacy of Drug Y against Disease Q, we might set a high p-value threshold of, say, 9 percent if Drug Y is very safe and Disease Q is a nonfatal nuisance, on the order of the common cold. Whereas if Drug Y has a dangerous collection of side effects and Disease Q is almost always fatal, you wouldn’t want to deem Drug Y to be an acceptable treatment unless you were pretty darned sure it worked. An appropriate significance threshold in that situation might be less than 1 percent, or perhaps, less than 0.1 percent. Any number in between the defined extremes might be chosen depending on the exact combination of risks and benefits intrinsic to the situation being studied.
Such a system would introduce utter chaos if deployed throughout the scientific community, across all fields. I do think, however, such a system—admittedly and intentionally complex—should be considered for the biomedical, or at least medical sciences. Medicine is complex and diverse, and the same set of standards cannot be applied to all situations. One important example of complexity in medicine which does not exist in, say, astrophysics, is that there can be important, even fatal costs to both overcaution and overboldness. As many people could, hypothetically, die from the rejection of an effective treatment, as could from acceptance of a dangerous or ineffective one.
jkr