A Placebo Straw Man

I acknowledge this setup is a bit of a straw man, but it’s still a justification I hear from time to time so it’s worth examining.

The argument goes that we should be performing certain interventions, such as providing oxygen for pain control, despite them having no evidence basis or even a plausible mechanism because they harness the placebo effect.

“Here’s some oxygen; it will make you feel better.”
Click image for source.

The placebo effect is a real phenomenon and it can (and, some argue, sometimes should) be harnessed to improve a patient’s perception of their outcome. In select cases it can even affect objective physiological measurements. The merits of if and when we should be providing placebos have been debated for years.

There is one common circumstance when we certainly should NOT be administering placebos, however, and that is when there is an intervention or treatment available that has been proven superior. That is the case in my “oxygen for pain control” straw man.

Sure, it would be nice if prehospital care could be simplified by managing the patient with an isolated humerus fracture at the BLS level—especially on days when the paramedic is seeing three ALS calls for every BLS the Basic takes—but this is not the time to try and even the case load.

There are plenty of pharmacological agents available that have been proven superior to placebo for pain control, so when the former options are available, it is decidedly wrong to try and scrape by on latter. Maybe one could make a case for giving oxygen as a stop-gap in a truly rural setting where BLS is the only level of transport available, but the point of this discussion isn’t to delve into these specifics and what-ifs.

It is to drive home the point that administering oxygen for pain control is not just ridiculous, it is unethical when alternatives that have been proven superior to placebo are available.

Can you think of any other interventions we provide in prehospital and emergency medicine that also fit this bill?

Placebo means never having to say you’re sorry


Often in clinical research it seems like the placebo effect was put on earth solely to cause us headaches. Imagine how simple things would be without it. Controls? Who needs ’em? Just grab a bunch of random dudes with the disease and check back in a month to establish your baseline. Easy peasy.

But here’s another way of looking at it. One of the central tenets of modern EBM is that only treatment benefits above and beyond the effect of placebo (sometimes called “specific” effects) are “real” benefits worth pursuing. In other words, if your drug cured 50% of patients, but a placebo also cured 50%, then your drug had a net effect of zero and did nothing.

This is because we’d like to have medicines that work via the mechanisms we designed them to have, or at least in somewhat predictable and understandable ways. If a sugar pill cures a bunch of people, that’s very nice, but it’s not really medicine, is it? It’s some kind of mind-over-matter trollop that we refuse to accept in our wards.

There are some interesting ethical questions here. For instance, if you’re the one suffering from a painful, debilitating, or deadly disease, do you care if you’re cured by a sugar pill or an exhaustively-developed organic compound? If you feel better, can I really pop from behind a bush and say, “Gotcha! That was the fake stuff! You don’t feel better at all!” And if the placebo benefit hinges on the patient not knowing the truth, would it be ethical to start passing around fake (but effective) drugs?

The current attitude is no. This business leaves a sour taste in our collective mouths, perhaps because medicine spent so many centuries practicing exactly this sort of handwaving mumbo-jumbo. We consider ourselves scientists, and scientists treat using physiological mechanisms, not wishful thinking. Perhaps more importantly, we have an ethical problem with lying to our patients, which seems like a necessary part of intentionally wielding placebos.

But it behooves us to remember that despite all this, the placebo effect can indeed be dramatic and beneficial. That’s the whole reason we try to control for it in trials — because otherwise, every random treatment would look pretty impressive, even if there’s nothing under the hood. No benefit over placebo is often still a big benefit to the patient.

What’s my point? My point is this: in the ivory towers of EBM, we spend a lot of time stroking our beards and worrying that we’re subjecting our patients to drugs, surgeries, and interventions that are not evidence-based. Perhaps there’s even evidence that they’re ineffective, or sometimes even harmful. Until we can banish all such snake oil from our practice, we believe that we’re wasting our patients’ time and money at best and harming them at worst.

But the fact is, this probably isn’t so. Because even our worst snake oil is still convincing, and it probably still carries with it real placebo value. (In fact, this is often why anecdotal reports of treatment success abound even when the controlled trials say “nope.”) When we say that Drug X has no benefit, we mean that it added nothing compared against a placebo — but in clinical practice, we’re not comparing it against anything, we’re just giving it to sick people. In other words, it is the placebo, with all the associated power.

So even without any “specific” effect, odds are that it has some value anyway. It’s not zero, and in fact, the benefit may even outweigh some specific harms.

And so, ladies and gentlemen, we should keep fighting the good fight to rid ourselves of myth and hokum. But in the meanwhile, we shouldn’t lose too much sleep that we’re hurting anyone, because we’d have to hurt them quite a bit to outweigh that sugary placebo goodness.


Epi Timing in Cardiac Arrest (Part 2)

In our last post we examined the effect of the chosen resuscitation end-time on the overall duration of the resuscitation and how that affected the calculated mean time-interval between epinephrine doses. It’s worth reviewing quickly before we resume our discussion here.

The major next point in our examination of the Warren et al. paper on epinephrine dosing in cardiac arrest is a look at the endpoints they used to define a “cardiac arrest.” There were two different ways to hit STOP on the clock measuring duration-of-resuscitation: death or return of spontaneous circulation (ROSC) lasting > 20 minutes. Both have issues.

The former is pretty convenient from a charting standpoint, the “time of death,” but it also has the chance to introduce a lot of bias. It’s my personal experience that epi often flows fast-and-furious early in the code. As time drags on and the chance of a good outcome drops, however, the propensity for other interventions increases (“Try a central line with my off-hand? Why not!?”) and group interest in giving more epi decreases. The data certainly seems to reflect that, with Table 1 clearly showing a dramatic increase in arrest duration accompanying the longer dosing-intervals.

Click to enlarge.

Click to enlarge.

That’s a highly-edited excerpt from Table 1; the original table is gigantic with a ton of characteristics listed, but most of them were pretty comparable across all of the dose-intervals. But that, that is something…

One factor at play is a form of selection bias that I guess I could call an anti-length bias (someone out there correct me if there’s a better term for this). Usually length bias is discussed in the setting of cancer screening, where faster growing cancers are less likely to be picked up by screening but more likely to be malignant and fatal. As a result, the patients who survive long enough to be picked up on screening have already self-selected to be a lower-risk for an aggressive tumor and thus have a lower mortality.

Here, by definition, only patients who stayed in arrest at least 9 minutes could ever populate the 9-10 min/dose group. As a result the shorter dosing-interval groups ended up with a disproportionate amount of patients with shorter arrest durations, and correspondingly, lower mortality. Not only do patients do better the sooner they come out of cardiac arrest, but with an average arrest duration of 7.6 min in the 1-3 min/dose group, the great majority of those patients must have been experiencing ROSC. This study only looked at patients experiencing their first in-hospital cardiac arrest, so it’s highly unlikely most of those patients would have been declared dead after only an average of 7.6 min of CPR, leaving ROSC as the only other outcome.

These patients could still go on to experience in-hospital mortality later, but by achieving ROSC they certainly carry a better overall prognosis than patients who died and stayed dead. Disproportionately populating the short-interval group with these ROSCers will skew their mortality lower.

And that isn’t all. Recall that the other stop-point of a defined “cardiac arrest” event was ROSC lasting at least 20 minutes. This is hugely important. At first glance it may seem like a good endpoint because lots of resuscitated patients tend to go back into arrest, especially during the first ten minutes, but it absolutely kills this study (pardon the wording).

The population studied in this paper was comprised only of patients from the intensive care unit and inpatient medical floors. These are not patients who usually experience a sudden cardiac arrest; by definition they had to make it upstairs to have even been considered. Instead, this is a population that tends to spiral downwards over time rather than experience an unexpected catastrophe. The latter still occurs, but at a much lower rate than in the community or even the emergency department.

Anyone who’s been at this for even a modest amount of time has seen the patient with a BP of 50/30 mmHg and a rhythm on the monitor who then “loses pulses.” It’s uncertain whether they actually have a cardiac output but a Code Blue is announced, the patient is given 1mg of epinephrine, and then BOOM, pulses come back.

This hypothetical patient could achieve ROSC with the first dose of epi one minute after the Code was announced, keep a decent cardiac output for the next 10 minutes, and then loses her pulses again. You know this game?

The clock has not reset and this is still considered the same “code” according to this study. As before she responds to a dose of epi and then manages to keep her pulses for at least 20 more minutes following the administration of a norepinephrine drip. The clock is now stopped at the second time she regained pulses. So, in essence, she received one additional dose of epi over approximately 10 minutes and will be evaluated in the 9-10 min dosing group, plus her duration of “cardiac arrest” is now recorded at something like 12 min instead of 2 min. Never mind that categorization doesn’t even come close to capturing what really happened, but that’s how she’ll be analyzed in this study.

To the author’s credit they did exclude patients with intervals > 10 min for this reason, but that eliminates only the most blatant of cases; plenty will still end up in the data. They also excluded patients who received a non-epinephrine vasopressor during the arrest, but this doesn’t account for all of the patients described by the scenario above who received one after “final” ROSC to stave off further arrest.

So, what we see at this point is that this paper is a horrible mess of cross-pollination between study categories. Short dosing-interval patients are being placed into longer-interval categories because of the resuscitation-length issues covered Part 1 and intermittent-ROSC factors just discussed. On the other hand, the patients who still managing to make it into the short dosing-intervals are going to show markedly decreased mortality compared to the longer dosing-intervals because many of the latter needed to “stay dead longer” in order to even make it into their dosing-group.

How will this all pan out? Stay tuned for Part 3 where will will finally discuss the outcome data…

Epi Timing in Cardiac Arrest (Part 1)

There’s a new study by Warren et al. out in the most resent issue of Resuscitation that examines the use of epinephrine during in-hospital cardiac arrest. It also purports to show a possible benefit to non-standard dosing regimens.

Your pupils just dilated slightly… I’ve been watching the new season of Sherlock.

Click image for source.

Epinephrine is a touchy subject in the world of critical care, both prehospital and in-hospital, so this study is bound to garner a bit of attention. The big questions are whether that attention is deserved and what to do with the information that’s contained within.

If you want to cut to the skinny, it looks like this data isn’t nearly strong enough to affect the next round of ACLS/ILCOR guidelines… at least I hope it won’t. It’s not just weak data; it’s fundamentally flawed and probably garbage. If you care why, and I think you should, allons-y!

Click image for source.

…there’s also been some Doctor Who thrown in.

From the abstract, the aim of the paper was, “to evaluate the association between epinephrine average dosing period and survival to hospital discharge in adults with an in-hospital cardiac arrest.” The data examined was prospectively gathered and retrospectively examined from approximately 21,000 in-hospital cardiac arrests at about 500 hospitals in the Get With The Guidelines – Resuscitation (GWTG-R) registry.

This means that the data was gathered with the knowledge that it would be used in future research, but at the time of entry that exact use was unknown. At a later data the authors then looked backwards at a near ten-year chunk of the data and attempted to parse out how varying dose-intervals of epinephrine were associated with patient outcomes. It’s a noble pursuit and an important subject that has been understudied in the past, but one of the reasons why it’s not often researched is that it is difficult to examine outside of the setting of a randomized controlled trial. This design can make for decent hypothesis-generation in the right scenario, but here it’s pretty weak-sauce.

It’s a pretty big registry, and that can be a good thing if you’re asking simple questions with simple answers. The problem is that we’re not asking a simple question. It’s notoriously difficult to record the timing of medications during a cardiac arrest, and in-fact, the GWTG-R registry doesn’t even record epinephrine timing after the first dose. How could the researchers even attempt this study?

Click image for source.

Well, the GWTG-R registry does record the total number of doses of epinephrine administered during the arrest, along with the time to return of spontaneous circulation (ROSC). The authors, looking at resuscitations that received more than one dose of epi, then divided the time from the first dose of epinephrine to the end of the resuscitation by the subsequent number of doses to come up the average amount of time between doses. Here’s an example of an ideal patient receiving epi every 5 minutes.

5-6min Epi Interval

q5min dosing categorized as “5 to <6 min dosing”

If you’ve ever been involved in a resuscitation you will quickly realize that this way of calculating epinephrine timing in no way reflects real-world practice. Resuscitations are messy affairs and dose-timing can be all over the place during a single arrest, with one interval of 2 min, another of 8 min, and another of 2 min. While that may average out to a rate of one dose every four minutes, it’s probably very different from giving the patient evenly spaced q4min doses of epi.

The authors recognized that this is only an estimate of the dosing intervals, but another issue that compounds this mess with the timing is that is exceptionally dependent on when ROSC was noted and recorded. The same “q5min” dosing pattern shown above can result in at least four different interval-stratifications depending on the length of resuscitation and the time ROSC is recognized.

6-7min Epi Interval

q5min dosing categorized as “6 to <7 min dosing”

7-8min Epi Interval

q5min dosing categorized as “7 to <8 min dosing”

9-10min Epi Interval

q5min dosing categorized as “9 to <10 min dosing”

You’ll also note that all of the above patterns bias towards estimating a longer interval than what was really prescribed and administered, never shorter. This will come up later.

To keep things reasonable let’s end our initial discussion there for today, but if you like this stuff look forward to Part 2 (and 3… and maybe 4) being posted over the coming days.


No opinion less interesting than an expert opinion


The Greek philosopher Diogenes used to wander around in broad daylight with a lit lantern. Whenever anybody asked, he’d tell them he was looking for an honest man.

Good luck with that.

Who do we get to write clinical guidelines? Or even to perform and author literature reviews, considered one of the highest levels of evidence, because they summarize everything that’s known with respect to a particular clinical question?

We get experts to do it. Physicians, clinicians, and researchers who have spent a long time dealing with the issue, and who know all the ins and outs. We put a bunch of them together, let them review the literature, talk it out, then write down their conclusions. It’s like a judge — or a jury — examining the evidence in a criminal trial and deciding what the facts are. Right?

Maybe not.

A lot of those experts have worked with the drug or device companies that stand to make money from their conclusions. That makes them biased, and it’s hard to avoid. But even if you manage to assemble a panel of experts who lack financial conflicts of interest, that doesn’t mean they’re objective.

These are experts. They’re experts like, say, Barack Obama’s Chief of Staff is an expert on politics. Would you ask him for a dispassionate, unbiased opinion on the Republican party? Of course not.

And similarly, any expert on a medical issue is going to have a bias already. He’s spent his career thinking about the topic, researching it, arguing about it, lecturing on it — in fact, he may have based a large part of his career and reputation as an expert upon the points he preaches. How can he not have an opinion? Any expert without an opinion isn’t much of an expert!

So if we let him look at the literature — even in a very rigorous, comprehensive systematic review — what’s the chance that it’s going to change his mind? How many times have you seen someone in an argument change their mind because their opponent presented some good facts? Approximately never? Yeah, that’s what I thought.

Even if you hand a huge stack of research papers to a room full of experts, if you know their backgrounds, you can probably already predict what each of them is going to conclude, no matter what’s in those papers. Just like you know what Obama’s staff is going to say about Republicans.

So how can we get a truly unbiased analysis of the evidence (without doing it ourselves)?

One idea would be to use people who aren’t experts. They’d have enough general background to understand what they’re reading — probably they’d be physicians, for instance — but no professional background in the relevant debate. A review about stroke treatments wouldn’t involve neurologists, but maybe dermatologists — people with no axe to grind in that arena, and who frankly don’t much care about it.

They should be experts in reading and analyzing scientific literature, interpreting statistics, and looking for flaws and bias. Just like a criminal judge. But they shouldn’t be content experts, because content experts are just as biased as any lawyer arguing their case.

Now, our non-experts shouldn’t have to figure everything out on their own. They should be advised by the experts — just like in a trial, the experts on one side would make their argument, frame their interpretation, and point to the evidence they like. The folks who disagree would make their counter-arguments and offer their preferred literature. Our adjudicating non-experts could ask questions and read the data. Then they’d give their decision on who made the most sense and what the evidence supported — and it’d be worth listening to, because they’re smart enough to know what they’re doing, they’ve seen the relevant evidence, and yet, they have no reason to BS you. They’re basically doing what you’d like to do if you weren’t very busy and unable to review the world’s literature on absolutely every clinical subject that pertains to your practice.

Would it work? I don’t know. But the current way doesn’t.

Graphtistical Tomfoolery

Correlation versus causation

There’s a great Facebook post making the rounds which is worth the read for anyone amused by statistical smoke-and-mirrors.

Hopefully it’s not news to anybody that correlation isn’t the same thing as causation. But it still seems to surprise some people that you can do a study on practically anything — a random drug, some arbitrary diagnostic criteira — and, well… let’s just say that you’ll probably find a positive result.

There are lots of reasons for that, many of which we’ve talked about before, but one is the simple old truism that two numbers going up or down at the same time means very little on its own. Try coming up with some of your own graphs to demonstrate this; I have one here that provides compelling evidence of a strong association between the quantity of iPhones and number of polar vortexes around the globe.

Outcomes (and why you don’t get to supersize them)


Many people have trouble understanding outcomes. Obviously, the outcome is whatever a study is measuring — the results. But what’s the difference between primary and secondary outcomes? Why is it important to specify them before the study is performed, rather than after? Why is it more rigorous to have one or two outcomes instead of fifty?

The simple answer: the more times you spin the wheel, the more likely you’re going to eventually hit the jackpot… but the less impressed we’ll be when it does.


Accurate rifles don’t need ten shots to hit the target

Suppose I told you that I’ve built a miraculous device — the Dice-Guesser 1000. I wave it over a six-sided die, it beeps and whirrs, and after a few moments, it flashes a number: 4. Now you roll the die, and it comes up with a 4.

Pretty impressive, right? Cool trick. It’s cool because the likelihood of guessing that roll by pure chance is only 1 in 6, which isn’t very high… so you think maybe the Dice-Guesser 1000 actually works. Normally I’d assume it was guessing randomly, but the odds were low that it would’ve guessed correctly the first time without any help, so maybe it really can predict dice rolls. If I did this trick many times, you’d be even more convinced, because it’s less and less likely that I’d keep guessing right if my device didn’t have some magic. Buy stock in my company.

Now, let’s do things a little differently. I have another device, the Dice-Guesser 2000. The only difference is this: instead of flashing one number, it flashes six numbers. 1, 2, 3, 4, 5, and 6. You roll a 4. Behold! One of my guesses was correct!

You’re not impressed this time? Why not? Isn’t the chance of correctly guessing the roll still just 1/6?

Sure it is. But I guessed six times, and that meant the chance that any one of my guesses would be correct was much higher (100%, in fact). The original feat was only impressive because I was picking one outcome out of many possibilities, and it was therefore improbable that I’d be right unless my device had some real magic. But if I pick more than one guess, it’s increasingly likely that random chance does explain my right answer — and if I guess enough times, it approaches statistical certainty that I’ll get it right. Doesn’t mean my device has any predictive power. Just means I tried until I got lucky.

So if you want to know something about my device, you’d be smart to limit me to only one guess. Maybe a couple. But not six.


My outcomes overfloweth

Now let’s apply this to a clinical study.

We’ll make it the strongest kind of study, a randomized controlled trial. Let’s say it’s a trial of Strokesallbetter, a pill that we think might help people after an ischemic stroke.

Before we run the trial, we specify a primary outcome: mortality. We’re guessing that patients who take Strokesallbetter will be less likely to die than patients who just take a placebo. We run the trial, and behold — patients who took Strokesallbetter were less likely to die! Are you impressed? You should be. Probably the drug works.

Now let’s take a different drug, Strokesjustfine. And let’s change things up. We’ll designate an outcome of mortality. But we’ll also designate another outcome: neurological deficits after 30 days. And we’ll designate another outcome too: duration of hospital stay. In fact, let’s designate one hundred different outcome measures, all unrelated. More data’s always a good thing, right?

We run the trial, and unfortunately, Strokesjustfine produces no difference in mortality. Or neurological deficits. Or hospital stays. In fact, it showed no difference for almost anything we measure, but one outcome was different: patients who took Strokesjustfine had a slightly lower chance of receiving a parking ticket in the 41 days following their stroke.

Are you impressed? Should we market Strokesallfine as a miracle parking-ticket cure?

Probably not. Why? Because this is just like the case where we kept guessing at the dice roll. It’s probably not random luck if you hit the target with the only shot you take. But if you shoot fifty times and one of them makes it, it probably is random luck. Nice try, no cigar.

Let’s put it a little more simply. In most studies, the statistical threshold for significance — how strong a correlation must be for us to assume it’s real, and not just chance — is 95% (a p value of .05). That translates to a 1 in 20 chance of the result happening randomly. Pretty good numbers… if you only guess once. What if you guess 20 times — in other words, you have 20 different outcomes? Now if one outcome shows a statistically significant correlation, that’s not an impressive result, because it had a 1 in 20 chance of happening by random probability, and you tried 20 times!

That’s why legitimate studies usually specify only one primary outcome. One hypothesis, one finish line; that’s where you’re putting your money, what you really think and hope might happen. At the most, there might be two or three, particularly if one is a safety outcome (i.e. not whether the drug worked, but whether it hurt anyone, which is a different kettle of fish). Best practice is to limit it to one primary outcome and perhaps two or three secondary outcomes (like the Pips to Gladys Knight, they’re not as central, but they’re something), all of them defined in advance. But that’s all you get. If you start listing dozens and dozens of outcomes, it no longer means anything when one of them happens to show a signal — statistically, that’s expected to happen, and you have no basis to believe it’s anything except chance.

That doesn’t mean you can’t measure or calculate plenty of different datapoints, of course. But when one of them has an interesting result, you don’t get to pretend that was your goal all along. If your primary outcome was a failure, your trial was a failure, period. If you had some secondary outcome that was positive, and there’s a plausible reason it might be a real finding (i.e. not the parking ticket situation above), then you can run another trial using that as the primary outcome. If you find a significant result this time, then great — you win! But merely noticing that result within the mountain of secondary data isn’t the same as designating the outcome ahead of time, for reasons that are hopefully now clear.


Hanging out the data nets

The ultimate example of this phenomenon is the practice of data mining.

Also known as data dredging, this is a lot like the big ocean-going fishing vessels that hang out nets and trawl along until they’re full. They’re not looking for anything in particular, they’re just picking up whatever — tuna, dolphins, old boots — until they notice something they like.

The way it works in research is like this: find a big bank of data. It could be from a study you performed, or it could be somebody else’s; doesn’t matter. Now, put it into a computer, and analyze it in every way possible. By our analogy above, you roll the dice over and over and over until you finally find a statistically-significant result. It doesn’t have to be something that was designated in advance. It doesn’t even have to be particularly plausible or logical. It could be the number of patients who lost between 63 and 66 hairs during the trial period. If it’s statistically significant, then you publish it.

And when you publish it, you highlight its statistical significance. Only a 1/20 chance of this happening randomly! What you certainly don’t highlight, or even mention, is that you ran twenty different analyses until you found this result. So nobody knows how many times you rolled the dice. Presto! We’ve created an impressive-looking association out of thin air.

The grim reality is that many published trials, even the big, impressive RCTs, are performed in this same way. They may designate a primary outcome beforehand. But there’s a lot of time and money invested in the study, so if it reaches its conclusion and the primary outcome is negative, do they publish it as a failure? No way. They dig through their data until they find some way of parsing the results that reaches statistical significance. Then they redact their study design and pretend that was their primary outcome all along. It’s called the Texas sharpshooter fallacy: fire your rifle at a blank wall, then draw a bulls-eye around wherever the bullets land, and you’ll never miss.

It’s hard to detect this nonsense, because when a study is published, you mostly have to trust that the authors haven’t changed their study design after peeking at the data. The only practical defense is pre-trial registration (ClinicalTrials.gov is the biggest registry), where researchers publish the design of their trials before they run them, and they can’t modify it later without the registry recording the change. That way, if you read their published write-up and it says the primary outcome was X, you can check the registry to confirm that their primary outcome was always X. Researchers are caught red-handed all the time by this, presumably because they hoped nobody would look.


Publication bias: the invisible dice

We talked before about publication bias, the phenomenon where negative trials are less likely to be published than positive ones, often disappearing into a desk drawer instead of appearing in the peer-reviewed literature and hence reaching your eyeballs. It happens all the time, particularly in studies funded by the industry that has a vested interest in their outcomes.

Guess what? Publication bias is another great way to roll extra, unseen dice. A significant result may have a 1/20 chance of occurring randomly, and maybe the trial only had a single outcome. But what if there are 19 other trials of the same drug, and every one of them was negative?

If you knew that, you’d recognize this positive result was probably spurious, despite its internal validity. But due to publication bias, you might never know about all those negatives. You search the literature, you find only the lucky positives, and you never know if they’re surrounded by a backdrop of failures that you simply don’t see.


So what’s the answer?

The take-away message is this:

  1. If it intends to show a significant association, a study should define a very small number of outcomes (preferably just one).
  2. Those outcomes should be designated before the study is conducted, and must not change at any point.
  3. Any findings beyond those pre-defined outcomes may generate hypotheses for future trials, but signify nothing on their own.
  4. Pre-trial registration can help readers verify that the above process was followed.
  5. Publication bias can hide how many attempts lie behind a positive result. Its presence is ubiquitous, and is not easily accounted for.

A Jaundiced Eye for Clinical Guidelines

Investigative journalist Jeanne Lenzer has written a good article at the BMJ on the pitfalls of clinical practice guidelines issued by professional bodies. It’s a good read, not too long, and it’s been made freely available for one week — and only one week, so go read it now.

This subject has a similar flavor to our recent highlight on publication bias. In short, the big professional organizations that comprise the “men in suits” ivory tower of the medical world regularly issue guidelines on various topics of care. These are recommendations, not laws or holy writ, and intrinsically they have no more value than an article in Cosmo; but because they come from bodies with such influence, they’re often seen as de facto standards of care. (Just look at how powerful the American Heart Association’s CPR guidelines have become.)

Unfortunately, doctors in the US are so scared of getting sued that much of their practice is guided by this fear. So if you’re an emergency physician, and the American College of Emergency Physicians publishes a document stating that they strongly recommend the use of tPA for ischemic stroke, the first thing on your mind isn’t whether you think tPA works, or is likely to help the patient, or is supported by evidence. It’s that if you don’t give it, and the patient has a bad outcome, you may be standing in court with a lawyer asking you, “Why did you feel you could disregard the standard of care?”

The trouble is that despite being smart and successful, the folks sitting on these recommendation boards are also, very often, tied to the money. In other words, they’re recommending tPA, and lo and behold — they’ve received money (often vaguely-referenced “honoraria”), favors (speaking engagements, free travel, etc), or are flat-out employed by the people who make tPA. This is incredibly common. Most of those people would say they’re not biased, and in many cases they probably believe that, but the effects of bias can be subtle and unconscious.

This sort of thing helps explain some of the really bizarre recommendations you’ll encounter. If you believe the professional guidelines, you’d think that we’re surrounded by wonder drugs, all with powerful evidence for their effectiveness. (In reality, of course, profoundly effective treatments are pretty rare.)

So, as always: if it sounds too good to be true, it probably is. If a new treatment sounds like a slam dunk, wait a decade and see. The fact that a bunch of folks with acronyms have something to say shouldn’t mean too much to you unless they’re recommending against a therapy — usually nobody benefits from that except the patients.

Publication bias

Although we’re all (presumably) better off when we try to answer questions by turning to the literature rather than, say, a Magic 8 ball, it behooves us to remember that it’s still an imperfect source for truth. And while you can hone your ability to peruse a study for flaws or poor quality, one foible may never be detectable: publication bias.

In other words, no matter how careful you are, you can only read the studies that are published — you never know about the ones that aren’t. And odds are, the ones that weren’t published aren’t just a random selection… they’re the ones that weren’t favorable to the sponsors or didn’t find interesting results.

So when you look over the five studies that support an intervention and the one that doesn’t, bear in mind there may be fifteen other studies producing negative results that never saw the light of day. (Oh, and take a look at the conflicts of interest declaration — and compare them to who might stand to benefit from a positive study.)

Here’s a good, brisk introduction to the subject

And here’s the exchange Goldacre mentions between the Cochrane group and Roche Pharmaceuticals, if you can stomach some truly shameless evasion.


Hopefully you’re out there reading the research and trying to figure out what you should be doing and why. But inevitably, when you get to the results, they’re reported using a baffling array of acronymic metrics. Some are fairly intuitive; some are truly confusing; but you should understand them all. Here’s a quick run-through on the most common terms, particularly odds ratios, absolute risk reduction, relative risk, and NNT, using two recently-added studies (Kudenchuk 1999 and Jacobs 2011) as examples.