Placebo means never having to say you’re sorry

 

Often in clinical research it seems like the placebo effect was put on earth solely to cause us headaches. Imagine how simple things would be without it. Controls? Who needs ’em? Just grab a bunch of random dudes with the disease and check back in a month to establish your baseline. Easy peasy.

But here’s another way of looking at it. One of the central tenets of modern EBM is that only treatment benefits above and beyond the effect of placebo (sometimes called “specific” effects) are “real” benefits worth pursuing. In other words, if your drug cured 50% of patients, but a placebo also cured 50%, then your drug had a net effect of zero and did nothing.

This is because we’d like to have medicines that work via the mechanisms we designed them to have, or at least in somewhat predictable and understandable ways. If a sugar pill cures a bunch of people, that’s very nice, but it’s not really medicine, is it? It’s some kind of mind-over-matter trollop that we refuse to accept in our wards.

There are some interesting ethical questions here. For instance, if you’re the one suffering from a painful, debilitating, or deadly disease, do you care if you’re cured by a sugar pill or an exhaustively-developed organic compound? If you feel better, can I really pop from behind a bush and say, “Gotcha! That was the fake stuff! You don’t feel better at all!” And if the placebo benefit hinges on the patient not knowing the truth, would it be ethical to start passing around fake (but effective) drugs?

The current attitude is no. This business leaves a sour taste in our collective mouths, perhaps because medicine spent so many centuries practicing exactly this sort of handwaving mumbo-jumbo. We consider ourselves scientists, and scientists treat using physiological mechanisms, not wishful thinking. Perhaps more importantly, we have an ethical problem with lying to our patients, which seems like a necessary part of intentionally wielding placebos.

But it behooves us to remember that despite all this, the placebo effect can indeed be dramatic and beneficial. That’s the whole reason we try to control for it in trials — because otherwise, every random treatment would look pretty impressive, even if there’s nothing under the hood. No benefit over placebo is often still a big benefit to the patient.

What’s my point? My point is this: in the ivory towers of EBM, we spend a lot of time stroking our beards and worrying that we’re subjecting our patients to drugs, surgeries, and interventions that are not evidence-based. Perhaps there’s even evidence that they’re ineffective, or sometimes even harmful. Until we can banish all such snake oil from our practice, we believe that we’re wasting our patients’ time and money at best and harming them at worst.

But the fact is, this probably isn’t so. Because even our worst snake oil is still convincing, and it probably still carries with it real placebo value. (In fact, this is often why anecdotal reports of treatment success abound even when the controlled trials say “nope.”) When we say that Drug X has no benefit, we mean that it added nothing compared against a placebo — but in clinical practice, we’re not comparing it against anything, we’re just giving it to sick people. In other words, it is the placebo, with all the associated power.

So even without any “specific” effect, odds are that it has some value anyway. It’s not zero, and in fact, the benefit may even outweigh some specific harms.

And so, ladies and gentlemen, we should keep fighting the good fight to rid ourselves of myth and hokum. But in the meanwhile, we shouldn’t lose too much sleep that we’re hurting anyone, because we’d have to hurt them quite a bit to outweigh that sugary placebo goodness.

 

Does NEXUS work in the elderly?

There aren’t a lot of studies specifically examining whether the NEXUS criteria are reliable (e.g. don’t miss important C-spine injuries) in the elderly.

Most of the big studies that derived and validated the NEXUS rule enrolled patients of all ages, so they do include this population. But their average age was much younger (20s-40s in most cases), so if you pick out the older subset, you’ll knock down the n from thousands down to much less. (That would be equally true if you asked about validity for any other small age range, of course.)

So how many older patients has NEXUS been applied to? It’s hard to know. Neither the original NEXUS derivation nor the validation give full age breakdowns, although their range does go up past 100, so there were some patients in that older group. It’s a similar story in a retrospective chart application using NEXUS, and in a study that compared NEXUS with the Canadian C-spine rule. These are all studies with thousands enrolled, though, so even in the subsets the numbers should have some weight.

Domeier 2005 studied a modified NEXUS for prehospital use, and they do give an age breakdown; eyeballing the chart it looks like about 1900 total enrolled age 75+. They found an overall sensitivity that was a little lower than in the other studies, about 92%, and it’s true that a fair number of their injuries were in the older cohort, but none of the missed injuries mattered (no clinical sequelae).

Now, Goode 2014 was just released and seems to be one of the only studies specifically addressing this. They concluded that NEXUS wasn’t very sensitive with age >65, with sensitivity only 65.6%. However, the sensitivity below 65 was only 84.2%, which is dramatically less than in the other studies, so they’re clearly doing something different; if we trust these numbers we shouldn’t be using NEXUS for anybody. Mainly, the difference seems due to higher a high-risk population enrolled; they only looked at patients with

… associated injuries from high-energy mechanisms (e.g., pelvic/long bone fractures), ejection from a vehicle, death in same compartment vehicle, fall from greater than 20 feet, vehicle speed greater than 40 mph, major vehicle deformity/significant intrusion, and pedestrian struck with speed greater than 5 to 20 mph.

In other words, big-sick trauma activations, not the “all blunt trauma” population used in the other studies. This is reflected in the higher rate of C-spine fracture in both groups: 7.4% in the young and 12.8% in the old, which is far higher than the ~2% rate of fracture in most other studies. Since it’s unlikely that these types of patients are getting clinically cleared anyway — no matter what, they’re getting a collar from most EMS crews and a CT scan from most EDs — I’m not sure how useful this data is. NEXUS is for small injuries with patients that look okay, not multi-system trauma codes.

So do older patients qualify for NEXUS? The data is not as robust in defense of this practice as for younger patients, certainly. But it does support its use; none of the major NEXUS studies put a cap on age and they all included at least some age >75 or >85 or whatever.

If you are very worried it may be reasonable to insist upon a specific study validating this age range, with enough power to focus on that specific population, but I’m not sure why you should be so worried. Although they may be at higher risk for fracture, that’s not the issue; the issue is whether the NEXUS criteria can detect those fractures, and I don’t think there’s any good reason to say that all old patients can’t reliably report pain or neuro deficits. Obviously selected patients, for instance with cognitive impairment, peripheral neuropathies, or other conditions may present obstacles, but hopefully your clinical judgment would already tell you that you may not be able to clinically clear those people anyway. NEXUS specifically has caveats to skip patients patients who can’t reliably report their symptoms — intoxicated, distracting injury, AMS — and if there’s something present which isn’t on that list but is still confounding things, you probably shouldn’t clear them. When it comes to corner cases, use the principles, not the letter of the law.

If anybody is really worried about this we can perhaps write to Hoffman or some of the other authors and ask if they have the age breakdowns for their big studies; that way we’d know exactly how many older folks have actually been studied.

Or just use the Canadian C-spine rule, which includes age >65 as an exclusion anyway. (Yep, it’s been validated for prehospital use as well.)

No opinion less interesting than an expert opinion

Justice

The Greek philosopher Diogenes used to wander around in broad daylight with a lit lantern. Whenever anybody asked, he’d tell them he was looking for an honest man.

Good luck with that.

Who do we get to write clinical guidelines? Or even to perform and author literature reviews, considered one of the highest levels of evidence, because they summarize everything that’s known with respect to a particular clinical question?

We get experts to do it. Physicians, clinicians, and researchers who have spent a long time dealing with the issue, and who know all the ins and outs. We put a bunch of them together, let them review the literature, talk it out, then write down their conclusions. It’s like a judge — or a jury — examining the evidence in a criminal trial and deciding what the facts are. Right?

Maybe not.

A lot of those experts have worked with the drug or device companies that stand to make money from their conclusions. That makes them biased, and it’s hard to avoid. But even if you manage to assemble a panel of experts who lack financial conflicts of interest, that doesn’t mean they’re objective.

These are experts. They’re experts like, say, Barack Obama’s Chief of Staff is an expert on politics. Would you ask him for a dispassionate, unbiased opinion on the Republican party? Of course not.

And similarly, any expert on a medical issue is going to have a bias already. He’s spent his career thinking about the topic, researching it, arguing about it, lecturing on it — in fact, he may have based a large part of his career and reputation as an expert upon the points he preaches. How can he not have an opinion? Any expert without an opinion isn’t much of an expert!

So if we let him look at the literature — even in a very rigorous, comprehensive systematic review — what’s the chance that it’s going to change his mind? How many times have you seen someone in an argument change their mind because their opponent presented some good facts? Approximately never? Yeah, that’s what I thought.

Even if you hand a huge stack of research papers to a room full of experts, if you know their backgrounds, you can probably already predict what each of them is going to conclude, no matter what’s in those papers. Just like you know what Obama’s staff is going to say about Republicans.

So how can we get a truly unbiased analysis of the evidence (without doing it ourselves)?

One idea would be to use people who aren’t experts. They’d have enough general background to understand what they’re reading — probably they’d be physicians, for instance — but no professional background in the relevant debate. A review about stroke treatments wouldn’t involve neurologists, but maybe dermatologists — people with no axe to grind in that arena, and who frankly don’t much care about it.

They should be experts in reading and analyzing scientific literature, interpreting statistics, and looking for flaws and bias. Just like a criminal judge. But they shouldn’t be content experts, because content experts are just as biased as any lawyer arguing their case.

Now, our non-experts shouldn’t have to figure everything out on their own. They should be advised by the experts — just like in a trial, the experts on one side would make their argument, frame their interpretation, and point to the evidence they like. The folks who disagree would make their counter-arguments and offer their preferred literature. Our adjudicating non-experts could ask questions and read the data. Then they’d give their decision on who made the most sense and what the evidence supported — and it’d be worth listening to, because they’re smart enough to know what they’re doing, they’ve seen the relevant evidence, and yet, they have no reason to BS you. They’re basically doing what you’d like to do if you weren’t very busy and unable to review the world’s literature on absolutely every clinical subject that pertains to your practice.

Would it work? I don’t know. But the current way doesn’t.

Graphtistical Tomfoolery

Correlation versus causation

There’s a great Facebook post making the rounds which is worth the read for anyone amused by statistical smoke-and-mirrors.

Hopefully it’s not news to anybody that correlation isn’t the same thing as causation. But it still seems to surprise some people that you can do a study on practically anything — a random drug, some arbitrary diagnostic criteira — and, well… let’s just say that you’ll probably find a positive result.

There are lots of reasons for that, many of which we’ve talked about before, but one is the simple old truism that two numbers going up or down at the same time means very little on its own. Try coming up with some of your own graphs to demonstrate this; I have one here that provides compelling evidence of a strong association between the quantity of iPhones and number of polar vortexes around the globe.

DRL Update

8 new papers have been added to the Digital Research Library. Three are in the Cardiac Arrest shelf, including some important recent stuff on steroids and epi; the other five are in the Miscellaneous section, ranging through Trauma, Hematology, Electrocardiography, Patient Assessment, and Respiratory.

As always, search for the triple asterisk (***) to see new material.

Outcomes (and why you don’t get to supersize them)

Bullseye

Many people have trouble understanding outcomes. Obviously, the outcome is whatever a study is measuring — the results. But what’s the difference between primary and secondary outcomes? Why is it important to specify them before the study is performed, rather than after? Why is it more rigorous to have one or two outcomes instead of fifty?

The simple answer: the more times you spin the wheel, the more likely you’re going to eventually hit the jackpot… but the less impressed we’ll be when it does.

 

Accurate rifles don’t need ten shots to hit the target

Suppose I told you that I’ve built a miraculous device — the Dice-Guesser 1000. I wave it over a six-sided die, it beeps and whirrs, and after a few moments, it flashes a number: 4. Now you roll the die, and it comes up with a 4.

Pretty impressive, right? Cool trick. It’s cool because the likelihood of guessing that roll by pure chance is only 1 in 6, which isn’t very high… so you think maybe the Dice-Guesser 1000 actually works. Normally I’d assume it was guessing randomly, but the odds were low that it would’ve guessed correctly the first time without any help, so maybe it really can predict dice rolls. If I did this trick many times, you’d be even more convinced, because it’s less and less likely that I’d keep guessing right if my device didn’t have some magic. Buy stock in my company.

Now, let’s do things a little differently. I have another device, the Dice-Guesser 2000. The only difference is this: instead of flashing one number, it flashes six numbers. 1, 2, 3, 4, 5, and 6. You roll a 4. Behold! One of my guesses was correct!

You’re not impressed this time? Why not? Isn’t the chance of correctly guessing the roll still just 1/6?

Sure it is. But I guessed six times, and that meant the chance that any one of my guesses would be correct was much higher (100%, in fact). The original feat was only impressive because I was picking one outcome out of many possibilities, and it was therefore improbable that I’d be right unless my device had some real magic. But if I pick more than one guess, it’s increasingly likely that random chance does explain my right answer — and if I guess enough times, it approaches statistical certainty that I’ll get it right. Doesn’t mean my device has any predictive power. Just means I tried until I got lucky.

So if you want to know something about my device, you’d be smart to limit me to only one guess. Maybe a couple. But not six.

 

My outcomes overfloweth

Now let’s apply this to a clinical study.

We’ll make it the strongest kind of study, a randomized controlled trial. Let’s say it’s a trial of Strokesallbetter, a pill that we think might help people after an ischemic stroke.

Before we run the trial, we specify a primary outcome: mortality. We’re guessing that patients who take Strokesallbetter will be less likely to die than patients who just take a placebo. We run the trial, and behold — patients who took Strokesallbetter were less likely to die! Are you impressed? You should be. Probably the drug works.

Now let’s take a different drug, Strokesjustfine. And let’s change things up. We’ll designate an outcome of mortality. But we’ll also designate another outcome: neurological deficits after 30 days. And we’ll designate another outcome too: duration of hospital stay. In fact, let’s designate one hundred different outcome measures, all unrelated. More data’s always a good thing, right?

We run the trial, and unfortunately, Strokesjustfine produces no difference in mortality. Or neurological deficits. Or hospital stays. In fact, it showed no difference for almost anything we measure, but one outcome was different: patients who took Strokesjustfine had a slightly lower chance of receiving a parking ticket in the 41 days following their stroke.

Are you impressed? Should we market Strokesallfine as a miracle parking-ticket cure?

Probably not. Why? Because this is just like the case where we kept guessing at the dice roll. It’s probably not random luck if you hit the target with the only shot you take. But if you shoot fifty times and one of them makes it, it probably is random luck. Nice try, no cigar.

Let’s put it a little more simply. In most studies, the statistical threshold for significance — how strong a correlation must be for us to assume it’s real, and not just chance — is 95% (a p value of .05). That translates to a 1 in 20 chance of the result happening randomly. Pretty good numbers… if you only guess once. What if you guess 20 times — in other words, you have 20 different outcomes? Now if one outcome shows a statistically significant correlation, that’s not an impressive result, because it had a 1 in 20 chance of happening by random probability, and you tried 20 times!

That’s why legitimate studies usually specify only one primary outcome. One hypothesis, one finish line; that’s where you’re putting your money, what you really think and hope might happen. At the most, there might be two or three, particularly if one is a safety outcome (i.e. not whether the drug worked, but whether it hurt anyone, which is a different kettle of fish). Best practice is to limit it to one primary outcome and perhaps two or three secondary outcomes (like the Pips to Gladys Knight, they’re not as central, but they’re something), all of them defined in advance. But that’s all you get. If you start listing dozens and dozens of outcomes, it no longer means anything when one of them happens to show a signal — statistically, that’s expected to happen, and you have no basis to believe it’s anything except chance.

That doesn’t mean you can’t measure or calculate plenty of different datapoints, of course. But when one of them has an interesting result, you don’t get to pretend that was your goal all along. If your primary outcome was a failure, your trial was a failure, period. If you had some secondary outcome that was positive, and there’s a plausible reason it might be a real finding (i.e. not the parking ticket situation above), then you can run another trial using that as the primary outcome. If you find a significant result this time, then great — you win! But merely noticing that result within the mountain of secondary data isn’t the same as designating the outcome ahead of time, for reasons that are hopefully now clear.

 

Hanging out the data nets

The ultimate example of this phenomenon is the practice of data mining.

Also known as data dredging, this is a lot like the big ocean-going fishing vessels that hang out nets and trawl along until they’re full. They’re not looking for anything in particular, they’re just picking up whatever — tuna, dolphins, old boots — until they notice something they like.

The way it works in research is like this: find a big bank of data. It could be from a study you performed, or it could be somebody else’s; doesn’t matter. Now, put it into a computer, and analyze it in every way possible. By our analogy above, you roll the dice over and over and over until you finally find a statistically-significant result. It doesn’t have to be something that was designated in advance. It doesn’t even have to be particularly plausible or logical. It could be the number of patients who lost between 63 and 66 hairs during the trial period. If it’s statistically significant, then you publish it.

And when you publish it, you highlight its statistical significance. Only a 1/20 chance of this happening randomly! What you certainly don’t highlight, or even mention, is that you ran twenty different analyses until you found this result. So nobody knows how many times you rolled the dice. Presto! We’ve created an impressive-looking association out of thin air.

The grim reality is that many published trials, even the big, impressive RCTs, are performed in this same way. They may designate a primary outcome beforehand. But there’s a lot of time and money invested in the study, so if it reaches its conclusion and the primary outcome is negative, do they publish it as a failure? No way. They dig through their data until they find some way of parsing the results that reaches statistical significance. Then they redact their study design and pretend that was their primary outcome all along. It’s called the Texas sharpshooter fallacy: fire your rifle at a blank wall, then draw a bulls-eye around wherever the bullets land, and you’ll never miss.

It’s hard to detect this nonsense, because when a study is published, you mostly have to trust that the authors haven’t changed their study design after peeking at the data. The only practical defense is pre-trial registration (ClinicalTrials.gov is the biggest registry), where researchers publish the design of their trials before they run them, and they can’t modify it later without the registry recording the change. That way, if you read their published write-up and it says the primary outcome was X, you can check the registry to confirm that their primary outcome was always X. Researchers are caught red-handed all the time by this, presumably because they hoped nobody would look.

 

Publication bias: the invisible dice

We talked before about publication bias, the phenomenon where negative trials are less likely to be published than positive ones, often disappearing into a desk drawer instead of appearing in the peer-reviewed literature and hence reaching your eyeballs. It happens all the time, particularly in studies funded by the industry that has a vested interest in their outcomes.

Guess what? Publication bias is another great way to roll extra, unseen dice. A significant result may have a 1/20 chance of occurring randomly, and maybe the trial only had a single outcome. But what if there are 19 other trials of the same drug, and every one of them was negative?

If you knew that, you’d recognize this positive result was probably spurious, despite its internal validity. But due to publication bias, you might never know about all those negatives. You search the literature, you find only the lucky positives, and you never know if they’re surrounded by a backdrop of failures that you simply don’t see.

 

So what’s the answer?

The take-away message is this:

  1. If it intends to show a significant association, a study should define a very small number of outcomes (preferably just one).
  2. Those outcomes should be designated before the study is conducted, and must not change at any point.
  3. Any findings beyond those pre-defined outcomes may generate hypotheses for future trials, but signify nothing on their own.
  4. Pre-trial registration can help readers verify that the above process was followed.
  5. Publication bias can hide how many attempts lie behind a positive result. Its presence is ubiquitous, and is not easily accounted for.

EMTLife Journal Club: The utility of gestures in patients with chest discomfort

New Journal Club at EMT Life. Does the gesture used by a patient to indicate his chest pain or discomfort have predictive value of an ischemic etiology? In other words, can you say “He pointed with his whole fist — it must be an MI?” Or is that nonsense? You decide!

Journal Club at EMTLife

Journal Club

Journal clubs are an old tradition, often practiced at medical schools, hospitals, research institutions, and similar settings. Evidence-minded folks share recent medical literature of interest, then get together on a regular basis (beer is sometimes involved) to discuss the findings and analyze the methods with a fine-toothed comb. It’s a great way to both keep up on contemporary research and to hone your critical skillset for appraising the good, the bad, and the ugly in a published paper.

A few users over at the busy EMTLife.com forums suggested a forum-based journal club, and I thought it was a splendid idea. So we’re giving it a whirl, and we’d love your participation. The first session went well, and the second has just been posted. Check it out!

A Jaundiced Eye for Clinical Guidelines

Investigative journalist Jeanne Lenzer has written a good article at the BMJ on the pitfalls of clinical practice guidelines issued by professional bodies. It’s a good read, not too long, and it’s been made freely available for one week — and only one week, so go read it now.

This subject has a similar flavor to our recent highlight on publication bias. In short, the big professional organizations that comprise the “men in suits” ivory tower of the medical world regularly issue guidelines on various topics of care. These are recommendations, not laws or holy writ, and intrinsically they have no more value than an article in Cosmo; but because they come from bodies with such influence, they’re often seen as de facto standards of care. (Just look at how powerful the American Heart Association’s CPR guidelines have become.)

Unfortunately, doctors in the US are so scared of getting sued that much of their practice is guided by this fear. So if you’re an emergency physician, and the American College of Emergency Physicians publishes a document stating that they strongly recommend the use of tPA for ischemic stroke, the first thing on your mind isn’t whether you think tPA works, or is likely to help the patient, or is supported by evidence. It’s that if you don’t give it, and the patient has a bad outcome, you may be standing in court with a lawyer asking you, “Why did you feel you could disregard the standard of care?”

The trouble is that despite being smart and successful, the folks sitting on these recommendation boards are also, very often, tied to the money. In other words, they’re recommending tPA, and lo and behold — they’ve received money (often vaguely-referenced “honoraria”), favors (speaking engagements, free travel, etc), or are flat-out employed by the people who make tPA. This is incredibly common. Most of those people would say they’re not biased, and in many cases they probably believe that, but the effects of bias can be subtle and unconscious.

This sort of thing helps explain some of the really bizarre recommendations you’ll encounter. If you believe the professional guidelines, you’d think that we’re surrounded by wonder drugs, all with powerful evidence for their effectiveness. (In reality, of course, profoundly effective treatments are pretty rare.)

So, as always: if it sounds too good to be true, it probably is. If a new treatment sounds like a slam dunk, wait a decade and see. The fact that a bunch of folks with acronyms have something to say shouldn’t mean too much to you unless they’re recommending against a therapy — usually nobody benefits from that except the patients.

Reinforcements

Fighting the good fight for truth, justice, and the evidence-based way is tiring stuff. That’s why we’re always looking for folks to help out with the DRL. There are a whole lot of good studies out there — research people should be reading — that haven’t made it onto our virtual shelves yet, simply because your poor librarians are too busy to add them.

So we’re pleased to announce that a new member has joined our ranks. Derek Sifford is a flight paramedic from the southern US with a passion for progressive EMS; he’ll be helping out behind the scenes, reading and sorting and filing with the rest of us. Not only does he bring wit and wisdom to the table, his background in critical care transport will be particularly helpful for wading through certain subjects.

Say hi if you see him! Oh, and haven’t you been thinking about adding something yourself to the world of prehospital EBM? Drop us a line, because we need your help.