Epi Timing in Cardiac Arrest (Part 2)

In our last post we examined the effect of the chosen resuscitation end-time on the overall duration of the resuscitation and how that affected the calculated mean time-interval between epinephrine doses. It’s worth reviewing quickly before we resume our discussion here.

The major next point in our examination of the Warren et al. paper on epinephrine dosing in cardiac arrest is a look at the endpoints they used to define a “cardiac arrest.” There were two different ways to hit STOP on the clock measuring duration-of-resuscitation: death or return of spontaneous circulation (ROSC) lasting > 20 minutes. Both have issues.

The former is pretty convenient from a charting standpoint, the “time of death,” but it also has the chance to introduce a lot of bias. It’s my personal experience that epi often flows fast-and-furious early in the code. As time drags on and the chance of a good outcome drops, however, the propensity for other interventions increases (“Try a central line with my off-hand? Why not!?”) and group interest in giving more epi decreases. The data certainly seems to reflect that, with Table 1 clearly showing a dramatic increase in arrest duration accompanying the longer dosing-intervals.

Click to enlarge.

Click to enlarge.

That’s a highly-edited excerpt from Table 1; the original table is gigantic with a ton of characteristics listed, but most of them were pretty comparable across all of the dose-intervals. But that, that is something…

One factor at play is a form of selection bias that I guess I could call an anti-length bias (someone out there correct me if there’s a better term for this). Usually length bias is discussed in the setting of cancer screening, where faster growing cancers are less likely to be picked up by screening but more likely to be malignant and fatal. As a result, the patients who survive long enough to be picked up on screening have already self-selected to be a lower-risk for an aggressive tumor and thus have a lower mortality.

Here, by definition, only patients who stayed in arrest at least 9 minutes could ever populate the 9-10 min/dose group. As a result the shorter dosing-interval groups ended up with a disproportionate amount of patients with shorter arrest durations, and correspondingly, lower mortality. Not only do patients do better the sooner they come out of cardiac arrest, but with an average arrest duration of 7.6 min in the 1-3 min/dose group, the great majority of those patients must have been experiencing ROSC. This study only looked at patients experiencing their first in-hospital cardiac arrest, so it’s highly unlikely most of those patients would have been declared dead after only an average of 7.6 min of CPR, leaving ROSC as the only other outcome.

These patients could still go on to experience in-hospital mortality later, but by achieving ROSC they certainly carry a better overall prognosis than patients who died and stayed dead. Disproportionately populating the short-interval group with these ROSCers will skew their mortality lower.

And that isn’t all. Recall that the other stop-point of a defined “cardiac arrest” event was ROSC lasting at least 20 minutes. This is hugely important. At first glance it may seem like a good endpoint because lots of resuscitated patients tend to go back into arrest, especially during the first ten minutes, but it absolutely kills this study (pardon the wording).

The population studied in this paper was comprised only of patients from the intensive care unit and inpatient medical floors. These are not patients who usually experience a sudden cardiac arrest; by definition they had to make it upstairs to have even been considered. Instead, this is a population that tends to spiral downwards over time rather than experience an unexpected catastrophe. The latter still occurs, but at a much lower rate than in the community or even the emergency department.

Anyone who’s been at this for even a modest amount of time has seen the patient with a BP of 50/30 mmHg and a rhythm on the monitor who then “loses pulses.” It’s uncertain whether they actually have a cardiac output but a Code Blue is announced, the patient is given 1mg of epinephrine, and then BOOM, pulses come back.

This hypothetical patient could achieve ROSC with the first dose of epi one minute after the Code was announced, keep a decent cardiac output for the next 10 minutes, and then loses her pulses again. You know this game?

The clock has not reset and this is still considered the same “code” according to this study. As before she responds to a dose of epi and then manages to keep her pulses for at least 20 more minutes following the administration of a norepinephrine drip. The clock is now stopped at the second time she regained pulses. So, in essence, she received one additional dose of epi over approximately 10 minutes and will be evaluated in the 9-10 min dosing group, plus her duration of “cardiac arrest” is now recorded at something like 12 min instead of 2 min. Never mind that categorization doesn’t even come close to capturing what really happened, but that’s how she’ll be analyzed in this study.

To the author’s credit they did exclude patients with intervals > 10 min for this reason, but that eliminates only the most blatant of cases; plenty will still end up in the data. They also excluded patients who received a non-epinephrine vasopressor during the arrest, but this doesn’t account for all of the patients described by the scenario above who received one after “final” ROSC to stave off further arrest.

So, what we see at this point is that this paper is a horrible mess of cross-pollination between study categories. Short dosing-interval patients are being placed into longer-interval categories because of the resuscitation-length issues covered Part 1 and intermittent-ROSC factors just discussed. On the other hand, the patients who still managing to make it into the short dosing-intervals are going to show markedly decreased mortality compared to the longer dosing-intervals because many of the latter needed to “stay dead longer” in order to even make it into their dosing-group.

How will this all pan out? Stay tuned for Part 3 where will will finally discuss the outcome data…

Epi Timing in Cardiac Arrest (Part 1)

There’s a new study by Warren et al. out in the most resent issue of Resuscitation that examines the use of epinephrine during in-hospital cardiac arrest. It also purports to show a possible benefit to non-standard dosing regimens.

Your pupils just dilated slightly… I’ve been watching the new season of Sherlock.

Click image for source.

Epinephrine is a touchy subject in the world of critical care, both prehospital and in-hospital, so this study is bound to garner a bit of attention. The big questions are whether that attention is deserved and what to do with the information that’s contained within.

If you want to cut to the skinny, it looks like this data isn’t nearly strong enough to affect the next round of ACLS/ILCOR guidelines… at least I hope it won’t. It’s not just weak data; it’s fundamentally flawed and probably garbage. If you care why, and I think you should, allons-y!

Click image for source.

…there’s also been some Doctor Who thrown in.

From the abstract, the aim of the paper was, “to evaluate the association between epinephrine average dosing period and survival to hospital discharge in adults with an in-hospital cardiac arrest.” The data examined was prospectively gathered and retrospectively examined from approximately 21,000 in-hospital cardiac arrests at about 500 hospitals in the Get With The Guidelines – Resuscitation (GWTG-R) registry.

This means that the data was gathered with the knowledge that it would be used in future research, but at the time of entry that exact use was unknown. At a later data the authors then looked backwards at a near ten-year chunk of the data and attempted to parse out how varying dose-intervals of epinephrine were associated with patient outcomes. It’s a noble pursuit and an important subject that has been understudied in the past, but one of the reasons why it’s not often researched is that it is difficult to examine outside of the setting of a randomized controlled trial. This design can make for decent hypothesis-generation in the right scenario, but here it’s pretty weak-sauce.

It’s a pretty big registry, and that can be a good thing if you’re asking simple questions with simple answers. The problem is that we’re not asking a simple question. It’s notoriously difficult to record the timing of medications during a cardiac arrest, and in-fact, the GWTG-R registry doesn’t even record epinephrine timing after the first dose. How could the researchers even attempt this study?

Click image for source.

Well, the GWTG-R registry does record the total number of doses of epinephrine administered during the arrest, along with the time to return of spontaneous circulation (ROSC). The authors, looking at resuscitations that received more than one dose of epi, then divided the time from the first dose of epinephrine to the end of the resuscitation by the subsequent number of doses to come up the average amount of time between doses. Here’s an example of an ideal patient receiving epi every 5 minutes.

5-6min Epi Interval

q5min dosing categorized as “5 to <6 min dosing”

If you’ve ever been involved in a resuscitation you will quickly realize that this way of calculating epinephrine timing in no way reflects real-world practice. Resuscitations are messy affairs and dose-timing can be all over the place during a single arrest, with one interval of 2 min, another of 8 min, and another of 2 min. While that may average out to a rate of one dose every four minutes, it’s probably very different from giving the patient evenly spaced q4min doses of epi.

The authors recognized that this is only an estimate of the dosing intervals, but another issue that compounds this mess with the timing is that is exceptionally dependent on when ROSC was noted and recorded. The same “q5min” dosing pattern shown above can result in at least four different interval-stratifications depending on the length of resuscitation and the time ROSC is recognized.

6-7min Epi Interval

q5min dosing categorized as “6 to <7 min dosing”

7-8min Epi Interval

q5min dosing categorized as “7 to <8 min dosing”

9-10min Epi Interval

q5min dosing categorized as “9 to <10 min dosing”

You’ll also note that all of the above patterns bias towards estimating a longer interval than what was really prescribed and administered, never shorter. This will come up later.

To keep things reasonable let’s end our initial discussion there for today, but if you like this stuff look forward to Part 2 (and 3… and maybe 4) being posted over the coming days.


Graphtistical Tomfoolery

Correlation versus causation

There’s a great Facebook post making the rounds which is worth the read for anyone amused by statistical smoke-and-mirrors.

Hopefully it’s not news to anybody that correlation isn’t the same thing as causation. But it still seems to surprise some people that you can do a study on practically anything — a random drug, some arbitrary diagnostic criteira — and, well… let’s just say that you’ll probably find a positive result.

There are lots of reasons for that, many of which we’ve talked about before, but one is the simple old truism that two numbers going up or down at the same time means very little on its own. Try coming up with some of your own graphs to demonstrate this; I have one here that provides compelling evidence of a strong association between the quantity of iPhones and number of polar vortexes around the globe.

Outcomes (and why you don’t get to supersize them)


Many people have trouble understanding outcomes. Obviously, the outcome is whatever a study is measuring — the results. But what’s the difference between primary and secondary outcomes? Why is it important to specify them before the study is performed, rather than after? Why is it more rigorous to have one or two outcomes instead of fifty?

The simple answer: the more times you spin the wheel, the more likely you’re going to eventually hit the jackpot… but the less impressed we’ll be when it does.


Accurate rifles don’t need ten shots to hit the target

Suppose I told you that I’ve built a miraculous device — the Dice-Guesser 1000. I wave it over a six-sided die, it beeps and whirrs, and after a few moments, it flashes a number: 4. Now you roll the die, and it comes up with a 4.

Pretty impressive, right? Cool trick. It’s cool because the likelihood of guessing that roll by pure chance is only 1 in 6, which isn’t very high… so you think maybe the Dice-Guesser 1000 actually works. Normally I’d assume it was guessing randomly, but the odds were low that it would’ve guessed correctly the first time without any help, so maybe it really can predict dice rolls. If I did this trick many times, you’d be even more convinced, because it’s less and less likely that I’d keep guessing right if my device didn’t have some magic. Buy stock in my company.

Now, let’s do things a little differently. I have another device, the Dice-Guesser 2000. The only difference is this: instead of flashing one number, it flashes six numbers. 1, 2, 3, 4, 5, and 6. You roll a 4. Behold! One of my guesses was correct!

You’re not impressed this time? Why not? Isn’t the chance of correctly guessing the roll still just 1/6?

Sure it is. But I guessed six times, and that meant the chance that any one of my guesses would be correct was much higher (100%, in fact). The original feat was only impressive because I was picking one outcome out of many possibilities, and it was therefore improbable that I’d be right unless my device had some real magic. But if I pick more than one guess, it’s increasingly likely that random chance does explain my right answer — and if I guess enough times, it approaches statistical certainty that I’ll get it right. Doesn’t mean my device has any predictive power. Just means I tried until I got lucky.

So if you want to know something about my device, you’d be smart to limit me to only one guess. Maybe a couple. But not six.


My outcomes overfloweth

Now let’s apply this to a clinical study.

We’ll make it the strongest kind of study, a randomized controlled trial. Let’s say it’s a trial of Strokesallbetter, a pill that we think might help people after an ischemic stroke.

Before we run the trial, we specify a primary outcome: mortality. We’re guessing that patients who take Strokesallbetter will be less likely to die than patients who just take a placebo. We run the trial, and behold — patients who took Strokesallbetter were less likely to die! Are you impressed? You should be. Probably the drug works.

Now let’s take a different drug, Strokesjustfine. And let’s change things up. We’ll designate an outcome of mortality. But we’ll also designate another outcome: neurological deficits after 30 days. And we’ll designate another outcome too: duration of hospital stay. In fact, let’s designate one hundred different outcome measures, all unrelated. More data’s always a good thing, right?

We run the trial, and unfortunately, Strokesjustfine produces no difference in mortality. Or neurological deficits. Or hospital stays. In fact, it showed no difference for almost anything we measure, but one outcome was different: patients who took Strokesjustfine had a slightly lower chance of receiving a parking ticket in the 41 days following their stroke.

Are you impressed? Should we market Strokesallfine as a miracle parking-ticket cure?

Probably not. Why? Because this is just like the case where we kept guessing at the dice roll. It’s probably not random luck if you hit the target with the only shot you take. But if you shoot fifty times and one of them makes it, it probably is random luck. Nice try, no cigar.

Let’s put it a little more simply. In most studies, the statistical threshold for significance — how strong a correlation must be for us to assume it’s real, and not just chance — is 95% (a p value of .05). That translates to a 1 in 20 chance of the result happening randomly. Pretty good numbers… if you only guess once. What if you guess 20 times — in other words, you have 20 different outcomes? Now if one outcome shows a statistically significant correlation, that’s not an impressive result, because it had a 1 in 20 chance of happening by random probability, and you tried 20 times!

That’s why legitimate studies usually specify only one primary outcome. One hypothesis, one finish line; that’s where you’re putting your money, what you really think and hope might happen. At the most, there might be two or three, particularly if one is a safety outcome (i.e. not whether the drug worked, but whether it hurt anyone, which is a different kettle of fish). Best practice is to limit it to one primary outcome and perhaps two or three secondary outcomes (like the Pips to Gladys Knight, they’re not as central, but they’re something), all of them defined in advance. But that’s all you get. If you start listing dozens and dozens of outcomes, it no longer means anything when one of them happens to show a signal — statistically, that’s expected to happen, and you have no basis to believe it’s anything except chance.

That doesn’t mean you can’t measure or calculate plenty of different datapoints, of course. But when one of them has an interesting result, you don’t get to pretend that was your goal all along. If your primary outcome was a failure, your trial was a failure, period. If you had some secondary outcome that was positive, and there’s a plausible reason it might be a real finding (i.e. not the parking ticket situation above), then you can run another trial using that as the primary outcome. If you find a significant result this time, then great — you win! But merely noticing that result within the mountain of secondary data isn’t the same as designating the outcome ahead of time, for reasons that are hopefully now clear.


Hanging out the data nets

The ultimate example of this phenomenon is the practice of data mining.

Also known as data dredging, this is a lot like the big ocean-going fishing vessels that hang out nets and trawl along until they’re full. They’re not looking for anything in particular, they’re just picking up whatever — tuna, dolphins, old boots — until they notice something they like.

The way it works in research is like this: find a big bank of data. It could be from a study you performed, or it could be somebody else’s; doesn’t matter. Now, put it into a computer, and analyze it in every way possible. By our analogy above, you roll the dice over and over and over until you finally find a statistically-significant result. It doesn’t have to be something that was designated in advance. It doesn’t even have to be particularly plausible or logical. It could be the number of patients who lost between 63 and 66 hairs during the trial period. If it’s statistically significant, then you publish it.

And when you publish it, you highlight its statistical significance. Only a 1/20 chance of this happening randomly! What you certainly don’t highlight, or even mention, is that you ran twenty different analyses until you found this result. So nobody knows how many times you rolled the dice. Presto! We’ve created an impressive-looking association out of thin air.

The grim reality is that many published trials, even the big, impressive RCTs, are performed in this same way. They may designate a primary outcome beforehand. But there’s a lot of time and money invested in the study, so if it reaches its conclusion and the primary outcome is negative, do they publish it as a failure? No way. They dig through their data until they find some way of parsing the results that reaches statistical significance. Then they redact their study design and pretend that was their primary outcome all along. It’s called the Texas sharpshooter fallacy: fire your rifle at a blank wall, then draw a bulls-eye around wherever the bullets land, and you’ll never miss.

It’s hard to detect this nonsense, because when a study is published, you mostly have to trust that the authors haven’t changed their study design after peeking at the data. The only practical defense is pre-trial registration (ClinicalTrials.gov is the biggest registry), where researchers publish the design of their trials before they run them, and they can’t modify it later without the registry recording the change. That way, if you read their published write-up and it says the primary outcome was X, you can check the registry to confirm that their primary outcome was always X. Researchers are caught red-handed all the time by this, presumably because they hoped nobody would look.


Publication bias: the invisible dice

We talked before about publication bias, the phenomenon where negative trials are less likely to be published than positive ones, often disappearing into a desk drawer instead of appearing in the peer-reviewed literature and hence reaching your eyeballs. It happens all the time, particularly in studies funded by the industry that has a vested interest in their outcomes.

Guess what? Publication bias is another great way to roll extra, unseen dice. A significant result may have a 1/20 chance of occurring randomly, and maybe the trial only had a single outcome. But what if there are 19 other trials of the same drug, and every one of them was negative?

If you knew that, you’d recognize this positive result was probably spurious, despite its internal validity. But due to publication bias, you might never know about all those negatives. You search the literature, you find only the lucky positives, and you never know if they’re surrounded by a backdrop of failures that you simply don’t see.


So what’s the answer?

The take-away message is this:

  1. If it intends to show a significant association, a study should define a very small number of outcomes (preferably just one).
  2. Those outcomes should be designated before the study is conducted, and must not change at any point.
  3. Any findings beyond those pre-defined outcomes may generate hypotheses for future trials, but signify nothing on their own.
  4. Pre-trial registration can help readers verify that the above process was followed.
  5. Publication bias can hide how many attempts lie behind a positive result. Its presence is ubiquitous, and is not easily accounted for.


Hopefully you’re out there reading the research and trying to figure out what you should be doing and why. But inevitably, when you get to the results, they’re reported using a baffling array of acronymic metrics. Some are fairly intuitive; some are truly confusing; but you should understand them all. Here’s a quick run-through on the most common terms, particularly odds ratios, absolute risk reduction, relative risk, and NNT, using two recently-added studies (Kudenchuk 1999 and Jacobs 2011) as examples.