by a literal banana
Introductory summary: The current scientific consensus is that the placebo effect is a real healing effect operating through belief and suggestion. The evidence does not support this. In clinical trials of treatments, outcomes in placebo and no-treatment arms are similar, distinguishable only in tiny differences on self-report measures. Placebo-focused researchers using paradigms designed to exploit demand characteristics (politeness, roleplaying, etc.) produce implausibly large effects, in many cases larger than the effect of fentanyl or morphine, but these studies measure response bias on self-report outcomes (at best). There is no evidence that placebos have effects on objective outcomes like wound healing. Three sources of evidence purport to show that the placebo effect is a real, objective phenomenon: brain imaging studies, the alleged involvement of the endogenous opioid system or dopaminergic system, and animal models. But the brain imaging studies do not demonstrate an objective effect, but are rather another way of measuring “response bias,” as subjects are capable of changing these measures voluntarily. Studies that claim to demonstrate the involvement of the endogenous opioid system suffer from replicability issues, with most positive results coming from a single laboratory genealogy; other laboratories produce conflicting results. Animal models also suffer from replicability issues, such that the highest-quality research is least likely to produce a placebo effect in animals. Even research designs that do produce a conditioned “placebo effect” in animals cast doubt on the involvement of the endogenous opioid system. In the era of open science, there has been no large-scale, multi-center, preregistered attempt to address the placebo effect in animal models or the involvement of the endogenous opioid system. The one adequately powered preregistered attempt for the dopaminergic system in humans produced no effect. Although placebo and “mind-cure” beliefs are widespread, the most parsimonious interpretation of the evidence is that the “placebo effect” is not a real healing effect, but a product of response bias and questionable research practices. The true power of the placebo is as a blind.
MB: I’ve detected a note of skepticism from you regarding the effects of placebos having actual physical effects on the body, but my rough understanding of it is that it’s not a priori totally ridiculous. The mind and the body (if we’re gonna do Cartesian dualism) are sort of intimately related; they both have effects on the other. And if you believe something is true, then it’s gonna have these psychological effects which are gonna translate into physical effects at some point, right?
CK: Oh no – yeah I have no skepticism about the placebo effect existing, or like you say, what people expect about things, you know where else is it gonna show up? Even the fact that the people subjective self-reporting is different, it wouldn’t be a surprise for me to find that you could find correlates between activity that represents that. So it’s not a huge reach.
(Matthew Browne and Christopher Kavanagh, Decoding the Gurus podcast, November 9, 2023, at about 2:29:25, speculative punctuation mine. Podcast beginning at 1:46:60 assigned in Daniel Lakens’ metascience class.)
The context of the above excerpt is our two scholars very politely and decorously tearing apart some sketchy placebo studies. It is obvious from the podcast that they are both highly intelligent and highly aware of the crisis in the sciences. They demonstrate not only the capability to identify specific flaws in studies, but also a fine olfactory sense for identifying questionable claims by smell. I open my case against placebos with an excerpt from their podcast not because I think they are fools, but precisely the opposite: it seems to be the consensus among intelligent, aware people that the placebo effect is real, and my own claim that the placebo effect is not real is in conflict with this consensus.
It often strikes me as odd, as a rare placebo effect denier, that even while criticizing studies that find, for example, that simply being told that one’s job is good exercise causes people to lose weight, critics take care to say they do not question that the placebo effect exists. The placebo effect as a healing effect is taken for granted, partly because it is not widely understood how shaky the evidentiary foundation for the healing placebo effect is.
Here is a list of sub-beliefs, which I hope to show are misconceptions, that support the belief in the healing placebo effect, some of which can be inferred from the above quotation:
- Randomized placebo-controlled trials use a “placebo arm” because the placebo effect is known to be powerful.
- The placebo effect is a real healing effect, and not just subjects being polite on self-report instruments, regression to the mean, selection on extreme values at the beginning of a study, or questionable research practices.
- The placebo effect is a healing effect that is large enough to be noticeable and clinically relevant.
- The placebo acts not only on subjective states, but also on objective, physical outcomes that are measurable by e.g. laboratory tests.
- We know the placebo effect is objective and not a function of response bias because it causes measurable changes in The Brain, for instance measured in the EEG sustained late positive potential.
- The placebo effect need not be deceptive; it may be evoked with “open-label placebos.”
- We know the placebo effect is an objective phenomenon because it involves endogenous opioids, demonstrated by high-quality and well-replicated research in which the placebo effect is abolished with a hidden injection of an opioid antagonist.
- We know the placebo effect is real because we can induce it in animal models with a conditioning paradigm.
- Expensive placebos are more effective; discount placebos are less effective. Properties of the placebo including the color of the tablet alter the subsequent placebo effect.
Some people are convinced that the placebo effect must exist, because they notice when children receive a mild injury, kissing the boo-boo or perhaps applying a band-aid seems to soothe any upset. But I don’t think this is a healing placebo effect at all. This is a rational process of people who are new to the world encountering a disturbing situation, and going to people they trust with more experience to find out how freaked out they should be. Deciding how upset to be based on contextual information is different from healing. (Also band-aids are pretty effective at protecting wounds from being touched, preventing more pain.) Receiving a placebo is a communication that conveys, for example, that no further treatment is forthcoming. It can mean anything between “it’s okay” and “shut up.” Recipients of placebos take this into consideration and may alter their communication in accordance with this, but they do not actually heal.
In faith-healing events like revival meetings, believers may throw away their crutches, eyeglasses, or hearing aids, and this voluntary behavior serves as a communication of faith. However, this behavior is not evidence that they no longer need these helping devices, as they may find themselves sheepishly buying replacements when the strong emotion of the meeting has worn off. It’s important not to confuse communication strategies with healing.
“To amuſe the mind”
It is well known that the placebo has something to do with pleasing, translating as “I shall please,” but it is a bit unclear who is to be pleased. I have sometimes heard that medical students are taught that placebos are given to patients so that they will get better to please their doctors! Dr. J. C. Lettsom, recorded by William Gaitskill in the Memoirs of the Medical Society of London in 1795, gives his patient placebos “to amuſe the mind,” implying the patient is the one to be pleased. I suspect that both senses coexisted throughout the past few hundred years. Some doctors (of various levels of respectability) sell placebo treatments because people like them and want to buy them; others expect placebos to actually work to some degree.
In a lecture given in 1953, Sir John Gaddum smears the two meanings together – a placebo may both please the patient and act to cure him through psychological suggestion – but he offers a further distinction which will be relevant in the next section:
Such tablets are sometimes called placebos, but it is better to call them dummies. According to the Shorter Oxford Dictionary the word placebo has been used since 1811 to mean a medicine given more to please than to benefit the patient. Dummy tablets are not particularly noted for the pleasure which they give to their recipients. One meaning of the word dummy is “a counterfeit object”. This seems to me the right word to describe a form of treatment which is intended to have no effect and I follow those who use it. A placebo is something which is intended to act through a psychological mechanism. It is an aid to therapeutic suggestion, but the effect which it produces may be either psychological or physical.
As we shall see, it might have prevented a great deal of confusion to preserve the distinction Gaddum makes between placebos and dummies:
Dummy tablets may, of course, act as placebos, but, if they do, they lose some of their value as dummy tablets. They have two real functions, one of which is to distinguish pharmacological effects from the effects of suggestion, and the other is to obtain an unbiased assessment of the result of the experiment.
Gaddum is concerned with the usefulness of placebo (dummy) controls in revealing the efficacy (or lack thereof) of drug treatments. He gives as an example the drug thonzylamine, of which “extravagant claims were made” of its ability to prevent and treat the common cold. But when compared with placebo control, the typical result was this:
That is, the placebo control (dummy) revealed that the apparent efficacy of the drug was down to factors unrelated to the drug. Pre-post comparisons without control could not have revealed this. But what this kind of experiment cannot reveal is whether the improvement in the placebo group was down the “psychological” effects of suggestion (mind cure), or to regression to the mean, natural history, or even politeness of the subjects (depending on how “cure” was assessed).
The Placebo Effect in Placebo-Controlled Trials
One of the most common misconceptions about placebos it that placebos are used in randomized controlled trials because the placebo effect is known to be a powerful healing effect. This misconception is addressed by Blease et al. in a paper from 2020, Open-label placebo clinical trials: is it the rationale, the interaction or the pill? They distinguish, on the one hand, the placebo response in placebo-controlled trials, in which patients in the placebo arm often actually do “get better” in a pre-post analysis (a dummy, in Gaddum’s terms), from, on the other hand, the placebo effect, which is the purported healing effect of the placebo.
They grant, as I do, that “there is a scientific consensus that placebo effects constitute genuine psychobiological events that engage perceptual and cognitive processes to produce therapeutic benefits among patients for a range of self-reported conditions and symptoms.” (To be clear, I think this is wrong, but I acknowledge that it is the current scientific consensus, unfortunately.) But the use of placebos as a control in clinical trials is agnostic to these purported healing effects. The placebo in a placebo-controlled trial exists, for one thing, to ensure blinding, so that any measured effects can be attributed to the treatment, rather than to the hopes and beliefs of researchers. Subjects in the placebo arm experience the same passage of time as the treatment arm, so that noise and any regression to the mean can be subtracted out and not mistaken for a healing effect of the treatment. Subjects given placebo are in the same position as subjects given a treatment in terms of being motivated to respond politely to surveys, such that theoretically they should have a similar “response bias” as treatment subjects, if the blind is intact. (Though note that we should expect them to respond politely even if they are given an open-label placebo with a rationale for why it is expected to work.) If “eligibility creep” is a factor, in which the measurements of subjects at baseline are exaggerated to qualify more subjects for the trial, it should be the same for the placebo and the treatment group (this is why “placebo effects” sometimes occur on objective outcomes in randomized controlled trials, especially those without a no-treatment arm for comparison, even though placebos only affect self-report measures).
For all these reasons, any pre-post improvement in a placebo arm (endpoint minus baseline) is not necessarily attributable to the placebo itself. To measure the effect of placebo, one must somehow design a trial comparing placebo to no treatment – but even then, subjects in the no-treatment arm are not given the same motivation to answer politely to surveys. (This is among many reasons why some have questioned whether the waiting list is a “nocebo” control condition, that is, worse than nothing, which makes sense – subjects assigned to a waiting list are invited to portray themselves as still in need of help to potentially qualify for the trial, whereas subjects given at least a placebo no longer need to qualify for the trial. Also, researchers doing the rating may encounter subjects coming to the laboratory for treatment or placebo, but may never encounter subjects assigned to the waiting list, threatening the blind.)
It’s impossible to achieve a blind when comparing placebo to no treatment, since the placebo itself is the main method for blinding in the first place. This is the promise behind the paradigm of hidden injection of opioid antagonists, which we shall address in a later section.
The Powerful Placebo
One of the most influential documents in favor of the scientific view of the healing placebo effect is Henry Beecher’s 1955 paper, The Powerful Placebo. Perhaps we can blame Dr. Beecher for the loss of Dr. Gaddum’s useful distinction between placebos and dummies, for Beecher, quoting Gaddum as I have above, specifically argues that they are the same thing:
Both “dummies” and placebos are the same pharmacologically inert substances; i. e., lactose, saline solution, starch. Since they appear to be differentiable chiefly in the reasons for which they are given and only at times distinguishable in terms of their effects, it seems simpler to use the one term, placebo, whose two principal functions are well stated in Professor Gaddum’s last sentence quoted above. Finally, I do not understand how a dummy tablet could be prevented from having a psychological effect that, if pleasing, would make it a placebo. One term seems to fill the bill. If it falls a bit short of precision, perhaps the language will have to grow a little to include the new use.
Indeed, that is exactly what happened!
Writing in 1955, Dr. Beecher performs what would now be regarded as a rather crude meta-analysis of his own work and that of others:
Fifteen illustrative studies have been chosen at random (doubtless many more could have been included) and are shown in table 2. These are not a selected group: all studies examined that presented adequate data have been included. Thus in 15 studies (7 of our own, 8 of others) involving 1,082 patients, placebos are found to have an average significant effectiveness of 35.2±2.2%, a degree not widely recognized. The great power of placebos provides one of the strongest supports for the view that drugs that are capable of altering subjective responses and symptoms and do so to an important degree through their effect on the reaction component of suffering.
That is, rather than presenting a modern effect size, he concludes that the pain of 35.2 plus or minus 2.2% of subjects were “satisfactorily relieved by placebo,” which he later defines as follows:
For example, in our pain work satisfactory relief is defined as “50 per cent or more relief of pain”at two checked intervals, 45 and 90 minutes after administration of the agent. (This is a reproducible judgment patients find easy to make.) Each author has been explicit, and some have required even greater success than indicated above. For example, Gay and Carliner (1949) required, for a positive effect, complete relief of seasickness within 30 minutes of administration of the placebo.
That is a powerful placebo! While Dr. Beecher focuses on “altering subjective responses” and the “reaction component of suffering,” he also claims that placebos are effective on objective, “physiological” criteria. We will return to this claim in a later section.
For now, we will introduce two of the few placebo skeptics I have encountered, Gunver Kienle and Helmut Kiene, who in 1997 published The Powerful Placebo Effect: Fact or Fiction? in response to Beecher’s claims. Among many other criticisms, some of which we will return to, they take issue with Beecher’s reporting of his fifteen trials:
Beecher misquoted 10 of the 15 trials listed in “The Powerful Placebo.” He sometimes inflated the percentage or the number of patients, or he cited as a percentage of patients what in the original publications is referred to as something completely different, such as the number of pills given, the percentage of days treated, the amount of gas applied in an experimental setting, or the frequency of coughs after irritating a patient. The main effects of these errors were false inflations of the alleged placebo effect. A multitude of misquotations can also be found in other placebo literature. (Citations omitted.)
Kienle and Kiene consider misquotation to be “a particular problem of placebo literature,” but in my experience it is a problem of all literatures ever.
Run In, Wash Out
One important aspect of Dr. Beecher’s work is in his conception of a “placebo responder.” That is, some patients respond to placebo, and some don’t, depending on “attitudes, habits, educational background, and personality structure” (but not intelligence!) of the subjects.
Dr. Beecher seems to anticipate the method of “placebo washout” that was later commonly used in antidepressant trials, that is, to deceptively provide subjects with a placebo said to be an effective treatment for a week or two, and then exclude any subjects who report getting better. He says, “as a consequence of the use of placebos, those who react to them in a positive way can be screened out to advantage under some circumstances and the focus sharpened on drug effects.”
This is an important prediction that has now been falsified, at least for a well-studied class of drugs purported to be affected by placebo response, antidepressants. Although there has long been criticism of the practice of placebo washout (also called placebo run-in) as a questionable research practice in antidepressant research (e.g., “we think that the practice of excluding patients during the washout procedure should be suspended due to the potential for distorting results in some studies,” Antonuccio et al. 1999), the actual effect of the procedure turned out to be nil.
As early as 1995, Greenberg et al. (confirming, they report, the results of two earlier studies) found that measures of depression were not affected either in placebo or drug groups from the use of a washout design:
Our results are entirely consistent with the few attempts in the literature to assess the value of the placebo-washout technique. We found no significant difference between washout and nonwashout studies in the percentage of reduction in ratings on depression for subjects in the placebo groups…. There was also no difference in the effectiveness of antidepressant drugs in the two types of studies…. Our analyses showed equivalent percentages of dropouts in the two types of studies for both the patients assigned to the placebo groups….
In 2021, Scott et al. examined a much larger sample of trials and found that while washout (placebo run-in) designs produced slightly lower placebo effects and slightly lower drug effects, the drug-placebo difference was indistinguishable between methodologies:
Studies using PRI periods reported a smaller placebo response (g = 1.05 [95% CI, 0.98-1.11]; I2 = 82%) than studies that did not use a PRI period (g = 1.15 [95% CI, 1.09-1.21]; I2 = 81%; P = .02). Subgroup analysis showed a larger drug response size among studies that did not use a PRI period (g = 1.55 [95% CI, 1.49-1.61]; I2 = 85%) than those that did use a PRI period (g = 1.42 [95% CI, 1.36-1.48]; I2 = 81%; P = .001). The drug-placebo difference did not differ by use of [placebo run-in] periods (g = 0.33 [95% CI, 0.29-0.38]; I2 = 47% for use of a [placebo run-in] period vs g = 0.34 [95% CI, 0.30-0.38]; I2 = 54% for no use of [placebo run-in] periods; P = .92). The likelihood of response to drug vs placebo also did not differ between studies that used a [placebo run-in] period (odds ratio, 1.89 [95% CI, 1.76-2.03]) and those that did not use a [placebo run-in] period (odds ratio, 1.77 [95% CI, 1.65-1.89]; P = .18).
Excluding subjects who are particularly willing to play along slightly reduces pre-post differences for both placebo and treatment, but doesn’t affect the drug-placebo difference.
On the other hand, if a researcher wants to produce large placebo effects, selecting only “placebo responders” exploits response bias by excluding subjects not willing to play along. Many research designs described below use this method, particularly in the section on endogenous opioids.
The question of whether placebo responders exist, as a consistent natural kind, seems to be an open one. In a small (n=71) study, Whalley et al. found that while responses to a placebo cream of the same name in different trials were somewhat correlated, there was no significant correlation in responses to placebos given different names. Given my beliefs about the placebo effect, that it is primarily a function of politeness and roleplaying in self-report measures on the part of the subjects, it would not be surprising if some subjects are more polite and better sports than others, but I cannot find much evidence of a consistent placebo responder in the literature. This does not stop many authors from conducting sketchy subgroup analyses of only “placebo responders” to find a placebo effect.
What Outcome Measures Are “Objective”?
I have claimed that the “placebo effect” is exclusively a phenomenon of self-report or subjective measures, and never objective measures. However, this distinction needs some clarification, as there are many different uses of “objective” in the literature. I will review some of these definitions here, as they will be relevant in the sections to come.
One possible meaning is “anything other than a self-report measure,” and this produces three distinct kinds of problems. First, this may include observer-reported outcomes that constitute a summary of a patient interview, such as the Hamilton Depression Rating Scale. Rather than producing an objective measure, like a laboratory test of a blood sample, this combines the subjective impression of two parties. A patient in a trial who has received a placebo may feel encouraged to role-play as if he has improved. But a researcher rating improvement may also interpret ambiguous information as favoring improvement. Indeed, researchers rating subjects produce larger pre-post effects in both placebo and treatment arms of depression trials, as discussed in a later section. Few such trials include a no-treatment arm to pinpoint a “placebo effect,” but the few that do indicate that it is modest, if it exists at all. In short, researchers may exaggerate on essentially subjective and vague criteria even more than subjects role-play as having improved. When the outcome does not allow for exaggeration or roleplaying, such as laboratory tests or wound healing, there is no “placebo effect.”
A second problem is that this definition seems to exclude brain imaging, as an EEG or fMRI result is not, strictly speaking, a self-report measure. However, available evidence suggests that just as verbal or written responses are produced voluntarily and may be changed depending on politeness or roleplaying, many brain imaging outcomes are also under voluntary control. Like the rate of breathing or facial expression or amount of effort exerted, subjects have control over these outcomes, even if they are measured in a manner that doesn’t superficially look like self-report.
The third problem is specific to pain measurement. A rating on a pain scale is clearly a self-report measure – for example, “How bad is your pain, on a scale of 1 to 100?” clearly asks for a self-report that may change based on how politely the subject is playing along. Other pain measures are sometimes termed “objective” if they are not strictly measured by self-report on a pain scale. Examples include the temperature in a heat pain protocol at which the subject reports unbearable pain, or the length of time a subject is able to endure induced pain. However, even though not a pain rating, these “objective” measures are under the control of the subject and may still be a product of roleplaying. However, I do not mean to say that large placebo effects on this kind of pain measure are only produced by roleplaying, as they may also be produced by questionable research practices. For example, no laboratory in the open-label placebo meta-analysis discussed in a later section, Spille et al. (2023), produced a placebo effect using such an “objective” pain measure, but certain laboratories claim to produce enormous effects of this kind, suggesting some difference in methodology that might be regarded as questionable.
The distinction between outcomes that are under voluntary control (and therefore subject to response bias) and truly objective outcomes is a bit fuzzy. Are exercise performance outcomes, for example, objective? Exercise science is still, to put it politely, in the early stages of addressing the replication crisis (with, as Büttner et al. 2020 report, 82.2% of studies reporting that the primary hypothesis of the study was supported, despite, as Abt et al. 2020 report, a median sample size of 19 in a random sample of papers in the Journal of Sports Sciences). However, it is still possible, in my model, that at least some placebo effects on exercise outcomes (time to run a certain distance, for example) are not products of sketchiness and fraud. Effort in exercise is under voluntary control, and certainly influences exercise outcomes as measured in trials. Differential effort, rather than fraud, may even explain placebo effects in studies in which the authors give the subjects a survey about their level of effort and find no difference, as there is no reason to suppose that a survey instrument is a reliable way to retrospectively measure effort during athletic performance.
To me, an objective outcome is not under the voluntary control of the subject and unlikely to be manipulated by researchers. Some examples of this kind of objective outcome would be wound healing, laboratory blood or urine tests, or pregnancy. Cognitive tests may also be objective measures as long as effort is constant, and indeed cognitive tests do not seem to show a “placebo effect” unless they are produced by the laboratories of known frauds.
The issue of the objectivity of outcomes is also relevant to a distinction our placebo skeptics Keinle and Keine (1997) make between the placebo effect and the concept of “psychosomatic” effects. They say, just before noting the “uncritical reporting of anecdotes” in the pro-placebo literature:
There is a class of anecdotal reports in the placebo literature, which have nothing to do with placebos, because no placebos were given at all.
The purpose of these anecdotes is to demonstrate the possible power of “nonspecific” causes. Beecher himself reported adventurous episodes from the voodoo culture, when supposedly dying people recovered immediately, or when magic rituals brought about the death of apparently healthy people.
Another classic example is an anecdote in Stewart Wolfe’s well known “The Pharmacology of Placebos:” A woman with a gastric ulcer could not respond with gastric acid production during provocative tests with even the most powerful secretory drugs. Yet, immediate acid secretion occurred when she was asked about her husband who, as she had just recently discovered, had been sexually abusing her 12-year-old daughter. Wolf used this story to demonstrate the possible range of placebo effectiveness. However, this is misleading. This was an example of a psychosomatic effect, not the effect of placebo application. The example does not show that the mere ritual of giving a pill can be equated with the effect of discovering the sexual abuse of one’s daughter by one’s husband.
It is worth making a distinction between placebo effects and psychosomatic effects, but it is also worth making a distinction between the measurable objective outcomes of strong emotion (e.g. heart rate increase, crying) and the possible effects of emotion or “mindset” on disease outcomes. For the former, it would be strange for emotions to evolve at all if they had no effects. For the latter, the picture is murkier. For example, for peptic ulcer, a condition widely believed to have psychosomatic causes, hopelessly confounded observational studies often show a link between some measures of stress and some measures of ulcer. But it does not follow that peptic ulcer responds to placebo. Within clinical trials testing a treatment for peptic ulcer, de Craen et al. (1999) found a small difference between trials that had subjects take four placebos, as opposed to two placebos, per day, but even their small result was not robust to various sensitivity checks. (Also, one could imagine that e.g. drinking two extra glasses water could have some small effect on a gastric outcome.) Claims that a “cancer-prone personality” caused cancer and could be treated with talk therapy turned out to be based on fraud, as we will see in a later section. It is worth keeping placebo claims separate from vague claims of psychosomatic effects, but it is also worth treating claims of psychosomatic effects on disease with skepticism.
Powerless Placebo?
Massive pre-post effect sizes in the placebo arms of placebo-controlled trials can be confused with a large placebo effect (for example, Hedge’s g greater than one for both placebo groups in antidepressant trials analyzed in Scott et al. cited above!). However, any effect of placebo itself must be distinguished from effects from the passage of time (especially in conditions of an episodic nature), and related issues such as inflation of initial scores. A “no treatment” or natural history control group can accomplish much of this. Although the inclusion of a no-treatment group won’t match effects of politeness and roleplaying (“response bias,” “Hawthorne effect,” “demand characteristics,” etc.), it goes a long way toward establishing the true effect of placebo.
Hróbjartsson and Gøtzsche (2001) performed a meta-analysis of studies comparing placebo arms to no-treatment arms (updated in 2004, and again in 2010). Their conclusions were surprising to placebo believers: “We found no evidence of a generally large effect of placebo interventions. A possible small effect on patient-reported continuous outcomes, especially pain, could not be clearly distinguished from bias. (2004)”
There are some interesting aspects of this analysis. Consistent with the politeness hypothesis, the authors found a significant (but surprisingly small) effect of placebo versus no treatment on self-reported outcomes. They found no such effect for observer-reported outcomes. They only found a placebo effect on continuous outcomes, and a perfect null for binary outcomes. They generously conclude, “We have no good explanation for the difference between effects of placebo when measured on a binary and on a continuous scale, but continuous scales could be more sensitive to small effects or biases. (2004)” The binary outcomes produce one of the loveliest funnel plots I’ve ever seen, almost perfectly symmetrical with a peak at the null value (2004):
Interestingly, there is a small but statistically significant effect of placebo on the subgroup of continuous outcomes measured by laboratory tests (2004) – in the wrong direction. That is, in laboratory tests with continuous outcomes, subjects given placebo did slightly but statistically significantly worse than subjects given no treatment. This is almost certainly noise, but it’s important to take note of absurd results from noise, as if it had gone in the other direction through chance, it might have been taken as evidence of placebo efficacy. To take another example, So et al. (2009) studied the effects of acupuncture and sham acupuncture on various objective outcome measures during IVF treatment. They found that the sham acupuncture group (the placebo group) had a statistically significantly higher overall pregnancy rate, one of three measures reported, with a p value of .038. I have described their conclusion as a two-sentence horror story:
Placebo acupuncture was associated with a significantly higher overall pregnancy rate when compared with real acupuncture. Placebo acupuncture may not be inert.
Of course, it’s just noise, as you might suspect from the p value. Coyle et al. (2020), meta-analyzing data from eight trials and almost ten times as many subjects as So et al., found no effect of acupuncture or placebo compared to usual care on any outcome measure after IVF. I have a general maxim: if there is a statistically significant effect of placebo on an objective outcome, it is either noise, fraud, questionable research practices, or a mischaracterization of a subjective outcome as objective.
Interestingly, although Hróbjartsson and Gøtzsche found a placebo effect for self-reported but not for observer-reported outcomes, in the case of antidepressant trials versus placebo, “placebo effects” seem to be larger when a researcher is doing the rating. In trials of depression treatments, both the placebo effect (Rief et al., 2009) and the treatment effect (Cuijpers et al., 2010) are larger for clinician-rated effects compared to self-reported effects. One interpretation of this is that the “placebo effect” in these trials is not so much from patients being polite and exaggerating their benefit, but from researchers exaggerating the change, either innocently or for the purpose of producing a larger apparent effect. (More on the placebo in depression in the appendix.) Interestingly, a 2010 meta-analysis reported that this was also the case for Irritable Bowel Syndrome: higher placebo response rates for physician-reported than subject-reported outcomes. The negative result for outcomes measured by laboratory tests suggest this is exaggeration, rather than a genuine objective improvement.
But the main upshot of Hróbjartsson and Gøtzsche (2004) is how small the placebo effect is. For example, “pain” was the subgroup that performed the best (by its nature a self-report subgroup prone to bias), but the effect was estimated at only 6 points on a 100-point scale, too small to be clinically relevant. To make matters worse, Kamper et al. (2008) conducted an updated meta-analysis of the pain trials in search of trial characteristics that might produce a large, clinically relevant placebo effect (“trial-design, patient-type, or placebo-type”), and found none, but also found the placebo effect on pain to be a mere 3.2 points on a 100-point scale. As they put it, “Our analysis confirms the conclusions of Hróbjartsson and Gøtzsche that, at least in the context of clinical trials, placebo interventions appear to have little effect on pain.”
The “at least in the context of clinical trials” part is important, because researchers in ordinary clinical trials have little incentive to put their thumbs on the scale in favor of a placebo effect. Researchers specifically designing trials to find a large placebo effect are more likely to “find” one.
Here is Kamper et al.’s plot for the pain studies, which also tells a story:
Of course, as all these authors are aware, the small placebo effect on self-reported pain does not answer the question of whether it is real pain relief or just subjects being polite. But assuming for a moment that it is 100% real pain relief, how much pain relief is enough to matter? This is a tricky question, but to take a stab at it, Olsen et al. (2017) (which includes our friend Asbjørn Hróbjartsson as an author) analyzed 37 studies that sought to investigate the magnitude of the clinically important difference in pain, based on mapping numerical scores to patient report of feeling better:
A typical eligible study would ask patients to score their pain intensity, e.g. using a VAS, at baseline and follow-up. At follow-up, patients were also asked to categorise their change in pain intensity using response options such as ‘no change’, ‘a little better’/‘somewhat better’, and ‘a lot better’/‘much better’. The [minimum clinically important difference] was then determined from the change in scores on the pain scale among patients having categorised their change as ‘a little better’ (or a similar expression indicating a minimum clinically important improvement).
Olsen et al. found that the studies varied widely in their conclusions, ranging from 8 points on a 100-point scale to 40 points (points may also be expressed as millimeters on a visual scale). No study found a clinically relevant difference as low as 6, much less 3, points on a 100-point scale. Far from Dr. Beecher’s claims of 50% reduction of pain in over a third of patients, in the typical case, placebos do not seem to make people report feeling even “a little better” in clinical trials compared to no treatment.
In a final update as I am writing, Hohenschurz-Schmidt et al. (2024) confirmed that in three-armed trials (clinical trials of a treatment with a placebo and a no-treatment arm), short-term placebo effects on pain are small, and medium-term and long-term placebo effects are nonexistent.
A large “placebo effect” is only demonstrated when researchers are determined to find it, as we will see in the next section.
Placebo-Controlled Trials versus Placebo Trials
Vase et al. (2002), in A comparison of placebo effects in clinical analgesic trials versus studies of placebo analgesia, found that studies that were trying to find a placebo effect indeed got a larger placebo effects (mean .95) than clinical trials that happened to have a placebo and a no-treatment arm (mean .15). Of the fourteen placebo-focused studies they review, eight were from the laboratories of Levine or Benedetti, discussed in a later section.
Hróbjartsson and Gøtzsche raised some methodological issues both with the analysis itself and the studies included in a letter, Unreliable analysis of placebo analgesia in trials of placebo pain mechanisms (2003). The funniest one is an issue that still plagues low-quality science to this day: “results from the included studies were summarised as simple unweighted averages.” They also question study quality: “Thirteen out of 14 mechanism studies did not report the method of concealment. The only study of pain mechanism with a clearly concealed allocation of patients reported a minor effect of placebo (Roelofs et al., 2000).” (Roelofs et al., 2000, is reviewed in the appendix on opioid antagonists.)
Price et al. (2003) (the authors of Vase et al. 2002) respond, in Reliable differences in placebo effects between clinical analgesic trials and studies of placebo analgesia mechanisms (which differences I do not think are the flex these authors think they are). They say,
Admittedly, we did not use a method of weighting studies in order to keep the analyses simple. However, a weighted estimate of d does not change our conclusions, rather it strengthens them.
They produce an even bigger result with a weighted analysis! They also say:
It is true that we did not have concealed allocation of patients as an
explicit inclusion criterion. However, all studies of placebo mechanisms in our analysis included random allocation of subjects, and nothing indicates that the randomization was performed in a way that was clearly not concealed. So in that aspect, inclusion of studies in our meta-analysis was no different from inclusion of studies in Hrobjartsson and Gøtzsche’s meta-analysis (2001). Roelofs et al. (2000) described the randomization procedure in great detail and they reported a minor placebo analgesia effect. However, there is no evidence that the randomization procedure was related to the magnitude of their reported placebo effect.
While Hrobjartsson and Gøtzsche had concerns with the methodology of the meta-analysis, I think the real reason for the difference in effect sizes is the difference in methodologies between the clinical trials of treatments and the placebo-focused studies. It’s about study quality and researcher motivation. Proponents of placebo effects claim that designs in clinical trials aren’t sensitive enough to detect these enormous placebo effects, and the difference is simply that the placebo-focused trials do a better job of evoking placebo effects. I think this is putting a positive spin on the fact that these placebo-focused trials exploit response bias. Worse, researchers motivated to find a large placebo effect are more likely to engage in questionable research practices in order to produce such an effect. Both factors probably play a role. Preregistered trials seem to have a particularly difficult time establishing a large placebo effect. Roelofs et al. (2000), mentioned in the above quotation, not only specify the randomization procedure in detail, but many other details not present in the other studies, down to the method of data storage and the position subjects were to sit in. These especially careful researchers produced a nonsignificant placebo effect of only half a point on a 100-point scale. As in the rat studies examined below, indications of high research quality seem to correlate with not finding a placebo effect, but high-quality research is so rare that it is difficult to evaluate formally.
Since studies specifically seeking to find a placebo effect were the most fruitful for placebo advocates, the analyses continued. Forsberg et al. (2017), in The Placebo Analgesic Effect in Healthy Individuals and Patients: A Meta-Analysis, in the Journal of Psychosomatic Medicine, find an absurdly large placebo effect on pain, even larger than those of Vase et al. (2002):
The average effect size was 1.24 for healthy individuals and 1.49 for patients. In the studies with patients, the average effect sizes of placebo treatment were 1.73 for experimentally induced pain and 1.05 for clinical pain.
To put these effect sizes into context, Watso et al. (2022) found an effect size of .84 for 5 mg of morphine on experimentally induced pain using a cold pressor test, but no effect of placebo. The same laboratory found an effect size of d = 1.48 for a 75 μg dose of fentanyl on the same pain protocol, while again finding no effect of placebo on either objective or subjective measures. (Again, researchers who do not have their thumbs on the scales to find a placebo effect do not find one.) Placebos in placebo studies are not only more powerful than placebos in controlled trials: they are more powerful than fentanyl. Rather than laughing at these absurd results supporting a powerful placebo effect, many otherwise skeptical people take them seriously, just as they did with behavioral priming studies.
Despite being published in 2017, there is no mention of study quality or quality analysis in Forsberg et al.’s meta-analysis, and concerns of publication bias are minimized. Rather than providing evidence of the placebo effect, I think that the difference between effects found when researchers are and aren’t motivated to find an effect illustrates the power of questionable research practices – and the power of research designs that exploit demand characteristics. We will address the difference in effect sizes between garbage-in-garbage-out meta-analyses and large-scale, preregistered trials in a later section.
On publication bias, Forsberg et al. say:
Visual inspection of the funnel plot for the overall effect revealed some asymmetry, indicating a publication bias with too many small sample studies with large effect sizes and too few small sample studies with small effect sizes. Even if there may be some publication bias, the value of the file drawer statistics indicated that at least 7927 unpublished studies with no effect of placebo treatment would be needed to reduce the placebo analgesic effect to a nonsignificant level. This is considerably higher than the suggested limit (5 K + 10 = 350; 33), and thus, it is unlikely that such a large number of unpublished studies with zero findings should exist.
This type of analysis has always bothered me, as it seems to assume that studies are always produced honestly and that the only issue that can happen is that a study is not published. In reality, it seems that a very common problem is massaging studies until they produce a significant result, which is different from honest non-publication. It seems to me that the numbers would look very different if we acknowledged that many of the studies with large, significant results would have been null if, for example, the analysis had been preregistered and followed. However, when studies are designed to exploit response bias, they may in many cases not even need to mess with their data to produce a result.
One study Forsberg et al. mention is Charron et al. (2006), which studied 16 subjects with low back pain and reported enormous effects of deceptive placebo on self-reported pain (over 20 points in one group of 8 patients). Interestingly, they only got a big effect for low back pain, compared to basically no placebo effect for a cold pressor test (putting a hand in a bucket of cold water). This is in contrast with the findings of Forsberg et al., who found a much larger effect for experimental pain than for clinical pain like chronic low back pain. Even Charron et al.’s sample size of 16 should have been enough to detect an effect as large as 1.73, but methodologies vary and the noise mine is very noisy.
After explaining that the enormous placebo effect dwarfs the effect of accepted real treatment methods, as I have noted above regarding morphine and fentanyl, Forsberg et al. caution:
This does not mean that placebo treatment should be the treatment of choice over other evidence-based pain treatments. Most of the included studies in the present meta-analysis were performed in a laboratory, investigating short-term effects of placebo analgesia. Moreover, whereas studies on treatment for pain commonly use a double-blind design, several of the studies on placebo analgesia used a single-blind design, which most likely increased the effect
of the placebo treatment.
I personally think prescribing placebos to chronic pain sufferers would be cruel, and the fact that people take this idea seriously shows the damage that scientific fuckery can do in the real world, even if it seems harmless (compared to all the fraudulent Alzheimer’s research, for example). Placebo effects in studies like these are large because the research is sketchy, not because placebos are effective.
Forsberg et al. add: “To our knowledge, there is only one study of placebo treatment of long duration outside of the laboratory. This study showed a large analgesic effect across 50 days of placebo treatment in one patient.” (Citation omitted, emphasis mine.) This study was Kupers et a. 2007, Naloxone-insensitive Epidural Placebo Analgesia in a Chronic Pain Patient, which interestingly found no effect of an opioid antagonist on the patient’s placebo response, an issue discussed in a later section.
Subjects involved in an experiment are in a special context very different from everyday life. They are under the impression that the research is important, and wish to play the role of a good experimental subject. “Demand characteristics” reflect a vast repertoire of abilities subjects bring to the experiment. Subjects improvise along with whatever the researchers are up to. Researchers actively trying to find a placebo effect present subjects with a different kind of game to play, compared to researchers testing a treatment. Even without manipulating data, it is possible to produce a spurious “placebo effect” simply because subjects are willing to play along, interpreting cues from researchers and context to figure out the right way to play. Some methodologies exploit this more than others, but by definition, no study comparing placebo to no treatment is blinded.
When talking about scientific-sounding ideas like effect sizes, it is easy to forget what’s really going on underneath the numbers. Subjects may be prompted to report on a survey that they feel better, but to interpret this as subjects actually feeling better is often a mistake. The picture that emerges is that a placebo pill has almost no effect when administered by researchers who do not care about the placebo effect, but the exact same pill has an enormous effect that dwarfs the effect of all existing treatments when administered by a researcher who really wants the placebo effect to be real. The most parsimonious explanation is that it is the research practices, rather than the placebo, that creates the large effect sizes.
The Open-Label Placebo
Placebos have traditionally been administered deceptively, and the deception was thought to be inherent to the efficacy of placebo. If you weren’t told that the pill you were taking was a powerful painkiller, why would you report feeling (slightly) less pain on a survey? But perhaps you have heard that placebos “work” even when you know it’s a placebo, and even if you don’t believe in the placebo effect. This claim comes to us from “open-label” placebo research.
In open-label placebo designs, subjects are told that they are receiving a placebo, and usually given some kind of rationale for why the inert substance should have some effect (contrary claims some researchers have speculated may induce “cognitive dissonance”). In my model in which placebo effects are driven by politeness and roleplaying on the part of subjects, this rationale should be enough to induce them to change their responses on self-report measures, but as with regular placebos, should have no effect on objective measures. This seems to be the case, and suggests that exploiting demand characteristics plays a larger part in apparent open-label placebo effects than other forms of fuckery.
In the meta-analysis of open-label placebo effects of Spille et al. (2023), just as in the Hróbjartsson and Gøtzsche analyses of deceptive placebo, there were significant effects on self-reported instruments, but not for objective outcomes, such as in Mathur et al. (2018), finding no effect of placebo on wound healing (it was almost significant in the wrong direction). Although “the overall quality of the evidence was rated low to very low,” and in my opinion likely exaggerates the efficacy of open-label placebo on even self-report measures (as I will explain), this is exactly what would be expected if the placebo effect were a function of response bias. By definition, an open-label placebo study cannot be blinded. By the nature of the methodology, roleplaying as if the placebo works is invited, not discouraged.
Some of the pain outcomes classified by Spille et al. as “objective” (none of which found any significant effect of open-label placebo) were still under the voluntary control of subjects, such as heat pain thresholds. While none of the “objective” pain measure results here were significant, including the overall meta-analytic effect, many other authors do report a placebo effect on “objectively” measured pain outcomes, as discussed in an earlier section. It seems plausible to me that these can be products of roleplaying, such that I would not conclude that these results are necessarily fraudulent. (Many do, however, seem too large to be real.) An outcome like wound healing, on the other hand, cannot be the product of roleplaying, as long as a blinded observer is rating the wound. The same goes for pregnancy, discussed above. That is why placebos don’t affect these types of outcomes.
Spille et al. give a meta-analytic effect (standard mean difference) of 0.43 (95% CI = 0.28, 0.58) in their sample of non-clinical studies (“20 studies comprising 1201 participants were included, of which 17 studies were eligible for meta-analysis”). Interestingly, in a smaller meta-analysis of clinical samples (“We included k = 11 studies (N = 654 participants) into the meta-analysis”), the same lab (Wernsdorff et al. 2021) reports an enormous meta-analytic effect size of .72 (95% Cl 0.39–1.05). All of Wersndorff et al.’s included studies were self-report measures except one, which found no effect. They note that excluding four studies with high risk of bias, their effect size went down to a still-incredibly-large .49 – though not as large as the hilarious effect sizes reported in Forsberg et al. (2017) for deceptive placebos on pain.
To get an idea of the studies under analysis, let’s look at some of the studies with the largest effect sizes in the Spille et al. (2023) paper. The largest effect size in the self-report category is Guevarra et al. 2020, Placebos without deception reduce self-report and neural measures of emotional distress, a standard mean difference of a whopping .99 between open-label placebo and no treatment on self-reported emotional distress while viewing scary pictures (over twice the effect reported by Schienle et al. (2023), discussed in the next section, in a preregistered trial). Guevarra et al. is also the only study to produce a statistically significant “objective” effect (SMD=.38), and the only study the authors describe as “high risk of bias”(reference 18 in Spille et al. 2023). Guevarra et al.’s “objective” measure, the “neural” measure, is an EEG measure called the sustained late positive potential. I will discuss why this is not actually an objective measure in the later section on brain imaging.
The second-largest effect size in the self-report category was El Brihi et al. 2019, with a SMD of .74. They gave healthy subjects a package of pills with this excellent art and had them take either one or four tablets per day:
Their subjects gamely reported feeling much better on various self-report wellbeing measures, although there was no dose-response relationship for the placebo. Bräscher et al. (2022) attempted to replicate this result (Open-Label Placebo Effects on Psychological and Physical Well-Being: A Conceptual Replication Study), but the replication attempt failed. Unfortunately, their study was reported too late to be included in the Spille et al. meta-analysis. They also used cool art:
Interestingly, one of the polite reasons they give for possibly not replicating the results of El Brihi et al. is that the name “Pharmacebo” reminded participants that they were taking a placebo, although the El Brihi et al. name is similarly suggestive (“Placibax”) and has the word “Placebo” printed on the packaging. I think the most likely reason for the failed replication is that the Bräscher et al. study was of higher quality (for example, assessing symptoms daily instead of once after five days), and the true effect (if we can even speak of a “true effect” made mostly of response bias) is much closer to zero than to .74.
The third-highest effect was Mundt et al. (2017) for lab-induced thermal pain, with a standard mean difference of .69 versus no treatment – but what they found was not decreased pain with placebo, but increased pain in the control group at repeated baseline. Here is their figure:
This seems sketchy to me, even ignoring that p = .045, because usually a placebo effect refers to a decrease in pain. But perhaps sensitization is the expected course for this type of trial, and placebos (open-label or deceptive) prevent sensitization? This seems not to be the case from what I can gather. For example, a study they reference, Chung et al. (2007), using the same methodology and rationale, did not find any such sensitization effect:
Mundt et al. 2017:
A Medoc Thermal Sensory Analyzer (TSA-2001, Ramat Yishai, Israel) was used to deliver all thermal stimuli. Thermal stimuli of 3-s duration were delivered to the ventral forearm via a contact thermode. Temperatures ranged from 43 to 51 °C. Temperature levels were computer controlled by a contractor-contained thermistor with a preset baseline of 32 °C. Stimulation sites were alternated such that no site was stimulated within a 3-min interval to preclude sensitization effects.
Chung et al. 2007:
Medoc Thermal Sensory Analyzer
All thermal stimuli were delivered using a computer-controlled Medoc Thermal Sensory Analyzer (TSA-2001, Ramat Yishai, Israel), which is a peltier-element-based stimulator. The stimuli were a range of temperatures from an adapting temperature of 33°C up to 51°C. Stimuli were applied in a counterbalanced order to the forearm by a contact thermode and were 3 seconds in duration. Multiples sites located on the forearms of both arms were employed. Stimuli presentations were timed such that no site was stimulated with less than a 3-minute interval to avoid sensitization of the site.
Another study, Hollins et al. 2011, using the same skin sites over and over finds the opposite of sensitization (habituation) early on, with later sensitization never going back up to the original level. That is, the rationale of not reusing the same site to preclude sensitization seems backwards:
In a similar vein, yet another study (Jepma et al. 2014) only found sensitization when the sites were switched, again the opposite of the stated rationale. Mundt et al. (2017) looks like noise mining to me, rather than a well-demonstrated placebo effect.
Again, I do not expect the placebo effect (which is to say, the politeness or roleplaying effect) to be zero. It might even be large. Open-label placebo studies seem to attract enthusiastic subjects. Wersndorff et al. say, “Because of the novelty of this kind of treatment, patients seemed to enjoy the treatment and described it as ‘crazy’ according to the intake and exit interviews.” They also say, “The risk for the so-called ‘time lag bias’ is also comparatively high, due to the early state of research in this field. This bias indicated that trials with negative results are published with some delay.” I think that is probably optimistic, given the enormous effects reported in Forsberg et al. (2017), since deceptive placebo research is a more “mature” field. While “time lag bias” was prescient with regard to Bräscher et al. (2022) and Schienle et al. (2023), Benedetti et al. (2023) (discussed in a later section) produced an enormous open-label placebo effect of four points on a ten-point scale! In my opinion, the most likely reason for large open-label placebo effects, as with regular placebo effects, is study sketchiness. Even in the absence of fuckery, when your entire research paradigm exists to exploit demand characteristics, you will indeed produce demand characteristics.
The Placebo and “The Brain”
Brain imaging studies have redressed earlier criticism that placebo effects might merely reflect a response bias.
Elsenbruch and Enck (2015), Placebo effects and their determinants
in gastrointestinal disorders
In the Spille et al. meta-analysis, the only study to produce a significant “objective” effect is Guevarra et al. (2020). These authors find a significant difference between open-label placebo and no treatment on an EEG measure, the late positive potential (specifically the sustained late positive potential measured at between one and six seconds from the stimulus), when subjects view scary pictures. This is their money shot:
They say, importantly, “These results show that non-deceptive placebo effects are not merely a product of response bias.”
The problem here, I think, is that their “objective” measure is in fact entirely possible to produce through response bias. Consider a survey that a subject has filled out. Is this an objective measure? It’s written on paper (or typed into a computer), and can be viewed by a neutral observer. But we wouldn’t call it an objective measure, because it is produced through the voluntary behavior of the subject.
Many authors find that the sustained late positive potential can be influenced intentionally and voluntarily on the part of the subject. Moser et al. (2014) find that, when subjects are looking at scary pictures, the practice of voluntary “cognitive reappraisal” (“participant should imagine that the pictured scene improved and to think of the image in a more positive light so as to decrease the intensity of their negative emotions”) can significantly affect the late positive potential in the exact same way:
Wang et al. (2024) make a similar finding (Watch versus Reappraisal conditions):
Studies find similar magnitudes of difference between late positive potential responses for voluntary “cognitive reappraisal” as Guevarra et al. found for open-label placebos.
On the self-report side, “cognitive reappraisal” might be regarded as a kind of maximal control for response bias, since researchers essentially ask subjects to decrease their ratings voluntarily. In a preregistered trial, Schienle et al. (2023) gave their “cognitive reappraisal” subjects an instruction to voluntarily change how they reacted to scary pictures: “participants were instructed to apply the strategy of cognitive reappraisal in the picture viewing task by imagining that the shown situations and objects are not real, but created by a special effects artist for a Halloween movie.” The authors of course found “regions of interest” in fMRI data, both overlapping with open-label placebo subjects and distinct. But more interesting was the effect on subjective ratings of disgust for the pictures. Compared to the disgust ratings of the “passive viewing” group, the open-label placebo group showed an effect of d = 0.39 (much smaller than Guevarra et al.’s finding of .99), whereas the cognitive reappraisal group showed a massive d = 1.02 reduction in disgust ratings. While the open-label placebo design is subtly suggesting that participants portray themselves as less disgusted (or perhaps less pained, or depressed, or allergic), directly asking people to portray themselves as less disgusted seems to be much more effective in changing survey responses.
If brain imagine measures are generally considered “objective” and not under voluntary control, this seems to imply that people do not have voluntary control over their own brains. I am not sure what metaphysical model this claim is working with. But some brain responses may be under more voluntary control than others, as demonstrated with the late positive potential. This may also be true of fMRI results.
In a large meta-analysis (n=603 participants) of individual participant fMRI data, Zunhammer et al. (2021) find a small (“g < 0.2″) but statistically significant difference in “pain-related brain activity, as compared to the matched control conditions” when combining patient data from both conditioning and “suggestion” (deceptive placebo) studies. (Since they found no difference between conditioning paradigms and normal placebo designs, I won’t go into the distinction here, but will present it in the section on animal models.) The placebo effect in “the brain,” as revealed by fMRI, seems to be scattered around an assortment of different brain regions, with one notable exception.
In an earlier paper analyzing the same 603 participants, the same lab found a small (g = −0.08 [95% CI, −0.15 to −0.01]) but (barely) statistically significant effect of placebo on the “Neurological Pain Signature,” a set of brain areas allegedly activated by pain (Zunhammer et al. 2018). In a “conservative” analysis removing studies rated at high risk of bias, the magnitude of the effect was even smaller, and the 95% CI included zero (g = −0.07, 95% CI, −0.15 to 0.00). Even in the most carefully selected subgroup of “placebo responders” (“which included only participants showing a behavioral placebo response greater than the study median and excluded potentially ineffective placebo treatments and outliers”) that they manage to coax a Bayesian-significant result out of, they note that “effects of placebo on the NPS were only 4% to 14% as large as the overall NPS response to painful stimulation.”
In the 2021 meta-analysis, referring to the study just described (Zunhammer et al. 2018), they say “This previous study revealed that behavioral placebo analgesia was associated with significant but small effects in the NPS, pointing to the relevance of other brain areas and networks.” However, in Botvinik-Nezer et al. (2024), Placebo treatment affects brain systems related to affective and cognitive processes, but not nociceptive pain, a study from still the same lab with almost as many (n=392) participants and a “pre-registered analysis,” no placebo effect on the NPS could be detected. They report that “placebo did not decrease pain-related fMRI activity in brain measures linked to nociceptive pain, including the Neurologic Pain Signature (NPS) and spinothalamic pathway regions, with strong support for null effects in Bayes Factor analyses.”
So whatever is going on with placebo effects on the brain, the one set of areas it doesn’t seem to involve is the set of areas associated with pain perception. The authors conclude, “Our results indicate that cognitive and affective processes primarily drive placebo analgesia.” I am just an ignorant banana, but to me the small effect of placebo on fMRI data almost everywhere but the pain-activated areas suggests that, similar to the late positive potential in EEG, it could be a matter of measuring voluntary effects, or “cognitive and affective processes.”
Interestingly, Spille et al. (2023) seem surprised that they do not get a significant result for “objective” outcomes from open-label placebo. They say, in their discussion section:
Regarding the differences between self-reported and objective outcomes, our finding of a null effect for objective outcomes raises the question of whether OLPs and deceptive placebos have the same pattern of effect, as changes in objective outcomes have been repeatedly demonstrated in studies using deceptive placebos. One might therefore hypothesize that OLPs, unlike deceptive placebos, do not entail biological changes. However, Kaptchuk & Miller emphasize that also deceptive placebos primarily affect self-reported and self-appraised symptoms. Further studies comparing the effects of OLPs with deceptive placebos on objective outcomes are needed to clarify this issue.
The claim “changes in objective outcomes have been repeatedly demonstrated in studies using deceptive placebos” might seem surprising after reviewing the clinical trial data from the Hróbjartsson and Gøtzsche meta-analyses (even Forsberg et al. do not claim to find a placebo effect on objective outcomes). But Spille et al. cite for the claim a paper by Fabrizio Benedetti, a great placebo believer and advocate for placebo research and a figure whose research we will learn more about in the opioid antagonist section. The paper, published in 2012 in the Journal of Acupuncture and Meridian Studies, is called Placebo-Induced Improvements: How Therapeutic Rituals Affect the Patient’s Brain. I am surprised by this choice, not because I don’t trust the Journal of Acupuncture and Meridian Studies, but because I would have imagined a series of meta-analyses would provide more certainty of evidence. Benedetti (2012) makes many interesting claims, and the word “objective” does not occur in the document, so it is up to interpretation. His laboratory sometimes finds an effect of placebo on “objective” measures of pain, discussed above, such as time that ischemic arm pain is tolerated. As I have explained, these results may also be a product of roleplaying and politeness, although the results are often so very large as to seem suspicious, and other laboratories seem to have difficulty replicating these effects. The other results he presents that might be interpreted as “objective” outcomes are concerned with measures of chemicals like endogenous opioids and dopamine (as well as a reference to an fMRI study of placebo acupuncture). This is a major focal point of placebo belief, particularly the claim that we “know” the placebo effect is an “objective” phenomenon because it involves the release of e.g. endogenous opioids, and we can abolish the placebo effect with a hidden injection of e.g. opioid antagonists. I will review the evidence for these claims in the next section.
Opioid Antagonists (And Friends)
One of the most understandable bases for the placebo belief is the idea that the placebo effect is objective and measurable, and not response bias or woo, because it is based on endogenous opioids (or similar chemicals, perhaps dopamine), and we can experimentally extinguish the placebo effect by giving a hidden injection of an opioid antagonist like naloxone. This cannot be response bias, because subjects are not aware of whether they are receiving naloxone or saline, and their subsequent pain ratings differentiate the two. This offers the promise of a truly blinded demonstration of a placebo effect.
This idea stems from the work of two important figures: Jon Levine, the pioneering neuroscientist who along with coauthors Newton Gordon and Howard Fields invented the idea, and their former student, Fabrizio Benedetti, referenced in the last section. Almost all of the research confirming the effect comes from these two sources, and other laboratories have found it difficult to reproduce these results. The majority of studies included in a twenty-year-old meta-analysis of the alleged effect, Endogenous opiates and the placebo effect: A meta-analytic review, by Sauro and Greenberg (2005), are from the laboratories of Levine or Benedetti.
The Sauro and Greenberg (2005) meta-analysis is old enough that it doesn’t include an analysis of publication bias or statements like “Study quality ranged from abysmal to hilarious,” but it doesn’t seem to have been updated in the age of open science.
The first thing that stands out from the meta-analysis is the absolutely massive estimated meta-analytic effect of placebo, d+=0.89 (95% CI 0.74 –1.04). This is even larger than the effects reported in the open-label placebo meta-analyses of various outcomes, which in turn were so much larger than the estimate of .28 from Hróbjartsson and Gøtzsche on self-reported pain (though again still not as large as the estimates in Forsberg et al., 2017, with which it has many overlapping studies). These researchers have access to very powerful placebos indeed! Some subgroup effects are even larger, with the largest being d+= 1.23 for the placebo effect on tourniquet-induced ischemic pain. The meta-analytic effect of naloxone in reducing the placebo effect is almost as massive, at d+=0.55 (95% CI 0.39–0.72). Note that naloxone no longer “abolishes” or “extinguishes” the effect, but merely reduces it somewhat in this analysis. The largest subgroup effect is an incredible d+= 1.37 for the reduction of placebo effect by naloxone in capsaicin-induced experimental pain (based on two studies, both reported in Benedetti et al., 1999).
Of the twelve papers included in the meta-analysis, seven have Levine or Benedetti as an author, and all seven of those produce positive results supporting all hypotheses. This is a bit surprising, since quick use of a couple of internet calculators suggests that to detect an effect size of .55 at 80% power would require a sample size of at least 84 subjects with 42 in each arm, assuming a one-tailed hypothesis. However, as detailed below, modern studies that perform a power analysis only find the need for the same old 13 or 14 subjects per arm, I think because they assume a much larger effect size, perhaps an effect size even larger than the placebo effect itself, so what do I know. Almost all of these studies have 11 to 17 subjects per arm, and the largest, Benedetti et al. (1999) with 24 to 29 subjects per relevant arm, gets a significant result in every comparison. (See appendix for group counts in all studies, and for a note on the alleged significance of one result.)
Benedetti and Levine get a positive result every time using an eclectic variety of methodologies and measures and surprisingly small sample sizes. On the other hand, of the five studies not conducted by Levine or Benedetti, only one gets a positive result, comparing a group of 16 to a group of 14. Some laboratories have all the luck!
I have briefly summarized every study from the Sauro and Greenberg meta-analysis with basic study characteristics and findings, but since my summary amounts to over 4000 words, I have turned it into an appendix for the curious.
As far as I can tell, there has never been a preregistered or multi-center or Manylabs-style replication attempt for the effect claimed in Sauro and Greenberg (2005). There is a preregistered study from this year, Dopamine has no direct causal role in the formation of treatment expectations and placebo analgesia in humans (Kunkel et al. 2024), which even included a power analysis indicating that 165 subjects would be needed (55 per group), and managed to get 168. The authors, as stated in the title, find no evidence for either dopamine agonists or antagonists in the formation of the placebo response in a conditioning paradigm. But I can’t find anything like this for opioid antagonists.
In the modern era, a recent study (Pontén et al. 2020) that looked at pain ratings and fMRI data in a conditioning paradigm failed to find any effect of naloxone on open-label placebo analgesia (for painful pressure applied to the thumbnail) in either measure reported. They performed a power analysis that justified a small sample size:
An a priori power analysis was performed to determine the sample size required to detect a pain-cue effect (n = 13) based on a previous data set with similar design. Calculations were performed in G*Power (3.1) based on differences in pain ratings (0–100) between two pain cues (dependent means) M = 17, SD of difference = 13, alpha = .05, power (1 − β) =.99, two-tailed.
Benedetti et al. (2023), however, as always find an enormous positive result (in “placebo responders”) in reversing the placebo effect of open-label placebos in ischemic arm pain. They also perform a power analysis justifying an almost identical sample size:
An a priori analysis of power and sample size was performed regarding the expected difference between saline (group 6) and naloxone (group 7). A sample size of 13 was calculated by setting the desired power at 0.8, P at 0.05, the expected difference between naloxone and saline at 1 or 2, and the expected variability at 1 or 2 SD. Therefore, we decided to test a sample of 14 subjects for group 6 and 14 subjects for group 7.
What could explain the differing results? The pain stimulus was different between the two studies, so perhaps only the placebo effect in ischemic arm pain is affected by naloxone. Pontén et al. did not select their subjects on the basis of being in the dubious group of “placebo responders” (although they did demonstrate a conditioned placebo effect), while Benedetti et al. did. Pontén et al. recruited subjects based on an “advertisement,” whereas Benedetti et al. recruited students and employees at the university and laboratory where the study was conducted. Pontén et al. used a conditioning paradigm and Benedetti et all. used a suggestion paradigm, but Amanzio and Benedetti (1999, discussed in appendix) previously claimed to demonstrate the effect with a conditioning paradigm with similar group sizes.
It’s interesting to look at the Benedetti lab results for deceptive placebo back in 1996:
The effects of naloxone vs. saline amount to less than two points on a ten-point scale and occur mostly 20 to 25 minutes after injection of naloxone in the 1996 deceptive placebo study. On the other hand, the open-label naloxone results from 2023 are larger and occur much earlier. The “placebo effect” of deceptive vs. open-label placebo appears almost identical in magnitude (absolutely massive), but occurs earlier in the 2023 study. Naloxone causes pain ratings to return to baseline much earlier, too, and pain goes up beyond baseline, even though in 2023 the hand squeeze device required less pressure to close (5 kg vs. 6.5 kg) and they used a lower cuff pressure to induce ischemia (200 vs. 250 mm Hg):
It is surprising that naloxone would be so much faster and more effective at reducing the effect of an open-label placebo compared to a deceptive placebo. (In both cases, the relevant groups were constructed of “placebo responders” and the experiment was started when a pain rating of 7 was reached.)
Benedetti et al. (2023), as in all their studies, find no effect of naloxone on pain in the absence of a placebo effect to “abolish.” A recent meta-analysis (Jensen 2021) found a small effect (g= .23) of naloxone on pain in general, but the author notes that “there was considerable heterogeneity present,” and “due to reporting bias in the literature, the size of this effect may be overstated.” It seems odd that if the endogenous opioid system is involved in pain perception, that opioid blockade should have such a small and unreliable effect with regard to various kinds of pain and pain modulation, yet somehow specifically act to reduce the “placebo effect.” This will be expanded on later in this section.
Just recently, Dean et al. (2024) published a preregistered study on the effects of naloxone on placebo analgesia (sham mindfulness) in men and women. With a similar sample size (15 per group), they found an effect in men but not in women, contrary to both Pontén et al. 2020 (whose subjects were all men but did not get a result) and Benedetti et al. (2023) (who found no difference between men and women). In the female subjects in Dean et al. (2024), naloxone actually increased their placebo analgesia, though not significantly. They found a much higher placebo effect in men than in women, contrary to the common finding in both randomized controlled trials and placebo experiments that there is no difference (e.g. Enck and Klosterhalfen, 2019). The hypothesis that researchers are still mining noise cannot be rejected.
I turn now to a 2015 systematic review, examining more studies on the effect of naloxone on pain modulation in various contexts, and visit some contemporary research in this area, to investigate the foundations of this alleged effect.
The review is Endogenous Opioid Antagonism in Physiological Experimental Pain Models: A Systematic Review, by Werner et al. (2015). These authors are not simply focused on the placebo effect, but on the effects of naloxone-type drugs on various types of experimental pain and pain reduction (or increase). They lead their conclusion with a quotation from a 1978 study by Grevert:
The consistent failure to find an effect of naloxone on experimental pain in humans suggests that endorphin release did not occur during these procedures
Werner et al. conclude:
This systematic review on endogenous opioid antagonism in physiological experimental pain models concludes that naloxone appears to have a demonstrable and relatively reliable effect in stress-induced analgesia (in all 7 studies) and repetitive transcranial magnetic stimulation (in all 3 studies). In all other pain models, both naloxone and naltrexone demonstrate a variable and unreliable effect.
Not a very reassuring conclusion! This review looks at studies that attempt to manipulate pain in various ways, either inhibitory (decreasing pain, as with placebo) or sensitizing (increasing pain, as with nocebo), and measure whether naloxone has any effect on the pain ratings. For example, in the seven “stress-induced analgesia” studies mentioned positively in the conclusion (all published between 1980 and 1986), subjects are subjected to pain after being stressed out by being made to do math problems or a public speaking task. After this treatment, but not after control non-treatment, they report lower levels of pain – stress reduces self-reported pain! Opioid antagonists, we are told, reduce this pain relief. Similarly, a few authors claim they can reduce pain with transcranial magnetic stimulation, and that this pain reduction is reversible by opioid antagonists. One study looked at the effects of distance running on laboratory-induced pain, and found that naloxone reversed it for ischemic (tourniquet) pain, but not for heat pain. This is emblematic of the entire enterprise: a mishmash of no result, some result, a result in the opposite direction, etc.
Other studies examine whether naloxone-like drugs affect pain ratings, thresholds, and sensitivity in general, again with widely varying results. One body of research focuses on the inhibition of pain with more pain (of a different type or in a different area, such as having one foot in a bucket of freezing cold water while experiencing laboratory-induced heat pain on your arm). This is called “conditioned pain modulation” (CPM). Does naloxone reduce this modulation? One study (King et al. 2013) produces a summary of studies that perfectly encapsulates the research paradigm, in both sample size and variability of results:
In this paradigm and others, the studies are all over the place, revealing no consistent effect of opioid antagonists on pain or pain modulation (except where exogenous opioids are concerned, which they do seem to effectively and reliably reverse, just as they seem to cure people who are overdosing). What about the two paradigms in which naloxone seems to have a “relatively reliable” effect, stress-induced analgesia and transcranial magnetic stimulation?
I mentioned above that all of the stress-induced analgesia studies that show a reverse of analgesia with naloxone are from the early-to-mid 1980s. Scientific practices have changed somewhat since then. What is up with current research?
A study from 2022, al’Absi et al. (2021), got a small effect at p = .04, the Cursed Value, for reduction of stress-induced analgesia on a cold pain test (subjects had to hold their hands in a bucket of freezing ice water slurry as long as they could stand). But they found no effect in a heat pain paradigm (“thermal stimulation device”), and Bruehl et al. (2022), same lab, replicated the null result for the heat pain paradigm. Apparently the involvement of endogenous opioids in stress-induced analgesia is not so “relatively reliable.”
As for the three transcranial magnetic stimulation studies, it is possible that TMS is somehow the only method of reducing pain that is reliably blocked by naloxone, but I think that is unlikely as TMS is one of the fakest areas of research outside of psi and telekinesis.
Until there is a large, pre-registered multi-center replication attempt, preferably not led by researchers with high allegiance to the effect who produce positive results over and over with tiny sample sizes, it seems that the evidentiary basis for a placebo effect modulated by endogenous opioids is not very solid.
Unless?
I have implied that I think something sketchy is up with the research on the effects of opioid antagonists on placebo analgesia, given the inconsistency of findings. However, I think there are two possible explanations that would allow the studies finding an effect to be perfectly honest, although still meaningless.
The first is that the administration of opioid antagonists (vs. placebo) may not be blind. That is, subjects may be able to detect that they have received an opioid antagonist, because opioid antagonists are unpleasant. For example, Wardle et al. (2016) found that their subjects were able to feel the effects of both a 25 mg and a 50 mg dose of naltrexone, differentiating it from placebo on all three measured indicators, “feel drug,” “dislike drug,” and “fatigue.” Naltrexone also caused nausea, with “near 0%” of subjects in the placebo condition reporting nausea, while 24% and 35% reported nausea in the 25 mg and 50 mg naltrexone conditions, respectively. This is in contrast with the usual method of confirming blind, which is to assess whether subjects guess which condition they are in, and which is usually found not to differ from chance (e.g. Inagaki et al. 2016). This may simply be a crappy method of assessing blind, as subjects may feel subtly worse but not attribute it to the drug, which presumably few subjects have ever taken before.
This was also the finding of Schull et al. (1981) for naloxone, who found that naloxone increased both the intensity and unpleasantness of the kind of ischemic arm pain that Benedetti et al. usually employ. Subjects given naloxone rated their pain as more intense and tolerated it for less time, and also rated their mood worse. The difference was large, similar in magnitude to the placebo effect reductions that Benedetti et al. usually find.
The second is a weirder possibility, which as far as I can tell has not been investigated. What if the effect of opioid antagonists is to make subjects less interested at playing along with researchers in general? For example, Rütgen et al. (2015) got an effect for reducing placebo analgesia with naltrexone, but got an even larger result on the effect of naltrexone on subjects’ imagined pain ratings for a confederate:
Peciña et al. (2021) found that naltrexone somewhat reduced the reported placebo effects of a placebo said to be a fast-acting antidepressant. Chelnokova et al. (2014) found that naltrexone reduced male subjects’ button pressing to keep an attractive face on the screen. These results are consistent with opioid antagonists generally reducing playing along with researchers (although they are also consistent with opioid antagonists making subjects feel worse in general). While I still suspect that all these results are just noise, it would be worth differentiating these possibilities. What if there was a drug that reduced response bias?
Mind Cure for Mice? Animal Models of Placebo Analgesia
Since animals, unlike bananas, can’t use language in the way humans do, they can’t be treated with placebo by suggestion (for instance, by telling them that an injection is fentanyl when it is actually saline, or giving them a rationale for an open-label placebo). Many of them can, however, learn, so animal models of placebo analgesia rely on conditioning paradigms.
In a typical placebo study, a subject might be given a pill or cream and told it is a powerful pain reliever, whereas a control subject might be told that it is an inert pill or cream (or no pill or cream at all may be given, as in the control arms, really control hands and feet, in Benedetti et al. 1999). In a conditioning study, subjects are trained and kind of gaslit with a conditioning procedure. For example, let’s imagine the placebo is a green light. The subjects are receiving electric shocks and asked to rate their pain. During the conditioning phase, they are given less intense, less painful electric shocks when the green light is on, and more intense, more painful shocks when the green light is off. They are told that the shocks are objectively of the same intensity. Basically, they are trained that green light means less pain. Afterwards, in the testing phase, they briefly guess wrong – specifically, they report less pain when the green light is on, at least until they realize that it no longer has any meaning. This is certainly deceptive, but I’m not convinced that this type of “conditioning” is what people mean when they say the placebo effect is real. Nonetheless, it is a common paradigm, because it is pretty easy to get subjects to essentially guess wrong, at least for a little while (although many authors fail to replicate this). Animals may be capable of being trained to “guess wrong” too.
One reason it is difficult to measure a placebo effect in animals is that it is difficult enough to measure pain in animals in the first place. Obviously, they can’t rate their pain on a scale of 1 to 100. Wodarski et al. (2016) attempt to bring the field into the age of open science with their Cross-centre replication of suppressed burrowing behaviour as an ethologically relevant pain outcome measure in the rat: a prospective multicentre study. They do not attempt to condition or demonstrate a placebo effect, but simply to replicate a single measure of pain in the rat, which is a reduced degree of burrowing. They did in fact verify that burrowing was suppressed (measured in grams of material displaced) in seven of eleven of the included studies. One interesting problem they identify is that blinding was impossible to maintain, as the substance they used to induce pain was yellowish and viscous compared to saline, such that “allocation concealment could be maintained only in 2 studies.” I applaud their effort, and precisely their honesty and attention to detail highlights how easy it might be to cheat in animal studies (without even going as far as making up data or excluding subjects based on a gut feeling).
Learned analgesic cue association and placebo analgesia in rats: an empirical assessment of an animal model and meta-analysis, a master’s thesis by Swanton (2020), provides a starting point for animal placebo analgesia by helpfully conducting a meta-analysis. Swanton also details the process of conducting three replication attempts of a conditioned placebo effect in rats, offering more detail than is typically seen in scientific papers (e.g. she reports, “Attempts were made to contact the authors [of the replication target study] for further information, but there was no response.”). It is one of the most interesting scientific documents I’ve encountered on this topic, and that is saying a lot. Swanton makes three increasingly valiant attempts (with 15 or 16 rats per comparison group, a large sample by the literature’s standards) to replicate Lee et al. (2014), A new animal model of placebo analgesia: involvement of the dopaminergic system in reward learning. (Fascinatingly, in addition to demonstrating placebo analgesia, Lee et al. claim both the behavioral indicators and laboratory tests of biomarkers were “blocked by a dopamine antagonist but not by an opioid antagonist.” This is in contrast with Kunkel et al. 2024, referenced in the above section, an adequately-powered study that ruled out a role of dopamine in human conditioned placebo responses, and also in contrast with the claims in the Sauro and Greenberg meta-analysis claiming an effect of opioid antagonists.)
Swanton was not able to demonstrate a conditioned placebo response at all, even when going to extremes adding high-visibility visual cues, sound cues, and scent cues to her setup to enhance learning:
After this triple disappointment, Swanton delivers a meta-analysis, noting that there is considerable variability within research paradigms, such that “no two protocols are the same in regards to cue type.” Most studies use a hot plate as an apparatus for pain, modeling the amount of time it takes a rat to withdraw a hind paw as a measure of placebo analgesia (or guessing wrong, in my model), or how much they lick their front paws. There are many other paradigms, as we have already seen with the injected pain-inducing substances, and one team even employed “a model of irritable bowel syndrome that employs an inflated balloon to mimic bowl expansion.” Sometimes rats are conditioned (or not) with morphine, and other times, as here, they are conditioned (or not) with the expectation that the temperature of the hot plate will be lower if certain cues are present.
You might wonder, as I did, why “conditioning” with a drug would result in a placebo effect when an inert substance is substituted, rather than increased pain. (If you drink decaf coffee expecting it to be real coffee, you might feel sleepier, because your body is preparing to deal with a bunch of caffeine that doesn’t come.) Apparently this is a relevant question, and Swanton says that “Early drug conditioning research established that cue-associated morphine resulted in hyperalgesia from drug tolerance, not placebo analgesia,” and “More recent work using almost identical conditioning models have reported opposing results of placebo analgesia.”
21 papers met Swanton’s inclusion criteria for the meta-analysis. Almost all used drugs for conditioning, mostly morphine. Most studies had major quality issues. Almost half of the “main outcomes” of the included studies found no effect (31 out of 65 reported main outcomes). Some studies that did get an effect reported effects as large as a Hedge’s g of 5 (yes, five, not .5). From eyeballing the chart of study quality problems, there was only one study with few quality or blinding issues, Akintola et al. (2019), and it got a null result. Despite these obvious problems with replicability, the overall meta-analytic effect for rodent placebo analgesia was a massive g=0.842, which Swanton drily describes as “consistent with a recent meta analysis in human placebo analgesic effects (Forsberg et al., 2017) that demonstrated a high effect size in people.” There was, however, “extremely significant” heterogeneity, unsurprisingly.
Swanton says:
This is the first time rodent models of placebo analgesia have been meta analysed and is thus the most compelling evidence we have to date that animals are capable of experiencing placebo analgesia.
It’s amusing to me that the “most compelling evidence” comes on the back of multiple apparently sincere and determined but failed replication attempts. This contrast is also seen in other areas of research, as illustrated in Kvarven et al. (2020), Comparing meta-analyses and preregistered multiple-laboratory replication projects. They say:
We find that meta-analytic effect sizes are significantly different from replication effect sizes for 12 out of the 15 meta-replication pairs. These differences are systematic and, on average, meta-analytic effect sizes are almost three times as large as replication effect sizes. We also implement three methods of correcting meta-analysis for bias, but these methods do not substantively improve the meta-analytic results.
To put it less politely, when a meta-analysis of garbage studies conducted by researchers with their thumbs on the scale is compared to more honest study designs, the more honest designs produce smaller effects, or no effect at all. If these massive effect sizes are real, it is amazing that the researchers with the best study designs should fail to detect them.
The F Word
I want to discuss something that I have alluded to but not addressed as a separate matter: fuckery. There are more kinds of fuckery than can be listed or perhaps even known, engaged in by anyone from a principle investigator to a lowly research assistant, and when fuckery is discovered, it is often impossible to pinpoint who committed it. I am referring to behaviors such as:
- exploiting unblinding by exaggerating measures
- excluding data based on a “gut feeling” or because it wrecks the result
- reporting only results that support the hypothesis
- Wansinking
- continuing to gather data until the result is significant and stopping when it is
- measuring the outcome before the manipulation that’s supposed to cause it happens, but still producing a significant effect
- trying a bunch of different variations of a measure until one works out
- manipulating or faking data
I believe that finding a large placebo effect (or similar mind-cure effect) is a reliable marker of fuckery. Researchers associated with fuckery seem to produce enormous placebo effects, and the same goes for research fields such as marketing. For example, Shiv, Carmon, and, relevantly, Dan Ariely (2005), in Placebo Effects of Marketing Actions: Consumers May Get What They Pay For, produced a large nocebo effect on an objective cognitive outcome (solving word jumble puzzles) from the tiny manipulation of whether the fine print of the study materials said that the energy drink provided to recipients was purchased “at a discount as part of an institutional purchase.” This was published in a marketing journal, the Journal of Marketing Research. Discount energy drinks made the subjects much stupider:
This is in contrast to e.g. a preregistered trial by Kleine-Borgmann et al. (2021) which found no effect of an open-label placebo on test performance, and another preregistered trial by Hartmann et al. (2023) that found no open-label placebo effect on cognitive measures.
Another influential placebo study is Commercial Features of Placebo
and Therapeutic Efficacy, a marketing study somehow published in JAMA, again with Dan Ariely as an author. These authors find extremely large placebo effects of 15-30 points on a 100-point scale, in a painful electric shock experiment that was not approved by the Institutional Review Board prior to execution, resulting in Ariely being suspended from MIT. If you cut corners on IRB paperwork, what other corners are you cutting?
In the closest replication attempt conducted, Tang et al. (2013) found much more plausible placebo effects of no effect, one point, and three points on a 100-point scale at different shock intensities under conditions similar to the above study.
The above Ariely study, by the way, while only a “research letter,” is one of the most influential studies for the claim that expensive placebos work better than cheap placebos. Many are also convinced that properties of a placebo other than price affect their healing value, but the evidence for this is also weak. For example, Meissner et al. (2018), Are Blue Pills Better Than Green? How Treatment Features Modulate Placebo Effects, provide a review of the evidence for this idea. The studies do not evaluate efficacy at all; instead, they simply ask subjects on surveys to associate colors with different possible treatment effects of drugs. Linde et al. (2010) found that needles were more effective placebos than other types, although they note that “Due to the heterogeneity of the trials included and the indirect comparison our results must be interpreted with caution.” Most of the studies that use needles as a placebo are acupuncture studies, an area in which many researchers are especially prone to fuckery.
While not precisely a placebo effect, but still a mind-cure effect, Ronald Grossarth-Maticek and Hans Eysenck published a great deal of research, now mostly retracted because of fraud, claiming that they could reduce cancer incidence in people with “cancer-prone personalities” by providing talk therapy, e.g.:
In the exquisitely-titled What a wonderful world it would be: a reanalysis of some of the work of Grossarth-Maticek, Van der Ploeg (1991) reveals an amusing way in which he caught their fraud: they had provided data years earlier with the identifying details redacted with marker ink, but the ink faded over the years, allowing Van der Ploeg to check the study participants against death records. Gossarth-Maticek, faced with proof of his deception, responded that, among other things, he was simply testing Van der Ploeg’s honesty and wanted to know if he would reveal patient identities when the marker ink faded.
Ellen Langer (in Crum & Langer, 2007, Mind-Set Matters: Exercise and the Placebo Effect) found large placebo effects on weight loss after subjects were merely told that their work (as hotel maids) was good exercise. I have already linked to Gelman and Brown’s (2024) extensive criticism of this study, but this is not Langer’s only instance of egregious fuckery. In Eminent Harvard psychologist, mother of positive psychology, New Age quack? (2014), James Coyne notes that she also advocates mind cures for cancer, much like Grossarth-Maticek and Eysenck. She also published a study (Rodin and Langer, 1977, Long-term effects of a control-relevant intervention with the institutionalized aged.) claiming that giving institutionalized old people a plant to be responsible for and water reduced mortality from 30% in the control group to only 15% in the treatment group. Coyne points out that an erratum published a year after the study essentially retracted the result. Langer is still attempting lucrative mind cures for cancer. When you’re a Harvard psychologist, they let you do it.
In Brooks et al. (2016), Don’t stop believing: Rituals improve performance by decreasing anxiety, now retracted, a paper with disgraced Harvard psychologist Francesca Gino as a co-author, a placebo effect of a placebo “ritual” was found on objective math test performance:
Please count out loud slowly up to 10 from 0, then count back down to 0. You should say each number out loud and write each number on the piece of paper in front of you as you say it. You may use the entire paper. Sprinkle salt on your paper. Crinkle up your paper. Throw your paper in the trash.
For the math study, Study 4, the following explanation is given in the retraction notice:
A reanalysis of the data showed that 11 participants’ datapoints were dropped prior to analysis but their removal was not reported in the paper. The authors report that the decisions to drop data were based on RAs’ written notes. The reanalysis shows that the focal effect becomes non-significant once all participants are included.
To repeat my earlier maxim, if there is a statistically significant effect of placebo on an objective outcome, chances are it is either noise, fraud, questionable research practices, or a mischaracterization of a subjective outcome as objective. Here it was fuckery. This is a rare situation in which the truth was revealed. Unfortunately, for most of these bogus studies, we will never know the truth about how their implausible results were constructed.
Conclusion
Initially, the “placebo effect” appeared powerful because natural improvement (the episodic nature of conditions, regression to mean, or healing over time) was misinterpreted as an effect of suggestion. When placebo arms of controlled trials of treatments were compared to no-treatment arms, it was revealed that the effect of placebo was modest, if it existed at all, and was exclusively a function of subjective outcome measures (response bias or roleplaying). Studies designed specifically to exploit response bias in measuring a “placebo effect” were often able to produce large differences on self-reported outcomes, but never on objective outcomes like wound healing, pregnancy after IVF treatment, or any outcome measured by a laboratory test. Brain imaging studies seem to have confirmed, rather than refuted, the claim that the “placebo effect” is a phenomenon of response bias. Studies finding similar efficacy for “open-label placebos” lend more support to the conclusion that response bias drives placebo effects.
Animal models attempting to demonstrate a placebo effect in animals using a conditioning paradigm suffer from poor replicability, with studies with few quality or blinding issues finding no effect. Studies attempting to prove that the “placebo effect” involves the endogenous opioid system also suffer from poor replicability, with conflicting results from different laboratories. No large multi-center preregistered trial has confirmed the effect, although one such trial produced strong evidence against the involvement of the dopaminergic system. Related research on pain modulation casts doubt on whether the endogenous opioid system is involved with psychological pain modulation at all.
Large placebo and “mind-cure” effects are a common feature of research now known to be fraudulent. The available evidence supports a conclusion that the “placebo effect” is not a real healing effect, but a product of response bias, questionable research practices, and misunderstanding. Placebos only appear to have efficacy to the extent that subjects are encouraged to role-play as if they are effective, and any difference on self-report measures reflects roleplaying, not healing. Inert substances are in fact inert, and are not rendered effective by suggestion. The power of the placebo is to blind subjects and researchers in blinded research designs. The magnitude of “placebo effects” in necessarily-unblinded placebo-focused studies demonstrates the necessity of blinding, not a placebo effect.