Experiments as the gold standard for research: A new twist

Worldwide, there were more than 1.56 billion people who used Facebook daily in March 2019. There were “only” half as many daily active users in 2014, but typical adults in the U.S. already spent 40 minutes daily on Facebook, much of that using News Feed. In June 2014, researchers at Facebook published the results of an experiment conducted in 2012. For one week, they removed 10 to 90% of posts containing positive-sounding words from News Feed streams of about 155,000 randomly selected users and 10 to 90% of posts containing negative-sounding words from News Feed streams of another 155,000 users. For control groups, posts were randomly removed from News Feed streams. The researchers then assessed whether these treatments were associated with changes in use of positive and negative words in subsequent posts sent by the subjects of the experiment. The dating website OkCupid did a similar experiment at about the same time, by informing some pairs of users that they would be good matches when the site’s algorithm predicted that they wouldn’t or vice versa. These experiments generated much outrage among social media users and the commentariat. I learned this story from new research about attitudes of people toward experimentation, so I decided to provide some background about the role of experiments in science and then describe this new research. I’ll return to the Facebook and OkCupid experiments at the end.

In Chapter 4 of Tools for Critical Thinking in Biology (TCTB), I described experiments as the gold standard for research, in keeping with common practice. The dictionary definition of experiment is quite general, e.g., “a scientific procedure undertaken to make a discovery, test a hypothesis, or demonstrate a known fact.” I used a much more specific definition in explaining why experiments are the gold standard for research – experiments as randomized, controlled trials. My main examples in Chapter 4 of TCTB were two randomized, controlled trials of the effects of smoking marijuana on pain in human subjects. The methodology of these experiments was typical of a wide variety of clinical trials in medicine, from tests of new drugs to comparisons of different surgical procedures and much more. In a randomized, controlled trial in clinical medicine, the experimental subjects are people who volunteer to be randomly assigned to one of two or more treatment groups. For example, all individuals might suffer from psoriasis. Those assigned to one group would receive a standard treatment for psoriasis while those in a second group would receive a new drug. In this case, the first group would be a control group and the experiment would ask whether the new drug better alleviates the symptoms of psoriasis. The Facebook and OkCupid experiments were also randomized, controlled trials, except that subjects didn’t volunteer to participate.

The main purpose of randomly assigning treatments to volunteers in medical experiments is to protect against conscious and unconscious bias. If there were such bias, we couldn’t confidently attribute any difference in outcome to the difference in treatments that the subjects received because the treatments would be confounded with the source of bias. For an example of unconscious bias, suppose researchers assigned the first 25 volunteers to the new drug and the next 25 to the standard psoriasis treatment. It could be that individuals with more itching and pain would be more motivated to volunteer for the experiment and therefore predominate among the first 25 volunteers. If so, volunteers with more severe psoriasis would get the new drug while those with less severe psoriasis would get the standard treatment. The new drug might be effective, but there might be little or no measurable difference in outcome because the subjects treated with the new drug were in worse condition to start with.

Besides randomization of treatments to subjects and comparison to a control group, clinical trials are usually also “double blinded”, meaning that both volunteer subjects and researchers don’t know the treatment that each subject received until after the results are recorded and analyzed. This eliminates another potential source of bias: if the researchers wanted to show that the new drug was more effective, and if they knew which subjects got the new drug, they might be tempted to minimize the symptoms in their evaluation of these subjects.

Experiments are considered the gold standard for research because they can give relatively unambiguous answers to scientific questions, including questions of practical importance like how best to treat a certain disease. But experiments may not be feasible for some questions; for example, questions about processes that occur at large spatial or temporal scales. Why do more species of birds live in the tropics than at higher latitudes? How did flight evolve in the transition from dinosaurs to birds? In TCTB, I discussed several examples of research based largely on correlational and comparative data rather than experiments to illustrate how scientists tackle the complexity of causation. I used two main examples in Chapter 5, “Correlations, comparisons, and causation.” These examples asked the questions: Does use of cell phones increase risk of brain cancer? Does exposure to lead early in childhood contribute to increased crime when exposed children become young adults?

Experiments may also raise ethical questions, as illustrated by the 2014-2016 epidemic of Ebola in West Africa. Ebola is extremely contagious and this epidemic killed 11,000 people, about 40% of those infected. Several potential vaccines were under development when the epidemic began, but none had been rigorously tested for efficacy. How do you design a randomized, controlled trial of a vaccination that might, or might not, save individuals from bleeding to death within a few weeks after symptoms first appear? Is it ethical to randomly assign some people in an outbreak area to a control group that doesn’t receive the vaccine? In this case, health workers used a procedure called ring vaccination, in which first appearance of the disease in a village led to vaccination of all direct contacts of the initially afflicted individual, then to all contacts of this group. For some randomly selected villages, this ring vaccination began immediately; for others, it was delayed by three weeks. Health workers tested a vaccine developed by Merck in this experiment, and found that vaccination started immediately was more effective than vaccination delayed by three weeks. This demonstrated convincingly that the Merck vaccine protected against Ebola, but to the detriment of some people in the control group, who might have benefitted from being vaccinated sooner. However, suppose the health workers had instead given everyone the Merck vaccine as soon as possible. The epidemic might have ended anyway due to a change in the weather or the fact that most susceptible people had already recovered or died, so the epidemic had simply run its course. Without an experiment, there’s no way to distinguish these possibilities from success of the vaccine, therefore no way to know if the vaccine would work if used in a future outbreak.

This background sets the stage for a new twist on the idea that experiments are the gold standard for research. In May 2019, Michelle Meyer and 6 coauthors reported results of 16 studies under the provocative title “Objecting to experiments that compare two unobjectionable policies or treatments.” The studies consisted of brief online surveys that 5,873 people volunteered to complete. For most of the studies, volunteers were randomly assigned to one of three treatments. In other words, the studies were themselves experiments – psychological experiments to assess how people respond to the process of experimentation itself. Many scientists consider experiments to be the gold standard for research; how about people in general?

As an example of the research by Meyer and her colleagues, subjects in treatment A of study 4 were informed that a doctor decided to prescribe a particular FDA-approved blood pressure medication (A) to all of her patients with hypertension. Subjects in treatment B were informed that the doctor decided to prescribe medication B, also approved by the FDA. Subjects in treatment C were informed that the doctor decided to randomly assign her patients to receive either medication A or medication B. The subjects were simply asked to rate the doctor’s decision on a 5-point scale, with 1 being very inappropriate and 5 being very appropriate. About 35% of the subjects in treatment C rated the doctor’s experimental approach either somewhat or very inappropriate, while fewer than 10% of the subjects in treatments A and B rated the doctor’s approach as inappropriate. In other words, the participants in this experiment were more willing to accept a doctor simply making a decision that her hypertensive patients should use drug A (for participants in the A group) or drug B (for those in the B group) than to accept that a doctor should randomly assign her patients to either drug A or drug B (for those in the C group), when both A and B had already been approved by the FDA to treat hypertension.

Meyer and her colleagues studied a wide range of scenarios in these experiments – direct-to-consumer genetic testing, design of autonomous vehicles, recruitment of health workers in developing countries, and more. In almost all cases, subjects were “objecting to experiments that compare two unobjectionable policies or treatments.” The researchers considered several possible reasons for this result, concluding that “Regardless of the reasons, the unfortunate lesson for those who care about evidence-based practice is that implementation of an untested policy based on intuition about what works may be less likely to invite objection than rigorous evaluation of two or more otherwise unobjectionable policies.”

Meyer and her colleagues don’t think this attitude makes sense. In the example I described in detail, the doctors who were prescribing medication A or medication B for hypertension were in effect doing an uncontrolled experiment. But this isn’t a very interesting or useful experiment compared to the randomized, controlled trial done by the doctors in group C, who would learn which drug worked better for their patients. Comparing different alternatives like this is the heart of the scientific method, which is why experiments are the gold standard for research.

How does this conclusion translate to the Facebook and OkCupid experiments that I described at the beginning? Those cases are different in one important respect: the users of Facebook and OkCupid were not informed that they were subjects of an experiment. But Meyer and Chabris argued that informed consent would have been impossible in these social media experiments; that in any case the users ultimately benefitted from the experiments; and, fundamentally, that the introduction of a new product like Facebook’s News Feed or OkCupid’s algorithm to predict successful matches for those seeking dates online are themselves experiments that users are subjected to. They’re just not very informative experiments. Does OkCupid’s algorithm really identify someone who will be compatible with you based on interests that the two of you share? Before this experiment, the company had found that pairs identified as good matches were more likely to have four-message exchanges than pairs not so identified. By altering the results of the algorithm reported to some users (changing a computed 30% probability of compatibility to 90%, or vice versa), OkCupid was trying to test whether their algorithm really measured things that contributed to compatibility, compared to the alternative hypothesis that users simply responded to the power of suggestion, e.g., pursuing email conversations with others reported to be 90% compatible even when the OkCupid algorithm implied that they were actually only 30% compatible. The researchers found that both the algorithm and the power of suggestion influenced subjects to pursue email conversations with people identified as good matches, although the simple power of suggestion had a somewhat larger effect.

Scientists in medicine, agriculture, ecology, psychology, sociology, education, economics, and many other fields rely on experiments. I taught about experiments in college classes ranging from introductory biology for nonmajors to graduate seminars. The new paper by Meyer’s group makes me wonder how receptive my students were to learning about experimentation during my 37 years of college teaching.

This entry was posted in Correlation & Causation, Ethics, Experimentation, Experimentation, Medicine, Psychology, Teaching, Vaccination and tagged , , , , , . Bookmark the permalink.