On Conventions Between Fields in Experimental Design and Analysis

Contextual note: this post is one of several on the memorability debate.

I think of science as a conversation that is carried out through paper-sized units. Any single paper can only do so much – it must have finite scope, so that the work behind it can be done in finite time and described in a finite number of pages. There is a limit on how much framing and explanation can fit into any paper. Supplemental materials can expand that scope somewhat, but even without explicit length limits for them there must still be a boundary.

In the particular case of InfoVis as a venue, the restriction on length is 9 pages of text (plus one more for references). That’s fewer than venues such as cognitive psychology journals, where authors might have dozens of pages. In these papers, it’s the common case that a single paper covers a series of multiple experiments that hit on different facets of the same fundamental research question. The InfoVis length is longer than venues such as some bioinformatics journals, where the main paper is sometimes only a few pages, with the bulk of the heavy lifting is done in supplemental materials.

This inescapable fact of finite scope means that fields develop conventions of the standard practice: what’s normally done, the level of detail that’s used to describe it, and the amount of justification that’s reasonable to expect for each decision. These conventions can diverge dramatically between fields. The interdisciplinarity of InfoVis can lead to very different points of view of what’s reasonable and what’s valid.

We discussed both of the memorability papers in our visualization reading group at UBC. The difference in initial opinions based on backgrounds was remarkable.

A person with a vision science background initially thought the methods were completely straightforward: they were closely in line with decades of work in her specific field of vision science in particular, and aligned with the larger field of experimental psychology in general. Although the vision scientist could identify some minor quibbles, she was fully satisfied with the rigor. She was intrigued to see that the methods of vision science, which are typically directed to experiments with extremely simple stimuli, were successfully being applied to the more complex stimuli that are of interest in visualization.

In contrast, a person with a biomedical statistics background initially thought the methods were completely indefensible, with far too many variables under study to make any of the statistical inferences meaningful, and most importantly no discussion of confidence intervals or odds ratios precision around the effect size estimates (usually represented as confidence intervals) and more contextualization of effect size interpretation; she mentioned, as an example, that the convention in her field was often odds ratios or hazards ratios. (I was well aware of confidence intervals, but I hadn’t heard of odds ratios. For a concise introduction to these ideas, see Explaining Odds Ratios by Szumilas.)

The biostatistician had had this highly negative reaction to many of the papers she’d been reading in the visualization literature, and had been thinking long and hard for the past year about how to understand her misgivings at a deeper lever than a first kneejerk reaction of “they’re just ignorant of the methods of science”. She articulated several crucial points that have helped me think much more crisply about these questions.

There are several fundamental differences between the experimental methods used in vision science and the methods considered the gold standard in medical experiments that test the effectiveness of a particular drug for treating a disease: randomized controlled trials.

Two of the most crucial points are the ability to manipulate the experimental variables/factors, and effect sizes.

First, in many medical contexts, some kinds of manipulation of experimental variables are off the table. Repeated-measures designs are impossible In most typical clinical trials repeated-measures designs, namely subjecting the same patient to multiple conditions sequentially,  are rarely possible because of carryover effects: you can’t just give the same cancer patient 100 different cancer drugs, one after the other, because the effects will linger instead of stopping when the treatment stops. With great care, it’s sometimes possible to carefully design “case-crossover” experiments for just two conditions, where for example two drugs are tested on the same person, but certainly it’s not possible to do test many conditions on the same person. That’s why the common case is to design experiments with between-subjects comparison not within-subject comparison. Moreover, the trial lasts a long time: months or even years. Thus, the number of trials is typically equal to the number of participants. (Clarification: the word “trial” itself is used ambiguously between the two fields: in medicine it means the whole experiment (eg drug trial or clinical trial), but in vision science it means just one stimulus/response episode. Reader comments led me to understand that I abruptly switched from the first sense to the second sense in the previous sentence!) 

Second, when manipulating variables that affect human subjects, you also have to consider harm to the participant. In medicine, there are many situations where you either cannot manipulate a variable (you can’t retroactively expose somebody to asbestos 20 years ago in order to see how sick they are today, and you can’t just divide a set of people into two groups and give one of these groups brain cancer), or you should not manipulate it for ethical reasons (you shouldn’t deliberately expose somebody to a massive dose of radiation today to see how sick they get tomorrow). One response to this situation is to develop methods for “observational” (aka “correlational”) studies, rather than “experimental” studies where the experimenter has full control of the independent variable. For example, in one kind of retrospective observational studies, a “cohort” is identified (a group that has been identified as having some property, such as exposure to an environmental toxin) and then it is compared to a similar group that hasn’t been exposed. Selecting appropriate participants for each of these groups is an extremely tricky problem, because of the possibility that the cohort also varies from the control group according to some confounding variable that has a stronger effect than the intended target of study.

What I used to think of as “experimental” studies turn out to be more properly called “quasi-experimental” methods because the experimenter doesn’t have full control of the independent variable: they can’t tell people to smoke or not to smoke, but they can ask the people who already smoke to do something else – but there’s still the extreme hazard of confounds. What if you divided so that one group happens to have more heavy smokers than the other, or what if an underlying reason that people smoke is stress and so you’re really measuring stress rather than the effects of smoking per se. The randomized controlled trials that are the gold standard of medicine are in this category. You can divide cancer patients into two groups, one that gets the experimental treatment and the control group that gets the placebo, and then analyze the differences in outcomes to try to uncover their linkage to the intervention. But you can’t control for how virulent of a strain of cancer they have, because you didn’t give them cancer. And, as above, you can’t give the same patient the experimental drug and the placebo.

(One good reference for all of this is the book “How to Design and Report Experiments” by Andy Field and Graham Hole, especially Section 3.2 on “Different Methods for Doing Research”.)

Above, I’ve been alluding to the other crucial aspect, effect size. The typical goal in medicine is to detect quite subtle effects, and thus experiments need to be designed for large statistical power in order to have a hope of detecting these effects.

In contrast, in vision science, life is very different: experimental trials are fast, independent, and harmless; frequently, effect sizes are big. First, trials are very short: just a few seconds in total for the full thing, and the actual exposure to the visual stimulus is often much shorter than one second! Moreover, it’s straightforward to design experiments that preclude carryover effects when you’re testing a perceptual reaction to a visual stimulus instead of a physiological reaction to an experimental drug. Thus, it’s the extremely common case to run many trials with each participant: dozens, hundreds, or even thousands of trials per participant. When considering the statistical power of an experiment, the designer is concerned with the total number of trials, which is the realm of hundreds or thousands. The number of participants is typically far, far smaller than in medical experiments, where in order to have thousands of trials you need thousands of participants. Also, in this domain, it’s not just feasible to design within-subjects experiments, it’s actively preferable whenever possible – because these designs provide greater statistical power for the same number of trials compared to between-subjects designs, since you can control for intersubject variability.

The combination of these two things — the ability to control for intersubject variability through within-subjects designs, and the ability to run many trials — means that there is not nearly so much concern for confounding variables based on splitting your subjects into groups improperly. One implication is that in this experimental paradigm, multi-factor / “factorial” designs are entirely practical and reasonable. That is, a single experiment can test more than one experimental variable, and each variable might be set to several values. For example, the visual stimuli shown to the participant might systematically vary according to multiple properties, resulting in many possibilities. Another implication is that “convenience sampling” is extremely common and does not require special justification, for example undergrads on campus or workers on Mechanical Turk.

Moreover, it’s even possible to design between-subjects experiments with multi-factor designs, given a crucial assumption: that individual differences have a smaller effect size than the effect size that we’re trying to study. This assumption is reasonable because there’s a huge amount of evidence from decades of work in vision science that it’s true – and moreover you can test that assumption in your statistical analysis of the results. And this point brings me back to the concept of effect sizes as the second key difference between the methods of medical research and vision science. In medical research, individual difference effects (how virulent is your cancer) are usually enormous compared to the variable under study (does the drug help). In vision science, individual differences in low-level visual perception are typically very small compared to the variable under study (does the size of the dot on the screen affect your speed of detecting its color).

All of these points are part of the reason that work in vision science is scientifically valid, because the methods are appropriate to the context – even though multi-factor testing with a small number of participants would be ridiculous in the very different context of medical drug trials.

Coming back to visualization, we’re in a context that’s very close to HCI (human-computer interaction) – and controlled laboratory experiments in HCI are a lot closer to vision science than to medicine. It’s common to use multi-factor designs and we run many trials on each participant. There is significant trickiness with carryover effects, typically more so than in vision science, and we often consider “learning effects” in particular as something that must be carefully controlled for in our designs. Our trial times are typically longer than in vision science, ranging from a minute to many minutes – but still far shorter than in medicine. There’s more to say here, but I’ll leave that discussion to another post because I have more ground to cover in this one.

Coming all the way back to the memorability papers and Few’s reponse to them, this analysis allowed me interpret a comment from Few somewhat more charitably: his complaint in the response to the Mem13 paper about the demographics of Mechanical Turk not matching up with the population of the US. In the context of HCI research, it seems extremely naive because there has been enough previous work establishing how to use MTurk in a way that replicates in-person lab experiments that most of us in the field consider it a settled issue. By considering it in the context of randomized drug trials, as I describe above, I can better understand why Few might have thought along these lines – and my discussion above also covers why his criticism is not valid in this context.

(Two of the most relevant papers are from Heer’s group: Crowdsourcing Graphical Perception: Using Mechanical Turk to Assess Visualization Design by Heer and Bostock, from CHI 2010; Strategies for Crowdsourcing Social Data Analysis by Willet, Heer, and Agrawala, from CHI 2012.)

Again coming back to these papers, a contentious point in this whole debate is whether these experiments had sufficient statistical power to draw valid conclusions. Few has contended that the Mem15 paper can’t possibly be valid because there are too few participants. As above, I think this argument is missing the point that in this kind of experiment the power is more appropriately analyzed in terms of the number of trials.

I would certainly be happier with the Mem13 paper if it explicitly discussed confidence intervals and/or effect sizes, but it does not. That’s the common case right now in HCI and vis: most papers in HCI and vis don’t, although a few do. I note that Stephen Few did specifically state that he’s critiquing the whole field through this paper as an exemplar, so saying “everybody does it” isn’t a good defense – that’s exactly his point!

Pierre Dragicevic has written extensively and eloquently about how HCI and Visualization as a community might achieve culture change on the question of how to do statistical analysis by emphasizing confidence intervals rather than just doing t-tests: that is, null-hypothesis significance testing (NHST). I do highly recommend his site http://www.aviz.fr/badstats. I also note that he gave a keynote on this very topic at the BELIV14 workshop, a sister event to InfoVis 2014, which sparked extensive discussion. This kind of attention and activity is one of the many reasons I don’t agree with Few’s characterization of the vis research community as being “complacent”.

(Dragicevic also also contributed to the online discussion on Few15, with posts 6, 19, 40, and 45.)

The biostatistician in my group argued that even this culture change might not be the best end goal; she sees confidence intervals as just one mechanism towards a larger goal of using methods that take into effect sizes as a central concern, and report on them explicitly in the analysis. She points out the in the medical community there is the concept of levels of evidence: while randomized controlled trials are are gold standard in terms of being the highest level of evidence, they’re absolutely not the only way to do science. In fact, it’s well understood that studies leading lower levels of evidence are exactly required as steps along the way towards such a gold standard. They’re not invalid — or pseudo-science — they use different methods to achieve different goals. (For a concise introduction to these ideas, see The Levels of Evidence and their role in Evidence-Based Medicine by Burns, Rohrich, and Chung.)

The upshot is that I do think this question of statistical validity is complex and subtle, and that Few’s approach of just asserting “you’re not following the scientific method” is dramatically oversimplifying a complex reality in a way that’s not very productive. I hope that my analysis above starts to give some sense the nuance here: the methods of science depend very much on the specific context what what is being studied. Yes, it’s true that we talk about “the” scientific method: observe, hypothesize, predict, test, analyze, model. But when we operationalize this very general idea, the much more interesting point is that there are many, many methods used in science. There is no single answer, and a lot of the training of a scientist involves to learning when to use which method; and within every method are many smaller methods that require judgement, and so on – arguably it’s methods all the way down. Methods appropriate for medical drug trials aren’t even the same as those for epidemiology, much less for low-level perception as in vision science, or human behavior as in social science, or in the complicated mix of low-level perception, mid-level cognition, and high-level decision making that is visualization.

Moreover, all of this discussion has just been about the relatively narrow question of controlled experiments featuring quantitative measurement! There’s an enormous field of qualitative research methods that are also extremely useful in the context of visualization. But that’s yet another blog post that I might get to in the future. I’ll stop here for now.

(Strikeout /italic edits finally added 24 October 2016,  in response to reader comments back from way back on 18 January 2016.)

3 comments

  1. ebertini · · Reply

    Thanks a lot Tamara from writing this post! This reminded me the amazing FiveThiryEight piece “science isn’t broken”: http://fivethirtyeight.com/features/science-isnt-broken/. Did you ever see it? I loved it. I think there is a second layer of complexity. Even studies that are absolutely impeccable may very well not describe reality. We should restrain ourselves to take the outcome of a study as the truth. It is much more of a dialogue than anything else.

    Like

  2. Hi Enrico – Aha, that is indeed a great post, thanks for the pointer!

    And yes, I agree completely there’s yet more complexity here. In this post I was focused on internal and construct validity, and didn’t even begin to talk about the problem of ecological validity – is there a match between what we are studying and what people actually do. Some future post, maybe…

    I also agree that a single study is never the final answer, it’s just one line of the conversation/dialogue.

    Back in the land of p-values, I also have several references I like in the further readings section of my grad vis course for the day we talk about validation, including a few nicely snarky titles (“Storks Deliver Babies (p=.008)”, “The Earth is Spherical (p<.05)"). For details see http://www.cs.ubc.ca/~tmm/courses/547-15/#chap4

    Like

  3. Thank you Tamara for this insightful post.

    I have the sense that clinical trials also differ in their goal: they try to find out what works, often irrespective of why. The reasons for this focus are easy to understand. In contrast, vision science tries to understand mechanisms, often with little concern for concrete applications. HCI and infovis stand somehow in-between, although they tend to focus more on what works, especially in HCI. There’s generally much less at stake than in medical research, so it’s less clear why.

    Confidence intervals (I now prefer the term “interval estimate” because it is inclusive of Bayesian methods) convey the uncertainty around effect sizes. Put differently, an interval estimate gives you a range of plausible effect sizes. So, far from precluding a focus on effect sizes, interval estimates encourage it. I use the term effect size in a broad sense. Whether to report simple or standardized effect sizes is another question.

    Many epistemologists agree that there is no such a thing as the scientific method. I recommend Susan Haack’s essay:

    Haack, Susan. “Six signs of scientism.” Logos & Episteme 3.1 (2012): 75-95. https://pervegalit.files.wordpress.com/2011/03/haack-six-signs-of-scientism-october-17-2009.pdf

    On the problem of ecological validity and to what extent we should really care, this article has a very interesting position:

    Calder, Bobby J., Lynn W. Phillips, and Alice M. Tybout. “The concept of external validity.” Journal of Consumer Research (1982): 240-244. http://www.uta.edu/faculty/richarme/BSAD%206310/Readings/calder%20phillips%20tybout%201982%20Concept%20External%20Validity.pdf

    On the particular problem of sample generalizability and the validity of lab experiments, I warmly recommend this article (you can get the author’s version if you request it on Academia):

    Highhouse, Scott, and Jennifer Z. Gillespie. “Do samples really matter that much.” Statistical and methodological myths and urban legends: Doctrine, verity and fable in the organizational and social sciences (2009): 247-265. APA.

    Like

Leave a comment