Design of experiments with users: lessons learned

Author: Mari-Carmen Marcos (Universitat Pompeu Fabra)

Citation: Marcos, Mari-Carmen (2013). "Design of experiments with users: lessons learned". Hipertext.net, 11, http://www.upf.edu/hipertextnet/en/numero-11/experiments_users.html

Mari Carmen_Marcos

Abstract: One of the most common methods in research is the observation of subjects in a laboratory. The discipline known as HCI (Human-Computer Interaction) studies the form in which we people interact with technology, its purpose being to adapt the interfaces of the devices to the way we naturally relate to them. HCI researchers have very different backgrounds: some come from Computer Science, others from Information Science, others from Journalism. Experimentation as a work method is quite unusual in those disciplines, which means they have had to learn it and overcome some setbacks. This article provides with some guidelines regarding how to conceive an experiment, and some case examples are shared wherein some planning or preparing mistakes were made. 

Keywords: Experimental Design. Human-Computer Interaction. User Testing. Social Sciences


Table of contents:

1. Researching into Human-Computer Interaction
2. Planning a user test
3. Preparing the testing sessions
4. Lessons learned
Acknowledgements
Bibliography

1.Researching into Human-Computer Interaction

The discipline known as Human-Computer Interaction (HCI) is about the interaction of people with technology. Its aim is to adapt the design of the technology to its users. This kind of design is known as user-centered design (UCD). Although HCI originated mostly focused on computers, nowadays it has a much larger reach, and it is applied to any technology.

HCI is interdisciplinary; its inputs are from various disciplines. Among them, those more closely related are Psychology ­-to understand the behavior of people-, Design –to design interfaces- and Computer Science to develop the products. HCI is studied in universities and research centers, and it also practiced as a profession. Those who have taken up HCI professionally are known as Interaction Designers or User Experience Designers.

The business world is aware of the need to have products that satisfy the needs and expectations of their target audiences. Specifically, organizations which base a significant amount of their business on their websites often conduct tests involving users, or outsource them to usability consulting firms. There are numerous works that explain how to plan, prepare, execute, analyze and present results from those studies in which users participate. This bibliography comes not only from HCI researchers but also from usability professionals with wide experience in this kind of studies. Three works are particularly recommended: Dumas and Redish (1999, chapters 7 to 17), Tullis and Bill (2008, chapter 3) and Barnum (2011, chapters 5 and 6).

Reading those who have been working in HCI for several years now and considering our own experience as researchers, we come to the conclusion that when a product is planned and designed thinking of users it is essential to observe –and also to interview- those people to know how they behave before the product, and the reasons for their behavior. Observations can be conducted at three different moments of the process:

  1. Before starting the design. In these cases the user is observed interacting with similar products or in situations wherein he or she could be using the product that is about to be designed. Let us a put an example: if we are to design an ATM for a bank company, we might want to observe the clients of the bank interacting with the current ATM; if the ATM will be installed in a company that still lacks its own ATM, we might observe clients interacting with the staff attending at the bank to know about their needs and interactions in a real context.
  2. During the design and prototyping stage. Although it is true that asking a user to do tasks in an unfinished system might be difficult, if a prototype is available important keys might be obtained to follow the right path. With these tests it is possible to detect whether the architecture of the menus is unclear, if the vocabulary used is not understood, whether what we designed as a type of element is confused with another one, and so on.
  3. Once the product already exists. This is the most common test, and sometimes the hardest part because the results obtained might lead to a full redesign of the product, which involves aspects difficult to change in an organization such as database design, programming, code and so on. It is the most common test in usability studies, it is called “user testing” and we will be referring to it from now on.

There are several reasons to start a HCI research. Industry calls it “user research”, usually referring to very pragmatic studies applied to a very specific problem. For instance, it is often needed to know which design option is the most usable among many: which one is easiest to understand for the users, which one they learn more quickly, they handle with the most ease or they like the most.  In the academic world, studies with HCI have less pragmatic and wider goals whose results might be applied to several environments. These studies intend to know, delve into and have a better understanding of an issue, a problem, a situation, through observing the behavior of users before a technology.

There are several ways to classify the strategies and methodologies applied in research. For the disciplines that undertake studies about people behavior (behavioral sciences), McGrath’s diagram (1995) shows how to tackle a study considering several classification facets:

  1. experimental, theoretical,  survey and field strategies
  2. concrete and abstract strategies
  3. obtrusive and unobtrusive strategies
  4. strategies aimed at generalizability, precision or realism

Users tests conducted in HCI research would be considered laboratory experiments. Thus, according to McGrath’s diagram (1995), they apply an experimental, concrete, obtrusive research strategy aimed at precise results.

Esquema de los tipos de estrategias para la investigación

 
Figure 1. Diagram of the different kinds of strategies for research according to McGrath (1995), 158.

In the most common classifications of research methods it is frequent to find the dichotomy between qualitative and quantitative. McGrath’s diagram does not present this kind of classification because it does not emphasize the research strategy to be adopted but the kind of data the researcher wants to obtain from it. All strategies allow the possibility of obtaining qualitative and quantitative data depending on what is the object of study in each research. It is not true that in Social Sciences, and specifically in those disciplines studying people behavior, only one kind of data can be obtained. Both kinds of data are always present. The fact of focusing in one or the other regarding the planned goals in the study is just a different question.

This article aims to share the lessons learned in several experiments conducted, where users were involved in a laboratory environment. Although five stages can be distinguished –planning, preparation, execution, analysis and presentation of results- this article will focus on the two first ones, since they are essential to execute the project properly, allowing for a good analysis and obtaining of valid results and conclusions.

2.Planning a users test

Planning is a key stage of experimental research. At this stage, the decisions taken will affect the results obtained. Moreover, they will determine which kind of results are obtained, because at this stage it is decided which data are to be collected, and in what ways.

Throughout our studies, we have learned about the importance of thoroughly planning everything there is to be done in tests. Several tasks are involved in planning, which might be divided as follows:

  1. Definition of the study. A paragraph must suffice to summarize what is to be studied (object of study), with what purpose (goal), what is expected from it (hypothesis) or what are we asking ourselves to start with (research question), and why it is important to tackle this research (justification).
  2. Referents. Often there are several previous studies that might be useful to start a new one. These studies might have been conducted by the same group of researchers, by other groups or by the industry; these studies might also involve experiments with participant users or other techniques such as heuristic evaluation or surveys. If there are not previous studies, it might be interesting to consider applying some other methodologies before planning a user test, to have information from which to conceive the study.
  3. Observation and monitoring techniques. Depending on the goals pursued it will be necessary to apply some techniques or others to the observation. It might be interesting to record the session with video cameras, or to have screenshots of what is happening in the screen of the tested device. It might also be interesting to have several observers, but in a way that the user does not realize there are several people during the test, which means having a window with glass in one side and a mirror on the other.
  4. Testing place. Depending on the needs of each study, different kinds of laboratories will be considered. It must be a hall wherein the device to be tested can be installed. Sometimes a more specific infrastructure is required, depending on the observation and monitoring techniques defined, with recording systems, double glazing and so on.
  5. Users. Regarding the participants in the tests, some decisions must be taken with respect to:
    1. Amount of people participating. A rough estimate might be made now, but to decide on this number the researcher must consider how many variables are going to be in the design of the experiment. More variables mean a larger sample is needed.
    2. Profile/s of the participant.
    3. Questions to filter to participants (screener)
    4. Recruitment system.
    5. Managing no-shows: anticipating what to do if a user misses an appointment.
    6. Testing calendar.
    7. Incentive for the participants.
  6. Calendar. Once the number of users to be summoned and the availability of the testing hall are determined, a calendar of appointments can be designed for the tests, as well as a calendar for the stages of analysis and presentation of results. A Gant chart might be useful for time management.
  7. Human resources. Depending on the goals, the kind of test, the amount of users summoned and the time the study is expected to last, a prediction of the number of people and profiles needed for the study will be made. The role of each person will also be noted down.
  8. Budget for the project. Considering all the previous data, a prediction of the budget needed to tackle the test is made.

3.Preparing the testing sessions

Between the stages of planning the study and the execution of the tests there is a necessary stage: that of the preparation of the test. It is at this stage when planning has to evolve into getting everything ready for the time when the users come to our testing hall. We emphasize a set of aspects to take into consideration when preparing the test.

Preparing the object of study. If a website is to be studied, it has to be prepared, for instance with an offline version if we want to avoid connection problems. If two versions of the same website are to be tested, both have to be prepared and determine which users are going to see each version.

Deciding which data are going to be collected. Qualitative data are often collected in usability studies, such as the comments of the user while performing the tasks, the expression on his or her face, what he or she says in the final interview and so on. In parallel, quantitative measurements are taken to value efficacy (the time the user took to perform a task, the number of problems he or she met or the efforts of the user), efficiency (whether the user successfully finishes the task) and satisfaction (whether he or she liked the product or system). If a monitoring system for the mouse or the gaze is being used, other usual metrics to be taken into account are the number of participants clicking on a particular element, the number of fixations of the gaze on a particular area, how long does it take for the user to fix the gaze on a particular area, and so on. Depending on the kind of study conducted, the measures to be taken will have to be specified, and from then on the variables should be determined. There are two kinds of variables:

  • Independent: variables that the researchers are going to control, as for instance the traits of the participants (age and gender) or aspects regarding the object of study (like the areas of the screen that are going to be analyzed, the kind of information that is going to be shown to each group of users, and so on). These variables will determine the kind of statistical analysis that is going be to be conducted once all the data have been collected.
  • Dependent: unknown variables, what we want to obtain, for instance the percentage of people finishing a task successfully, or the amount of time that goes by between the moment an element appears on the screen and the first look they take at it, or the percentage of clicks on an element. It is important to offer a proper definition of these variables because depending on which are the variables some data (and not other) will have to be collected. Once the tests are conducted, data that have not been collected can no longer be obtained.

Design of the test tasks. The user will perform a series of tasks guided by us. These tasks will be thoroughly prepared because the goal is to observe how the user interacts with the system to execute them. Sometimes, the fact that the user does correctly finish the task or not will not be important, but the way he or she follows to try to solve it.

  • Drafting tasks and scenarios. Tasks are drafted on a script, and they are preferably set in a scenario. This scenario is a context, a situation, for the user to better understand the task. For instance, instead of asking for a concise task by saying “buy a Barcelona-Madrid train ticket for July the 7th” the user is suggested a scenario like this: “On July the 7th you have been called to a meeting in Madrid at noon, followed by a lunch, and then you can come back to Barcelona; the company is paying your ticket but they want you to buy it first and then they pay you back. Choose your preferred schedule and buy the ticket with this Visa card we provide you with”.
  • Task zero. Besides preparing the tasks we are interested in for the study, it is often convenient for the user to familiarize with the product. When the researcher sees fit, a “zero” task might he prepared. This task does not have to be measured.

Design of the experiment

  • Randomness in the tasks. Whenever possible, the order of the tasks performed by the users has to be rotate so that it does not influence the degree of uncertainty, nervousness and familiarity. During the execution of the tasks that are performed at the beginning of the test the user tends to be more insecure, and the feeling goes away as he or she advances into the tasks.
  • Randomness in the variables. If the object of study has different values for any of the variables, the order in which those values appear must be changed so that their position on the screen or their order of appearance does not create a bias. For instance, if we test the efficiency in reading different font sizes on a page and we provide with text in size 10, size 14 and size 20, their order of presentation should rotate so that their position is not altering the way of reading; thus, every user will see a different position. The more values are added to a variable the more variability will be in the presentation, which means having to increase the sample of users so that we have enough in each designed version.

Practical matters. There are many details that, although minor compared to the design of tasks, variables and so on, are essential to conduct the tests. For instance, controlling the users participating, noting down their contact data and the day and time of the appointment, preparing the consent form they have to sign, having the gifts ready to deliver the day of the test, fitting out the hall with everything that is needed (devices, cables, office supplies, etc), printing the tasks so that the user always has them at hand, and so on.

Pilot test. The preparation stage ends with a pilot test. Actually, several pilot tests are recommended. One or two tests with close people, colleagues or relatives might be useful to guarantee that the tasks are understandable, that they can be performed within the scheduled time, that everything works as expected. Also recommended is another pilot test with a real user, to test the test in the most realistic way possible. These tests allow for a premature detection of things that might be failing. It is worth to invest time in them because they might save us from mistakes that otherwise would make us lose some tests.

4. Lessons learned

As previously argued, test planning and test preparation are two fundamental stages that have to be well closed before starting testing. The factors intervening in the whole process are varied and sometimes they make us change the course of the study or even paralyze it to redesign it. In this section we share experiences that have happened to us over the last years in our studies with users, related to the stages of planning and preparation.

Human resources
The first case is about planning human resources for the project. In small research projects with few or none resources for hiring, as in our case, we do not always have enough people to test as many users as it is needed. When in this situation, we have sought the collaboration of masters or PhD students as laboratory assistants.  After some hours of training, they have managed to take on the testing sessions; with the limitation that one of the researchers has to be permanently reachable should any problem arise.

Selection of the sample of participants
The selection of the sample tends to be another problematic aspect. In a study we are conducting our goal is to compare the behavior of the users when making online searches (Marcos, García-Gavilanes, Bataineh, Pasarín, 2013). This study is conducted by our UPF team and in parallel by another research team in Zayed University, Dubai (United Arab Emirates). The problem the Dubai researchers had for recruiting users is that their faculty only accepts female students, that is to say, women, so that they had more trouble finding males. Finally, their sample had a significantly higher number of women than men, so that the results obtained when analyzing the data would not be representative of the population. This lack of initial foresight led us to widen the sample of Dubai users in order to collect more data and obtain reliable results, so that the time frame stipulated in the calendar has been delayed.

Mapas de calor y laboratorio en unestudio realizado en Barcelona y Dubai paralelamente

Figure 2. Heat maps and laboratory in a study conducted in parallel in Barcelona and Dubai

Testing hall
In a study we conducted about the behavior of users with TV Conectada (Connected TV; Mansilla, Marcos 2013; Marcos, Mansilla 2013) we were expecting to use the television sets in UPF campus. This scenario would be decorated as a living room, even with furniture, so that the user had a more similar experience to that at his or her home despite being at a laboratory. The problem detected –and luckily this was detected beforehand- was that this area of the university, which happens to be a basement, does not have access to the signal of a TV antenna that is essential to conduct the study.  The campus does not have an aerial TV antenna, since they use IP television. An antenna had to be installed in the testing area, and a basement would do not be a viable place to install it since it did not pick up the signal. Considering these problems, we fit out a meeting hall where we installed the devices to be tested and the antenna, but we were surprised to find out that the signal was very weak in the new place.  We tried in another side of the building, where the antenna finally picked up the signal correctly. We reinstalled the devices and some furniture and managed to test in the best way possible.

Participante y moderador en el laboratorio de testeo para el estudio de TV Conectada

 Figure 3. Participant and moderator in the testing laboratory for the TV Conectada study

Object of study
In the study with Zayed University we also detected a problem, this time when preparing the test. The faculty we collaborated with teaches in English, the English level of their students is therefore high and they do not have any problems of written comprehension. For the Barcelona team to design the study for both universities, it was considered advisable to design the pages to be tested in two languages: Spanish for the Barcelona users and English for the Dubai users. When analyzing results, very significant differences were observed between the search behaviors of the people in both countries, but we are unable to determine the influence of the fact that some users were reading in their native language while others were reading a non-native language. We are therefore repeating these tests in Barcelona in English –with other users- and in the following months we are doing the same in Arabic to test the Dubai participants.

Number of variables
Another example related to the preparation of the test has to do with the definition of the variables. In this same study, when comparing Barcelona users with Dubai users a wide list of independent variables was introduced, meaning variables controlled by us the researchers. The goal was to study the visual behavior before a Google results page. To conduct it, data provided by the eye tracker were taken into account according to these independent variables:

  1. The cultural context of the users: Western or from the Middle East.
  2. The kind of results that they would see in the search engine: the results themselves and the advertisements.
  3. The ranking of the results: they could be in any position between the first and the tenth result, besides the top zone or in the right side, those zones being the ones occupied by Google advertisements.
  4. The relevance of each result for the question they were being asked.
  5. The subject matter of the question asked.
  6. If the user found a good answer to the question, meaning success in the task performed.

By defining these many variables the goal was to cover every factor that might influence the behavior of the users, but two consequences derived from the fact that there were so many factors: on the one hand, when dividing the sample by each factor we were left with very few users to compare and obtain statistical significance. For instance, we wanted to know if the subject matter influenced their behavior, but we only had a third of the sample for each subject matter, meaning about 20 users out of 60. On the other hand, and this most relevant, we lost sight of the research question, which was whether there were cultural differences in information search. Thus, we learned that although there are factors which might influence results, it is important to focus on the variables that are closely related to the research question we ask. In a secondary way, we might contemplate making calculations considering other variables if we detect that any one of them might be influencing the results.

Randomness in the order of the tasks
In a study on legibility in websites we are conducting, we selected a series of Wikipedia articles and we modified the font size and interlinear space in each one of them. We had several font sizes and several line spaces and an article for each font-line combination. Luckily we had only conducted 10 tests when we realized the importance of changing the order of the articles the participants had to read from one user to another. We might find ourselves with the fact that, due to the insecurity of not knowing what was going to happen next, users might take more time with the first article than with the following ones. To achieve this randomness we designed as many orders of articles as possible, and tested few users for each design. It does not matter that the sample is small, since in the results we gathered together what we obtained in different designs to have everything regarding the font-line pair. The key was to provide the research with a necessary randomness to avoid bias related to the order in reading the articles.

Randomness in the order of presentation of the variables
When explaining independent variables we previously quoted the study on legibility regarding different font sizes (Rello, Marcos 2012). In this study we introduced more variables, among them the contrast between font and background. When designing this experiment we did not take into account the position in which we were presenting each sentence, so when we analyzed the results some doubts arose regarding whether people tend to read sentences at the beginning of the paragraph more intensely, and with less intensity those at the end of it. Because of these doubts, we redesigned the experiment introducing variability in the position of the sentences and we repeated the tests with new users.

Variable


Figure 4. “Separation among paragraphs” variable in a screen legibility study (Rello and Marcos 2012)

We have learned from these mistakes and we wanted to share them with you so that they might be useful when planning and preparing tests with users. Only good planning might help foreseeing possible problems so that the study is not paralyzed when it is already being executed, and there is enough data collected when reaching the analysis stage.

Acknowledgments

Thanks to all the users who participated in these tests, wherein sometimes their only gratification was nothing but our smile. Also thanks to the students that have participated as researchers in these projects; their doubts made me reflect and reconsider the study in more than one occasion. Special thanks to the colleagues from whom I have learned the most regarding the design of experimental studies: Ricardo Baeza-Yates, Ioannis Arapakis and Luz Rello.

References

Barnum, Carol (2011). Usability Testing Essentials: Ready, Set… Test! Morgan Kaufmann.

Dumas, Joseph and Redish, Janice (1999). A Practical Guide to Usability Testing. Norwood: Ablex Publishing Corporation.

Mansilla, Verónica; Marcos, Mari-Carmen (2013). “User experience en Televisión Conectada: un estudio con usuarios”. El Profesional de la Información. March-April 2013. In press.

Marcos, Mari-Carmen; Garcia-Gavilanes, Ruth; Bataineh, Emad; Pasarin, Lara (2013). “Using Eye Tracking to Identify Cultural Differences in Information Seeking Behavior". Article accepted for the workshop “Many People, Many Eyes” to be held at the CHI 2013 congress (Paris 2013). Unpublished research.

Marcos, Mari-Carmen; Mansilla, Verónica (2013). “Video on Demand: Usability challenges for Connected TV”. Article accepted for the workshop “Exploring and Enhancing the User Experience for TV” to be held at the CHI 2013 congress (Paris 2013). Unpublished research.

McGrath, Joseph (1995). “Methodology Matters: Doing Research in the Behavioral and Social Sciences”. In Baecker, R. M.; Grudin, J.; Buxton, W. A. S.; Greenberg (eds). Readings in Human-Computer Interaction: Toward the Year 2000, 152 - 169.

Rello, Luz; Marcos, Mari-Carmen (2012). “An Eye Tracking Study on Text Customization for User Performance and Preference”. LA-Web 2012 (Cartagena, Colombia, October 2012).

Tullis, Tom; Albert, Bill (2008). Measuring the User Experience: Collecting, Analyzing, and Presenting Usability Metrics. Morgan Kaufmann.

Last updated 21-05-2013
© Universitat Pompeu Fabra, Barcelona