Back How can we improve the way we use digital tracking data to answer important social science questions?
How can we improve the way we use digital tracking data to answer important social science questions?
Researchers at the RECSM-UPF research centre have developed the new Total Error Framework (TEM) for online behavioural data, which improves the way social scientists use this type of big data. The application of this innovative tool to the TRI-POL international project, led by UPF, allowed Oriol Bosch Jover and Melanie Revilla to detect various errors, improve its design, and thus establish a catalogue of practical recommendations.
Given the widespread adoption of the Internet, measuring what people do and consume online is crucial in almost all areas of social science research. For example, what are the negative consequences of disinformation and fake news, and how can we minimize them? Thanks to the adoption of big data and data science methods, in recent years the use of online behavioural data (otherwise known as web tracking or digital tracking) has become popular to directly and objectively measure what citizens do when they are connected. In general, these data are collected from a sample of participants who, on their devices, voluntarily install or set up technologies to track the digital traces left, such as information about the websites and applications visited.
Despite the great opportunities that these data offer to study a multitude of social phenomena, until now not enough attention has been paid to the errors that occur in the use of this method, uncritically accepting the results obtained. This is problematic, since the potential errors of these data could be distorting the conclusions and policy decisions that are taken with the information obtained.
“This new framework that we have designed can help improve the quality of research produced using online behavioural data, as well as foster an understanding of how and when this type of data can be combined with survey data”
To overcome this shortcoming, researchers from the Research and Expertise Centre for Survey Methodology (RECSM-UPF), linked to the UPF Department of Political and Social Sciences, have designed a Total Error Framework (TEM) for online behavioural data. This tool, traditionally applied in survey design, allows any researcher to better understand the process to follow to collect, process and analyse online behavioural data, at the same time providing them with the necessary tools to identify, prevent, and fight against the various errors that can bias the conclusions drawn.
In order to validate the usefulness of this tool beyond the theory, the study researchers have used the TEM to design the online data collection of a pioneering international project coordinated by Pompeu Fabra University (TRI-POL), which has allowed them to improve the quality of the data collected and the transfer with which to communicate errors that could not be avoided.
RECSM-UPF members and authors of the research (published in the journal of the Royal Statistical Society) Oriol Bosch Jover, a predoctoral researcher linked to the London School of Economics and Political Science (LSE), and Melanie Revilla, a senior researcher at the Barcelona Institute for International Studies (IBEI), affirm that “this new framework that we have designed can help improve the quality of research produced using online behavioural data, as well as foster an understanding of how and when this type of data can be combined with survey data”.
Application of the TEM framework to a case study: the TRI-POL international project
To illustrate how the TEM framework can help plan online behavioural data collection and minimize errors, the researchers used a case study: the Triangle of Polarisation, Political Trust and Political Communication (TRI-POL) project. It is an international and an interuniversity project, led by Mariano Torcal, UPF full professor of Political Science and director of RECSM-UPF, to which Oriol Bosch Jover and Melanie Revilla also pertain.
TRI-POL represents the first project in the international sphere to combine data on online behaviour and surveys of the same individuals
TRI-POL represents the first project in the international sphere to combine data on online behaviour and surveys of the same individuals in order, among other objectives, to understand whether online behaviours are related to affective polarization in several countries of southern Europe and Latin America and how. In addition, the TRI-POL project is a pioneer in opening up access to and allowing the free use of all data collected, including online behavioural data, which is generally difficult for most researchers to access. “Using the TEM framework, TRIP-POL is the first project designed to recognize errors in online behavioural data, with established strategies to minimize, quantify and disclose these errors. This could help establish quality standards to follow for future research projects based on web tracking data”, the researchers state.
Errors detected by the TEM and what they are due to
Web tracking data offer many advantages, for instance, they are objective measured data, that is, there is no need to trust that the individual will remember what s/he did online; they are very granular, since they allow collecting more information than would be possible through surveys, and they are collected in real time, enabling the analysis of external shocks that the researchers had not planned to measure.
However, researchers using these data typically need to design their data collection strategies in the dark: “Researchers cannot recognize and report errors they encounter without a clear understanding of what these errors are and how to identify them. This is precisely the void we are filling with our TEM framework”, Oriol Bosch Jover and Melanie Revilla assert.
“Researchers cannot recognize and report errors they encounter without a clear understanding of what these errors are and how to identify them. This is precisely the void we are filling with our TEM”
The TEM shows that the conceptualization of errors in online behavioural data is very similar to what we find for survey data: “The sample from which we obtain the data must be a good representation of the population, and the behaviours that are observed based on the monitoring technologies must represent the real behaviour of the individuals participating in the study. If this does not happen, the data are skewed”, they clarify.
However, the reasons for which the data can be misrepresented are new and as yet unknown. For example, a key step for any web tracking project is to make sure that participants are tracked across all the devices they use to connect. If this is not achieved, the researchers will miss some of what people are doing online, leading to potential biases.
With this in mind, the researchers need to clearly define with which devices tracking must be carried out and try to maximize coverage. If this is not possible or is beyond their control, they should collect ancillary information to assess the proportion or participants affected by so-called “undercoverage” (not being tracked on all devices used to connect) and report it as it occurs.
Recommendations and best practices
Based on these errors, the authors propose a series of recommendations and best practices for anyone using web tracking data: clearly define the list of traces (e.g., URLs) to be used to create the variables for analysis; consider the limitations of the tracking technologies used and how they can introduce biases; clearly define the devices to be tracked and seek to maximize their coverage.
Other recommendations are to keep in mind that tracked devices can be used by third parties (for example, family members who read newspapers with completely opposite ideologies), and finally, develop strategies to minimize and correct the errors that can arise when extracting and transforming data (for example, when identifying whether a news item contains disinformation or not), a process that is often done using complex algorithms.
“Online behavioural data may be the future in social science research, but we need greater transparency and better design practices, and data analysis, similar to those applied to surveys. We must work with great care and transparency”, the researchers conclude.
Reference work: Bosch, O.J. & Revilla, M. (2022) “When survey science met webtracking: Presenting an error framework for metered data”. Journal of the Royal Statistical Society: Series A (Statistics in Society).