Why should academics care about data engineering principles?

code
Published

July 24, 2023

What is data engineering?

In an era where many fields in Academia are grappling with high-profile cases of faked data (Nelson, Simmons, and Simonsohn 2023) and feeling pressures to publish papers quickly for professional advancement, many are concerned of being part of a team where someone or they make a mistake that goes unnoticed during the data cleaning and analysis stages. Meanwhile, we are watching a broader surge of high-profile examples of individual’s personal data being accessed either through unauthorized means or sold by companies that sell hardware and software that we use every day, the public, academic institutions, and government are, understandably, increasingly concerned about data privacy and are often imposing more strict standards to keep data secure (see Atske 2019).

As a result, quantitative researchers should be taking extra care to keep up-to-date with common tools and practices that help ensure that these two goals are met. Data engineers are professionals who develop and use tools and practices that are designed to achieve these two goals. Outside of Academia, there is quite high demand for those who are familiar with best practices to safely store data in a reproducible way (Molander 2021). As the collection of data is booming, and demands for employees with capabilities to analyze data, so too are the demands to ensure that an organization’s data are protected and are reliably and easily accessible.

Though quantitative researchers in academia are constantly collecting and analyzing data, there are few discussions of the concepts and tools central to data engineering. These very tools and principles that increase the safety for our subjects (particularly if we study human subjects and need to protect their identities), increase the accessibility of our data for ourselves and with others, along with decreasing barriers to reproducibility. Though graduate students get extensive training and warnings about the dangers of mishandling data, rarely does a discussion of data engineering accompany them.

Of course academics do not have to know everything. There are many obligations in academia today and with the rising number of tools for research, the growth of our datasets in terms of both number of variables and observations, competition for jobs, and demands on our time, we need to take stock of what matters to us and to invest our time to learn tools accordingly. Ensuring that our data is accessible, reproducible, and avoid the potential for easy mistakes through automation should be one such desire for a quantitative researcher.

Data engineering is its own massive and complex field, after all. Academics should not feel pressure to learn fancy tools like Apache and PySpark. However, academic researchers should familiarize themselves with principles that encourage more careful thinking about their data management processes and to try to implement tools that are designed for such processes. While academic researchers are often familiar with a number of these principles, familiarity with the jargon used in data engineering helps by organizing our thinking about the process of data management as well as giving us an understanding of what tools common in data management are most appropriate for the particular step in the process of data management that we currently find ourselves in. Learning the jargon won’t just be learning the jargon for the sake of it, but it’ll make the implementation of new tools introduced by data engineers easier.

What are some common principles?

One of the more accessible concepts in data engineering is referred to as ETL. This refers to the Extract-Transform-Load pipeline. The ETL pipeline refers to the task that we as academics are quite familiar with. Extraction refers to the process of collecting and accessing raw data. Transform refers to transforming our raw data (what is referred to as a data lake) into something usable that we might use for analysis later and store it in a data warehouse. Once we have transformed our data into something useable, we (and others) should be able to Load the data for analysis.

Outside of academia, data engineers are tasked with the responsibility to create, develop, and maintain this process so that an organization’s analysts might access these data quickly without having to do these steps themselves (and to do it each and every time they want to dip into the data).

I’ll make an attempt to illustrate the ETL process: A data engineer working at Spotify has a large mound of messy and raw data that may be stored in a JSON format (if you have ever seen JSON data, it is awful to look at and to parse) that is collected and tossed into Spotify’s data lake. The data engineer’s job is to take that JSON data and parse it into something more readable and analyzable to then be put in a data warehouse such as a repository of csv files or into (more commonly) a database that is accessed with SQL queries. So at this step, the data engineer has to write code that “streams” data as it comes in from listeners and converts it to a tabular (row and column) format to then be placed into the data warehouse.

As you can probably guess this is a massive amount of data for companies like Spotify. Often times they have to use tools like MongoDB and Hadoop rather than MySQL. As academics rarely are using that much data for a given project, learning and using expensive cloud-based tools like MongoDB and Hadoop is not necessary. What is more relevant to academics is that the users and regulators often have expectations about the protection of the user’s data. When streaming this data from the user, they are not putting everything from the JSON file into the same table! That is, they do not have one CSV file or one table in their database file that contains all of the data in one place. They put some of the data into a financial table or CSV file such as the user’s credit card and subscription information, while they put other parts into a separate table or CSV file that stores information about the user’s listening habits. Further, when they are working with these data, the engineers are not manually sorting through millions of rows of data. They are writing code that automates all of these steps so that if there are new needs in the organization about what data to collect or how to store it, the code automating these steps is all that needs to be changed.

Another important concept is that these tables in the database or CSV files are still connected so that if we need to look at information about the user across the tables, we can still do so. So, for example, in all of these tables or CSV files we have stored a unique user id column. Having a user id column in common among these tables or CSV files allow us to speak across them if we need to. They disaggregated the tables or CSV files, however, so that not everything about the user is stored in one place. These connected tables are referred to as relational. These relational tables are stored in a data warehouse (a central place) that allow us quickly and easily access to all of the tables if we need to. So that analysts and data scientists at Spotify know how these tables are connected, the data engineer would construct what is called a schema. That is, they’d document that the key way to look at user information across these tables is by looking at the unique user id.

This way, when a data scientists needs training data for the new music recommendation model, there is documentation about how the data scientist can access user information (even if it is across multiple tables). This makes it easier for the data scientist to understand the underlying infrastructure, but allows the data engineer to protect the data (from either a potential hack, or more likely, a potential error in code).

How does it benefit the open scientist in academia?

Many quantitative researchers interact with human-subject data each and every day. One of the main requirements with our Institutional Review Board’s is that we are legally and ethically responsible for protecting the identities of our subjects. An additional expectation is that our processes to analyze our data are reproducible from the point of collecting the data to producing the tables and figures in our manuscripts. To address concerns about protecting the identity of our respondents one thing researchers do is to simply delete a column in the CSV file that they download from their Qualtrics or MTurk account after the study has completed.

One key problem with that is that you are deleting data! You are also deleting data in a way that you can’t reproduce later! While it may seem inconsequential in most circumstances to delete that column, it can lead to a number of issues down the road. One example is: what if you or a research team member deletes the wrong column!? Another is if you and a team member need to go back to reconsider the inclusion or exclusion of a particular set of subjects in the study – with those data deleted, you now no longer have an ability to do so. Your only option would be to go back to your data lake, re-clean the data but to not delete the user id column this time. Either way, it slows the process down and makes it easier for mistakes to occur.

The ETL framework using data lakes and data warehouses bake-in improvements to data privacy. While there are many other tools and steps data engineers take to increase the security of their data, simply integrating the ETL framework provides advantages to data privacy. The ETL framework necessarily decentralizes our data. Just as a data engineer collects data from Spotify’s data lake, and puts the data into separate, but related tables in our data warehouse, we can (and should) do the same thing. This makes it harder for external actors to access all of the data for our study, while also preserving information that allows us to course correct if we’ve made a mistake. This decentralization of our data increases the need for background information to understand how data are related to one another. This is good for privacy.

The side-effect of this decentralization is reproducibility. Creating and sharing a schema (even if that is a simple google doc that you share with your lab members) that details how a particular study’s data warehouse is structured, makes it easier for co-authors and lab members to quickly access data while preserving the decentralization of the data in case it is compromised. Furthermore, when the paper is published, rather than publishing all of the data, you are able to share parts of your data warehouse and keep sensitive information in other parts of your data warehouse inaccessible to non-investigators for the project.

How to implement it into an academic’s workflow?

Academics need not learn all the fancy, and expensive, cloud computing tools or to learn several coding languages in order to implement these techniques into a reproducible and secure workflow. Thankfully the ETL framework does not require a set of tools for it to be implemented into one’s data management workflow.

Academics can integrate the ETL framework into their workflow without too many headaches. You first Extract your data from a data lake. Your data lake would be the common places that you go to download your study’s data. Whether that is through Qualtrics, Amazon’s MTurk, a hard drive, etc. This is a common task that we all engage in already. Next, one can take their raw data and sort it into multiple CSV files with a common unique identifying column in them. While you are cleaning the data for analysis, you can transform the data by either merging data from the related tables based on the unique identifier or by simply accessing the table with the columns you need and to clean them that way. As you are cleaning the data, you can document the changes that are implemented. This should leave you with a data warehouse containing multiple CSV files containing one’s clean data. If a mistake is made, one can use the documentation during the Transformation stage to reconstruct the data warehouse.

This way of thinking of data management significantly improves reproducibility and data security. What it did was provided a structured and identifiable set of distinct steps in the data management process. These steps did not require learning any new tool nor adding a lot of extra work. Despite how easy and low-cost it is to integrate the ETL workflow into one’s own data management workflow, there are tools that are specifically designed to handle these steps and to do so quite efficiently. The tools that I discuss below are all relatively low cost in that they are open-source (do not require a software license), are well-documented, and are relatively easy to learn.

A case for using a database rather than csv files

While you’ve increased the security and reproducibility of your data by separating it out into multiple csv files while documenting these steps, this is still not as safe nor reproducible as it can be.

One additional security and reproducibility measure is to consider putting your study’s data in a database file as opposed to a series of CSV files. There are a number of tools that you can use to do this such as MySQL, SQLLite, PostgreSQL, or DuckDB – which is a newer way of working with extremely low-computational cost database files stored locally on someone’s machine rather than on a cloud service.

A database file requires one to connect to the database file and execute a query in the SQL language to extract data. If you use R or Python, there are a number of packages that easily allow you to connect to the database and extract your data in your analysis script! So it is not too many extra steps. Furthermore, most of the basic SQL that academics would need to do is really simple and has inspired packages (such as dplyr) and commands (such as select and joins) that academics regularly use for their data management.

In terms of security, requiring a query to access and view the data stored in the file makes it much harder for those to gain meaningful unauthorized access to the data. These queries are a set of commands in the SQL language that return only the data requested. This means that those who want to view the data must not only be familiar with this query language but they must also understand the structure and organization of the tables in the database file. This is compared to someone being able to find the CSV file(s), opening it and then viewing the contents. In addition, some of these tools for database files are cloud-based which often offer the option to require passing security-keys and passwords to even connect to the database file in the first place.

The second benefit to using a database file that requires SQL queries is that it increases reproducibility. One way that a SQL-based database file does this is that it requires a query to access and edit the data in the file. This significantly reduces the chances of someone making a mistake, especially an undocumented mistake such as accidentally hitting a key on their keyboard as they scroll through the database. The other way that using a database file increases reproducibility is that it requires documentation of how the database file is organized. As they make it hard for those who have unauthorized access to view the data, good documentation shared with those that are meant to have access to the database means that access and editing to the database is quite straightforward as well as well-documented. Using an SQL-based database file provides a massive advantage especially as more and more stories emerge about just how easy it was for someone to make honest mistakes when working with the data (Ingraham 2021).

An illustration of DuckDB in R

DuckDB is an extremely useful tool. DuckDB is not only an implementation of the SQL-based database files that one can store locally on their computer without needing to pay for cloud access, but it also is extremely computationally efficient (Raasveldt, Mark and Muehleisen, Hannes 2023). As many of us are dealing with larger datasets for our analyses, this is a huge advantage. I recommend viewing the documentation on DuckDB to see all the ways it integrates nicely into the ETL framework while also being computationally efficient.

For the purposes of an academic researcher, one can create separate tables within the same .db file on their computer. You can do this by first downloading your raw data from your data lake, convert the file into a .db file, delete the .csv, .xlsx, or .dta file of your raw data, and then start cleaning and later analyzing your data from the .db file directly. DuckDB has full integration in both R and Python. One can quite easily write a simple SQL query to load the data from the dataset, do cleaning as they normally would with their data cleaning libraries (dplyr, data.table, pandas, pola-rs) that they prefer, then store the cleaned data into the tables in their .db file. Those more familiar with SQL can directly write queries to do the data cleaning in the .db file rather than loading it into a R or Python session and then saving over the uncleaned data.

Figure 1: ETL with CSV files
Figure 2: ETL with DuckDB files

Figure 1 and Figure 2 provides a visual representation of what the integration of the ETL process looks like for academics. Doing these steps summarized in the figures adds only one or two extra steps and necessitates learning one or two commands in SQL, but makes dramatic improvements on the security and reproducibility of one’s data. With the simplicity of these extra steps but the dramatic improvements they provide, I think quantitative researchers could dramatically improve their data management practices!

Concluding thoughts

Many quantitative researchers strive to spend as little time as possible on the data collection and data cleaning steps. After all, these steps often do not show up in publications and are often sandwiched between the exciting parts of a manuscript’s evolution. As we are turning inwards to think about the ways that we can reduce mistakes, to make credible promises to our institutions and research participants that we are protecting the privacy of those we are studying and to think about the ways that we can better document the decisions we make over the course of a project, introducing a principled and consistent framework to manage our data helps bridge these two (often competing) motivations.

References

Atske, Sara. 2019. “Americans and Privacy: Concerned, Confused and Feeling Lack of Control over Their Personal Information. Pew Research Center: Internet, Science & Tech.” November 15, 2019. https://www.pewresearch.org/internet/2019/11/15/americans-and-privacy-concerned-confused-and-feeling-lack-of-control-over-their-personal-information/.
Ingraham, Christopher. 2021. “An Alarming Number of Scientific Papers Contain Excel Errors.” Washington Post, November. https://www.washingtonpost.com/news/wonk/wp/2016/08/26/an-alarming-number-of-scientific-papers-contain-excel-errors/.
Molander, Oliver. 2021. “The Rise and Future of Data Engineering — What’s It All about? VALIDIO.” December 6, 2021. https://medium.com/validio/the-rise-and-future-of-data-engineering-whats-it-all-about-a235dbb4412e.
Nelson, Leif, Joe Simmons, and Uri Simonsohn. 2023. “[109] Data Falsificada (Part 1): "Clusterfake". Data Colada.” June 17, 2023. https://datacolada.org/109.
Raasveldt, Mark, and Muehleisen, Hannes. 2023. DuckDB.” https://github.com/duckdb/duckdb".