Ten Tips For Doing Open Science

June 23, 2020

Originally published on Economics from the Top Down

Blair Fix

Science is the quintessential public good. It’s an iterative process in which new knowledge builds on previous knowledge. For this process to work, science needs to be ‘open’. Both the results and methods of scientific research need to be freely available for all. The open science movement is trying to make this a reality.

In this post, I share some tips for doing open science. If you’re an active researcher, these are tips you can use in your own research. If you’re not a researcher but are interested in science, these are tips that will help you judge the openness of scientific research.

The iceberg

Scientific research is like an iceberg. The published paper is the tip of the iceberg — a summary of the research you’ve done. But this summary is a small part of the actual research. Below the surface lurks the bulk of the iceberg — the data and analysis that underpin your results. For science to progress, fellow researchers need access to both the tip of the iceberg (your paper) and the subsurface bulk (the data and analysis).

Making your paper open access is the easy part of doing open science. First, make sure you upload the preprint to an online repository. Next, try to publish open access. If this is too expensive, you can always self archive your published paper after the fact.

The more difficult part of doing open science is making your data and methods available to all. This takes effort. It means you have to design your research flow with sharing in mind. Think of the difference between how you plan and write a scientific paper versus how you plan and write a note to yourself. The note takes far less effort because it only needs to be intelligible to you. The scientific paper, in contrast, needs to be intelligible to your peers.

The same principle applies to publishing your data and analysis. If you want your methods to be intelligible to others, you need to plan your analysis just like you would plan your paper. It needs to be designed for sharing from the outset.

With that in mind, here are 10 tips for making this process as pain free as possible.

1. Upload your data and analysis to a repository

How should you grant access to your data and analysis? One way is to do it manually. At the bottom of your paper you write: “Data and code are available by request”. If someone wants your supplementary material, they email you and you send it to them.

This is an easy way to share your analysis. The problem, though, is that you’re a scientist, not a data curator. Unless you’re diligent about preserving your work, it will get misplaced or lost over time.

Here’s a concrete example.

I’ve done a lot of research on hierarchy within firms. As part of this work, I’ve contacted fellow researchers and asked them to share their data and analysis with me. In one case, the researchers were happy to share their data … except they couldn’t find it! The work had been done in the 1990s. In the proceeding 25 years the original data had been lost. And I can’t blame these researchers. Do you know where your 1990s computer files are? I don’t.

The solution to this problem is to upload your data and analysis to an online repository. These repositories are professional curators, so the hope is that the data will be available long into the future.

There are now many scientific repositories (here’s a list). My favourite repository is the Open Science Framework (OSF). There are a few things I like about the OSF. First, as the name suggests, the Open Science Framework is committed to open science.

Second the OSF has a great preprint service. So you can write a paper, upload the preprint to OSF and then link this paper to your data repository. Here, for instance, is the preprint and supplementary materials for a recent paper of mine.

Third, the OSF has version control. So as your research evolves, you can upload revised versions of your paper and analysis. As you update things, the OSF keeps the older versions available.

Fourth, the OSF allows you to link projects. Suppose you’re working on a big research project that has many subprojects. The OSF allows you to link all these things together. For an example of this linking, check out the replication projects sponsored by the OSF. In these projects, many researchers work independently to replicate published research. When finished, the researchers upload their data and methods to the OSF and link it together in a single project. There are now replication projects in psychology, cancer biology, economics, and the social sciences.

Another advantage of uploading your materials to a repository like OSF is that you get a DOI (Digital Object Identifier) for the resulting project. This means that your supplementary materials are citable. So if a researcher builds off of your analysis, they can cite your supplementary work. If you put in the effort to do open science, you might as well get credit for it.

2. Link your paper and analysis from the beginning

Putting your data and analysis in a repository takes work. To make this process as pain free as possible, I recommend linking your analysis and paper from the outset.

To frame this process, I’ll start by telling you what not to do. When I first started uploading supplementary materials, it was a lot of work. The problem was that I hadn’t integrated this step into my research flow. As I did my research, I would accumulate a hodgepodge of files and folders spread out on my computer. From this hodgepodge, I’d pull things together to write my paper. When I finished writing, I’d manually merge all the research into a file that I’d upload as supplementary material. As you can imagine, this was a pain.

The solution, I realized, was to organize my research into projects. Each project is associated with a paper (or potential paper) and is designed from the outset to be uploaded to the OSF. So when I start new research, I create a folder with the project name, and then put two subfolders inside it: Paper and Supplementary Materials. The Paper folder contains the manuscript in progress. The Supplementary Materials folder contains all the data and analysis.

Having the manuscript and analysis linked from the start is helpful because the act of writing inevitably leads to new analysis. As a novice researcher, this always surprised to me. By now I expect it. When you write up your ideas, you get new ones that lead to new analysis. I like to keep track of these changes, and I do so by archiving the manuscript and analysis together. Each version of the manuscript has a corresponding version of the analysis that goes with it. When I’m finished the project (and paper), everything goes on OSF with no extra work.

3. Make a data analysis pipeline

For your Supplementary Materials to be useful, they need to be organized in a coherent way. I recommend creating a data analysis pipeline so that other researchers (and of course you) can follow your analysis.

Any type of analysis is going to have three basic steps:

Clean the data
Analyze the data
Display the results in a table or figure

To make these steps coherent, I recommend dumping Excel and switching to a coding language like R or python. If you’re a committed Excel user, I understand your reticence to leave it behind. As a novice researcher, I fell in love with the simplicity of Excel. I could churn out simple analysis in a few minutes, and plot the results on the same page.

But as my research progressed, my love of Excel waned. As I started to analyze data with millions of entries, using Excel became painful. Try sorting 10 million lines of data in Excel and see how your computer reacts. Faced with frequent frozen screens of death, I switched to R and never looked back.

But even if you’re working with small data sets, I still recommend dumping Excel because it’s difficult to use it to implement a data analysis ‘pipeline’. A good pipeline needs to be automated. This way, when your data changes, you can easily update your results. The problem with Excel is that it mixes raw data, data cleaning, and analysis. When you use R (or comparable software) you can keep these steps separate, and automate them.

There are many ways to create a data pipeline, but here’s what I do. Inside my Supplementary Materials folder, I have a folder called Data in which I keep all the raw data for my analysis. I generally use a separate subfolder for each data set.

Suppose I work with data from the World Inequality Database (WID). Inside my Data folder, I put a subfolder called WID that contains the raw data from WID. If I need to clean the data, I make an R script called clean.R and put it in the WID folder. This script would output cleaned data called WID_clean.csv. Then I make an R script called analysis.R that manipulates this cleaned data and outputs the results as WID_results.csv. At the end of my data pipeline, I plot figures and create tables. I like to put all of my figures scripts in one folder called figures.

At the end of a large research project, I usually have dozens of scripts in my data pipeline. It can get a bit unwieldy to keep track of all this code, which brings me to my next tip.

4. Make a `RUN ALL` script

As your analysis gets more complicated, it becomes difficult to keep track of your data pipeline. For me, this came to a head when I was working on research that linked many different types of data. When I updated one part of the analysis, I’d have to manually update all the other parts. I’d go to each script in the pipeline and press RUN. This was annoying, not to mention difficult. I can’t tell you how many times I started writing up my results, only to find that a data update had failed to pass through the whole pipeline.

The solution is to create a RUN ALL script that executes your entire data pipeline. The RUN ALL script is useful for a few reasons. First, it’s a table of contents for your data pipeline. If you comment your RUN ALL code well, another researcher can read it and understand your data pipeline.

Second, your RUN ALL script helps you keep track of what you’ve done. You can see, in one file, how all the parts of your data pipeline fit together.

Third, your RUN ALL script automates your analysis, making updates easy. Change some of your analysis? Update some of the data? No problem. Just run your RUN ALL script and everything gets updated.

Lastly, the RUN ALL script makes it easy to debug your data pipeline. Just run the script and see where you get errors.

I usually write my RUN ALL script in Linux Bash, but that’s just a personal preference. Any language will do.

5. Use relative file paths

This is a technical point, but one that’s important when it comes to running your code on other computers. In your data pipeline, you’re going to tell the computer to look in many different directories. The quick and easy way to do this is to specify an absolute file path. For instance, if I had a folder called project on my desktop, the absolute pathway would be: /home/blair/Desktop.

When I first started making data pipelines, I’d use absolute pathways because they’re easy. Just copy the path from your file browser into your code and you’re done. The problem with doing this is that it means you can’t move your project. If, for instance, I moved my project folder off the desktop, all the code would break. That’s annoying for me. It also means that the code won’t work on anyone else’s computer, since their absolute file paths will always differ from mine.

The solution is to use relative file paths. You write your scripts so that the file paths are specified relative to the project folder. This way, no matter where your project folder lives (on your computer or anyone else’s) your code will work.

In R, I use the here package to deal with relative file paths. The command here() gets the file path for the R script you’re running. You can then specify any directory changes relative to this file path.

6. Automate your data downloads

Let’s face it, doing science can be slow. I can’t tell you how many times I’ve done some analysis, got a result, and then put this into a mental folder called “think about it later”. Two years down the road, I return to this research. But now the data is two years out of date and I need to update it.

Enter automated data downloads. If you’re working with an online database, I suggest including the data download in your data pipeline. So if I was working with data from the World Inequality Database (WID), I’d write a little script to download the WID data. When you do this, updating your research is a breeze.

When you return to the analysis two years later, you’ll automatically have the most up to date data. Moreover, suppose you publish the results and put the analysis in an online repository. Twenty years down the road, another researcher could rerun your analysis and see if the results hold for the twenty years since you published your paper.

7. Use open source software

This should go without saying, but if you want your research to be open, you need to use open source tools. Suppose, for instance, that you do all of your analysis using proprietary software like SAS, SPSS, Stata or Statistica. If another researcher wants to run your code, they have to pay for this software. Some of it (SAS, for instance) can be pricey. In contrast, open source tools like R and python are free.

Open source software also tends to have a stronger online community. If you have a coding problem with R or python, it’s almost guaranteed that someone else has had the same problem, and that there’s a solution on Stack Exchange. This will help you build your data pipeline. It will also help other researchers understand your code.

8. Comment your code

Most scientists (me included) have no formal training in coding. This means we commit the cardinal sin of commenting our code poorly.

When I write code for data analysis, my first impulse is to write it quickly with no comments. I want to see the results now, damn it! I have no time for comments! As I’ve matured, I’ve learned to resist this urge. The problem is that code without comments is unintelligible. I would discover this when I’d return to the analysis a few months later and have no idea what my code did.

The solution is to comment your code. Comments tell the reader what your code does. The best comments are usually high level. Don’t tell the reader that “this is a loop”. Write a comment that tells the reader what your code does and what the major steps are. Assume that the reader understands the basic syntax of the language.

For more information about how to document your code, check out Ten simple rules for documenting scientific software.

9. Automate your statistical tables

I suggest making statistical tables a part of your data pipeline. This may seem obvious, but it’s something I’ve only recently introduced into my data pipeline.

I use the R package xtable to generate tables. I do all of my analysis in R, and then have xtable output the finished table of statistics. This takes some effort to learn, but it will save you time in the long run. When you’re writing a paper, you inevitably update your analysis. After this update, I used to cut and paste the statistics into a table. Now I have everything automated. When I change my analysis, all the tables get updated automatically. In terms of open science, this automation also makes it clear to researchers how you derived your published statistics.

10. Use version control

Version control is something that all trained programmers use. It’s a way of backing up your code and tracking changes over time.

You should use version control for many reasons. The most obvious is that when you break your code, you need an easy way to revert to the last working version. Maybe you tweak your data pipeline, but in the process, you break it horribly. The results are gibberish. What do you do? If you haven’t used version control, you’re out of luck. You’ll have to rewrite your code (by memory) back to the last working version.

If you’ve used version control, reverting is easy. You literally click ‘revert’. There are many tools for doing this. The most popular is git. I use git through the GitKraken GUI.

Using version control means you periodically ‘commit’ your code to a repository (either on your computer, or online to something like github). Each time you commit, git will track the changes to your code. This could be code for a single script, or the code for your entire data pipeline.

Version control helps save you from disaster. It also provides a simple way to keep track of how your data pipeline has changed over time.

Learn open science by doing it

In this post, I’ve shared tips for doing open science that I’ve learned over the last few years. Some of these tips I learned by looking at other people’s work. But many of them I learned by trial and error.

If you’re interested in opening up your research methods, the best way to learn is to just do it. Plan to publish the data pipeline for your next paper. Hopefully the tips I’ve shared will help you. But I’m sure you’ll come up with tips of your own. When you do, please share them. Let’s make science open!