Skip to content

Reproducibility I: Software Environments

Learning Objectives

After this lesson, you should be able to:

  • Understand the value of reproducible computing
  • Know the challenges of reproducible computing
  • Define a computing environment
  • Share a software environment with a colleague
  • Set up a software project with an environment




Reproducible Scientific Computing


Defining Reproducibility

"Reproducing the result of a computation means running the same software on the same input data and obtaining the same results." Rougier et al. 2016

"Getting someone else's code to run on my computer" - Anonymous




As the graphic below suggests, Reproducibility is a spectrum of sharing behaviors.

open science

Peng 2011





Interactive (ie, point-and-click) Computing

open science

Definition

Manually navigating a mouse across a graphical user interface (GUI) and running commands by selecting from menu options.

Advantages

  • Inuitive and easy to navigate a GUI and click buttons

Limitations

  • It can be slow to sequence through hundreds of clicks to accomplish an analysis.
  • Less reproducible - Cumbersome to write and follow a click-by-click tutorial





Scripted Computing

open science

Definition

Removing the GUI and instead instructing the computer to run a series of custom commands using a scripting/coding language.

We are automating what used to take many manual clicks.

We can write scripts to install software, clean data, run analyses, and generate figures.


Advantages

  • Much faster to run through commands
  • The script runs identically every time, reducing the human element
  • Easy for someone else to quickly reproduce the exact analysis and result
  • Enables analysis tasks to scale up


Challenges

  • Requires deeper computer knowledge
  • More upfront effort to produce the script



Discussion Question

What are some tasks you have automated or want to automate?

  • Have you ever successfully automated a task?
  • Found a way to make something scale or take less time?
  • What was the task, and how did you do it?
  • Are there any things you wish you could automate?
  • What are some barriers to automating them?





Scripting Languages

The two most common open-source scripting languages (for science) are Python and R.

python r

Both languages consist of base software (Python Standard Library or R Base Package) and MANY additional packages that can be downloaded and installed for increased capabilities.





Software Installation

When you download and install software onto your computer, it will typically install it in a set of specific directories that we call the System Path.

System Path

In the context of computing, the system path, often referred to simply as PATH, is the set of directories in which the operating system looks for executable files when a command is issued.

When you go to launch an application by clicking on a desktop icon or with a CLI command, the computer will search for the application within the PATH directories. If it finds the executable, it will launch.

Find the PATH on your computer

In Linux and Mac Terminal

echo $PATH


In Windows Terminal

echo %PATH%




Nice and Short Video Describing the PATH.



The PATH prefers one version of any given software.





Computing Environment

A computing environment is the combination of hardware, software, and network resources that provide the infrastructure for computing operations and user interactions.

  • Hardware: CPUs, GPUs, RAM
  • Operating system & version: many flavors of Linux, MacOS, Windows
  • Software versions: R, Python, etc.
  • Package versions: specific R or Python packages, which often depend on other packages

open science

Python Package Dependency



The scripts you create:

  • Were designed to work in your specific computing environment
  • May not work on your computer in the future, because your computing enviroment will probably change (eg., updated software versions)
  • May not work on someone else's computer because their computing environment is different





Software Dependency Hell

Sometimes, it can be nearly impossible to get your computing environment correct enough to run someone else's code.

This can caused by incorrect software versions of the packages you are using or their dependencies.

Updating software installed in the system path - to make new code work - can break old code!




Environment Managers

One solution to software dependency hell is to use an Environment Manager

An environment manager allows you to create software installation directories (similar to PATH) that are isolated your computer's PATH. You can create unique environments and install specific software version to run specific scripts.


Conda - Open Source Environment Manager

Conda is a popular and open source environment manager tool that can be installed on any operating system (Windows, MacOS, Linux).

  • Users can create environments that have their own set of packages, dependencies, and even their own version of Python.
  • Projects can have their own specific requirements without interfering with each other
  • It allows for consistent and reproducible results across different systems and setups

open science



Renv

  • R package that allows you to create unique environments for an R project




Package Managers

A software tool to find, download, and install software packages to PATH or virtual environment

Conda

Software: Python, R, Django, Celery, PostgreSQL, nginx, Node.js, Java programs, C and C++, Perl, and command line tools

Repository: Conda-Forge.


Pip

Software: python

Repository: PyPi

Note: Pip can be used together with Conda environment manager.


R

With the R language, a package manager is built directly into the R Base Package.

install.packages('ggplot2')

Repository: R Comprehensive R Archive Network (CRAN)



Sharing your Environment with Colleagues

Whether you are using Conda, Pip, or Renv, you should be able to share the specifications of your software environment so colleagues can reproduce the environment.

The general sharing workflow:

  1. Output an environment file that lists the software and versions of the environment

  2. Share the file with colleagues through a platform like Github

  3. Colleagues create an empty environment on their computer and populate it with the contents of the environment file


Conda

  1. Export your Conda Environment

    conda env export > my_conda_env.yml
    

  2. Share the .yml file through Github

  3. Reproduce the Environment on a Different Computer

    conda env create --file environment.yml
    

Conda exports your Pip environment as well

Exporting your environment using Conda (conda env export > my_conda_env.yml) will ALSO export your pip environment!



Python

  1. Export python libraries present in your environment

    pip3 freeze > requirements.txt 
    

  2. Share the requirements.txt on Github

  3. Reproduce the Environment on a Different Computer

    pip install -r requirements.txt
    



Renv

  1. Create an isolated environment

    renv::init()
    

  2. Export R packages to the renv.lock file

    renv:snapshot()
    

  3. Share the renv.lock, .Rprofile, renv/settings.json and renv/activate.R files to Github

  4. Reproduce the Environment on a Different Computer

    renv::restore()
    









Reproducibility Tutorial


Installing Conda

When you download and install Conda it comes in two different flavors:

Miniconda - lightweight (500 mb) program that includes Conda, the environment and package manager, as well as a recent version of the Standard Python Library.

Anaconda - a larger (2.5GB) program that includes Conda and many more python libraries pre-installed (in Conda base environment), as well as graphical user interface, acccess to jupyter notebooks, and support for easily integrating the R language.

conda

Conda, Miniconda, and Anaconda.
Taken from Getting Started with Conda, Medium.

Installing Conda

For the appropriate installation package, visit https://docs.conda.io/en/latest/miniconda.html. ⚠ Note: If you are using the WSL, install the Linux version!!

# Download conda and add right permissions
wget https://repo.anaconda.com/miniconda/Miniconda3-py39_4.12.0-Linux-x86_64.sh     # Modify this to match the OS you're using.
chmod +x Miniconda3-py39_4.12.0-Linux-x86_64.sh

# install conda silenty (-b) and update (-u) and initial conda run
./Miniconda3-py39_4.12.0-Linux-x86_64.sh -b -u
~/miniconda3/bin/conda init

# Restart bash so that conda is activated
source ~/.bashrc

You'll be able to tell when conda is active when next (base) is present next to the to the shell prompt such as

(base) user@machine

Conda should now be installed and can be used to install other necessary packages!


Tip: slow Conda? Try Mamba.

Conda is known to take time processing some software installation. A solution is to use Mamba, a reimplementation of Conda in C++ for quicker queries and installations. Mamba is then invoked by using mamba instead of conda (whilst keeping options and the rest of the command synthax the same).

The quickest way to install mamba is with conda install -c conda-forge mamba, or follow the official installation documentation here.






Conda on Cyverse

OS of choice

This tutorial will be performed using the CyVerse CLI (Command Line Interface) which is a Linux Command Line. This requires a Cyverse account.

However, if you'd like to use your own computer feel free to! If you're on Mac or Linux, open your terminal; If you're on Windows, please use the Windows Subsystem for Linux (WSL) so you can follow along.

How to Scroll in Cyverse (Tmux) Cloud Shell

If you're using the Cyverse Cloud Shell, you can scroll up and down by pressing Ctrl + b and then [ to enter scroll mode. You can then use the arrow keys to scroll up and down. Press q to exit scroll mode.

The CLI in CyVerse is controlled with Tmux, a software that allows to "window" the CLI; Here is a cheat sheet that will teach you more Tmux tricks!



Environment Management with Conda

When you start a Cyverse Cloud shell, the prompt will look something like this:

(base) jovyan@a12b272e0:/home/user/data-store$

Miniconda has already been pre-installed, and by default, you are started in a base Conda directory (base)


View the list of conda environments

conda env list

View the software installed in the base directory. Notice the version of Python.

conda list


Create our own custom environment (type y when prompted).

conda create --name myenv



Activate your new environment with

conda activate myenv

You will notice that the prompt changed to (myenv)


View the software that is installed in your new custom environment. It should be empty!

conda list





Package management with Conda

Within your new custom environment (ie, myenv) download and install a specific version of python. This may take a few minutes to complete.

conda install python=3.9


View the new software that has been install

conda list




Install Salmon and FastQC (genomics software) using Conda

conda install -c bioconda salmon fastqc


Conda channels

Conda operates through channels, specififc repositories where packages are stored. Specific packages sometimes may appear in multiple channels, however it is always helpful to specify a channel with the -c flag.




Share and Reproduce a Conda Environment

Export all of the software in your custom environment to a file

conda env export > myenv.yml

Let's view the contents of the .yml file
nano myenv.yml

Now we are going to pretend that we are reproducing a conda environment from a .yml file shared by a collegue.

Change the name of the environement within the .yml file from myenv to myenv2


Create a new environment and populate it with the .yml environment file

conda env create --file myenv2.yml





Package management with Pip

Pip works similarly to Conda, as Pip is the package management supported by the Python Software foundation. If you use Python for your work it is likely you have installed packages using Pip.

We only have to install a single package required for this tutorial, MultiQC. To install MultiQC using Pip, do:

pip install multiqc

Similar to Conda, you can export your pip environment by doing

pip3 freeze > requirements.txt

Why pip3?

pip3 freeze > requirements.txt is used to export the pip environment such that it is readable for Python 3. If you want to export an environment for Python 2, you can use pip freeze > requirements.txt.