Skip to content

Reproducibility I: Software Environments

Learning Objectives

After this lesson, you should be able to:

  • Understand the value of reproducible computing
  • Know the challenges of reproducible computing
  • Define a computing environment
  • Share a software environment with a colleague
  • Set up a software project with an environment



Reproducible Scientific Computing


Defining Reproducibility

"Reproducing the result of a computation means running the same software on the same input data and obtaining the same results." Rougier et al. 2016

"Getting someone else's code to run on my computer" - Anonymous





open science

Source: Peng, RD Reproducible Research in Computational Science Science (2011): 1226–1227 via Reproducible Science Curriculum





Interactive (ie, point-and-click) Computing

open science

Definition

Manually navigating a mouse across a graphical user interface (GUI) and running commands by selecting from menu options.

Advantages

  • Intuitive and easy to navigate a GUI and click buttons

Limitations

  • It can be slow to sequence through hundreds of clicks to accomplish an analysis.
  • Less reproducible - Cumbersome to write and follow a click-by-click tutorial





Scripted Computing

open science

Definition

Removing the GUI and instead instructing the computer to run a series of custom commands using a scripting/coding language.

We are automating what used to take many manual clicks.

We can write scripts to install software, clean data, run analyses, and generate figures.


Advantages

  • Much faster to run through commands
  • The script runs identically every time, reducing the human element
  • Easy for someone else to quickly reproduce the exact analysis and result
  • Enables analysis tasks to scale up


Challenges

  • Requires deeper computer knowledge
  • More upfront effort to produce the script



Discussion Question

What are some tasks you have automated or want to automate?

  • Have you ever successfully automated a task?
  • Found a way to make something scale or take less time?
  • What was the task, and how did you do it?
  • Are there any things you wish you could automate?
  • What are some barriers to automating them?



Scripting Languages

The most common open-source scripting languages (for science) are Python, R, and shell (Bash).

python r     bash

If you recall from lesson 4 How to Talk to Computers, we ran a shell script to back up and compress files. The following admonitions show the original shell script as well as the same instructions in Python and R. Scripting languages are simply different ways to instruct a computer.

Shell Script
#use Bash shell to run the following commands
#!/bin/bash

## Variables
#the directory you want to back up (e.g., shell-lesson-data)
SOURCE_DIR=$(find / -type d -name "shell-lesson-data" 2>/dev/null) # Note: if you are working on your computer, this will look in every folder. Be careful with this line!

#location where the backup will be stored
BACKUP_DIR="$HOME/Backup"

#used to create a unique name for each backup based on the current date and time
TIMESTAMP=$(date +"%Y-%m-%d_%H-%M-%S")

# name of the compressed backup file
ARCHIVE_NAME="backup_$TIMESTAMP.tar.gz"


# Create backup directory if it doesn't exist
mkdir -p "$BACKUP_DIR"

# Create a compressed archive of the source directory
tar -czf "$BACKUP_DIR/$ARCHIVE_NAME" -C "$SOURCE_DIR" .

# Output the result
echo "Backup of $SOURCE_DIR completed!"
echo "Archive created at $BACKUP_DIR/$ARCHIVE_NAME"
Python
import os
import subprocess
import shutil
from datetime import datetime

# Variables
# Find the source directory (e.g., shell-lesson-data)
def find_source_dir():
    try:
        # Run the 'find' command to locate the directory
        result = subprocess.run(['find', '/', '-type', 'd', '-name', 'shell-lesson-data'], 
                                stdout=subprocess.PIPE, stderr=subprocess.DEVNULL, text=True)
        source_dir = result.stdout.strip()
        return source_dir
    except Exception as e:
        print(f"Error finding directory: {e}")
        return None

# Set the backup directory to a folder called Backup in the home directory
backup_dir = os.path.join(os.path.expanduser("~"), "Backup")

# Create a unique timestamp for the backup
timestamp = datetime.now().strftime("%Y-%m-%d_%H-%M-%S")

# Create the archive name with the timestamp
archive_name = f"backup_{timestamp}.tar.gz"

# Ensure backup directory exists
os.makedirs(backup_dir, exist_ok=True)

# Find source directory
source_dir = find_source_dir()

if source_dir:
    # Create the compressed archive using tar
    archive_path = os.path.join(backup_dir, archive_name)
    try:
        subprocess.run(['tar', '-czf', archive_path, '-C', source_dir, '.'], check=True)
        print(f"Backup of {source_dir} completed!")
        print(f"Archive created at {archive_path}")
    except subprocess.CalledProcessError as e:
        print(f"Error creating archive: {e}")
else:
    print("Source directory not found!")
R
# Load necessary libraries
library(lubridate)

# Variables
# Function to find the source directory (e.g., shell-lesson-data)
find_source_dir <- function() {
result <- system("find / -type d -name 'shell-lesson-data' 2>/dev/null", intern = TRUE)
if (length(result) > 0) {
    return(result[1])  # Return the first match, if any
} else {
    return(NULL)
}
}

# Backup directory
backup_dir <- file.path(Sys.getenv("HOME"), "Backup")

# Create a unique timestamp for the backup
timestamp <- format(now(), "%Y-%m-%d_%H-%M-%S")

# Name of the compressed archive
archive_name <- paste0("backup_", timestamp, ".tar.gz")

# Ensure backup directory exists
if (!dir.exists(backup_dir)) {
dir.create(backup_dir, recursive = TRUE)
}

# Find the source directory
source_dir <- find_source_dir()

if (!is.null(source_dir)) {
# Create the compressed archive using tar
archive_path <- file.path(backup_dir, archive_name)
tar_command <- paste("tar -czf", shQuote(archive_path), "-C", shQuote(source_dir), ".")

# Run the tar command
system(tar_command)

cat("Backup of", source_dir, "completed!\n")
cat("Archive created at", archive_path, "\n")
} else {
cat("Source directory not found!\n")
}

Each language consist of base software (Python Standard Library or R Base Package) and MANY additional packages that can be downloaded and installed for increased capabilities.




Computing Environment

A computing environment is the combination of hardware, software, and network resources that provide the infrastructure for computing operations and user interactions.

  • Hardware: CPUs, GPUs, RAM
  • Operating system & version: many flavors of Linux, MacOS, Windows
  • Software versions: R, Python, etc.
  • Package versions: specific R or Python packages, which often depend on other packages

open science

Python Package Dependency



!!Very Important!!

The scripts you create:

  • Were designed to work in your specific computing environment
  • May not work on someone else's computer because their computing environment is different
  • May not work on your computer in the future, because your computing enviroment will probably change (eg., updated software versions)


Software Dependency Hell

Sometimes, it can be nearly impossible to get your computing environment correct enough to run someone else's code.

This can be caused by incorrect software versions of the packages you are using or their dependencies.

Don't Dispair! There are solutions to avoid software dependency hell and ensure reproducibility from one computer to another



Software Installation

When you download and install software onto your computer, it will typically install it in a set of specific directories that we call the System Path.

System Path

In the context of computing, the system path, often referred to simply as PATH, is the set of directories in which the operating system looks for executable files when a command is issued.

When you go to launch an application by clicking on a desktop icon or with a CLI command, the computer will search for the application within the PATH directories. If it finds the executable, it will launch.

Find the PATH on your computer

In Linux and Mac Terminal

echo $PATH


In Windows Terminal

$env:PATH




Nice and Short Video Describing the PATH.



The PATH prefers one version of any given software.





Environment Managers

One solution to software dependency hell is to use an Environment Manager

An environment manager allows you to create software installation directories (similar to PATH) that are isolated from your computer's PATH. You can create unique environments and install specific software version to run specific scripts.


Conda - Open Source Environment Manager

Conda is a popular and open source environment manager tool that can be installed on any operating system (Windows, MacOS, Linux).

  • Users can create environments that have their own set of packages, dependencies, and even their own version of Python.
  • Projects can have their own specific requirements without interfering with each other
  • It allows for consistent and reproducible results across different systems and setups

open science

Conceptual Graphic 1

open science

Conceptual Graphic 2

Renv

  • R package that allows you to create unique environments for an R project




Sharing your Environment with Colleagues

Whether you are using Conda, Pip, or Renv, you should be able to share the specifications of your software environment so colleagues can reproduce the environment.

The general sharing workflow:

  1. Output an environment file that lists the software and versions of the environment

  2. Share the file with colleagues through a platform like Github

  3. Colleagues create an empty environment on their computer and populate it with the contents of the environment file

Conda to Share Environment

Conda

  1. Export your Conda Environment

    conda env export > my_conda_env.yml
    

  2. Share the .yml file through Github

  3. Reproduce the Environment on a Different Computer

    conda env create --file environment.yml
    

Conda exports your Pip environment as well

Exporting your environment using Conda (conda env export > my_conda_env.yml) will ALSO export your pip environment!

Pip to Share Environment

Python

  1. Export python libraries present in your environment

    pip3 freeze > requirements.txt 
    

  2. Share the requirements.txt on Github

  3. Reproduce the Environment on a Different Computer

    pip install -r requirements.txt
    

Renv to Share Environment

Renv

  1. Create an isolated environment

    renv::init()
    

  2. Export R packages to the renv.lock file

    renv:snapshot()
    

  3. Share the renv.lock, .Rprofile, renv/settings.json and renv/activate.R files to Github

  4. Reproduce the Environment on a Different Computer

    renv::restore()
    



Package Managers

A software tool to find, download, and install software packages to PATH or virtual environment

Conda

Conda

Software: Python, R, Django, Celery, PostgreSQL, nginx, Node.js, Java programs, C and C++, Perl, and command line tools

Repository: Conda-Forge.

Pip

Pip

Software: python

Repository: PyPi

Note: Pip can be used together with Conda environment manager.

R

R

With the R language, a package manager is built directly into the R Base Package.

install.packages('ggplot2')

Repository: R Comprehensive R Archive Network (CRAN)




Reproducibility Tutorial Using Conda

Set Up

OS of choice

To get everyone on the same page, we will do this exercise together using the Linux terminal in Github Codespaces.

However, if you'd like to use your own computer feel free to! If you're on Mac or Linux, open your terminal; If you're on Windows, please use the Windows Subsystem for Linux (WSL) so you can follow along.

How to Scroll in Cyverse (Tmux) Cloud Shell

If you're using the Cyverse Cloud Shell, you can scroll up and down by pressing Ctrl + b and then [ to enter scroll mode. You can then use the arrow keys to scroll up and down. Press q to exit scroll mode.

The CLI in CyVerse is controlled with Tmux, a software that allows to "window" the CLI; Here is a cheat sheet that will teach you more Tmux tricks!


Launch Github Codespaces

1 Go to this Github repository and Fork it (i.e., make a copy of it in your Github account).

open science

2 Click on the green "Code" button and select "Create Codespaces on main"

open science

3 After a few moments, you will be taken to a new browser window with a Linux terminal.

open science


Installing Conda

If you are using Codespaces, Conda is already installed.

When you download and install Conda it comes in two different flavors:

Miniconda - lightweight (500 mb) program that includes Conda, the environment and package manager, as well as a recent version of the Standard Python Library.

Anaconda - a larger (2.5GB) program that includes Conda and many more python libraries pre-installed (in Conda base environment), as well as graphical user interface, acccess to jupyter notebooks, and support for easily integrating the R language.

conda

Conda, Miniconda, and Anaconda.
Taken from Getting Started with Conda, Medium.

Installing Conda

For the appropriate installation package, visit https://docs.conda.io/en/latest/miniconda.html. ⚠ Note: If you are using the WSL, install the Linux version!!

# Download conda and add right permissions
wget https://repo.anaconda.com/miniconda/Miniconda3-py39_4.12.0-Linux-x86_64.sh     # Modify this to match the OS you're using.
chmod +x Miniconda3-py39_4.12.0-Linux-x86_64.sh

# install conda silenty (-b) and update (-u) and initial conda run
./Miniconda3-py39_4.12.0-Linux-x86_64.sh -b -u
~/miniconda3/bin/conda init

# Restart bash so that conda is activated
source ~/.bashrc

You'll be able to tell when conda is active when next (base) is present next to the to the shell prompt such as

(base) user@machine

Conda should now be installed and can be used to install other necessary packages!

Tip: slow Conda? Try Mamba.

Conda is known to take time processing some software installation. A solution is to use Mamba, a reimplementation of Conda in C++ for quicker queries and installations. Mamba is then invoked by using mamba instead of conda (whilst keeping options and the rest of the command synthax the same).

The quickest way to install mamba is with conda install -c conda-forge mamba, or follow the official installation documentation here.




Environment Management with Conda

When you start a Codespaces terminal, the prompt will look something like this:

@jeffgillan ➜ /workspaces/foss_conda_lesson (main) $

Type the following command to see the current conda environment.

conda info

Initialize conda by running the following commands.

conda init
exec $SHELL

View the list of conda environments. There should only be one environment called base.

conda env list

View the software installed in the base directory. Notice the version of Python.

conda list


Create our own custom environment (type y when prompted).

conda create --name myenv



Activate your new environment with

conda activate myenv

You will notice that the prompt changed to (myenv)


View the software that is installed in your new custom environment. It should be empty!

conda list




Package management with Conda

Within your new custom environment (ie, myenv) download and install a specific version of python. This may take a few minutes to complete.

conda install python=3.9


View the new software that has been installed. Notice the version of Python is now 3.9. while the base is 3.12

conda list



Install Salmon (genomics software) using Conda

conda install -c bioconda salmon 


Conda channels

Conda operates through channels, specififc repositories where packages are stored. Specific packages sometimes may appear in multiple channels, however it is always helpful to specify a channel with the -c flag.




Share and Reproduce a Conda Environment

Export all of the software in your custom environment to a file

conda env export --no-builds > myenv.yml

Let's view the contents of the .yml file. It should contain all the software you installed in the environment. This myenv.yml file can be shared with a colleague so they can reproduce the same environment on their computer.
nano myenv.yml

Reproduce someone else's environment with mandelbrot.yml environment file located in the repository.

conda env create --file mandelbrot.yml


Activate the environment to use the software installed in the environment.

conda activate mandelbrot


Look at the software installed in the environment.

conda list


Run a python script that generates a Mandelbrot set

python3 mandelbrot.py