Reproducibility I: Software Environments¶
Learning Objectives
After this lesson, you should be able to:
- Understand the value of reproducible computing
- Know the challenges of reproducible computing
- Define a computing environment
- Share a software environment with a colleague
- Set up a software project with an environment
Reproducible Scientific Computing¶
Defining Reproducibility
"Reproducing the result of a computation means running the same software on the same input data and obtaining the same results." Rougier et al. 2016
"Getting someone else's code to run on my computer" - Anonymous
Interactive (ie, point-and-click) Computing¶
Definition
Manually navigating a mouse across a graphical user interface (GUI) and running commands by selecting from menu options.
Advantages¶
- Intuitive and easy to navigate a GUI and click buttons
Limitations¶
- It can be slow to sequence through hundreds of clicks to accomplish an analysis.
- Less reproducible - Cumbersome to write and follow a click-by-click tutorial
Scripted Computing¶
Definition
Removing the GUI and instead instructing the computer to run a series of custom commands using a scripting/coding language.
We are automating what used to take many manual clicks.
We can write scripts to install software, clean data, run analyses, and generate figures.
Advantages¶
- Much faster to run through commands
- The script runs identically every time, reducing the human element
- Easy for someone else to quickly reproduce the exact analysis and result
- Enables analysis tasks to scale up
Challenges¶
- Requires deeper computer knowledge
- More upfront effort to produce the script
Discussion Question
What are some tasks you have automated or want to automate?
- Have you ever successfully automated a task?
- Found a way to make something scale or take less time?
- What was the task, and how did you do it?
- Are there any things you wish you could automate?
- What are some barriers to automating them?
Scripting Languages¶
The most common open-source scripting languages (for science) are Python, R, and shell (Bash).
If you recall from lesson 4 How to Talk to Computers, we ran a shell script to back up and compress files. The following admonitions show the original shell script as well as the same instructions in Python and R. Scripting languages are simply different ways to instruct a computer.
Shell Script
#use Bash shell to run the following commands
#!/bin/bash
## Variables
#the directory you want to back up (e.g., shell-lesson-data)
SOURCE_DIR=$(find / -type d -name "shell-lesson-data" 2>/dev/null) # Note: if you are working on your computer, this will look in every folder. Be careful with this line!
#location where the backup will be stored
BACKUP_DIR="$HOME/Backup"
#used to create a unique name for each backup based on the current date and time
TIMESTAMP=$(date +"%Y-%m-%d_%H-%M-%S")
# name of the compressed backup file
ARCHIVE_NAME="backup_$TIMESTAMP.tar.gz"
# Create backup directory if it doesn't exist
mkdir -p "$BACKUP_DIR"
# Create a compressed archive of the source directory
tar -czf "$BACKUP_DIR/$ARCHIVE_NAME" -C "$SOURCE_DIR" .
# Output the result
echo "Backup of $SOURCE_DIR completed!"
echo "Archive created at $BACKUP_DIR/$ARCHIVE_NAME"
Python
import os
import subprocess
import shutil
from datetime import datetime
# Variables
# Find the source directory (e.g., shell-lesson-data)
def find_source_dir():
try:
# Run the 'find' command to locate the directory
result = subprocess.run(['find', '/', '-type', 'd', '-name', 'shell-lesson-data'],
stdout=subprocess.PIPE, stderr=subprocess.DEVNULL, text=True)
source_dir = result.stdout.strip()
return source_dir
except Exception as e:
print(f"Error finding directory: {e}")
return None
# Set the backup directory to a folder called Backup in the home directory
backup_dir = os.path.join(os.path.expanduser("~"), "Backup")
# Create a unique timestamp for the backup
timestamp = datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
# Create the archive name with the timestamp
archive_name = f"backup_{timestamp}.tar.gz"
# Ensure backup directory exists
os.makedirs(backup_dir, exist_ok=True)
# Find source directory
source_dir = find_source_dir()
if source_dir:
# Create the compressed archive using tar
archive_path = os.path.join(backup_dir, archive_name)
try:
subprocess.run(['tar', '-czf', archive_path, '-C', source_dir, '.'], check=True)
print(f"Backup of {source_dir} completed!")
print(f"Archive created at {archive_path}")
except subprocess.CalledProcessError as e:
print(f"Error creating archive: {e}")
else:
print("Source directory not found!")
R
# Load necessary libraries
library(lubridate)
# Variables
# Function to find the source directory (e.g., shell-lesson-data)
find_source_dir <- function() {
result <- system("find / -type d -name 'shell-lesson-data' 2>/dev/null", intern = TRUE)
if (length(result) > 0) {
return(result[1]) # Return the first match, if any
} else {
return(NULL)
}
}
# Backup directory
backup_dir <- file.path(Sys.getenv("HOME"), "Backup")
# Create a unique timestamp for the backup
timestamp <- format(now(), "%Y-%m-%d_%H-%M-%S")
# Name of the compressed archive
archive_name <- paste0("backup_", timestamp, ".tar.gz")
# Ensure backup directory exists
if (!dir.exists(backup_dir)) {
dir.create(backup_dir, recursive = TRUE)
}
# Find the source directory
source_dir <- find_source_dir()
if (!is.null(source_dir)) {
# Create the compressed archive using tar
archive_path <- file.path(backup_dir, archive_name)
tar_command <- paste("tar -czf", shQuote(archive_path), "-C", shQuote(source_dir), ".")
# Run the tar command
system(tar_command)
cat("Backup of", source_dir, "completed!\n")
cat("Archive created at", archive_path, "\n")
} else {
cat("Source directory not found!\n")
}
Each language consist of base software (Python Standard Library or R Base Package) and MANY additional packages that can be downloaded and installed for increased capabilities.
Computing Environment¶
A computing environment is the combination of hardware, software, and network resources that provide the infrastructure for computing operations and user interactions.
- Hardware: CPUs, GPUs, RAM
- Operating system & version: many flavors of Linux, MacOS, Windows
- Software versions: R, Python, etc.
- Package versions: specific R or Python packages, which often depend on other packages
!!Very Important!!
The scripts you create:¶
- Were designed to work in your specific computing environment
- May not work on someone else's computer because their computing environment is different
- May not work on your computer in the future, because your computing enviroment will probably change (eg., updated software versions)
Software Dependency Hell¶
Sometimes, it can be nearly impossible to get your computing environment correct enough to run someone else's code.
This can be caused by incorrect software versions of the packages you are using or their dependencies.
Don't Dispair! There are solutions to avoid software dependency hell and ensure reproducibility from one computer to another
Software Installation¶
When you download and install software onto your computer, it will typically install it in a set of specific directories that we call the System Path.
System Path¶
In the context of computing, the system path, often referred to simply as PATH, is the set of directories in which the operating system looks for executable files when a command is issued.
When you go to launch an application by clicking on a desktop icon or with a CLI command, the computer will search for the application within the PATH directories. If it finds the executable, it will launch.
Find the PATH on your computer
In Linux and Mac Terminal
echo $PATH
In Windows Terminal
$env:PATH
The PATH prefers one version of any given software.
Environment Managers¶
One solution to software dependency hell is to use an Environment Manager
An environment manager allows you to create software installation directories (similar to PATH) that are isolated from your computer's PATH. You can create unique environments and install specific software version to run specific scripts.
Conda - Open Source Environment Manager¶
Conda is a popular and open source environment manager tool that can be installed on any operating system (Windows, MacOS, Linux).
- Users can create environments that have their own set of packages, dependencies, and even their own version of Python.
- Projects can have their own specific requirements without interfering with each other
- It allows for consistent and reproducible results across different systems and setups
Renv¶
- R package that allows you to create unique environments for an R project
Sharing your Environment with Colleagues¶
Whether you are using Conda, Pip, or Renv, you should be able to share the specifications of your software environment so colleagues can reproduce the environment.
The general sharing workflow:
-
Output an environment file that lists the software and versions of the environment
-
Share the file with colleagues through a platform like Github
-
Colleagues create an empty environment on their computer and populate it with the contents of the environment file
Conda to Share Environment
Conda¶
-
Export your Conda Environment
conda env export > my_conda_env.yml
-
Share the .yml file through Github
-
Reproduce the Environment on a Different Computer
conda env create --file environment.yml
Conda exports your Pip environment as well
Exporting your environment using Conda (conda env export > my_conda_env.yml
) will ALSO export your pip environment!
Pip to Share Environment
Python¶
-
Export python libraries present in your environment
pip3 freeze > requirements.txt
-
Share the
requirements.txt
on Github -
Reproduce the Environment on a Different Computer
pip install -r requirements.txt
Renv to Share Environment
Renv¶
-
Create an isolated environment
renv::init()
-
Export R packages to the renv.lock file
renv:snapshot()
-
Share the
renv.lock
,.Rprofile
,renv/settings.json
andrenv/activate.R
files to Github -
Reproduce the Environment on a Different Computer
renv::restore()
Package Managers¶
A software tool to find, download, and install software packages to PATH or virtual environment
Conda
Conda¶
Software: Python, R, Django, Celery, PostgreSQL, nginx, Node.js, Java programs, C and C++, Perl, and command line tools
Repository: Conda-Forge.
Pip
Pip¶
Software: python
Repository: PyPi
Note: Pip can be used together with Conda environment manager.
R
R¶
With the R language, a package manager is built directly into the R Base Package.
install.packages('ggplot2')
Repository: R Comprehensive R Archive Network (CRAN)
Reproducibility Tutorial Using Conda¶
Set Up¶
OS of choice
To get everyone on the same page, we will do this exercise together using the Linux terminal in Github Codespaces.
However, if you'd like to use your own computer feel free to! If you're on Mac or Linux, open your terminal; If you're on Windows, please use the Windows Subsystem for Linux (WSL) so you can follow along.
How to Scroll in Cyverse (Tmux) Cloud Shell
If you're using the Cyverse Cloud Shell, you can scroll up and down by pressing Ctrl + b
and then [
to enter scroll mode. You can then use the arrow keys to scroll up and down. Press q
to exit scroll mode.
The CLI in CyVerse is controlled with Tmux, a software that allows to "window" the CLI; Here is a cheat sheet that will teach you more Tmux tricks!
Launch Github Codespaces¶
1 Go to this Github repository and Fork it (i.e., make a copy of it in your Github account).
2 Click on the green "Code" button and select "Create Codespaces on main"
3 After a few moments, you will be taken to a new browser window with a Linux terminal.
Installing Conda¶
If you are using Codespaces, Conda is already installed.
When you download and install Conda it comes in two different flavors:
Miniconda - lightweight (500 mb) program that includes Conda, the environment and package manager, as well as a recent version of the Standard Python Library.
Anaconda - a larger (2.5GB) program that includes Conda and many more python libraries pre-installed (in Conda base environment), as well as graphical user interface, acccess to jupyter notebooks, and support for easily integrating the R language.
Installing Conda
For the appropriate installation package, visit https://docs.conda.io/en/latest/miniconda.html. Note: If you are using the WSL, install the Linux version!!
# Download conda and add right permissions
wget https://repo.anaconda.com/miniconda/Miniconda3-py39_4.12.0-Linux-x86_64.sh # Modify this to match the OS you're using.
chmod +x Miniconda3-py39_4.12.0-Linux-x86_64.sh
# install conda silenty (-b) and update (-u) and initial conda run
./Miniconda3-py39_4.12.0-Linux-x86_64.sh -b -u
~/miniconda3/bin/conda init
# Restart bash so that conda is activated
source ~/.bashrc
You'll be able to tell when conda is active when next (base)
is present next to the to the shell prompt such as
(base) user@machine
Conda should now be installed and can be used to install other necessary packages!
Tip: slow Conda? Try Mamba.
Conda is known to take time processing some software installation. A solution is to use Mamba, a reimplementation of Conda in C++ for quicker queries and installations. Mamba is then invoked by using mamba
instead of conda
(whilst keeping options and the rest of the command synthax the same).
The quickest way to install mamba is with conda install -c conda-forge mamba
, or follow the official installation documentation here.
Environment Management with Conda¶
When you start a Codespaces terminal, the prompt will look something like this:
@jeffgillan ➜ /workspaces/foss_conda_lesson (main) $
Type the following command to see the current conda environment.
conda info
Initialize conda by running the following commands.
conda init
exec $SHELL
View the list of conda environments. There should only be one environment called base
.
conda env list
View the software installed in the base directory. Notice the version of Python.
conda list
Create our own custom environment (type y
when prompted).
conda create --name myenv
Activate your new environment with
conda activate myenv
You will notice that the prompt changed to
(myenv)
View the software that is installed in your new custom environment. It should be empty!
conda list
Package management with Conda¶
Within your new custom environment (ie, myenv) download and install a specific version of python. This may take a few minutes to complete.
conda install python=3.9
View the new software that has been installed. Notice the version of Python is now 3.9. while the base is 3.12
conda list
Install Salmon (genomics software) using Conda
conda install -c bioconda salmon
Conda channels
Conda operates through channels, specififc repositories where packages are stored. Specific packages sometimes may appear in multiple channels, however it is always helpful to specify a channel with the -c
flag.
Share and Reproduce a Conda Environment¶
Export all of the software in your custom environment to a file
conda env export --no-builds > myenv.yml
Let's view the contents of the .yml file. It should contain all the software you installed in the environment. This
myenv.yml
file can be shared with a colleague so they can reproduce the same environment on their computer.nano myenv.yml
Reproduce someone else's environment with mandelbrot.yml
environment file located in the repository.
conda env create --file mandelbrot.yml
Activate the environment to use the software installed in the environment.
conda activate mandelbrot
Look at the software installed in the environment.
conda list
Run a python script that generates a Mandelbrot set
python3 mandelbrot.py