After this lesson, you should be able to:
- Understand the basics of
gitas a resource for reproducible programming
- Describe tools and approaches to creating your
- Describe best practices for maintaining GitHub Organizations and Repositories
- Maintain own GitHub user profile and repositories
Version control refers to keeping track of the version of a file, set of files, or a whole project.
Some version control tools:
- Microsoft Office's Track Changes functionality
- Apple's Time Machine
- Google Docs' Version History
Version control is as much a philosophy as a set of tools; you don't need to master Git to utilize version control (though it is certainly a worthwhile tool for many researchers).
Git vs. GitHub¶
Git is a command-line program for version control of repositories. It keeps track of changes you make to files in your repository and stores those changes in a .git folder in that repository. These changes happen whenever you make a commit. Git stores the history of these commits in a "tree", so you can go back to any previous commit. By keeping track of the differences between commits, Git can be much more efficient than storing an entire copy of each version in a document's history.
You could utilize Git completely on its own, on your local computer, and get a lot of benefits. You will have a history of the changes you made to a project, allowing you to go back to any old version of your work. However, where Git really shines is in collaborative work. In order to effectively collaborate with others on a project, you need two basic features: a way to allow people to work in parallel, and a way to host repositories somewhere where everyone can access them. The first feature is branching, which is part of Git, and the hosting part can be taken care of by platforms like GitHub, GitLab, or Bitbucket. We will focus on GitHub.
GitHub is a site that can remotely host your Git repositories. By putting your repository onto GitHub, you get a backup of the repository, a way to collaborate with others, and a lot of other features.
- Git: tool for version control.
- GitHub: hosted server that is also interactive.
Locations and directions:
- repo: short for repository
- local: on your personal computer.
- remote: somewhere other than your computer. GitHub can host remote repositories.
- upstream: primary or main branch of original repository.
- downstream: branch or fork of repository.
- clone: copy of a repository that lives locally on your computer. Pushing changes will affect the repository online.
- pull: getting latest changes to the repository on your local
- the fetch command does the same, however one needs to also merge the changes, whilst with pull, the merge action is automatic.
- branch: a history of changes to a repository. You can have parallel branches with separate histories, allowing you to keep a "main" version and development versions.
- fork: copy of someone else's repository stored locally on your account. From forks, you can make pull requests to the main branch.
- commit: finalize a change.
- push: add changes back to the remote repository.
- merge: takes changes from a branch or fork and applies them to the main.
These are also commands when paird with
Using the following synthax
git <command> one can trigger an action. An example is
git pull, which will pull all of the latest changes in the remote repository.
pull request: proposed changes to/within a repository.
issue: suggestions or tasks needed for the repository. Allows you to track decisions, bugs with the repository, etc.
Practical Git Techniques¶
The basic Git life cycle
When using Git for your version control, the usual life cycle is the following:
||Clones the target repository to your machine|
||Checks whether there are changes in the remote, original repository|
||Pulls any change to your local repository|
||Adds to a future commit any change|
||Creates the commit and adds a descriptive message|
||Pushes the changes commited from local to the remote repository|
If there are no branches or external pull requests, the basic Git life cycle is summarizable like this:
graph LR A[1. git clone] --> B[2. git status] -->C([differences from origin?]):::colorclass; C-->|yes| D[3. git pull]--> E; C-->|no| E[4. git add]; E-->F[5. git commit] -->G[6. git push]; G-->B; classDef colorclass fill:#f96
After learning the basics of using Git, which you can learn with the Software Carpentry Git Lesson, there are some next things that can be useful to learn. Here are a couple topics that are worth digging into more:
Using the Git log
- You can access using git log
- Will show you your commit history
- Useful for figuring out where you need to roll back to
- This is important to learn if you're going to be doing any sort of collaboration
- Here is a fantastic resource for learning how git branching really works: https://learngitbranching.js.org/
- you will probably have to deal with merge conflicts at some point
- Merge conflicts happen when two branches are being merged, but they have different changes to the same part of a file
- Perhaps you are working on a feature branch, and you change line 61 in file.R, but someone else made a change to the main branch at line 61 in file.R. When you try to merge the feature and main branches, Git won't know which changes to line 61 in file.R are correct, and you will need to manually decide.
- Here are some good resources:
- Resolving merge conflicsresolving-a-merge-conflict-using-the-command-line
- git - ours & theirs, a CLI resource to help with conflicts
- You often want Git to completely ignore certain files
- Generated files (like HTML files from Markdown docs)
- IDE-specific files like in .RStudio or .vscode folders
- really big files, like data or images
- If you accidentally commit a really big file, GitHub might not let you push that commit
- If you have a huge file in Git, your repository size can get way too big
- This is a pain to solve, so use the .gitignore file ahead of time, but if you need to fix this, here is a great resource:
- Removing Large Files From git Using BFG and a Local Repository
Git, GitHub and Data¶
Git and data don't always go hand in hand. GitHub allows commited files to be uploaded only if the file is of 100MB or less (with a warning being issued for files between 50MB and 100MB). Additionally, GitHub recommends to keep repositories below the 1GB threshold, as this also allows for quicker cloning and sharing of the repository. If a large file has been uploaded by mistake and you wish to remove it, you can follow these instrutctions.
If you do have to work with large files and Git, here are some questions to ask yourself:
- Is this data shareable?
- Are there alternative file hosting platforms I can use?
- How will this data impact the sharability of this repository?
- Am I using a .gitignore?
GitHub now offers the Git Large File Storage ( Git LFS): the system works by storing references to the file in your repository, but not the file itself -- it creates a pointer file within the repo, and stores the file elsewhere. If you were to clone the repository, the pointer file will act as a map to show you how to obtain the original file.
Git LFS data upload limits are based on your GitHub subscription:
- 2 GB for GitHub free and GitHub Pro
- 4 GB for GitHub Team
- 5 GB for GitHub Enterprise Cloud
Useful GitHub Features¶
At its core, GitHub is just a place to host your Git repositories. However, it offers a lot of functionality that has less to do with Git, and more to do with Project Management. We will walk through a few of these useful features.
- Issues let you plan out changes and suggestions to a repo
- Pull requests are a way to request merging code from one branch to another
- typical workflow is for someone to fork a repo, then make a PR from that repo to another
- Closing issues
- You can use Organizations to organize sets of repositories
- GitHub documentation:
Other neat things
- GitHub Classroom
- CSV and map rendering
- Code editor
Beyond Git and GitHub¶
There are other platforms that address Version Control and have similar functionalities to GitHub:
GitLab: An alternative to GitHub, GitLab offers both a cloud-hosted platform and a self-hosted option (GitLab CE/EE). It provides a comprehensive DevOps platform with built-in CI/CD, container registry, and more.
Bitbucket: Atlassian's Bitbucket is a Git repository hosting service that also supports Mercurial repositories. It offers integration with Jira, Confluence, and other Atlassian products.
SourceForge: A platform that provides Git and Subversion hosting, as well as tools for project management, issue tracking, and collaboration.
AWS CodeCommit: Part of Amazon Web Services (AWS), CodeCommit is a managed Git service that integrates seamlessly with other AWS services.
Azure DevOps Services (formerly VSTS)): Microsoft's Azure DevOps Services offers Git repository hosting along with a wide range of DevOps tools for planning, developing, testing, and deploying software.
Mercurial: Like Git, Mercurial is a distributed version control system, but with a different branching and merging model. It's an alternative to Git for version control.
True or False: Using
Git requires a GitHub account
True or False: Using
Git is easy
Git can be frustrating to even the most experienced users
When you find a new repository on GitHub that you think can help your research, what are the first things you should do?
Look at the README.md
Most GitHub repositories have a README.md file which explains what you're looking at.
Look at the LICENSE
Not all repositories are licensed the same way - be sure to check the LICENSE file to see whether the software is open source, or if it has specific requirements for reuse.