versioning_data_scripts

Versioning your Data and Scripts


Previous: README


Setting up for today’s class

For this quick hands-on session we will be using a Graphical User Interface (GUI) to work with Git. Let’s start by:

  1. Download and install GitKraken.
  2. Create an account for yourself on GitHub. Please be sure to select the free/academic account, as this option has more long-term flexibility.
  3. Download the workshop sample files

What is Version Control?

Version control can be used to keep track of versions of a piece of work that either a single person is working on, or a shared document. It is designed to avoid a situation like noted below.

mydocument.txt
mydocument_v2.txt
mydocument_v3_rev-BHP.txt
mydocument_v8_Final?.txt

Some word processors let us deal with this a little better, without creating a new file for every “save”, such as Microsoft Word’s “Track Changes” or Google Docs’ version history.

Version control systems start with a base version of the document and then save just the changes you made at each step of the way by taking a so-called “snapshot”. A snapshot records information about when it was taken, and all the changes between the current document and the previous version. The user (you) decides when these snapshots are collected, and this allows one to ‘rewind’ your file to an older version.

For example, two users can make independent sets of changes based on the same document and have 2 separate snapshots documenting the changes.

If there aren’t conflicts, you can “merge” two sets of changes onto the same base document.

Version Control Systems and Hosts

There are a lot of different version control systems available; however, in this class we will be focusing on Git. These systems enable you to track changes locally or remotely (easy for collaborations), and there are hosts available for remote management of your “repositories”.

GitHub is currently the most popular host of open source projects by number of projects and number of users. But other hosts exist, including SourceForge, BitBucket, and Gitlab, to name a few.

Why use Version Control?

The two main reasons to use version control are to:

Though version control was originally designed for dealing with code there are many benefits to using it to with text files too (.txt, .csv, .tsv). Some examples of projects making use of version control systems like GitHub include: writing manuscripts, books or dissertations, and for collaboratively developing as well as distributing teaching materials (like for this class).

Note: Different Version Control systems handle different non-text files differently. In most cases Word documents, graphics files, data objects from R or STATA, etc., can be included but most tools have limited capabilities for these.

Version control is particularly useful for facilitating collaboration. One of the original motivations behind version control systems was to allow different people to work on large projects together, and in the case of Git, to manage the Linux kernel source code.

Benefits of collaborating with Version Control include:

Why Not use Dropbox or Google Drive?

Dropbox, Google Drive and other services offer some form of version control in their systems. There are times when this may be sufficient for your needs. However there are a number of advantages to using a version control system like Git:


Next: Getting Started with Git using GitKraken