Github for Research: a quick and painless tutorial to improve your papers
If you do any sort of programming you probably already heard about Github, and have been encouraged to, somehow, use it.
Now, I’ve seen plenty of repositories on the internet that lack the very basics of Git, and significantly ruin its purpose.
In this tutorial, I will focus on the very minimum necessary to keep your projects organized on Git, so that you and other people can make proper use of them.
What is Git for?
The main use of Git is for version control. Version control is useful if you make any changes down the line that ruin your code, or that you later deem unnecessary, it allows you to easily revert things back to a working version.
Git is also a nice way of sharing your code with others, allowing people to work together, commenting, and keeping things up to date.
How do I get started?
Now, there are many options of platforms out there, like Github, GitLab, Bitbucket, and so on. I will show you how things work for Github, but mostly the interface will change if you pick a different one.
The first thing we will do is create our repository. You can do this by clicking on the + sign on the top right corner as shown in the image below:
Give it a name and set it to either public or private (some of these functionalities might not be available depending on what type of account you have).
Click create repository and you are done (you can also check the initialize this repository with a README box, especially if you intend to share this repository with others), this will allow you to add a nice introduction and explanation to your repository.
The next step is to connect your git repository to your computer, so you can send code from your machine to the online repository and keep track of your changes. In the Figure below you can see how to clone your repository to a local folder (in my case, f/medium).
Now it is time to add some code to your repository, we will add the following code that computes the BMI given weight and height, to the folder where we cloned our repository (f/medium/test_medium).
With the code copied to the folder, we will use the GitBash command line to add our code to the repository. The image below shows the commands and their results in the GitBash terminal. First, we move to the repository with <cd test_medium> (always without <>). Then we add all the files we want to add to the repository <git add .> (you can replace . by the name of the file you want to add, ex: <git add main.py>), the dot will add everything in the folder.
Then we commit our code with <git commit -am ‘First commit’>. You can change the text inside ‘’ to something more meaningful. This will save the changes to our local repository, which means a version of the code is saved on our machine.
Finally, we push the changes to the remote server with <git push origin master> (which means our code is now in the Github repository we created online and is available to others).
If you check your Github repository, you will find your code there.
Now we have the first version of our code in the repository. But let’s say we forgot to add something important, in this case, the result of the BMI for a given input, so we change our code to:
I would like this to be the latest version of the code online since the previous one is lacking content. For that, we just repeat the procedure shown above, git add, git commit, git push.
If we check the repository now (image below), we will see that there’s a second commit, with the message (Second commit).
If we click History (on the right of the image above), we can see all the commits from this repository, together with the changes made in each one of them:
Maintaining your repository
Now, this is where a lot of people make a mistake. Let’s say you are done with the first part of this project and will work on something else, despite still related to the topic. The code you committed is the latest version of your code and you would like to keep it that way, maybe because other people might use it later, maybe it is referenced in one of your papers, or most importantly, you don’t know how your new changes will affect the code and want to have a stable version there.
This last version should be easily available to other people, despite your new changes. In our case, let’s say we will change the height from meters to centimeters. This could greatly impact your code and hamper the reproducibility of your results since this change is not documented yet. Besides, it might not integrate well with other functions.
For this new code, we will create a branch. A branch is basically a unique partition of your repository, where you can make changes, and then later merge them to the main branch. A repository can have multiple branches.
We create a branch with the command <git checkout -b bmi_cm>, where bmi_cm is the name of our new branch, and repeat the series of commands to add, commit and push our code to our new branch.
If you succeeded, you will be able to see your new branch in your repository, and now there will be two versions of your code there.
Now, here comes the best part of the way we are structuring things. Let’s say you need to go back to the older version, where we used meters instead of centimeters, and make some changes in that code (maybe someone asked for a new function).
All you need to do is <git checkout master>, and you are back to the older version of your code. Most IDEs (like Spyder, PyCharm, Visual Code) will update the code you have open automatically (make sure you save your changes and commit them before you move around branches).
You have now, two working versions of your code that you can easily edit without having multiple folders or updating a new version and losing the older one.
If at some point you don’t need one of them anymore, you can easily merge them with <git merge bmi_cm>.
What else can you do?
There is still a lot more about Git. Here are some points you can explore:
Tags [Reader suggestion]: Tags are useful to create code releases in the main branch. You probably have seen that all libraries have some sort of release number (e.g numpy=1.15). You can create that for your code as well, to make it easier to find which version was used to obtain the results from your paper. It will make your code and repository a lot more organized.
Learn about forking, so you can work with other peoples code and submit changes.
If you don't like the command prompt or GitBash, you can check alternatives with a more friendly interface, like SourceTree.
Learn about solving conflicts when merging.
Following some advanced Git tutorials.
*Are you missing something? Would you like to learn more about a specific topic? Leave your opinion below or send me an e-mail (firstname.lastname@example.org).