With almost 30k commits and a history spanning over ten years, Scala is a mature programming language. It is a general-purpose programming language that has recently become another prominent language for data scientists.
Scala is also an open source project. Open source projects have the advantage that their entire development histories -- who made changes, what was changed, code reviews, etc. -- are publicly available.
I am going to read in, clean up, and visualize the real world project repository of Scala that spans data from a version control system (Git) as well as a project hosting site (GitHub). I will find out who has had the most influence on its development and who are the experts.
The dataset I will use, which has been previously mined and extracted from GitHub, is comprised of three files:
- pulls_2011-2013.csv contains the basic information about the pull requests, and spans from the end of 2011 up to (but not including) 2014.
- pulls_2014-2018.csv contains identical information, and spans from 2014 up to 2018.
- pull_files.csv contains the files that were modified by each pull request.