An Software Engineering Statistics Study using the GitHub API.
Our team conducted a statistical survey on the language type preferences of large software projects using a proprietary algorithm that queries the GitHub API. We found that 56% of large software projects prefer explicitly typed languages over implicitly typed languages.
Begin by installing the latest version of Python 3.
Clone the repo to the folder where you would like to run it.
Run the command python main.py
Our goal is to determine whether large software projects prefer to use statically typed programming languages (like C, C++, Java) or dynamically typed languages (like JavaScript, Python, PHP). It is often said that as software systems increase in size that static type checking becomes a useful feature for eliminating entire classes of bugs, specifically compile-time type errors.
Our interest is to test whether this feature of statically typed languages makes it a more common choice for large software projects.
We hypothesize that more than 50% of open source large software projects use statically type programming languages as opposed to dynamically typed languages.
We consider a large software project to have over 1,000,000 bytes and over 10 contributors. The size of a project refers to these metrics.
We will assemble a hash-table which maps the most common languages to either 'explicit' or 'implicit'. The list should be comprehensive and can be found in the file language-types.json
. Research into every language has been conducted to assert that data is correct. Language names were taken from linguist
provided by Github.
Our statistical population is the large open source software engineering projects found on Github.
N = 50
Data is sampled using the public GitHub API.
- The script queries a random project and determines the size of the project. Verify the project id in not already in the sample data set (to preserve independence)
- If the project meets the size criteria of a large software project they our added to our sample data set.
- Continue until N projects are inserted into the sample data set.
- For each project in the sample data set check its most prominent language.
- Take that language and check whether statically typed or dynamically typed.
- Insert the project information into its corresponding data set.
- The script calculates the sample mean, sample median, sample variance and sample standard deviation.
Our findings found that 28 of the 50 projects that were inspected were using explicitly typed languages. For more information about the math that we used to determine this please refer to https://drive.google.com/file/d/1lbfze5P0Y2RtJqXF_VUFQSkdVS3xtguR/view?usp=sharing