Optimise CrawlProjectsJob
to use fewer GitHub core
resources
#281
Labels
enhancement
Improvements or additions to features
refactoring
Improvements to code structure
server
Concerning the server
Although we have substantially cut
core
resource usage from the GitHub API back in #152, I believe that it can be optimized further. Since a repository can have an arbitrary number of labels, languages and topics1, we solely relied on retrieving them through REST APIs, as pagination ensures that they can all be retrieved, 100 results at a time. This means that for every repository mined, we also have to perform 3 additional REST calls, and that's only considering the best possible case. However, should a repository have less than 100 of each, we could retrieve the complete result list in GraphQL. This, in turn, would eliminate the need to make any further REST requests. At the same time, this would not increase the overall cost of the GraphQL query we currently use.To grasp the impact of switching to the methodology described above, I decided to look at the average and maximum number of children present in each relationship. I ran the following query:
Which produced the following result:
label
language
topic
The results obtained were in line with what was expected. On average, each repository contains about 9 - 10 labels, which corresponds to the labels generated by default when creating a new repository. At the same time, the number of labels used within a project can reach well into the thousands, reinforcing the need to rely on REST as a fallback when preliminary GraphQL retrieval proves insufficient.
Languages follow a similar trend, with the average number of languages present in a project being 3, and the maximum exceeding 350. Since the presence of languages is directly correlated to the project complexity, as reflected by the repository code contents, it makes sense that the average and maximum number of languages is lower than a user-defined characteristic like labels. Still, we may need to rely on the old method of using REST for complex projects.
Finally, we have the topic counts. Although the average number of topics ranges from 6 to 7, the maximum number of 21 came as a bit of a surprise. GitHub currently restricts the maximum assignable number of topics to 20, which points to a possible issue with our current implementation. Discrepancy aside, the aforementioned restriction implies that all repository topics can be retrieved in a single REST request. Despite this, one should keep in mind that GitHub is a constantly evolving platform. As such, the paginated retrieval of topics through REST should be retained if GitHub were to ever increase the assignable topic upper bound.
Since the initial findings proved promising, I decided to delve further. Given that the maximum number of retrievable labels/languages/topics was 100, I wanted to see the percentage of projects whose label and language counts exceeded 100. Given that the topics currently cap at 20, they were not considered. Executing the following query:
Produced the following output:
label
language
Note that the values displayed are percentages. In conclusion, 99.9% of GraphQL requests will contain all the information we need. In the remaining 0.1%, we will fall back on using the REST method that we have used thus far. This optimization could potentially lead to massive savings, which opens the door to other enhancements, such as parallel miners that divide the mining load, or another background job that can freely use the APIs (proposed in #166).
Footnotes
Although GitHub currently imposes a double-digit hard limit, what if this limitation changes in the future? ↩
The text was updated successfully, but these errors were encountered: