The data in our database generally originates from somewhere in the Boost GitHub ecosystem.
This page will explain to Django developers how data is synced from GitHub to our database.
- Most code is in
libraries/github.py
andlibraries/tasks.py
- Releases are also called "Versions."
- The model that saves Release/Version data is
versions/models.py::Version
- We retrieve all the non-beta and non-release-candidate tags from the main Boost repo
Boost releases some tags as formal GitHub "releases," and these show up on the Releases tab.
Not all tags are official GitHub Releases, however, and this impacts where we get metadata about the tag.
To retrieve releases and tags, run:
./manage.py import_versions
Note the command enqueues a celery task rather than running synchronously. The task will:
- Delete existing Versions and LibraryVersions if you pass
--delete-versions
to the command - Retrieve tags and releases from the Boost GitHub repo
- Create new Versions for each tag and release that is not a beta or rc release
- Create a new LibraryVersion for each Library (including for historical versions unless you pass
--new
)
- Once a month, the task
libraries/tasks/update_libraries()
runs. - It cycles through all Boost libraries and updates data
- It only handles the most recent version of Boost and does not handle older versions yet.
- There are methods to download issues and PRs, but the methods to download issues and PRs are not currently called.
- A new GitHub API needs to be generated through the CPPAlliance GitHub organization, and be added as a kube secret
-
self.skip_modules
: This exists in bothGitHubAPIClient
andLibraryUpdater
but it should probably only exist inLibraryUpdater
, to keepGitHubAPIClient
less tightly coupled to specific repos - If we only want aggregate data for issues and PRs, do we need to save the data from them in models, or can we just save the aggregate data somewhere?
To make the code more readable to the Boost team, who will ultimately maintain the project, we tried to replicate their terminology as much as possible.
- Library: Boost “Libraries” correspond to GitHub repositories
.gitmodules
: The file in the main Boost project repo that contains the information on all the repos that are considered Boost libraries- module and submodule: Other words for library that correspond more specifically to GitHub data
This is not a code walkthrough, but is a general overview of the objects and data that this class retrieves.
- The Celery task
libraries/tasks.py/update_libraries
runsLibraryUpdater.update_libraries()
- This class uses the
GitHubAPIClient
class to call the GitHub API - It retrieves the list of libraries to update from the
.gitmodules
file in the main Boost repo: https://github.com/boostorg/boost/blob/master/.gitmodules - From that list, it makes sure to exclude any libraries in
self.skip_modules
. The modules inself.skipped_submodules
are not imported into the database. - For each remaining library:
- It uses the information from the
.gitmodules
file to call the GitHub API for that specific library - It downloads the
meta/libraries.json
file for that library and parses that data - It uses the parsed data to add or update the Library record in our database for that GitHub repo
- It adds the library to the most recent Version object to create a LibraryVersion record, if needed
- The library categories are updated
- The maintainers are updated and stub Users are added for them if needed.
- The authors are updated and stub Users are added for them if needed (updated second because maintainers are more likely to have email addresses, so matching is easier).
- It uses the information from the
- This class controls the requests to and responses from the GitHub API. Mostly a wrapper around
GhApi
that allows us to set some default values to make calling the methods easier, and allows us to retrieve some data that is very specific to the Boost repos - Requires the environment variable
GITHUB_TOKEN
to be set - Contains methods to retrieve the
.gitmodules
file, retrieve the.libraries.json
file, general repo data, repo issues, repo PRs, and the git tree.
- Contains methods to parse the data we retrieve from GitHub into more useful formats
- Contains methods to parse the
.gitmodules
file and thelibraries.json
file, and to extract the author and maintainer names and email addresses, if present.
Attributes
owner | GitHub repo owner | boostorg |
---|---|---|
ref | GitHub branch or tag to use on that repo | heads/master |
repo_slug | GitHub repo slug | default |
self.skip_modules
: This is the list of modules/libraries from.gitmodules
that we do not download
- Each Boost Library has a GitHub repo.
- Most of the time, one library has one repo. Other times, one GitHub repo is shared among multiple libraries (the “Algorithm” library is an example).
- The most important file for each Boost library is
meta/libraries.json
This is the most important file in the main Boost repository. It contains the GitHub information for all Libraries included in that tagged Boost version, and is what we use to identify which Libraries to download into our database.
submodule
: Corresponds to thekey
inlibraries.json
- Contains information for the top-level Library, but not other sub-libraries stored in the same repo
path
: the path to navigate to the Library repo from the main Boost repourl
: the URL for the.git
repo for the library, in relative terms (../system.git
)fetchRecurseSubmodules
: We don’t use this fieldbranch
: We don’t use this field
This is the most important file in the GitHub repo for a library. It is where we retrieve all the metadata about the Library. It is the source of truth.
key
: The GitHub slug, and the slug we use for our Library object- When the repo hosts a single Library, the
key
corresponds to thesubmodule
in the main Boost repo’slibraries.json
file. Example:"key": "asio"
- When the repo hosts multiple libraries, the first
key
corresponds to thesubmodule
. Example:"key": "algorithm"
. Then, the following keys inlibraries.json
will be prefixed with the originalkey
before adding their own slug. Example:"key": "algorithm/minimax"
- When the repo hosts a single Library, the
name
: What we save as the Library nameauthors
: A list of names of original authors of the Library’s documentation.- Data is very unlikely to change
- Data generally does not contain emails
- Stub users are creates for authors with fake email addresses and users will be able to claim those accounts.
description
: What we save as theLibrary
descriptioncategory
: A list of category names. We use this to attach Categories to the Libraries.maintainers
: A list of names and emails of current maintainers of this Library- Data may change between versions
- Data generally contains emails
- Stub users are created for all maintainers. We use fake email addresses if an email address is not present
- We try to be smart — if the same name shows up as an author and a maintainer, we won’t create two fake records. But it’s imperfect.
cxxstd
: C++ version in which this Library was added
Example with a single library:
Example with multiple libraries:
- To add a new skipped submodule: add the name of the submodule to the list
self.skipped_modules
and make a PR. This will not remove the library from the database, but it will stop refreshing data for that library. - To remove a submodule that is currently being skipped: remove the name of the submodule from
self.skipped_modules
and make a PR. The library will be added to the database the next time the update runs.
- Via the Admin. The Library update process does not delete any records.
- They will be automatically added as part of the download process as soon as they are added to a library's
libraries.json
file.
- Via the Admin.
- But if they are not also removed from the
libraries.json
file for the affected library, then they will be added back the next time the job runs.