Skip to content

Commit

Permalink
Clean up README (teaxyz#13)
Browse files Browse the repository at this point in the history
* Clean up README

* wip

* wip

* wip

* wip

* removed the pruning ur docker stuff

* removed the PKG_MANAGER reference

* +api

---------

Co-authored-by: Sanchit Ram Arvind <[email protected]>
  • Loading branch information
mxcl and sanchitram1 authored Oct 23, 2024
1 parent fcc5ae3 commit 8643509
Showing 1 changed file with 47 additions and 63 deletions.
110 changes: 47 additions & 63 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,47 +1,48 @@
# CHAI

This is an attempt at an open-source data pipeline for package managers. The goal is to
have a pipeline that can run on any package manager and provide a normalized data
source for different use cases.
CHAI is an attempt at an open-source data pipeline for package managers. The
goal is to have a pipeline that can use the data from any package manager and
provide a normalized data source for myriads of different use cases.

1. db: postgres to store package specific data
1. alembic: for running migrations
1. pipeline: which fetches and writes data
## Getting Started

## Requirements
Use [Docker](https://docker.com)

- docker
1. Run `docker compose build` to create the latest Docker images.
2. Then, run `docker compose up` to launch.

> [!TIP]
> [!NOTE]
> This will run CHAI with for all package managers. As an example crates by
> itself will take over an hour and consume >5GB storage.
>
> for local development, all the requirements are within the [pkgx yaml](pkgx.yaml) file
> To run only a specific backend, comment out the others in `docker-compose.yml`.
## Setup

1. Run `docker compose build` to create the latest Docker images.
2. Run `docker compose up` to launch.
<!-- we'd like to change the above to be more friendly to users trying to run a specific
pipeline -->

### Arguments

- `PKG_MANAGER`: which package manager the pipeline will be run for. Currently, the
supported values are:
- `crates`
- `FREQUENCY`: how frequently **(in hours)** the pipeline will run. Defaults to `24`
Specify these eg. `docker compose -e FOO=bar up`:

- `FREQUENCY`: how frequently **(in hours)** the pipeline will run
(defaults to `24`)
- `FETCH`: whether the pipeline will fetch the data. Defaults to `true`
- `DEBUG`: whether the pipeline will run in debug mode. Defaults to `true`

These arguments are all configurable in the `docker-compose.yml` file.

## Hard Reset
### Docker Services Overview

If at all you need to do a hard reset, here's the steps
1. `db`: [PostgreSQL] database for the reduced package data
2. `alembic`: handles migrations
3. `pipeline`: fetches and writes data
4. `api`: a simple REST api for reading from the db

1. `rm -rf data`: removes all the data the fetcher is putting
2. `docker system prune -a -f --volumes`: removes **everything** docker-related
### Hard Reset

> [!WARNING]
>
> step 4 deletes all your docker stuff...be careful
Stuff happens. Start over:

`rm -rf ./data`: removes all the data the fetcher is putting.

<!-- this is handled now that alembic/psycopg2 are in pkgx -->
<!--
Expand All @@ -51,45 +52,25 @@ If at all you need to do a hard reset, here's the steps
- vapor's migrations are written in swift
-->

## Usage
## Goals

Our goal is to build a data schema that looks like this:

![db/CHAI_ERD.png](db/CHAI_ERD.png)

Our specific application extracts the dependency graph understand what are critical
pieces of the open-source graph. there are many other potential use cases for this data:

- license compatibility checker
- developer publications
- package popularity
- dependency analysis vulnerability tool (requires translating semver)

<!-- TODO: add these to the examples folder-->
Our specific application extracts the dependency graph understand what are
critical pieces of the open-source graph. there are many other potential use
cases for this data:

### license compatibility checker
- License compatibility checker
- Developer publications
- Package popularity
- Dependency analysis vulnerability tool (requires translating semver)

> [!WARNING]
>
> it's probably better to start with a global list of licenses and then map each
> version's to the global list...but this isn't part of v1
```sql
SELECT DISTINCT
p.name,
l.name AS license,
dep.name AS dependency,
dep_l.name AS dependency_license
FROM packages p
JOIN versions v ON p.id = v.package_id
JOIN dependencies d ON v.id = d.version_id
JOIN packages dep ON d.dependency_id = dep.id
JOIN licenses l ON v.license_id = l.id
JOIN versions dep_v ON dep.id = dep_v.package_id
JOIN licenses dep_l ON dep_v.license_id = dep_l.id
```
> [!TIP]
> Help us add the above to the examples folder.
### package popularity
### Package Popularity

```sql
SELECT p.name, SUM(v.downloads) as total_downloads
Expand All @@ -100,7 +81,7 @@ ORDER BY total_downloads DESC
LIMIT 10;
```

### developer publications
### Developer Publications

```sql
SELECT u.username, p.name, COUNT(uv.id) as publications
Expand All @@ -112,15 +93,15 @@ GROUP BY u.username, p.name
ORDER BY p.name;
```

## FAQs / common issues
## FAQs / Common Issues

1. The database url is `postgresql://postgres:s3cr3t@localhost:5435/chai`, and is used
as `CHAI_DATABASE_URL` in the environment.
1. The database url is `postgresql://postgres:s3cr3t@localhost:5435/chai`, and
is used as `CHAI_DATABASE_URL` in the environment.

## tasks
## Tasks

These are tasks that can be run using xcfile.dev. if you have pkgx, just run `dev` to
inject into your environment. if you don't...go get it.
These are tasks that can be run using [xcfile.dev]. If you use `pkgx`, typing
`dev` loads the environment. Alternatively, run them manually.

### reset

Expand Down Expand Up @@ -235,3 +216,6 @@ Refreshes table knowledge from the db.
```sh
docker-compose restart api
```

[PostgreSQL]: https://www.postgresql.org
[`pkgx`]: https://pkgx.sh

0 comments on commit 8643509

Please sign in to comment.