Skip to content

Latest commit

 

History

History
808 lines (593 loc) · 42.5 KB

README.md

File metadata and controls

808 lines (593 loc) · 42.5 KB

tests GitHub Contributors GitHub Last Commit

License

Svelte Typescript Made with Node.js

Vitest Passing Vitest Failing GitHub Issues GitHub Pull Requests

Typescript Coverage

Talk to the City

Talk to the City is an application that:

  • ingests unstructured natural language, e.g:
    • citizen surveys / public deliberations
    • newsgroups
    • forums
    • discussion archives
  • uses LLMs to extract and classify:
    • atomic claims
    • topics and subtopics
  • generates interactive reports

Demo

Heal Michigan

The Heal Michigan report is a video-based survey and an in-depth look into the challenges and daily lives of the Michigan community.

Taiwan same-sex marriage

The Taiwan same-sex marriage report is a very large survey of the Taiwanese population, covering their views on same-sex marriage in Taiwan.

Mina protocol

The Mina protocol report features the results of a user-survey carried out by Mina Zero Knowledge Protocol on their users.

Repo link: https://github.com/AIObjectives/talk-to-the-city-reports

Computational Graph

On a technology front, tttc uses a dependency-graph based data and computational model based on nodes that are connected by directional edges. The nodes + edges form a pipeline where some nodes provide data, whilst others provide computation steps. Computation simply involves a topological sort (since edges are directed) where the output of nodes are passed into the input of their downstream nodes. On each step the "compute" function for each node is simply invoked with the upstream input data, and so on until all nodes have been computed.

Computation has two modes: "run" when the pipeline creator actively runs the pipeline, and "load" which is called when the resulting report page is loaded by a viewer.

Reusability with the MVC pattern

The graph is also used for the UI. Pipelines have two rendering mode: graph and standard. The graph view uses Svelteflow whilst the standard view performs a topological sort and renders the nodes in a single column.

Nodes use the MVC pattern. The compute functions hold the Model and the Controller. The graph UI components hold the View.

Since the MC and V are decoupled, we can use different combinations of MC <-> V to yield many combinations of compute + UI entities whilst minimizing code and maximizing reusability.

Documentation

Our AI Pipeline Engineering Guide #1 takes the reader step by step over the process of creating a report pipeline.

Our user docs provides a very high level overview of the application for non-technical users.

Cloning

$ git clone https://github.com/AIObjectives/talk-to-the-city-reports

Firebase

The application can be hosted anywhere, although the persistence layer is currently coupled with Firestore and Google Cloud Storage.

Setting up a firebase instance

Setting up a firebase instance

Since the app uses Firebase, you'll need a dev / staging firebase instance for local development, and for deployment. To do so, you have two options:

  • setting up your own instance.
  • using AOI's dev instance.

Deploying and maintaining google cloud platform resources is fairly simple and straight forwards although requires the use of the gcloud and gsutil CLI applications. So before we get started make sure you have those correctly installed, and authenticated.

https://cloud.google.com/sdk/docs/install

Setting up your own instance

To set up your own instance:

  • Head over to https://console.firebase.google.com/
  • Click "add project" and enter a project name
  • Disable google analytics
  • Click "create project" & continue
  • Under "Get started by adding Firebase to your app" click on the web </> icon
  • Add an app nickname (same as earlier)
  • Click "firebase hosting" if you intend to deploy the app
  • Click "register app"
  • Copy .env.example to .env in the turbo directory
  • Copy & paste the values of the variables.
  • Click next.
  • npm install -g firebase-tools
  • firebase login

Setting up authentication

  • In the project overview, click on "Authentication"
  • Click on "set up sign-in method"
  • Click 'Google'
  • Click 'enable'
  • Select a support email address
  • Click 'save'

Setting up firestore

  • In the project overview, in the left side panel, click on "build"
  • Click on "firestore database"
  • Click "Create Database"
  • Select your region / multi region
  • Click 'next'
  • Click 'Start in test mode'
  • Click 'enable'

N.B Firestore rules are still being finalized. Please contact @lightningorb to find out more.

Setting up Google Cloud Storage

  • In the project overview, in the left side panel, click on "build"
  • Click on 'storage'
  • Click 'get started'
  • Click 'start in test mode'
  • Click next
  • Click done

Setting up CORS on GSC

  • Install and configure the gsutil application
  • Save the following in a temporary cors.json file
[
  {
    "origin": ["http://localhost:5173", "https://<optional_deployment_url>"],
    "method": ["GET", "HEAD", "DELETE"],
    "responseHeader": ["Content-Type"],
    "maxAgeSeconds": 3600
  }
]
  • Install the gsutil application
  • Run the following:
gsutil cors set cors.json gs://<project-name>.appspot.com

Setting up the service account

Authenticated backend endpoints require the service account file:

  • in the console for the project, click on project settings (the cog icon)
  • click on "service accounts"
  • click on Manage service account permissions
  • look for the email address that matches the project id
    • click actions
    • click create key
  • save the json private key to turbo/src/lib/service-account-pk.json
  • add the environment variable to your shell: export GOOGLE_APPLICATION_CREDENTIALS="src/lib/service-account-pk.json"

Post fresh install steps

DB 'dataset' index

After launching the app, for the first time check your dev console, as it will contain a link for creating an index for datasets.

Templates

Talk to the City turbo uses pipeline templates, so end users do not have to construct their own graphs.

You can manage templates via http://localhost:5173/templates or https://tttc-turbo.web.app/templates.

Admin UID

The .env file contains a VITE_ADMIN variable that should be filled in with your user id, which can be acquired from the Firestore database.

Using AOI's dev instance

  • Contact @brittneygallagher or @lightningorb for credentials files
  • save the provided .env in turbo/
  • optional steps for deployment:
    • save the provided service-account-pk.json in turbo/src/lib/
    • npm install -g firebase-tools
    • firebase login

Disclaimer: by using a shared dev instance, you are aware that the data you shared by nature, and therefore no privacy can be made for the data you choose to upload to the platform. For better privacy, consider setting up your own instance.

Deploying to firebase

Once you're done making your changes, you can deploy to firebase with:

$ firebase deploy

Multi-site deployments

Firebase allows easily deploying to multiple sites that use the same project resources.

To specify a different site:

  • modify .hosting.site in turbo/firebase.json
  • run firebase deploy --only hosting:<alt-site-name>

Running

Once you have set up a Firebase instance:

Node version tested: v18.0.0

$ cd talk-to-the-city-reports/turbo
$ npm install --legacy-peer-deps # or --force
$ npm run dev

Dev documentation

Adding new node types

Adding new node types

To add pipeline computation nodes:

  • create the compute function in src/lib/compute/
  • look for a suitable UI component in src/components/
    • In the vast majority of cases, you should be able to simply use an existing UI component. If a UI component does not suit your needs, then feel free to create a new one.
  • Bind the node's compute type with a component in src/lib/node_types.ts
  • add the node to src/lib/templates.ts
  • add node documentation to src/lib/docs
Node UI component hierarchy

Node UI component hierarchy:

The primary UI components displayed to users are called "nodes" as they are part of a dependency graph.

The docs that appear when the user presses the ? mark are stored in:

src/lib/docs

Adding text inside nodes:

The UI nodes are stored in ./turbo/src/components/graph/nodes.

DGNode is the 'base' node, that all nodes reuse. DefaultNode is an empty generic node, when nodes don't have a specialized UI. DefaultNode is the generic file upload, which CSVNode and JSON reuse.

This is the "Argument Extraction" and "Cluster Extraction" etc. nodes, essentially all nodes requiring prompts to interact with GPTs use the PromptNode.

Internationalization

Internationalization:

src/lib/i18n/en.json
src/lib/zh-TW.json

Since we use internationalization, UI strings use:

<script lang='ts>
    import { _ as __ } from 'svelte-i18n';
</script>


<p>{$__('this_is_a_string')}</p>

The localized strings is then added to their respective src/lib/<lang>.json files.

Tests & TDD

Tests & TDD

The core functionalities of the nodes are tested. Thus it is strongly recommended to run the tests, and keep them running (vitest uses a daemon with file watch) while you make changes.

$ npm run test-ui

Testing the live website

brew install xorg-server
pip install chromedriver-autoinstaller selenium pyvirtualdisplay
DISPLAY=:99 python src/test/test_selenium.py

Test Results

Metric Count
Total Test Suites 106
Passed Test Suites 106
Failed Test Suites 0
Pending Test Suites 0
Total Tests 215
Passed Tests 215
Failed Tests 0
Pending Tests 0
Todo Tests 0
Test Status Duration (ms)
testing vimeo claim passed
testing yt claim passed
testing yt link has si passed
testing yt link has timestamp passed
testing yt link has si and timestamp passed
testing no video passed
testing no claim throws error passed
Test Status Duration (ms)
should concatenate multiple CSV inputs into a single output array passed
should handle empty input arrays passed
should handle a single input array passed
should set dirty to false after compute passed
should return an empty array if no inputs are provided passed
should not mutate the input data passed
Test Status Duration (ms)
extract the given arguments passed
should not extract the arguments if no csv passed
should not extract the arguments if no open_ai_key and no GCS passed
should load from GCS if no open ai key passed
should not extract the arguments if no prompt and no system prompt passed
test GCS caching passed
Test Status Duration (ms)
extract the given arguments passed
extract the given arguments with missing rows in CSV passed
should not extract the arguments if no csv passed
should not extract the arguments if no open_ai_key and no GCS passed
should load from GCS if no open ai key passed
should not extract the arguments if no prompt and no system prompt passed
test GCS caching passed
Test Status Duration (ms)
should return the cached output if not dirty and output exists passed
should read audio from GCS and update size and mime_type if download is true passed
should create an empty audio file if download is false passed
should set dirty to false after compute passed
should return undefined if gcs_path is not set passed
Test Status Duration (ms)
compute should set output to messages and dirty to false passed
Test Status Duration (ms)
extract the cluster passed
should not extract the cluster if no csv passed
should not extract the cluster if no open_ai_key passed
should not extract the cluster if no prompt and no system prompt passed
test GCS caching passed
Test Status Duration (ms)
extract the cluster passed
should not extract the cluster if no csv passed
should not extract the cluster if no open_ai_key passed
should not extract the cluster if no prompt and no system prompt passed
test GCS caching passed
Test Status Duration (ms)
should concatenate comments until reaching 100 words, then start a new chunk passed
should start a new chunk when the interview field changes passed
should handle an empty input array passed
should not lose the last comment if it does not exceed 100 words passed
should correctly handle comments with exactly 100 words passed
Test Status Duration (ms)
should correctly count tokens in input data passed
should not count tokens if input data length matches and node is not dirty passed
should count tokens if the input data is a string passed
Test Status Duration (ms)
should process CSV data correctly from GCS passed
should handle empty CSV data from GCS passed
should handle rows with uneven columns from GCS passed
Test Status Duration (ms)
Find by compute type passed
Simple pipeline run test passed
Full pipeline run test passed
Test Status Duration (ms)
generates new columns passed
deletes columns passed
renames columns passed
returns undefined if input is undefined passed
handles multiple operations passed
does not modify input if no operations are specified passed
does not crash if input is empty passed
Test Status Duration (ms)
should filter CSV data inclusively based on provided filters passed
should filter CSV data exclusively based on provided filters passed
should return all data if no filters are set passed
should handle multiple filters correctly passed
should set dirty to false after compute passed
should not mutate the input data passed
Test Status Duration (ms)
should compute embeddings for input data passed
should not compute embeddings if no open_ai_key is provided passed
should load embeddings from GCS if data length matches and save_to_gcs is true passed
should handle no data input passed
Test Status Duration (ms)
general prompt passed
json prompt passed
json prompt with text passed
Test Status Duration (ms)
sets the output of the node to the input data passed
Test Status Duration (ms)
should process data correctly with JQ filter passed
should handle invalid JQ filter passed
Test Status Duration (ms)
should process data correctly with JQ filter passed
should handle invalid JQ filter passed
should return an empty array when no matches found passed
should process data correctly with a complex JQ filter passed
should return undefined if the input is null or undefined passed
Test Status Duration (ms)
should process JSON data correctly from GCS passed
should handle invalid JSON data from GCS passed
should update dirty state correctly passed
Test Status Duration (ms)
evaluates JSONata expressions passed
returns undefined if no expression is provided passed
catches errors when evaluating expressions passed
Test Status Duration (ms)
should let all data pass through if number is left blank passed
should limit the number of rows correctly, for an object passed
should return all rows if limit is greater than number of rows passed
should return an empty array if input is empty passed
should not mutate the input node passed
Test Status Duration (ms)
should set markdown data if input is a string passed
should combine multiple string inputs with separation passed
should wrap non-string inputs within code block passed
should handle an empty input object passed
should preserve the order of inputs when combining passed
should stringify and wrap arrays in code blocks passed
should throw an error if input data contains circular references passed
Test Status Duration (ms)
merges cluster_extraction and argument_extraction data passed
does not merge if cluster_extraction data is missing passed
does not merge if argument_extraction data is missing passed
does not merge if cluster_extraction data has no topics passed
sets node data output to the merged data and dirty to false after merge passed
Test Status Duration (ms)
merges cluster extraction data passed
does not merge if cluster extractions are missing passed
uses cached data if available and not dirty passed
does not merge if no open_ai_key is provided passed
Test Status Duration (ms)
should merge cluster extractions into a single output passed
should handle empty input data passed
should not process if no open_ai_key is provided passed
Test Status Duration (ms)
should return the cached output if not dirty and output exists passed
should read audio from GCS and update size and mime_type if download is true passed
should create empty audio files if download is false passed
Test Status Duration (ms)
should split CSV into chunks and process each chunk passed
should handle empty CSV input passed
should not process if no open_ai_key is provided passed
Test Status Duration (ms)
should process multiple prompts passed
should process multiple differing prompts passed
should join outputs if join_output is true passed
should not process if no open_ai_key is provided passed
Test Status Duration (ms)
should process multiple audio files passed
should handle empty audio input passed
should update node_info with results from WhisperNode computations passed
should remove entries from node_info that are not in the audio list passed
should mark node_info entry as dirty if WhisperNode output is null passed
Test Status Duration (ms)
should set the key in cookies if the UI key is valid passed
if ui key is set but invalid use local key passed
should set the node text to "Invalid key" if the UI key is not valid and there is no local key passed
should not mutate the node if the UI key and local key are both valid passed
Test Status Duration (ms)
filters participants based on the provided name passed
removes subtopics with no claims after filtering passed
removes topics with no subtopics after filtering passed
returns undefined if input data does not contain topics passed
does not filter claims if interview key is missing passed
Test Status Duration (ms)
should set the key in cookies if the UI key is provided passed
should use the local key from cookies if available passed
should return an empty string if no key is provided or available in cookies passed
Test Status Duration (ms)
should initialize Pinecone with the provided API key passed
should create a new index if it does not exist and upsert embeddings passed
should list Pinecone indexes passed
should provide tools for querying Pinecone index passed
Test Status Duration (ms)
should execute python script and return outputData passed
should be able to pass input to outputData passed
test passing in complex data from jsonapi passed
Test Status Duration (ms)
should execute python script and return outputData passed
should be able to pass input to outputData passed
should be able to make get requests to jsonapi passed
Test Status Duration (ms)
should execute python script and return output passed
should handle fetch errors gracefully passed
should handle invalid JSON response passed
should handle non-string JSON response passed
should update node data output with the response passed
Test Status Duration (ms)
test node registeration passed
Load all nodes passed
Test Status Duration (ms)
should set the output of the node to the input data passed
should handle empty input data passed
should not mutate the input node passed
Test Status Duration (ms)
sets the output of the node to the input data passed
handles translation passed
uploads data to GCS on run passed
reads data from GCS on load if gcs_path is set and input data is empty passed
clears gcs_path if readFileFromGCS throws an error passed
sets message if merge and csv data are present passed
sets message to empty string if merge or csv data are missing passed
does not mutate the input node passed
Test Status Duration (ms)
scores the relevance of arguments passed
uses cached data if available and not dirty passed
does not score if argument_extraction data is missing passed
does not score if open_ai_key is missing passed
does not score if prompts are missing passed
Test Status Duration (ms)
should set the key in cookies if the UI key is provided passed
should use the local key from cookies if available passed
should return an empty string if no key is provided or available in cookies passed
Test Status Duration (ms)
should process CSV data correctly from GCS passed
Test Status Duration (ms)
should correctly stringify input data passed
should return input if it cannot be stringified passed
should handle different types of input passed
should not mutate the input node passed
Test Status Duration (ms)
should generate summaries for topics and subtopics passed
should load summaries from GCS if data length matches passed
Test Status Duration (ms)
integer node passed
adder node passed
dataset run adder passed
dataset run multi input multi output passed
Test Status Duration (ms)
should convert a single text input to CSV format passed
should convert multiple text inputs to CSV format passed
should handle empty text input passed
should split text into chunks if it exceeds the number of tokens passed
Test Status Duration (ms)
translates the input data passed
loads translations from GCS if data has not changed passed
does not translate if required inputs are missing passed
Test Status Duration (ms)
should return unique values based on the specified property passed
should return an empty array if input is empty passed
should return undefined if no property is specified passed
should set dirty to false after compute passed
should not mutate the input data passed
Test Status Duration (ms)
Test secondsToHHMMSS passed
Test secondsToHHMMSS with string passed
Test HHMMSSToSeconds passed
Test Status Duration (ms)
should load from cache if data is not dirty and gcs_path is set passed
should load from GCS if data is not dirty, gcs_path is set, and output is empty and audio size matches passed
should transcribe audio and upload to GCS if data is dirty passed
should return undefined and set message if open_ai_key is missing passed
should convert transcription to internal format if response_format is custom passed
Test Status Duration (ms)
should load from cache if data is not dirty and gcs_path is set passed
should load from GCS if data is not dirty, gcs_path is set, and output is empty and audio size matches passed
should transcribe audio and upload to GCS if data is dirty passed
should return undefined and set message if open_ai_key is missing passed
should convert transcription to internal format if response_format is custom passed
Test Status Duration (ms)
should execute function in workerpool passed
should execute delayed function in workerpool passed