Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat/engt 1717 #979

Draft
wants to merge 108 commits into
base: develop
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
108 commits
Select commit Hold shift + click to select a range
fc84f48
initialized package
adhocmaster Aug 10, 2023
5c038c2
fixed paths
adhocmaster Aug 10, 2023
54b9e18
URLUtils to be tested
adhocmaster Aug 10, 2023
9cf0ccb
asyncified
adhocmaster Aug 10, 2023
b81bf5d
fixed compilation errors. brands cannot be used in enums
adhocmaster Aug 10, 2023
c11938e
why are tests not compiled?
adhocmaster Aug 10, 2023
fff3841
working on the classifier
adhocmaster Aug 10, 2023
17f6457
fix types
adhocmaster Aug 10, 2023
088221c
fixed typing
adhocmaster Aug 10, 2023
84a1ebe
url keywords are hard
adhocmaster Aug 10, 2023
ebe3092
match started
adhocmaster Aug 11, 2023
6f5fd39
prototype for detection is done
adhocmaster Aug 11, 2023
20bb043
refactored
adhocmaster Aug 15, 2023
049550b
most methods in URLUtils are synchronized calls. Sets instead of List…
adhocmaster Aug 15, 2023
d8a875c
Update KeywordRepository.ts
adhocmaster Aug 15, 2023
65ca28f
Merge branch 'develop' into feat/ENGT-1699
adhocmaster Aug 17, 2023
d7d476e
updated yarn lock
adhocmaster Aug 17, 2023
086ba04
interfaces
adhocmaster Aug 17, 2023
44566cf
next builder patterns for prompts
adhocmaster Aug 17, 2023
b7f93f8
Update README.md
adhocmaster Aug 17, 2023
20bfcab
builder
adhocmaster Aug 17, 2023
27f953c
Create ProductUtils.ts
adhocmaster Aug 17, 2023
b7c93bd
more primitives and first prompt builder
adhocmaster Aug 17, 2023
5edda16
working on the director
adhocmaster Aug 17, 2023
f05f178
rest of chalie's comments
adhocmaster Aug 22, 2023
d42c829
Merge branch 'develop' into feat/ENGT-1699
adhocmaster Aug 22, 2023
193d4a9
Merge branch 'feat/ENGT-1699' into feat/ENGT-1717
adhocmaster Aug 22, 2023
9e3abcc
refactoring, supporting mechanism for multiple provider structures
adhocmaster Aug 22, 2023
1bafd1f
now unit tests
adhocmaster Aug 22, 2023
4beb8d2
started tests
adhocmaster Aug 22, 2023
3712b5f
Update PurchaseHistoryPromptBuilder.test.ts
adhocmaster Aug 22, 2023
798b72e
Merge branch 'develop' into feat/ENGT-1717
adhocmaster Aug 22, 2023
a88618f
Update PurchaseHistoryPromptBuilder.test.ts
adhocmaster Aug 24, 2023
0060b94
new api works
adhocmaster Aug 24, 2023
8014175
moving to product persistence
adhocmaster Aug 24, 2023
6c05d55
shopping data package initiated
adhocmaster Aug 24, 2023
e00dec1
working on parsing
adhocmaster Aug 24, 2023
e7252e9
testing parser. updated prompt to have consistent prompt
adhocmaster Aug 24, 2023
8597321
fix
adhocmaster Aug 24, 2023
cb51dda
response parser works. Now the repository
adhocmaster Aug 24, 2023
2901106
provider and prompt director tested
adhocmaster Aug 25, 2023
928fe6d
scraper service almost done. Need the repository
adhocmaster Aug 25, 2023
d45cdca
Amazon utils for navigation
adhocmaster Aug 29, 2023
f50d182
Merge branch 'develop' into feat/ENGT-1717
adhocmaster Aug 29, 2023
6efe013
objects and common utils version changes
adhocmaster Aug 29, 2023
a388307
not compiling due to node utils
adhocmaster Aug 29, 2023
44006e6
text processor test
adhocmaster Aug 29, 2023
e9fb86d
text processor pass
adhocmaster Aug 29, 2023
3dca862
extracted Persistence Interface
adhocmaster Aug 29, 2023
e4434b8
shopping data purchase repository.
adhocmaster Aug 29, 2023
31a8d4c
circular circular
adhocmaster Aug 29, 2023
c5b6738
repository injected
adhocmaster Aug 29, 2023
d0b3298
Update LLMScraperService.ts
adhocmaster Aug 29, 2023
aad279c
now test the LLMScraper and you are done for the api for now
adhocmaster Aug 29, 2023
5b33258
having trouble with testdouble
adhocmaster Aug 31, 2023
02e0d80
one service for all
adhocmaster Sep 5, 2023
4f37abe
added language to purchases. writing purchase utils. updated persiste…
adhocmaster Sep 5, 2023
433b730
working on nlp package
adhocmaster Sep 6, 2023
71924ba
nlp nlp
adhocmaster Sep 6, 2023
dcb746f
close to hashing products
adhocmaster Sep 6, 2023
4fbcb56
Stemmer requires stopwords
adhocmaster Sep 6, 2023
a5725a1
product has working
adhocmaster Sep 6, 2023
442aad2
purchase utils close
adhocmaster Sep 6, 2023
2aca360
undefined tokens
adhocmaster Sep 6, 2023
ded0d25
Update PurchaseUtils.test.ts
adhocmaster Sep 6, 2023
35712ac
Update PurchaseRepository.ts
adhocmaster Sep 6, 2023
fe542c5
doc and linked nlp to core
adhocmaster Sep 12, 2023
f8cc930
Merge branch 'develop' into feat/ENGT-1717
adhocmaster Sep 14, 2023
2da8519
fixing merge 1. Master indexer?
adhocmaster Sep 14, 2023
e96b71a
total refactoring
adhocmaster Sep 14, 2023
97599f9
import errors
adhocmaster Sep 14, 2023
82714ad
tests runs after refactoring
adhocmaster Sep 14, 2023
976e08f
interfaces exported through core
adhocmaster Sep 14, 2023
8c24582
something is wrong with indexer container registration
adhocmaster Sep 14, 2023
8656a78
duplicate indexer registrations
adhocmaster Sep 14, 2023
60ab2ab
fixed modules and refactored
adhocmaster Sep 14, 2023
5f6dee4
plumbing wor4ks
adhocmaster Sep 14, 2023
f65ba6a
fixing tiktoken
adhocmaster Sep 21, 2023
3897476
Update package.json
adhocmaster Sep 21, 2023
82e212b
Update yarn.lock
adhocmaster Sep 21, 2023
92f4ddd
added missing injections
adhocmaster Sep 25, 2023
78e20ce
Merge branch 'develop' into feat/ENGT-1717
adhocmaster Sep 28, 2023
3a71998
missing modules
adhocmaster Oct 5, 2023
9edc057
Update yarn.lock
adhocmaster Oct 5, 2023
a893866
Update LLMPurchaseHistoryUtilsChatGPT.ts
adhocmaster Oct 11, 2023
c4b7e5e
handled non purchases and empties
adhocmaster Oct 11, 2023
da41053
print error when object store fails
adhocmaster Oct 17, 2023
51405d7
Update VolatileStorageSchemaProvider.ts
adhocmaster Oct 18, 2023
0cb2107
Update IndexedDB.ts
adhocmaster Oct 18, 2023
dde04a6
purchase id
adhocmaster Oct 18, 2023
38651cc
unknown category
adhocmaster Oct 18, 2023
c6a56eb
Update purchases.ts
adhocmaster Oct 18, 2023
9391996
Update IndexedDB.ts
adhocmaster Oct 19, 2023
4745b2d
Merge branch 'develop' into feat/ENGT-1717
adhocmaster Oct 19, 2023
aa174f4
working on product meta
adhocmaster Oct 19, 2023
44379d2
prompt builder almost done
adhocmaster Oct 19, 2023
2100d38
product meta utils test done
adhocmaster Oct 19, 2023
710475b
testing prompt director
adhocmaster Oct 19, 2023
01d96b3
Update LLMProductMetaUtilsChatGPT.test.ts
adhocmaster Oct 19, 2023
5271274
wiring up, firing up
adhocmaster Oct 19, 2023
b1ca605
gotta optimize by not querying categories for purchases if there is a…
adhocmaster Oct 20, 2023
075710b
cleaner code
adhocmaster Oct 23, 2023
1a2ebc7
purchase utils works
adhocmaster Oct 23, 2023
236ddb8
improved types
adhocmaster Oct 23, 2023
d4fea37
amazon content reduction. trim text as maximum tokens is 4096 includi…
adhocmaster Oct 24, 2023
bbefcc7
Update LLMScraperService.ts
adhocmaster Oct 24, 2023
a1d1b99
Update LLMProductMetaUtilsChatGPT.ts
adhocmaster Oct 24, 2023
f4f7665
Update LLMProductMetaUtilsChatGPT.ts
adhocmaster Oct 24, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 6 additions & 1 deletion .eslintrc
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,12 @@
"parser": "@typescript-eslint/parser", // Specifies the ESLint parser
"parserOptions": {
"ecmaVersion": 2020, // Allows for the parsing of modern ECMAScript features
"sourceType": "module" // Allows for the use of imports
"sourceType": "module", // Allows for the use of imports
// "babelOptions": {
// "plugins": [
// "@babel/plugin-syntax-import-assertions"
// ],
// },
},
"extends": [
"plugin:import/errors",
Expand Down
69 changes: 66 additions & 3 deletions documentation/ai-scraper/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -78,7 +78,7 @@ participant SJR as Scraper Job Repository
participant LLM as LLM Scraper
participant PU as PromptUtils
participant LLMProvider as LLM Provider
participant T as Text Preprocessor
participant Pre as HTML Preprocessor
participant DTU as DomainTaskUtils

loop Periodically
Expand All @@ -95,8 +95,8 @@ loop Periodically
PU-->LLM: Prompt

loop for each job
LLM->>T: convert HTML to text and minimize
T-->>LLM: minimzed Text
LLM->>Pre: convert HTML to text and minimize
Pre-->>LLM: minimzed Text
LLM->>PU: add to Prompt if have token budget
end

Expand All @@ -118,3 +118,66 @@ end

1. Web-workers
2. IPFS to collect URLs?

### Prompt Builder

```mermaid
classDiagram

class PromptDirector {
+makePurchaseHistoryPrompt(data) Prompt
}

PromptDirector --> PromptBuilder

class PromptBuilder {
<<interface>>
+setExemplars(exemplars)
+setRole(role)
+setQuestion(question)
+setAnswerStructure(structure)
+setData(data)
+getPrompt() Prompt
}

class PurchaseHistoryPromptBuilder {

}
PromptBuilder <|-- PurchaseHistoryPromptBuilder

class ShoppingCartPromptBuilder {

}
PromptBuilder <|-- ShoppingCartPromptBuilder

%% Collection prompts

class CollectionPromptBuilder {

<<interface>>
}
PromptBuilder <|-- CollectionPromptBuilder

class ProductCollectionPromptBuilder {

}
CollectionPromptBuilder <|-- ProductCollectionPromptBuilder

class GameCollectionPromptBuilder {

}
CollectionPromptBuilder <|-- GameCollectionPromptBuilder


%% Single Item Prompts
class ItemDetailsPromptBuilder {

}
PromptBuilder <|-- ItemDetailsPromptBuilder

class OrderDetailsPromptBuilder {

}
ItemDetailsPromptBuilder <|-- OrderDetailsPromptBuilder

```
29 changes: 17 additions & 12 deletions documentation/persistence layer/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,13 +44,18 @@ We add the migrator definition to the same file where we defined the Animal clas
}
}
```
3. Every entity requires a schema which is analogous to table definitions in SQL. We add the schema to [VolatileStorageSchema.ts](./../../packages/persistence/src/volatile/VolatileStorageSchema.ts) by adding an object of type [VolatileTableIndex](./../../packages/persistence/src/volatile/VolatileTableIndex.ts).
3. Every entity requires a schema which is analogous to table definitions in SQL. We add the schema to [VolatileStorageSchemaProvider.ts](./../../packages/persistence/src/volatile/VolatileStorageSchemaProvider.ts) by adding an object of type [VolatileTableIndex](./../../packages/persistence/src/volatile/VolatileTableIndex.ts).
```
new VolatileTableIndex(
ERecordKey.ANIMAL, // The name of our object store / table
"id", // primary key field.
false, // false disables the auto-increment key generator.
new AnimalMigrator(), // migrator that our database client will use to convert data into animal objects.

EBackupPriority.NORMAL, // Backup priority
3600 * 1000, // Backup Interval in milliseconds
config.backupChunkSizeTarget,
[], // Index
),

```
Expand All @@ -65,7 +70,12 @@ We add the migrator definition to the same file where we defined the Animal clas
"id", // primary key field.
false, // false disables the auto-increment key generator.
new AnimalMigrator(),
[['name', false], ['someOtherField', false], [['comp1', 'comp2'], true]

EBackupPriority.NORMAL, // Backup priority
3600 * 1000, // Backup Interval in milliseconds
config.backupChunkSizeTarget,

[['name', false], ['someOtherField', false], [['comp1', 'comp2'], true]]
),


Expand All @@ -78,25 +88,20 @@ In the examples, **this.persistence** is an instance of DataWalletPersistence. A
**Add an animal**: We add a new object to the store by wrapping it in a [VolatileStorageMetadata](./../../packages/objects/src/businessObjects/VolatileStorageMetadata.ts) object.
```
const myDog = new Animal("XX12", "Tom");
const metadata = new VolatileStorageMetadata<Animal>(
EBackupPriority.NORMAL,
myDog,
Animal.CURRENT_VERSION,
);
return this.persistence.updateRecord(ERecordKey.ANIMAL, metadata);
return this.persistence.updateRecord(ERecordKey.ANIMAL, myDog);
```

**Update an animal**: Update works exactly the same way as adding a new object. The engine will update and existing object if an object with the same primary key exists.

**Delete an animal**: Current we only support deleting by the primary key. The value of the primary key needs to be wrapped in a [VolatileStorageKey](./../../packages/objects/src/primitives/VolatileStorageKey.ts) object.

```
return this.persistence.deleteRecord(ERecordKey.ANIMAL, VolatileStorageKey("XX12"), EBackupPriority.NORMAL); // Deletes Tom from the animal store.
return this.persistence.deleteRecord(ERecordKey.ANIMAL, "XX12", EBackupPriority.NORMAL); // Deletes Tom from the animal store.
```

**Find an animal by primary key**:
```
return this.persistence.getObject(ERecordKey.ANIMAL, VolatileStorageKey("XX12"), EBackupPriority.NORMAL); // Deletes Tom from the animal store.
return this.persistence.getObject(ERecordKey.ANIMAL, "XX12", EBackupPriority.NORMAL); // Deletes Tom from the animal store.
```

**Get all animals**:
Expand All @@ -105,7 +110,7 @@ In the examples, **this.persistence** is an instance of DataWalletPersistence. A
```
**Get all animals by an index**:
```
return this.persistence.getAllByIndex(ERecordKey.ANIMAL, "name", IDBValidKey("Tom")); // Analogous to the getCursor function. But it returns all the objects.
return this.persistence.getAllByIndex(ERecordKey.ANIMAL, "name", "Tom"); // Analogous to the getCursor function. But it returns all the objects.
```
**Get all primary keys**:
```
Expand All @@ -115,7 +120,7 @@ In the examples, **this.persistence** is an instance of DataWalletPersistence. A
**Get cursor**:
Cursors can return all the objects or a subset matching an index field. For details, please check [IndexDB](https://developer.mozilla.org/en-US/docs/Web/API/IndexedDB_API/Using_IndexedDB) document.
```
return this.persistence.getCursor(ERecordKey.ANIMAL, "name", IDBValidKey("Tom")); // will return a cursor with all the Toms.
return this.persistence.getCursor(ERecordKey.ANIMAL, "name", "Tom"); // will return a cursor with all the Toms.
```


Expand Down
1 change: 1 addition & 0 deletions documentation/shopping-data/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# TODO
9 changes: 7 additions & 2 deletions packages/ai-scraper/package.json
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
{
"name": "@snickerdoodlelabs/ai-scraper",
"version": "0.0.17",
"description": "Utilities for parsing and understanding SDQL Queries",
"description": "Web scraper for Data Wallet browser extension",
"license": "MIT",
"repository": {
"type": "git",
Expand Down Expand Up @@ -41,9 +41,14 @@
"dependencies": {
"@snickerdoodlelabs/common-utils": "workspace:^",
"@snickerdoodlelabs/objects": "workspace:^",
"@snickerdoodlelabs/persistence": "workspace:^",
"@snickerdoodlelabs/shopping-data": "workspace:^",
"ethers": "^5.6.6",
"html-to-text": "^9.0.5",
"inversify": "^6.0.1",
"js-tiktoken": "^1.0.7",
"neverthrow": "^5.1.0",
"neverthrow-result-utils": "^2.0.2"
"neverthrow-result-utils": "^2.0.2",
"openai": "^4.0.1"
}
}
101 changes: 101 additions & 0 deletions packages/ai-scraper/src/implementations/business/ChatGPTProvider.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
import { ILogUtilsType, ILogUtils } from "@snickerdoodlelabs/common-utils";
import { LLMError, LLMResponse, Prompt } from "@snickerdoodlelabs/objects";
import { inject, injectable } from "inversify";
import {
getEncoding,
encodingForModel,
TiktokenModel,
Tiktoken,
} from "js-tiktoken";
import { ResultAsync, okAsync, errAsync } from "neverthrow";
import OpenAI from "openai";
import {
ChatCompletion,
ChatCompletionCreateParamsNonStreaming,
CompletionCreateParamsNonStreaming,
} from "openai/resources/chat";

import {
ILLMProvider,
IOpenAIUtils,
IOpenAIUtilsType,
IScraperConfigProvider,
IScraperConfigProviderType,
} from "@ai-scraper/interfaces/index.js";

@injectable()
export class ChatGPTProvider implements ILLMProvider {
private chatModel: TiktokenModel = "gpt-3.5-turbo";
private temperature: number;
private chatEncoder: Tiktoken;
// private timeout = 5 * 60 * 1000; // 5 minutes

public constructor(
@inject(IScraperConfigProviderType)
private configProvider: IScraperConfigProvider,
@inject(ILogUtilsType)
private logUtils: ILogUtils,
@inject(IOpenAIUtilsType)
private openAIUtils: IOpenAIUtils,
) {
this.temperature = 0.1;
this.chatEncoder = encodingForModel(this.chatModel);
}

public defaultMaxTokens(): number {
return 4096;
}

public maxTokens(model: TiktokenModel): number {
switch (model) {
case "gpt-3.5-turbo":
return 4096;
case "gpt-4":
return 8192;
}
return 0;
}

public getPromptTokens(prompt: Prompt): ResultAsync<number, Error> {
const tokens = this.chatEncoder.encode(prompt); // This might take a while
return okAsync(tokens.length);
}

public executePrompt(prompt: Prompt): ResultAsync<LLMResponse, LLMError> {
const messages = [
{ role: "system", content: "You are an helpful assistant." },
{ role: "user", content: prompt },
];

return this.chatOnce(messages);
}

private getClient(): ResultAsync<OpenAI, LLMError> {
return this.configProvider.getConfig().andThen((config) => {
try {
const clientOptions = {
apiKey: config.scraper.OPENAI_API_KEY,
timeout: config.scraper.timeout,
};
// this.logUtils.debug("ChatGPTProvider", "constructor", clientOptions);
return okAsync(new OpenAI(clientOptions));
} catch (e) {
return errAsync(new LLMError((e as Error).message, e));
}
});
}

private chatOnce(messages): ResultAsync<LLMResponse, LLMError> {
return this.getClient().andThen((openai) => {
const params: ChatCompletionCreateParamsNonStreaming = {
model: this.chatModel,
messages: messages,
temperature: this.temperature,
};
const completionResult =
this.openAIUtils.createChatCompletionNonStreaming(openai, params);

return this.openAIUtils.parseCompletionResult(completionResult);
});
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
import {
ELanguageCode,
HTMLString,
ScraperError,
} from "@snickerdoodlelabs/objects";
import { compile, convert } from "html-to-text";
import { injectable } from "inversify";
import { ResultAsync, okAsync } from "neverthrow";

import { IHTMLPreProcessor } from "@ai-scraper/interfaces/index.js";

type Converter = (html: string) => string;

@injectable()
export class HTMLPreProcessor implements IHTMLPreProcessor {
private converter: Converter;
private converterWithImages: Converter;
private converterWithLinks: Converter;
private headConverter: Converter;

public constructor() {
const options = {
selectors: [
{ selector: "a", options: { ignoreHref: true } },
{ selector: "img", format: "skip" },
],
};
this.converter = compile(options);

const optionsImages = {
baseElements: { selectors: ["body"] },
selectors: [{ selector: "a", options: { ignoreHref: true } }],
};

this.converterWithImages = compile(optionsImages);

const optionsLinks = {
baseElements: { selectors: ["body"] },
selectors: [{ selector: "img", format: "skip" }],
};
this.converterWithLinks = compile(optionsLinks);

const headOptions = {
baseElements: { selectors: ["head"] },
};
this.headConverter = compile(headOptions);
}

public getLanguage(
html: HTMLString,
): ResultAsync<ELanguageCode, ScraperError> {
return okAsync(ELanguageCode.English); // TODO parse html tag for language. if not found, use third party library (nlp.js) to detect language such as google translate.
}

public htmlToText(
html: HTMLString,
options: unknown | null,
): ResultAsync<string, ScraperError> {
if (options == null) {
return okAsync(this.converter(html));
} else {
return okAsync(convert(html, options));
}
}
public htmlToTextWithImages(
html: HTMLString,
): ResultAsync<string, ScraperError> {
return okAsync(this.converterWithImages(html));
}
public htmlToTextWithLinks(
html: HTMLString,
): ResultAsync<string, ScraperError> {
return okAsync(this.converterWithLinks(html));
}
}
Loading