-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: adds scraper logic and token handling #190
base: development
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems generally fine changes not sure how can be QA'd
Was wondering if another approach makes sense such as only copying with high precision based on when an issue is closed as complete and only that is copied over for example. But if we are hunting down historical issues then obviously this isn't a solution
My concern is redundantly finding the same historical issues and doing the same redundant search and scrape work. I am wondering if there is a better approach to this problem.
@zugdev There are two locksfiles |
I'll create a demo video to walk through the flows. Right now, we store the timestamp of the last scrape for the user, I think it should be possible to use this timestamp as a filter to check for any newly closed issues marked as completed. If there are any, we can proceed with the scrape; otherwise, we can skip the process for that user. |
The bun.lockb one is related to Cloudflare Wrangler, the main one is yarn.lock. Use the later. |
Perhaps we should consolidate all to bun? @gentlementlegen rfc |
Ideally we should use bun everywhere for consistency, except if there is a reason not to. It matters mostly because when testing packages locally, |
Is there a way to add secrets in the final build without exposing them in |
Yes, you can store and pull them from the GitHub Action environment during deployment. |
You’ll still be able to view the values in the |
Why would it be in the final build? It should not be there. |
The scraper logic is executed when a user logs in on the client side. So, we will need voyage API key for embeddings. |
Ha yes I see now, I was missing the context sorry. We should definitely not expose that API key on the client side. Sadly, we don't use SSR so I don't know how we can handle this except having some API pour that we host ourselves somewhere, or storing it in cookies? |
Make a worker endpoint. We have a setup like this for pay.ubq.fi that's related to the cards. Check its code. @EresDev rfc |
Yes, it looks like VOYAGEAI part needs a backend. You can get started with pages functions. We store API keys in cloudflare worker secrets. However, one similar thing that ubiquity-os-kernel does differently is store the API keys in Github, but pushing it to cloudflare during worker deployment. This helps with managing the secrets in one place (GitHub) and we probably will do the same soon in |
We can store API keys in Cloudflare KV, but I don't think this is the best approach, as we could quickly hit the read limit. A better option might be Supabase Functions, which would allow us to read directly from the Supabase Vault. |
You can have environment secrets in cloud flare that do not count against any quota. We can upload then from Github environment secrets, like we do for most plug-ins. |
QA: Mean Stats:
@zugdev could you add the supabase service role key to the secrets under |
QA: https://scraperui.work-ubq-fi-50d.pages.dev/ You should be able to see the scraper function request, in the Network Tab of the Inspect Window. |
I'm mobile these days so I don't want to hold up the review somebody else please check on their computer for me. |
functions/issue-scraper.ts
Outdated
): Promise<void> { | ||
const { error } = await supabase.from("issues").upsert(issues); | ||
if (error) { | ||
throw new Error(`Error during batch upsert: ${error.message}`); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we can try an exponential backoff? I wonder if this is worth implementing to make this more robust although I imagine that errors are quite rare for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be possible to implement, but if a request fails and we retry multiple times, the entire request will eventually be terminated once it hits the maximum CPU wall time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Retries should be handled from the client so no timeouts on the worker side should be relevant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the retries be handled in this task? I think the error for this function should be very rare.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's what I would implement if I was working on this code yeah
|
||
if (lastFetch) { | ||
const lastFetchTimestamp = Number(lastFetch); | ||
if (now - lastFetchTimestamp < 24 * 60 * 60 * 1000) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if there is a better way to implement this. For example, instead of by timing, we can do based on an event.
Although I can't think of how it would be possible in outside organization contexts.
Resolves #187