Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: adds scraper script #59

Merged
merged 7 commits into from
Dec 20, 2024

Conversation

shiv810
Copy link
Collaborator

@shiv810 shiv810 commented Dec 8, 2024

Resolves #56

  • Scrapes issues based on the username passed in.
  • Reads the token either as a user input or from the cli.
  • Updates issues in the repo, even with same node_id exists.
  • Issue Dedup and Matchmaking Results.

Copy link
Contributor

github-actions bot commented Dec 8, 2024

Unused files (1)

src/handlers/issue-scraper.ts

Copy link

@sshivaditya2019, this task has been idle for a while. Please provide an update.

@shiv810
Copy link
Collaborator Author

shiv810 commented Dec 12, 2024

@0x4007, This is the base scraper logic. Should I write a script for adding the issues for all the users mentioned in the auth.users.json ?

@0x4007
Copy link
Member

0x4007 commented Dec 12, 2024

Yes and please update the database with it. You can QA with some task matchmaking scoring improvements and second goal some issue dedupe improvements- right?

@shiv810 shiv810 marked this pull request as ready for review December 12, 2024 21:06
@0x4007 0x4007 requested review from rndquu and whilefoo December 13, 2024 00:23
@0x4007
Copy link
Member

0x4007 commented Dec 13, 2024

Is there some type of bias in the algorithm to make 75% the peak of the bell curve?

talent referrals

I looked through a few of the results and I think above 80% seems actually relevant. What are your thoughts?

It might make sense to exclude showing matches below 80%?

And then always recommend at least two contributors still.

issue deduplication

The markup seems very noisy and also based on my quick look it seems that below 80% seems kind of irrelevant. What are your thoughts on this?

For near term testing purposes I think we should leave on all the markup but I can see us needing to reduce the noisiness and hide anything below a certain threshold, like that 80% again.

@shiv810
Copy link
Collaborator Author

shiv810 commented Dec 13, 2024

I looked through a few of the results and I think above 80% seems actually relevant. What are your thoughts?

It might make sense to exclude showing matches below 80%?

I think we should include matches below 80%, as this would allow for a larger pool of contributors. We can always exclude them by removing alwaysRecommend and setting the jobMatchingThreshold to 0.8.

Is there some type of bias in the algorithm to make 75% the peak of the bell curve?

The current similarity search uses a weighted sum of cosine distance (0.8) and L2 distance (0.2). Without this weighting, the results tend to cluster around 90% similarity, that is just using the cosine distance1. With the weighted sum, they are more likely to cluster around 75%. This helps make the results more varied and accurate.

Footnotes

  1. https://docs.voyageai.com/discuss/660499a8c27dbb000f201a40

@0x4007
Copy link
Member

0x4007 commented Dec 19, 2024

@sshivaditya @sshivaditya2019 can you merge now?

@shiv810
Copy link
Collaborator Author

shiv810 commented Dec 19, 2024

I'll add the scraper files to knip ignore then it should be good to go.

@shiv810
Copy link
Collaborator Author

shiv810 commented Dec 19, 2024

@0x4007 Could you review this ? This is good to merge otherwise.

Copy link
Member

@0x4007 0x4007 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Passes the CI so it's fine

@0x4007 0x4007 merged commit dec3497 into ubiquity-os-marketplace:development Dec 20, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Scraper: Populate "Closed As Complete" Issue Specifications
3 participants