Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PubMed Download Pipeline #246

Draft
wants to merge 33 commits into
base: main
Choose a base branch
from
Draft

PubMed Download Pipeline #246

wants to merge 33 commits into from

Conversation

star-nox
Copy link
Member

@star-nox star-nox commented Apr 2, 2024

Extract article data from XML files present in this FTP folder - https://ftp.ncbi.nlm.nih.gov/pubmed/baseline/
Store the data in Supabase
Extract remaining data using PubMed OA API
Download and store the articles in self-hosted storage

Copy link

You need to setup a payment method to use Lintrule

You can fix that by putting in a card here.

@star-nox star-nox marked this pull request as draft April 2, 2024 00:48
@star-nox
Copy link
Member Author

star-nox commented Apr 8, 2024

Currently processing a single XML file by yielding metadata in batches of 100. It took ~37 seconds to get 100 articles + metadata ready for upload. The XML file has over 10k articles.
TO-DO:

  1. Should I upload data to SQL and S3 in batches of 100 for a single file or all data per file?
  2. If uploading in batches of 100, there will be conflicts in folder names - first few functions are still adding stuff to it whereas last few function will upload all articles present in that folder.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants