PubMed Download Pipeline #246

star-nox · 2024-04-02T00:48:18Z

Extract article data from XML files present in this FTP folder - https://ftp.ncbi.nlm.nih.gov/pubmed/baseline/
Store the data in Supabase
Extract remaining data using PubMed OA API
Download and store the articles in self-hosted storage

lintrule-review · 2024-04-02T00:48:21Z

You need to setup a payment method to use Lintrule

You can fix that by putting in a card here.

star-nox · 2024-04-08T16:59:57Z

Currently processing a single XML file by yielding metadata in batches of 100. It took ~37 seconds to get 100 articles + metadata ready for upload. The XML file has over 10k articles.
TO-DO:

Should I upload data to SQL and S3 in batches of 100 for a single file or all data per file?
If uploading in batches of 100, there will be conflicts in folder names - first few functions are still adding stuff to it whereas last few function will upload all articles present in that folder.

…into pubmed puling remote pubmed branch into hybrid :wq

…into pubmed

added functions for metadata extraction

aefeac1

star-nox marked this pull request as draft April 2, 2024 00:48

star-nox added 7 commits April 3, 2024 15:37

completed all download functions

b833524

added supabase upsert

8f14cf9

updated comments

6daa4df

added processpool to extractMetadataFromXML()

c8c9056

minor changes

f8a4d23

Merge branch 'main' into pubmed

84fd182

yielded metadata after collecting 100 articles

fa83ddf

star-nox added 19 commits April 9, 2024 10:17

Merge branch 'main' into pubmed

88a951b

storing metadata into csv and upserting per XML file

64a4142

parallelized metadata update

24425d4

added minio to requirements.txt

3ddec3b

parallelized download

e03fbf1

parallelized upload

8e5a1a0

minor changes

caac0bf

restricted upload parallelization to 10

fa60508

changed starting XML file

a61255c

changed starting XML file

4d86b85

changed start file

63a6cb6

minor comment

b17baea

Merge branch 'pubmed' of https://github.com/UIUC-Chatbot/ai-ta-backend …

8f608c5

…into pubmed puling remote pubmed branch into hybrid :wq

minor changes in main loop

4cfc6b8

minor changes

8b533ba

parallelized processing

6dfe50b

added try-except in getArticleIds()

e98f3dc

deleted csv file

68057df

merged changes

a2a5f22

star-nox and others added 6 commits May 14, 2024 12:59

test comment

4338931

print test comment

048f41d

Commented out prints for speed

472814e

Commented out prints for speed

880ae97

Merge branch 'pubmed' of https://github.com/UIUC-Chatbot/ai-ta-backend …

d635fe8

…into pubmed

parallelized main for loop and added xml filename column

304ec5d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PubMed Download Pipeline #246

PubMed Download Pipeline #246

star-nox commented Apr 2, 2024

lintrule-review bot commented Apr 2, 2024

star-nox commented Apr 8, 2024 •

edited

Loading

PubMed Download Pipeline #246

Are you sure you want to change the base?

PubMed Download Pipeline #246

Conversation

star-nox commented Apr 2, 2024

lintrule-review bot commented Apr 2, 2024

You need to setup a payment method to use Lintrule

star-nox commented Apr 8, 2024 • edited Loading

star-nox commented Apr 8, 2024 •

edited

Loading