-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parse Papers on Local Machines #104
Comments
I wonder whether the front end can add a new feature to allow users to submit their own files. Once the backend receives the file, it parses it and returns the closest neightbors, just like it did with an input preprint. |
Only concern is that pdfs could have malicious code that could potentially take down our server. We could implement a check for this mentioned in the blog, but one alternative solution was to have users generate vectors locally. |
I think it also depends on how popular this feature could be. If it's just for a few users, asking them to submit the doc vectors is okay. If many users want it, we probably should figure out how to minimize their efforts as much as possible (for example, allow users to submit a PDF file directly, but either the frontend or backend will do some checking/cleaning before parsing it). |
What if we provided source code + some environment + an example notebook that would do the embeddings. Then folks could access a new API endpoint to get the results for that. |
Sure. We can use this approach now. If many users like it and want to simplify the procedure, we'll figure out how. |
I just realized that it will make model versioning annoying. What if we created a notebook that calculated words and counts and those words + counts for each could be uploaded and calculated from? It just feels like plain text is going to be easier to handle with less risk than PDFs and running the parsing library. |
You mean the word modeling file that was used by the backend? How often is it supposed to be updated?
True. I was thinking earlier that we can force users to convert whatever format their documents are in before allowing them to submit it. |
Per our post-scrum chat about this: It sounds like a good plan is to have the frontend allow a pdf/txt/word document upload, use a js library to convert it to plain text, then send that to a new backend API endpoint as plaintext. Then once it reaches the backend, the process continues as normal as if the plain text had come from the backend python pdf parser. There are potential hazards to letting users upload pdfs. However, if we're doing it on the frontend, there's almost no risk. The user could crash their browser, but that wouldn't affect anyone else. Doing PDF parsing on the backend is more perilous because it could mess the server up for others. Also, since this probably isn't the way we want to recommend people use the tool, I think we should sort of "hide" this feature in the frontend. I think a good way to do that is to let the user drag files onto the search box from their computer to upload. There would be no indication the user could do this unless they were instructed to do it. |
A user requested to have a feature that will allow them to parse their own papers without having to post on bioRxiv or medRxiv. Ideally I was thinking about the following steps to accomplish this request:
The text was updated successfully, but these errors were encountered: