Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change update functions & readme #88

Merged
merged 4 commits into from
Mar 6, 2018
Merged

Change update functions & readme #88

merged 4 commits into from
Mar 6, 2018

Conversation

eseiver
Copy link
Collaborator

@eseiver eseiver commented Feb 28, 2018

This closes #85.

  • deprecates allofplos.plos_corpus in favor of allofplos.update
  • updates README with new command (as well as other readme cleanup)
  • filename_to_doi no longer allows passing through a DOI (it was broken by use os.path.basename() in filename_to_doi #87, oops)
  • new download_xml function is based on DOI and not file, and is passed into download_updated_xml.

@eseiver eseiver requested review from sbassi and mpacer February 28, 2018 01:54
@@ -0,0 +1,4 @@
from .corpus.plos_corpus import main
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're going to do this, I'd prefer that the meat of this functionality actually be moved into this file and rely on the corpus module to provide the actual APIs it uses internally. This kind of redirection is getting to be a bit gratuitous. But if this would include a true refactor (making python -m allofplos.corpus.plos_corpus invalid, then I'd be for this.

Copy link
Collaborator

@mpacer mpacer Feb 28, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll note that I've been trying to push us to have a deeper package structure, and so introducing another top level module isn't something I'm jumping to do.

However, I think it's important to separate the applications from the APIs, and this is effectively an application style command, so that's why I'm in favour of adding this at the cost of a flatter package structure to gain the isolation of purpose.

@@ -94,15 +94,14 @@ def filename_to_doi(filename):
:param filename: relative path to local XML file in the get_corpus_dir() directory
:return: full unique identifier for a PLOS article
"""
filename = os.path.basename(filename)
if correction in filename and validate_filename(filename):
fn = os.path.basename(filename)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

couldn't you change the filenameparameter to filepath and then keep this as filename?

I always feel conflicted about fn since it can also be shorthand for function, especially in callback based code.

from .corpus.plos_corpus import main

if __name__ == "__main__":
warnings.simplefilter('always', DeprecationWarning)
warnings.warn("This update method is deprecated. use 'python -m allofplos.update'",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yay deprecation warnings!

@@ -278,9 +278,16 @@ def compare_article_pubdate(doi, days=22, directory=None):
print("Pubdate error in {}".format(doi))


def download_xml(doi, tempdir=newarticledir):
Copy link
Collaborator

@mpacer mpacer Feb 28, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is really clever! I hadn't thought of using an article object as a downloader… why don't we just bake this into the Article class itself?

Edit: I actually had thought about it in the context of the async-await stuff (#46) but I think I chose not to implement it because I thought I'd get pushback and that it wasn't how we were planning on using the Article class. No matter what, this is super clever in a good, straightforward code way!

elif validate_doi(filename):
doi = filename
else:
doi = ''
return doi
Copy link
Collaborator

@mpacer mpacer Mar 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should validate the doi before returning it (raising an error if that fails).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a custom doi validation error? If not… we should.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, you have the same problem with doi_to_path

https://github.com/PLOS/allofplos/pull/88/files?diff=unified#diff-ad065d41199b64627a6f15ef79720843R181

     if directory is None:
         directory = get_corpus_dir()
     if doi.startswith(ANNOTATION_DOI) and validate_doi(doi):
         article_file = os.path.join(directory, "plos.correction." + doi.split('/')[-1] + SUFFIX_LOWER)
     elif validate_doi(doi):
         article_file = os.path.join(directory, doi.lstrip(PREFIX) + SUFFIX_LOWER)
     # NOTE: The following check is weird, a DOI should never validate as a file name.
     elif validate_filename(doi):
         article_file = doi
     return article_file

it should have an else case.

@@ -496,10 +491,10 @@ def download_vor_updates(directory=None, tempdir=newarticledir,
if vor_updates_available is None:
vor_updates_available = check_for_vor_updates()
vor_updated_article_list = []
for article in tqdm(vor_updates_available, disable=None):
updated = download_updated_xml(article, vor_check=True)
for doi in tqdm(vor_updates_available, disable=None):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does vor_updates_available return a list of dois?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indeed it does

eseiver added 2 commits March 5, 2018 13:16
deprecates allofplos.plos_corpus, updates README with new command
`filename_to_doi` no longer allows passing through a DOI. new `download_xml` function is based on DOI and not file, and is passed into `download_updated_xml`.
point .plos_corpus.py at update.py
delete `main()` from corpus.plos_corpus.py
----------------------------------------------------------------------
Ran 6 tests in 3.327s
Ran 8 tests in 0.257s
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is no longer accurate after #89, could you update it?

@@ -0,0 +1,52 @@
import os

from . import get_corpus_dir, newarticledir, uncorrected_proofs_text_list
Copy link
Collaborator

@mpacer mpacer Mar 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is uncorrect_proofs_text_list in __init__.py? Isn't it specific to the update function? Does anything else use it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because the .txt file is stored in the top-level directory (which made sense when plos_corpus.py was there).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be wiped out when you reinstall allofplos

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, just like people who have installed the corpus directory in the default location. let's make obsolete in future PR

@mpacer
Copy link
Collaborator

mpacer commented Mar 6, 2018

There is now an issue with our entrypoints… I actually didn't know we had a console_script entrypoint. but you'll want to change that too… or remove it, because I don't know if it works right now.

elif validate_doi(filename):
doi = filename
else:
raise Exception("Invalid format for PLOS filename: {}".format(filename))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you add the same validate_filename check at the beginning of the of this functiont hat you did below for the doi_to_url check? that way you could remove all of the other calls to validate filename later on and simplify the logic.

# NOTE: The following check is weird, a DOI should never validate as a file name.
elif validate_filename(doi):
article_file = doi
else:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like the way you simplified the check with validate doi in filename_to_doi, could you use that method here as well? Checking for validation first that way you don't need to keep adding it as part of your conditions?

also include string that isn't validating
@mpacer
Copy link
Collaborator

mpacer commented Mar 6, 2018

LGTM, merging

@mpacer mpacer merged commit a6d6376 into PLOS:master Mar 6, 2018
@eseiver eseiver added this to the 0.11.0 milestone Mar 6, 2018
@eseiver eseiver deleted the update branch March 6, 2018 18:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cannot get corpus dir
2 participants