-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change update functions & readme #88
Conversation
allofplos/update.py
Outdated
@@ -0,0 +1,4 @@ | |||
from .corpus.plos_corpus import main |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're going to do this, I'd prefer that the meat of this functionality actually be moved into this file and rely on the corpus
module to provide the actual APIs it uses internally. This kind of redirection is getting to be a bit gratuitous. But if this would include a true refactor (making python -m allofplos.corpus.plos_corpus
invalid, then I'd be for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll note that I've been trying to push us to have a deeper package structure, and so introducing another top level module isn't something I'm jumping to do.
However, I think it's important to separate the applications from the APIs, and this is effectively an application style command, so that's why I'm in favour of adding this at the cost of a flatter package structure to gain the isolation of purpose.
allofplos/transformations.py
Outdated
@@ -94,15 +94,14 @@ def filename_to_doi(filename): | |||
:param filename: relative path to local XML file in the get_corpus_dir() directory | |||
:return: full unique identifier for a PLOS article | |||
""" | |||
filename = os.path.basename(filename) | |||
if correction in filename and validate_filename(filename): | |||
fn = os.path.basename(filename) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
couldn't you change the filename
parameter to filepath
and then keep this as filename
?
I always feel conflicted about fn
since it can also be shorthand for function
, especially in callback based code.
allofplos/plos_corpus.py
Outdated
from .corpus.plos_corpus import main | ||
|
||
if __name__ == "__main__": | ||
warnings.simplefilter('always', DeprecationWarning) | ||
warnings.warn("This update method is deprecated. use 'python -m allofplos.update'", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yay deprecation warnings!
@@ -278,9 +278,16 @@ def compare_article_pubdate(doi, days=22, directory=None): | |||
print("Pubdate error in {}".format(doi)) | |||
|
|||
|
|||
def download_xml(doi, tempdir=newarticledir): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is really clever! I hadn't thought of using an article object as a downloader… why don't we just bake this into the Article
class itself?
Edit: I actually had thought about it in the context of the async-await stuff (#46) but I think I chose not to implement it because I thought I'd get pushback and that it wasn't how we were planning on using the Article class. No matter what, this is super clever in a good, straightforward code way!
allofplos/transformations.py
Outdated
elif validate_doi(filename): | ||
doi = filename | ||
else: | ||
doi = '' | ||
return doi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should validate the doi before returning it (raising an error if that fails).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have a custom doi validation error? If not… we should.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, you have the same problem with doi_to_path
if directory is None:
directory = get_corpus_dir()
if doi.startswith(ANNOTATION_DOI) and validate_doi(doi):
article_file = os.path.join(directory, "plos.correction." + doi.split('/')[-1] + SUFFIX_LOWER)
elif validate_doi(doi):
article_file = os.path.join(directory, doi.lstrip(PREFIX) + SUFFIX_LOWER)
# NOTE: The following check is weird, a DOI should never validate as a file name.
elif validate_filename(doi):
article_file = doi
return article_file
it should have an else case.
@@ -496,10 +491,10 @@ def download_vor_updates(directory=None, tempdir=newarticledir, | |||
if vor_updates_available is None: | |||
vor_updates_available = check_for_vor_updates() | |||
vor_updated_article_list = [] | |||
for article in tqdm(vor_updates_available, disable=None): | |||
updated = download_updated_xml(article, vor_check=True) | |||
for doi in tqdm(vor_updates_available, disable=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does vor_updates_available return a list of dois?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indeed it does
deprecates allofplos.plos_corpus, updates README with new command
`filename_to_doi` no longer allows passing through a DOI. new `download_xml` function is based on DOI and not file, and is passed into `download_updated_xml`.
point .plos_corpus.py at update.py delete `main()` from corpus.plos_corpus.py
---------------------------------------------------------------------- | ||
Ran 6 tests in 3.327s | ||
Ran 8 tests in 0.257s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is no longer accurate after #89, could you update it?
@@ -0,0 +1,52 @@ | |||
import os | |||
|
|||
from . import get_corpus_dir, newarticledir, uncorrected_proofs_text_list |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is uncorrect_proofs_text_list in __init__.py
? Isn't it specific to the update function? Does anything else use it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because the .txt file is stored in the top-level directory (which made sense when plos_corpus.py
was there).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will be wiped out when you reinstall allofplos
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, just like people who have installed the corpus directory in the default location. let's make obsolete in future PR
There is now an issue with our entrypoints… I actually didn't know we had a console_script entrypoint. but you'll want to change that too… or remove it, because I don't know if it works right now. |
allofplos/transformations.py
Outdated
elif validate_doi(filename): | ||
doi = filename | ||
else: | ||
raise Exception("Invalid format for PLOS filename: {}".format(filename)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you add the same validate_filename check at the beginning of the of this functiont hat you did below for the doi_to_url
check? that way you could remove all of the other calls to validate filename later on and simplify the logic.
allofplos/transformations.py
Outdated
# NOTE: The following check is weird, a DOI should never validate as a file name. | ||
elif validate_filename(doi): | ||
article_file = doi | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like the way you simplified the check with validate doi in filename_to_doi
, could you use that method here as well? Checking for validation first that way you don't need to keep adding it as part of your conditions?
also include string that isn't validating
LGTM, merging |
This closes #85.
allofplos.plos_corpus
in favor ofallofplos.update
filename_to_doi
no longer allows passing through a DOI (it was broken by use os.path.basename() in filename_to_doi #87, oops)download_xml
function is based on DOI and not file, and is passed intodownload_updated_xml
.