Code Cleanup #28

SomeoneInParticular · 2025-01-15T18:22:35Z

This PR is mainly to resolve a number of potential issues and code quality concerns in the current state of the repository. In particular, this PR aims to fix the following (among other minor fixes):

Over-reliance on the os package, which can be unsafe (or, at least, less safe than pathlib for what its being used for)
Output fails to work as expected, often crashing out in common use cases (i.e. relative output paths, see Designating where to place output fails in many cases #26)
Many hard-coded variables in the code itself, rather than in separate configuration files, making the code overly specific to only out use-case
Inefficient use of Pandas utilities (making code take much longer than it could, or being harder to debug/generalize)
Overzealous imports, which could result in global variable pollution (and by extension, the potential for nightmare bugs)

Once this pass is completed, the next step will be implementing automated testing to help ensure that further refactors/tests don't break existing functionality (which I am currently testing manually). Cheers!

…xtendable.

…ion types; it will now process all available types when this occurs.

…io.test.docx") and be placed in directories that do not initially exist at runtime, through the power of PathLib!

…for the NeuroPoly wiki) is now placed in the same place as everything else.

…ueries doesn't exist; it will now just notify the user with a warning instead + fix for user requesting 0 publication types

…ollution in preparation for future refactoring

…now you can mix both files and raw identifiers if you want! This is part of an overall CLI refactor, to preparation for automated testing.

…CV's contents to another.

…appropriate pre-run validations)

…tions can be called *without* needing a shell to call them!

…on used by the CCV

…ting file

…LI `label` parameter

… call

…(Google Doc or local) into its own CLI interface

…he old CLI hook to match)

SomeoneInParticular · 2025-01-21T21:49:26Z

Change Log:

New: Custom CLI added for formatting the contents of a bibliography.

This has NOT replaced the original CLI used previously, though it does replicate a large amount of its functionality. The old CLI will be deprecated (and eventually removes) once all of its functionality has been replicated in other CLI managers.
Accessible via calling bibeasy_format. See bibeasy_format -h for detailed use-case examples
Currently only outputs into Word or our custom NeuroPoly format, set up to be extended going forward though
Changes to input file handling:
- A Google Sheet cache is now (re)generated when you give the URL explicitly. For example, the following command will generate a cache using the Google Sheet data present at the provided URL:
  bibeasy_format -i https://docs.google.com/spreadsheets/d/1dEUBYf17hNM22dqV4zx1gsh3Q-d97STnRB4q7p9nQ54/edit?gid=566297787#gid=566297787
- Alternatively, you can specify an Excel file you have on your local desktop instead, if you have a local spreadsheet you want to use instead:
  bibeasy_format -i "path/to/excel/spreadsheet.xlsx"
- If no input is given, the program will try to use an existing cache. Otherwise it will error out with a message explaining why the program could not proceed.
Changes to output file designation:
- The user no longer needs to define an output file; instead, the user may optionally use either or both of the following to define where output is placed and how it is named:
  - -o/--output: A directory where output files will be placed. Defaults to the current directory if not specified.
  - -l.--label: A label which will be appended to the front of all output files within the same directory. Defaults to bibeasy_out if not specified.
    - In the case of multiple output files, a 'sub-label' is added after the main label to delineate between them, with a double underscore between the two (i.e. bibeasy_out__article)
Changes to output formatting arguments
- The (file) format of the output now MUST be specified by the user explicitly with the -f flag
  - Currently only docx and neuro_md (our custom format for the NeuroPoly website) formats are supported.
  - The format used is used to determine the extension the resulting output file(s), rather than the user specifying it.
- --combine has been replaced with --keep_separate
  - If the flag is used, the output files will be separated by the publication type (determined by the sheets used in the source spreadsheet)
    - This was changed as, given we want this to be a bibliography manager first, I suspect the majority of people (myself included) use the sub-sheets for organization, rather than to enforce stratification. This is trivial to reverse if I'm wrong, though
- The citation style (currently only used in docx formats) can be specified with the --style flag.
  - Default is now APA, rather than custom. This was done under the assumption that most people wanting a bibliography want a known standardized format, rather than our current custom one.
- The order (by year) can be made to be descending with the --reverse flag
- A path containing a list of organization labels (for use in the NeuroPoly MD website) can be provided using the --valid_labels flag
Changes to output filtering arguments:
- The --year flag can be used to specify the earliest year for which publications should be considered. This is inclusive (that is, if it is set to 2015, anything published on 2015 is included, but nothing before).
- The --type flag can be used to specify the types of publication which should be preserved and output. The 'type' of a publication is determined by the name of it's sub-sheet within the source Excel file.
- The --required_columns flag can be used to specify a set of columns which cannot be null in entries. Entries which lack a value in any of the provided columns will not be present in the resulting output file(s).
- The --filters flag can be used to define a list of custom (SQL-like) query filters you want applied to the dataset prior to the results being output. For example, adding --filter "IVADO17 == 'x'" would result in the output containing only entries in which the IVADO17 columns contained an x.
  - This uses pandas.DataFrame.query under the hood, so any query string which is valid in that context should work here as well.

Changed:

The methods to format a DataFrame into the requested output format have been extracted from csv_to_txt_pubtype into their own unique functions
- These are now mapped by formatting "labels" in the new CLI script and called directly. This should make the code much easier to extend with new formats going forward.
Parsing an Excel file into a unified DataFrame is now handled by its own utility function in the CLI (utils:read_excel) to better modularize it.
- Unlike its precursor (gsheet:gsheet_to_df), it will only filter by the 'type' of publication, with the remaining filters and re-formatting operations being done in the new CLI script.
- The 'type' exception exists solely because it reduces read-from-disk time (as we reduce the number of sub-sheets in the Excel spreadsheet being loaded based on the filter)

Remaining TODOS:

Allow you to (re)generate the Google Sheet cache without also defining an output format (and, by extension, generating new outputs)
Choose a default output format (i.e. docx which will be used if one is not defined by the user).
Allow for more customizable sorting order (i.e. by Title, or first-author then year)
When a labels file is specified with --valid_labels, allow the user to request that the program validates all entries in the provided Excel file fall under at least one of the label's provided.

SomeoneInParticular · 2025-01-21T21:51:10Z

@jcohenadad @namgo could one of you poke around with the new CLI and let me know if theres any features that were left out/broken by mistake? I don't really have a "feature list" to go off of here, so I'm just aiming to re-implement existing functionality in a new (cleaner) CLI format. Thank you in advance!

jcohenadad · 2025-01-21T23:16:11Z

@SomeoneInParticular this is great-- thank you for working on this-- i tested " bibeasy -o publi.docx -st custom -r -c" and it produces the same output as it used to--

however, glancing at #28 (comment), i see that some behaviours changed, which is not great-- it means that a command i used to do in the past won't work anymore or will produce different behaviours-- i do not have the time to dive into your lengthy description so could you pls just give me the TL;DR of what is different now-- thx

also: the lab website is relying on bibeasy, so here to we need to make sure you didn't break anything-- thx-- tagging @nullnik-0 @namgo so they are aware

SomeoneInParticular · 2025-01-22T00:18:22Z

@jcohenadad Yes; the original bibeasy CLI should work identically to how it did before (functions, bugs, edge cases, and everything else). The changes I detailed in the change log did not replace this "main" CLI, and we can keep it around for as long as we want to avoid breaking existing scripts.

TL:DR: theirs now a second CLI, bibeasy_format, which isolates the output formatting functionality to its own call. As it stands, if we proceed with the current setup, the bibeasy command is going to be nigh impossible to maintain as more and more arguments get added. Hence, I've done this to avoid bloating the "main" CLI with additional arguments, in anticipation for extended functionality later.

To use your command as an example (bibeasy -o publi.docx -st custom -r -c), the equivalent using the new CLI would be:

bibeasy_format -l publi -f docx --style custom --reverse

AGAIN: This new CLI has not replaced the old one, and we can change existing scripts over to this new CLI at our leisure without anything breaking. It (and any other CLIs we add later for other functionality, such as CCV parsing) exist solely to get ahead of the bloat problem before technical debt makes it too much a nightmare to untangle.

SomeoneInParticular · 2025-01-22T00:19:49Z

Note as well; the CLI changes are not set in stone, and were done mainly based on my intuition on what good "defaults" would be. Its easy to change them if I was wrong, or to make them "closer" to the original CLI. Just say the word

kalum.ost added 20 commits January 14, 2025 17:41

Refactored GoogleSheet caching and loading to be more versatile and e…

167d79c

…xtendable.

Fixed error at output when the user doesn't specify a set of publicat…

482c82a

…ion types; it will now process all available types when this occurs.

The output destination can now contain periods (i.e. "../testing/bibl…

0401c2a

…io.test.docx") and be placed in directories that do not initially exist at runtime, through the power of PathLib!

'labels_publication.html' (generated when a MarkDown file is created …

01472ed

…for the NeuroPoly wiki) is now placed in the same place as everything else.

Filtered input listing will no longer crash if one of the requested q…

52db16e

…ueries doesn't exist; it will now just notify the user with a warning instead + fix for user requesting 0 publication types

Refactored imports in 'bibeasy_cli.py' to reduce variable namespace p…

efd9393

…ollution in preparation for future refactoring

Updated the reference filter in the CLI to be much more generalized; …

23a0142

…now you can mix both files and raw identifiers if you want! This is part of an overall CLI refactor, to preparation for automated testing.

Clarified help for the updated (reference) input argument.

b663352

Fixed error caused my misnamed variable when trying to transfer one C…

8827ee1

…CV's contents to another.

Transitioned remaining CLI arguments to be a pathlib Path type (with …

d1dbc30

…appropriate pre-run validations)

File-type CLI arguments don't always return None now. Oops!

b75367b

Disentangled CLI parsing from the rest of the code base. Now all func…

e4386fe

…tions can be called *without* needing a shell to call them!

Added some "testing" files to make testing more straightforward

433b476

[Minor] Removed no-longer-used imports and constants in CLI

b15d9ac

[Minor] Renamed the testing 'input' file to match the naming conventi…

1893dfc

…on used by the CCV

Moved publication labels to testing, renaming it to function as a tes…

80e61c2

…ting file

More cleaning, fixed type-hinting for arguments associated with the C…

4083f49

…LI `label` parameter

Updated SetupTools entry points to correctly refer to the updated CLI…

54cd8d1

… call

[Part 1] Isolated CLI for reformatting the contents of a Spreadsheet …

633f975

…(Google Doc or local) into its own CLI interface

Updated SetupTools CLI hooks to register the new hook (and modified t…

6f2f4f0

…he old CLI hook to match)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Code Cleanup #28

Code Cleanup #28

SomeoneInParticular commented Jan 15, 2025

SomeoneInParticular commented Jan 21, 2025 •

edited

Loading

SomeoneInParticular commented Jan 21, 2025

jcohenadad commented Jan 21, 2025 •

edited

Loading

SomeoneInParticular commented Jan 22, 2025

SomeoneInParticular commented Jan 22, 2025 •

edited

Loading

Code Cleanup #28

Are you sure you want to change the base?

Code Cleanup #28

Conversation

SomeoneInParticular commented Jan 15, 2025

SomeoneInParticular commented Jan 21, 2025 • edited Loading

SomeoneInParticular commented Jan 21, 2025

jcohenadad commented Jan 21, 2025 • edited Loading

SomeoneInParticular commented Jan 22, 2025

SomeoneInParticular commented Jan 22, 2025 • edited Loading

SomeoneInParticular commented Jan 21, 2025 •

edited

Loading

jcohenadad commented Jan 21, 2025 •

edited

Loading

SomeoneInParticular commented Jan 22, 2025 •

edited

Loading