Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sheet_utils refactor to add csv functionality (C4-1088) #276

Open
wants to merge 17 commits into
base: kmp_sheet_utils
Choose a base branch
from

Conversation

netsettler
Copy link
Collaborator

@netsettler netsettler commented Aug 17, 2023

This is an improvement on Some tools for working with spreadsheets (PR #275) to generalize it a bit more and add support for .csv files.

From the CHANGELOG.rst...

  • New module sheet_utils for loading workbooks.

    • Important things of interest:

      • Class ItemManager for loading Item-style data from either .xlsx or .csv files.

      • Function load_items that does the same as ItemManager.load.

    • Various low-level implementation classes such as:

      • Classes XlsxManager and CsvManager for loading raw data from .xlsx and .csv files, respectively.

      • Classes ItemXlsxManager and ItemCsvManager for loading Item-style data from .xlsx and .csv files, respectively.

  • Contains a fix for a bug in ff_utils.get_schema_names (C4-1086).

… not the workbook level artifact. Better handling of init args.
… ItemManager.load to take a tab_name argument so that CSV files can perhaps infer a type name.
@netsettler netsettler changed the base branch from master to kmp_sheet_utils August 17, 2023 16:21
Copy link
Contributor

@drio18 drio18 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the organization here, and the consideration for csv files which can easily be adapted for tsv files.

dcicutils/sheet_utils.py Outdated Show resolved Hide resolved
Comment on lines 160 to 193
class AbstractTableSetManager:
"""
The TableSetManager is the spanning class of anything that wants to be able to load a table set,
regardless of what it wants to load it from. To do this, it must support a load method
that takes a filename and returns the file content in the form:
{
"Sheet1": [
{...representation of row1 as some kind of dict...},
{...representation of row2 as some kind of dict...}
],
"Sheet2": [...],
...,
}
Note that at this level of abstraction, we take no position on what form of representation is used
for the rows, as long as it is JSON data of some kind. It might be
{"col1": "val1", "col2", "val2", ...}
or it might be something more structured like
{"something": "val1", {"something_else": ["val2"]}}
Additionally, the values stored might be altered as well. In particular, the most likely alteration
is to turn "123" to 123 or "" to None, though the specifics of whether and how such transformations
happen is not constrained by this class.
"""

def __init__(self, **kwargs):
if kwargs:
raise ValueError(f"Got unexpected keywords: {kwargs}")

@classmethod
def load_workbook(cls, filename: str):
wb = cls(filename)
return wb.load_content()
def load(cls, filename: str) -> Dict[str, List[AnyJsonData]]:
"""
Reads a filename and returns a dictionary that maps sheet names to rows of dictionary data.
For more information, see documentation of AbstractTableSetManager.
"""
raise NotImplementedError(f".load(...) is not implemented for {cls.__name__}.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like this should be an abstract base class?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I was quite clean enough. I know of such classes from Lisp, but hadn't been tracking Python's adoption of them. As I read the doc, having a method that does something isn't helpful because it's not in the MRO. Here I'm relying on __init__(self, **kwargs) for some error handling that must be in the MRO.

I also gotta say that although I do find it useful to name these things in the way that says yeah, it's abstract, I have not found metaclasses to be as useful as I expected them to be. Almost every time I've found a use for metaclasses, there's a just as easy way to make it with regular classes. (Actually, in Python I often use decorators.) Part of the issue is that the uses invariably thwart static analysis, so I end up finding static analysis more important than the other stuff. Maybe such tools would understand this class as a special case because it might have widespread use, but in general metaclasses just make it hard to understand. We have code in dcicutils.misc_utills for example that imlements classproperty but PyCharm fusses at me because it doesn't understand why I'm using cls instead of self as an argument.

So I'm gonna pass on that at this time. I'll add a note in the sources for future consideration. If you'd like to convince me otherwise, let's chat interactively. I'm not fixed on this. It's just my, uh, base posture.

Comment on lines +249 to +253
def _raw_row_generator_for_tabname(self, tabname: str) -> Iterable[SheetRow]:
"""
Given a tabname and a state (returned by _sheet_loader_state), return a generator for a set of row values.
"""
raise NotImplementedError(f"._rows_for_tabname(...) is not implemented for {self.__class__.__name__}.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can also consider making this class an abstract base class with these methods decorated to be abstractmethod.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment.

@netsettler
Copy link
Collaborator Author

I like the organization here, and the consideration for csv files which can easily be adapted for tsv files.

Yeah, I can probably trivially add tsv files. I'll look at that.

pyproject.toml Outdated
@@ -1,6 +1,6 @@
[tool.poetry]
name = "dcicutils"
version = "7.7.2.1b0" # to become "7.8.0"
version = "7.8.1.1b1" # to become "7.9.0"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it safe to update PyYML to ^6.0.1? I tried locally and tests pass, and I didn't see anything too scary in the release notes. It's just that everytime I build locally I have to manually update to this version due to Mac M1 issues with 5.4.1.

@netsettler netsettler changed the title sheet_utils refactor to add csv functionality sheet_utils refactor to add csv functionality (C4-1088) Aug 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants