-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Updated oxidation.py and its tests #366
base: develop
Are you sure you want to change the base?
Conversation
WalkthroughThis pull request encompasses changes across several files, primarily focusing on the configuration of pre-commit hooks and modifications to the Changes
Possibly related PRs
Suggested labels
Poem
✨ Finishing Touches
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (2)
smact/utils/oxidation.py (1)
100-101
: Add validation for consensus_threshold parameter.Consider adding validation to ensure the threshold is non-negative, as negative values wouldn't make sense in this context.
species_occurrences_df = self.ox_states_df[ + (consensus_threshold >= 0) + and (self.ox_states_df.results_count >= consensus_threshold) ].reset_index(drop=True)get_species_occurences_df2.ipynb (1)
17-19
: Add missing parameter documentation.The docstring is missing documentation for the
threshold
parameter.Args: + threshold (int): Minimum number of occurrences required for inclusion. Default is 3. include_one_oxidation_state (bool): Include oxidation states +1 and -1 in the species or include as + and - signs. Default is False. sort_by_occurrences (bool): Sort the species list by occurrences. Default is True.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
.pre-commit-config.yaml
(1 hunks)get_species_occurences_df2.ipynb
(1 hunks)smact/utils/oxidation.py
(2 hunks)
🧰 Additional context used
🪛 GitHub Actions: SMACT CI
get_species_occurences_df2.ipynb
[error] 2-7: Notebook contains execution count metadata that should be stripped. The nbstripout pre-commit hook failed and modified the file to remove execution counts.
🔇 Additional comments (3)
smact/utils/oxidation.py (2)
85-87
: LGTM! Parameter addition looks good.The new
consensus_threshold
parameter is well-documented and its default value of 3 seems reasonable for filtering out spurious data.Also applies to: 94-94
Line range hint
85-101
: Add tests for the new consensus_threshold parameter.Please add unit tests to verify:
- Default threshold behaviour
- Custom threshold filtering
- Edge cases (0, very large values)
- Invalid inputs (negative values)
Would you like me to help generate comprehensive test cases for the new parameter?
.pre-commit-config.yaml (1)
55-55
: LGTM! Good improvement to the pre-commit workflow.Moving the codespell check to the pre-commit stage will help catch spelling issues earlier in the commit process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tests need to be updated to reflect the changes.
See my additional review comments
species_occurrences_df = species_occurrences_df[["species", "results_count"]] | ||
species_occurrences_df = species_occurrences_df[["element", "species", "results_count"]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy to accept the change, but can this change also be documented in the Pull Request text. Would also be helpful to know why the addition of the element column is needed.
Additionally, the tests need to be updated to reflect the changes made to the code to ensure that they pass.
SMACT/smact/tests/test_utils.py
Lines 211 to 220 in a92e123
def test_oxidation_states_filter_species_occurrences(self): | |
species_occurrences_df = self.ox_filter.get_species_occurrences_df() | |
self.assertIsInstance(species_occurrences_df, pd.DataFrame) | |
self.assertEqual( | |
species_occurrences_df.columns.tolist(), | |
["species", "results_count"], | |
) | |
self.assertEqual(species_occurrences_df.shape, (490, 2)) | |
self.assertEqual(species_occurrences_df.iloc[0]["species"], "O2-") | |
self.assertEqual(species_occurrences_df.iloc[0]["results_count"], 116910) |
I'd say that consensus_threshold
should be relabeled to threshold
since there are functions which also take a threshold
argument in this Python file. So it's better to have consistency. On the point of consistency, all thresholds should either be >= or >, rather than having a mix of both.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🔭 Outside diff range comments (1)
smact/utils/oxidation.py (1)
Line range hint
130-154
: Reduce code duplication by reusing filter method logic.The filtering logic and species proportion calculation are duplicated from the
filter
method. Consider reusing the existing implementation.-if not include_zero: - species_occurrences_df = self.ox_states_df[ - self.ox_states_df["oxidation_state"] != 0 - ].reset_index(drop=True) -else: - species_occurrences_df = self.ox_states_df -species_occurrences_df = species_occurrences_df[ - (species_occurrences_df.results_count >= consensus) -].reset_index(drop=True) +# Reuse filter method with commonality="low" to maintain all proportions +species_occurrences_df = self.ox_states_df[ + self.ox_states_df["results_count"].isin( + self.filter(consensus=consensus, include_zero=include_zero, commonality="low")["results_count"] + ) +].reset_index(drop=True)
🧹 Nitpick comments (1)
smact/utils/oxidation.py (1)
196-198
: Enhance metadata in output file.Consider adding more detailed metadata about the filtering parameters used.
f.write( - f"#\n# Oxidation state set\n# Source: ICSD (2024), filtered for {commonality} commonality of reports\n#\n" + f"#\n# Oxidation state set\n" + f"# Source: ICSD (2024)\n" + f"# Filtering parameters:\n" + f"# - Minimum consensus: {consensus} reports\n" + f"# - Commonality threshold: {commonality} " + f"({commonality_map[commonality]}% minimum proportion)\n#\n" )
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
smact/utils/oxidation.py
(5 hunks)
🔇 Additional comments (3)
smact/utils/oxidation.py (3)
24-26
: LGTM! Clean data loading implementation.The DataFrame loading implementation is correct and follows pandas best practices.
79-83
: LGTM! Consistent parameter updates.The changes to
get_species_list
maintain consistency with the filter method updates.Also applies to: 96-96
28-30
: Consider renaming 'consensus' to 'threshold' for consistency.As mentioned in previous reviews, other functions in this file use the term 'threshold'. For consistency, consider renaming the 'consensus' parameter to 'threshold'.
smact/utils/oxidation.py
Outdated
element_totals = filtered_df.groupby("element")["results_count"].transform( | ||
"sum" | ||
) | ||
filtered_df["species_proportion (%)"] = ( | ||
filtered_df["results_count"] / element_totals * 100 | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Add error handling for zero element totals in proportion calculation.
The species proportion calculation could raise a division by zero error if all results are filtered out for an element.
element_totals = filtered_df.groupby("element")["results_count"].transform(
"sum"
)
+if (element_totals == 0).any():
+ raise ValueError(
+ "No valid results remain for some elements after filtering. "
+ "Consider adjusting the consensus threshold."
+ )
filtered_df["species_proportion (%)"] = (
filtered_df["results_count"] / element_totals * 100
)
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
element_totals = filtered_df.groupby("element")["results_count"].transform( | |
"sum" | |
) | |
filtered_df["species_proportion (%)"] = ( | |
filtered_df["results_count"] / element_totals * 100 | |
) | |
element_totals = filtered_df.groupby("element")["results_count"].transform( | |
"sum" | |
) | |
if (element_totals == 0).any(): | |
raise ValueError( | |
"No valid results remain for some elements after filtering. " | |
"Consider adjusting the consensus threshold." | |
) | |
filtered_df["species_proportion (%)"] = ( | |
filtered_df["results_count"] / element_totals * 100 | |
) |
smact/utils/oxidation.py
Outdated
commonality_map = {"low": 0, "medium": 10, "high": 50} | ||
commonality_threshold = commonality_map.get(commonality) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Add input validation for commonality parameter.
The commonality parameter should be validated against allowed values to prevent silent failures with invalid inputs.
commonality_map = {"low": 0, "medium": 10, "high": 50}
+if commonality not in commonality_map:
+ raise ValueError(
+ f"Invalid commonality value: {commonality}. "
+ f"Must be one of: {', '.join(commonality_map.keys())}"
+ )
commonality_threshold = commonality_map.get(commonality)
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
commonality_map = {"low": 0, "medium": 10, "high": 50} | |
commonality_threshold = commonality_map.get(commonality) | |
commonality_map = {"low": 0, "medium": 10, "high": 50} | |
if commonality not in commonality_map: | |
raise ValueError( | |
f"Invalid commonality value: {commonality}. " | |
f"Must be one of: {', '.join(commonality_map.keys())}" | |
) | |
commonality_threshold = commonality_map.get(commonality) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
smact/utils/oxidation.py
(5 hunks)
🔇 Additional comments (5)
smact/utils/oxidation.py (5)
58-75
: LGTM! Clear documentation and proper parameter propagationThe method signature changes and documentation updates are well-implemented.
Line range hint
135-169
: LGTM! Consistent parameter updates and clear documentationThe method changes maintain consistency with other methods and provide clear documentation of the new functionality.
179-179
: LGTM! Consistent threshold comparisonThe change from > to >= aligns with the consensus threshold usage throughout the class.
37-38
:⚠️ Potential issueAdd input validation for the commonality parameter
The commonality parameter should be validated to prevent silent failures with invalid inputs.
commonality_map = {"low": 0, "medium": 10, "high": 50} +if commonality not in commonality_map: + raise ValueError( + f"Invalid commonality value: {commonality}. " + f"Must be one of: {', '.join(commonality_map.keys())}" + ) commonality_threshold = commonality_map.get(commonality)Likely invalid or redundant comment.
46-48
:⚠️ Potential issueAdd error handling for zero element totals
The species proportion calculation could raise a division by zero error if all results are filtered out for an element.
element_totals = filtered_df.groupby("element")["results_count"].transform("sum") +if (element_totals == 0).any(): + raise ValueError( + "No valid results remain for some elements after filtering. " + "Consider adjusting the consensus threshold." + ) filtered_df["species_proportion (%)"] = filtered_df["results_count"] / element_totals * 100Likely invalid or redundant comment.
self, | ||
consensus: int = 3, | ||
include_one_oxidation_state: bool = False, | ||
sort_by_occurrences: bool = True, | ||
include_zero: bool = False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Add commonality parameter for consistency
This method should include the commonality parameter for consistency with other methods in the class.
def get_species_occurrences_df(
self,
consensus: int = 3,
include_one_oxidation_state: bool = False,
sort_by_occurrences: bool = True,
- include_zero: bool = False,
+ include_zero: bool = False,
+ commonality: str = "low",
):
Committable suggestion skipped: line range outside the PR's diff.
@@ -142,7 +164,9 @@ def write( | |||
if not filename.endswith(".txt"): | |||
filename += ".txt" | |||
with open(filename, "w") as f: | |||
f.write(f"#\n# Oxidation state set\n# Source: ICSD (2024), filtered for > {threshold} reports\n#\n") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest that users should have the option to either write out lists based on either the threshold/consensus or commonality. This should preserve backwards compatibility, as well as provide flexibility to users to choose how to write the oxidation states lists
|
||
Returns: | ||
pd.DataFrame: The filtered oxidation states list as a DataFrame. | ||
""" | ||
commonality_map = {"low": 0, "medium": 10, "high": 50} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest that there's additional argument to the filter function for commonality_map
which can take your default arguments.
This provides flexibility to the user if they want to define their own thresholds
Thanks @AntObi ! I've introduced your suggestions and updated the tests accordingly |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
smact/utils/oxidation.py (1)
55-61
:⚠️ Potential issueAdd error handling for zero element totals.
The species proportion calculation could raise a division by zero error if all results are filtered out for an element.
element_totals = filtered_df.groupby("element")["results_count"].transform("sum") + if (element_totals == 0).any(): + raise ValueError( + "No valid results remain for some elements after filtering. " + "Consider adjusting the consensus threshold." + ) filtered_df["species_proportion (%)"] = filtered_df["results_count"] / element_totals * 100
🧹 Nitpick comments (5)
smact/utils/oxidation.py (2)
26-41
: LGTM! Consider enhancing the docstring.The parameter renaming and addition of commonality filtering improve the API. The docstring is comprehensive but could be enhanced by documenting the return value's columns.
Returns: - pd.DataFrame: The filtered oxidation states list as a DataFrame. + pd.DataFrame: The filtered oxidation states list as a DataFrame with columns: + - element (str): Chemical element symbol + - oxidation_state (str): Space-separated oxidation states
179-181
: Enhance the file header with consensus threshold information.The header should mention both the commonality level and consensus threshold for complete filtering information.
f.write( - f"#\n# Oxidation state set\n# Source: ICSD (2024), filtered for {commonality} commonality of reports\n#\n" + f"#\n# Oxidation state set\n# Source: ICSD (2024), filtered for {commonality} commonality of reports (consensus ≥ {consensus})\n#\n" )smact/tests/test_utils.py (3)
181-181
: Enhance test coverage for commonality parameter.The test should verify the behaviour with different commonality values.
threshold = 10 - filtered_df = self.ox_filter.filter(consensus=threshold) + filtered_df = self.ox_filter.filter(consensus=threshold) + filtered_df_medium = self.ox_filter.filter(consensus=threshold, commonality="medium") + filtered_df_high = self.ox_filter.filter(consensus=threshold, commonality="high") self.assertIsInstance(filtered_df, pd.DataFrame) + self.assertIsInstance(filtered_df_medium, pd.DataFrame) + self.assertIsInstance(filtered_df_high, pd.DataFrame) + self.assertGreater(len(filtered_df), len(filtered_df_medium)) + self.assertGreater(len(filtered_df_medium), len(filtered_df_high))
187-227
: Enhance file content verification.The tests could be more thorough in verifying the structure and content of the written files.
with open(f"{filename}.txt") as f: content = f.read() # Check if the comment is included in the file self.assertIn(comment, content) + # Verify file structure + lines = content.split("\n") + self.assertTrue(any(line.startswith("# Source: ICSD (2024)") for line in lines)) + self.assertTrue(all(len(line.split()) <= 2 for line in lines if not line.startswith("#") and line)) + # Verify oxidation states format + for line in lines: + if line and not line.startswith("#"): + parts = line.split() + if len(parts) == 2: + self.assertTrue(all(x.isdigit() or x == "-" for x in parts[1].split()))
229-261
: Add tests for custom commonality thresholds.The tests should verify that custom commonality thresholds work correctly.
# Test with a specific consensus threshold species_list_threshold = self.ox_filter.get_species_list(consensus=5) self.assertIsInstance(species_list_threshold, list) self.assertGreater(len(species_list_threshold), 0) + + # Test with custom commonality threshold + species_list_custom = self.ox_filter.get_species_list( + consensus=3, + commonality_thresholds={"low": 0, "medium": 15, "high": 60} + ) + self.assertIsInstance(species_list_custom, list) + self.assertGreater(len(species_list_custom), 0)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
smact/tests/files/oxidation_states_icsd24_consensus.txt
(1 hunks)smact/tests/files/oxidation_states_icsd24_consensus_w_0.txt
(1 hunks)smact/tests/test_utils.py
(2 hunks)smact/utils/oxidation.py
(5 hunks)
✅ Files skipped from review due to trivial changes (2)
- smact/tests/files/oxidation_states_icsd24_consensus_w_0.txt
- smact/tests/files/oxidation_states_icsd24_consensus.txt
🔇 Additional comments (4)
smact/utils/oxidation.py (3)
42-50
: Make commonality thresholds configurable.As suggested in a previous review, the commonality thresholds should be configurable to provide more flexibility to users.
def filter( self, consensus: int = 3, include_zero: bool = False, commonality: str | float = "low", + commonality_thresholds: dict[str, float] | None = None, ): - commonality_map = {"low": 0, "medium": 10, "high": 50} + default_map = {"low": 0, "medium": 10, "high": 50} + commonality_map = commonality_thresholds or default_map
Line range hint
105-139
: Add commonality parameter for consistency.This method should include the commonality parameter for consistency with other methods in the class.
def get_species_occurrences_df( self, consensus: int = 3, include_one_oxidation_state: bool = False, sort_by_occurrences: bool = True, include_zero: bool = False, + commonality: str = "low", ):
191-191
: LGTM! Consistent threshold comparison.The change to use >= makes the filtering logic consistent across all methods.
smact/tests/test_utils.py (1)
166-167
: LGTM! Test paths updated correctly.The test file paths have been appropriately updated to reflect the consensus-based filtering approach.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #366 +/- ##
==========================================
+ Coverage 75.47% 75.58% +0.11%
==========================================
Files 31 31
Lines 2642 2679 +37
==========================================
+ Hits 1994 2025 +31
- Misses 648 654 +6 ☔ View full report in Codecov by Sentry. |
Pull Request Template
Description
Updated get_species_occurences_df by adding a consensus_threshold. Helps to avoid non-physical species. Default is 3.
Fixes # (issue)
Type of change
Please delete options that are not relevant.
Reviewers
@AntObi
Checklist
Summary by CodeRabbit
Release Notes
Chores
codespell
during pre-commit phase.New Features