Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get patterns from regex match in ITIN recognizer #959

Closed

Conversation

aperezfals
Copy link
Contributor

@aperezfals aperezfals commented Dec 1, 2022

Get patterns from regex match in ITIN recognizer

Adds a new prop get_improved_pattern_fn to Pattern that returns a new Pattern based on the regex match info.

UsItinRecognizer is change, by passing an improve_itin_pattern function to its only pattern , reducing the regex used from 3 to only 1. The regex in UsItinRecognizer use now named groups for the separators - and . The implementation of improve_itin_pattern in UsItinRecognizer uses the named groups to return different pattern names and scores.

Issue reference

#956

Future PRs will reduce the number of regex in the recognizers list defined in the issue.

Checklist

  • Add get_improved_pattern_fn prop toPatternRecognizer
  • Reduce the number of regexes in UsItinRecognizer by passing a function improve_itin_pattern to the pattern
  • Add more unit tests for space separators

Copy link
Contributor

@omri374 omri374 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! This is a great addition. Added some comments with questions

@aperezfals
Copy link
Contributor Author

@omri374 @SharonHart I updated the PR with the proposed changes. Since the proposal is incompatible with the existing approach, I implemented a new recognizer for this. We can migrate the rest of recognizers to this step by step. I included the ssn recognizer in this PR to demonstrate a perfect example of pattern level score improvement, and recognizer level score improvement. I also added a string sanitizer class to make more extensible the replace list we are using. This string sanitizer uses now python translate tables, that are faster than string.replace().

@omri374
Copy link
Contributor

omri374 commented Jan 17, 2023

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@omri374
Copy link
Contributor

omri374 commented Jan 17, 2023

@aperezfals, thanks for the update of the PR! I might be missing something, but could you elaborate on why this change cannot be added to the existing PatternRecognizer and is not compatible with the existing flow? Is it because it would then have all three validate, invalidate and improve approaches?

@omri374
Copy link
Contributor

omri374 commented Jan 18, 2023

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@aperezfals
Copy link
Contributor Author

@aperezfals, thanks for the update of the PR! I might be missing something, but could you elaborate on why this change cannot be added to the existing PatternRecognizer and is not compatible with the existing flow? Is it because it would then have all three validate, invalidate and improve approaches?

@omri374 Mainly yes. It could be confusing to have a validate, invalidate and improve. Which method should have precedence? I guess that if we implement improve directly on PatterRecognizer and make it override any validate and invalidate (only if improve is implemented) it could work. But maybe we should clarify somehow this. In the docs of the methods and/or mark validate and invalidate as deprecated. What do you think @omri374 ? In that case, the improve would go also in the Pattern class.

@SharonHart
Copy link
Contributor

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@omri374
Copy link
Contributor

omri374 commented Jan 19, 2023

Thanks @aperezfals. Some considerations I'm thinking of:

  1. We should handle backward compatibility. If a user has a custom recognizer with an implementation of validate, and we decide to deprecate it, their code would not be called.
  2. To me, validate and invalidate are deterministic methods that should return True or False. Checksum is one example of validation. This is different from improvement which is a way to enhance the score (like context enhancement). Having all three might be confusing, but I do see them serving different purposes. Happy to get your thoughts and @SharonHart's too.

Copy link
Contributor

@omri374 omri374 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding my review on the code. Hopefully we can have this resolved (together with the previous comment on adding vs. modifying) and released in the next version!

:param name: the name of the pattern
:param regex: the regex pattern to detect
:param score: the pattern's strength (values varies 0-1)
:param get_improved_pattern_func: a function that improve the score of the analysis explanation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
:param get_improved_pattern_func: a function that improve the score of the analysis explanation
:param improve_score_fn: a function that improve the score of the analysis explanation


class RegexReplaceSanitizer(StringSanitizer):
"""
Replace parts of a string using a regex to search the term to replace.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add the params (regex, replace) to the docstring

self.replace = replace

def sanitize(self, text: str) -> str:
return re.sub(self.regex, self.replace, text)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add docstring



class StringSanitizer:
"""Cleans a string."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add a more detailed description? What type of cleaning does this do?

import regex as re


class StringSanitizer:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have this as an abstract class?


See https://docs.python.org/3/library/stdtypes.html#str.maketrans
"""
self.trans_table = str.maketrans(*trans_table)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool!

class ImprovablePatternRecognizer(LocalRecognizer):
"""
PII entity recognizer using regular expressions or deny-lists.
Analysis explanations can be improved by a pattern or by the recognizer.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not just the explanations that could be improved, but also the score. Am I correct?

@iukea1
Copy link

iukea1 commented Nov 26, 2023

This would be nice

@omri374 omri374 closed this Jan 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants