Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exception thrown since presidio analyzer is in bad state #147

Open
StefanKarlsson321 opened this issue Jun 3, 2024 · 5 comments
Open

Exception thrown since presidio analyzer is in bad state #147

StefanKarlsson321 opened this issue Jun 3, 2024 · 5 comments

Comments

@StefanKarlsson321
Copy link

StefanKarlsson321 commented Jun 3, 2024

Describe the bug
InvalidParamException exception thrown since presidio analyzer is in bad state

To Reproduce
Steps to reproduce the behavior:

  1. Install LLM Guard 0.3.13
  2. With python 3.11 execute the following code:
from llm_guard import input_scanners, scan_prompt
import json

Secrets = input_scanners.Secrets()

def sanitize(prompt):
    sanitized_prompt = scan_prompt([Secrets], prompt)
    print (sanitized_prompt)

text='{"awsAccountId":"327878933619","digestStartTime":"2023-10-15T22:04:04Z","digestEndTime":"2023-10-15T23:04:04Z","digestS3Bucket":"paul-trail","digestS3Object":"AWSLogs\/327878933619\/CloudTrail-Digest\/ap-northeast-1\/2023\/10\/15\/327878933619_CloudTrail-Digest_ap-northeast-1_paul-trail_us-west-2_20231015T230404Z.json.gz","digestPublicKeyFingerprint":"be2f0b997552f44942837300ba1aba9d","digestSignatureAlgorithm":"SHA256withRSA","newestEventTime":"2023-10-15T22:58:17Z","oldestEventTime":"2023-10-15T22:04:51Z","previousDigestS3Bucket":"paul-trail","previousDigestS3Object":"AWSLogs\/327878933619\/CloudTrail-Digest\/ap-northeast-1\/2023\/10\/15\/327878933619_CloudTrail-Digest_ap-northeast-1_paul-trail_us-west-2_20231015T220404Z.json.gz","previousDigestHashValue":"8f953371d3e85eddb89b05ed6b9e680791055315c73e1025ab5dba7bb2aee189","previousDigestHashAlgorithm":"SHA-256","previousDigestSignature":"11c11e253f4929eaded49c9d826b257a5ab894ce002988bd07ed2bc6407f1b0ef74f48634c364c6884c6470c9416d73f0742f8758746fc8db4cf23b75c713304779bb6d181ccae4b6a78ae5106f1602ce49af3f9dea4e9ba92761fcaf3e02a5f3d64558d7f4b2eff85f0cc523a770a3b1092e0e37aa665f3c37b75ecc93c94a4640825e0ebe44b2b4fa48b7477040f08a83db2224b403c46476ca25a1b53b5b5db86be04e623fef2d9a2a8eba482239439d6d49cb5eb759a90184f72506a8788fb085f56830c46f51d6e216152bf9156b33cbbee3aeeb5b00540f333708f870d316291f37dd530491a7785ddafdb83543c327fa504e200efefbadd644fed9b9a","logFiles":[{"s3Bucket":"paul-trail","s3Object":"AWSLogs\/327878933619\/CloudTrail\/ap-northeast-1\/2023\/10\/15\/327878933619_CloudTrail_ap-northeast-1_20231015T2205Z_iRIoDMA9l9Q4kmFy.json.gz","hashValue":"4309c6161e37538de72ec6f679e86b7e45aebed71fa7e76af70c3019fef44e19","hashAlgorithm":"SHA-256","newestEventTime":"2023-10-15T22:04:51Z","oldestEventTime":"2023-10-15T22:04:51Z"},{"s3Bucket":"paul-trail","s3Object":"AWSLogs\/327878933619\/CloudTrail\/ap-northeast-1\/2023\/10\/15\/327878933619_CloudTrail_ap-northeast-1_20231015T2300Z_aDYIgZODwysx0Irn.json.gz","hashValue":"de90c3b55016bc5fad9c12378ccc6fc38180a15bd95879305415572a4472b1a9","hashAlgorithm":"SHA-256","newestEventTime":"2023-10-15T22:58:17Z","oldestEventTime":"2023-10-15T22:58:17Z"},{"s3Bucket":"paul-trail","s3Object":"AWSLogs\/327878933619\/CloudTrail\/ap-northeast-1\/2023\/10\/15\/327878933619_CloudTrail_ap-northeast-1_20231015T2300Z_9eJ8qdKnXIfFg2wM.json.gz","hashValue":"85e79f9b40d5a57be15fa6ac6f54d3ea1919611e37ca682c1e753287ac7b9bcb","hashAlgorithm":"SHA-256","newestEventTime":"2023-10-15T22:58:17Z","oldestEventTime":"2023-10-15T22:58:17Z"},{"s3Bucket":"paul-trail","s3Object":"AWSLogs\/327878933619\/CloudTrail\/ap-northeast-1\/2023\/10\/15\/327878933619_CloudTrail_ap-northeast-1_20231015T2225Z_OviGSSWadUI1W1r7.json.gz","hashValue":"58583ed7d52597e47e073db9b756f38815a8a5aff92911911710f18e65e1c44d","hashAlgorithm":"SHA-256","newestEventTime":"2023-10-15T22:20:34Z","oldestEventTime":"2023-10-15T22:10:12Z"},{"s3Bucket":"paul-trail","s3Object":"AWSLogs\/327878933619\/CloudTrail\/ap-northeast-1\/2023\/10\/15\/327878933619_CloudTrail_ap-northeast-1_20231015T2225Z_j5hj9VuYmchJHAkK.json.gz","hashValue":"c18c49161f97def10a14cffa5b5ab441c8fe8194af1cb1d79d470b6173f901c4","hashAlgorithm":"SHA-256","newestEventTime":"2023-10-15T22:20:34Z","oldestEventTime":"2023-10-15T22:20:34Z"}]}'

sanitize(text)
sanitize("Hello")

Expected behavior
No exception thrown and no secrets detected in the "Hello" prompt.

Screenshots
raise InvalidParamException(err_msg)
presidio_anonymizer.entities.invalid_exception.InvalidParamException: Invalid analyzer result, start: -1 and end: 63, while text length is only 5.

Additional context
Inspiration to finding bug: microsoft/presidio#1376
Although I have applied the fix related to the bug: microsoft/presidio#1377, the issue still occurs.

@asofter
Copy link
Collaborator

asofter commented Jun 3, 2024

Hey @StefanKarlsson321 ,
Thanks for submitting the bug. This scanner doesn't use Presidio. It relies on the https://github.com/bridgecrewio/detect-secrets

I see that even this library is not doing a good jon on your prompt

@StefanKarlsson321
Copy link
Author

Hmm, but I get this. It seems to point to Presidio:

Exception has occurred: InvalidParamException
Invalid analyzer result, start: -1 and end: 511, while text length is only 5.
  File "/home/***/***/test_llm_guard.py", line 7, in sanitize
    sanitized_prompt = scan_prompt([Secrets], prompt)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dhb/pythoncod/test_llm_guard.py", line 13, in <module>
    sanitize("Hello")
presidio_anonymizer.entities.invalid_exception.InvalidParamException: Invalid analyzer result, start: -1 and end: 511, while text length is only 5.

@asofter
Copy link
Collaborator

asofter commented Jun 4, 2024

I see now but this issue is a bit different. We rely on the text replacer from the Presidio library, and apparently we didn't refresh credentials each time we run scanning. I fixed that issue in the latest commit.

@StefanKarlsson321
Copy link
Author

I added an Anonymizer after the secrets scanner, and got this from presidio_analyzer with the otherwise same code:

OverflowError: int too big to convert

Will check and come back with updated presidio_analyser as well.

@StefanKarlsson321
Copy link
Author

When I update the presidio_analyzer to the very latest origin/main, then it seem that the issue is gone.

Thanks a lot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants