'SwigPyObject' object has no attribute 'thisown' when trying to apply multiple redactions #2035
-
Please provide all mandatory information! Describe the bug (mandatory)I am trying to loop through a set of patterns to find sensitive text and redact on a pdf. However, after the first time I use apply_redactions(), I can no longer use the page and get the following error:
To Reproduce (mandatory)While I cannot share a sample file because it has PHI, below is the code I am using: This is the function to find the sensitive data
and this is the code to apply the redactions:
Interestingly, after the first redaction is applied, if I try to extract another page from the doc, Your configuration (mandatory)
Let me know if you need any other info. Thanks! |
Beta Was this translation helpful? Give feedback.
Replies: 17 comments 13 replies
-
Always a problem if there are no data for reproducing an error! What I did notice looking at your text: |
Beta Was this translation helpful? Give feedback.
-
For example, this snippet basically does the same thing: make multiple redactions on all pages and apply them. And it does work: import fitz
doc=fitz.open("PyMuPDF.pdf")
for page in doc:
rl=page.search_for("link")
for r in rl:
page.add_redact_annot(r)
if rl!=[]:
page.apply_redactions(images=fitz.PDF_REDACT_IMAGE_NONE)
'Redact' annotation on page 2 of PyMuPDF.pdf
True
'Redact' annotation on page 4 of PyMuPDF.pdf
True
'Redact' annotation on page 5 of PyMuPDF.pdf
'Redact' annotation on page 5 of PyMuPDF.pdf
'Redact' annotation on page 5 of PyMuPDF.pdf
'Redact' annotation on page 5 of PyMuPDF.pdf
'Redact' annotation on page 5 of PyMuPDF.pdf
True
'Redact' annotation on page 6 of PyMuPDF.pdf
'Redact' annotation on page 6 of PyMuPDF.pdf
True
... and so on ... |
Beta Was this translation helpful? Give feedback.
-
@JorjMcKie Thanks for the response! I agree, if you have any suggestions on generating a sample to send, I am happy to do so. As I said, all the files I have contain patient information, which is why I am trying to do the redaction. After your comment, I changed the main loop in the code to the following:
I am only working on the first page right now, the loop is to loop over multiple regex patterns for search conditions, so each iteration generates the areas that need to be redacted and they are appended to all_areas. The page.add_redact_annot statement is successful but when I try to apply redactions I get a similar error code:
If I just took the bit of my code that is doing what your code is doing it would look like:
Which works as expected. But, if I used your snippet, what I am trying to do is:
Where samples is a list of string patterns for which to search. Not sure how to apply the last if statement in this example, but hopefully the idea is clear enough. Thanks again! |
Beta Was this translation helpful? Give feedback.
-
Hopefully to simplify things a bit more, below is the part of the minimal code that is a problem:
I tested with manual iteration:
Which does not throw any error until I try to apply the redactions. I tried looking at the structure of the data before applying the redactions and see no discernible difference among the added redactions so I am not sure why the redactions are successful after the first iteration and fails in subsequent iterations. Thanks |
Beta Was this translation helpful? Give feedback.
-
Well, again: I can't say anything without having a reproducing file at hand.
redact_rects = []
redact_rects.extend(page.search_for(some1))
redact_rects.extend(page.search_for(some2))
# ... etc., when finished:
for r in redact_rects:
page.add_redact_annot(r,...)
page.apply_redactions() Please be aware, that every execution of So a variation of your code may be to create one single |
Beta Was this translation helpful? Give feedback.
-
So the above snippet may look like this: redact_rects = []
tp = page.get_textpage(flags=fitz.TEXTFLAGS_SEARCH)
redact_rects.extend(page.search_for(some1, textpage=tp))
redact_rects.extend(page.search_for(some2, textpage=tp))
# ... etc., when finished:
del tp
for r in redact_rects:
page.add_redact_annot(r,...)
page.apply_redactions() |
Beta Was this translation helpful? Give feedback.
-
@JorjMcKie of course! Your help is greatly appreciated! I tried your above suggestion but still get the same error. I also tried taking one of the loops out by using pipes in the regex pattern:
And, it seems that there is something happening at the document level because if I try to extract any page from the original document variable Interestingly, when I do not use the OR operator in the regex string, |
Beta Was this translation helpful? Give feedback.
-
let's take out this issue to Discussions Q&A ok? |
Beta Was this translation helpful? Give feedback.
-
Also do not understand why regex is necessary at all, as well as splitting the page text into lines. |
Beta Was this translation helpful? Give feedback.
-
@JorjMcKie good question. There are multiple sensitive pieces of information contained within the document. Some have a consistent precursor (such as "Acct #: 12345") but some do not (such as the patient name on the document). So I am using regex to find all of the instances of sensitive information that match expected patterns, such as the name is always a character string, followed by a comma and a white space, followed by another character string, followed by another white space, then a single character and a period. Regex was the only way I could think of doing this type of thing. As for splitting the text into lines, it was the only way I could think of to obtain all of the occurrences of a sensitive piece of information on a page. If there is another/better way of approaching this, I am certainly open to suggestions! |
Beta Was this translation helpful? Give feedback.
-
Maybe I am getting it wrong still: If this is true, then you really do not need regex. Instead simply extract the text giving you line rectangles. In this way you even need no searching either! |
Beta Was this translation helpful? Give feedback.
-
Could work like this: redact_rects=[]
for b in page.get_text("dict")["blocks"]:
for l in b["lines"]:
bbox = l["bbox"]
text = " ".join([s["text"] for s in l["spans"]])
if text.startswith((sensitive1, sensitive2, ...)):
redact_rects.append(bbox)
for r in redact_rects:
page.add_redact_annot(r)
page.apply_redactions() If the issue is that a line must be redacted if it contains critical text, then |
Beta Was this translation helpful? Give feedback.
-
@JorjMcKie Unfortunately still not working. I even tried creating a file then reloading it with fitz:
Then I iteratively load the newly created pdf and save the next iteration. This actually works perfectly IF there is only one pattern in the regex, i.e. |
Beta Was this translation helpful? Give feedback.
-
@JorjMcKie That is an excellent analogy. But I have some good news! I think I may be getting to the bottom of the problem. It seems to have something to do with the extracted areas for redaction. As another work around, I made a bash script with the regex patterns as a bash array being passed into a python script to perform the redaction. Doing it this way, I got an extra bit of error message/information:
With the extra bit being I have to stop for today but am investigating the cause and will report back tomorrow. |
Beta Was this translation helpful? Give feedback.
-
@JorjMcKie I got everything to work through iterating with a bash script as opposed to within python. So the script opens and saves the document, closes the python session, then opens a new one to iterate the next regex pattern. I still couldn't track down the root of the problem but it must have something to do with the regex drawing bounding boxes in areas that don't make sense for annotations/redactions. Unfortunately, due to time constraints, I will just have to go with this solution even though it is not optimal. I really appreciate your help!! |
Beta Was this translation helpful? Give feedback.
-
Hi @JorjMcKie : I am facing similar issue. I am able to reproduce the error. Attached the screenshot as well as a pdf file. Here is the PDF file file: It's happening in Mac M1 12.6.2, Python 3.8 and Python 3.11 too. |
Beta Was this translation helpful? Give feedback.
-
Just an update: Using the latest PyMuPDF (1.21.2) did not produce the issue. I had the issue with PyMuPDF 1.20.2. |
Beta Was this translation helpful? Give feedback.
Hi @JorjMcKie : I am facing similar issue. I am able to reproduce the error. Attached the screenshot as well as a pdf file.
Here is the PDF file file:
test.pdf
It's happening in Mac M1 12.6.2, Python 3.8 and Python 3.11 too.