You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
While using the extract_text_to_fp function with the latest version of pdfminer.six, I've encountered an issue where CID characters (e.g., CID(123)) appear in the extracted text. These characters seem to be associated with the fonts used in the PDF, but there is no straightforward way to clean them or make them readable.
This makes it difficult to obtain a clean and readable text from the PDF, as these CID characters are not converted into standard Unicode characters or intelligible text. I would like to know if there is a recommended way to handle this issue with pdfminer.six or if improvements can be made to the library to manage such cases more effectively.
Additional Information:
Version of pdfminer.six used: Version: 20240706
Operating system: linux
Hello,
While using the extract_text_to_fp function with the latest version of pdfminer.six, I've encountered an issue where CID characters (e.g., CID(123)) appear in the extracted text. These characters seem to be associated with the fonts used in the PDF, but there is no straightforward way to clean them or make them readable.
This makes it difficult to obtain a clean and readable text from the PDF, as these CID characters are not converted into standard Unicode characters or intelligible text. I would like to know if there is a recommended way to handle this issue with pdfminer.six or if improvements can be made to the library to manage such cases more effectively.
Additional Information:
Version of pdfminer.six used: Version: 20240706
Operating system: linux
Example code used for extraction:
Thank you for your incomming answer and your time.
page10.pdf
The text was updated successfully, but these errors were encountered: