pdfminer-six和pdfplumber版本兼容问题 #507

LukasCY · 2025-01-21T13:27:49Z

问题描述

我之前安装了pdfminer-six-20240706，安装贵软件时报告不兼容，说只跟20231228兼容。我卸载2024版本，安装20231228，然后再次运行贵软件的安装，结果它自动安装了20240706的pdfminer，然后报告pdfplumber需要20231228版本的pdfminer。我想，这次的2024版miner可不是我安装的。于是，我把贵软件和miner都卸载一遍，再运行贵软件的安装，结果它还是自动安装20240706的miner，同样报告pdfplumber跟miner不兼容。
现在我没辙了。请教。

测试文档

把贵软件和miner都卸载一遍，再运行贵软件pip install pdf2zh的结果（最后几行）
Requirement already satisfied: mdurl~=0.1 in c:\users\marco\appdata\local\packages\pythonsoftwarefoundation.python.3.10_qbz5n2kfra8p0\localcache\local-packages\python310\site-packages (from markdown-it-py>=2.2.0->rich>=10.11.0->typer<1.0,>=0.12->gradio->pdf2zh) (0.1.2)
Using cached pdf2zh-1.8.9-py3-none-any.whl (54 kB)
Using cached pdfminer.six-20240706-py3-none-any.whl (5.6 MB)
Installing collected packages: pdfminer-six, pdf2zh
WARNING: The script pdf2zh.exe is installed in 'C:\Users\marco\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\Scripts' which is not on PATH.
Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
pdfplumber 0.11.5 requires pdfminer.six==20231228, but you have pdfminer-six 20240706 which is incompatible.
Successfully installed pdf2zh-1.8.9 pdfminer-six-20240706

尝试运行贵软件，但出错
PS C:\Windows\system32> pdf2zh -i
pdf2zh : Die Benennung "pdf2zh" wurde nicht als Name eines Cmdlet, einer Funktion, einer Skriptdatei oder eines
ausführbaren Programms erkannt. Überprüfen Sie die Schreibweise des Namens, oder ob der Pfad korrekt ist (sofern
enthalten), und wiederholen Sie den Vorgang.
In Zeile:1 Zeichen:1

pdf2zh -i

  + CategoryInfo          : ObjectNotFound: (pdf2zh:String) [], CommandNotFoundException
  + FullyQualifiedErrorId : CommandNotFoundException

The text was updated successfully, but these errors were encountered:

Byaidu · 2025-01-21T13:31:11Z

请确保python目录位于path中，否则pip日志会有如下报错导致无法运行pdf2zh

Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.

LukasCY · 2025-01-21T13:38:27Z

我python在path里面啊，我都可以python --version

hellofinch · 2025-01-22T07:01:46Z

D:\Unix\pdf2zh_dist>Scripts\pip.exe list
Package                   Version
------------------------- -----------
aiofiles                  23.2.1
altgraph                  0.17.4
annotated-types           0.7.0
anyio                     4.7.0
argostranslate            1.9.6
azure-ai-translation-text 1.0.1
azure-core                1.32.0
certifi                   2024.12.14
cffi                      1.17.1
charset-normalizer        3.4.0
click                     8.1.7
colorama                  0.4.6
coloredlogs               15.0.1
cryptography              44.0.0
ctranslate2               4.5.0
deepl                     1.20.0
Deprecated                1.2.15
distro                    1.9.0
fastapi                   0.115.6
ffmpy                     0.4.0
filelock                  3.16.1
flatbuffers               24.3.25
fonttools                 4.55.3
fsspec                    2024.10.0
gradio                    5.12.0
gradio_client             1.5.4
gradio_pdf                0.0.21
h11                       0.14.0
httpcore                  1.0.7
httpx                     0.27.2
huggingface-hub           0.26.5
humanfriendly             10.0
idna                      3.10
isodate                   0.7.2
Jinja2                    3.1.4
jiter                     0.8.2
joblib                    1.4.2
lxml                      5.3.0
markdown-it-py            3.0.0
MarkupSafe                2.1.5
mdurl                     0.1.2
mpmath                    1.3.0
networkx                  3.4.2
Nuitka                    2.5.6
numpy                     2.2.0
ollama                    0.4.4
onnx                      1.16.0
onnxruntime               1.20.1
openai                    1.57.4
opencv-python-headless    4.10.0.84
ordered-set               4.1.0
orjson                    3.10.12
packaging                 24.2
pandas                    2.2.3
pdf2zh                    1.8.9
pdfminer.six              20240706
peewee                    3.17.8
pefile                    2023.2.7
pikepdf                   9.5.1
pillow                    11.0.0
pip                       24.3.1
protobuf                  5.29.1
pycparser                 2.22
pydantic                  2.10.3
pydantic_core             2.27.1
pydub                     0.25.1
Pygments                  2.18.0
pyinstaller               6.11.1
pyinstaller-hooks-contrib 2024.10
PyMuPDF                   1.25.1
pyreadline3               3.5.4
python-dateutil           2.9.0.post0
python-multipart          0.0.19
pytz                      2024.2
pywin32-ctypes            0.2.3
PyYAML                    6.0.2
regex                     2024.11.6
requests                  2.32.3
rich                      13.9.4
ruff                      0.8.3
sacremoses                0.0.53
safehttpx                 0.1.6
semantic-version          2.10.0
sentencepiece             0.2.0
setuptools                75.8.0
shellingham               1.5.4
six                       1.17.0
sniffio                   1.3.1
stanza                    1.1.1
starlette                 0.41.3
sympy                     1.13.1
tenacity                  9.0.0
tencentcloud-sdk-python   3.0.1282
tomlkit                   0.13.2
torch                     2.5.1
tqdm                      4.67.1
typer                     0.15.1
typing_extensions         4.12.2
tzdata                    2024.2
urllib3                   2.2.3
uvicorn                   0.34.0
websockets                14.1
wrapt                     1.17.2
xinference-client         1.2.0
zstandard                 0.23.0

这个是相关的包，没有你提到的pdfplumber。我认为是你的环境存在污染。

LukasCY · 2025-01-22T07:03:57Z

我之前运行pip uninstall pdfminer.six，这次用pip uninstall pdfminer卸载，然后pip uninstall pdfplumber，再卸载pdf2zh，重新安装，一切顺利，没有报错（结果还是显示pdfminer安装了2024版，但没报错就行）。我再把运行结尾提示的长长的路径加入path了。
不过运行pdf2zh -i 又出现新问题
Traceback (most recent call last):
File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.3056.0_x64__qbz5n2kfra8p0\lib\runpy.py", line 196, in _run_module_as_main
return run_code(code, main_globals, None,
File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.3056.0_x64__qbz5n2kfra8p0\lib\runpy.py", line 86, in run_code
exec(code, run_globals)
File "C:\Users\marco\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\Scripts\pdf2zh.exe_main.py", line 4, in
File "C:\Users\marco\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\pdf2zh_init.py", line 2, in
from pdf2zh.high_level import translate, translate_stream
File "C:\Users\marco\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\pdf2zh\high_level.py", line 21, in
from pymupdf import Document, Font
ModuleNotFoundError: No module named 'pymupdf'

hellofinch · 2025-01-22T07:09:36Z

ModuleNotFoundError: No module named 'pymupdf'

你缺了pymupdf这个包。
我推荐使用一个干净的环境来配置，我认为你的环境中有其他的内容在干扰。这些干扰会影响问题的定位。

LukasCY · 2025-01-22T08:23:10Z

我是外行，怎么弄个“干净”环境……？

hellofinch · 2025-01-22T08:27:58Z

可以考虑使用setup.bat的那种安装方式来使用webui。

LukasCY · 2025-01-22T09:13:02Z

我把原有的python卸载，再运行setup.bat也出现奇怪问题
Der Befehl "powershell" ist entweder falsch geschrieben oder
konnte nicht gefunden werden.
Der Befehl "powershell" ist entweder falsch geschrieben oder
konnte nicht gefunden werden.
C:\Users\marco\Downloads\python.zip konnte nicht gefunden werden
Das System kann den angegebenen Pfad nicht finden.
Das System kann den angegebenen Pfad nicht finden.
Der Befehl "powershell" ist entweder falsch geschrieben oder
konnte nicht gefunden werden.
Python konnte nicht gefunden werden. F黨ren Sie die Verkn黳fung ohne Argumente aus, um sie 黚er den Microsoft Store zu installieren, oder deaktivieren Sie diese Verkn黳fung unter Der Befehl "pip" ist entweder falsch geschrieben oder
konnte nicht gefunden werden.
Der Befehl "pip" ist entweder falsch geschrieben oder
konnte nicht gefunden werden.
Der Befehl "pdf2zh" ist entweder falsch geschrieben oder
konnte nicht gefunden werden.
Drücken Sie eine beliebige Taste . . .
会不会跟我的win10是德语版本有关？

hellofinch · 2025-01-22T11:00:20Z

双击运行setup.bat，不要在命令行里执行那个。

LukasCY · 2025-01-22T11:11:51Z

就是双击运行的……
哎，谢谢一直回复，劳您费神了。我就想看看这个程序能否帮我顺利翻译哲学类的文章，特别在段落被pdf换页切断的情况下能否连贯地翻译。

awwaawwa · 2025-01-23T04:28:05Z

段落被pdf换页切断的情况下不能连贯地翻译。

暂时没有处理这个问题。

LukasCY · 2025-01-23T07:39:00Z

github有没有从学术论文排版的pdf里面（有正文和脚注）提取正文成为连续文本的程序？

LukasCY · 2025-01-23T11:09:32Z

我自己靠cursor做了一个符合上述要求的程序 https://github.com/LukasCY/extract_pdf2docx 抛砖引玉，希望有大神继续完善。

awwaawwa · 2025-01-23T12:08:16Z

My implementation approach involves generating a dataset using LaTeX+SyncTeX, then training a layout recognition model to directly identify such cross-page paragraphs. Subsequently, the typesetting module in YADT will support multiple available space paragraphs, typesetting them in order.

YADT is currently in its early stages, and there are still many more important tasks to be done, such as user documentation and developer documentation, PDF line support, bug fixes, etc. The typesetting engine also needs to be refactored to support more advanced typesetting operations, and I need to consider preserving typesetting translation-related matters, so cross-page paragraph handling is placed in a relatively lower priority. Please be patient.

awwaawwa · 2025-01-23T12:10:09Z

@LukasCY Your program identifies headers & footers through simple rules of regular expressions, which doesn't have strong generalization capabilities, so I cannot incorporate it into YADT (the new backend of this project). Nevertheless, thank you for your approach.

awwaawwa · 2025-01-23T12:12:58Z

In addition, for example, paragraphs spanning multiple columns in a multi-column PDF also need to address this issue.

LukasCY · 2025-01-23T12:22:22Z

期待你的YADT，弄好跨页翻译功能后请务必告诉我。学术期刊中，一页多栏确实也常出现，还见过在一页单栏的排版中插入双栏的排版。

LukasCY added the bug Something isn't working label Jan 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pdfminer-six和pdfplumber版本兼容问题 #507

pdfminer-six和pdfplumber版本兼容问题 #507

LukasCY commented Jan 21, 2025

Byaidu commented Jan 21, 2025

LukasCY commented Jan 21, 2025

hellofinch commented Jan 22, 2025

LukasCY commented Jan 22, 2025 •

edited

Loading

hellofinch commented Jan 22, 2025

LukasCY commented Jan 22, 2025

hellofinch commented Jan 22, 2025

LukasCY commented Jan 22, 2025

hellofinch commented Jan 22, 2025

LukasCY commented Jan 22, 2025

awwaawwa commented Jan 23, 2025

LukasCY commented Jan 23, 2025

LukasCY commented Jan 23, 2025

awwaawwa commented Jan 23, 2025 •

edited

Loading

awwaawwa commented Jan 23, 2025 •

edited

Loading

awwaawwa commented Jan 23, 2025

LukasCY commented Jan 23, 2025

pdfminer-six和pdfplumber版本兼容问题 #507

pdfminer-six和pdfplumber版本兼容问题 #507

Comments

LukasCY commented Jan 21, 2025

问题描述

测试文档

Byaidu commented Jan 21, 2025

LukasCY commented Jan 21, 2025

hellofinch commented Jan 22, 2025

LukasCY commented Jan 22, 2025 • edited Loading

hellofinch commented Jan 22, 2025

LukasCY commented Jan 22, 2025

hellofinch commented Jan 22, 2025

LukasCY commented Jan 22, 2025

hellofinch commented Jan 22, 2025

LukasCY commented Jan 22, 2025

awwaawwa commented Jan 23, 2025

LukasCY commented Jan 23, 2025

LukasCY commented Jan 23, 2025

awwaawwa commented Jan 23, 2025 • edited Loading

awwaawwa commented Jan 23, 2025 • edited Loading

awwaawwa commented Jan 23, 2025

LukasCY commented Jan 23, 2025

LukasCY commented Jan 22, 2025 •

edited

Loading

awwaawwa commented Jan 23, 2025 •

edited

Loading

awwaawwa commented Jan 23, 2025 •

edited

Loading