Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pdfminer-six和pdfplumber版本兼容问题 #507

Open
LukasCY opened this issue Jan 21, 2025 · 17 comments
Open

pdfminer-six和pdfplumber版本兼容问题 #507

LukasCY opened this issue Jan 21, 2025 · 17 comments
Labels
bug Something isn't working

Comments

@LukasCY
Copy link

LukasCY commented Jan 21, 2025

问题描述

我之前安装了pdfminer-six-20240706,安装贵软件时报告不兼容,说只跟20231228兼容。我卸载2024版本,安装20231228,然后再次运行贵软件的安装,结果它自动安装了20240706的pdfminer,然后报告pdfplumber需要20231228版本的pdfminer。我想,这次的2024版miner可不是我安装的。于是,我把贵软件和miner都卸载一遍,再运行贵软件的安装,结果它还是自动安装20240706的miner,同样报告pdfplumber跟miner不兼容。
现在我没辙了。请教。

测试文档

把贵软件和miner都卸载一遍,再运行贵软件pip install pdf2zh的结果(最后几行)
Requirement already satisfied: mdurl~=0.1 in c:\users\marco\appdata\local\packages\pythonsoftwarefoundation.python.3.10_qbz5n2kfra8p0\localcache\local-packages\python310\site-packages (from markdown-it-py>=2.2.0->rich>=10.11.0->typer<1.0,>=0.12->gradio->pdf2zh) (0.1.2)
Using cached pdf2zh-1.8.9-py3-none-any.whl (54 kB)
Using cached pdfminer.six-20240706-py3-none-any.whl (5.6 MB)
Installing collected packages: pdfminer-six, pdf2zh
WARNING: The script pdf2zh.exe is installed in 'C:\Users\marco\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\Scripts' which is not on PATH.
Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
pdfplumber 0.11.5 requires pdfminer.six==20231228, but you have pdfminer-six 20240706 which is incompatible.
Successfully installed pdf2zh-1.8.9 pdfminer-six-20240706

尝试运行贵软件,但出错
PS C:\Windows\system32> pdf2zh -i
pdf2zh : Die Benennung "pdf2zh" wurde nicht als Name eines Cmdlet, einer Funktion, einer Skriptdatei oder eines
ausführbaren Programms erkannt. Überprüfen Sie die Schreibweise des Namens, oder ob der Pfad korrekt ist (sofern
enthalten), und wiederholen Sie den Vorgang.
In Zeile:1 Zeichen:1

  • pdf2zh -i
  •   + CategoryInfo          : ObjectNotFound: (pdf2zh:String) [], CommandNotFoundException
      + FullyQualifiedErrorId : CommandNotFoundException
    
@LukasCY LukasCY added the bug Something isn't working label Jan 21, 2025
@Byaidu
Copy link
Owner

Byaidu commented Jan 21, 2025

请确保python目录位于path中,否则pip日志会有如下报错导致无法运行pdf2zh

Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.

@LukasCY
Copy link
Author

LukasCY commented Jan 21, 2025

我python在path里面啊,我都可以python --version

@hellofinch
Copy link
Contributor

D:\Unix\pdf2zh_dist>Scripts\pip.exe list
Package                   Version
------------------------- -----------
aiofiles                  23.2.1
altgraph                  0.17.4
annotated-types           0.7.0
anyio                     4.7.0
argostranslate            1.9.6
azure-ai-translation-text 1.0.1
azure-core                1.32.0
certifi                   2024.12.14
cffi                      1.17.1
charset-normalizer        3.4.0
click                     8.1.7
colorama                  0.4.6
coloredlogs               15.0.1
cryptography              44.0.0
ctranslate2               4.5.0
deepl                     1.20.0
Deprecated                1.2.15
distro                    1.9.0
fastapi                   0.115.6
ffmpy                     0.4.0
filelock                  3.16.1
flatbuffers               24.3.25
fonttools                 4.55.3
fsspec                    2024.10.0
gradio                    5.12.0
gradio_client             1.5.4
gradio_pdf                0.0.21
h11                       0.14.0
httpcore                  1.0.7
httpx                     0.27.2
huggingface-hub           0.26.5
humanfriendly             10.0
idna                      3.10
isodate                   0.7.2
Jinja2                    3.1.4
jiter                     0.8.2
joblib                    1.4.2
lxml                      5.3.0
markdown-it-py            3.0.0
MarkupSafe                2.1.5
mdurl                     0.1.2
mpmath                    1.3.0
networkx                  3.4.2
Nuitka                    2.5.6
numpy                     2.2.0
ollama                    0.4.4
onnx                      1.16.0
onnxruntime               1.20.1
openai                    1.57.4
opencv-python-headless    4.10.0.84
ordered-set               4.1.0
orjson                    3.10.12
packaging                 24.2
pandas                    2.2.3
pdf2zh                    1.8.9
pdfminer.six              20240706
peewee                    3.17.8
pefile                    2023.2.7
pikepdf                   9.5.1
pillow                    11.0.0
pip                       24.3.1
protobuf                  5.29.1
pycparser                 2.22
pydantic                  2.10.3
pydantic_core             2.27.1
pydub                     0.25.1
Pygments                  2.18.0
pyinstaller               6.11.1
pyinstaller-hooks-contrib 2024.10
PyMuPDF                   1.25.1
pyreadline3               3.5.4
python-dateutil           2.9.0.post0
python-multipart          0.0.19
pytz                      2024.2
pywin32-ctypes            0.2.3
PyYAML                    6.0.2
regex                     2024.11.6
requests                  2.32.3
rich                      13.9.4
ruff                      0.8.3
sacremoses                0.0.53
safehttpx                 0.1.6
semantic-version          2.10.0
sentencepiece             0.2.0
setuptools                75.8.0
shellingham               1.5.4
six                       1.17.0
sniffio                   1.3.1
stanza                    1.1.1
starlette                 0.41.3
sympy                     1.13.1
tenacity                  9.0.0
tencentcloud-sdk-python   3.0.1282
tomlkit                   0.13.2
torch                     2.5.1
tqdm                      4.67.1
typer                     0.15.1
typing_extensions         4.12.2
tzdata                    2024.2
urllib3                   2.2.3
uvicorn                   0.34.0
websockets                14.1
wrapt                     1.17.2
xinference-client         1.2.0
zstandard                 0.23.0

这个是相关的包,没有你提到的pdfplumber。我认为是你的环境存在污染。

@LukasCY
Copy link
Author

LukasCY commented Jan 22, 2025

我之前运行pip uninstall pdfminer.six,这次用pip uninstall pdfminer卸载,然后pip uninstall pdfplumber,再卸载pdf2zh,重新安装,一切顺利,没有报错(结果还是显示pdfminer安装了2024版,但没报错就行)。我再把运行结尾提示的长长的路径加入path了。
不过运行pdf2zh -i 又出现新问题
Traceback (most recent call last):
File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.3056.0_x64__qbz5n2kfra8p0\lib\runpy.py", line 196, in _run_module_as_main
return run_code(code, main_globals, None,
File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.3056.0_x64__qbz5n2kfra8p0\lib\runpy.py", line 86, in run_code
exec(code, run_globals)
File "C:\Users\marco\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\Scripts\pdf2zh.exe_main
.py", line 4, in
File "C:\Users\marco\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\pdf2zh_init
.py", line 2, in
from pdf2zh.high_level import translate, translate_stream
File "C:\Users\marco\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\pdf2zh\high_level.py", line 21, in
from pymupdf import Document, Font
ModuleNotFoundError: No module named 'pymupdf'

@hellofinch
Copy link
Contributor

ModuleNotFoundError: No module named 'pymupdf'

你缺了pymupdf这个包。
我推荐使用一个干净的环境来配置,我认为你的环境中有其他的内容在干扰。这些干扰会影响问题的定位。

@LukasCY
Copy link
Author

LukasCY commented Jan 22, 2025

我是外行,怎么弄个“干净”环境……?

@hellofinch
Copy link
Contributor

可以考虑使用setup.bat的那种安装方式来使用webui。

@LukasCY
Copy link
Author

LukasCY commented Jan 22, 2025

我把原有的python卸载,再运行setup.bat也出现奇怪问题
Der Befehl "powershell" ist entweder falsch geschrieben oder
konnte nicht gefunden werden.
Der Befehl "powershell" ist entweder falsch geschrieben oder
konnte nicht gefunden werden.
C:\Users\marco\Downloads\python.zip konnte nicht gefunden werden
Das System kann den angegebenen Pfad nicht finden.
Das System kann den angegebenen Pfad nicht finden.
Der Befehl "powershell" ist entweder falsch geschrieben oder
konnte nicht gefunden werden.
Python konnte nicht gefunden werden. F黨ren Sie die Verkn黳fung ohne Argumente aus, um sie 黚er den Microsoft Store zu installieren, oder deaktivieren Sie diese Verkn黳fung unter Der Befehl "pip" ist entweder falsch geschrieben oder
konnte nicht gefunden werden.
Der Befehl "pip" ist entweder falsch geschrieben oder
konnte nicht gefunden werden.
Der Befehl "pdf2zh" ist entweder falsch geschrieben oder
konnte nicht gefunden werden.
Drücken Sie eine beliebige Taste . . .
会不会跟我的win10是德语版本有关?

@hellofinch
Copy link
Contributor

双击运行setup.bat,不要在命令行里执行那个。

@LukasCY
Copy link
Author

LukasCY commented Jan 22, 2025

就是双击运行的……
哎,谢谢一直回复,劳您费神了。我就想看看这个程序能否帮我顺利翻译哲学类的文章,特别在段落被pdf换页切断的情况下能否连贯地翻译。

@awwaawwa
Copy link
Contributor

段落被pdf换页切断的情况下不能连贯地翻译。

暂时没有处理这个问题。

@LukasCY
Copy link
Author

LukasCY commented Jan 23, 2025

github有没有从学术论文排版的pdf里面(有正文和脚注)提取正文成为连续文本的程序?

@LukasCY
Copy link
Author

LukasCY commented Jan 23, 2025

我自己靠cursor做了一个符合上述要求的程序 https://github.com/LukasCY/extract_pdf2docx 抛砖引玉,希望有大神继续完善。

@awwaawwa
Copy link
Contributor

awwaawwa commented Jan 23, 2025

My implementation approach involves generating a dataset using LaTeX+SyncTeX, then training a layout recognition model to directly identify such cross-page paragraphs. Subsequently, the typesetting module in YADT will support multiple available space paragraphs, typesetting them in order.

YADT is currently in its early stages, and there are still many more important tasks to be done, such as user documentation and developer documentation, PDF line support, bug fixes, etc. The typesetting engine also needs to be refactored to support more advanced typesetting operations, and I need to consider preserving typesetting translation-related matters, so cross-page paragraph handling is placed in a relatively lower priority. Please be patient.

@awwaawwa
Copy link
Contributor

awwaawwa commented Jan 23, 2025

@LukasCY Your program identifies headers & footers through simple rules of regular expressions, which doesn't have strong generalization capabilities, so I cannot incorporate it into YADT (the new backend of this project). Nevertheless, thank you for your approach.

@awwaawwa
Copy link
Contributor

In addition, for example, paragraphs spanning multiple columns in a multi-column PDF also need to address this issue.

@LukasCY
Copy link
Author

LukasCY commented Jan 23, 2025

期待你的YADT,弄好跨页翻译功能后请务必告诉我。学术期刊中,一页多栏确实也常出现,还见过在一页单栏的排版中插入双栏的排版。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants