-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pdfminer-six和pdfplumber版本兼容问题 #507
Comments
请确保python目录位于path中,否则pip日志会有如下报错导致无法运行pdf2zh Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location. |
我python在path里面啊,我都可以python --version |
这个是相关的包,没有你提到的pdfplumber。我认为是你的环境存在污染。 |
我之前运行pip uninstall pdfminer.six,这次用pip uninstall pdfminer卸载,然后pip uninstall pdfplumber,再卸载pdf2zh,重新安装,一切顺利,没有报错(结果还是显示pdfminer安装了2024版,但没报错就行)。我再把运行结尾提示的长长的路径加入path了。 |
你缺了 |
我是外行,怎么弄个“干净”环境……? |
可以考虑使用 |
我把原有的python卸载,再运行setup.bat也出现奇怪问题 |
双击运行setup.bat,不要在命令行里执行那个。 |
就是双击运行的…… |
段落被pdf换页切断的情况下不能连贯地翻译。 暂时没有处理这个问题。 |
github有没有从学术论文排版的pdf里面(有正文和脚注)提取正文成为连续文本的程序? |
我自己靠cursor做了一个符合上述要求的程序 https://github.com/LukasCY/extract_pdf2docx 抛砖引玉,希望有大神继续完善。 |
My implementation approach involves generating a dataset using LaTeX+SyncTeX, then training a layout recognition model to directly identify such cross-page paragraphs. Subsequently, the typesetting module in YADT will support multiple available space paragraphs, typesetting them in order. YADT is currently in its early stages, and there are still many more important tasks to be done, such as user documentation and developer documentation, PDF line support, bug fixes, etc. The typesetting engine also needs to be refactored to support more advanced typesetting operations, and I need to consider preserving typesetting translation-related matters, so cross-page paragraph handling is placed in a relatively lower priority. Please be patient. |
In addition, for example, paragraphs spanning multiple columns in a multi-column PDF also need to address this issue. |
期待你的YADT,弄好跨页翻译功能后请务必告诉我。学术期刊中,一页多栏确实也常出现,还见过在一页单栏的排版中插入双栏的排版。 |
问题描述
我之前安装了pdfminer-six-20240706,安装贵软件时报告不兼容,说只跟20231228兼容。我卸载2024版本,安装20231228,然后再次运行贵软件的安装,结果它自动安装了20240706的pdfminer,然后报告pdfplumber需要20231228版本的pdfminer。我想,这次的2024版miner可不是我安装的。于是,我把贵软件和miner都卸载一遍,再运行贵软件的安装,结果它还是自动安装20240706的miner,同样报告pdfplumber跟miner不兼容。
现在我没辙了。请教。
测试文档
把贵软件和miner都卸载一遍,再运行贵软件pip install pdf2zh的结果(最后几行)
Requirement already satisfied: mdurl~=0.1 in c:\users\marco\appdata\local\packages\pythonsoftwarefoundation.python.3.10_qbz5n2kfra8p0\localcache\local-packages\python310\site-packages (from markdown-it-py>=2.2.0->rich>=10.11.0->typer<1.0,>=0.12->gradio->pdf2zh) (0.1.2)
Using cached pdf2zh-1.8.9-py3-none-any.whl (54 kB)
Using cached pdfminer.six-20240706-py3-none-any.whl (5.6 MB)
Installing collected packages: pdfminer-six, pdf2zh
WARNING: The script pdf2zh.exe is installed in 'C:\Users\marco\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\Scripts' which is not on PATH.
Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
pdfplumber 0.11.5 requires pdfminer.six==20231228, but you have pdfminer-six 20240706 which is incompatible.
Successfully installed pdf2zh-1.8.9 pdfminer-six-20240706
尝试运行贵软件,但出错
PS C:\Windows\system32> pdf2zh -i
pdf2zh : Die Benennung "pdf2zh" wurde nicht als Name eines Cmdlet, einer Funktion, einer Skriptdatei oder eines
ausführbaren Programms erkannt. Überprüfen Sie die Schreibweise des Namens, oder ob der Pfad korrekt ist (sofern
enthalten), und wiederholen Sie den Vorgang.
In Zeile:1 Zeichen:1
The text was updated successfully, but these errors were encountered: