Download book from Gallica and make PDF

Install requirements

pip install -r requirements.txt

Edit link in links.txt

Then run

python main.py

OCR pdf

Installing OCRmyPDF

sudo apt update
sudo apt install ocrmypdf

# Fonts for Chinese
sudo apt-get install fonts-arphic-ukai fonts-arphic-uming fonts-ipafont-mincho fonts-ipafont-gothic fonts-unfonts-core

# Tesseract
echo "deb https://notesalexp.org/tesseract-ocr5/$(lsb_release -cs)/ $(lsb_release -cs) main" \
| sudo tee /etc/apt/sources.list.d/notesalexp.list > /dev/null
sudo apt update --allow-insecure-repositories

# https://ocrmypdf.readthedocs.io/en/latest/languages.html#lang-packs
# Display a list of all Tesseract language packs
# Choose the -best version
sudo apt-cache search tesseract-ocr
sudo apt-cache search tesseract-ocr | grep Chinese
# get tesseract-ocr-chi-tra-vert-best - tesseract-ocr language files for Chinese - Traditional (vertical) (best)
sudo apt-cache search tesseract-ocr | grep Viet
# get tesseract-ocr-vie-best - tesseract-ocr language files for Vietnamese (best)

# Tesseract Chinese traditional vertical
sudo apt-get install tesseract-ocr-chi-tra-vert-best tesseract-ocr-chi-tra-vert tesseract-ocr-chi-tra-best

# If pngquant is installed, OCRmyPDF will use it to perform quantize paletted images to reduce their size
sudo apt install pngquant

# Other packages
 sudo apt install ghostscript \
  fonts-droid-fallback \
  jbig2dec \
  unpaper

Run ocrmypdf example:

export TMPDIR=$HOME/tmpdir
export OPTIONS=' -l chi_tra_vert --jobs 2 --redo-ocr  --tesseract-timeout 600 --output-type pdf '

FILE="your_pdf_file.pdf"
ocrmypdf $OPTIONS "$FILE" "ocr_${FILE}"

# Or you can run this in a folder to make ocr for all pdf files
mkdir -p ocr
find . -printf '%p\n' -name '*.pdf' -exec ocrmypdf $OPTIONS '{}' ocr/'{}' \;

Optimize

Installing the JBIG2 encoder

https://ocrmypdf.readthedocs.io/en/latest/jbig2.html#jbig2-lossy

https://github.com/ocrmypdf/OCRmyPDF/blob/main/.docker/Dockerfile


sudo apt install autotools-dev automake libtool libleptonica-dev gcc

git clone https://github.com/agl/jbig2enc
cd jbig2enc
./autogen.sh
./configure --host=x86_64-pc-linux-gnu
make
sudo make install

Run ocrmypdf example:

export TMPDIR=$HOME/tmpdir
export OPTIONS=' -l chi_tra_vert --jobs 2 --redo-ocr  --tesseract-timeout 600 --optimize 2 --jbig2-lossy --output-type pdf '

FILE="your_pdf_file.pdf"
ocrmypdf $OPTIONS "$FILE" "ocr_${FILE}"

# Or you can run this in a folder to make ocr for all pdf files
mkdir -p ocr
find . -printf '%p\n' -name '*.pdf' -exec ocrmypdf $OPTIONS '{}' ocr/'{}' \;

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
.gitignore		.gitignore
README.md		README.md
links.txt		links.txt
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Download book from Gallica and make PDF

Edit link in links.txt

OCR pdf

Installing OCRmyPDF

Run ocrmypdf example:

Optimize

Installing the JBIG2 encoder

Run ocrmypdf example:

About

Releases

Packages

Languages

nguoianphu/gallica.bnf.fr-downloader

Folders and files

Latest commit

History

Repository files navigation

Download book from Gallica and make PDF

Edit link in links.txt

OCR pdf

Installing OCRmyPDF

Run ocrmypdf example:

Optimize

Installing the JBIG2 encoder

Run ocrmypdf example:

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages