pdf2html

pdf2html helps to convert PDF file to HTML or Text using Apache Tika. This module also helps to generate thumbnail image for PDF file using Apache PDFBox.

Installation

via yarn:

yarn add pdf2html

via npm:

npm install --save pdf2html

Java runtime environment (JRE) is required to run this module.

Usage

const pdf2html = require('pdf2html');

const html = await pdf2html.html('sample.pdf');
console.log(html);

Convert to text

const text = await pdf2html.text('sample.pdf');
console.log(text);

Convert as pages

const htmlPages = await pdf2html.pages('sample.pdf');
console.log(htmlPages);

const options = { text: true };
const textPages = await pdf2html.pages('sample.pdf', options);
console.log(textPages);

Extract metadata

const meta = await pdf2html.meta('sample.pdf');
console.log(meta);

Customize maximum buffer to be used

The maxBuffer option specifies the largest number of bytes allowed on stdout or stderr. If this value is exceeded, then the child process is terminated.

By default, the maximum buffer size is 2MB. You can customize it by passing the maxBuffer option.

await pdf2html.meta('sample.pdf', { maxBuffer: 1024 * 10000 }); // set maxBuffer to 10MB
await pdf2html.html('sample.pdf', { maxBuffer: 1024 * 10000 });
await pdf2html.text('sample.pdf', { maxBuffer: 1024 * 10000 });
await pdf2html.pages('sample.pdf', { maxBuffer: 1024 * 10000 });
await pdf2html.thumbnail('sample.pdf', { maxBuffer: 1024 * 10000 });

Generate thumbnail

const thumbnailPath = await pdf2html.thumbnail('sample.pdf');
console.log(thumbnailPath);

const options = { page: 1, imageType: 'png', width: 160, height: 226 };
const thumbnailPath = await pdf2html.thumbnail('sample.pdf', options);
console.log(thumbnailPath);

Manually download dependencies files

Sometimes downloading the dependencies might be too slow or unable to download in a HTTP proxy environment. Follow the step below to skip the dependency downloads.

cd node_modules/pdf2html/vendor
# These URLs come from https://github.com/shebinleo/pdf2html/blob/master/postinstall.js#L6-L7
wget https://archive.apache.org/dist/pdfbox/2.0.27/pdfbox-app-2.0.27.jar
wget https://archive.apache.org/dist/tika/2.6.0/tika-app-2.6.0.jar

Name		Name	Last commit message	Last commit date
Latest commit History 109 Commits
.github/workflows		.github/workflows
files		files
test		test
vendor		vendor
.editorconfig		.editorconfig
.eslintignore		.eslintignore
.eslintrc.json		.eslintrc.json
.gitignore		.gitignore
.prettierignore		.prettierignore
.prettierrc		.prettierrc
.travis.yml		.travis.yml
LICENSE		LICENSE
README.md		README.md
constants.js		constants.js
index.js		index.js
package-lock.json		package-lock.json
package.json		package.json
postinstall.js		postinstall.js
sample.pdf		sample.pdf
yarn.lock		yarn.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pdf2html

Installation

Usage

Convert to text

Convert as pages

Extract metadata

Customize maximum buffer to be used

Generate thumbnail

Manually download dependencies files

About

Releases 11

Packages

Used by 368

Contributors 8

Languages

License

shebinleo/pdf2html

Folders and files

Latest commit

History

Repository files navigation

pdf2html

Installation

Usage

Convert to text

Convert as pages

Extract metadata

Customize maximum buffer to be used

Generate thumbnail

Manually download dependencies files

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 11

Packages 0

Used by 368

Contributors 8

Languages

Packages