gutenberg-cleaner

A python package for cleaning Gutenberg books and dataset.

Prerequisites

nltk package

Installing

[sudo] pip install gutenberg-cleaner

How to use it?

it has two methods called "simple_cleaner" and "super_cleaner".

from gutenberg_cleaner import simple_cleaner, super_cleaner

simple_claner:

Just removes lines that are part of the Project Gutenberg header or footer. Doesnt go deeply in the text to remove other things like titles or footnotes or etc...

simple_cleaner(book: str) -> str

super_cleaner:

Super clean the book (titles, footnotes, images, book information, etc.). may delete some good lines too.

super_cleaner(book: str, min_token: int = 5, max_token: int = 600) -> str

min_token: The minimum tokens of a paragraph that is not "dialog" or "quote", -1 means don't tokenize the txt (so it will be faster, but less efficient cleaning). max_token: The maximum tokens of a paragraph.

it will mark deleted paragraphs with: [deleted]

Author

Peyman Mohseni kiasari

License

This project is licensed under the MIT License - see the LICENSE.md file for details

Name	Name	Last commit message	Last commit date
Latest commit kiasar Update README.md Apr 10, 2023 88ad760 · Apr 10, 2023 History 19 Commits
.idea	.idea	Initial commit	May 25, 2019
_cleaning_options	_cleaning_options	:))	May 26, 2019
gutenberg_cleaner.egg-info	gutenberg_cleaner.egg-info	new	May 26, 2019
.gitignore	.gitignore	new	May 26, 2019
LICENSE.md	LICENSE.md	Initial commit	May 25, 2019
README.md	README.md	Update README.md	Apr 10, 2023
gutenberg_cleaner.py	gutenberg_cleaner.py	new	May 25, 2019
setup.py	setup.py	new	May 26, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

gutenberg-cleaner

Prerequisites

Installing

How to use it?

simple_claner:

super_cleaner:

Author

License

About

Releases

Packages

Languages

License

kiasar/gutenberg_cleaner

Folders and files

Latest commit

History

Repository files navigation

gutenberg-cleaner

Prerequisites

Installing

How to use it?

simple_claner:

super_cleaner:

Author

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages