Skip to content

a python package for cleaning Gutenberg books and dataset

License

Notifications You must be signed in to change notification settings

kiasar/gutenberg_cleaner

Folders and files

NameName
Last commit message
Last commit date

Latest commit

88ad760 · Apr 10, 2023

History

19 Commits
May 25, 2019
May 26, 2019
May 26, 2019
May 26, 2019
May 25, 2019
Apr 10, 2023
May 25, 2019
May 26, 2019

Repository files navigation

Downloads Downloads

gutenberg-cleaner

A python package for cleaning Gutenberg books and dataset.

Prerequisites

nltk package

Installing

[sudo] pip install gutenberg-cleaner

How to use it?

it has two methods called "simple_cleaner" and "super_cleaner".

from gutenberg_cleaner import simple_cleaner, super_cleaner

simple_claner:

Just removes lines that are part of the Project Gutenberg header or footer. Doesnt go deeply in the text to remove other things like titles or footnotes or etc...

simple_cleaner(book: str) -> str

super_cleaner:

Super clean the book (titles, footnotes, images, book information, etc.). may delete some good lines too.

super_cleaner(book: str, min_token: int = 5, max_token: int = 600) -> str

min_token: The minimum tokens of a paragraph that is not "dialog" or "quote", -1 means don't tokenize the txt (so it will be faster, but less efficient cleaning). max_token: The maximum tokens of a paragraph.

it will mark deleted paragraphs with: [deleted]

Author

  • Peyman Mohseni kiasari

License

This project is licensed under the MIT License - see the LICENSE.md file for details

About

a python package for cleaning Gutenberg books and dataset

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages