duplicates

Find duplicate sentences or sentence fragments in a large (e.g. book-length) text file.

Behaviour is primitive; text is only split on new lines and punctuation, and any splits shorter than 20 characters are ignored. While nothing fancy is done (i.e. better-performing suffix trees are not used, I'm using simple lists), performance for a 400 page test document is sub-second.

Status: experimental. Things are hardcoded.

Requirements

haskell stack to compile this project.
Since few people write books in text or markdown, you probably want pandoc

How to use

create a input.txt file in the current directory, e.g. using pandoc to convert if necessary:

pandoc <input file> -o input.txt --wrap=none

Compile with make build and analyse for duplicates with make run

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
app		app
.gitignore		.gitignore
ChangeLog.md		ChangeLog.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
Setup.hs		Setup.hs
package.yaml		package.yaml
stack.yaml		stack.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

duplicates

Requirements

How to use

About

Releases

Packages

Languages

License

jschaul/duplicates

Folders and files

Latest commit

History

Repository files navigation

duplicates

Requirements

How to use

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages