When authors write, they use unconscious and consistent stylistic choices and habits in writing, and these vary between authors. It therefore follows that these differences are enough to differentiate between text by different authors. There is a computationally identifiable difference between calculable features of texts written by different authors, such that these differences can be used to identify the author of a text with unknown authorship given texts written by all possible authors of the text with unknown authorship.
This is a project that attempts to guess at the author of a text by calculating numerical features (things like how often they use the most common english words, average sentence length, etc.) (in the features
directory) and using classifiers (in the classifiers
directory). To run, run main.py
- Features are calculated for writings with know Authorship
- Classifiers are trained with the calculated features to be able to guess which author wrote something
- The features of a text with unknown authorship are calculatedData
- These features are fed into the trained classifier, and the resulting author is the guessed author
(In features
Directory)
This feature calculates the frequency of the thirty most common english words throughout the text
This feature calculates the frequency of a few basic punctuation marks in the text
This feature calculates the average length, in characters, of sentences in the text
(In texts
Directory)
Texts with known authors are located in sub-directories named for their authors
Texts with unknown authors are located in the Unknown
sub-directory
(texts/Unknown/AUTHORS.txt
)
This file contains a list of the unknown texts and their "known" authors. This file is not required, nor is a author needed for any unknown text. The only use of this file is to measure accuracy of guesses after they have been made. When the program is applied to actually unknown texts, no author is needed in AUTHORS.txt
.
Format for AUTHORS.txt
File name and author are separated by a :
, with no whitespace except in the author's name
Example:
pickwick_papers.txt:Charles Dickens
return_of_sherlock_holmes.txt:Arthur Conan Doyle