Guidance: Boris Velichkov, FMI
- noise_data:
- Libraries:
- Articles:
- extract.py - generates a dataset in JSON format from the most common words in each book
- classify.py - using a specified dataset, created by extract.py, it classifies the genre(s) of a book after learning on the provided dataset
- reqs.txt - contains all the requirements needed for the scripts, can be used from pip
$ pip install -r reqs.txt
- You must be located in a directory that contains other directories named after their genres
- The script will be looking for a txt directory for each genre. This is useful in case you want to have zip files for the books in each genre as well
- After you start the script, you have to wait depending on how many books you have provided. You can also provide as a parameter how many words you want to extract from each word
- After the extraction is ready, the generated file will be "/tmp/ops.json"
$ ls
Ancient Classics Criminal Fantasy Horror Humour Love Science Sci-Fi Social
$ ls Ancient/
all_links Ancient.words txt zips
$ ls Ancient/txt/ | head
110-Herodot_-_Istoricheski_noveli.txt
141-William-Shakespeare_-_Soneti.txt
1492-Nikolaj_Kun_-_Starogrytski_legendi_i_mitove.txt
1642-Jean-Froissart_-_Hroniki.txt
1696-Sun_Dzy_-_Izkustvoto_na_vojnata.txt
1840-Jean-Pierre-Vernant_-_Starogrytski_mitove_-_Vsemiryt_bogovete_horata.txt
1984-Konfutsij_-_Dobrijat_pyt_-_Misli_na_velikija_kitajski_mydrets.txt
2074-Genro_-_Zheljaznata_flejta_sto_dzenski_koana_-_Slovata_na_dzenskite_mydretsi.txt
217-Starogrytska_lirika.txt
29-Giovanni-Boccaccio_-_Dekameron.txt
$ ~/Git/Book-Classifier/extract.py
Starting Ancient
Starting Classics
Starting Criminal
Starting Fantasy
Starting Horror
Starting Humour
Starting Love
Starting Science
Starting Sci-Fi
Starting Social
classify.py
- You must call the script with 2 parameters - *path to the dataset* and *path to the book*
- The scipt will return an F1 score of the training and a predicted genre of the books
$ ./classify.py /tmp/ops.json Omir_-_Iliada_-_6122-b.txt
0.6937984496124031
[('Ancient',)]
- Test 0 The idea was to pick the top 3 datasets that I had based on an F1 score. Since a single test on a single dataset was ~10 hours, I had to do it only on some of the datasets
- Test 1
- features: 2,000 - 20,000
- data_split: 0.1 - 0.45
- threshold: 0.1 - 0.45
- № tests: 20
- features: 5,000 - 14000
- data_split: 0.1 - 0.25
- threshold: 0.25 - 0.3
- Test 2
- features: 5,000 - 14,000
- data_split: 0.1 - 0.25
- threshold: 0.25 - 0.3
- № tests: 25
- features: 5,000 - 10000
- data_split: 0.1 - 0.15
- threshold: 0.25 - 0.3
- Test 3
- features: 5,000 - 10,000
- data_split: 0.1 - 0.15
- threshold: 0.25 - 0.3
- № tests: 35
- features: 7,000
- data_split: 0.1
- threshold: 0.26
- Test 4
- features: 7,000
- data_split: 0.1
- threshold: 0.26
- № tests: 50
Note that it is overfitting and this is not the most optimal solution of the task. -
Tests with books that were not part of the training process
- Test Hannibal
- Tests with 0.15 thresholds were accurate and tests with more than 1000 top words and threshold 0.125 were accurate as well
- Test Twilight
- Tests with threshold over 0.1 were accurate
- Test Game of Thrones
- Tests with threshold 0.1 and above were accurate
- Test Naked Sun
- Tests with 0.1 threshold and above and over 950 words were accurate
- Test Azazel
- Tests with 0.125 threshold and top words between 1000 and 1250 were accurate
As a result from the tests, especially that from Azazel, datasets with top 1125 an 1250 top words seem the most accurate. and since the F1 score of 1250 is a bit higher, this is the dataset that I consider a winner - Test Hannibal