GitHub - Mapmo/Book-Classifier: University project for artificial intelligence.

Guidance: Boris Velichkov, FMI

Sources:

noise_data:
Libraries:
- bulstem
- sklearn
Articles:
- Main ML idea
- Multi-label classification idea

How the project works:

extract.py - generates a dataset in JSON format from the most common words in each book
classify.py - using a specified dataset, created by extract.py, it classifies the genre(s) of a book after learning on the provided dataset
reqs.txt - contains all the requirements needed for the scripts, can be used from pip

 $ pip install -r reqs.txt

Usage:

extract.py

You must be located in a directory that contains other directories named after their genres
The script will be looking for a txt directory for each genre. This is useful in case you want to have zip files for the books in each genre as well
After you start the script, you have to wait depending on how many books you have provided. You can also provide as a parameter how many words you want to extract from each word
After the extraction is ready, the generated file will be "/tmp/ops.json"

$ ls
Ancient  Classics  Criminal  Fantasy  Horror  Humour  Love  Science  Sci-Fi  Social
$ ls Ancient/
all_links  Ancient.words  txt  zips
$ ls Ancient/txt/ | head
110-Herodot_-_Istoricheski_noveli.txt
141-William-Shakespeare_-_Soneti.txt
1492-Nikolaj_Kun_-_Starogrytski_legendi_i_mitove.txt
1642-Jean-Froissart_-_Hroniki.txt
1696-Sun_Dzy_-_Izkustvoto_na_vojnata.txt
1840-Jean-Pierre-Vernant_-_Starogrytski_mitove_-_Vsemiryt_bogovete_horata.txt
1984-Konfutsij_-_Dobrijat_pyt_-_Misli_na_velikija_kitajski_mydrets.txt
2074-Genro_-_Zheljaznata_flejta_sto_dzenski_koana_-_Slovata_na_dzenskite_mydretsi.txt
217-Starogrytska_lirika.txt
29-Giovanni-Boccaccio_-_Dekameron.txt
$ ~/Git/Book-Classifier/extract.py
Starting Ancient
Starting Classics
Starting Criminal
Starting Fantasy
Starting Horror
Starting Humour
Starting Love
Starting Science
Starting Sci-Fi
Starting Social

classify.py

You must call the script with 2 parameters - *path to the dataset* and *path to the book*
The scipt will return an F1 score of the training and a predicted genre of the books

$ ./classify.py /tmp/ops.json Omir_-_Iliada_-_6122-b.txt
0.6937984496124031
[('Ancient',)]

Tests explained:

Test 0 The idea was to pick the top 3 datasets that I had based on an F1 score. Since a single test on a single dataset was ~10 hours, I had to do it only on some of the datasets
Test 1
- features: 2,000 - 20,000
- data_split: 0.1 - 0.45
- threshold: 0.1 - 0.45
- № tests: 20
Results showed that the best solution has to be in this range:
- features: 5,000 - 14000
- data_split: 0.1 - 0.25
- threshold: 0.25 - 0.3
Test 2
- features: 5,000 - 14,000
- data_split: 0.1 - 0.25
- threshold: 0.25 - 0.3
- № tests: 25
Results showed that the best solution has to be in this range:
- features: 5,000 - 10000
- data_split: 0.1 - 0.15
- threshold: 0.25 - 0.3
Test 3
- features: 5,000 - 10,000
- data_split: 0.1 - 0.15
- threshold: 0.25 - 0.3
- № tests: 35
Results showed that the best solution has to be in this range:
- features: 7,000
- data_split: 0.1
- threshold: 0.26
Test 4
- features: 7,000
- data_split: 0.1
- threshold: 0.26
- № tests: 50
Results showed that the best possible solution in terms of F1 score is for the 1125 words dataset and it is 0.8759894459102902
Note that it is overfitting and this is not the most optimal solution of the task.
Tests with books that were not part of the training process
- Test Hannibal
  - Tests with 0.15 thresholds were accurate and tests with more than 1000 top words and threshold 0.125 were accurate as well
- Test Twilight
  - Tests with threshold over 0.1 were accurate
- Test Game of Thrones
  - Tests with threshold 0.1 and above were accurate
- Test Naked Sun
  - Tests with 0.1 threshold and above and over 950 words were accurate
- Test Azazel
  - Tests with 0.125 threshold and top words between 1000 and 1250 were accurate
As a result from the tests, especially that from Azazel, datasets with top 1125 an 1250 top words seem the most accurate. and since the F1 score of 1250 is a bit higher, this is the dataset that I consider a winner

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
noise_data		noise_data
ReadMe.md		ReadMe.md
Tests.ods		Tests.ods
classify.py		classify.py
commands		commands
dataset.json		dataset.json
extract.py		extract.py
reqs.txt		reqs.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Guidance: Boris Velichkov, FMI

Sources:

How the project works:

Usage:

Tests explained:

About

Releases

Packages

Languages

Mapmo/Book-Classifier

Folders and files

Latest commit

History

Repository files navigation

Guidance: Boris Velichkov, FMI

Sources:

How the project works:

Usage:

Tests explained:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages