This project mainly focuses on Arabic text summarization using transformers. After that we use the summary result with its original text to be evaluated through arabic classification and clustering algorithms to check whether the meaning of the summary matches the original text.
The classification problem is article classification between 5 topics context.
The clustering problem is the same arabic articles clustering between 5 topics.
- nltk
- Scipy
- pickle
- transformers==4.19.2
- tensorflow-gpu==2.9.1
- numpy
- pandas
- re
- time
- PyQt5
- pyarabic
- farasapy
- functools
- operator
- emoji
- string
- sklearn
- plotly
- Arabic News Articles :
- For Classification and clustering.
- 45,000 Articles with 7 different topics.
- WikiLingua :
- For Summarization
- ~ 40,000 Arabic articles with their summaries.
The project folder contains the following files:
- summarization.ipynb
- inference.py
- class_clust.ipynb
- class_clust_infer.py
- MainWindow.py
- Arabic_stop_words.txt
- champion_models.pickle
- objects.pickle
Also it contains the following folders:
-
This file contains all the processes of summarization algorithm
-
This file contains all the processes of building clustering and classification models
-
This file contains the inference code for summarization that returns the summary of the text to be summarized in the GUI.
-
This file contains the inference code for classification and clustring to be imported in the GUI code file that returns the class and cluster names of the original and summarized text in addition to their similarity score.
-
This is the GUI code file that is used for inference.
-
This is a text file that is used in the preprocessing process in summarization file.
-
This pickle file contains the TF-IDF vectorizer, champion classfier and champion cluster to be loaded in class_clust_infer.py. It has a large size, so you can click here for download.
-
This pickle file contains the trained tokenizers to be loaded in inference.py file. It has a large size, so you can click here for download.
-
This file contains the WikiLingua arabic datasets. You can click here for download.
If you downloaded it, you wouldn't need to run read_text(dir_path, fin) in summarization Notebook.
-
This file is generated from summarization code file for saving the checkpoints while training in addition to be loaded in inference.py to use the latest checkpoints directly for inference. You can click here for download.
If you downloaded it, you wouldn't need to run Training steps section in summarization notebook.