This project is a CLI Apllication that searches for recent news articles on a given topic and generates a summary of the top articles
Note: Unless stated specifically, most commands are UNIX specific. Windows Devices using powershell have a different suite of commands.
- Clone the repository:
- =>
git clone https://github.com/semibroiled/news-topic-aggregator.git
- Enter cloned directory
- =>
cd news-topic-aggregator
- Create Virtual Environment
- => Make sure you have Python installed. Python 3.12 is reccomennded
- =>
python3 -m venv .venv
- Activate the Virtual Environment
- =>
source .venv/bin/activate
- => on Windows Devices, this is
.venv\Scripts\activate
- Verify you are in the correct envrironment
- =>
which pip
andwhich python
to check if you're in .venv
- Install required packages
- =>
pip install -r requirements.txt
Rename the .env.example
file to .env
Enter your API Key for News API instead of the example listed.
See below on how to obtain your API Key
- Make an account at https://newsapi.org/
- Press Get API Key-> Button on Homepage
- You will be offered a free development API Key. Be sure to save it and use it
There are a couple different ways to start the application.
Run the application in your terminal. Make sure you're already in the project's root directory
python main.py
If you're getting Import Errors such as no module named 'src'; update PYTHONPATH and then run main.py
export PYTHONPATH="${PYTHONPATH}:/path/to/project_root/src"
python main.py
To run tests, from the project directory in terminal use command:
-
=>
pytest
-
or =>
pytest -v
for detailed output in verbose mode -
=>
pytest -s
to show print statements -
=>
pytest -x
to stop after first FAIL -
Run pytest for a specific file =>
pytest path/to/test_file.py
-
Run specific test within a file =>
pytest path/to/test_file.py::test_case_name
To Run in Docker, you need to have Docker Desktop installed and running in the background. From your repository, use
docker build -t news-topic-aggregator .
docker run -it --rm --env-file .env -v "$(pwd)/history:/app/history" news-topic-aggregator
- With the
--env-file flag
we specify our.env
file from which the API Key is read - With
-v
flag we set the output where the CSV files are written locally - With the
it
flag we run interactively on the terminal and accept user inputs, this means that you can runpytest
in the Terminal via Docker
On macOS and Windows
Docker Desktop provides a graphical interface to allocate more resources to Docker.
- Open Docker Desktop: • Click on the Docker icon in the taskbar to open Docker Desktop.
- Go to Settings/Preferences: • Click on the Settings or Preferences menu item. This is usually found in the Docker Desktop menu.
- Adjust Resources: • Navigate to the Resources section (on the left sidebar). • You can adjust the sliders for: • CPU: Increase the number of CPUs allocated. • Memory: Increase the amount of RAM allocated. • Swap: Adjust the swap space if needed. • Disk: Increase disk space allocation if needed.
- Apply and Restart: • After making adjustments, click Apply & Restart to apply the new settings and restart Docker.
Do this if your container keeps crashing after the first query.
Due to running HuggingFace locally, this may generate too much resources for a container to run without restrictions leading it to eventually crash.
This is a simple Shell script to run the application on UNIX devices in bash. You do need to make sure that the .env
file with API Key and .venv
is already correctly configued. This streamlines all the nitty gritty steps and reduces inconvenience.
You do not need to activate the virtual envrionment, as that is a specified step in the script itself.
chmod +x start.sh
to set the file as an executable./start.sh --run
or./start.sh --test
or./start.sh --docker
--run
flag starts the application--test
flag runs pytest suite before starting application--docker
flag builds docker image and runs container
Upon running the application, you will be prompted to enter a topic to search. You can also use the following commands:
- !help - Display usage instructions.
- !setlang - Change query language.
- !sethf - Change Model ID for HuggingFace.
- !exit or !quit - Close the application.
To change the query language, use the !setlang command. Type en for English or de for German.
Application Defaults to English.
To change the model used for summarization, use !sethf command.
Application Defaults to Barts CNN Model.
Type !exit or !quit to close the application.
• Exact Match: Use quotation marks for exact matches, e.g., "elon musk".
• Mandatory/Excluded Keywords: Use + and - to specify mandatory and excluded keywords, e.g., gamestop +stonks -sell.
• Boolean Operators: Use Boolean operators for complex queries, e.g., (crypto AND bitcoin) NOT ethereum.
• Title-Specific Search: Limit search to titles using InTitle="title search".
news-topic-aggregator/
├── .venv/
├── history/
│
├── src/
│ ├── utils/
│ │ ├── secure_input.py
│ │ ├── get_keys.py
│ │ ├── spinner.py
│ │ └── __init__.py
│ ├── cli_help.py
│ ├── search_news.py
│ └── summarize_content.py
├── test/
├── main.py
├── .env
├── .env.example
├── config.yaml
├── LICENSE
├── start.sh
├── .gitignore
├── README.md
└── requirements.txt
Search results from your topic search will be saved in the history subfolder. You can identify separate files by virtue of search term and date/time.
LangChain was chosen to allow potential switching between different services, ensuring the project isn’t restricted to one service.
Hugging Face was selected for its open-source nature and active community, providing access to state-of-the-art NLP models. Especially for our case, we use Bart trained on CNN News Articles.
Despite that, the performance is anything but optimal. Often only part of headlines verbatim or misconstrued repeatable inferences are outputted instead.
Update: Added option to type in other models.
Initially, API endpoints were used, but issues with internal kwargs calls led to a transition to pipelines. This approach leverages local resources for more reliable processing in terms of execution.
In terms of results, it seems to just fetch one or two headlines in full. At present, I do not know a optimzation strategy suited for this issue.
To improve maintainability, API calls are separated into utility functions, making the codebase cleaner and easier to manage.
Update: I've decided against this to reduce overhead.
To improve readability and usability, it would be nice to use ANSI Codes to make the terminal application pretty.
The main script is currently verbose but maintains clarity and functionality. Future refactoring will focus on modularizing the code. Especially using a terminal package instead of input and to handle all steps within try-except blocks where necessary.
Named Entity Recognition (NER) sometimes captures capitalization of normal words incorrectly. No work around has been found. Packages NLTK and SpaCy were both tested. Despite NLTKs false positives, it was chosen in favour of SpaCy due to it not capturing obvious NEs in testing phase.
This is a known limitation and its optimization is beyond my knowlegdge for now.
I took the 'at least one month' and took the liberty to interpret it as for the last month only for the sake of convenienve. Despite plenty of neat stuff there is no way to set that within CLI
Update: I see no reason to limit to one month in hindsight, removing date constraint so it fetches all the data it can
I didn't use argparse or really my favourites click/questionary to make the CLI. That was a hindsight. I am too used the past month to make CLIs in C++ and zoned myself in. It works, so I don't plant to fix that soon. Neat todo for later
The application is a bit bloated perhaps and hence slow. The first few dozen runs it wasn't really like that. It just sort of happened.
I suspect exporting PYTHONPATH temporarily so many times might have done something, or just that the runtime itself is slow due to many unoptimized elements.
Not to mention, I am running HF locally instead of calling via API. As I was coding trying to march past depreceated commands, I eventually ran into a problem where internal method calls I have no control over are hindering code execution. This is beyond the scope of this project, and thus I have decided to run it locally instead.
Another issue of contention is nltk or spacy. Since they are downloaded on memory, they may also be causing performance issues.
I wanted to let '!' still be the basis for my CLI Commands so I let it pass through sanitization process of inputs- there should be a better or more robust way to do this.
Its verbose and unnecessary, I had it in when I was coding and debugging, and kept it inside still. It would be a good idea to take them out for clarity in code.
Update: Removed from code