One Day workshop on understanding Docker, Web Scrapping, Regular Expressions, PostgreSQL and Git.
Use Ubuntu 20.04 LTS with following packages installed
- Python 3.9 or above
- docker
- docker-compose
- pip3
- git (any recent version)
- Create an account on GitHub (Only if you do not have an account)
- Fork DataEngineering-Workshop1 repository. Refer this guide to understand how to fork a repository
- Clone forked repo to your machine using SSH Key.
- Make sure you have set up SSH key as per the documentation to create a new SSH Key if you don't have a Key.
- Open your forked repo link in your browser.
- Click on Code (Green color button).
- Select SSH option and copy the link.
- Clone the repo (replace YOUR-GIT-ID with your GitHub id)
git clone [email protected]:<YOUR-GIT-ID>/DataEngineering-Workshop1.git
- To install docker go to your cloned repository and run the following command
sudo prerequisites/install_docker.sh
- Check if Git, Docker, and Docker Compose are installed in on the system.
- Open the terminal and run the following command to check the version of the prerequisites
- Check Git version
git --version
- Check Docker version
docker --version
- Check Docker Compose version
docker-compose --version
- Check Git version
-
Run docker-compose.yml with posgresql commands to start server
docker-compose up -d
docker exec -it psql-db bash
psql -U postgres
- create a database to store the scraped content
-
Run Dockerfile using commands
docker build --no-cache --network=host ./ -t simple_python
docker run --network=host simple_python
-
The scraped content will be stored in a table format
- Date | Title | Content/BodyText | Author