Skip to content

Preetheshlewis26/DataEngineering-Workshop1

 
 

Repository files navigation

Data Engineering Workshop

One Day workshop on understanding Docker, Web Scrapping, Regular Expressions, PostgreSQL and Git.

Prerequisites

Use Ubuntu 20.04 LTS with following packages installed

  • Python 3.9 or above
  • docker
  • docker-compose
  • pip3
  • git (any recent version)

GitHub account

  • Create an account on GitHub (Only if you do not have an account)
  • Fork DataEngineering-Workshop1 repository. Refer this guide to understand how to fork a repository
  • Clone forked repo to your machine using SSH Key.
    • Make sure you have set up SSH key as per the documentation to create a new SSH Key if you don't have a Key.
    • Open your forked repo link in your browser.
    • Click on Code (Green color button).
    • Select SSH option and copy the link.
    • Clone the repo (replace YOUR-GIT-ID with your GitHub id)
         git clone [email protected]:<YOUR-GIT-ID>/DataEngineering-Workshop1.git
      

Docker

  • To install docker go to your cloned repository and run the following command
  • sudo prerequisites/install_docker.sh

Workshop environment setup

  • Check if Git, Docker, and Docker Compose are installed in on the system.
  • Open the terminal and run the following command to check the version of the prerequisites
    • Check Git version
       git --version
      
      git version 2.25.1
    • Check Docker version
       docker --version
      
      Docker version 20.10.17, build 100c701
    • Check Docker Compose version
       docker-compose --version
      
      docker-compose version 1.25.0, build 0a186604

Homework

  • Run docker-compose.yml with posgresql commands to start server

    docker-compose up -d 
    
    docker exec -it psql-db bash
    
    psql -U postgres
    
    • create a database to store the scraped content
  • Run Dockerfile using commands

    docker build --no-cache --network=host ./ -t simple_python 
    
    docker run --network=host simple_python
    
  • The scraped content will be stored in a table format

    • Date | Title | Content/BodyText | Author

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Shell 74.3%
  • Python 22.3%
  • Dockerfile 3.4%