GitHub - krishnabigdata/data-analysis: taxi-data-analysis

Yellow Taxi Data Processing and Analysis

Analyzing Yellow Taxi Data. This project can be executed using Docker as Container for loading data to DB as a scheduler Job using Chronos in Mesos Platform or Kubernetes Scheduler

Data Downloading and Loading Process are tracked and managed using table tbl_status

Requirements

Python 3.7+ and PostgresSql

Project Structure

.. code-block::
.
|-- Dockerfile
|-- MANIFEST.in
|-- Makefile
|-- README.md
|-- dataset
|   |-- 
|-- docker-compose.yml
|-- requirements.txt
|-- setup.cfg
|-- setup.py
|-- sql
|   `-- V1.0_CREATE_TABLE.sql
|-- src
|   `-- processing
|       |-- __init__.py
|       |-- cli.py
|       |-- constant.py
|       |-- transform.py
|       `-- util.py
|-- tests
|   |-- __init__.py
|   `-- test_analysis.py
`-- yellow_taxi_analysis.ipynb

Output

yellow_taxi_analysis.ipynb : Has all the analysis outputs.

Installing

Steps:

git clone https://github.com/krishnabigdata/taxi-data-analysis.git
pip install -r taxi-data-analysis/requirements.txt
pip install --upgrade taxi-data-analysis
docker-compose up -d

Docker

Building Docker and using docker

make build -e VERSION=latest
make push -e VERSION=latest
docker run -t -i --network host docker.io/krishnabigdata/taxi_data_analysis -v ${PWD}:/taxi_data_analysis/dataset --action download --year 2019 --month 1 --color yellow

Usage

Commands to use the processing cli

.. code-block:: bash 

usage: processing [-h] [--year YEAR] [--month {1,2,3,4,5,6}]
              [--color {yellow}] --action
              {all,download,load,avg_trip,avg_trip_local,rolling_avg_trip}
              [--verbose VERBOSE]

Taxi Data Analysis

optional arguments:
  -h, --help            show this help message and exit
  --year YEAR           year of data to load (default: 2019)
  --month {1,2,3,4,5,6}
                        month of data to load (default: 1)
  --color {yellow}      color of data to load (default: yellow)
  --action {all,download,load,avg_trip,avg_trip_local,rolling_avg_trip}
                        action to be performed (default: all)
  --verbose VERBOSE     logging action to be performed (default: True)

all: Performs all steps
- Downloading, LoadingToDB, Queries DB for AVG and Rolling AVG
avg_trip_local
- Calculates Trip Distance Average by Month by Querying the Locally downloaded file.
avg_trip
- Calculates Trip Distance Average by Month by Querying the DB.
rolling_avg_trip
- Calculates 45 Day Rolling Trip Distance Average by Querying the DB.

Scaling Up

We can use the below options for distributed processing in order to process huge volume of data which cannot be processed by single instance.

pyspark - For distributed processing
DB: Parallel loading of files to DB and analysis using SQL queries.
Streaming: Data produced as events to Kafka and Processing using Kafa-Connect connectors or Spark Structured Streaming or Consume from Kafka and load to DB -> SQL Query

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Yellow Taxi Data Processing and Analysis

Requirements

Project Structure

Output

Installing

Docker

Usage

Scaling Up

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
dataset		dataset
sql		sql
src/processing		src/processing
tests		tests
.envrc		.envrc
.gitignore		.gitignore
Dockerfile		Dockerfile
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.md		README.md
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py
yellow_taxi_analysis.ipynb		yellow_taxi_analysis.ipynb

krishnabigdata/data-analysis

Folders and files

Latest commit

History

Repository files navigation

Yellow Taxi Data Processing and Analysis

Requirements

Project Structure

Output

Installing

Docker

Usage

Scaling Up

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages