Skip to content

This repository contains ETL file from Pentaho Data Integration. The ETL process cleaned applicant with empty values/data and dirty data. The dataset is taken from https://www.kaggle.com/rikdifos/credit-card-approval-prediction

Notifications You must be signed in to change notification settings

caesarmario/etl-credit-card-dataset-using-pentaho

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

14 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐Ÿ’ณ ETL Credit Card Data Set ๐Ÿ’ณ

using Pentaho Data Integration (PDI)/Kettle โš™

.: ๐Ÿ“„ Dataset taken from Kaggle :.


Star Badge Kaggle Projects


๐Ÿ“ƒ Table of Contents:


๐Ÿ–‹ About Project

  • This repository contains:

    • ETL file using Pentaho Data Integration (PDI).
    • CSV file that has gone through ETL process.
  • This project will also:

    • Clean and transform both data sets (application record and credit record),
    • Merge, clean, and transform data sets into one data set (in CSV format).

๐Ÿ“Œ Objectives

  • Perform ETL using PDI for both datasets.
  • Create time dimension using PDI.
  • Create fact table using PDI.

๐Ÿงพ Data Set Description

  • The dataset description can be seen here.

โš™ ETL Process

Main ETL Flow



๐Ÿ‘จโ€๐Ÿ’ผ Application Record

โ–ถ CSV file input

CSV file input - Application

  • Importing application record csv.

โ–ถ Sort rows

Sort rows - Application

  • Sort data based on ID (in ascending order).

โ–ถ Unique rows

Unique rows - Application

  • Filtering duplicate ID.

โ–ถ Replace in string

Replace in string - Application

  • Replacing 'Y' with 1, and 'N' with 0.

โ–ถ Add constants

Add contants - Application

  • Adding 'Current_Date' column.

โ–ถ Calculator

Calculator - Application

  • Calculating applicant age and how long applicant have been working (in years).

โ–ถ Filter rows

Filter rows - Application

  • Filtering applicant that is less than 21 y.o.
  • Filtering applicant with null/empty values



๐Ÿ’ถ Credit Record

โ–ถ CSV file input

CSV file input - Credit Record

  • Importing credit record csv.

โ–ถ Sort rows 2

Sort rows - Credit Record

  • Sort data based on ID (in ascending order).

โ–ถ Add constants 2

Add contants - Credit Record

  • Adding 'Current_Date' column.

โ–ถ Calculator 2

Calculator - Credit Record

  • Calculating month loan payment.
  • Creating copy of 'STATUS' column.

โ–ถ Replace in string 2

Replace in string - Credit Record

  • Replace C, X, 0 with 'Good Debt' (C: loan for that month is already paid; X: no loan for that month; 0: loan is 1 to 29 days overdue).
  • Replace 1, 2, 3, 4, 5 with 'Bad Debt' (1: loan is 30 to 59 days overdue; 2: loan is 60 to 89 days overdue; 3: loan is 90 to 119 days overdue; 4: loan is 120 to 149 days overdue; 5: loan is more than 150 days overdue).

โ–ถ Calculator 3

Calculator2 - Credit Record

  • Creating 2 copies of 'STATUS2' column (Good_Debt and Bad_Debt).

โ–ถ Replace in string 3

Replace in string2 - Credit Record

  • Good_Debt: Good Debt will be change to 1, while Bad Debt will be change to 0.
  • Bad_Debt: Good Debt will be change to 0, while Bad Debt will be change to 1.

โ–ถ Group by

Group by - Credit Record

  • Calculating total of Good Debt and Bad Debt from each applicant (similar to group by function in SQL).

โ–ถ Modified JavaScript value

JavaScript - Credit Record

  • If the total of Good Debt is higher than Bad Debt, then an applicant status will be eligible (1).
  • If the total of Bad Debt is higher than Good Debt, then an applicant status will be not eligible (0).



๐Ÿ“ฅ Output file

โ–ถ Stream lookup

Stream lookup - Output file

  • Bad_Debt_CNT, Good_Debt_CNT, and STATUS will be merged based on applicant ID.

โ–ถ Filter rows

Filter rows - Output file

  • Applicant with empty Bad_Debt_CNT, Good_Debt_CNT, and STATUS will be deleted.

โ–ถ Select values 2

Select values - Output file

  • Select columns that will extracted.

โ–ถ Text file output

Text file output1 - Output file Text file output2 - Output file

  • Exporting cleaned and transformed data set into CSV file.



๐Ÿ‘€ Preview Output File

Preview Output File 1 Preview Output File 2



๐Ÿ™Œ Support me!

๐Ÿ‘‰ If you find this project useful, please โญ this repository ๐Ÿ˜†!

๐ŸŽˆ Check out my work using AutoML/PyCaret with this processed data set here!


๐Ÿ‘‰ More about myself: here

About

This repository contains ETL file from Pentaho Data Integration. The ETL process cleaned applicant with empty values/data and dirty data. The dataset is taken from https://www.kaggle.com/rikdifos/credit-card-approval-prediction

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published