Uber personal data

The goal of this project is to extract, transform and load (ETL) personal data from Uber rides for data analysis

1. Where can one find your own dataset?

Go to https://myprivacy.uber.com/privacy/exploreyourdata/download, log in and click on request your data.

After a while (maybe a couple days), you'll receive an e-mail from Uber saying that your data is ready to be downloaded. Once you download it, you'll have a zip file with a few folders. We'll use the Rider/trips_data.csv

2. A few considerations

This specific project was done by me, for my specific personal data. Therefore, the transformation step may not work for your own analysis. But, if you know Python, it'll be simple to make your own adjustments!

I did not shared my own data on the samples that are in the dags folder, they are a few samples datasets that do not correspond to a real user. Its purpose is to provide to the developer an idea of the raw data and the final outcome.

3. The code

3.1 Overview

So, let's finally understand what's been done here!

After extracting the files from Zip, we dump it in a raw_table on Postgres. Then it's necessary to read from this table, make the transformations in Python (using Pandas) and save it in another csv file called transformed_data.csv. That done, we load it in a table called structured_data.

3.2 Extract step

After creating the tables in Postgres we're ready to extract the data. But there's a tiny problem in timestamp columns: they have a "+0000 UTC" that doesn't tell us nothing and it's not supported in timestamp data type in Postgres, like this:

Request time
2022-12-03 13:28:36 +0000 UTC

So, we need to remove it in order to insert data in the database. The remove_tmz function does that for us, we just need to pass a list with the columns that need to be transformed, the dataframe and voilà!

df[cols] = df[cols].apply(lambda s: s.str.replace("\+0000 UTC", ""))

That done, data is can be loaded in the raw_data table! It's done by the send_data function, which in turn uses the to_sql function of Pandas:

data.to_sql( name=table_name, con=create_engine(self.URI), schema=schema, if_exists="append", index=False, dtype={}, )

3.3 Transform step

Okay, we've loaded the raw data into Postgres. What now? Now, we look for data quality problems and solve them! Looking at the data I found out some problems:

1 - Not completed rides

It is not relevant to us to analyze rides that didn't occur, right? So let's remove them. The approach that I took was to loop in the dataframe and then remove everything that's not COMPLETED:

for index in dataframe.index:

            
            dataframe.drop(
                dataframe[dataframe["Trip or Order Status"] != "COMPLETED"].index,
                inplace=True)

2 - Product type has a typo in some rows. UberX has two different writings:

Product Type	Value Count
UberX	337
uberX	14
UberEATS Marketplace	4
VIP	3

The solution that I used was to turn everything to uppercase. And now it looks like this:

Product Type	Value Count
UBERX	351
UBEREATS MARKETPLACE	4
VIP	3

The types_all_upper method did this operation with the following code:

dataframe["Product Type"] = dataframe["Product Type"].apply( lambda s: str.upper(s))

3 - City column has wrong data:

In my specific case I had a problem with the column city. I'm from a city called Jundiaí, but for some reason city was set as Campinas for every single ride that I took in my town. Then, to overcome this, I searched for the word "Jundiaí" in the address column. If this word appears in the line, it'll change the column city to Jundiaí.

Previous data:

City	Value Count
Campinas	343
São Paulo	11
Blumenau	2
Santos	2

Then I apply the check_city function to solve this as explained above:

for index in dataframe.index:

    if re.search("Jundiaí", dataframe.loc[index, "Dropoff Address"]):
        dataframe.loc[index, "City"] = "Jundiaí"

Jundiaí 329 Campinas 14 Sao Paulo 11 Blumenau 2 Santos 2

City	Value Count
Jundiaí	329
Campinas	14
São Paulo	11
Blumenau	2
Santos	2

3.4 Load

And finally, load! After all these transformations, we export the result to a CSV file and then load this file to the structured_data table!

This is done directly in the DAG in the tasks. But, how does our tasks look like now? They look like this:

4. Dive into your data!

Now, just connect to the structured_data table and create your own dashboard!

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
dags		dags
.gitignore		.gitignore
README.md		README.md
docker-compose.yaml		docker-compose.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Uber personal data

The goal of this project is to extract, transform and load (ETL) personal data from Uber rides for data analysis

1. Where can one find your own dataset?

2. A few considerations

3. The code

3.1 Overview

3.2 Extract step

3.3 Transform step

1 - Not completed rides

2 - Product type has a typo in some rows. UberX has two different writings:

3 - City column has wrong data:

3.4 Load

4. Dive into your data!

About

Releases

Packages

Languages

luis-fnogueira/uber_data

Folders and files

Latest commit

History

Repository files navigation

Uber personal data

The goal of this project is to extract, transform and load (ETL) personal data from Uber rides for data analysis

1. Where can one find your own dataset?

2. A few considerations

3. The code

3.1 Overview

3.2 Extract step

3.3 Transform step

1 - Not completed rides

2 - Product type has a typo in some rows. UberX has two different writings:

3 - City column has wrong data:

3.4 Load

4. Dive into your data!

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages