The objective of this project is to analyse various datasets in PySpark about commercial
products and rating of e-commerce website NewChic.com.
For this project there are three csv files one for each product
- Shoe products
- Jewelry products
- Accessories
- Category (men, accessory, jewelry, etc)
- Subcategory type of catgeory
- name of the product
- current price : price listed on the website
- raw price : total price of the product before any discounts(i.e original price of the product)
- discounts
- Currency (currency of which the price listed in)
- likes count : popularity of the product
- isnew : tells whether the product is used or not (Binary value true or false)
-
- Import the necessary packages and start your spark session
- load the 3 datasets into 1 dataframe.
- Print the schema
-
Building more features to help analyse each product For these tasks I will be create more columns to help get a better understanding of each product. First create functions which would take the dataframe as its input and return a new dataframe containing the following columns:
- Average price of each category
- Average price of each subcategory
- Average discount of each category
- Average discount of each subcategory
- Average likes of each category
- Average likes of each subcategory
- Rank of the likes of each subcategory
- Dense rank of the likes of each subcategory
- Rank of the likes of each category
- Dense rank of the likes of each category
- Rank of the discount of each category
- Dense rank of the discounts of each category
- Sum of the unused products in each category
- Sum of the unused products in subcategory.
Call the functions i created to create the new columns and Print the schema of the dataframe to check column created.
-
Utilising the newly created features to answer some questions. For this task, use the dataframe API to answer the following questions.
- Which type of product is the least popular with the users? (tip - find the feature which demonstrates the popularity of a product)
- Which type of product is the most popular with the users?
- Which subcategory product is the most popular with the users?
- Which subcategory product is the least popular with the users?
- In the accessory products, which subcategory was the most expensive?
- Which type of product had the highest discount offerings?
- Which type of product had the lowest discount offerings ?
- In the shoe's products, which subcategory was the most common?
- Which type of product had the most unused products listed for sale?
- Which subcategory product had the most unused products listed for sale?
-
using Docker Image to set Postgresql DBMS Enviroment to load the data
- create a new database
- subsequently mount a new folder.
- Load your dataframes as 3 different tables in the database.