This github repository holds the details for our university capstone project Walmart Strategic Sales Forecasting at Drexel University.
In any supply chain, especially retail, forecasting is essential for all parties in the business. For the business owner, it gives them the ability to make informed business decisions and develop data-driven strategies. By using current and historical data, businesses can predict future trends and forecasts and using those insights to plan resources, make appropriate adjustments to business strategy, lower the overall business operations cost, and increase profits. From the consumer standpoint, it helps reduce the overall costs for the business and get the right products that they want to the shelf. For the business' supply partners, this allows them to be proactive instead of reactive; they can produce the right amount ahead of time and reduce excesses.
In this capstone project for the capstone project of the MS Data Science program at Drexel University, we used machine learning (ML) to solve a time-series forecasting problem of predicting sales at Walmart stores. Walmart is a retail business that gets products from its supply partners to deliver to consumers. We will use Walmart sales data from 3 states across the United States (California, Texas, and Wisconsin) over several years to predict future sales.
- Xi Chen: MS Data Science at Drexel University
- Emily Wang: MS Data Science at Drexel University
- Kriti Bartaria: MS Data Science at Drexel University
- Khiem Nguyen: MS Data Science at Drexel University
We utilized an existing dataset present at Kaggel.com, it includes item-level details, department, product categories, and store details for stores in three US states (California, Texas, and Wisconsin). It also includes explanatory variables like price, promotions, day of the week, and special events. The dataset is arranged in a hierarchal order.
Google Trends data is available for download via its web interface, but because we need to send multiple queries containing different keywords, geographical locations, and timeframes, we needed a more scalable approach. Therefore, we opted to use a Python package called Pytrends. Pytrends is an unofficial API (Application Programming Interface) for Google Trends. It allows us to automate the process of querying and downloading reports from Google Trends with Python scripts. With the API, we pulled data for interest over time with our desired timeframe, keywords, and geographic location.
For this project, we have manually downloaded the seasonally adjusted unemployment data ranging daily from January 2011 to December 2016 from the U.S Bureau of Labor & Statistics site. By seasonality it means that the periodic fluctuations associated with events such as weather, holidays, and the opening and closing of schools were also taken into consideration for the unemployment rate.
The CPI data was also manually downloaded from the U.S Bureau of Labor & Statistics site ranging from January 2011 to December 2016.
Gas Price Data is available for download from the Kaggle project website. It has a single csv file including the price of multiple gasoline types.
- Linear Regression
- Long short-term memory (LSTM)
- Autoregressive Integrated Moving Average (ARIMA)
- analysis: python notebooks of the Exploratory Data Analysis
- data-processing: python code to process the data
- data: raw and processed data
- models: machine learning model for sales prediction
- notebooks: additional python notebooks related to the project