Skip to content

PopcornData/shaw-scraper

Repository files navigation

Shaw Scraper

A web scraper built to scrape movie and seat buying data from Shaw Theatres' website to understand movie-goers' behaviouristic patterns. The data was used to build interesting visualisations which can be found on the PopcornData website. More details on how we obtained the data and cleaned it can be found in this Medium article.

Data Collected

Raw Data

The complete raw data collected can be found here.

Cleaned Data

The processed data can be found here.

Built With

Getting Started

The scraper was built to run on Heroku. The following instructions are to deploy it on Heroku.

Prerequisites

  • Heroku

    • Account - Create a free account on Heroku
    • Heroku CLI - Follow these instructions to download and install the Heroku CLI
  • MongoDB Atlas account

    • Create a free MongoDB Atlas account
    • Create a database in MongoDB named "shaw_data" and a collection inside it called "movie_data". You can have different names for the database and collection but you must update the Shaw_scraper.py file accordingly.
    • Add 0.0.0.0/0 (i.e. all addresses) to your MongoDB Atlas whitelist
    • Get the database connection string which is in this format:
    mongodb://[username:password@]host1[:port1][,...hostN[:portN]][/[defaultauthdb][?options]]
    

Installation

  1. Clone the repo and navigate to the correct folder
git clone https://github.com/PopcornData/shaw-scraper.git
  1. Open your Heroku CLI and login to Heroku
heroku login
  1. Create a new project on Heroku
heroku create <project-name>
  1. Add the remote
heroku git:remote -a <project-name>
  1. Add the Buildpacks necessary for Selenium ChromeDriver
heroku buildpacks:add --index 1 https://github.com/heroku/heroku-buildpack-python.git

heroku buildpacks:add --index 2 https://github.com/heroku/heroku-buildpack-chromedriver

heroku buildpacks:add --index 3 https://github.com/heroku/heroku-buildpack-google-chrome
  1. Add the PATH variable to the Heroku configuration
heroku config:set GOOGLE_CHROME_BIN=/app/.apt/usr/bin/google_chrome

heroku config:set CHROMEDRIVER_PATH=/app/.chromedriver/bin/chromedriver

heroku config:set MONGODB_URL=<your-MongoDB-connection-string>
  1. Deploy to Heroku (Make sure that you navigate to the cloned folder before deploying)
git push heroku master
  1. Run the following command to start the scraper
heroku ps:scale clock=1

Usage

The scraper has 2 functions which run separately:

  1. get_movie_data() - This function scrapes the movie details from all the theatres for the given day and stores the JSON data in the DB. The data has the folowing format:
{
 "theatre":"Nex",
 "hall":"nex Hall 5",
 "movie":"Jumanji: The Next Level",
 "date":"18 Jan 2020",
 "time":"1:00 PM+",
 "session_code":"P00000000000000000200104"
}
  1. get_seat_data() - This function scrapes the seat details including which seats where bought and the time at which seats where bought for movie sessions. It scrapes data from the previous day so that all the seat data (ticket sales) are updated. It should be run only after running the get_movie_data() function as it updates the JSON in the DB by adding the seat data to it. The updated data has the following format:
 {
     "theatre":"Nex",
     "hall":"nex Hall 5",
     "movie":"Jumanji: The Next Level",
     "date":"18 Jan 2020",
     "time":"1:00 PM+",
     "session_code":"P00000000000000000200104"
     "seats":[
         {   
           "seat_status":"AV",
           "last_update_time":"2020-01-20 14:34:53.704117",
           "seat_buy_time":"1900-01-01T00:00:00",
           "seat_number":"I15",
           "seat_sold_by":""
         },
         ...,
         {  
           "seat_status":"SO",
           "last_update_time":"2020-01-20 14:34:53.705116",
           "seat_buy_time":"2020-01-18T13:12:34.193",
           "seat_number":"F6",
           "seat_sold_by":""
         }
      ]
 }

A full sample updated document in the database can be viewed here.

The functions are scheduled to run daily at the times specified in clock.py. The timings and frequencies of the scraper can be changed by editing the clock.py file.

License

Distributed under the MIT License. See LICENSE for more information.

Team

Disclaimer

This scraper was made as a project to analyse cinema seat patterns. We are in no way affiliated with Shaw Theatres and are not responsible for the accuracy of the data scraped using this scraper. The scraper was developed to scrape data in Jan 2020 from the website and was functional as of June 2020. It may not work as expected as the structure of the website may have changed since.