A graphical Python application which can automatically grab vehicle data (e.g. price, specifications, features etc.) from the Australian automotive website http://redbook.com.au. Vehicle images as well as page screenshots can also be grabbed. This data can then be exported to an Excel file.
Screenshot 1: https://raw.github.com/nickdademo/redbook-data-grabber/master/Screenshot1.png
This application was a personal project of mine and here were some of my motivations for creating it:
- To learn to create graphical applications using PyQt.
- To learn more about automated web browsing and data grabbing, in particular, using multiple threads to speed up the process.
- To see if it was possible and to show others how!
This application makes use of the following:
- PhantomJS, a headless browser: http://phantomjs.org/
- Selenium, a web automation framework: http://docs.seleniumhq.org/
- PyQt, a Python binding of the cross-platform GUI toolkit Qt: http://www.riverbankcomputing.com/software/pyqt/
The application can be run on both Windows and Linux.
This application has been created and published for EDUCATIONAL purposes only!
Before using this program, you must first agree to RedBook.com.au's Terms & Conditions: http://www.redbook.com.au/help/terms-conditions
According to #4: "Use of this website is for your personal and non-commercial use only. Except for the material held in your computer’s cache or a single permanent copy of the material for your personal use, you must not: ..."
Therefore, you must only use this application for PERSONAL USE! I am not responsible for any misuse of this application.
The following procedure has been tested with:
Ubuntu 14.04 LTS (64-bit)
Python 3.4.0
Selenium 2.44.0
PhantomJS 1.9.7
PyQt4 4.10.4
BeautifulSoup4 4.3.2
html5lib 0.999
XlsxWriter 0.6.6
-
Install PIP:
$ sudo apt-get install python3-pip -
Install Selenium Python bindings:
$ sudo pip3 install selenium -
Install BeautifulSoup:
$ sudo pip3 install beautifulsoup4 -
Install html5lib (BeautifulSoup parser):
$ sudo pip3 install html5lib -
Install PyQt4:
$ sudo apt-get install python3-pyqt4 -
Install XlsxWriter:
$ sudo pip3 install XlsxWriter -
Download the latest version of PhantomJS from http://phantomjs.org/. Place the phantomjs binary executable in the same folder as the rdbg.py script.
-
Run application:
$ python3 rbdg.py
The following procedure has been tested with:
Windows 8.1 Professional (64-bit)
Python 3.4.2
Selenium 2.44.0
PhantomJS 2.0.0
PyQt4 4.11.3 for Py3.4 (x64) (Qt 5.3.2)
BeautifulSoup4 4.3.2
html5lib 0.999
XlsxWriter 0.6.6
-
Install Selenium Python bindings:
$ pip install selenium -
Install BeautifulSoup:
$ pip install beautifulsoup4 -
Install html5lib (BeautifulSoup parser):
$ pip install html5lib -
Download and install PyQt4 from http://www.riverbankcomputing.com/software/pyqt/download.
-
Install XlsxWriter:
$ pip install XlsxWriter -
Download the latest version of PhantomJS from http://phantomjs.org/. Place the phantomjs binary executable in the same folder as the rdbg.py script.
-
Run application:
$ python rbdg.py