This project provides two Scrapy spiders for scraping movie data from IMDb and TMDb: a basic scraper and an advanced scraper with additional capabilities for concurrent and customizable scraping.
.
├── imdbscrapper
│ ├── spiders
│ │ ├── __init__.py
│ │ ├── advance_scrapper.py
│ │ ├── basic_scrapper.py
│ ├── __init__.py
│ ├── items.py
│ ├── middlewares.py
│ ├── pipelines.py
│ └── settings.py
│
├── LICENSE.md
├── README.md
├── requirements.txt
├── scrapy.cfg
└── setup.py
- IMDb Search Pages: Scrapes movie details from IMDb search pages.
- TMDb Integration: Fetches additional data, such as posters and ratings, using the TMDb API.
- Pagination: Supports pagination through the IMDb search "Show More" option.
- Output: Saves scraped data as JSON and CSV files.
- Enhanced Features: Includes all functionalities of the basic scraper.
- Multi-threaded Scraping: Utilizes concurrent scraping for faster data collection.
- Robust Error Handling: Implements improved retry and error management.
- Data Enrichment: Collects extended movie metadata and applies data cleaning.
- Python 3.10+
- Scrapy
- Selenium (for JavaScript-heavy pages)
- Requests (for TMDb API requests)
- Concurrent Futures (for parallel scraping)
Install dependencies with:
pip install -r requirements.txt
-
Clone the repository:
git clone https://github.com/WhoIsJayD/IMDB-Scrapper cd IMDB-Scrapper
-
Set up API Key: Add your TMDb API key in each spider file or pass it as an argument.
-
Set up Selenium (for advanced scraping):
- Download ChromeDriver compatible with your Chrome version.
- Add ChromeDriver to your system's PATH.
Run the basic scraper with:
scrapy crawl basic_scrapper
Run the advanced scraper with custom parameters:
scrapy crawl advance_scrapper -a tmdb_api_key="your_tmdb_api_key" -a start_year=2000 -a end_year=2023 -a num_instances=5
start_year
: Start year for the movie range.end_year
: End year for the movie range.num_instances
: Number of concurrent scraping instances.
The scrapers produce the following files:
- movies.json: Contains movie data in JSON format.
- movies.csv: Contains movie data in CSV format.
- Modify
custom_settings
in each spider to configure scraping behavior. - Adjust the
clean_movie_data
method inadvance_scrapper.py
to customize data cleaning.
- Legal Compliance: Ensure your usage complies with IMDb and TMDb terms of service.
- Rate Limiting: To avoid blocking, set appropriate delays or intervals.
This project is licensed under the MIT License. See the LICENSE.md file for details.
For any inquiries, please reach out:
- Name: Jaydeep Solanki
- Email: [email protected]
- LinkedIn: LinkedIn Profile
Special thanks to:
- IMDb for the movie data.
- TMDb for their API resources.
- The Scrapy and Selenium communities for their robust tools and documentation.