A web - scraper built using scrapy framework in python.
- Item file conatins the container that will be loaded with scraped data. They are just like a normal python dict. Modify it according to the data you need to scrap from a web page.
- Spider are the classes used to scrape information from a domain(Or a group of domains).
They define an initial list of URLs to download, how to follow links, and how to parse the contents of those pages to extract items.
Note: Remeber to use DOWNLOAD_DELAY >= 2 sec else your IP could be blacklisted.
How To Run: Go to projects top level directory and type in terminal :- scrapy crawl <projectname>
Replace project name wih you the name you gave to your project while creating it.
Dependencies: Before running this project or any scrapy project you must have these packages intalled.
To Do: Use MongoDB for storing data in the database and use Stack_spider for doing the nested crawling.
Reference: https://realpython.com/blog/python/web-scraping-with-scrapy-and-mongodb/