WebCrawl

A web - scraper built using scrapy framework in python.

Item file conatins the container that will be loaded with scraped data. They are just like a normal python dict. Modify it according to the data you need to scrap from a web page.
Spider are the classes used to scrape information from a domain(Or a group of domains). They define an initial list of URLs to download, how to follow links, and how to parse the contents of those pages to extract items.
Note: Remeber to use DOWNLOAD_DELAY >= 2 sec else your IP could be blacklisted.

How To Run: Go to projects top level directory and type in terminal :- scrapy crawl <projectname>
Replace project name wih you the name you gave to your project while creating it.

Dependencies: Before running this project or any scrapy project you must have these packages intalled.

To Do: Use MongoDB for storing data in the database and use Stack_spider for doing the nested crawling.

Reference: https://realpython.com/blog/python/web-scraping-with-scrapy-and-mongodb/

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
scraper_scrapy		scraper_scrapy
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WebCrawl

About

Releases

Packages

Languages

aman-harness/WebCrawl

Folders and files

Latest commit

History

Repository files navigation

WebCrawl

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages