A web crawler written in Python to crawl a given website.
- Faster
- Ablility to specify the number of threads to use to crawl the given website
- Ability to use proxies to bypass IP restrictions
- Clear summary of all the urls that were crawled. View the crawled.txt file to see the complete list of all the links crawled
- Ability to specify delay between each HTTP Request
- Stop and resume crawler whenever you need
- Gather all the urls with their titles to a csv, incase if you are planning to create a search engine
- Search for specific text throughout the website
- Clear statistics about how many links ended up as Files,Timeout Errors,Connecrion Errors
- Crawl until you need. You can specify upto what level the crawler should crawl.
- Random browser user agents will be used while crawling.
- Gather AWS Buckets,Emails,Phone Numbers etc
- Download all images
This tool uses a number of open source projects to work properly:
- BeautifulSoup - Parser to parse the HTML response of each request made.
- Requests - To make GET requests to the URLs.
If you like to see the list of supported features, simply run