Skip to content

eckarthik/PyWebCrawler

Repository files navigation

Python Web Crawler Build Status

A web crawler written in Python to crawl a given website.

Features!

  • Faster
  • Ablility to specify the number of threads to use to crawl the given website
  • Ability to use proxies to bypass IP restrictions
  • Clear summary of all the urls that were crawled. View the crawled.txt file to see the complete list of all the links crawled
  • Ability to specify delay between each HTTP Request
  • Stop and resume crawler whenever you need
  • Gather all the urls with their titles to a csv, incase if you are planning to create a search engine
  • Search for specific text throughout the website
  • Clear statistics about how many links ended up as Files,Timeout Errors,Connecrion Errors
  • Crawl until you need. You can specify upto what level the crawler should crawl.
  • Random browser user agents will be used while crawling.

Upcoming Features!

  • Gather AWS Buckets,Emails,Phone Numbers etc
  • Download all images

Dependencies

This tool uses a number of open source projects to work properly:

  • BeautifulSoup - Parser to parse the HTML response of each request made.
  • Requests - To make GET requests to the URLs.

Usage

If you like to see the list of supported features, simply run Usage Demo

Specifying only to crawl for 3 levels

Depth Crawl

Search for specific text throughout the website

Text Search

Gather all the links along with their titles to a CSV file. A CSV file with the links and their titles will be created after the crawl completes

Gather Titles

Use proxies to crawl the site.

Use Proxies

About

A python based web crawler

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages