redfin-scraper

redfin-scraper is a proxy-based scraper to extract properties from Redfin with filters. It is especially useful when you want to crawl all recently sold properties (e.g., properties sold in past 3 years) in a given state or city.

Scraping Algorithm

Please refer to algorithm_sketch.md.

Prerequisites

Have sqlite installed. If you are using mac, you do not need to install.
Your OS system has python 3.6
You have a file of proxies. You can buy proxies online, or use a free service like proxybroker. The repo assumes the use of proxies with user and password authorization. If your proxies do not need authorization, you can just have the csv file like

ip,port
a.b.c.d,2345
e.f.g.h,1234
...

Otherwise, your csv proxy file can be

ip,port,user,password
a.b.c.d,2345,user1,pass1
e.f.g.h,1234,user2,pass2
...

Environment Setup

Create Python virtual environment first with python3.

python3.6 -m venv /path/to/venv

Activate venv.

source /path/to/venv/bin/activate

pip install -r requirements.txt

How to use

Once you successfully have all the prerequisites ready and set up the Python environment, you can scrape the Redfin data based on your needs. In the following I will demonstrate redfin-scraper usage by scraping a small city called Belmont (https://www.redfin.com/city/1362/CA/Belmont).

Property Summary URLs Only

If you want to get all Redfin summary URLs in a given city, you can just run

python redfin_crawler.py proxy.csv https://www.redfin.com/city/1362/CA/Belmont
--property_prefix https://www.redfin.com/city/1362/CA/Belmont --type pages

Scraping Property Details

If you need to get the property details, you can just run with type properties. This will not only generate the summary URLs containing the properties, but extract the property metadata from those urls.

python redfin_crawler.py good_proxies.csv https://www.redfin.com/city/1362/CA/Belmont
--property_prefix https://www.redfin.com/city/1362/CA/Belmont --type properties

Known Issues and Bugs

Safe folk issue on Mac

If Mac user experiences errors like

may have been in progress in another thread when fork() was called.
We cannot safely call it or ignore it in the fork() child process. Crashing instead

Try setting the following env before running the program

export OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES

Scraping with proxies returns 403 error code.

Most likely this proxy is blocked by the detection algorithm of the corresponding websites. You can temporarily remove the proxy out of your proxy pool.

But how do I know whether a proxy is good or not?

I put a proxy_checker.py in the tools repo. You can use this script to eliminate the proxies that are currently blocked by external website. To use, run

python tools/proxy_checker.py --proxy_csv_path proxy.csv

Disclaimer

Scraping websites can violate website term of service. Use at your own risk.

TODO

Add free proxy integration so no external proxy file is needed.
Make it a package so users can easily install it with pip.
Add Docker environment.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
tests		tests
tools		tools
.gitignore		.gitignore
ALGO.md		ALGO.md
LICENSE		LICENSE
README.md		README.md
redfin_crawler.py		redfin_crawler.py
redfin_filters.py		redfin_filters.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

redfin-scraper

Scraping Algorithm

Prerequisites

Environment Setup

How to use

Property Summary URLs Only

Scraping Property Details

Known Issues and Bugs

Safe folk issue on Mac

Scraping with proxies returns 403 error code.

But how do I know whether a proxy is good or not?

Disclaimer

TODO

About

Releases

Packages

Languages

License

wang-ye/redfin-scraper

Folders and files

Latest commit

History

Repository files navigation

redfin-scraper

Scraping Algorithm

Prerequisites

Environment Setup

How to use

Property Summary URLs Only

Scraping Property Details

Known Issues and Bugs

Safe folk issue on Mac

Scraping with proxies returns 403 error code.

But how do I know whether a proxy is good or not?

Disclaimer

TODO

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages