redfin-scraper is a proxy-based scraper to extract properties from Redfin with filters. It is especially useful when you want to crawl all recently sold properties (e.g., properties sold in past 3 years) in a given state or city.
Please refer to algorithm_sketch.md.
- Have sqlite installed. If you are using mac, you do not need to install.
- Your OS system has python 3.6
- You have a file of proxies. You can buy proxies online, or use a free service like proxybroker. The repo assumes the use of proxies with user and password authorization. If your proxies do not need authorization, you can just have the csv file like
ip,port
a.b.c.d,2345
e.f.g.h,1234
...
Otherwise, your csv proxy file can be
ip,port,user,password
a.b.c.d,2345,user1,pass1
e.f.g.h,1234,user2,pass2
...
- Create Python virtual environment first with python3.
python3.6 -m venv /path/to/venv
- Activate venv.
source /path/to/venv/bin/activate
pip install -r requirements.txt
Once you successfully have all the prerequisites ready and set up the Python environment, you can scrape the Redfin data based on your needs. In the following I will demonstrate redfin-scraper usage by scraping a small city called Belmont (https://www.redfin.com/city/1362/CA/Belmont).
If you want to get all Redfin summary URLs in a given city, you can just run
python redfin_crawler.py proxy.csv https://www.redfin.com/city/1362/CA/Belmont
--property_prefix https://www.redfin.com/city/1362/CA/Belmont --type pages
If you need to get the property details, you can just run with type properties. This will not only generate the summary URLs containing the properties, but extract the property metadata from those urls.
python redfin_crawler.py good_proxies.csv https://www.redfin.com/city/1362/CA/Belmont
--property_prefix https://www.redfin.com/city/1362/CA/Belmont --type properties
If Mac user experiences errors like
may have been in progress in another thread when fork() was called.
We cannot safely call it or ignore it in the fork() child process. Crashing instead
Try setting the following env before running the program
export OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES
Most likely this proxy is blocked by the detection algorithm of the corresponding websites. You can temporarily remove the proxy out of your proxy pool.
I put a proxy_checker.py in the tools repo. You can use this script to eliminate the proxies that are currently blocked by external website. To use, run
python tools/proxy_checker.py --proxy_csv_path proxy.csv
Scraping websites can violate website term of service. Use at your own risk.
- Add free proxy integration so no external proxy file is needed.
- Make it a package so users can easily install it with pip.
- Add Docker environment.