S. Felix Wu, [email protected]
Fredrik Erlandsson, [email protected]
This crawler consists of two parts, the agent.php that does the actual crawling and a controller (found in contoller/) keeping track of the current crawling status.
The agent is dependent on the Facebook PHP SDK. To install just do a submodule update:
git submodule update --init
Most of the time you only need to use the agent.
Create a Facebook application at: https://developers.facebook.com/apps, make sure to fill in offline_access & read_stream under Permissions->Extended Permissions.
Copy config/config-dist.php to config/config.php and fill APPID, APPSEC (from your Facebook application page) & the URL to a running controller.
run php agent.php token=FACEBOOK_USER_TOKEN
or as a web application http://example.com/agent.php?token=FACEBOOK_USER_TOKEN
To run multiple instances (reccomended) of the agent in one environment use
the script bgxgrp.sh as:
bash bgxgrp.sh <#-instances> php agent.php token=FACEBOOK_USER_TOKEN
where <#-instances> should be replaced with the number of threads to run
(something between 8-15 is reasonable to not hit Facebook's 600/600 limit).
The FACEBOOK_USER_TOKEN is generated via the graph explorer page https://developers.facebook.com/tools/explorer/ using an user that is said to be over 18 of age to support crawling of all types of pages.
Happy crawling!!