A simple distributed scrapper service that, given an initial set of URLs and a value referring to the depth of the download, returns the html of each URL and its group of dependencies necessary to be able to navigate without connection as much as the specified depth value.
To use the project, clone it or download it to your local computer.
It is necessary to have python v-3.7.2
Dependencies:
zmq
scrapy
requests
BeautifulSoup
To run the project, simply open the console from the root location of the project and run each of the nodes types, one or more of each.
Every node have a make
rule to execute them with the defaults values, rules are:
seed
for master nodeworker
for scrapper nodedispatcher
for client node
If you want to use other parameters, run each node without using make
directly executing its corresponding python file:
python <node>.py <params>
To know how to pass the parameters in each type of node use:
python <node>.py --help
Inside python-image folder there is a dockerfile for dependencies image. This image must be build from the python: 3.7-slim
image. Run from the CLI make python-docker-image
to build this image, then run make project-image
to build the image for the project. This image is build with scrapper tag by default, if it is wanted to build the image with other tag, run make project-image image_tag=<tag_name>
.
To run the image, execute from the CLI:
docker run -it --rm --network host \
-v <path>:/home/Distributed-Scrapper/downloads \
-v <urls>:/home/Distributed-Scrapper/urls \
2kodevs:<image_tag>
Where
<path>
it is the folder path where downloads are desired<urls>
the path to the file where the set of URLs to be downloaded is located<image_tag>
is the tag used when building the image
Another way is to run from the project's root:
make run target=docker node="<node_name.py> <args>"
target=<target>
is for specify if a node it is wanted to be run using docker, skip it to not use docker.
The set of URLs to be downloaded must be of the form (it does not need a special extension):
[
"<url_1>",
"<url_2>",
...
"<url_n>"
]
This project is under the License (MIT License) - see the file LICENSE.md for details.