This project includes a Python-based web crawler designed to extract data from a list of URLs and save the crawled content in various formats. The crawler is built using Scrapy, a fast high-level web crawling and web scraping framework.
- Python 3.6+
- pip (Python package installer)
First, clone this repository to your local machine:
git clone https://github.com/padas-lab-de/url-dataset-crawling.git
cd url-dataset-crawling
Create a virtual environment to manage the project's dependencies separately from other Python projects:
# For Unix or MacOS
python3 -m venv env
source env/bin/activate
# For Windows
python -m venv env
.\env\Scripts\activate
Install the required Python packages specified in requirements.txt
:
pip install -r requirements.txt
-
Input Data: Place your list of URLs in a
.csv
or.txt
file within thedata/inputs/
directory. Ensure the CSV file has a column namedurl
containing the URLs. -
Output Format: Decide on the format for the output data. Options include:
- CSV file (
csv
): Appends crawled content to a CSV file. - JSON Lines file (
jsonl
): Stores each item in a separate line in JSON format. - HTML files (
html
): Saves complete HTML content to individual files and indexes them in a CSV file.
- CSV file (
Navigate to the project's root directory and run main.py
using Python:
python main.py
You will be prompted to enter:
- The input type (
csv
ortxt
). - The output type (
csv
,jsonl
, orhtml
). - The path to the input file (relative to the project root).
- The path to the output file or directory (relative to the project root).
Example interaction:
Enter input type (csv/txt): csv
Enter output type (csv/jsonl/html): csv
Enter path to input file: data/inputs/OWS_URL_DS.csv
Enter path to output file/directory: data/outputs/output.csv
Logs are saved in the logs/
directory, which helps in troubleshooting and understanding the crawler's behavior.
- ModuleNotFoundError: Ensure all scripts are being run from the project's root directory and the virtual environment is activated.
- IOError or PermissionError: Check the permissions of the directory where you are trying to write the output files. Ensure the directory exists and is writable.