Super-Simple Scraper

This a very thin layer on top of Colly which allows configuration from a JSON file. The output is JSONL which is ready to be imported into Typesense.

Features

Scrape HTML & PDF documents based on the configured selectors
Selectors can use CSS selectors or template-based ones which have sprig functions available.

Configuration

See the example configuration. Many of these options are directly copied to the Colly equivalents:

Running

We have an image on DockerHub, so after installing Docker and jq, something like this will work:

docker run -it -v `pwd`:/go/src/app -e "CONFIG=$(cat ./path/to/your/config.json | jq -r tostring)" gotripod/ssscraper:main

The manual method is:

docker build -t ssscraper .
docker run -v `pwd`:/go/src/app -it --rm --name ssscraper-ahoy ssscraper

# you're now in the docker container

cd src/app
go build
./ssscraper

ssscraper can be called with the --testUrl flag:

./ssscraper --testUrl=https://gotripod.com

It will scrape that URL and not follow its links, only output the results for that one page.

Developing

Using VSCode, clone and open the repo directory with the Containers extension installed.

Future ideas

Nested selectors; i.e. select each item from a list on each page
Webhook support - POST the output to a URL on completion
Different output formats
Custom weighting for selectors
Extract the selector/template logic to a common function
Add Word doc support

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
.devcontainer		.devcontainer
.github/workflows		.github/workflows
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
config.example.json		config.example.json
go.mod		go.mod
go.sum		go.sum
pdf.go		pdf.go
start.go		start.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Super-Simple Scraper

Features

Configuration

Running

Developing

Future ideas

Sponsors

About

Releases

Packages

Languages

License

gotripod/ssscraper

Folders and files

Latest commit

History

Repository files navigation

Super-Simple Scraper

Features

Configuration

Running

Developing

Future ideas

Sponsors

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages