website scrapping tool for mallam-ai
- Install
go
from https://go.dev - Execute
go get ./...
to install dependencies
go run ./cmd/mallam-scrape "https://www.marxists.org/archive/marx/"
This will scrape all urls and save to out/www.marxists.org/../..
directory
go run ./cmd/mallam-extract-text-marx
This will read all HTML files in out/www.marxists.org/archive/marx/works
and save plain text to out/text-marx.txt
Internal Logic
- Iterate subdirectories in
archive/marx/works
with 4-digits prefixed - Ignore
index.htm
files - Collect
<p>
element withoutclass
- Combine all text together
MALLAM Developers, MIT License