Skip to content

mallam-ai/mallam-scrape

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

mallam-scrape

website scrapping tool for mallam-ai

Pre-requisites

  • Install go from https://go.dev
  • Execute go get ./... to install dependencies

Tool mallam-scrape

go run ./cmd/mallam-scrape "https://www.marxists.org/archive/marx/"

This will scrape all urls and save to out/www.marxists.org/../.. directory

Tool mallam-extract-text-marx

go run ./cmd/mallam-extract-text-marx

This will read all HTML files in out/www.marxists.org/archive/marx/works and save plain text to out/text-marx.txt

Internal Logic

  1. Iterate subdirectories in archive/marx/works with 4-digits prefixed
  2. Ignore index.htm files
  3. Collect <p> element without class
  4. Combine all text together

Credits

MALLAM Developers, MIT License

About

website scrapping tool for mallam-ai

Resources

License

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages