Skip to content

goLang crawler restricted to only topic relevant/curated URLs. Includes token frequency analysis and NLP nGram detection

Notifications You must be signed in to change notification settings

computerphysicslab/goCrawler

Repository files navigation

goCrawler

Text-mining topic relevant sources

About goCrawler

goLang crawler restricted to only topic relevant/curated URLs. It includes token frequency analysis and NLP nGram detection

Watch goLang goCrawler running at https://youtu.be/7IDkNYcMXxU

Install

By using go.mod needed libraries are automagically downloaded on first run. Add Redis service. Check this document at the end of it.

Run

Create myTopic.yaml using covid-19.yaml or blockchain.yaml as template

    $ go run crawler.go <myTopic>

neutral.yaml is a non-topic crawling profile to get a non-topic english corpus

    $ go run crawler.go neutral

Output

From time to time, you will get an on-screen summary of relevant kwywords found so far like this:

    numLinksProcessed: 150

    Corpus frequencies:  [{the 5191} {note 637} {convoy 503} {ottawa 426} {canada 395} {police 336} {covid 327} {canadian 325} {protest 277} {truckers 244} {protesters 243} {vaccine 231} {freedom 190} {government 183} {protests 181} {border 172} {mandates 170} {trudeau 134} {health 130} {vaccinated 127} {right 115} {city 114} {trucks 113} {ontario 111} {restrictions 109} {vehicles 108} {minister 106} {truck 105} {retrieved 100} {toronto 99} {media 97} {last 96} {gofundme 95} {country 91} {pandemic 89} {organizers 88} {end 88} {downtown 88} {vaccination 86} {federal 85} {canadians 83} {far 82} {parliament 81} {called 80} {global 78} {united 76} {party 76} {drivers 75} {cross 75} {several 74} {cbc 74} {service 73} {states 72} {measures 72} {mandate 72} {statement 71} {prime 70} {conservative 70} {act 70} {supporters 69} {justin 69} {movement 66} {groups 64} {hill 63} {street 62} {law 62} {traffic 61} {friday 60} {down 60} {trucking 59} {president 59} {reported 58} {funds 58} {fully 58} {bridge 58} {capital 57} {sunday 56} {saturday 56} {demonstrators 56} {trucker 55} {political 55} {king 55} {highway 55} {press 54} {monday 54} {jan 54} {began 54} {occupation 53} {trump 52} {court 52} {area 52} {provincial 51} {saying 50} {members 50} {emergency 50} {workers 49} {security 49} {off 49} {chief 49} {american 49}]


    Corpus frequencies w/o Eng.:  [{convoy 499} {ottawa 414} {canadian 294} {canada 258} {protest 253} {truckers 244} {protesters 238} {note 219} {vaccine 216} {mandates 168} {protests 160} {trudeau 134} {vaccinated 127} {covid 116} {trucks 107} {freedom 103} {gofundme 95} {retrieved 92} {truck 89} {vaccination 83} {ontario 81} {organizers 80} {canadians 80} {downtown 79} {vehicles 76} {cbc 74} {police 70} {mandate 68} {trucking 59} {trucker 55} {toronto 53} {justin 52} {demonstrators 50} {highway 46} {cent 41} {unvaccinated 40} {rcmp 40} {blockade 40} {windsor 37} {supporters 36} {organizer 36} {ctv 36} {postmedia 35} {lich 35} {givesendgo 34} {emergencies 34} {blockades 33} {provincial 31} {occupation 31} {alberta 31} {sloly 30} {protestors 30} {crowdfunding 30} {protesting 29} {conservative 29} {jan 26} {vancouver 23} {vaccines 23} {tamara 23} {feb 23} {ukraine 22} {omicron 22} {crossings 22} {tweeted 21} {provinces 21} {horns 21} {toole 20} {rally 20} {nova 19} {donations 19} {scotia 18} {pressprogress 18} {ops 18} {crossing 18} {convoys 18} {confederate 18} {passports 17} {ndp 17} {bergen 17} {variant 16} {saskatchewan 16} {opp 16} {occupiers 16} {mou 16} {invoked 16} {fundraiser 16} {demonstrations 16} {coutts 16} {arrests 16} {peaceful 15} {zello 14} {watson 14} {invocation 14} {insurrection 14} {hospitalized 14} {honking 14} {freeland 14} {extremists 14} {drivers 14} {bloor 14}]

You may find topic relevant text at logs/corpusCuratedText.log

Other interesting results produced by computerphysicslab/goCrawler:

About the author

Find me at https://www.linkedin.com/in/semanticwebarchitect/ or [email protected]

I am very interested on Text-Mining, Natural Language Processing, Named Entity Recognition and building textual corpora and search apps for Health/Biomedical/Clinical domain

Redis Install

Appendix: How to install Redis service on Linux Ubuntu

goCrwaler uses Redis noSQL local database to cache visited urls and avoid unnecessary network overload. So you need to install it prior to running the crawler.

"Redis is very fast and can perform about 110000 SETs per second, about 81000 GETs per second."

https://www.tutorialspoint.com/redis/redis_overview.htm

#redis #nosql #performance #benchmarks

sudo apt-get update
sudo apt-get install redis-server

redis-server
	770156:C 03 Sep 2020 08:11:43.794 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
	770156:C 03 Sep 2020 08:11:43.794 # Redis version=5.0.7, bits=64, commit=00000000, modified=0, pid=770156, just started
	770156:C 03 Sep 2020 08:11:43.794 # Warning: no config file specified, using the default config. In order to specify a config file use redis-server /path/to/redis.conf
	770156:M 03 Sep 2020 08:11:43.795 * Increased maximum number of open files to 10032 (it was originally set to 1024).
	770156:M 03 Sep 2020 08:11:43.796 # Could not create server TCP listening socket *:6379: bind: Address already in use

redis-cli
redis 127.0.0.1:6379>
redis 127.0.0.1:6379> ping
	PONG

sudo /etc/init.d/redis-server restart
sudo /etc/init.d/redis-server stop
sudo /etc/init.d/redis-server start

	772459:C 03 Sep 2020 08:22:37.635 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
	772459:C 03 Sep 2020 08:22:37.635 # Redis version=5.0.7, bits=64, commit=00000000, modified=0, pid=772459, just started
	772459:C 03 Sep 2020 08:22:37.635 # Warning: no config file specified, using the default config. In order to specify a config file use redis-server /path/to/redis.conf
	772459:M 03 Sep 2020 08:22:37.637 * Increased maximum number of open files to 10032 (it was originally set to 1024).
					_._                                                  
			_.-``__ ''-._                                             
		_.-``    `.  `_.  ''-._           Redis 5.0.7 (00000000/0) 64 bit
	.-`` .-```.  ```\/    _.,_ ''-._                                   
	(    '      ,       .-`  | `,    )     Running in standalone mode
	|`-._`-...-` __...-.``-._|'` _.-'|     Port: 6379
	|    `-._   `._    /     _.-'    |     PID: 772459
	`-._    `-._  `-./  _.-'    _.-'                                   
	|`-._`-._    `-.__.-'    _.-'_.-'|                                  
	|    `-._`-._        _.-'_.-'    |           http://redis.io        
	`-._    `-._`-.__.-'_.-'    _.-'                                   
	|`-._`-._    `-.__.-'    _.-'_.-'|                                  
	|    `-._`-._        _.-'_.-'    |                                  
	`-._    `-._`-.__.-'_.-'    _.-'                                   
		`-._    `-.__.-'    _.-'                                       
			`-._        _.-'                                           
				`-.__.-'                                               

	772459:M 03 Sep 2020 08:22:37.639 # Server initialized
    ...
	772459:M 03 Sep 2020 08:22:37.639 * Ready to accept connections

Checking Redis server is an active system process:

ps aux | grep redis
	redis     653361  0.1  0.0  55332  5048 ?        Ssl  18:11   0:00 /usr/bin/redis-server 127.0.0.1:6379