This application uses the New York Times Newswire API to collect the latest news items.
A cron job is set up using the ‘whenever’ gem, it runs:
rake newswire:collect
every 20 minutes.
The captured news items are represented in the following Rails models:
-
NewsItem
-
MediaItem
-
MediaMetadataItem
-
RelatedUrl
One potential use of pre-categorized texts is as training data for building models for machine learning/text mining tools like LIBSVM (the most popular C++ implementation of Support Vector Machines).
The Ruby interface to the LIBSVM C++ library is available at:
http://github.com/tomz/libsvm-ruby-swig/tree/master
The power of Convention over Configuration is demonstrated here, there are less than 80 lines of Ruby code in lib/task/newswire.rake including the setup code, comments, and blank lines. To achieve the same in Java would require code that is both mind-boggling and ugly.
-
apply for an API key from the NYTimes Developer Network site at:
http://developer.nytimes.com/docs/times_newswire_api/
-
get the code from github:
git clone git://github.com/tomz/nytimes_newswire_api_app.git
-
install the required gem(s):
sudo gem install json
-
modify database.yml for your db environment
-
create the database and tables:
rake db:create && rake db:migrate
-
modify lib/task/newswire.rake to put in your NYTimes Newswire API key
-
install the ‘whenever’ gem:
gem sources -a http://gems.github.com #only if you haven't run this before sudo gem install javan-whenever
-
run whenever in the rails app directory:
whenever
-
the newswire:collect task will run every 20 minutes to grab the latest NYTimes news, you can change the time interval in config/schedule.rb:
every 20.minutes do rake "newswire:collect", :environment => "development" end
-
and you can optionally run:
rake newswire:test
to test without the API key
-
and for integration test, run:
rake RAILS_ENV=test db:create && rake RAILS_ENV=test db:migrate cucumber features -n
-
now relax and watch your tables get filled up with New York Times news items:
tail -100f log/collect.log
-
to clear up the collected items, run:
rake newswire:empty_all_tables
Tom Zeng
tom.z.zeng at gmail dot com