This is an Apache Spark based project to analyze crawls generated by Apache Nutch. The project is still in incubation and has the CDRv2 dump feature for now.
The vision is to continue developing Analytical features for Nutch using Spark. This will also interesect with awesome concepts like Machine Learning and Natural Language Processing.
mvn clean install
java -cp analytics-1.0.jar gov.nasa.jpl.analytics.dump.Cdrv2Dump -m local[*] -s PATH_TO_SEGMENT_FOLDER -o OUTPUT_FILE -l PATH_TO_LINK_DB
In case you have any questions or suggestions, please drop them at [email protected]
Website: http://irds.usc.edu