In order to be able to submit RDFStats at spark-submit
, first clone the repo and compile the project.
Before computing statistics, download datasets and upload them on the HDFS. The following steps should be taken :
- Download DBPedia and extract it on bin .nt file
- Dbpedia en
wget -r -np -nd -nc -A'*.nt.bz2' http://downloads.dbpedia.org/3.9/en/ cat *.nt.bz2 >Dbpedia_en.nt.bz2 bzip2 -d Dbpedia_en.nt.bz2 hadoop fs -put Dbpedia_en.nt /<pathToHDFS>/
- Dbpedia de
wget -r -np -nd -nc -A'*.nt.bz2' http://downloads.dbpedia.org/3.9/de/ cat *.nt.bz2 >Dbpedia_de.nt.bz2 bzip2 -d Dbpedia_de.nt.bz2 hadoop fs -put Dbpedia_de.nt /<pathToHDFS>/
- Dbpedia fr
wget -r -np -nd -nc -A'*.nt.bz2' http://downloads.dbpedia.org/3.9/fr/ cat *.nt.bz2 >Dbpedia_fr.nt.bz2 bzip2 -d Dbpedia_fr.nt.bz2 hadoop fs -put Dbpedia_fr.nt /<pathToHDFS>/
- Download LinkedGeoData and extract it on bin .nt file
wget -r -np -nd -nc -A'*.nt.bz2' http://downloads.linkedgeodata.org/releases/2015-11-02/ cat *.nt.bz2 >LinkedGeoData.nt.bz2 bzip2 -d LinkedGeoData.nt.bz2 hadoop fs -put LinkedGeoData.nt /<pathToHDFS>/
- Generate BSBM datasets
We have generated the dataset based on the size :
wget http://downloads.sourceforge.net/project/bsbmtools/bsbmtools/bsbmtools-0.2/bsbmtools-v0.2.zip unzip bsbmtools-v0.2.zip cd bsbmtools-0.2/
"./generate -fc -s nt -fn BSBM_2GB -pc 23336" "./generate -fc -s nt -fn BSBM_20GB -pc 233368" "./generate -fc -s nt -fn BSBM_200GB -pc 2333682" "./generate -fc -s nt -fn BSBM_50GB -pc 583420" "./generate -fc -s nt -fn BSBM_100GB -pc 1166840"
hadoop fs -put BSBM_XGB.nt /<pathToHDFS>/
- Distributed Processing on Large-Scale Datasets
Run DistLODStats agains the datasets and get the stats generated. Run the following command to get the results:
- For
cluster
mode./run_stats.sh Dbpedia_en Iter1
- For
local
mode./run_stats-local.sh Dbpedia_en Iter1
- For
- Scalability
- Sizeup scalability To measure the performance of size-up scalability of our approach, we run experiments on three different sizes.
- Node scalability In order to measure node scalability, we use variations of the number of the workers on our cluster.