Data Distribution.

This repository contains different data analysis of data distribution in the OSM dataset.

Spark locally

If you don't have access to a Spark cluster, it is possible to execute it locally. A laptop with 16Gb memory and 8 cores should be enough. In my case, I'm using a Desktop with 16cores and 32Gb RAM. Full specs at the very bottom.

To start Spark in local mode, after download and uncompress:

sbin/start-all.sh

To access to the UI: http://localhost:8080/

To stop Spark in local mode:

sbin/stop-all.sh

Extract blocks

To be able to parallelize, lets extract all blocks. Full universe will take 4 minutes:

spark-submit \
  --class com.simplexportal.simplexspatial.analysis.Driver \
  --master "local[*]" \
  target/scala-2.11/simplexspatial-data-distribution-analysis-assembly-0.1.jar \
  extract \
  -i file:///home/angelcc/Downloads/osm/planet/planet-200309.osm.pbf \
  -o file:///home/angelcc/Downloads/osm/planet/blobs

Node IDs distribution

Following, example of how to report for 100 "partitions", locally, using 5 cores and 4Gb per core. It will take around 30 minutes.

/home/angelcc/apps/spark-2.4.5-bin-hadoop2.7/bin/spark-submit \
  --class com.simplexportal.simplexspatial.analysis.Driver \
  --master "spark://angelcc-B450-AORUS-ELITE:7077" \
  --deploy-mode cluster \
  --executor-memory 4G \
  --total-executor-cores 5 \
  --num-executors 1 \
  target/scala-2.11/simplexspatial-data-distribution-analysis-assembly-0.1.jar \
  mod \
  -p 100 \
  -i file:///home/angelcc/Downloads/osm/planet/blobs \
  -o file:///home/angelcc/Downloads/osm/planet/distribution/nodeId/100

Tile distribution

Following, example of how to distribution report for tiles of 10000x10000, locally, using 5 cores and 4Gb per core. It will take around 30 minutes.

/home/angelcc/apps/spark-2.4.5-bin-hadoop2.7/bin/spark-submit \
  --class com.simplexportal.simplexspatial.analysis.Driver \
  --master "spark://angelcc-B450-AORUS-ELITE:7077" \
  --deploy-mode cluster \
  --executor-memory 4G \
  --total-executor-cores 5 \
  --num-executors 1 \
  target/scala-2.11/simplexspatial-data-distribution-analysis-assembly-0.1.jar \
  tile \
  --latPartitions 10000 \
  --lonPartitions 10000 \
  -i file:///home/angelcc/Downloads/osm/planet/blobs \
  -o file:///home/angelcc/Downloads/osm/planet/distribution/tile/10000x10000

Zeppelin

To start the notebook, from a temporal folder:

mkdir logs notebook
docker run -p 8081:8080 --rm \
   -v $PWD/logs:/logs \
   -v $PWD/notebook:/notebook \
   -v /home/angelcc/Downloads/osm/planet/distribution/nodeId/100:/zeppelin/data/nodeId \
   -v /home/angelcc/Downloads/osm/planet/distribution/tile/10000x10000:/zeppelin/data/tile \
   -e ZEPPELIN_LOG_DIR='/logs' \
   -e ZEPPELIN_NOTEBOOK_DIR='/notebook' \
   --name zeppelin \
   apache/zeppelin:0.9.0

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
notebooks		notebooks
project		project
src		src
.gitignore		.gitignore
.scalafmt.conf		.scalafmt.conf
LICENSE		LICENSE
README.md		README.md
build.sbt		build.sbt
code_of_conduct.md		code_of_conduct.md
scalastyle-config.xml		scalastyle-config.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Distribution.

Spark locally

Extract blocks

Node IDs distribution

Tile distribution

Zeppelin

About

Releases

Packages

Languages

License

simplexspatial/simplexspatial-data-distribution-analysis

Folders and files

Latest commit

History

Repository files navigation

Data Distribution.

Spark locally

Extract blocks

Node IDs distribution

Tile distribution

Zeppelin

About

Topics

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages