Find 'influencers' on Twitter graph
None
20 mins
$ ~/spark/bin/spark-shell
All of the following steps are performed by entering the commands in the Spark Scala shell
import org.apache.spark.graphx._
import org.apache.spark.rdd.RDD
Data structure: twitter handle, number of followers, gender of the tweeter
val vertexArray = Array(
(1L, ("@markkerzner", 309, "M")), // (Name, # followers, gender)
(2L, ("@mjbrender", 3101, "M")),
(3L, ("@dridisahar1", 27, "F")),
(4L, ("@dez_blanchfield ", 38600, "M")),
(5L, ("@ch_doig ", 519, "F")),
(6L, ("@Sunitha_Packt ", 332, "F")),
(7L, ("@WibiData ", 2477, "N")) // company, so gender neutral
)
On this step, these are all my followers, so they connect to me
val edgeArray = Array(
Edge(1L, 2L, 7), // src, dest, # retweets
Edge(1L, 3L, 2),
Edge(1L, 4L, 4),
Edge(1L, 5L, 3),
Edge(1L, 6L, 1),
Edge(1L, 7L, 2)
)
We are using data from a real Twitter account, if you want, you can use yours
val vertexRDD = sc.parallelize(vertexArray)
// vertexRDD: RDD[(Long, (String, Int))]
val edgeRDD = sc.parallelize(???)
// edgeRDD: RDD[Edge[Int]]
val graph = Graph(???, ???)
graph.vertices.collect.foreach { case (id, (name, nFollow, gender)) =>
println(s"Tweeter $name has $nFollow followers and is $gender") }
graph.vertices.filter { case (id, (name, followers, gender)) => gender != "M" }.collect.
foreach { case (id, (name, followers, gender)) =>
println(s"$name should be a $gender with $followers followers") }
graph.vertices.filter { case (id, (name, nFollow, gender)) => nFollow > 1000 }.collect
graph.edges.filter { case (edge) => edge.attr > 5 }.collect
graph.edges.filter { case (edge) => edge.attr > 5 }.count
val maleFollowerCount = graph.vertices.filter { case (id, (name, nFollow, gender)) => gender == "M" }.count
val femaleFollowerCount = graph.vertices.filter { case (id, (name, nFollow, gender)) => gender == "F" }.count