You can view this document @ http://bit.ly/himym-graph
In this session we’ll be working with a graph of How I met your mother scrapped from the HIMYM wiki.
We’re going to write some queries to explore the dataset and learn Neo4j’s cypher language while doing so. If we have time we can even look at how to merge some other data sets in as well.
First things first, let’s get Neo4j running on your machines.
-
Download Neo4j from http://www.neo4j.com/download
-
Windows users: Install desktop application & then click the 'start' button
-
Mac/Linux: Unpack archive & then
./bin/neo4j start
-
Browse to http://localhost:7474. You should see the following:
insert image here
And we’re ready to go!
The first thing we need to do is load the episodes into Neo4j. I’ve already done the scrapping of the data into a CSV file which is hosted on github.
We’ll start by exploring that using Cypher LOAD CSV
command.
The LOAD CSV
command allows us to process CSV files hosted on a web server or our local file system.
The syntax looks like this:
LOAD CSV [WITH HEADERS] FROM "http or file URI" AS row
[FIELDTERMINATOR ","]
[MATCH MERGE RETURN WITH]
Let’s give it a try:
LOAD CSV WITH HEADERS FROM "https://raw.githubusercontent.com/mneedham/neo4j-himym/master/data/import/episodes_full.csv" AS row
RETURN row
LIMIT 1
If you run that query you should see the following output:
We can see that an episode has lots of different attributes - a title, date aired, season, ranking etc.
Now let’s create an episode node for each row of the CSV file. We’ll do this by combining the LOAD CSV
command with the MERGE
command.
The MERGE
command does a MATCH
or CREATE
i.e. if the node already exists then we’ll return it and if it doesn’t then we’ll create it.
It’s quite useful in this context as we may want to run our import script multiple times but don’t want to end up with duplicate episodes.
The syntax of MERGE
is as follows:
MERGE (:Label1:Label2 {id: "primaryKeyValue"})
---
[source, cypher]
LOAD CSV WITH HEADERS FROM "https://raw.githubusercontent.com/mneedham/neo4j-himym/master/data/import/episodes_full.csv" AS row MERGE (episode:Episode {id: TOINT(row.NumberOverall)}) ON CREATE SET episode.title = row.Title, episode.viewers = toFloat(row.Viewers), episode.dateAired = row.DateAired, episode.timestamp = TOINT(row.Timestamp), episode.number = TOINT(row.NumberInSeason), episode.season = TOINT(row.Season), episode.ranking = TOINT(row.Rating)
If everything has run properly you should now have 208 nodes in your database. Let's write a query to check that: [source, cypher]
MATCH (:Episode) RETURN COUNT(*) AS numberOfEpisodes
[table] The `COUNT` function does an implicit group by using any non aggregate fields as its key. In this case there are no other fields so it just counts all the rows returned by the `MATCH` clause. === Finding specific episodes By now you should have all the episodes loaded so let's try and find some individual episodes. We'll start by finding an episode by its `id` which in this case is its overall position in the series: [source, cypher]
MATCH (episode:Episode) WHERE episode.id = 5 RETURN episode
or we can use a bit of syntactic sugar: [source, cypher]
MATCH (episode:Episode {id: 5}) RETURN episode
[table] ==== Exercises * Find an episode by title rather than by id * Only return the `title` and `dateAired` properties rather than all the properties * Which episode had the highest viewership? === What's happening under the covers? We have two ways of exploring what Neo4j is doing when we execute a query. Prefixing a query with `EXPLAIN` will return the query plan based on statistics the database has. It won't run the query. Prefixing a query with `PROFILE` will run the query and return the actual query plan that was executed. Give those a try with the queries we just ran. What do you notice? === Index all the things Let's add an index on a few properties of an episode so that we can find them more quickly in future without having to scan through all the episodes [source, cypher]
CREATE INDEX ON :Episode(id)
[source, cypher]
CREATE INDEX ON :Episode(title)
Now let's import some more data to make it more of a graph and less of a key/value store. === Seasons If you explored the episodes data carefully you'll remember that we actually stored a `season` property on the episode nodes. This is fine if we only intend to use the season as meta data but if we want to write queries that start with a season it would be better off represented as a node. We can refactor the graph to introduce season as a first class citizen with the following query: [source, cypher]
MATCH (episode:Episode) MERGE (season:Season {number: episode.season}) MERGE (episode)-[:IN_SEASON]→(season)
This query has three steps: * Look up all the episodes * Create `season` nodes using the `MERGE` command so we don't get duplicates * Create a relationship from the episode to its season ==== Exercises * How many episodes were there in each season? === Finding next/previous episodes Let's say we've looked up an episode and we want to get the next episode. How would we do that? [source, cypher]
MATCH (episode:Episode {id: 6})
WITH episode MATCH (nextEpisode:Episode {id: episode.id + 1}) RETURN episode, nextEpisode
or we can feed in the episode numbers from outside [source, cypher]
MATCH (episode:Episode) WHERE episode.id IN [6,7] RETURN episode
Alternatively we can introduce an explicit relationship between the episodes to make this type of query easier. We want to connect every pair of adjacent episode nodes with a `NEXT_EPISODE` relationship [source, query]
MATCH (episode:Episode)
WITH episode ORDER BY episode.id
WITH COLLECT(episode) AS episodes FOREACH(i in RANGE(0, length(episodes)-2) | FOREACH(n1 in [episodes[i]] | FOREACH(n2 in [episodes[i+1]] | MERGE (n1)-[:NEXT_EPISODE]→(n2))))
First we get all the episodes sorted in `id` order and collect them into an (ordered) array using the `COLLECT` function. We use the outer `FOREACH` function to iterate over all the events and ideally we'd just connect `episodes[i]` and `episodes[i+1]` directly. Unfortunately this isn't valid Cypher so instead we create two arrays of length 1 and `FOREACH` over those and execute the `MERGE` command. Now let's rewrite our earlier query which found episode 6 and the next episode: [source, cypher]
MATCH (episode:Episode {id: 6})-[:NEXT_EPISODE]→(nextEpisode) RETURN episode, nextEpisode
Not bad. If we want to get the next 5 episodes we'd write the following query: [source, cypher]
MATCH (episode:Episode {id: 6})-[:NEXT_EPISODE*1..5]→(nextEpisode) RETURN episode, nextEpisode
[table] There's a bit of duplication here in that episode 6 is repeated on every line. We can fix that with a slight tweak of the query: [source, cypher]
MATCH (:Episode {id: 6})-[:NEXT_EPISODE*0..5]→(episode) RETURN episode
[table] ==== Exercises * Create a linked list of the seasons === References The wiki also contains information about the references that episodes make to each other. Let's load this into the graph: [source, cypher]
LOAD CSV WITH HEADERS FROM "https://raw.githubusercontent.com/mneedham/neo4j-himym/master/data/import/references.csv" AS row MERGE (:Reference {text: row.ReferenceText})
[source, cypher]
LOAD CSV WITH HEADERS FROM "https://raw.githubusercontent.com/mneedham/neo4j-himym/master/data/import/references.csv" AS row MATCH (reference:Reference {text: row.ReferenceText}) MATCH (referenced:Episode {id: TOINT(row.ReferencedEpisodeId)}) MERGE (referenced)←[:HAPPENED_IN]-(reference)
[source, cypher]
LOAD CSV WITH HEADERS FROM "https://raw.githubusercontent.com/mneedham/neo4j-himym/master/data/import/references.csv" AS row MATCH (reference:Reference {text: row.ReferenceText}) MATCH (referencing:Episode {id: TOINT(row.ReferencingEpisodeId)}) MERGE (referencing)-[:REFERENCES]→(reference)
Let's have a quick check that we've got that imported correctly: [source, cypher]
MATCH (episode:Episode)←[:HAPPENED_IN]-(reference)←[:REFERENCES]-(otherEpisode) RETURN episode, reference, otherEpisode LIMIT 10
==== Exercises * Which episode is referenced most often? * For episode 'z' which ones should I have seen to understand all the references made? -> Connecting the episodes together -> What sort of queries can we do with a linked list of episodes? == Topics graph I ran the transcript for each of the episodes through Prismatic's Topic Graph and stored the results into a CSV file. We can add this to the graph as well: [source, cypher]
Can we do some sort of episode similarity here? Jaccard similarity? Cosine similarity? == Building the query for an episode's page Now we'll put together everything we've learned and build a query for an episode's profile page. These are the bits of data we want to return: * Episode title * Characters * Writers * Similar episodes * Next & previous episodes Multi query vs All in one