Here’s how we turned this complex saga into a collection of interaction networks:
- Link two characters each time their names (or nicknames) appear within 15 words of one another.
Each link corresponds to an interaction between the two characters. Note that this interaction could be direct or indirect. Here are some of the types of interactions that our method picks up.
- Two characters appearing together in the same location
- Two characters in conversation
- One character talking about another character
- One character listening to a third character talk about a second character
- A third character talking about two other characters
- And so on…
If two characters appear together, then clearly they are linked. If one character mentioned another character, then that signifies some relationship between them. Again, they should be linked. Likewise, if two characters are spoken of together, they are linked in the mind of a third character (and also in the mind of the listener). For simplicity, we decided to make these links “undirected” meaning that the links are mutual, even when one character references another character (which could instead be seen as a one-way link).
Admittedly, this is a rather naive approach to creating a network. However, we found that the resulting network was quite robust. We tried various thresholds for interaction (10 words, 15 words, 20 words), and we found that tuning to 15 words produced the most reasonable network for George R. R. Martin’s novels. (Note: we also automatically removed any edges of weight 1 or 2, judging these connections to be incidental.)
Of course, determining when two characters have interacted is easier said than done. The more effort you put into identifying interactions, the better your network. Here are some more details.
The Cast of Characters
First, we compiled an exhaustive list of characters and their nicknames. We focussed on three main resources
- The Books. Our most naive effort was to parsed the books, looking for character names by keeping track of capitalized words. This was a messy process, but did yield a reasonable start.
- Web Scraping. Using data science techniques, we processed A Wiki of Ice and Fire, a fan-created site for the books and the TV series. This was our best source of truth, since character articles contain lists of aliases and titles (come capitalized, some not). This site also includes a list of books that reference the character, but we found this information was not very reliable for the minor characters
- First Appearance Spreadsheet. Leo King’s A Song of Ice and Fire Character Spreadsheet. Leo compiled a list of first appearances of characters. This was helpful for us, but not quite as useful as the fan-created wiki.
Another problem to solve was ambiguous references. Many characters share the same names (Jon, Walder, Brandon) and titles (king, queen, maester). Does a given appearance of the name “Jon” refer to Jon Snow or Jon Arryn? When someone references “the king,” then this reference could resolve to any number of people (Aegon, Robert, Joffrey, Robb, Stannis…), depending on who is speaking and the context in which they are speaking. Likewise, “dwarf” usually refers to Tyrion, but there are instances where it does not.
Disambiguation was a labor-intensive manual process. Ultimately, we decided to modify the source text to disambiguate the word, for example by replacing “king” with “king_Joffrey” in the appropriate places (so that we could still see the original phrasing of the reference, but also resolve it to a character).
Creating and Analyzing the Network
We generated a network for each book, using the disambiguated text and the list of characters and nicknames. We then loaded the network into Gephi, an easy-to-use network analysis application. There are more fully-featured network analysis packages out there, but Gephi has enough features to create visualizations and perform basic analysis.
There are plenty of additional computations that we could run on our networks. So be sure to check back, occasionally.