For my talk on High Performance Computing in R (which I had to reschedule due to a nasty stomach bug), I used Wikipedia linking data, an adjacency list of articles and the articles to which they link. This data was linked from DataWrangling and was originally created by Henry Haselgrove. The dataset is small on disk, but I needed a dataset that was huge, very huge. So, without a simple option off the top of my head, I took this data and expanded a subset of it into an incidence matrix, occupying 2GB in RAM. Perfect!
The subsetting was a bit of a problem because I had to maintain the links within the subgraph induced by the same. This required me to search dictionary objects for keys. This is where things went awry. Usually, I try to be efficient as possible. Since I was just producing a little example, and would never ever run this code otherwise, I wasn’t as careful.
The data were presented as a follows
1. First, I looked at from, and if from was in the chosen subset, keep it and proceed to 2, otherwise, throw it out. 2. Then, take the to nodes […]