Rio+20 Environmental Resolutions Timeline
Digitizing an organization's documents not only saves trees but it can help us explore and understand them. The global conversation on the environment goes back to December 1968 when general assembly resolution 2398 (XXIII) brought attention to the "relationship between man and his environment."
This data visualization gives some historical context to the current discussion in Rio by showing the environmental resolutions and how they relate to each other.
Data Munging 40 Years of Resolutions
The dataset was a hand-selected list of General Assembly resolutions which related to the environment. The Official Document System (ODS) provided PDFs of the resolutions which included text data scanned with OCR.
As usual, the bulk of the work was getting a clean dataset. Even while working in the same physical office it turned out that scraping the ODS search results was the best way to download full-text PDFs. ID discrepancies, referrer tracking, and server-side state management made this tricky but not impossible.
The text from each PDF was extracted with the help of PDF Miner. The pdf2txt tool does some pretty impressive work to figure out how to translate the character soup into words and paragraphs based on the distance of each letter from its neighbours. It can take some parameter tweaking and it's never perfect but you can get close enough.
I originally wanted to extract resolution references based on a generic pattern, like "A/RES/###". Then I would have the resolution in an explicit format and it would be easy to match it with the other documents. However, references were more typically made without the "A/RES/" prefix so that didn't work. You often find things like this:
Recalling its resolutions 48/169 and 48/170 of 21 December 1993, 49/102 of 19 December 1994 and 51/168 of 16 December 1996...
In the end I had to settle for a top-down approach which used the list of environmental resolutions to search for references. That worked very well for this smaller dataset but wouldn't be as effective for larger ones. I'm just lucky the resolution numbers I was concerned with didn't overlap with the years.
Using the NetworkX Python extension, I was able to generate a graph data structure based on these references and output to GraphML and to JSON. The GraphML I threw into Gephi to do quick tests and preliminary exploration. The JSON output was used as the data source for the visualization.
Presenting the Data Visually
As much as I love the typical network visualization with lines connecting lots of little bubbles it really wasn't appropriate here. Those visualizations are best when there is significant meaning when A is connected to B is connected to C. In social networks, a friend of a friend is more likely to be trusted / have common interests / etc. However, in the case of citations between general assembly resolutions which are already filtered by the topic of the environment it's not a very important metric.
The typical graph visualization puts a lot of emphasis on communicating these friend of a friend relationships because it encodes the information in each node's position, the most important visual element. The proximity of two nodes is a good indication of how closely connected they are.
For this visualization the sequence is more important so resolutions are represented as a series of boxes stacked by year along the horizon. So here, position encodes time.
This visualization had to work on IE7 so I had to forgo some of the convenience of d3.js and use Raphael.js for all the frontend drawing. It's a great little library that completely shelters you from dealing with differences between SVG and IE's VML.
One thing that's definitely missing here is a better drill-down capability. I wish you could see the text around each citation when you select an arc. Unfortunately, the quality of the output from the OCR scanning / PDF Miner combo wasn't good enough. So for now, the list of documents is printed at the bottom of the screen and you can click a link to access the document from UN ODS.