Using basic data tools for history research #mystory #lovedata18

It’s the big day for Love Data Week! Today I am featuring a few of my favorite tools using my own qualitative dataset.

I am working on my PhD in History and, although I am still early in my program, I began research for my dissertation this year. I am currently examining a set of petitions sent to the U.S. Congress that were calling for action during the Armenian massacres of the 1890s. Instead of just reading through and taking notes in Word, I decided to collect information systematically so that I can use the data for a potential digital humanities project.

As I read through, I collected the dates, locations, types of meetings, representatives, and then notes on the language used in the petition. I originally entered the information in Excel because I needed to finish a short paper, but I am switching to a Qualtrics form for final analysis. I still need to test it out and make sure it will work for my questions. I mocked this up in half an hour at the most one day before I went to NARA. If you have suggestions, let me know.

Qualtrics is easy to use and a more reliable way to input data because you can control the types of fields that can be entered. For example, I am interested in changes over time, so I can control both the date fields and the congressional information and avoid input error. Moreover, the data can be exported easily and analyzed in any software. Because I have a large set of petitions, using a system like this is absolutely necessary and more reliable for counts and a broad overview than just taking notes on the petitions.


Petition from Boonton, NJ

Once I had a starting set, I used OpenRefine to clean up some fields. OpenRefine is a much easier and more reliable tool for cleaning data than Excel. For example, in my spreadsheet I was collecting information about the specific representative to whom the petitions were sent. Again, I want to know about this issue over time, so I’d like to see if individuals were receiving specific petitions multiple times and if they got petitions on other issues. But as you can see from this picture, the handwriting is not often legible. This is actually one of the easiest to read. As such, because I was inputting data under a deadline, some of the names of legislators are inconsistent (see the image below). I can use OpenRefine to easily and quickly clean fields. So, for example, all the fields with Fletcher, Minn can quickly become Fletcher, MN. Also, if I have trouble making out someone’s name on one petition but can see their state, I can use OpenRefine to give me clues as to the name from other petitions.

OpenRefine dataset

My dataset in OpenRefine


The tool can do much more than this. Definitely worth checking out for spreadsheet cleaning!

After I cleaned up my spreadsheet, I mapped the petition origination locations just for fun. I used Google Fusion Tables because we have GAFE at UNCG and I wanted to test it versus ArcGIS Online. For simple mapping, it is quite helpful and easy to use. As long as it can recognize location fields (City, State for example), it will try to map the data. I used this to get a sense of where the petitions were coming from. While I thought I knew because I read each one, it was difficult to get a sense of geography after reading over 200 petitions. I assumed most would be from the Northeast, but I was surprised by the actual geographic spread. Of course, a lot more could be done with this. For example, the size of bubbles could be changed depending on the number of petitions from a location.

Map of locations of origin for petitions

Next, I did some basic text visualization using Voyant to explore the themes. I uploaded parts of the petition that referred to the role of the state (my research question) just to explore. You can do several things with Voyant for basic text mining, Armeniancloudbut the word clouds are always fun. You can also see the most frequently used words. In this case they were government (24), Turkish (20), people (19), humanity (18), right (17). Again this was a subset of my petitions (only one column), but it gives you the idea of what you could do just starting out.

I’ve also played around some with Atlas.ti for my documents that are not petitions, especially my primary source newspapers. I haven’t done much with that yet, but I love the iPad version of Atlas.ti and can see it as being useful later on when I have a larger set of primary documents to work with. It is mostly used for qualitative data work in the social sciences, but if you are a historian working with Atlas.ti, Nvivo, or Dedoose, PLEASE get in touch with me. I would love to have some use case scenarios.

Eventually I would also like to do some basic network analysis using Gephi but my data is not ready for that. And of course I could never go anywhere without my Zotero library that provides immediate access to all of my secondary sources and some primary documents! Yes, that is data too!

Historians wanting to do more systematic examinations of their “data” sets of primary documents should check out the Programming Historian. You can do a lot of analysis now that you couldn’t do even five years ago, but you need to start by collecting and managing the information systematically. If you would like tips on doing this, please contact me and we can talk through your project!


A dream of a “fact-based worldview:” The passing of Hans Rosling


Have a few hours to kill? Gapminder is the thing for you!

For the past seven years I have used Hans Rosling’s video “200 countries, 200 years, 4 minutes” as an introduction to economic development. I teach a general education course, so the students have varying levels of comfort with data and economics. The video has always generated lively discussions about development. The students talk about how we conceive of “poverty” and if a standard of living measure is sufficient. They usually on their own develop an understanding of poverty that aligns with the Human Development Index. They on their own bring up questions about access to adequate food and access to public schools and free education. We also have discussions of inequality within societies, which Rosling nicely demonstrates in his China example in the video.

Rosling passed away yesterday and it is a great loss for many communities of practice, including the government information world. He worked to promote a fact-based worldview through his TED talks and the development of Gapminder. His aim in all of these endeavors was to make data accessible to everyone as a protection against ignorance. His videos provide a entree into world where data is used to ask questions and test hypotheses rather than just support opinions.

If you aren’t familiar with Rosling, I encourage you to watch his videos, especially his most famous TED talk (below). He was a visionary and, in this world of alternative facts, he will be missed.

The Ghost Map

The Ghost Map: The Story of London’s Most Terrifying Epidemic–and How It Changed Science, Cities, and the Modern World

Finished: Jan. 17
Rating: A-

I have to admit I sometimes enjoy a good epidemiological history. Johnson’s account of the 1854 cholera outbreak in Victorian London is engaging, but the focus isn’t so much cholera as the larger challenges of urban living. Within the vein of epidemiological history, I preferred Marilyn Chase’s The Barbary Plague in which she describes the development of the bubonic plague in Victorian San Francisco from 1900-1909. Of course she had a larger event to describe so the focus stayed solidly on the plague. Johnson’s account seems a bit more scattered, but it reads well. His descriptions of the city and its…um…untidiness are strong enough that you feel like you could be standing (and smelling) in the middle of Soho. He is able to balance his narrative elements against his description of the cholera virus and medical aspects of the story.

As someone who enjoys the beauty of data visualization this book is a good look into the gathering of real life data and the illumination of a problem through the visualization of the data. Unfortunately, the idea of the ghost map felt a bit perfunctory like it was tacked on to the end of the story. Also, his epilogue, on the difficulties and triumphs of modern urban living in relation to sustainability, disease, and nuclear weapons while a fascinating read seem to take the focus away from his original story (and from the ghost map again). Overall it is a good read, especially if you like medical histories.