Tuesday, June 17, 2014

Studying Ted Talks, Anscombe's Quartet, and Modern Languages Enrollment

While the feed from a newer github/jekyll blogging platform (patilv.github.io) is registered with blog aggregators, here are snippets of three posts that were recently published at the new site. Please click on the titles to visit the corresponding page.

A recent article in openculture.com by Dan Colman mentioned that there was a list of 1756 Ted Talks maintained by “someone” in a spreadsheet format. A link to this sheet can also be found on this page on Wikipedia. It was titled “Ted Talks as of 5/23/2014”. I downloaded that spreadsheet on 6/12/2014 from this link and saved that as a csv file. It turned out to be a list of 1755 talks. Here, I make a wordcloud of the titles of these talks and a few ggplots to identify speakers with 3 or more appearances using Karthik Ram’s Wes Anderson palette for R. The code and data for this post can be found on my github site at this link.

Anscombe’s quartet is a set of four datasets with two variables (x and y) and 11 observations.It has been been used to demonstrate the importance of graphically displaying data. It has appeared not only in books (for example, in the first page of the first chapter of Tufte’s seminal work, Visual Display of Quantitative Information), but also in scholarly papers (for example, see Healy and Moody, 2014), and blog posts (for example, see Hirst). Here, I use ggvis in the shiny environment to play with the quartet. The code for the post and the accompanying shiny app can be found on my github site.

Published first at KD Nuggets. This was an extension of my earlier post on Modern Languages Enrollments in the US. In that, I used data from MLA surveys of enrollments in institutions of US higher education between 1983 and 2009 and found that enrollments in Indian languages were low, compared to enrollments in 10 other languages, besides English. These 10 languages were French, German, Italian, Japanese, Spanish, Arabic, Chinese, Korean, Portuguese, and Russian. In this extension, I used data from 22 survey years since 1958, the first year for which the modern languages enrollment database provides data, to study the pattern and number of students enrolling in these 11 languages.