Thursday, April 25, 2013

Interacting, on demand, with 2012 best cities data and plotting different graphs ---- Experiments with ggplot2 on shiny server

UPDATE: THE BLOG/SITE HAS MOVED TO GITHUB. THE NEW LINK FOR THE BLOG/SITE IS patilv.github.io and THE LINK TO THIS POST IS:
http://bit.ly/1vRPV8h .  PLEASE UPDATE ANY BOOKMARKS YOU MAY HAVE.

Data from slides on Bloomberg businessweek's site: http://images.businessweek.com/slideshows/2012-09-26/americas-50-best-cities

In this post, I show how one can interact with a dataset and generate graphs on the fly (technically, I've predetermined which graphs to create, but it should not be difficult to build a menu of possible graphs). This required setting up the shiny server running on Ubuntu. This current server is being hosted by Gonzaga. Big thanks to Bob Toshack for help in trying to get the server public. Work in progress. In the meanwhile, I am using the services provided by RStudio for hosting this. Thanks a ton, RStudio, for this opportunity and the wonderful set of tools you have. And I am getting tired of this best cities dataset. So, the last few apps using that. Please click on the pictures below to go to the site and play with it. Remember to come back to go to the next set of interactive graphs.

1. Generating bar plots and/or dot plots of any variable you choose (from the preloaded data).



2. Comparing distributions of two different variables (can be extended to more than 2 variables) using their bar graphs (can use other graphs as well).





3. Exploring the relationship between two different variables for cities of different ranks.Scatter plots of rank of cities, by two variables.


4. Extending the above scatter plot to simultaneously see two scatter plots and four variables.

h



Saturday, April 13, 2013

Percentage with Graduate Degree in Best Cities of 2012


Data from Bloomberg Businessweek's slideshow. Please click image to interact with it.


Number of Colleges in best cities of 2012


Data from Bloomberg Businessweek's slideshow. Please click image to interact with it.


Number of Bars (yes, for drinking) in the best cities of 2012

Data from Bloomberg Businessweek's slideshow. Please click on the image to interact with it.


Motivating Students

UPDATE: THE BLOG/SITE HAS MOVED TO GITHUB. THE NEW LINK FOR THE BLOG/SITE IS patilv.github.io and THE LINK TO THIS POST IS:
http://bit.ly/SrQK8k .  PLEASE UPDATE ANY BOOKMARKS YOU MAY HAVE.

Figure shown to students in a particular class to show the effect of slacking after the mid-semester grades are received.

Background information: A course has multiple components - Exams, projects, quizzes, assignments, etc. The objective of this set of plots was two fold - (a) To show students that their mid-term grade is not the same as the grade on the exam and that it included several other components, and (b) To show students that their mid-term grades could change based on how well/poorly they performed during the rest of the semester. The left column of plots show a histogram (upper) and a dot plot (bottom) of student scores on Exam 1 of the course. The color denotes the grade they received at mid-term. The color in the right column of plots shows the final grades of the same students. Student positions in all four plots were fixed based on their Exam 1 score.  One marked change in transition of colors from the left column to the right shows the brown "B" bleeding into the red/pink "A", suggesting that many students who received an A in the mid-term grade ended up receiving a B at the end of the semester due to their lackluster performance subsequently. The third objective of this figure, in hindsight, is to show how ggplot automatically picks the sequence of levels for its legend. Yes, B+ should appear above B in the legend and other corrections of that sort need to be made manually. To that extent, ggplot isn't the panacea for all problems.

Data Analysis Course

UPDATE: THE BLOG/SITE HAS MOVED TO GITHUB. THE NEW LINK FOR THE BLOG/SITE IS patilv.github.io and THE LINK TO THIS POST IS:
http://bit.ly/1gWyfUT .  PLEASE UPDATE ANY BOOKMARKS YOU MAY HAVE.

Although my initial motive for signing up for the Coursera MOOC titled "Data Analysis" in Jan 2013 was to get a sense of how MOOCs work and how quickly they might make my day job extinct, I must admit I feel safer with my job likely to continue to exist (the MOOC model has its benefits and limitations, which I will share in a post at sometime). More importantly, this fun course taught by Jeff Leek from Johns Hopkins got me acquainted with R and its related packages. It required two papers (my submissions are attached to the hyperlinks in the rest of the sentence), the first one on exploratory analysis and the second involved building a predictive model. The first two figures were created for the two papers.  Figure created for the first paper on  exploratory analysis (created using ggplot).
 

Panel A shows that the distribution of interest rates (and means) is different for people who apply for a loan for different purposes. Panel B shows that the distribution of interest rates (and mean interest rate) is different for applicants from different states. Panel C shows that there is, on average, a negative relationship between FICO score and interest rate. Higher the FICO score of an applicant, lower the interest rate they paid. Panel D shows that applicants who applied for a 60 month loan generally paid a higher interest rate than those who applied for only 36 months. Panel E shows the interaction between loan length and amount of loan requested. (36 month loans are represented in red and 60 month loans are in blue). Specifically, it indicates that when the amount requested is low, there is no difference in the interest rates among applicants of 36 months or 60 months loans, but when the amount requested increased, the interest rates increased for 60 month loans but not for 36 month loans. Created using ggplot. Figure for the predictive model paper (Panel A was created using ggplot)

Panel A shows results helpful in determining number of components to retain in the principal components analysis of training data. It compares eigenvalues generated from the training data and eigenvalues from randomly generated datasets for the same sample size and number of variables using parallel analysis. Please note that for scaling purposes, the graph shows data for the first 60 components only (and not all 561 components) and it also does not plot eigenvalues for the first two components from the training data (283.39 and 36.56). According to the results, the first 42 components from the training data have higher eigenvalues than eigenvalues from randomly generated datasets. Eigenvalues of components 43 through 561 explain less than what can be done by chance. Hence, only 42 components are retained.


Panel B is a heat map of the confusion matrix showing the predictive ability of the support vector model developed on the testing data. The overlap between predicted and actual activity values are shown in the heat map. The purple region shows higher degrees of overlap (accurate classification) and the aqua blue colored regions show regions of low to no misclassification. Different shades of aqua blue denote the overlap between the predicted activity of standing with actual activities of laying and sitting. The kappa measure of this confusion matrix was .89, suggesting that the prediction was almost perfect.