Wednesday, June 26, 2013

Analyses of the Best Undergraduate (US-based) Business Schools of 2013


Link to the code for the analysis program is given at the very end.
I teach in a business school and am always fascinated by rankings published by different outlets. As much as one would like to think that they are meaningless, they seem to influence how many business schools actively try to manage their reputation. One such ranking of undergraduate business schools was recently published by Bloomberg Businessweek. I thought I'll take a closer look at the numbers and try to present my analyses using only graphs and charts. The Businessweek article presented table(s) on their website. I have recreated that table below. (You can click on any column header to sort the table based on that column.)

For this analysis, we ignore the 2012 Rank column. A description of the variables and the methodology is provided in the article here and here. 124 schools have been ranked. One thing to note is the following. The 2013 rank is based on an index number they computed, with 100 being the highest (for the first ranked school, Notre Dame) and 20.35 being the lowest score, for California-Riverside, which was ranked 124. So, when it comes to a rank of a school, lower is better; for index numbers, higher the better. something to keep in mind for understanding results.

How many schools are ranked in different States?
Below is the US map color coded based on number of schools in different States. Since Washington DC is small to be on the larger map, that region gets its own smaller map. (You may have to scroll to the right.)

Few observations
1. Ohio has the highest number of schools represented in this ranking.
2. There are few states, which are not represented in the ranking.

Distribution of Index Numbers by State

Remember, a school with a higher index number was better ranked. Here, we look at how index numbers were distributed across different states. To do this, we use boxplots, which show the minimum, maximum, 25th percentile, 50th percentile (median), and the 75th percentile of the distribution. The blue dots in the middle of the box are the mean index scores.

1. Average scores are increasing from Vermont to Indiana.
2. Spread of the distribution (to some extent, measured by how long each box is) appears to vary across different states (for states with only one school, it is just a single point, but for states with multiple schools, the spread indicates the higher variance in the index scores across schools in that state.

Distribution of Index Numbers by Type of School, Teaching Quality Grade, Facilities and Services Grade, and Job Placement Grade
1. As one should expect, job placement grade and teaching quality grade matter. Facilities and services grade also has an effect.
2. Private schools appear to have a higher index number than public schools.

Relationships between Different Ranks, Index Numbers, a whole host of other variables
Here, we will look at a correlation analysis. (I've clubbed rank-ordered variables with ratio-scaled ones, but come on.) Below, is a graph with correlation coefficients color coded based on their value. Positive coefficients have been coded in blue and negative ones in red. Intensity of colors indicate the value.

1. 2013 Rank is, as expected, positively correlated with how schools are ranked by students and employers (not too dark a blue, is it?), MBA feeder school and Academic quality rank
2. 2013 Rank is perfectly negatively related to Index Number (hence the , because of the way it is computed).
3. Index Number is marginally positively correlated with Annual Tuition (not a good thing, is it?), but has a positive relationship with median starting salary of students. The latter relationship is a good thing.
4. Index number is negatively related to student faculty ratio, which is also a good thing, right? How does this play into the MOOCs thing?
5. And yes, the average SAT scores are strongly related to the index number.

Schools positioned on any set of two variables

Now, this gets interesting. Since there are many combinations possible and everyone has their own pet set of issues, the following interactive chart will let you play around until .... forever, if you are really ticked about these issues.
This chart can be played around with in the following ways.
1. The two small tabs on the top right show either bubble charts or bar graphs, depending on what's selected.
2. The horizontal and vertical axes can be changed to other variables by clicking at existing axis labels (the arrow mark)
3. You can hover over the dots to know the school and value of the variables being examined.
4. Color is used to identify whether a school is a public school (Green) or a private one (Blue).
5. Size of dot is based on the index number- LARGER size is a better ranked school with a higher index number.

Observations: I will let you make them. Have fun.

The R code for this analysis can be accessed by clicking this link.

Sunday, June 23, 2013

Revisualizing the best cities in the US in 2012- Shiny + googleVis = Incredibly powerful


This is the last time I will talk about visualizing the best cities of 2012 based on Bloomberg Businessweek's rankings. In an earlier post on this topic, interactive applications to plot bar graphs and histograms for different characteristics defining different cities were discussed. They used R Studio's shiny server and ggplot2. The reason I am still clinging on to this data is because of how pathetically painful (you can imagine the pain) it was to find out details about the best city in the original article. [This is not uncommon. Rankings published by many media outlets suffer from this problem. I hope someone out there is computing the trade offs between advertising dollars and frustration to customers.] With one city per page, and beginning their discussion with the 50th best city, they actually expected us to go through about 50 pages to find out more about the best city.

What I did was have this data collected and organized into a spreadsheet by a student. I stripped out the textual description of the city and looked at different ways this could be presented. But something was missing there. It, in my opinion, didn't provide a coherent story. This is my last attempt at doing so, and I try this using the shiny server and the googleVis package. Using them, I bring in interactivity in plotting these cities on a map as well as study how any two characteristics are related to cities and their rankings. (Scatter plots, again, but using googleVis.)If you went through the source code, you'd realize that the ratio of functionality to number of lines of code is incredibly high. Very powerful tools. Please click on the following image to play with the application. The source code link is provided in the application.

Wednesday, June 12, 2013

Twitter Twitter on the Web, Who is the Most Popular of All? Interactively Determining Popularity of Two Entitites on Twitter


Code updated based on feedback (see list of changes at the very end)

Okay, that was a take on the mirror mirror on the wall quote from Snow White. This continues my saga of learning from the superb work done by the R-community and building on their ideas. My first post on twitter-related analysis relied on data downloaded at a particular time for interactively analyzing tweets from 5 different universities. In this second post on twitter-related analysis, let's advance that idea this time by retrieving tweets on the fly. As I mentioned in my earlier post, it is possible that twitter might prevent the application from retrieving tweets after a few attempts. So, I'm hoping that it doesn't happen. If you run into trouble, then you, of course, have access to the entire code used for this post. (GitHub link, should work if I'm doing it remotely correctly; else, an htm file with the code for the server and user interface files can be downloaded.)

In this application, I attempted to build on ideas and functions from a chapter titled "Popularity Contest" from Jeffrey Stanton's book- Introduction to Data Science and bring interactivity to many of those functions. Here, a user of the application can compare the popularity (on twitter) of any two people/objects/entities. I again use the shiny server framework advanced by R-Studio for web-based interactivity. This is super cool stuff. Thanks.

How is popularity being measured?

This is being done in two ways, one, by determining the probability of a new tweet regarding an entity occurring within a certain time period. It could be argued that higher this probability, higher the frequency of discussion of this entity on twitter, suggesting relatively higher popularity.

A second approach involves comparing the proportion of all retrieved tweets for each entity that occurred within a certain time frame. For example, one can ask what proportion of retrieved tweets about entity 1 and entity 2 occurred within 30 seconds of each other?

Fortunately for me, both these functions come from Stanton's book and it took little tweaking to get them working under the shiny server framework. The first function was, in fact, used in my earlier post as well

Lastly, we need to check if all this discussion on twitter about the entities is positive and not negative. As my parents always told me, which I am hoping to pass on to my 3 year old when she's ready to listen to me --- when people talk frequently about us, it better be in a positive light. It might be better to not be talked about than be unpopular. To do this, I rely on the sentiment analysis approach Jeffrey Breen uses in his twitter mining application with airlines.  Yet another illustration of this approach can be found at Gaston Sanchez's excellent twitter mining project. In this, words used in tweets are matched with listing of terms deemed positive or negative, based on previous research.  Then a score is generated for each tweet, reflecting number of positive and negative terms used. More information on this approach can be obtained from Jeffrey Breen's site. We'll use this to get a general sense of the sentiment people are expressing about our entities. We'll also develop word clouds of terms being used in these tweets. 

What does the application do?

User Inputs  

Inputs 1 and 2: The application can take inputs for two entities (Entity 1 and Entity 2). (This can, of course, be scaled for more than 2 names.) When you click the application link below,  the default inputs are "#Michael" and "#Mary" for Entity 1 and Entity 2, respectively. Why did I pick Michael and Mary? According to the Social Security Administration, over the last 100 years (between 1913-2012), "Michael" was the most popular name for a male child for 44 years, while "Mary" was the most popular name for a female child for 43 years. So I thought I'd give them a try. You can change those inputs to anything you like.

Input 3: Number of tweets to retrieve. Although there are several ways of retrieving data through twitter using R (for example, see here),  I use the twitteR packagePlease also remember that this input only specifies how many tweets to retrieve; how many ultimately get retrieved may be fewer than the number requested. In order to make sure Twitter didn't block this application, I've restricted the max tweets to about 50. This can be increased to anything by just modifying one number in the user input code file. I've kept the minimum number to 5, which can again be modified. 

Input 4: Time (in seconds). Remember, we will compute the proportion of tweets for both entities that arrived within a particular time period. It is this time period that we are asking the user to input. 

(Please click on the image below to go to interact with the application)

Application Outputs

Outputs are presented in a series of  7 tabs.

Tab 1: Gives you information on how many tweets were retrieved and plots the probability of a new tweet arriving a particular time, t for both entities. 

Tab 2: Gives us three graphs to visualize the distribution of delay times between tweets for both entities.   The first gives box plots and mean of the distribution of delay times for both entities; the second and third graphs are a histogram and a kernel density function.  Box plots would  give information on the minimum, maximum, and 25th, 50th (median), and 75th percentile of the distribution. Overall, lower the delay time between tweets, higher the frequency of tweets, suggesting a higher popularity.

Tab 3: 3 parts to this.

Part 1 gives us a bar graph of estimated proportion of tweets retrieved for both entities, which had delay times <= the user specified time (see Input 4 above). It also plots the 95% confidence interval in the estimation of these proportions. If one entity is clearly more popular than the other, we should not find an overlap of the region of these confidence intervals.

Part 2 presents the table used for generated the bar graph of part 1. 

Part 3 presents the results of a poisson test of the two proportions. Here, the idea is that if one entity is clearly more popular than the other, the rates (or proportions computed previously) should be different from each other and their ratio should be statistically different from 1. This test presents some numbers to corroborate what we might find in part 1 of this tab. Check the following numbers: Rate ratio --- should not be 1 if one entity's proportion is sufficiently different from the other entity's. The 95% confidence interval should not have "1" for one to be more popular than the other. The p-value would be <=.05 in such cases

Tab 4: It is here that we determine whether all the talk on twitter about the entities was positive or negative. We see the distribution of sentiment scores (higher score is better) for retrieved tweets for both entities using box plots. The mean is also computed. To see how well the algorithm did, few tweets for both entities are shown along with their sentiment score. It may not be perfect, but it gives an idea of the sentiment. Please note that tweets were cleaned prior to the generation of sentiment scores. Cleaning involved removing redundant spaces, getting rid of URLs, taking out retweet headers, hashtags, and reference to other screen names.

Tab 5: Presents word clouds of terms used in tweets for both entities. They are regular word clouds, which were described in my earlier post. Please note that tweets were cleaned prior to the generation of these word clouds. As mentioned in the discussion of tab 4, cleaning involved removing redundant spaces, getting rid of URLs, taking out retweet headers, hashtags, and reference to other screen names. 

Finally, tabs 6 and 7 present the raw tweets retrieved for the two entities and used in the outputs generated in the earlier 5 tabs.

Concluding Thoughts

This has been really fun. When and what next, no clue.

App has been updated. 

a) Replaced references to entity 1 and 2 in tables and graphs with their values
b) Time slider increases every 30 secs rather than 1 second.
c) Modified the function to count number of tweets for the output in tab 1.
d) Application address on glimmer changed (....popularity instead of ...Popsenti)

Saturday, June 1, 2013

Tweetanalytics - Interactively analyzing tweets from accounts of 5 universities


This is an attempt at learning and interactively displaying few results using twitter data using text mining. Interactivity is implemented using RStudio's shiny server. Their documentation of demo scripts came in very handy. As a non-user of twitter, I had to open an account to get access to tweets. My first major source for information/functions/understanding was to refer to Jeffrey Stanton's free and easy to read book, Introduction to Data Science. Then, there were several R bloggers who had documented their text-mining projects extensively. Almost all of the tasks shown in this first part come from Gaston Sanchez's excellent twitter mining project.

There are several ways of retrieving data through twitter using R. For example, see here.  I used the twitteR package, which provides access to the Twitter API from within R. While playing around with this and retrieving few hundred tweets every few seconds, I noticed that twitter had, at one point, blocked my access to tweets. This required me to wait for a bit before I could get data again. So, what I did was to retrieve tweets and store them for my analyses. What this also did for this demo was to speed up the processing of the data by eliminating the data retrieving phase.

Stage 1: Retrieve, clean, and store tweets from the official twitter handles of 5 universities, including mine; these universities were Gonzaga University (GonzagaU), Eastern Washington University (EWUEagles), Washington State University (WSUPullman), Seattle University (Seattleu), and University of Washington (UW). Clean tweets were stored in a new column for each university's data file.  Although 500 tweets were requested for each university, the number of tweets that were collected were much fewer than those. These numbers can be obtained from the interactive app below. Cleaning involved removing redundant spaces, getting rid of URLs, taking out retweet headers, hashtags, and reference to other screen names. Note that cleaned tweets were stored separately and they didn't contaminate the original tweets. Data were downloaded on 31 May, 2013.

Code and files for all stages can be found here. [This is my first time using GitHub. So, if it doesn't work, please check this page for all the code in a single htm file.]

Stage 2: Use stored data to generate basic results - Raw tweets were used for this stage

In this, users can select any two (of five) universities and the tweet data files from these two universities can be used to generate 9 different results in 10 different tabs . 8 of these follow Gaston Sanchez's work with ice cream. These include comparing the two universities based on 1) number of tweets retrieved, 2) characters per tweet, 3) words per tweet, 4) length of words per tweet , 5) number of unique words per tweet, 6) number of hash (#) tags per tweet, 7) number of @ (at) signs per tweet, and 8) number of web links per tweet. The 9th tab combines results for both universities on all 8 previously mentioned variables. Finally, the 10th tab graphically displays the probability of a new tweet from either of these universities arriving within a particular time frame (relying on the data we've gathered). This gives us an idea of which university is a more active tweeter. This uses a function from Jeffrey Stanton's book.

PLEASE CLICK THIS IMAGE to interact with this Stage 2 application. (Don't forget Stage 3 below, for discussion of word clouds.)

Stage 3: Tag clouds or word clouds can be useful to visually represent textual data by emphasizing terms used more frequently. In this particular instance, I use the stored data to get an idea of terms used in tweets from different universities using three different word clouds - a regular word cloud for each university selected, and a comparative word cloud (a cloud comparing word frequencies across both universities) and a commonality word cloud (a cloud of words shared by the two universities). My understanding of word clouds and the functions used for coding were greatly influenced by the official documentation for the word cloud package for R, Jeffrey Stanton's book, and Gaston Sanchez's discussion on this topic. Of course, I also benefited from a number of other examples from R-bloggers. For these word clouds, cleaned tweets were used.

PLEASE CLICK THIS IMAGE to interact with this Stage 3 application. (Don't forget Stage 3 below, for discussion of word clouds.)

 Concluding thoughts: This was fun. More later.