Tuesday, August 27, 2013

Visualizing the Forbes-CCAP University Rankings using ggplot2, rCharts, googleVis, and the shiny server

UPDATE: THE BLOG/SITE HAS MOVED TO GITHUB. THE NEW LINK FOR THE BLOG/SITE IS patilv.github.io and THE LINK TO THIS POST IS: http://bit.ly/1mcryNv. PLEASE UPDATE ANY BOOKMARKS YOU MAY HAVE.

 President Obama is pushing for higher education reform and the development of a rating system for Universities is a critical component of it. These ratings are likely to be based on several measures, such as graduation rates, earnings of graduates, and affordability of education. Discussions of these plans can be found in leading national news outlets. See here, here,  here, and here  to go through some discussions regarding these plans.

In July, Forbes, in partnership with the Center for College Affordability and Productivity (CCAP) produced a ranking of 650 schools in the country. Caroline Howard's article in Forbes indicated that this ranking system emphasized outputs of colleges (e.g., graduation rates) rather than inputs (e.g., SAT scores of incoming students), which is similar to what President Obama is pushing for. In this post, I take a look at Forbes' rankings of schools using interactive visualizations. These would suggest that several challenges lie ahead for any meaningful reform.

Tools used for generating these visualizations include R, ggplot2, googleVis, rCharts, and a shiny-server application. The relevant code and data files can be found here on github. I would appreciate your contacting me if you find any errors in the code or have any comments on this project. 

Forbes-CCAP Rankings. Data for this post were collected from two different documents from CCAP's site. Document 1 provided a list of America's 100 Best College Buys and document 2 - Component Rankings - provided the complete listing of all 650 schools, their overall rank, and other ranking criteria. Details of the methodology used by the ranking system can be found here at Forbes and here at CCAP's site. Links to their discussions of results can be found here at CCAP's site and here at Forbes'. These rankings are based on variables such as post-graduate success, student debt,  graduation rate, and student satisfaction, among others. There were two summary rankings produced --- one was an overall rank, and the second was a ranking of universities providing the best value. 650 schools were ranked in the overall rank system and among them, 100 were ranked as providing the best value.

Procedure Followed.
1. Convert both pdf files (Component Rankings and Best College buys) to excel formats (using Able2Extract)
2. Merge both files into one datafile.
3. Combine the State with the name of the university into one column --- there were few universities with the same name, but were from different states.
4. For many components in the assigned component rankings, universities getting a rank between 600 and 650 were assigned a generic rank of "600-650" by Forbes-CCAP. It was decided to assign a rank of 625 to all such schools for those specific components.
5. Please remember that a lower value of rank indicates that the school is rated higher on that attribute.

The table below provides information used for subsequent analysis. The table can be sorted by any column.



Where are the 100 best value schools in the country? 

Color of dots: Rank, from 1 through 100
Size of dots: Total Score (which determines the Overall rank... higher score denotes lower rank number, which indicates a better overall school)... Please note that the overall ranking method and the ranking for best value are different.

(You can hover over the dots to find the name of the school.)


 Relationship between Best Value Rankings and Overall Rankings of Schools 
(Please click on the image below to enlarge it.)
The above graph shows overall ranks for all 100 universities identified as providing the best value. Universities have been listed in the order of their best value rank (better ranked below to lesser ranked above). What one would notice is that there are only about 7 universities whose overall rank is better than or similar to their best value rank. The overall rank of most of the best value universities are over 100 and a few are over 500.

This is also confirmed in the interactive graph given below, which shows that the overall ranks for the 100 best value universities range from 7 through 599. It is also the case that of the 100 schools, only 3 are private and the remaining 97 are public schools. (You can hover over the dots to identify the school.)



How are different criteria related to each other? 

Given the large number of criteria the Forbes-CCAP rankings provide data on, it is perhaps more appropriate to let you interact with the data to see how different variables are related to each other. The interactive application below will let you do that.



Clearly, few variables have a positive relationship. For example, universities with a better overall rank appear to also have better ranks in terms of Federal.Student.Debt, Actual.Retention.Rate.Rank, Student.Awards and Payscale.Salary.Rank . However, not all variables show that kind of a relationship. For example, please see the relationship between Rate.my.Professor rank and Payscale.Salary.Rank or between Rate.my.Professor rank and American.Leaders.Rank. Interestingly, private schools appear to have an edge over public schools based on the Rate.My.Professor rank.

Where does a particular school stack up, versus others? 

The interactive application below provides additional information over what we just saw. Here, you can select specific schools and figure out how they do in comparison to select others. This application can be played around with in the following ways.
 1. The two small tabs on the top right show either bubble charts or bar graphs, depending on what's selected.
 2. The horizontal and vertical axes can be changed to other variables by clicking at existing axis labels (the arrow mark)
3. Please change the Size parameter to "Same size" (top most option) to ease interpretation.
4. You can hover over the dots to know the school.
5. Coloring can be modified --- but I would leave it at its default state.
6. Specific schools (one or many) can be selected and changes in their position with respect to different variables can be monitored .


Concluding Remark and Acknowledgements

It is likely to be a long and arduous road ahead for education reform. Should it be done, definitely. How? Wiser people need to figure this out.

Thanks to Ramnath Vaidyanathan, Joe Cheng, and other members of the R community for their immediate responses to queries and excellent resources they have created for all to use.

Monday, August 12, 2013

Short tales of two NCAA basketball conferences (Big 12 and West Coast) using graphs

UPDATE: THE BLOG/SITE HAS MOVED TO GITHUB. THE NEW LINK FOR THE BLOG/SITE IS patilv.github.io and THE LINK TO THIS POST IS: http://bit.ly/1kvathJ. PLEASE UPDATE ANY BOOKMARKS YOU MAY HAVE.

Having been at the University of Kansas (Kansas Jayhawks) as a student and now working at Gonzaga University (Gonzaga Bulldogs), discussions about college basketball are inescapable. This post uses R, ggmap, ggplot2 and the shiny server to graphically visualize relationships between few basketball related variables for two NCAA basketball conferences - Big 12 and the West Coast conference. Code used to generate graphs in this post and for the shiny server application referred to in the post can be found here.

A visit to espn.com's page at http://espn.go.com/mens-college-basketball/teams shows a listing of universities in different conferences. The image below presents one such example.


A click on the "Stats" link for a school will take you to a page like this.

The page provides two tables - one of game statistics - total games played and the per game statistics for the season for all players from that team and the second, cumulative numbers for the entire season. Tables in this format are useful, but they do not provide information on relationships between variables, between teams, or between conferences. So, the objective of this post is to visualize few of these relationships. In order to do this, we do the following.

Steps
1. Find a way to extract data from the website. Remember, each page for a college provides information on only one season, but information on 12 seasons is available.  This is done using a function developed to scrape the site.
2. Clean and prepare the data.
3. Prepare preliminary graphs, and lastly,
4. Make the graph generation process interactive for people to focus on their favorite teams or variables.

Scope of this post
1. We focus only on the first of the two tables of statistics provided by espn.com (see the previous image) --- total games played and per game statistics. Please note that these are player level data.
2. This post does not look at individual players, by name, in different teams -- so it does not identify best players on any variable -- but it uses player level data to study teams and conferences.
3. We look at only 20 teams, 10 each in the West Coast Conference and Big 12  ---- but with some extra time, interest, and patience (TIP), it can be used for the remaining 331 Division 1 teams in 30 different basketball conferences.
4. Not all questions pertaining to variables studied are answered and not all graphical formats explored. For example, we do not have data on the win/loss record or of the competition teams faced.

The fun begins...

Step 1: Extracting Data
The image previously presented of a typical statistics page from espn has certain distinct elements.
1. Each page has two tables, the first about game statistics and the second about the season (cumulative numbers for the season), they have the same headers.
2. The URL for a page had three different parts.
(a) Part 1 has the team id, Part 2 the year, and Part 3 was the team name.
(b) The second part, year, represents a season --- so when Part 2 is 2012, the page described the 2011-12 season. This is something to remember to make the necessary change in the data.
3. The basic idea is to create a function, which does the following. (i) Take in Part 1 and Part 3, among any other variables we deem appropriate as inputs;(ii) loop through pages representing different seasons for a team; (iii) from each page extract two tables and update two master data frames with data from these two tables --- two because we have game stats and season stats (the latter is not used for analysis in this post -- but the data is available for anyone wishing to look at them). This function should then be used to extract data for 20 universities.
First the libraries required for all of this code 


library(XML)
library(ggmap)
library(ggplot2)
library(plyr)

Initialize two empty master data frames, which we'll call Gamestatistics and Seasonstatistics, and provide column names from the two tables. 

Gamestatistics=as.data.frame(matrix(ncol=16))
names(Gamestatistics) = c("Player", "GP", "MIN", "PPG", "RPG", "APG", "SPG", "BPG", "TPG", "FG%", "FT%", "3P%","Year","Team","City","Conference")

Seasonstatistics=as.data.frame(matrix(ncol=20))
names(Seasonstatistics) = c("Player", "MIN", "FGM", "FGA", "FTM", "FTA", "3PM", "3PA", "PTS", "OFFR", "DEFR", "REB", "AST",  "TO", "STL", "BLK","Year","Team","City","Conference")

Note that we have created columns for Year, Team, City, and Conference.  --- variables beyond the columns provided by the two tables and will either have to be calculated or manually determined. For each of the 20 colleges of interest from the 2 conferences, we can prepare something like the following.

URLpart1="http://espn.go.com/mens-college-basketball/team/stats/_/id/2250/year/"
URLpart3 ="/gonzaga-bulldogs"
Team="Gonzaga Bulldogs"
City="Spokane, WA"
Conference="West Coast"

The city information was obtained from Wikipedia for those I didn't know and the other information was available on the espn page for that university. Once we have prepared the parameters required to send to our function (getData), we then call on it. Remember that the function should return two tables. This we collect in a list named gameandseasonstats and extract the updated master tables - Gamestatistics and Seasonstatistics tables from the list.


gameandseasonstats=getData(URLpart1,URLpart3,Team,City,Conference)
Gamestatistics=gameandseasonstats[[1]]
Seasonstatistics=gameandseasonstats[[2]]

Now the function.


getData=function(URLpart1,URLpart3,Team,City,Conference){
  for (i in 2002:2013){
 
  URL=paste(paste(URLpart1,as.character(i),sep=""),URLpart3,sep="")
  tablesfromURL = readHTMLTable(URL)
 
  gamestat=tablesfromURL[[1]]
  names(gamestat) = c("Player", "GP", "MIN", "PPG", "RPG", "APG", "SPG", "BPG", "TPG", "FG%", "FT%", "3P%")
  gamestat$Year=i
  gamestat$Team=Team
  gamestat$City=City
  gamestat$Conference=Conference
  Gamestatistics=rbind(gamestat,Gamestatistics)

  seasonstat=tablesfromURL[[2]]
  names(seasonstat) = c("Player", "MIN", "FGM", "FGA", "FTM", "FTA", "3PM", "3PA", "PTS", "OFFR", "DEFR", "REB", "AST",  "TO", "STL", "BLK")
  seasonstat$Year=i
  seasonstat$Team=Team
  seasonstat$City=City
  seasonstat$Conference=Conference
  Seasonstatistics=rbind(seasonstat,Seasonstatistics)
  }
return(list(Gamestatistics,Seasonstatistics))
}

What this does is the following.
(a) Receive parameters we send to the function - getData
(b) For every year's page, from 2002-2013, it assembles the complete URL of the page by adding URLpart1, the year, and URLpart3 together.
(c) Gets the two tables and stores them in two temporary tables --- gamestat and seasonstat
(d) Adds new columns for year, team, city, and conference.
(e) Add these rows to the respective master tables - Gamestatistics and Seasonstatistics and return them in a list, which we retrieve outside the function and reuse them for the next school.
Now, we have collected data on 2 tables for 20 schools from 2 different conferences from 240 different pages from espn.com. On to the next stage.

Step 2 Clean and Prepare the Data

The dataframes have three types of rows which need to be removed. These are instances where the value of "Players" is either "Players", "Total", or "NA" (remember, we initialized empty dataframes). This is from the way tables were read by our function. That's easy.


Gamestatistics=Gamestatistics[which(Gamestatistics$Player!="NA"),]
Gamestatistics=Gamestatistics[which(Gamestatistics$Player!="Player"),]
Gamestatistics=Gamestatistics[which(Gamestatistics$Player!="Totals"),]

Converting few variables to factors and few to numbers was accomplished by these two lines.


for (i in 2:12){Gamestatistics[, i] = as.numeric(as.character(Gamestatistics[,i]))}
for(i in 14:16){Gamestatistics[,i]=as.factor(Gamestatistics[,i])}

Then, columns were renamed to explain each variable completely, so PPG became Points.Per.Game. This could've done this at the very beginning, but I didn't have the foresight.


names(Gamestatistics) = c("Player", "Games.Played", "Minutes", "Points.Per.Game", "Rebounds.Per.Game", "Assists.Per.Game", "Steals.Per.Game",
"Blocks.Per.Game", "Turnovers.Per.Game", "Field.Goal.Percent",
"Free.Throw.Percent", "Three.Point.FieldGoal.Percent", "Year", "Team", "City","Conference")

  
And the last thing left was converting years back to seasons ---  so 2002 became "2001-2002", accomplished using this for years between 2002 and 2013.



Gamestatistics$Year<-gsub("2002", "2001-2002", Gamestatistics$Year)

Remember, these changes are required for the second table as well --- Seasonstatistics, which we don't use for the post. However, both these cleaned dataframes are available for anyone to play with. They are with the code at the github link provided previously.

Step 3: Prepare preliminary graphs

Which 20 schools did we collect data on?


ggplot(Gamestatistics,aes(x=Conference,y=City,color=Conference))+ 
geom_text(data=Gamestatistics,aes(label=Team))+
  theme(axis.text.x = element_text(color="black",size=12))+
  theme(axis.text.y = element_text(color="black",size=12))+theme(legend.position="none")+labs(y="",x="")


Let's plot these cities on a map.

First, get the US map from osm, get the list of cities from our data, get their latitudes and longitudes, and add this information to our data.


location=c(-125,24.207,-70,50) # It took a bit to figure these coordinates out - zoom to the appropriate location and level using openstreetmap.org
# and find coordinates from the export link

map=get_map(location=location,maptype="roadmap",source="osm")
usmap=ggmap(map)

locs=geocode(as.character(unique(Gamestatistics$City))) # find the 20 cities from the data and identify their latitude and longitude; combine City information
locs$City=unique(Gamestatistics$City)
Gamestatistics$lat=locs$lat[ match(Gamestatistics$City,locs$City)]# bring latitude and longitude information to main data frame
Gamestatistics$lon=locs$lon[ match(Gamestatistics$City,locs$City)]


And, the plot.


usmap+geom_point(data=Gamestatistics,aes(x=lon,y=lat,color=Conference),size=7)+ ggtitle("Location of WCC and Big 12 Schools")

1. BYU is in Utah, which isn't a coastal state, unlike the other schools from the WCC.
2. WVU doesn't exactly fit the region of the other Big 12 schools.
3. BYU and WVU recently joined these conferences.

Just a thought: On what bases are conferences named? Big 12 has 10 teams, but Big Ten has 12? Clearly, there appears to be a branding issue. PAC-10 did change to PAC-12 after the recent addition of two new teams. Hmm.

Let's plot histograms of some variable, say Points.Per.Game, for all teams.


ggplot(Gamestatistics,aes(x=Points.Per.Game, fill=Team))+
  geom_histogram()+ggtitle("Histogram of Points.Per.Game for All Teams - Data Collapsed Across All Years")+ facet_wrap(~Team,ncol=4) + theme(legend.position="none")

We could also compare the distributions of two different schools on one variable --- let's take a look at Gonzaga Bulldogs and Kansas Jayhawks on say, Points.Per.Game.

ggplot(subset(Gamestatistics,Team %in% c("Gonzaga Bulldogs","Kansas Jayhawks")),aes(x=Points.Per.Game, fill=Team))+  geom_density(alpha=.3)+ ggtitle("Kernel Density Plots of Points.Per.Game for Gonzaga Bulldogs and Kansas Jayhawks for all Years")+ facet_wrap(~Year,ncol=4)






We might also be interested in seeing how the mean points per game of team players change over years for different teams.



# Mean calculation of Points.Per.Game of Team players for a season
ppgmean=ddply(Gamestatistics,.(Team,Year),summarize,Mean.Points.Per.Game=mean(Points.Per.Game))

#Plot
ggplot(ppgmean,aes(x=Year,y=Mean.Points.Per.Game,color=Team,group=Team))+
geom_point()+geom_line()+facet_wrap(~Team,ncol=4)+theme(legend.position="none")+

  theme(axis.text.x = element_text(angle=-90))+ggtitle("Mean Points Per Game of Players of Different Teams in Different Seasons")


Alternately, we might be interested in how the mean points per game of team players changed for two teams, across different seasons.



# Mean points per game comparison for two teams, say, Gonzaga and Kansas, over years


ggplot(subset(ppgmean,Team %in% c("Gonzaga Bulldogs","Kansas Jayhawks")),aes(x=Year,y=Mean.Points.Per.Game,color=Team,group=Team))+
  geom_point()+geom_line()+ggtitle("Mean Points Per Game of Players of Gonzaga Bulldogs and Kansas Jayhawks in Different Seasons")




We could also look at relationships between two variables (Points per game and Assists.Per.Game) in teams across different years and add in a LOESS curve.


ggplot(Gamestatistics,aes(x=Points.Per.Game, y=Assists.Per.Game, color=Team))+
  geom_jitter()+ geom_smooth(method='loess',level=0,size=1,aes(color=Team))+
  ggtitle("Scatter Plots with LOESS smoothing of Points.Per.Game and Assists for All Teams -- Data Collapsed Across All Years")+ facet_wrap(~Team,ncol=4) +
  theme(legend.position="none") 


We could also compare the relationship of two variables - points per game and assists per game, for two or more schools.

ggplot(subset(Gamestatistics,Team %in% c("Gonzaga Bulldogs","Kansas Jayhawks")),aes(x=Points.Per.Game, y=Assists.Per.Game, color=Team))+
  geom_jitter()+ geom_smooth(method='loess',level=0,size=1,aes(color=Team))+ facet_wrap(~Year,ncol=4)+
  ggtitle("Scatter Plots with LOESS smoothing of Points.Per.Game and Assists for Gonzaga Bulldogs and Kansas Jayhawks -- Data Collapsed Across All Years")




Step 4: Interactively generate graphs - for any combination of variable and teams

Of course, previously shown graphs could be generated for comparing both conferences and we could also have other variables of interest that we might want to compare different schools or conferences on. It would be unwieldy to present graphs of all possibilities here. Introducing interactivity by letting you play around with variables and schools will help. For this, we rely on the shiny server platform from RStudio.  Our shiny server application uses three preloaded data files - Gamestatisticscleaned.rda, and files for team-wise and conference-wise means (labeled meansteams.rda and meansconferences.rda, respectively) of all variables for all years and presents 9 different graphs. The code to generate these additional data files is given below.



meansteams=ddply(Gamestatistics,.(Team,Year),summarize,
                 Points.Per.Game=mean(Points.Per.Game),
                 Games.Played=mean(Games.Played),
                 Minutes = mean(Minutes),
                 Rebounds.Per.Game=mean(Rebounds.Per.Game),
                 Assists.Per.Game=mean(Assists.Per.Game),
                 Steals.Per.Game=mean(Steals.Per.Game),
                 Blocks.Per.Game=mean(Blocks.Per.Game),
                 Turnovers.Per.Game=mean(Turnovers.Per.Game),
                 Field.Goal.Percent=mean(Field.Goal.Percent),
                 Free.Throw.Percent=mean(Free.Throw.Percent),
      Three.Point.FieldGoal.Percent=mean(Three.Point.FieldGoal.Percent)
                 )
meansconferences=ddply(Gamestatistics,.(Conference,Year),summarize,
                       Points.Per.Game=mean(Points.Per.Game),
                       Games.Played=mean(Games.Played),
                       Minutes = mean(Minutes),
                       Rebounds.Per.Game=mean(Rebounds.Per.Game),
                       Assists.Per.Game=mean(Assists.Per.Game),
                       Steals.Per.Game=mean(Steals.Per.Game),
                       Blocks.Per.Game=mean(Blocks.Per.Game),
                       Turnovers.Per.Game=mean(Turnovers.Per.Game),
                       Field.Goal.Percent=mean(Field.Goal.Percent),
                       Free.Throw.Percent=mean(Free.Throw.Percent),
      Three.Point.FieldGoal.Percent=mean(Three.Point.FieldGoal.Percent)
                )
save(meansteams,file="meansteams.rda")

save(meanconferences,file="meansconferences.rda")




The codes for the shiny server user interface (ui.r) and server (server.r) are available through github. Please click on the screenshot below to access the application.

Concluding Remarks
1. The objective of this post was to visualize few relationships between game statistics presented for teams by espn.com for two NCAA basketball conferences - West Coast and Big 12.
2.  Data were scraped from 240 different pages on espn.com
3. Data on win/loss history or the competition teams faced was not available on the page and hence, not used.
4. Only one of two tables --- one concerning game statistics - was used for the analyses and the data from the second table, season statistics ---cumulative numbers for the entire season --- is available for download from the previously provided link.
5. R code was formatted for the blog purpose using the Highlight 3.14 software.
6. I would appreciate any comments/suggestions you might have on the application or on the code. Thanks.
--------------------------