Saturday, December 14, 2013

Visualization of 2012 Crime Rates of Different States in the US using rCharts

UPDATE: THE BLOG/SITE HAS MOVED TO GITHUB. THE NEW LINK FOR THE BLOG/SITE IS patilv.github.io and THE LINK TO THIS POST IS: http://bit.ly/1pi6mGo. PLEASE UPDATE ANY BOOKMARKS YOU MAY HAVE.

In this post, I look at crime rates (per 100,000 people) in 2012 for different crimes in different states.  Data from FBI's Uniform Crime Reports site were used for this analysis. The methodology used by the FBI and the caveats associated with the data can be found on their website.  Although this analysis does not get into the causal factors behind the numbers, it is important to note that there are many possible factors at play. FBI's site also suggests extreme caution in using these numbers. The original spreadsheet downloaded on 13 Dec 2013 can be found here. It was cleaned to retain only crime rate information (per 100,000 people) for the 50 US states. Information on District of Columbia and Peurto Rico were removed. Cleaning was done outside of the R environment in the native format of original data and the cleaned version can be found here. All documents from FBI's site referred to here can also be found along with other analysis related files found here on github. Crime rates (per 100,000 people) for 9 different types of crimes are present for each of the 50 US states. The rest of this document was generated using slidify. Acknowledgements are due to the developers and maintainers of different R packages used here as well as to the wonderful user-community, which has contributed plenty of web-based resources. All interactive charts were generated using rCharts. (Thanks, Ramnath, for your responsiveness.)

Thursday, November 7, 2013

Replication of few graphs/charts in base R, ggplot2, and rCharts

UPDATE: THE BLOG/SITE HAS MOVED TO GITHUB. THE NEW LINK FOR THE BLOG/SITE IS patilv.github.io and THE LINK TO THIS POST IS: http://bit.ly/1jJ6f7v. PLEASE UPDATE ANY BOOKMARKS YOU MAY HAVE.

In this post, I use a simulated dataset (7 variables -3 factor and 4 numeric - and a sample size of 50) to create graphs/charts using base R, and replicate them using ggplot2, and rCharts. This is not an attempt to create an exhaustive database of graphs/charts of all possible combinations, but it was an exercise to generate some of the common ones (in my view). These include dot plots, histograms, box plots, bar charts, scatter plots, density curves, and line graphs and a few more. I am sure the code can be further optimized  and it could use some finishing touches with many things like legends, axes labels, and color, but at the core, I think it does its job.  Pages of the 3 versions of graphs/charts were generated using slidify. Thanks to Ramnath Vaidyanathan for having answers to all questions and to the kind rCharts, ggplot2, and R community for the free knowledge base available on the Internet. The code for these pages can be found on github. Please click on this link for the analysis: http://bit.ly/18aWx3e

Friday, October 18, 2013

Subsidies, Revenues, and Expenses of NCAA Division I Public Schools Athletic Departments – Visualizations using rCharts and animint generated interactive ggplots

UPDATE: THE BLOG/SITE HAS MOVED TO GITHUB. THE NEW LINK FOR THE BLOG/SITE IS patilv.github.io and THE LINK TO THIS POST IS: http://bit.ly/1obVGLS. PLEASE UPDATE ANY BOOKMARKS YOU MAY HAVE.

Two interesting articles in recent issues of The Atlantic linked sports to tax-payer dollars. The first few lines of Gregg Easterbrook’s article, titled How the NFL Fleeces Taxpayers, were: “Taxpayers fund the stadiums, antitrust law doesn't apply to broadcast deals, the league enjoys nonprofit status, and Commissioner Roger Goodell makes $30 million a year. It's time to stop the public giveaways to America's richest sports league—and to the feudal lords who own its teams.” Amanda Ripley’s article was titled The Case Against High-School Sports. Its first few lines were: “The United States routinely spends more tax dollars per high-school athlete than per high-school math student—unlike most countries worldwide. And we wonder why we lag in international education rankings?” An USA Today Sports article by Steve Berkowitz, Jodi Upton and Erik Brady addressed a third level of sports- collegiate athletics, and linked it to tax-payer dollars. The article, titled “Most NCAA Division I athletic departments take subsidies” focused on the issue of providing subsidies to athletic departments of public schools that were self-sufficient (16 out of 228 schools) and asked if that was necessary. In this post, I use data provided in the USA Today article to further study the expenses, revenues, and subsidies of the athletic departments of the 228 public schools.

All charts used in this post permit interactivity of some sort. These were created using rCharts and animint. These in turn rely on a set of other libraries such as ggplot2 and polycharts JS. Thanks to Toby Dylan Hocking, @timelyportfolio and Ramnath Vaidyanathan for their prompt assistance on charting questions. Code, data, and other files relevant for a replication of this analysis are available at github. 

Data on total revenues, total expenses, and subsidies for 228 athletic departments of public universities were provided for the year 2012. An overall rank was also provided to different athletic departments on their finances. The methodology and category explanations USA Today used are available here.  Data were collected by USA Today in partnership with Indiana University’s National Sports Journalism Center. In the data analyzed below, Total Revenues for a department include the subsidies it received. An additional variable, Revenues less Expenses was calculated, which was nothing but the Total Revenues less Total Expenses. The complete data have been tabulated below. The columns can be sorted based on the value.


Now, let’s visualize the data using a heat map. Along the columns are different schools (you don’t see all labels but if you hover over the rectangle tiles, you should see the school identified). Values for expenses, revenues and other variables are scaled to a 0 to 1 scale, thereby permitting a visualization of all variables in a better manner. Higher the dollar amount, higher the intensity of red and the scaled value will be closer to 1. Lower the dollar value, lower the intensity of the red--- or white if dollar amount is negligible. To verify, you can hover over the most intense red parts of the Total Subsidy row and you should find Nevada-Las Vegas with a scaled value of 1. It is because it gets the highest subsidy.



The heat map clearly shows that most schools receive subsidy to some degree; the row for Total Revenue indicates that that there are very few schools that generate very high amounts of revenue and as a result, the Revenue less Expenses row is generally a shade of faded red for most schools, save for a handful of those with very high values.

Relationship between Total Subsidy, Total Revenue, Total Expenses and Revenue Less Expenses
The correlation matrix given below provides this information. It tells you how related two variables are to each other. The correlation coefficient, a measure of the relationship, can vary from –1 through +1. A negative coefficient indicates indicates a negative relationship (in red or a shade of red) between the two variables, whereas a positive coefficient (in blue or a shade of blue) denotes a positive relationship (higher values of one variable are associated with higher values of the other variable). You should be able to hover over the tiles to get more information.


There's a perfect positive correlation between Total Revenue and Total Expenses, and these variables are negatively correlated to the overall rank (higher the revenue, better the rank [with a lower number denoting better performance]). No surprises there. Total Subsidy is not positively related to any other variable.

Groups/Clusters of Athletic Departments

In order to classify different athletic departments into different groups, a statistical technique called k-means clustering was used. Based on few statistical measures, which can be ascertained from the code provided, it was decided that all 228 departments be grouped into 4 different groups. This grouping is done such that Schools with similar characteristics (with respect to total subsidy, total revenue, total expenses and revenue less expenses) are grouped together. Group 1 had 19 schools, Group 2 had 61 schools, Group 3 had 33 schools and the remaining 115 schools belonged to Group 4. Schools are tabulated based on the 4 groups below.

Characteristics of these groups/clusters. The chart below can be used to determine the profile of each of the 4 groups. You can select (box around a group number) and click to find few lines highlighted. This chart gives the mean scores on variables of interest for the 4 groups. Lower the mean score in the group, lesser the dollar amount for that group.


Group 1 (cluster of 19 schools) has the highest mean Total Revenues (approximately $111.48 million), highest mean Total Expenses (approximately $101 million), and highest Revenues less Expenses (over $10 million). It takes the least amount of subsidy (approximately $2.85 million). Contrast Group 1 with Group 4, a cluster of 115 schools, the largest among the four groups. Group 4 has the lowest mean Total Revenue, lowest mean Total Expenses, a subsidy average of approximately $8.3 million dollars, and the lowest Revenue less Expense of $17,550. Group 3 schools, a cluster of 33, taken in an average subsidy amount which is slightly less than Group 4 schools (approximately $7.9 million) but has the second highest mean levels with respect to Total Revenues (approximately 70.34 million), Total Expenses (approximately 69.3 million) , and Revenues less Expenses (little over $1 million) . Lastly, Group 2 schools, who receive the highest average subsidy (approximately $16.9 million) have total revenues of approximately $29 million, expenses of approximately $28.99 million for a Revenue less Expenses amount of about $37,000.

The following set of plots provide the potential to explore relationships between different variables and subsidies and the 4 different groups further. You should be able to hover over either the bars of the bar charts or the points of the scatter plots to get information of the school being hovered over. You should also be able to click on a point or bar to see where that school is positioned on the other charts. The points have been colored based on the membership of the corresponding school in the 4 different groups.



Conclusion

There clearly is a lot of money in NCAA athletics, even though the athletes themselves are not paid a salary. The analysis shows that subsidies are provided to most athletic departments of public schools. Given limited tax-payer dollars and the picture that emerges after juxtaposing other articles on the use of tax-payer funding for sports, a question that begs answers is whether tax-payer dollars are going to the right places.

Saturday, October 5, 2013

Crime Against Women in India - Addressing 8 Questions Using rCharts, googleVis, and shiny

UPDATE: THE BLOG/SITE HAS MOVED TO GITHUB. THE NEW LINK FOR THE BLOG/SITE IS patilv.github.io and THE LINK TO THIS POST IS: http://bit.ly/1lHtVon. PLEASE UPDATE ANY BOOKMARKS YOU MAY HAVE.

 Recent crimes against women, specifically the 2012 gang rape in New Delhi of a 23 year old lady, have pushed this issue as a substantially significant one for Indians to deal with. In this post, I try to address 8 different questions regarding crime against women in India. (1). How have numbers in different types of crimes and percentage of different crimes changed over the past few years for the country as a whole and for individual States/Union Territories? This is answered using data available for 12 years (2001-2012). For rest of the questions, I focus on data from 2012. (2) At the States/Union territories level, how are different types of crimes related to each other, in terms of their correlations? (3) Which States/UTs have a higher incidence of different types of crimes? (4) How can different States/UTs be classified based on the nature and extent of crime occurring in them? In question 5 through 7,  I ask questions 2 through 4 again, but for different cities.  Question 8 – Can the States, Union Territories, and Cities be mapped along with their incidence levels of different crimes in 2012? 

Tools Used: Analysis has been done using R and visualizations have been developed using rCharts and googleVis. Interactivity has also been facilitated by the shiny server environment, thanks to RStudio. Other R libraries used include reshape2, plyr, and scales. Special thanks to Ramnath Vaidyanathan and @timelyportfolio for their super quick assistance with questions on rCharts. All datasets used for this post and the code for generating charts and shiny applications are available at github.

Given the length of the post and the number of charts, I decided to avoid posting it completely here and have created a separate page on blogger. For the entire post, please visit: http://bit.ly/1937EN9 

Wednesday, September 11, 2013

Why use R for Data Analysis?

UPDATE: THE BLOG/SITE HAS MOVED TO GITHUB. THE NEW LINK FOR THE BLOG/SITE IS patilv.github.io and THE LINK TO THIS POST IS: http://bit.ly/1k0oHT5. PLEASE UPDATE ANY BOOKMARKS YOU MAY HAVE.

Update: Ramnath, thanks for the fix for the error from rawgithub
.
This was a presentation made to few of my colleagues. The idea was to provide a brief introduction to R, provide few reasons for my switching to it, and give some samples of graphics/interactive visualizations that could be created with minimal coding. Slidify (impressjs framework) was used for putting the presentation together (the coding is imperfect) and the graphs/charts used ggplot2, rCharts, corrplot, healthvis, and PerformanceAnalytics libraries.

Relevant code: https://github.com/patilv/UseR
Link to the presentation (for a larger view) : https://dl.dropboxusercontent.com/u/56796392/WhyR/index.html OR http://patilv.github.io/UseR2/
(The presentation may not show up with the impressjs jazz if the browser is incompatible with it. Apologies in advance.)







Tuesday, August 27, 2013

Visualizing the Forbes-CCAP University Rankings using ggplot2, rCharts, googleVis, and the shiny server

UPDATE: THE BLOG/SITE HAS MOVED TO GITHUB. THE NEW LINK FOR THE BLOG/SITE IS patilv.github.io and THE LINK TO THIS POST IS: http://bit.ly/1mcryNv. PLEASE UPDATE ANY BOOKMARKS YOU MAY HAVE.

 President Obama is pushing for higher education reform and the development of a rating system for Universities is a critical component of it. These ratings are likely to be based on several measures, such as graduation rates, earnings of graduates, and affordability of education. Discussions of these plans can be found in leading national news outlets. See here, here,  here, and here  to go through some discussions regarding these plans.

In July, Forbes, in partnership with the Center for College Affordability and Productivity (CCAP) produced a ranking of 650 schools in the country. Caroline Howard's article in Forbes indicated that this ranking system emphasized outputs of colleges (e.g., graduation rates) rather than inputs (e.g., SAT scores of incoming students), which is similar to what President Obama is pushing for. In this post, I take a look at Forbes' rankings of schools using interactive visualizations. These would suggest that several challenges lie ahead for any meaningful reform.

Tools used for generating these visualizations include R, ggplot2, googleVis, rCharts, and a shiny-server application. The relevant code and data files can be found here on github. I would appreciate your contacting me if you find any errors in the code or have any comments on this project. 

Forbes-CCAP Rankings. Data for this post were collected from two different documents from CCAP's site. Document 1 provided a list of America's 100 Best College Buys and document 2 - Component Rankings - provided the complete listing of all 650 schools, their overall rank, and other ranking criteria. Details of the methodology used by the ranking system can be found here at Forbes and here at CCAP's site. Links to their discussions of results can be found here at CCAP's site and here at Forbes'. These rankings are based on variables such as post-graduate success, student debt,  graduation rate, and student satisfaction, among others. There were two summary rankings produced --- one was an overall rank, and the second was a ranking of universities providing the best value. 650 schools were ranked in the overall rank system and among them, 100 were ranked as providing the best value.

Procedure Followed.
1. Convert both pdf files (Component Rankings and Best College buys) to excel formats (using Able2Extract)
2. Merge both files into one datafile.
3. Combine the State with the name of the university into one column --- there were few universities with the same name, but were from different states.
4. For many components in the assigned component rankings, universities getting a rank between 600 and 650 were assigned a generic rank of "600-650" by Forbes-CCAP. It was decided to assign a rank of 625 to all such schools for those specific components.
5. Please remember that a lower value of rank indicates that the school is rated higher on that attribute.

The table below provides information used for subsequent analysis. The table can be sorted by any column.



Where are the 100 best value schools in the country? 

Color of dots: Rank, from 1 through 100
Size of dots: Total Score (which determines the Overall rank... higher score denotes lower rank number, which indicates a better overall school)... Please note that the overall ranking method and the ranking for best value are different.

(You can hover over the dots to find the name of the school.)


 Relationship between Best Value Rankings and Overall Rankings of Schools 
(Please click on the image below to enlarge it.)
The above graph shows overall ranks for all 100 universities identified as providing the best value. Universities have been listed in the order of their best value rank (better ranked below to lesser ranked above). What one would notice is that there are only about 7 universities whose overall rank is better than or similar to their best value rank. The overall rank of most of the best value universities are over 100 and a few are over 500.

This is also confirmed in the interactive graph given below, which shows that the overall ranks for the 100 best value universities range from 7 through 599. It is also the case that of the 100 schools, only 3 are private and the remaining 97 are public schools. (You can hover over the dots to identify the school.)



How are different criteria related to each other? 

Given the large number of criteria the Forbes-CCAP rankings provide data on, it is perhaps more appropriate to let you interact with the data to see how different variables are related to each other. The interactive application below will let you do that.



Clearly, few variables have a positive relationship. For example, universities with a better overall rank appear to also have better ranks in terms of Federal.Student.Debt, Actual.Retention.Rate.Rank, Student.Awards and Payscale.Salary.Rank . However, not all variables show that kind of a relationship. For example, please see the relationship between Rate.my.Professor rank and Payscale.Salary.Rank or between Rate.my.Professor rank and American.Leaders.Rank. Interestingly, private schools appear to have an edge over public schools based on the Rate.My.Professor rank.

Where does a particular school stack up, versus others? 

The interactive application below provides additional information over what we just saw. Here, you can select specific schools and figure out how they do in comparison to select others. This application can be played around with in the following ways.
 1. The two small tabs on the top right show either bubble charts or bar graphs, depending on what's selected.
 2. The horizontal and vertical axes can be changed to other variables by clicking at existing axis labels (the arrow mark)
3. Please change the Size parameter to "Same size" (top most option) to ease interpretation.
4. You can hover over the dots to know the school.
5. Coloring can be modified --- but I would leave it at its default state.
6. Specific schools (one or many) can be selected and changes in their position with respect to different variables can be monitored .


Concluding Remark and Acknowledgements

It is likely to be a long and arduous road ahead for education reform. Should it be done, definitely. How? Wiser people need to figure this out.

Thanks to Ramnath Vaidyanathan, Joe Cheng, and other members of the R community for their immediate responses to queries and excellent resources they have created for all to use.

Monday, August 12, 2013

Short tales of two NCAA basketball conferences (Big 12 and West Coast) using graphs

UPDATE: THE BLOG/SITE HAS MOVED TO GITHUB. THE NEW LINK FOR THE BLOG/SITE IS patilv.github.io and THE LINK TO THIS POST IS: http://bit.ly/1kvathJ. PLEASE UPDATE ANY BOOKMARKS YOU MAY HAVE.

Having been at the University of Kansas (Kansas Jayhawks) as a student and now working at Gonzaga University (Gonzaga Bulldogs), discussions about college basketball are inescapable. This post uses R, ggmap, ggplot2 and the shiny server to graphically visualize relationships between few basketball related variables for two NCAA basketball conferences - Big 12 and the West Coast conference. Code used to generate graphs in this post and for the shiny server application referred to in the post can be found here.

A visit to espn.com's page at http://espn.go.com/mens-college-basketball/teams shows a listing of universities in different conferences. The image below presents one such example.


A click on the "Stats" link for a school will take you to a page like this.

The page provides two tables - one of game statistics - total games played and the per game statistics for the season for all players from that team and the second, cumulative numbers for the entire season. Tables in this format are useful, but they do not provide information on relationships between variables, between teams, or between conferences. So, the objective of this post is to visualize few of these relationships. In order to do this, we do the following.

Steps
1. Find a way to extract data from the website. Remember, each page for a college provides information on only one season, but information on 12 seasons is available.  This is done using a function developed to scrape the site.
2. Clean and prepare the data.
3. Prepare preliminary graphs, and lastly,
4. Make the graph generation process interactive for people to focus on their favorite teams or variables.

Scope of this post
1. We focus only on the first of the two tables of statistics provided by espn.com (see the previous image) --- total games played and per game statistics. Please note that these are player level data.
2. This post does not look at individual players, by name, in different teams -- so it does not identify best players on any variable -- but it uses player level data to study teams and conferences.
3. We look at only 20 teams, 10 each in the West Coast Conference and Big 12  ---- but with some extra time, interest, and patience (TIP), it can be used for the remaining 331 Division 1 teams in 30 different basketball conferences.
4. Not all questions pertaining to variables studied are answered and not all graphical formats explored. For example, we do not have data on the win/loss record or of the competition teams faced.

The fun begins...

Step 1: Extracting Data
The image previously presented of a typical statistics page from espn has certain distinct elements.
1. Each page has two tables, the first about game statistics and the second about the season (cumulative numbers for the season), they have the same headers.
2. The URL for a page had three different parts.
(a) Part 1 has the team id, Part 2 the year, and Part 3 was the team name.
(b) The second part, year, represents a season --- so when Part 2 is 2012, the page described the 2011-12 season. This is something to remember to make the necessary change in the data.
3. The basic idea is to create a function, which does the following. (i) Take in Part 1 and Part 3, among any other variables we deem appropriate as inputs;(ii) loop through pages representing different seasons for a team; (iii) from each page extract two tables and update two master data frames with data from these two tables --- two because we have game stats and season stats (the latter is not used for analysis in this post -- but the data is available for anyone wishing to look at them). This function should then be used to extract data for 20 universities.
First the libraries required for all of this code 


library(XML)
library(ggmap)
library(ggplot2)
library(plyr)

Initialize two empty master data frames, which we'll call Gamestatistics and Seasonstatistics, and provide column names from the two tables. 

Gamestatistics=as.data.frame(matrix(ncol=16))
names(Gamestatistics) = c("Player", "GP", "MIN", "PPG", "RPG", "APG", "SPG", "BPG", "TPG", "FG%", "FT%", "3P%","Year","Team","City","Conference")

Seasonstatistics=as.data.frame(matrix(ncol=20))
names(Seasonstatistics) = c("Player", "MIN", "FGM", "FGA", "FTM", "FTA", "3PM", "3PA", "PTS", "OFFR", "DEFR", "REB", "AST",  "TO", "STL", "BLK","Year","Team","City","Conference")

Note that we have created columns for Year, Team, City, and Conference.  --- variables beyond the columns provided by the two tables and will either have to be calculated or manually determined. For each of the 20 colleges of interest from the 2 conferences, we can prepare something like the following.

URLpart1="http://espn.go.com/mens-college-basketball/team/stats/_/id/2250/year/"
URLpart3 ="/gonzaga-bulldogs"
Team="Gonzaga Bulldogs"
City="Spokane, WA"
Conference="West Coast"

The city information was obtained from Wikipedia for those I didn't know and the other information was available on the espn page for that university. Once we have prepared the parameters required to send to our function (getData), we then call on it. Remember that the function should return two tables. This we collect in a list named gameandseasonstats and extract the updated master tables - Gamestatistics and Seasonstatistics tables from the list.


gameandseasonstats=getData(URLpart1,URLpart3,Team,City,Conference)
Gamestatistics=gameandseasonstats[[1]]
Seasonstatistics=gameandseasonstats[[2]]

Now the function.


getData=function(URLpart1,URLpart3,Team,City,Conference){
  for (i in 2002:2013){
 
  URL=paste(paste(URLpart1,as.character(i),sep=""),URLpart3,sep="")
  tablesfromURL = readHTMLTable(URL)
 
  gamestat=tablesfromURL[[1]]
  names(gamestat) = c("Player", "GP", "MIN", "PPG", "RPG", "APG", "SPG", "BPG", "TPG", "FG%", "FT%", "3P%")
  gamestat$Year=i
  gamestat$Team=Team
  gamestat$City=City
  gamestat$Conference=Conference
  Gamestatistics=rbind(gamestat,Gamestatistics)

  seasonstat=tablesfromURL[[2]]
  names(seasonstat) = c("Player", "MIN", "FGM", "FGA", "FTM", "FTA", "3PM", "3PA", "PTS", "OFFR", "DEFR", "REB", "AST",  "TO", "STL", "BLK")
  seasonstat$Year=i
  seasonstat$Team=Team
  seasonstat$City=City
  seasonstat$Conference=Conference
  Seasonstatistics=rbind(seasonstat,Seasonstatistics)
  }
return(list(Gamestatistics,Seasonstatistics))
}

What this does is the following.
(a) Receive parameters we send to the function - getData
(b) For every year's page, from 2002-2013, it assembles the complete URL of the page by adding URLpart1, the year, and URLpart3 together.
(c) Gets the two tables and stores them in two temporary tables --- gamestat and seasonstat
(d) Adds new columns for year, team, city, and conference.
(e) Add these rows to the respective master tables - Gamestatistics and Seasonstatistics and return them in a list, which we retrieve outside the function and reuse them for the next school.
Now, we have collected data on 2 tables for 20 schools from 2 different conferences from 240 different pages from espn.com. On to the next stage.

Step 2 Clean and Prepare the Data

The dataframes have three types of rows which need to be removed. These are instances where the value of "Players" is either "Players", "Total", or "NA" (remember, we initialized empty dataframes). This is from the way tables were read by our function. That's easy.


Gamestatistics=Gamestatistics[which(Gamestatistics$Player!="NA"),]
Gamestatistics=Gamestatistics[which(Gamestatistics$Player!="Player"),]
Gamestatistics=Gamestatistics[which(Gamestatistics$Player!="Totals"),]

Converting few variables to factors and few to numbers was accomplished by these two lines.


for (i in 2:12){Gamestatistics[, i] = as.numeric(as.character(Gamestatistics[,i]))}
for(i in 14:16){Gamestatistics[,i]=as.factor(Gamestatistics[,i])}

Then, columns were renamed to explain each variable completely, so PPG became Points.Per.Game. This could've done this at the very beginning, but I didn't have the foresight.


names(Gamestatistics) = c("Player", "Games.Played", "Minutes", "Points.Per.Game", "Rebounds.Per.Game", "Assists.Per.Game", "Steals.Per.Game",
"Blocks.Per.Game", "Turnovers.Per.Game", "Field.Goal.Percent",
"Free.Throw.Percent", "Three.Point.FieldGoal.Percent", "Year", "Team", "City","Conference")

  
And the last thing left was converting years back to seasons ---  so 2002 became "2001-2002", accomplished using this for years between 2002 and 2013.



Gamestatistics$Year<-gsub("2002", "2001-2002", Gamestatistics$Year)

Remember, these changes are required for the second table as well --- Seasonstatistics, which we don't use for the post. However, both these cleaned dataframes are available for anyone to play with. They are with the code at the github link provided previously.

Step 3: Prepare preliminary graphs

Which 20 schools did we collect data on?


ggplot(Gamestatistics,aes(x=Conference,y=City,color=Conference))+ 
geom_text(data=Gamestatistics,aes(label=Team))+
  theme(axis.text.x = element_text(color="black",size=12))+
  theme(axis.text.y = element_text(color="black",size=12))+theme(legend.position="none")+labs(y="",x="")


Let's plot these cities on a map.

First, get the US map from osm, get the list of cities from our data, get their latitudes and longitudes, and add this information to our data.


location=c(-125,24.207,-70,50) # It took a bit to figure these coordinates out - zoom to the appropriate location and level using openstreetmap.org
# and find coordinates from the export link

map=get_map(location=location,maptype="roadmap",source="osm")
usmap=ggmap(map)

locs=geocode(as.character(unique(Gamestatistics$City))) # find the 20 cities from the data and identify their latitude and longitude; combine City information
locs$City=unique(Gamestatistics$City)
Gamestatistics$lat=locs$lat[ match(Gamestatistics$City,locs$City)]# bring latitude and longitude information to main data frame
Gamestatistics$lon=locs$lon[ match(Gamestatistics$City,locs$City)]


And, the plot.


usmap+geom_point(data=Gamestatistics,aes(x=lon,y=lat,color=Conference),size=7)+ ggtitle("Location of WCC and Big 12 Schools")

1. BYU is in Utah, which isn't a coastal state, unlike the other schools from the WCC.
2. WVU doesn't exactly fit the region of the other Big 12 schools.
3. BYU and WVU recently joined these conferences.

Just a thought: On what bases are conferences named? Big 12 has 10 teams, but Big Ten has 12? Clearly, there appears to be a branding issue. PAC-10 did change to PAC-12 after the recent addition of two new teams. Hmm.

Let's plot histograms of some variable, say Points.Per.Game, for all teams.


ggplot(Gamestatistics,aes(x=Points.Per.Game, fill=Team))+
  geom_histogram()+ggtitle("Histogram of Points.Per.Game for All Teams - Data Collapsed Across All Years")+ facet_wrap(~Team,ncol=4) + theme(legend.position="none")

We could also compare the distributions of two different schools on one variable --- let's take a look at Gonzaga Bulldogs and Kansas Jayhawks on say, Points.Per.Game.

ggplot(subset(Gamestatistics,Team %in% c("Gonzaga Bulldogs","Kansas Jayhawks")),aes(x=Points.Per.Game, fill=Team))+  geom_density(alpha=.3)+ ggtitle("Kernel Density Plots of Points.Per.Game for Gonzaga Bulldogs and Kansas Jayhawks for all Years")+ facet_wrap(~Year,ncol=4)






We might also be interested in seeing how the mean points per game of team players change over years for different teams.



# Mean calculation of Points.Per.Game of Team players for a season
ppgmean=ddply(Gamestatistics,.(Team,Year),summarize,Mean.Points.Per.Game=mean(Points.Per.Game))

#Plot
ggplot(ppgmean,aes(x=Year,y=Mean.Points.Per.Game,color=Team,group=Team))+
geom_point()+geom_line()+facet_wrap(~Team,ncol=4)+theme(legend.position="none")+

  theme(axis.text.x = element_text(angle=-90))+ggtitle("Mean Points Per Game of Players of Different Teams in Different Seasons")


Alternately, we might be interested in how the mean points per game of team players changed for two teams, across different seasons.



# Mean points per game comparison for two teams, say, Gonzaga and Kansas, over years


ggplot(subset(ppgmean,Team %in% c("Gonzaga Bulldogs","Kansas Jayhawks")),aes(x=Year,y=Mean.Points.Per.Game,color=Team,group=Team))+
  geom_point()+geom_line()+ggtitle("Mean Points Per Game of Players of Gonzaga Bulldogs and Kansas Jayhawks in Different Seasons")




We could also look at relationships between two variables (Points per game and Assists.Per.Game) in teams across different years and add in a LOESS curve.


ggplot(Gamestatistics,aes(x=Points.Per.Game, y=Assists.Per.Game, color=Team))+
  geom_jitter()+ geom_smooth(method='loess',level=0,size=1,aes(color=Team))+
  ggtitle("Scatter Plots with LOESS smoothing of Points.Per.Game and Assists for All Teams -- Data Collapsed Across All Years")+ facet_wrap(~Team,ncol=4) +
  theme(legend.position="none") 


We could also compare the relationship of two variables - points per game and assists per game, for two or more schools.

ggplot(subset(Gamestatistics,Team %in% c("Gonzaga Bulldogs","Kansas Jayhawks")),aes(x=Points.Per.Game, y=Assists.Per.Game, color=Team))+
  geom_jitter()+ geom_smooth(method='loess',level=0,size=1,aes(color=Team))+ facet_wrap(~Year,ncol=4)+
  ggtitle("Scatter Plots with LOESS smoothing of Points.Per.Game and Assists for Gonzaga Bulldogs and Kansas Jayhawks -- Data Collapsed Across All Years")




Step 4: Interactively generate graphs - for any combination of variable and teams

Of course, previously shown graphs could be generated for comparing both conferences and we could also have other variables of interest that we might want to compare different schools or conferences on. It would be unwieldy to present graphs of all possibilities here. Introducing interactivity by letting you play around with variables and schools will help. For this, we rely on the shiny server platform from RStudio.  Our shiny server application uses three preloaded data files - Gamestatisticscleaned.rda, and files for team-wise and conference-wise means (labeled meansteams.rda and meansconferences.rda, respectively) of all variables for all years and presents 9 different graphs. The code to generate these additional data files is given below.



meansteams=ddply(Gamestatistics,.(Team,Year),summarize,
                 Points.Per.Game=mean(Points.Per.Game),
                 Games.Played=mean(Games.Played),
                 Minutes = mean(Minutes),
                 Rebounds.Per.Game=mean(Rebounds.Per.Game),
                 Assists.Per.Game=mean(Assists.Per.Game),
                 Steals.Per.Game=mean(Steals.Per.Game),
                 Blocks.Per.Game=mean(Blocks.Per.Game),
                 Turnovers.Per.Game=mean(Turnovers.Per.Game),
                 Field.Goal.Percent=mean(Field.Goal.Percent),
                 Free.Throw.Percent=mean(Free.Throw.Percent),
      Three.Point.FieldGoal.Percent=mean(Three.Point.FieldGoal.Percent)
                 )
meansconferences=ddply(Gamestatistics,.(Conference,Year),summarize,
                       Points.Per.Game=mean(Points.Per.Game),
                       Games.Played=mean(Games.Played),
                       Minutes = mean(Minutes),
                       Rebounds.Per.Game=mean(Rebounds.Per.Game),
                       Assists.Per.Game=mean(Assists.Per.Game),
                       Steals.Per.Game=mean(Steals.Per.Game),
                       Blocks.Per.Game=mean(Blocks.Per.Game),
                       Turnovers.Per.Game=mean(Turnovers.Per.Game),
                       Field.Goal.Percent=mean(Field.Goal.Percent),
                       Free.Throw.Percent=mean(Free.Throw.Percent),
      Three.Point.FieldGoal.Percent=mean(Three.Point.FieldGoal.Percent)
                )
save(meansteams,file="meansteams.rda")

save(meanconferences,file="meansconferences.rda")




The codes for the shiny server user interface (ui.r) and server (server.r) are available through github. Please click on the screenshot below to access the application.

Concluding Remarks
1. The objective of this post was to visualize few relationships between game statistics presented for teams by espn.com for two NCAA basketball conferences - West Coast and Big 12.
2.  Data were scraped from 240 different pages on espn.com
3. Data on win/loss history or the competition teams faced was not available on the page and hence, not used.
4. Only one of two tables --- one concerning game statistics - was used for the analyses and the data from the second table, season statistics ---cumulative numbers for the entire season --- is available for download from the previously provided link.
5. R code was formatted for the blog purpose using the Highlight 3.14 software.
6. I would appreciate any comments/suggestions you might have on the application or on the code. Thanks.
--------------------------