Friday, October 18, 2013

Subsidies, Revenues, and Expenses of NCAA Division I Public Schools Athletic Departments – Visualizations using rCharts and animint generated interactive ggplots


Two interesting articles in recent issues of The Atlantic linked sports to tax-payer dollars. The first few lines of Gregg Easterbrook’s article, titled How the NFL Fleeces Taxpayers, were: “Taxpayers fund the stadiums, antitrust law doesn't apply to broadcast deals, the league enjoys nonprofit status, and Commissioner Roger Goodell makes $30 million a year. It's time to stop the public giveaways to America's richest sports league—and to the feudal lords who own its teams.” Amanda Ripley’s article was titled The Case Against High-School Sports. Its first few lines were: “The United States routinely spends more tax dollars per high-school athlete than per high-school math student—unlike most countries worldwide. And we wonder why we lag in international education rankings?” An USA Today Sports article by Steve Berkowitz, Jodi Upton and Erik Brady addressed a third level of sports- collegiate athletics, and linked it to tax-payer dollars. The article, titled “Most NCAA Division I athletic departments take subsidies” focused on the issue of providing subsidies to athletic departments of public schools that were self-sufficient (16 out of 228 schools) and asked if that was necessary. In this post, I use data provided in the USA Today article to further study the expenses, revenues, and subsidies of the athletic departments of the 228 public schools.

All charts used in this post permit interactivity of some sort. These were created using rCharts and animint. These in turn rely on a set of other libraries such as ggplot2 and polycharts JS. Thanks to Toby Dylan Hocking, @timelyportfolio and Ramnath Vaidyanathan for their prompt assistance on charting questions. Code, data, and other files relevant for a replication of this analysis are available at github. 

Data on total revenues, total expenses, and subsidies for 228 athletic departments of public universities were provided for the year 2012. An overall rank was also provided to different athletic departments on their finances. The methodology and category explanations USA Today used are available here.  Data were collected by USA Today in partnership with Indiana University’s National Sports Journalism Center. In the data analyzed below, Total Revenues for a department include the subsidies it received. An additional variable, Revenues less Expenses was calculated, which was nothing but the Total Revenues less Total Expenses. The complete data have been tabulated below. The columns can be sorted based on the value.

Now, let’s visualize the data using a heat map. Along the columns are different schools (you don’t see all labels but if you hover over the rectangle tiles, you should see the school identified). Values for expenses, revenues and other variables are scaled to a 0 to 1 scale, thereby permitting a visualization of all variables in a better manner. Higher the dollar amount, higher the intensity of red and the scaled value will be closer to 1. Lower the dollar value, lower the intensity of the red--- or white if dollar amount is negligible. To verify, you can hover over the most intense red parts of the Total Subsidy row and you should find Nevada-Las Vegas with a scaled value of 1. It is because it gets the highest subsidy.

The heat map clearly shows that most schools receive subsidy to some degree; the row for Total Revenue indicates that that there are very few schools that generate very high amounts of revenue and as a result, the Revenue less Expenses row is generally a shade of faded red for most schools, save for a handful of those with very high values.

Relationship between Total Subsidy, Total Revenue, Total Expenses and Revenue Less Expenses
The correlation matrix given below provides this information. It tells you how related two variables are to each other. The correlation coefficient, a measure of the relationship, can vary from –1 through +1. A negative coefficient indicates indicates a negative relationship (in red or a shade of red) between the two variables, whereas a positive coefficient (in blue or a shade of blue) denotes a positive relationship (higher values of one variable are associated with higher values of the other variable). You should be able to hover over the tiles to get more information.

There's a perfect positive correlation between Total Revenue and Total Expenses, and these variables are negatively correlated to the overall rank (higher the revenue, better the rank [with a lower number denoting better performance]). No surprises there. Total Subsidy is not positively related to any other variable.

Groups/Clusters of Athletic Departments

In order to classify different athletic departments into different groups, a statistical technique called k-means clustering was used. Based on few statistical measures, which can be ascertained from the code provided, it was decided that all 228 departments be grouped into 4 different groups. This grouping is done such that Schools with similar characteristics (with respect to total subsidy, total revenue, total expenses and revenue less expenses) are grouped together. Group 1 had 19 schools, Group 2 had 61 schools, Group 3 had 33 schools and the remaining 115 schools belonged to Group 4. Schools are tabulated based on the 4 groups below.

Characteristics of these groups/clusters. The chart below can be used to determine the profile of each of the 4 groups. You can select (box around a group number) and click to find few lines highlighted. This chart gives the mean scores on variables of interest for the 4 groups. Lower the mean score in the group, lesser the dollar amount for that group.

Group 1 (cluster of 19 schools) has the highest mean Total Revenues (approximately $111.48 million), highest mean Total Expenses (approximately $101 million), and highest Revenues less Expenses (over $10 million). It takes the least amount of subsidy (approximately $2.85 million). Contrast Group 1 with Group 4, a cluster of 115 schools, the largest among the four groups. Group 4 has the lowest mean Total Revenue, lowest mean Total Expenses, a subsidy average of approximately $8.3 million dollars, and the lowest Revenue less Expense of $17,550. Group 3 schools, a cluster of 33, taken in an average subsidy amount which is slightly less than Group 4 schools (approximately $7.9 million) but has the second highest mean levels with respect to Total Revenues (approximately 70.34 million), Total Expenses (approximately 69.3 million) , and Revenues less Expenses (little over $1 million) . Lastly, Group 2 schools, who receive the highest average subsidy (approximately $16.9 million) have total revenues of approximately $29 million, expenses of approximately $28.99 million for a Revenue less Expenses amount of about $37,000.

The following set of plots provide the potential to explore relationships between different variables and subsidies and the 4 different groups further. You should be able to hover over either the bars of the bar charts or the points of the scatter plots to get information of the school being hovered over. You should also be able to click on a point or bar to see where that school is positioned on the other charts. The points have been colored based on the membership of the corresponding school in the 4 different groups.


There clearly is a lot of money in NCAA athletics, even though the athletes themselves are not paid a salary. The analysis shows that subsidies are provided to most athletic departments of public schools. Given limited tax-payer dollars and the picture that emerges after juxtaposing other articles on the use of tax-payer funding for sports, a question that begs answers is whether tax-payer dollars are going to the right places.

Saturday, October 5, 2013

Crime Against Women in India - Addressing 8 Questions Using rCharts, googleVis, and shiny


 Recent crimes against women, specifically the 2012 gang rape in New Delhi of a 23 year old lady, have pushed this issue as a substantially significant one for Indians to deal with. In this post, I try to address 8 different questions regarding crime against women in India. (1). How have numbers in different types of crimes and percentage of different crimes changed over the past few years for the country as a whole and for individual States/Union Territories? This is answered using data available for 12 years (2001-2012). For rest of the questions, I focus on data from 2012. (2) At the States/Union territories level, how are different types of crimes related to each other, in terms of their correlations? (3) Which States/UTs have a higher incidence of different types of crimes? (4) How can different States/UTs be classified based on the nature and extent of crime occurring in them? In question 5 through 7,  I ask questions 2 through 4 again, but for different cities.  Question 8 – Can the States, Union Territories, and Cities be mapped along with their incidence levels of different crimes in 2012? 

Tools Used: Analysis has been done using R and visualizations have been developed using rCharts and googleVis. Interactivity has also been facilitated by the shiny server environment, thanks to RStudio. Other R libraries used include reshape2, plyr, and scales. Special thanks to Ramnath Vaidyanathan and @timelyportfolio for their super quick assistance with questions on rCharts. All datasets used for this post and the code for generating charts and shiny applications are available at github.

Given the length of the post and the number of charts, I decided to avoid posting it completely here and have created a separate page on blogger. For the entire post, please visit: