Monday, August 12, 2013

Short tales of two NCAA basketball conferences (Big 12 and West Coast) using graphs

UPDATE: THE BLOG/SITE HAS MOVED TO GITHUB. THE NEW LINK FOR THE BLOG/SITE IS patilv.github.io and THE LINK TO THIS POST IS: http://bit.ly/1kvathJ. PLEASE UPDATE ANY BOOKMARKS YOU MAY HAVE.

Having been at the University of Kansas (Kansas Jayhawks) as a student and now working at Gonzaga University (Gonzaga Bulldogs), discussions about college basketball are inescapable. This post uses R, ggmap, ggplot2 and the shiny server to graphically visualize relationships between few basketball related variables for two NCAA basketball conferences - Big 12 and the West Coast conference. Code used to generate graphs in this post and for the shiny server application referred to in the post can be found here.

A visit to espn.com's page at http://espn.go.com/mens-college-basketball/teams shows a listing of universities in different conferences. The image below presents one such example.


A click on the "Stats" link for a school will take you to a page like this.

The page provides two tables - one of game statistics - total games played and the per game statistics for the season for all players from that team and the second, cumulative numbers for the entire season. Tables in this format are useful, but they do not provide information on relationships between variables, between teams, or between conferences. So, the objective of this post is to visualize few of these relationships. In order to do this, we do the following.

Steps
1. Find a way to extract data from the website. Remember, each page for a college provides information on only one season, but information on 12 seasons is available.  This is done using a function developed to scrape the site.
2. Clean and prepare the data.
3. Prepare preliminary graphs, and lastly,
4. Make the graph generation process interactive for people to focus on their favorite teams or variables.

Scope of this post
1. We focus only on the first of the two tables of statistics provided by espn.com (see the previous image) --- total games played and per game statistics. Please note that these are player level data.
2. This post does not look at individual players, by name, in different teams -- so it does not identify best players on any variable -- but it uses player level data to study teams and conferences.
3. We look at only 20 teams, 10 each in the West Coast Conference and Big 12  ---- but with some extra time, interest, and patience (TIP), it can be used for the remaining 331 Division 1 teams in 30 different basketball conferences.
4. Not all questions pertaining to variables studied are answered and not all graphical formats explored. For example, we do not have data on the win/loss record or of the competition teams faced.

The fun begins...

Step 1: Extracting Data
The image previously presented of a typical statistics page from espn has certain distinct elements.
1. Each page has two tables, the first about game statistics and the second about the season (cumulative numbers for the season), they have the same headers.
2. The URL for a page had three different parts.
(a) Part 1 has the team id, Part 2 the year, and Part 3 was the team name.
(b) The second part, year, represents a season --- so when Part 2 is 2012, the page described the 2011-12 season. This is something to remember to make the necessary change in the data.
3. The basic idea is to create a function, which does the following. (i) Take in Part 1 and Part 3, among any other variables we deem appropriate as inputs;(ii) loop through pages representing different seasons for a team; (iii) from each page extract two tables and update two master data frames with data from these two tables --- two because we have game stats and season stats (the latter is not used for analysis in this post -- but the data is available for anyone wishing to look at them). This function should then be used to extract data for 20 universities.
First the libraries required for all of this code 


library(XML)
library(ggmap)
library(ggplot2)
library(plyr)

Initialize two empty master data frames, which we'll call Gamestatistics and Seasonstatistics, and provide column names from the two tables. 

Gamestatistics=as.data.frame(matrix(ncol=16))
names(Gamestatistics) = c("Player", "GP", "MIN", "PPG", "RPG", "APG", "SPG", "BPG", "TPG", "FG%", "FT%", "3P%","Year","Team","City","Conference")

Seasonstatistics=as.data.frame(matrix(ncol=20))
names(Seasonstatistics) = c("Player", "MIN", "FGM", "FGA", "FTM", "FTA", "3PM", "3PA", "PTS", "OFFR", "DEFR", "REB", "AST",  "TO", "STL", "BLK","Year","Team","City","Conference")

Note that we have created columns for Year, Team, City, and Conference.  --- variables beyond the columns provided by the two tables and will either have to be calculated or manually determined. For each of the 20 colleges of interest from the 2 conferences, we can prepare something like the following.

URLpart1="http://espn.go.com/mens-college-basketball/team/stats/_/id/2250/year/"
URLpart3 ="/gonzaga-bulldogs"
Team="Gonzaga Bulldogs"
City="Spokane, WA"
Conference="West Coast"

The city information was obtained from Wikipedia for those I didn't know and the other information was available on the espn page for that university. Once we have prepared the parameters required to send to our function (getData), we then call on it. Remember that the function should return two tables. This we collect in a list named gameandseasonstats and extract the updated master tables - Gamestatistics and Seasonstatistics tables from the list.


gameandseasonstats=getData(URLpart1,URLpart3,Team,City,Conference)
Gamestatistics=gameandseasonstats[[1]]
Seasonstatistics=gameandseasonstats[[2]]

Now the function.


getData=function(URLpart1,URLpart3,Team,City,Conference){
  for (i in 2002:2013){
 
  URL=paste(paste(URLpart1,as.character(i),sep=""),URLpart3,sep="")
  tablesfromURL = readHTMLTable(URL)
 
  gamestat=tablesfromURL[[1]]
  names(gamestat) = c("Player", "GP", "MIN", "PPG", "RPG", "APG", "SPG", "BPG", "TPG", "FG%", "FT%", "3P%")
  gamestat$Year=i
  gamestat$Team=Team
  gamestat$City=City
  gamestat$Conference=Conference
  Gamestatistics=rbind(gamestat,Gamestatistics)

  seasonstat=tablesfromURL[[2]]
  names(seasonstat) = c("Player", "MIN", "FGM", "FGA", "FTM", "FTA", "3PM", "3PA", "PTS", "OFFR", "DEFR", "REB", "AST",  "TO", "STL", "BLK")
  seasonstat$Year=i
  seasonstat$Team=Team
  seasonstat$City=City
  seasonstat$Conference=Conference
  Seasonstatistics=rbind(seasonstat,Seasonstatistics)
  }
return(list(Gamestatistics,Seasonstatistics))
}

What this does is the following.
(a) Receive parameters we send to the function - getData
(b) For every year's page, from 2002-2013, it assembles the complete URL of the page by adding URLpart1, the year, and URLpart3 together.
(c) Gets the two tables and stores them in two temporary tables --- gamestat and seasonstat
(d) Adds new columns for year, team, city, and conference.
(e) Add these rows to the respective master tables - Gamestatistics and Seasonstatistics and return them in a list, which we retrieve outside the function and reuse them for the next school.
Now, we have collected data on 2 tables for 20 schools from 2 different conferences from 240 different pages from espn.com. On to the next stage.

Step 2 Clean and Prepare the Data

The dataframes have three types of rows which need to be removed. These are instances where the value of "Players" is either "Players", "Total", or "NA" (remember, we initialized empty dataframes). This is from the way tables were read by our function. That's easy.


Gamestatistics=Gamestatistics[which(Gamestatistics$Player!="NA"),]
Gamestatistics=Gamestatistics[which(Gamestatistics$Player!="Player"),]
Gamestatistics=Gamestatistics[which(Gamestatistics$Player!="Totals"),]

Converting few variables to factors and few to numbers was accomplished by these two lines.


for (i in 2:12){Gamestatistics[, i] = as.numeric(as.character(Gamestatistics[,i]))}
for(i in 14:16){Gamestatistics[,i]=as.factor(Gamestatistics[,i])}

Then, columns were renamed to explain each variable completely, so PPG became Points.Per.Game. This could've done this at the very beginning, but I didn't have the foresight.


names(Gamestatistics) = c("Player", "Games.Played", "Minutes", "Points.Per.Game", "Rebounds.Per.Game", "Assists.Per.Game", "Steals.Per.Game",
"Blocks.Per.Game", "Turnovers.Per.Game", "Field.Goal.Percent",
"Free.Throw.Percent", "Three.Point.FieldGoal.Percent", "Year", "Team", "City","Conference")

  
And the last thing left was converting years back to seasons ---  so 2002 became "2001-2002", accomplished using this for years between 2002 and 2013.



Gamestatistics$Year<-gsub("2002", "2001-2002", Gamestatistics$Year)

Remember, these changes are required for the second table as well --- Seasonstatistics, which we don't use for the post. However, both these cleaned dataframes are available for anyone to play with. They are with the code at the github link provided previously.

Step 3: Prepare preliminary graphs

Which 20 schools did we collect data on?


ggplot(Gamestatistics,aes(x=Conference,y=City,color=Conference))+ 
geom_text(data=Gamestatistics,aes(label=Team))+
  theme(axis.text.x = element_text(color="black",size=12))+
  theme(axis.text.y = element_text(color="black",size=12))+theme(legend.position="none")+labs(y="",x="")


Let's plot these cities on a map.

First, get the US map from osm, get the list of cities from our data, get their latitudes and longitudes, and add this information to our data.


location=c(-125,24.207,-70,50) # It took a bit to figure these coordinates out - zoom to the appropriate location and level using openstreetmap.org
# and find coordinates from the export link

map=get_map(location=location,maptype="roadmap",source="osm")
usmap=ggmap(map)

locs=geocode(as.character(unique(Gamestatistics$City))) # find the 20 cities from the data and identify their latitude and longitude; combine City information
locs$City=unique(Gamestatistics$City)
Gamestatistics$lat=locs$lat[ match(Gamestatistics$City,locs$City)]# bring latitude and longitude information to main data frame
Gamestatistics$lon=locs$lon[ match(Gamestatistics$City,locs$City)]


And, the plot.


usmap+geom_point(data=Gamestatistics,aes(x=lon,y=lat,color=Conference),size=7)+ ggtitle("Location of WCC and Big 12 Schools")

1. BYU is in Utah, which isn't a coastal state, unlike the other schools from the WCC.
2. WVU doesn't exactly fit the region of the other Big 12 schools.
3. BYU and WVU recently joined these conferences.

Just a thought: On what bases are conferences named? Big 12 has 10 teams, but Big Ten has 12? Clearly, there appears to be a branding issue. PAC-10 did change to PAC-12 after the recent addition of two new teams. Hmm.

Let's plot histograms of some variable, say Points.Per.Game, for all teams.


ggplot(Gamestatistics,aes(x=Points.Per.Game, fill=Team))+
  geom_histogram()+ggtitle("Histogram of Points.Per.Game for All Teams - Data Collapsed Across All Years")+ facet_wrap(~Team,ncol=4) + theme(legend.position="none")

We could also compare the distributions of two different schools on one variable --- let's take a look at Gonzaga Bulldogs and Kansas Jayhawks on say, Points.Per.Game.

ggplot(subset(Gamestatistics,Team %in% c("Gonzaga Bulldogs","Kansas Jayhawks")),aes(x=Points.Per.Game, fill=Team))+  geom_density(alpha=.3)+ ggtitle("Kernel Density Plots of Points.Per.Game for Gonzaga Bulldogs and Kansas Jayhawks for all Years")+ facet_wrap(~Year,ncol=4)






We might also be interested in seeing how the mean points per game of team players change over years for different teams.



# Mean calculation of Points.Per.Game of Team players for a season
ppgmean=ddply(Gamestatistics,.(Team,Year),summarize,Mean.Points.Per.Game=mean(Points.Per.Game))

#Plot
ggplot(ppgmean,aes(x=Year,y=Mean.Points.Per.Game,color=Team,group=Team))+
geom_point()+geom_line()+facet_wrap(~Team,ncol=4)+theme(legend.position="none")+

  theme(axis.text.x = element_text(angle=-90))+ggtitle("Mean Points Per Game of Players of Different Teams in Different Seasons")


Alternately, we might be interested in how the mean points per game of team players changed for two teams, across different seasons.



# Mean points per game comparison for two teams, say, Gonzaga and Kansas, over years


ggplot(subset(ppgmean,Team %in% c("Gonzaga Bulldogs","Kansas Jayhawks")),aes(x=Year,y=Mean.Points.Per.Game,color=Team,group=Team))+
  geom_point()+geom_line()+ggtitle("Mean Points Per Game of Players of Gonzaga Bulldogs and Kansas Jayhawks in Different Seasons")




We could also look at relationships between two variables (Points per game and Assists.Per.Game) in teams across different years and add in a LOESS curve.


ggplot(Gamestatistics,aes(x=Points.Per.Game, y=Assists.Per.Game, color=Team))+
  geom_jitter()+ geom_smooth(method='loess',level=0,size=1,aes(color=Team))+
  ggtitle("Scatter Plots with LOESS smoothing of Points.Per.Game and Assists for All Teams -- Data Collapsed Across All Years")+ facet_wrap(~Team,ncol=4) +
  theme(legend.position="none") 


We could also compare the relationship of two variables - points per game and assists per game, for two or more schools.

ggplot(subset(Gamestatistics,Team %in% c("Gonzaga Bulldogs","Kansas Jayhawks")),aes(x=Points.Per.Game, y=Assists.Per.Game, color=Team))+
  geom_jitter()+ geom_smooth(method='loess',level=0,size=1,aes(color=Team))+ facet_wrap(~Year,ncol=4)+
  ggtitle("Scatter Plots with LOESS smoothing of Points.Per.Game and Assists for Gonzaga Bulldogs and Kansas Jayhawks -- Data Collapsed Across All Years")




Step 4: Interactively generate graphs - for any combination of variable and teams

Of course, previously shown graphs could be generated for comparing both conferences and we could also have other variables of interest that we might want to compare different schools or conferences on. It would be unwieldy to present graphs of all possibilities here. Introducing interactivity by letting you play around with variables and schools will help. For this, we rely on the shiny server platform from RStudio.  Our shiny server application uses three preloaded data files - Gamestatisticscleaned.rda, and files for team-wise and conference-wise means (labeled meansteams.rda and meansconferences.rda, respectively) of all variables for all years and presents 9 different graphs. The code to generate these additional data files is given below.



meansteams=ddply(Gamestatistics,.(Team,Year),summarize,
                 Points.Per.Game=mean(Points.Per.Game),
                 Games.Played=mean(Games.Played),
                 Minutes = mean(Minutes),
                 Rebounds.Per.Game=mean(Rebounds.Per.Game),
                 Assists.Per.Game=mean(Assists.Per.Game),
                 Steals.Per.Game=mean(Steals.Per.Game),
                 Blocks.Per.Game=mean(Blocks.Per.Game),
                 Turnovers.Per.Game=mean(Turnovers.Per.Game),
                 Field.Goal.Percent=mean(Field.Goal.Percent),
                 Free.Throw.Percent=mean(Free.Throw.Percent),
      Three.Point.FieldGoal.Percent=mean(Three.Point.FieldGoal.Percent)
                 )
meansconferences=ddply(Gamestatistics,.(Conference,Year),summarize,
                       Points.Per.Game=mean(Points.Per.Game),
                       Games.Played=mean(Games.Played),
                       Minutes = mean(Minutes),
                       Rebounds.Per.Game=mean(Rebounds.Per.Game),
                       Assists.Per.Game=mean(Assists.Per.Game),
                       Steals.Per.Game=mean(Steals.Per.Game),
                       Blocks.Per.Game=mean(Blocks.Per.Game),
                       Turnovers.Per.Game=mean(Turnovers.Per.Game),
                       Field.Goal.Percent=mean(Field.Goal.Percent),
                       Free.Throw.Percent=mean(Free.Throw.Percent),
      Three.Point.FieldGoal.Percent=mean(Three.Point.FieldGoal.Percent)
                )
save(meansteams,file="meansteams.rda")

save(meanconferences,file="meansconferences.rda")




The codes for the shiny server user interface (ui.r) and server (server.r) are available through github. Please click on the screenshot below to access the application.

Concluding Remarks
1. The objective of this post was to visualize few relationships between game statistics presented for teams by espn.com for two NCAA basketball conferences - West Coast and Big 12.
2.  Data were scraped from 240 different pages on espn.com
3. Data on win/loss history or the competition teams faced was not available on the page and hence, not used.
4. Only one of two tables --- one concerning game statistics - was used for the analyses and the data from the second table, season statistics ---cumulative numbers for the entire season --- is available for download from the previously provided link.
5. R code was formatted for the blog purpose using the Highlight 3.14 software.
6. I would appreciate any comments/suggestions you might have on the application or on the code. Thanks.
--------------------------




35 comments:

  1. So to enable you to win your dream baseball associations, I will give you a portion of these significant things about dream baseball, you can use to stretch out beyond everyone. 토토사이트

    ReplyDelete
  2. Great! We will be connecting to this enormous post on our site. Continue the good writing.
    how to jump higher

    ReplyDelete
  3. I am jovial you take pride in what you write. It makes you stand way out from many other writers that can not push high-quality content like you. 메이저사이트

    ReplyDelete
  4. It is the psychological frame of mind and information of a player's own capacity to foresee conceivably hurtful circumstances and. SITUS JUDI BOLA

    ReplyDelete
  5. Hello, this weekend is good for me, since this time i am reading this enormous informative article here at my home.
    basketball legends

    ReplyDelete
  6. Good – I should certainly pronounce, impressed with your website. I had no trouble navigating through all the tabs and related info ended up being truly easy to do to access. I recently found what I hoped for before you know it in the least. Reasonably unusual. Is likely to appreciate it for those who add forums or anything, website theme . a tones way for your customer to communicate. Nice task. Gazette Review

    ReplyDelete
  7. proiest provide best basketball news and reviews.

    ReplyDelete
  8. I was surfing net and fortunately came across this site and found very interesting stuff here. Its really fun to read. I enjoyed a lot. Thanks for sharing this wonderful information. 릴게임

    ReplyDelete
  9. Fabulous post, you have denoted out some fantastic points, I likewise think this s a very wonderful website. I will visit again for more quality contents and also, recommend this site to all. Thanks. 릴게임

    ReplyDelete
  10. This was a really great contest and hopefully I can attend the next one. It was alot of fun and I really enjoyed myself.. 릴게임

    ReplyDelete
  11. This is very educational content and written well for a change. It's nice to see that some people still understand how to write a quality post.! Satta

    ReplyDelete
  12. I have a hard time describing my thoughts on content, but I really felt I should here. Your article is really great. I like the way you wrote this information. Spotify streams

    ReplyDelete
  13. Wheelchair basketball, as the name suggests, is played in wheelchairs. The chairs are specially designed so they move quickly and easily down the court. They also are made so players can be agile, much like players in traditional basketball.how to increase vertical jump

    ReplyDelete
  14. You should also check the authenticity of the website from where you are going to buy your favourite action movies.

    fmovies.to

    ReplyDelete
  15. In one movie Iron Man went from B-list hero whom nobody but a comic book geek had ever heard of to a true cultural icon and being featured prominently on every Marvel kid's toy right next to Spider-Man.
    ffmovies

    ReplyDelete
  16. Because movies provide a rich sensory experience of a story, they provide a wonderful platform to learn when you take the time to intentionally extract lessons from them. 123movies

    ReplyDelete
  17. Winning a top award at any film festival is a good thing for any independent filmmaker. With an award comes publicity, and publicity is the life blood for filmmakers and their careers.

    123 movies

    ReplyDelete
  18. One of the best things about the Toronto International Film Festival is the fact that they let the public participate in part of the voting process for the awards.123movie

    ReplyDelete
  19. The plot is a bit disoriented and it is difficult to keep track of what is taking place on screen but the fear factor for this movie flew off the chart.

    0123movie

    ReplyDelete
  20. I simply wanted to inform you about how much I actually appreciate all you’ve contributed to help increase the value of the lives of people in this subject matter. Through your own articles, we have gone via just a newbie to a professional in the area. It can be truly a honor to your initiatives. Thanks 토토사이트

    ReplyDelete
  21. I am looking for and I love to post a comment that "The content of your post is awesome" Great work! How to Dunk a Basketball even if your Short

    ReplyDelete
  22. Certain dissertation web sites over the internet courses currently have evidently unveiled while in the web-site. 토토

    ReplyDelete
  23. That would seem wholly great. Every one compact info are designed coupled with number of track record comprehension. Everyone loves the following a lot. a course in miracles

    ReplyDelete
  24. If more people that write articles involved themselves with writing great content like you, more readers would be interested in their writings. I have learned too many things from your article. 먹튀검증

    ReplyDelete
  25. I don't think, though, that I could have used the computer and not played. betting sites

    ReplyDelete
  26. This valuable appearances totally proper. Each one of minimal details have decided thanks to large number with practical experience simple awareness. I'm just excited them just as before significantly. 릴게임다이소

    ReplyDelete
  27. Thank you for the auspicious writeup. It in fact was a amusement account it. Look advanced to far added agreeable from you! However, how can we communicate? totalsportek.news/category/football/liverpool/

    ReplyDelete
  28. Thanks for your post. One other thing is when you are disposing your property all on your own, one of the problems you need to be aware about upfront is just how to deal with house inspection accounts. As a FSBO supplier, the key about successfully shifting your property in addition to saving money in real estate agent commission rates is knowledge. The more you realize, the smoother your sales effort might be. One area in which this is particularly important is inspection reports. 먹튀사이트검증

    ReplyDelete
  29. Okay Eighteen Old Country Boy Living On A Farm Looking For A Great Way To Make Money Online 먹튀수사대

    ReplyDelete
  30. Picking the sort of club is a troublesome decision, yet once the kind of game has been concluded 먹튀사이트

    ReplyDelete
  31. This amazing is very much unquestionably great. All these smaller fact is manufactured working with number with credentials know-how. I actually enjoy the reasoning the best value. https://reeljackpot.com/

    ReplyDelete
  32. The appearance efficiently excellent. Every one of these miniscule information and facts will be designed working with wide range of track record practical experience. I like it a lot. https://reelkorea.com/

    ReplyDelete
  33. There are various dissertation internet websites on the net when you attain definitely reported with your web page. https://reeldaiso.com/

    ReplyDelete
  34. A lot of dissertation webpages on line as you may get hold of secured in a dark listed during the webpage. https://reelpolice.com/

    ReplyDelete
  35. If more people that write articles involved themselves with writing great content like you, more readers would be interested in their writings. I have learned too many things from your article. https://bettingtipinfo.com/

    ReplyDelete