March 14, 2016

Adding a subtitle to ggplot2

A couple of days ago (2016-03-12) a short blog post by Bob Rudis appeared on, "Subtitles in ggplot2". I was intrigued by the idea and what this could mean for my own plotting efforts, and it turned out to be very simple to apply. (Note that Bob's post originally appeared on his own blog, as "Subtitles in ggplot2".)
In order to see if I could create a plot with a subtitle, I went back to some of my own code drawing on the Lahman database package. The code below summarizes the data using dplyr, and creates a ggplot2 plot showing the annual average number of runs scored by each team in every season from 1901 through 2014, including a trend line using the loess smoothing method.
This is an update to my series of blog posts, most recently 2015-01-06, visualizing run scoring trends in Major League Baseball.

# load the package into R, and open the data table 'Teams' into the
# workspace
# package load 
# ====================
# create a new dataframe that
# - filters from 1901 [the establishment of the American League] to the most recent year,
# - filters out the Federal League
# - summarizes the total number of runs scored, runs allowed, and games played
# - calculates the league runs and runs allowed per game 

MLB_RPG <- Teams %>%
  filter(yearID > 1900, lgID != "FL") %>%
  group_by(yearID) %>%
  summarise(R=sum(R), RA=sum(RA), G=sum(G)) %>%
  mutate(leagueRPG=R/G, leagueRAPG=RA/G)

Plot the MLB runs per game trend

Below is the code to create the plot, including the formatting. Note the hjust=0 (for horizontal justification = left) in the plot.title line. This is because the default for the title is to be centred, while the subtitle is to be justified to the left.

MLBRPGplot <- ggplot(MLB_RPG, aes(x=yearID, y=leagueRPG)) +
  geom_point() +
  theme_bw() +
  theme(panel.grid.minor = element_line(colour="gray95")) +
  scale_x_continuous(breaks = seq(1900, 2015, by = 20)) +
  scale_y_continuous(limits = c(3, 6), breaks = seq(3, 6, by = 1)) +
  xlab("year") +
  ylab("team runs per game") +
  geom_smooth(span = 0.25) +
  ggtitle("MLB run scoring, 1901-2014") +
  theme(plot.title = element_text(hjust=0, size=16))


MLB run scoring, 1901-2014

Adding a subtitle: the function

So now we have a nice looking dot plot showing the average number of runs scored per game for the years 1901-2014.

But a popular feature of charts--particularly in magazines--is a subtitle that has a summary of what the chart shows and/or what the author wants to emphasize.

In this case, we could legitimately say something like any of the following:
  • The peak of run scoring in the 2000 season has been followed by a steady drop
  • Teams scored 20% fewer runs in 2015 than in 2000
  • Team run scoring has fallen to just over 4 runs per game from the 2000 peak of 5 runs
  • Run scoring has been falling for 15 years, reversing a 30 year upward trend
I like this last one, drawing attention not only to the recent decline but also the longer trend that started with the low-scoring environment of 1968.

How can we add a subtitle to our chart that does that?

The function Bob Rudis has created quickly and easily allows us to add a subtitle. The following code is taken from his blog post. Note that the code for this function relies on two additional packages, grid and gtable. Other than the package loads, this is a straight copy/paste from Bob's blog post.


ggplot_with_subtitle <- function(gg, 
                                 hjust=0, vjust=0, 
                                 ...) {
  if (is.null(fontfamily)) {
    gpr <- gpar(fontsize=fontsize, ...)
  } else {
    gpr <- gpar(fontfamily=fontfamily, fontsize=fontsize, ...)
  subtitle <- textGrob(label, x=unit(hjust, "npc"), y=unit(hjust, "npc"), 
                       hjust=hjust, vjust=vjust,
  data <- ggplot_build(gg)
  gt <- ggplot_gtable(data)
  gt <- gtable_add_rows(gt, grobHeight(subtitle), 2)
  gt <- gtable_add_grob(gt, subtitle, 3, 4, 3, 4, 8, "off", "subtitle")
  gt <- gtable_add_rows(gt, grid::unit(bottom_margin, "pt"), 3)
  if (newpage) grid.newpage()
  if (is.null(vp)) {
  } else {
    if (is.character(vp)) seekViewport(vp) else pushViewport(vp)

Adding a subtitle

Now we've got the function loaded into our R workspace, the steps are easy:
  • Rename the active plot object gg (simply because that's what Bob's code uses)
  • Define the text that we want to be in the subtitle
  • Call the function

# set the name of the current plot object to `gg`
gg <- MLBRPGplot

# define the subtitle text
subtitle <- 
  "Run scoring has been falling for 15 years, reversing a 30 year upward trend"
ggplot_with_subtitle(gg, subtitle,
                     bottom_margin=20, lineheight=0.9)

MLB run scoring, 1901-2014 with a subtitle

Wasn't that easy? Thanks, Bob!

And it's going to get easier; in the few days since his blog post, Bob has taken this into the ggplot2 development environment, working on the code necessary to add this as a simple extension to the package's already extensive functionality. And Jan Schulz has chimed in, adding the ability to add a text annotation (e.g. the data source) under the plot. It's early days, but it's looking great. (See ggplot2 Pull request #1582.) Thanks, Bob and Jan!

And thanks also to the rest of the ggplot2 developers, for making those of us who use the package create good-looking and effective data visualization. Ain't open development great?

The code for this post (as an R markdown file) can be found in my Bayesball github repo.


March 6, 2016

Book review: Storytelling With Data

by Cole Nussbaumer Knaflic (2015, Wiley)

The Sabermetric bookshelf, #4

One of the great strengths of R is that there are some robust (and always improving) packages that facilitate great data visualization and tabular summaries. Beyond the capabilities built into the base version of R, packages such as ggplot2 (my favourite), lattice, and vcd and vcdExtra extend the possibilities for rendering charts and graphs, and a similar variety exist for reproducing tables. And accompanying these packages have been a variety of fine instruction manuals that delineate the code necessary to produce high-quality and reproducible outputs. (You can’t go wrong by starting with Winston Chang’s R Graphics Cookbook, and the R Graph Catalog based on Naomi Robbins’s Creating More Effective Graphs, created and maintained by Joanna Zhao and Jennifer Bryan at the University of British Columbia.)

Let’s call these the “how” resources; once you’ve determined you want a Cleveland plot (which are sometimes called “lollipop plots”—please, just stop it), these sources provide the code for that style of chart, including the myriad options available to you.

Elsewhere, there has been a similar explosion in the number of books that build on research and examples as to what makes a good graphic. These are the “what” books; the authors include the aforementioned William Cleveland and Naomi Robbins, and also include Stephen Few and Edward R. Tufte. Also making an appearance are books that codify the “why”, written by the likes of Alberto Cairo and Nathan Yau.

The recently published Storytelling with Data: A Data Visualization Guide for Business Professionals by Cole Nussbaumer Knaflic falls into the latter category, and it’s one of the best I’ve seen to date. Although the subtitle indicates the intended audience, I believe that anyone involved in creating data-driven visualizations would benefit from reading and learning from it.

The book is relatively software agnostic, although Nussbaumer Knaflic recognizes the ubiquity of Excel and has used it to produce the charts in the book. She also provides some sidebar commentary and tools via her website specifically using Excel. For R users, this shouldn’t pose a particular challenge or barrier; the worst-case scenario is that it provides an opportunity to learn how to use R to replicate the book’s examples.

One of the strengths of the book is that Nussbaumer Knaflic takes the approach of starting with a chart (often a real-life example published elsewhere), and then iterates through one or more options before arriving at the finished product. One instance is the step-by-step decluttering of a line graph, which becomes substantially improved through a six step process. This example re-appears later, first in the chapter on the use of preattentive attributes and then again in the chapter titled “Think like a designer”. This approach reinforces the second of Nussbaumer Knaflic’s tips that close the book, “iterate and seek feedback”.

Nussbaumer Knaflic also introduces the Gestalt Principles of Visual Perception, and provides vivid examples of how these principles play out in data visualizations.

All of the discussion of graphics is wrapped in the context of storytelling. That is to say, the data visualization is always in the service of making a point about what the data tell us. In the context of business, this then translates into influencing decisions. The chapter “Lessons in storytelling” falls almost exactly in the middle of the book; after we’ve been introduced to the principles of making good data visualizations, Nussbaumer Knaflic gives us a way to think about the purpose of the visualization. With all of the pieces in place, the remainder of the book is focussed on the applications of storytelling with data.

The book is supported with Nussbaumer Knaflic’s site, which includes her blog. Check out her blog entry/discussion with Steven Few “Is there a single right answer?”), and some makeovers (in the Gallery) where she redraws some problematic charts that have appeared in the public domain.

All in all, Cole Nussbaumer Knaflic’s Storytelling with Data is a succinct and focussed book, one that clearly adopts and demonstrates the enormous value of the principles that it espouses. Highly recommended to anyone, of any skill level, who is interested in making effective data visualizations, and the effective use of those visualizations.

Cole Nussbaumer Knaflic’s Storytelling with Data: A Data Visualization Guide for Business Professionals was published in 2015 by Wiley.

Cross-posted at


Note: the authors of the following books have all published additional books on these topics; I’ve simply selected the ones that most closely fit with the context of this review. All are recommended.

  • Alberto Cairo, The Functional Art: An Introduction to Information Graphics and Visualization (2013) New Riders.
  • Winston Chang, R Graphics Cookbook (2012) O’Reilly.
  • William S. Cleveland, Visualizing Data (1993) Hobart Press.
  • Stephen Few, Signal (2015) Analytics Press.
  • Michael Friendly and David Meyer, Discrete Data Analysis with R: Visualization and Modeling Techniques for Categorical and Count Data (2016) CRC Press.
  • Naomi B. Robbins, Creating More Effective Graphs (2004) Reprinted 2013 by Chart House.
  • Edward R. Tufte, The Visual Display of Quantitative Information (2nd edition, 2001) Graphics Press.
  • Hadley Wickham, ggplot2: Elegant Graphics for Data Analysis (2nd edition, 2016) Springer.
  • Nathan Yau, Visualize This: The FlowingData Guide to Design, Visualization, and Statistics (2011) Wiley.

January 6, 2015

Run scoring trends: using Shiny to create dynamic charts and tables in R

Or, Retracing my steps

As I’ve been learning the functionality of Shiny, the web app for R, I have used the helpful tutorials available from the developers at RStudio. At some point, though, one needs to break out and develop one’s own application.  My Shiny app “MLB run scoring trends” can be found at (

Note: this app is a work in progress! If you have any thoughts on how it might be improved, please leave a comment.
All of the files associated with this app, including the code, can be found on, at MonkmanMH/MLBrunscoring_shiny.

This Shiny app is a return to my earlier analysis on run scoring trends in Major League Baseball, last seen in my blog post “Major League Baseball run scoring trends with R’s Lahman package”; see the “References” tab in the Shiny app for more). This project gave me the opportunity to update the underlying data, as well as to introduce some of the coding improvements I’ve learned along the way (notably the packages ggplot2 and dplyr.)

Some notable changes in the code:
  • In the original version (starting here), I treated each league separately, starting with subsetting (now, with dplr, filtering) the Lahman “Teams” table on the lgID variable into two separate data frames which were then used to separately generate the two charts. Now, with ggplot2, I have used the faceting to plot the two leagues, and given the reader the option of making that split or not. This is both more flexible from the reader’s point of view, and more efficient code.
  • In my original approach, the trend lines were generated using the loess function, embedded in a discrete object, and then added to the plot as a separately plotted line. By using ggplot2, a LOESS trendline can be quickly added to the plot call with the stat_smooth() option, a much more efficient approach.
  • The stat_smooth() makes it possible to adjust the degree of smoothing of the tend line through changes to the span specification. Originally this was hard-coded, but is now dynamic, controlled in the Shiny app through a slider widget.
  • The stat_smooth() also includes the option of showing a confidence interval. This is achieved through the level specification. For this, I used a set of radio buttons in the Shiny user interface. (I had initially tried a slider, but was not able to specify a set of pre-defined points for the confidence intervals.)
  • The start and end dates of the league plots are also user-controlled through a slider widget. You will notice that the date in the chart title changes along with the range of the plot.
Other things I learned:
  • Radio buttons return factors, even if they look numeric in the ui.r code. In order to get the values that are input by the user to work in the stat_smooth(), I wrapped them in as.numeric().
  • Tables made with dplyr don't render properly in the Shiny environment; the numbers are all there but the sort function generates an error. My solution to this was to wrap the table created by renderDataTable in the server file with
  • I already knew that I was struggling to keep up with the changes in the R coding environment, but this exercise opened my eyes to even more potential opportunities. The latest version of Shiny ( as of 2015-01-06) has added a lot of new functionality, but I hadn’t realized the degree of integration with other visualization tools. This recent blog entry, “Goodbye static graphs, hello shiny, ggvis, rmarkdown” by Simon Jackman, gives some hints as to where an integrated analytic & reporting environment might go. Exciting stuff, indeed.


July 26, 2014

Roger Angell and the Baseball Hall of Fame

The great baseball writer Roger Angell is the recipient of the 2014 J.G. Taylor Spink Award, the first time a non-BBWAA member has been given the award.  Angell will be presented the award at the Baseball Hall of Fame at Cooperstown, during the induction weekend July 25 - 28, 2014.

Angell is, for my money, the best writer about baseball.  His accounts of the game are from a fan's perspective, rather than the typical listing of the game's dramatic moments. Indeed, many of his greatest observations are about the fans and their experiences.  One such essay is "The Interior Stadium"; required reading for anyone interested in sports and people's responses to the games.

Much of Angell's writing was published by his employer, the New Yorker, who have recently compiled two summaries of his writing.  The first was offered by David Remmick, whose piece "Roger Angell Heads to Cooperstown" was published when the Spink award was announced in December 2013 and has links to a variety of Angell's best. More recently, "Hall of Fame Weekend: Roger Angell's Baseball Writing" (by Sky Dylan-Robbins) provides a different list of great essays.

One of the best things the New Yorker has made available is Angell's scorecard for Game 6 of the 2011 World Series, when the Cardinals were down to their final strike twice, but managed to come back and win the game (and then, in an anti-climactic game 7, the Series).


July 23, 2014

Left-handed catchers

Benny Distefano – 1985 Donruss #166
We are approaching the twenty-fifth anniversary of the last time a left-handed throwing catcher appeared behind the plate in a Major League Baseball game; on August 18, 1989 Benny Distefano made his third and final appearance as a catcher for the Pirates. Distefano’s accomplishment was celebrated five years ago, in Alan Schwarz’s “Left-Handed and Left Out” (New York Times, 2009-08-15).

Jack Moore, writing on the site Sports on Earth in 2013 (“Why no left-handed catchers?”), points out that lack of left-handed catchers goes back a long way. One interesting piece of evidence is a 1948 Ripley’s “Believe It Or Not” item with a left-handed catcher Dick Bernard (you can read more about Bernard’s signing in the July 1, 1948 edition of the Tuscaloosa News). Bernard didn’t make the majors, and doesn’t appear in any of the minor league records that are available on-line either.

Dick Bernard in Ripley’s “Believe It or Not”, 1948-12-30

There are a variety of hypotheses why there are no left-handed catchers, all of which are summarized in John Walsh’s “Top 10 Left-Handed Catchers for 2006” (a tongue-in-cheek title if ever there were) at The Hardball Times. A compelling explanation, and one supported by both Bill James and J.C. Bradbury (in his book The Baseball Economist) is natural selection; a left-handed little league player who can throw well will be groomed as a pitcher.

Throwing hand by fielding position as an example of a categorical variable

I was looking for some examples of categorical variables to display visually, and the lack of left-handed throwing catchers, compared to other positions, came to mind. The following uses R, and the Lahman database package.

The analysis requires merging the Master and Fielding tables in the Lahman database – the Master table gives the player's name and his throwing hand, and Fielding tells us how many games at each position they played. For the purpose of this analysis, we’ll look at the seasons 1954 (the first year in the Lahman database that has the outfield positions split into left, centre, and right) through 2012.

You may note that for the merging of the two tables, I used the new dplyr package. I tested the system.time of the basic version of “merge” to combine the two tables, and the “inner_join” in dplyr. The latter is substantially faster: my aging computer ran “merge” in about 5.5 seconds, compared to 0.17 seconds with dplyr.

# load the required packages

The first step is to create a new data table that merges the Fielding and Master tables, based on the common variable “playerID”. This new table has one row for each player, by position and season; we use the dim function to show the dimensions of the table.

Then, select only those seasons since 1954 and omit the records that are Designated Hitter (DH) and the summary of outfield positions (OF) (i.e. leave the RF, CF, and LF).

MasterFielding <- inner_join(Fielding, Master, by="playerID")
## [1] 164903     52
MasterFielding <- filter(MasterFielding, POS != "OF" & POS != "DH" & yearID > "1953")
## [1] 91214    52

This table needs to be summarized one step further – a single row for each player, counting how many games played at each position.

Player_games <- MasterFielding %.%
  group_by(playerID, nameFirst, nameLast, POS, throws) %.%
  summarise(gamecount = sum(G)) %.%
## [1] 19501     6
## Source: local data frame [6 x 6]
## Groups: playerID, nameFirst, nameLast, POS
##    playerID nameFirst nameLast POS throws gamecount
## 1 robinbr01    Brooks Robinson  3B      R      2870
## 2 bondsba01     Barry    Bonds  LF      L      2715
## 3 vizquom01      Omar  Vizquel  SS      R      2709
## 4  mayswi01    Willie     Mays  CF      R      2677
## 5 aparilu01      Luis Aparicio  SS      R      2583
## 6 jeterde01     Derek    Jeter  SS      R      2531

This table shows the career records for the most games played at the positions (for 1954-2012). We see that Brooks Robinson leads the way with 2,870 games played at third base, and the fact that Derek Jeter, at the end of the 2012 season, was closing in on Omar Vizquel’s career record for games played as a shortstop.

Cross-tab Tables

The next step is to prepare a simple cross-tab table (also known as contingency or pivot tables) showing the number of players cross-tabulated by position (POS) and throwing hand (throws).

Here, I’ll demonstrate two ways to do this: first with dplyr’s “group_by” and “summarise” (with a bit of help from reshape2), and then the “table” function in gmodels.

# first method - dplyr
Player_POS <- Player_games %.%
  group_by(POS, throws) %.%
  summarise(playercount = length(gamecount))
## Source: local data frame [17 x 3]
## Groups: POS
##    POS throws playercount
## 1   1B      L         411
## 2   1B      R        1515
## 3   2B      L           4
## 4   2B      R        1560
## 5   3B      L           4
## 6   3B      R        1889
## 7    C      L           4
## 8    C      R         980
## 9   CF      L         393
## 10  CF      R        1252
## 11  LF      L         544
## 12  LF      R        2161
## 13   P      L        1452
## 14   P      R        3623
## 15  RF      L         520
## 16  RF      R        1893
## 17  SS      R        1296

To transform this long-form table into a traditional cross-tab shape we can use the “dcast” function in reshape2.

## Loading required package: reshape2
dcast(Player_POS, POS ~ throws, value.var = "playercount")
##   POS    L    R
## 1  1B  411 1515
## 2  2B    4 1560
## 3  3B    4 1889
## 4   C    4  980
## 5  CF  393 1252
## 6  LF  544 2161
## 7   P 1452 3623
## 8  RF  520 1893
## 9  SS   NA 1296

A second method to get the same result is to use the “table” function in the gmodels package.

## Loading required package: gmodels
throwPOS <- with(Player_games, table(POS, throws))
##     throws
## POS     L    R
##   1B  411 1515
##   2B    4 1560
##   3B    4 1889
##   C     4  980
##   CF  393 1252
##   LF  544 2161
##   P  1452 3623
##   RF  520 1893
##   SS    0 1296

A more elaborate table can be created using gmodels package. In this case, we’ll use the CrossTable function to generate a table with row percentages. You’ll note that the format is set to SPSS, so the table output resembles that software’s display style.

CrossTable(Player_games$POS, Player_games$throws, 
           digits=2, format="SPSS",
           prop.r=TRUE, prop.c=FALSE, prop.t=FALSE, prop.chisq=FALSE,  # keeping the row proportions
           chisq=TRUE)                                                 # adding the ChiSquare statistic
##    Cell Contents
## |-------------------------|
## |                   Count |
## |             Row Percent |
## |-------------------------|
## Total Observations in Table:  19501 
##                  | Player_games$throws 
## Player_games$POS |        L  |        R  | Row Total | 
## -----------------|-----------|-----------|-----------|
##               1B |      411  |     1515  |     1926  | 
##                  |    21.34% |    78.66% |     9.88% | 
## -----------------|-----------|-----------|-----------|
##               2B |        4  |     1560  |     1564  | 
##                  |     0.26% |    99.74% |     8.02% | 
## -----------------|-----------|-----------|-----------|
##               3B |        4  |     1889  |     1893  | 
##                  |     0.21% |    99.79% |     9.71% | 
## -----------------|-----------|-----------|-----------|
##                C |        4  |      980  |      984  | 
##                  |     0.41% |    99.59% |     5.05% | 
## -----------------|-----------|-----------|-----------|
##               CF |      393  |     1252  |     1645  | 
##                  |    23.89% |    76.11% |     8.44% | 
## -----------------|-----------|-----------|-----------|
##               LF |      544  |     2161  |     2705  | 
##                  |    20.11% |    79.89% |    13.87% | 
## -----------------|-----------|-----------|-----------|
##                P |     1452  |     3623  |     5075  | 
##                  |    28.61% |    71.39% |    26.02% | 
## -----------------|-----------|-----------|-----------|
##               RF |      520  |     1893  |     2413  | 
##                  |    21.55% |    78.45% |    12.37% | 
## -----------------|-----------|-----------|-----------|
##               SS |        0  |     1296  |     1296  | 
##                  |     0.00% |   100.00% |     6.65% | 
## -----------------|-----------|-----------|-----------|
##     Column Total |     3332  |    16169  |    19501  | 
## -----------------|-----------|-----------|-----------|
## Statistics for All Table Factors
## Pearson's Chi-squared test 
## ------------------------------------------------------------
## Chi^2 =  1759     d.f. =  8     p =  0 
##        Minimum expected frequency: 168.1

Mosaic Plot

A mosaic plot is an effective way to graphically represent the contents of the summary tables. Note that the length (left to right) dimension of each bar is constant, comparing proportions, while the height of the bar (top to bottom) varies depending on the absolute number of cases. The mosaic plot function is in the vcd package.

## Loading required package: vcd
## Loading required package: grid
mosaic(throwPOS, highlighting = "throws", highlighting_fill=c("darkgrey", "white"))


The clear result is that it’s not just catchers that are overwhelmingly right-handed throwers, it’s also infielders (except first base). There have been very few southpaws playing second and third base – and there have been absolutely no left-handed throwing shortstops in this period.

As J.G. Preston puts it in the blog post “Left-handed throwing second basemen, shortstops and third basemen”,
While right-handed throwers can be found at any of the nine positions on a baseball field, left-handers are, in practice, restricted to five of them.

So who are these left-handed oddities? Using the filter function, it’s easy to find out:

# catchers
filter(Player_games, POS == "C", throws == "L")
## Source: local data frame [4 x 6]
## Groups: playerID, nameFirst, nameLast, POS
##    playerID nameFirst  nameLast POS throws gamecount
## 1 distebe01     Benny Distefano   C      L         3
## 2  longda02      Dale      Long   C      L         2
## 3 squirmi01      Mike   Squires   C      L         2
## 4 shortch02     Chris     Short   C      L         1

# second base
filter(Player_games, POS == "2B", throws == "L")
## Source: local data frame [4 x 6]
## Groups: playerID, nameFirst, nameLast, POS
##    playerID nameFirst  nameLast POS throws gamecount
## 1 marqugo01   Gonzalo   Marquez  2B      L         2
## 2 crowege01    George     Crowe  2B      L         1
## 3 mattido01       Don Mattingly  2B      L         1
## 4 mcdowsa01       Sam  McDowell  2B      L         1

# third base
filter(Player_games, POS == "3B", throws == "L")
## Source: local data frame [4 x 6]
## Groups: playerID, nameFirst, nameLast, POS
##    playerID nameFirst  nameLast POS throws gamecount
## 1 squirmi01      Mike   Squires  3B      L        14
## 2 mattido01       Don Mattingly  3B      L         3
## 3 francte01     Terry  Francona  3B      L         1
## 4 valdema02     Mario    Valdez  3B      L         1

My github file for this entry in Markdown is here: []


December 17, 2013

Book Review: Analyzing Baseball Data with R

by Max Marchi and Jim Albert (2014, CRC Press)

The Sabermetric bookshelf, #3

Here we have the perfect book for anyone who stumbles across this blog--the intersection of R and baseball data. The open source statistical programming environment of R is a great tool for anyone analyzing baseball data, from the robust analytic functions to the great visualization packages. The particular readership niche might be small, but as both R and interest in sabermetrics expand, it's a natural fit.

And one would be hard pressed to find better qualified authors, writers who have feet firmly planted in both worlds.  Max Marchi is a writer for Baseball Prospectus, and it's clear from the ggplot2 charts in his blog entries (such as this entry on left-handed catchers) that he's an avid R user.

Jim Albert is a Professor in the Department of Mathematics and Statistics at Bowling Green State University; three of his previous books sit on my bookshelf. Curve Ball, written with Jay Bennett, is pure sabermetrics, and one of the best books ever written on the topic (and winner of SABR's Baseball Research Award in 2002).  Albert's two R-focussed  books, the introductory R by Example (co-authored with Maria Rizzo) and the more advanced Bayesian Computation with R, are intended as supplementary texts for students learning statistical methods. Both employ plenty of baseball examples in their explanations of statistical analysis using R.

In Analyzing Baseball Data with R Marchi and Albert consolidate this joint expertise, and have produced a book that is simultaneously interesting and useful.

The authors takes a very logical approach to the subject at hand. The first chapter concerns the three sources of baseball data that are referenced throughout the book:
- the annual summaries contained with the Lahman database,
- the play-by-play data at Retrosheet, and
- the pitch-by-pitch PITCHf/x data.
The chapter doesn't delve into R, but summarizes the contents of the three data sets, and takes a quick look at the types of questions that can be answered with each.

The reader first encounters R in the second and third chapters, titled "Introduction to R" and "Traditional Graphics". These two chapters cover many of the basic topics that a new R user needs to know, starting with installing R and RStudio, then moving on to data structures like vectors and data frames, objects, functions, and data plots. Some of the key R packages are also covered in these chapters, both functional packages like plyr and data packages, notably Lahman, the data package containing the Lahman database.

The material covered in these early chapters are things I learned early on in my own R experience, but whereas I had relied on multiple sources and an unstructured ad hoc approach, in Analyzing Baseball Data with R a newcomer to R will find the basics laid out in a straight-forward and logical progression. These chapters will most certainly help them climb the steep learning curve faced by every neophyte R user.  (It is worth noting that the "Introduction to R" chapter relies heavily on a fourth source of baseball data -- the 1965 Warren Spahn Topps card, the last season of his storied career. Not all valuable data are big data.)

From that point on, the book tackles some of the core concepts of sabermetrics. This includes the relationship between runs and wins, run expectancy, career trajectories, and streaky performances.  As the authors work through these and other topics, they weave in information about additional R functions and packages, along with statistical and analytic concepts.  As one example, one chapter introduces Markov Chains in the context of using R to simulate half inning, season, and post-season outcomes.

The chapter "Exploring Streaky Performances" provides the opportunity to take a closer look at how Analyzing Baseball Data with R compares to Albert's earlier work.  In this case, the chapter uses moving average and simulation methodologies, providing the data code to examine recent examples (Ichiro and Raul Ibanez).  This is methodologically similar to what is described in Curve Ball, but with the addition of "here's the data and the code so you can replicate the analysis yourself".  This approach differs substantially from the much more mathematical content in Albert's text Bayesian Computation with R, where the example of streaky hitters is used to explore beta functions and the laplace R function.

Woven among these later chapters are also ones that put R first, and use baseball data as the examples. A chapter devoted to the advanced graphics capabilities of the R splits time between the packages lattice and ggplot2. The examples used in this chapter include  visualizations that are used to analyze variations in Justin Verlander's pitch speed.

Each chapter of the book also includes "Further Reading" and "Exercises", which provide readers with the chance to dig deeper into the topic just covered and to apply their new-found skills. The exercises are consistently interesting and often draw on previous sabermetric research.  Here's a couple of examples:
  • "By drawing a contour plot, compare the umpire's strike zone for left-handed and right-handed batters. Use only the rows of the data frame where the pitch type is a four-seam fastball." (Chapter 7)
  • "Using [Bill] James' similarity score measure ..., find the five hitters with hitting statistics most similar to Willie Mays." (Chapter 8)
The closing pages of the book are devoted to technical arcana regarding the data sources, and how-to instructions on obtaining those data.

The authors have established a companion blog (, which has an expansion of the analytics presented in the book.  For example, the entry from December 12, 2013 goes deeper into ggplot2 capabilities to enhance and refine charts that were described in the book.

Analyzing Baseball Data with R provides readers with an excellent introduction to both R and sabermetrics, using examples that provide nuggets of insight into baseball player and team performance. The examples are clear, the R code is well explained and easy to follow, and I found the examples consistently interesting. All told, Analyzing Baseball Data with R will be an extremely valuable addition to the practicing sabermetrician's library, and is most highly recommended.

Additional Resources

Jim Albert and Jay Bennett (2003), Curve Ball: Baseball, Statistics, and the Role of Chance in the Game (revised edition), Copernicus Books.

Jim Albert and Maria Rizzo (2011), R by Example, Springer.

Jim Albert (2009), Bayesian Computation with R (2nd edition), Springer.

An interview with Max Marchi, originally posted at MilanoRnet and also available through R-bloggers


December 1, 2013

A few random things

The creation of random numbers, or the random selection of elements in a set (or population), is an important part of statistics and data science. From simulating coin tosses to selecting potential respondents for a survey, we have a heavy reliance on random number generation.

R offers us a variety of solutions for random number generation; here's a quick overview of some of the options.

runif, rbinom, rnorm

One simple solution is to use the runif function, which generates a stated number of values between two end points (but not the end points themselves!) The function uses the continuous uniform distribution, meaning that every value between the two end points has an equal probability of being sampled.

Here's the code to produce 100 values between 1 and 100, and then print them.

RandomNumbers <- runif(100, 1, 100)
##   [1] 22.290 33.655 89.835 38.535 24.601 11.431  7.047 94.958 83.703 76.847
##  [11] 58.429 20.667 25.796 91.821  8.741 65.696 24.262  8.077 51.399 19.652
##  [21] 64.883 33.258 55.488  6.828 14.925 11.480 72.783  2.549 78.706 49.563
##  [31] 10.829 27.653 70.304 96.759 12.614 66.610 82.467  8.506 71.719 86.586
##  [41] 69.519 11.538 72.321 63.126 42.754 60.139 44.854 71.088 15.165 67.818
##  [51] 83.342  9.894 64.497 96.620 64.286 20.162 16.343 53.800 31.380 24.418
##  [61] 13.740 47.458 80.037 13.189 45.496 20.697 28.240 60.003 84.350 14.888
##  [71] 20.084  3.003  1.191 28.748  4.528 40.568 90.963 82.640 15.885 95.029
##  [81] 54.166 17.315 43.355  9.762 74.012 64.537 74.131 24.758 41.922 65.458
##  [91] 11.423 41.084 22.514 77.329 76.879 43.954 78.471 24.727 69.357 60.118

R helpfully has random generators from a plethora of distributions (see under the heading “Random Number Generators”). For example, the equivalent function to pull random numbers from the binomial distribution is rbinom. In the following example, the code generates 100 iterations of a single trial where there's a 0.5 (50/50) probabilty – as you would get with one hundred coin tosses. So let's call the object OneHundredCoinTosses. The table function then gives us a count of the zeros and ones in the object.

OneHundredCoinTosses <- rbinom(100, 1, 0.5)
##   [1] 1 0 0 1 1 1 0 1 1 1 1 0 0 1 0 1 1 0 1 1 0 1 1 0 1 0 0 1 0 0 1 1 1 0 0
##  [36] 1 0 0 0 0 0 1 0 1 1 0 0 0 1 0 1 0 1 0 1 1 1 0 1 0 0 0 1 1 1 0 1 0 0 1
##  [71] 1 0 0 0 1 0 1 1 1 1 0 0 0 0 1 1 1 0 0 1 1 1 0 0 1 1 0 1 0 1
## OneHundredCoinTosses
##  0  1 
## 48 52

In this variant, we'll toss our coin again, but this time it will be 100 iterations of 10 trials. R will generate the number of successes per trial. A plot of the histogram would show how with enough iterations, we'd get something that looks very much like a normal distribution curve.

OneHundredCoinTrials <- rbinom(100, 10, 0.5)
##   [1]  5  4  7  7  4  2  3  3  6  6  6  5  4  6  5  6  4  7  3  3  6  5  5
##  [24]  7  4  5  6  3  7  4 10  5  3  6  6  5  7  6  3  6  4  7  7  4  3  6
##  [47]  6  2  2  7  5  6  5  8  4  3  5  5  2  7  5  4  4  1  4  5  7  5  6
##  [70]  9  6  3  7  4  4  2  3  6  3  5  6  6  5  8  5  5  7  5  7  4  2  3
##  [93]  5  5  7  5  4  6  0  6
## OneHundredCoinTrials
##  0  1  2  3  4  5  6  7  8  9 10 
##  1  1  6 13 16 23 21 15  2  1  1

And there's rnorm for the normal distribution. In this case, the second number in the function is the mean and the third is the standard deviation. With this example, the code generates 100 values from a normal distribution with a mean of 50 and a standard deviation of 12.5.

RandomNormal <- rnorm(100, 50, 12.5)
##   [1] 52.67 36.55 57.48 63.00 53.65 52.05 65.39 36.16 53.03 53.22 53.61
##  [12] 65.66 47.08 41.87 80.60 67.34 49.56 54.09 50.51 48.35 33.72 47.31
##  [23] 51.22 51.01 56.76 51.01 79.84 37.27 33.67 41.73 49.34 62.04 61.48
##  [34] 37.49 56.54 63.87 49.13 36.11 47.14 34.67 57.88 34.45 50.46 48.68
##  [45] 66.36 45.32 51.72 36.64 41.35 48.09 35.50 55.30 42.13 26.29 30.68
##  [56] 53.08 60.53 40.96 35.04 60.64 74.57 49.00 62.41 37.19 52.64 39.80
##  [67] 31.66 49.13 36.05 49.98 55.00 72.42 56.64 53.44 50.50 54.02 60.74
##  [78] 39.32 53.72 75.50 46.87 12.10 45.09 70.40 53.11 37.36 58.97 67.09
##  [89] 45.63 55.44 46.66 31.69 36.68 59.28 55.09 53.25 49.52 59.87 59.16
## [100] 80.30


Another approach to randomization is the sample function, which pulls elements from an object (such as a vector) of defined values or, alternatively, can be specified to select cases from a string of integers. The function also has the option of specifying whether replacement will be used or not. (See

In the first example of sample, we'll generate 100 values (the second value specified in the function) from the integers between 1 and 99 (the first value specified), with replacement – so there's a possibility of duplicates. The code adds the sort function so that we can easily spot the duplicates.

RandomSample <- sort(sample(99, 100, replace = TRUE))
##   [1]  1  1  2  2  2  5  5  5  7  8 11 11 12 12 12 13 15 15 16 16 19 19 21
##  [24] 23 23 23 23 23 25 26 27 30 30 31 32 33 34 35 35 35 35 37 38 40 41 42
##  [47] 42 42 42 45 46 46 47 48 48 50 52 52 54 54 54 54 54 57 58 61 62 63 63
##  [70] 64 66 66 67 67 69 70 71 71 71 73 73 74 78 78 80 80 82 83 84 84 85 86
##  [93] 86 91 93 94 96 96 98 98

In a second example, we'll generate 5 values (the second value specified) from a list of 13 names that we predefine, without replacement. Note that the default setting in sample is “without replacement”, so there should be no duplicates.

# the list of party-goers
dwarves <- c("Fíli", "Kíli", "Balin", "Dwalin", "Óin", "Glóin", "Bifur", "Bofur", 
    "Bombur", "Ori", "Nori", "Dori", "Thorin")  # draw a sorted sample of 50 with replacement
Party <- sort(sample(dwarves, 5))
# print the names
## [1] "Dwalin" "Fíli"   "Nori"   "Ori"    "Thorin"

There is also the variant which insists on integers for the values. Here's the code to randomly select 6 numbers between 1 and 49, without replacement.

six49numbers <- sort(, 6, replace = FALSE))
## [1]  3 15 16 24 35 44

Controlling random number generation: set.seed and RNGkind

It sounds like an oxymoron – how can you control something that is random? The answer is that in many computer programs and programming languages, R included, many of the functions that are dubbed random number generation really aren't. I won't get into the arcana, but runif (and it's ilk) and sample all rely on pseudo-random approaches, methods that are close enough to being truly random for most purposes. (If you want to investigate this further in the context of R, I suggest starting with John Ramey's post at )

With the set.seed command, an integer is used to start a random number generation, allowing the same sequence of “random” numbers to be selected repeatedly. In this example, we'll use the code written earlier to sample 6 numbers between 1 and 49, and repeat it three times.

The first time through, set.seed will define the starting seed as 1, then for the second time through, the seed will be set to 13, leading to a different set of 6 numbers. The third iteration will reset the starting seed to 1, and the third sample set of 6 numbers will be the same as the first sample.

six49numbers <- sort(, 6))
## [1] 10 14 18 27 40 42
six49numbers <- sort(, 6))
## [1]  1  5 12 19 35 44
six49numbers <- sort(, 6))
## [1] 10 14 18 27 40 42

The first and third draws contain the same 6 integers.

Another control of the random number generation is RNGkind. This command defines the random number generation method, from an extensive list of methodologies. The default is Mersene Twister (, and a variety of others are available.

The R documentation page on Random{}, with both set.seed and RNGkind, can be found here:


While the methods above are pseudo-random, there are methods available that generate truly random numbers. One is the service provided by (

The R package random (documentation here: uses the service to generate random numbers and return them into an R object. While the functions in the package can return random integers, randomized sequences, and random strings, and has the flexibility to define the shape of the matrix (i.e. the number of columns).

It's worth nothing that free users or are confronted by daily limits to the volume of calls you can make to (paying customers don't have these limits).

Here's an example to generate 20 random numbers from, defined as being between 100 and 999 (that is to say, three digit numbers) and present them in two columns.

# load random
if (!require(random)) install.packages("random")
## Loading required package: random
twentytruerandom <- randomNumbers(n = 20, min = 100, max = 999, col = 2, check = TRUE)
# note: the 'check=' sets whether quota at server should be checked first
##        V1  V2
##  [1,] 531 402
##  [2,] 559 367
##  [3,] 616 789
##  [4,] 830 853
##  [5,] 382 436
##  [6,] 336 737
##  [7,] 769 548
##  [8,] 293 818
##  [9,] 746 609
## [10,] 108 331


Paul Teetor's R Cookbook (O'Reilly, 2011) has a chapter on probability (Chapter 8) that includes good examples of various random number generation in R.

Jim Albert & Maria Rizzo, R by Example (Springer, 2012), Chapter 11 “Simulation Experiments” and Chapter 13 “Monte Carlo Methods”, contain a variety of applications of random number generation using sample and rbinom to approximate and understand probability experiments.

For an in-depth look at random sampling in the context of survey design, see Thomas Lumley Complex Surveys: A Guide to Analysis Using R (Wiley, 2010).

If you're interested in testing a random number generator, check out

Joseph Rickert's blog entry at gives a good rundown of the applications and approach for parallel random number generation