February 2, 2013

Comparing individual team run production

Or, The 2010 Mariners: How Bad Were They?

In earlier posts, I used the statistical software R to plot the trends in league average run scoring since 1901. This was the first step to answering other questions I had on my mind:
  1. How poor was the offensive performance of the 2010 Seattle Mariners?
  2. Are they showing any signs of improvement?
  3. And how can I use R to tabulate the data to answer these questions?
So, to answer Question #1.  It is well-established that the 2010 Mariners were not very good, at least offensively. (For fans of the team the well-deserved Cy Young award won by Felix Hernandez is surely the highlight of the season.) But I wanted a form of relative measure that would be comparable across time, to accommodate the various fluctuations in run scoring that were the subject of that earlier post.

As I started into this, the first decision was to draw a line in the historical record. I opted to use the eras described in Bill James' "Dividing Baseball History into Eras" article (behind a pay wall – but chances are if you're reading my blog, you already a Bill James subscriber):
  • Era 1 (The Pioneer Era), 1871-1892 
  • Era 2 (The Spitball Era), 1893-1919 
  • Era 3 (The Landis Era), 1920-1946 
  • Era 4 (The Baby Boomers Era), 1947-1968 
  • Era 5 (The Artifical Turf Era), 1969-1992 
  • Era 6 (The Camden Yards Era), 1993-2012
Based on these groupings, I opted to use the range of seasons 1947-2012 inclusive. This yields 1,580 team seasons of National League and American League baseball.

The second step was to calculate a runs per game (RPG) for each team, by year. This corrects for the longer regular season in the post-expansion period, the strike-shortened seasons, and will give us a common denominator to compare the results so far in 2012.

To do this, I accessed the 2012 edition of the Lahman database. Once I had downloaded and extracted the comma-delimted version of the files, I read the "teams" file into R.



Sidebar: My original intention was to incorporate chunks of the R code into this post.  But Blogspot seems to be going out of its way to make formatting of that code a nightmare.  I located this post at Getting Genetics Done that pointed me to the Gist feature at Github.  I have to admit to being a total Github neophyte, but I have managed to create a public Gist that will allow you to access all of the R code I used here.)

With the 1947-2012 data frame constructed, we can calculate the average number of runs each team scored per game. And while we are at it, we can calculate the runs allowed, too (and save them for another day). The variables are R (runs), RA (runs allowed), and G (games).

The next step is to calculate the league averages. First, I used "aggregate", which summed the number of runs (variable R) by year (yearID) and league (lgID). Each of the lines of code below creates a stand-alone summary table with three variables: the year, league, and the sum of the variable.

Now I have three new data frames -- "RunsLG", "RunsALG", and "GamesLG". Each has only 3 variables, but they share "Teams.yearID" and "Teams.lgID". This code, using the self-explanatory "merge" command, builds a single table LG_RPG with all three of the newly created variables.

Now we've got the totals of runs, runs allowed, and games for each league's season, the next step is to use the values to calculate the averages of runs and runs allowed per game, and then a bit of variable name maintenance.

Although it's possible to calculate the values I'm looking for from the values in the two separate data frames, I decided to make a single table (ultimately, I will be writing this table as a flat file for later use). There's probably a more elegant way to accomplish this, but it works. [Note to self: this should perhaps be the motto for my coding.] A single line of code is all that's required, since the default for "merge" is to merge based on the columns with shared variable names.

The table "Teams.merge" is the truncated 1947-2012 version of the Lahman database table "Teams" that was first read into R, with the corresponding league averages for runs and runs allowed added for each team.

The next step in the process is to compare the individual team's runs scored with the league average for that year, by creating an index value where 100 is equal to the league average, and the individual team index is measured relative to this. Thus an index score of 110 indicates that a team scored runs at a rate 10% higher than the league average, or allowed runs at a rate 10% higher than the league average.

I also was curious to find out the distributions, so I calculated the minimum, maximum, and standard deviation, and then plotted the distribution.

(click to enlarge)
So let's take a look at the extremes of the distribution -- those offensive juggernauts that managed to score runs at 120% or more of the league average, and whatever the opposite of a juggernaut might be, with run production below 80% of the league rate. Here, I used two different tools -- the "rank" command, and a sorting function, "order".

First, the juggernauts.



ROWyearIDlgIDfranchIDR_indexR_index_rank
5651976NLCIN132.88538571
491950ALBOS132.24613222
1051953NLLAD129.60171053
10921996NLCOL126.64972234
3141965NLCIN126.26647695
5411975NLCIN125.6504826
4511971NLPIT124.40468367
411949NLLAD124.06126628
1181954ALNYY123.9743829
171948ALBOS123.824577110
11201997NLCOL123.773946411
331949ALBOS123.699401312
51947ALNYY123.672456613
2971964NLATL123.520441314
3751968NLCIN123.418818115
1371955NLLAD122.911437816
7131982ALMIL122.093918217
12832003ALBOS122.050794918
14092007ALNYY121.936296619
4311971ALBAL121.427506620
12962003NLATL121.396420821
731951NLLAD121.249498422
3231966ALBAL121.201800523
12392001NLCOL121.188248824
3101965ALMIN121.164683825
15222011ALBOS121.083325126
2821963NLSTL121.003433527
3971969NLCIN120.748326328
10141993NLPHI120.596929929
3441967ALBOS120.49399230
11651999ALCLE120.3182531
3681968ALDET120.110928932



Both the 1976 Reds and the 1950 Red Sox scored runs at a rate more than 30% higher than the league average of the time. These two clubs were clearly parts of on-going offensive powerhouses -- the '75 Reds and both the 1948 and '49 Red Sox also make the list of teams with an index score of greater than 120.

And now the equivalent for the low-scoring teams.


ROWyearIDlgIDfranchIDR_indexR_index_rank
15012010ALSEA71.130038631
4041969NLSDP71.251936352
12562002ALDET74.235349673
3181965NLNYM74.835985094
1131954ALBAL74.867646295
12862003ALDET75.059333786
2751963NLHOU75.161436587
6371979ALOAK75.800850728
2951964NLHOU76.14273789
6921981ALTOR76.1724538210
11421998ALTBD76.3748350211
181948ALCHW76.8108111712
13022003NLLAD76.8264008413
7421983ALSEA76.8290153214
15312011ALSEA76.9398042915
4521971NLSDP77.2033101216
81947ALMIN77.7580102517
1291955ALBAL77.917711518
6111978ALOAK78.1185855119
8611988ALBAL78.386378520
3851969ALANA79.1910472621
1271954NLPIT79.2318634422
9431991ALCLE79.2764451423
241948ALMIN79.4215543124
501950ALCHW79.4490439525
951952NLPIT79.6182566426
1451956ALBAL79.6183316327
10091993NLFLA79.893747228




There at the top of this list stand the 2010 Seattle Mariners. The 2010 Seattle Mariners plated 513 runs (3.17 per game), which turns out to be more than the 463 (2.86 per game) that were scored by the White Sox in a 162 game season in 1968. But the Sox, and the other 19 teams that had lower runs per game values than the 2010 Mariners, were playing in seasons with very low run scoring.

But by the index measure, the 2010 Seattle Mariners were unprecedented in their inability to score runs. With an index score of 71.1, the 2010 Mariners produced lowest number of runs relative to the league average than the other 1,579 teams that played in the period 1947-2012. They scored nearly 30% fewer runs than the league average, and with a Z score of -2.96, it indicates that this is roughly 1 in a 1,000 event. (OK, for those of you with a bent for precision, it's 1 in 998.5.)

It's important to note that 2010 wasn't a one-off fluke of bad luck for the Mariners, it just happens to be the nadir of their run scoring performance. The 2011 Mariners were better than the team in 2010, but not a whole lot. They produced runs at 76.9% of the American League rate that season -- the 15th poorest in the 1947-2011 period.

For my next post, I'll look at the historic trend for the Mariners (you may have noticed other Mariner teams showing up in the above list, although not the 2012 edition of the team) and then move on to the pitching side of the equation -- runs allowed.

-30-

3 comments:

  1. Excellent post! The 1969 Padres weren't that far behind the 2010 Mariners in offensive ineptitude -- 71.13 vs 71.25. At least the Padres had an excuse -- they were an expansion team.

    ReplyDelete
  2. On the R code,

    You asked for a more elegant way. Lines 25 - 41 of your code could be replaced with simply:

    LG_RPG <- aggregate(cbind(R, RA, G) ~ yearID + lgID, data = Teams, sum)

    And then you don't even have to clean up the variable names!

    ReplyDelete
    Replies
    1. Peter, thanks -- very elegant indeed. I'll edit the Gist to reflect this improvement.

      Delete