Examining Potential Umpire Bias in Major League Baseball
Our inspiration for this project came from research done by Christopher A. Parsons, Johan Sulaeman, Michael C. Yates, and Daniel S. Hamermesh on potential racial bias exhibited by umpires in Major League Baseball in the 2004 to 2006 seasons. In short, their research stated that when the race of the umpire matched the race of the starting pitcher, the umpire was more likely to call a strike. However, they found this particular effect was only seen when there was "little scrutiny of umpires' behavior" -- that is, for example, in stadiums where there was no computerized system monitoring calls. In 2004-2006, such systems were installed in just 11 of 30 MLB stadiums, which accounted for approximately 35 percent of games played in those seasons.
We took inspiration from their methods to examine potential racial bias by umpires in the 2013 to 2015 seasons. Although we failed to find any significant effects overall, we did find that in a couple special cases, there is a subtle but notable effect on the probability that an umpire calls a strike based on whether or not he matches the pitcher on race.
Based on a study of the subjective racial identity of umpires and pitchers in Major League Baseball during the 2013 to 2015 seasons, we decided to examine umpires' potential bias in an attempt to corroborate or oppose the findings of Parsons et al. Our hypothesis states:
The null hypothesis, then, is:
We tested this hypothesis for several cases:
We scraped every pitch of every game of the 2013, 2014, and 2015 MLB seasons from Baseball-Reference.com
We created code to retrieve from each game, based on the structure of each game's Play by Play table:
Once we scraped this data from Baseball-Reference, we transformed it to:
With each relevant pitch on its own row, we then added the race of each pitcher and umpire included in the 2013-15 game data. We compiled this data by starting with an Excel file of an MLB "player census" done by BestTickets.com. This Excel file only included players on each team's Opening Day roster for the 2014 season, so we then manually identified any pitchers not in the initial player census using a Google Image Search. Fortunately, the BestTickets file classified players into four groups: white, black, Hispanic, or Asian. These groups matched those used in the Parsons et al. study. For umpires, we again used a Google Image Search to determine each umpire's race (sorted into the same four categories as players). For pitchers and umpires whose race was not immediately obvious to the initial evaluator after a search, other group members were asked to provide a second opinion. Batter data helped separate each plate appearance (and therefore helped determine the ball-strike count for each pitch), but it was dropped once it was no longer needed.
When we finally had race data for all pitchers and umpires, we merged it with the game data. From there, we were able to convert the data into indicator variables (with the exception of run_diff, which is a quantitative term) that were classified into the following columns:
strike_given_called
: Whether the pitch was called a strikeupm
: Whether the umpire and pitcher race matched for that pitchhome_pitcher
: Whether the pitcher was pitching for the home teamrun_diff
: The run difference for the pitcher's team at that point in the game; for example, if the pitcher's team led by 5 runs, this was 5, and if the team trailed by 4 runs, this was -4. If the game was tied, this was 0.count_b-s
: Where b
is the number of balls in the count and s
is the number of strikes in the count.inning_i
: Where i
is the inning in which the pitch was thrown. If a pitch was thrown in extra innings, it was placed in the inning_9+
column.At this point, the data was ready to be tested in models.
Before we jump into the analysis, here are some brief facts about our data set. Readers eager to see the results may jump ahead to the next two sections.
The ratio of ball vs. strike calls is 67:33.
Count of pitches by pitcher race.
Count of pitches by umpire race
Whether the umpire and pitcher are the same race does not seem like a strong predictor for called strikes based on the visualization.
We applied a logistic regression model to evaluate the potential effects of umpire and pitcher race on whether an umpire would be more likely to call a ball or a strike. That is, we modeled the probability of an umpire calling a strike on a called pitch (π-hat) in terms of an indicator variable xupm which equals 1 if umpire and pitcher match on a racial category (UPM), 0 otherwsie, and in terms of of a vector of control variables xcontrols described in the section on data cleaning:
Since the coefficients of a logistic model cannot be interpreted directly, we quantified the effects of UPM on the the probability by taking the partial derivative of the probabiity estimator with respect to the indicator variable...
...and evaluating it where the indicator variable is set at baseline, and all control variables are set at their means:
After some testing with the model, we found that the significance of upm
(umpire-pitcher match) is most prominent when all of the features (upm
, home_pitcher
, run_diff
, count_b-s
, and inning_i
) are applied. This was an interesting find that seemed to indicate that our model was indeed picking up on a significant, if subtle, effect in certain cases, and that the control variables were serving a purpose by helping to filter out noise that would prevent the upm
feature from expressing itself fully.
First, for each case we examined, we applied scikit-learn's recursive feature elimination to the appropriate subset of data to rank the features in order of significance. Then, we calculated the coefficient of upm
and the p-value as control features were eliminated from the model in order of rank significance. The below graphs illustrate that the magnitude of the upm
coefficients increase and the p-values decrease as features are added. This shows that the effect of upm
, in some cases, subtle though it may be, is significant as we control for other variables.
Umpire Race | UPM Coefficient | UPM Effect | p-value |
---|---|---|---|
Black | -0.0675 | -1.43% | 0.049 |
Nonwhite | -0.0237 | -0.50% | 0.088 |
All | 0.0051 | +0.11% | 0.239 |
White | 0.0053 | +0.11% | 0.275 |
Hispanic | -0.0042 | -0.09% | 0.860 |
UPM coefficient with feature elimination effects overall
UPM p-value with feature elimination effects overall
UPM coefficient with feature elimination effects when umpire is white
UPM p-value with feature elimination effects when umpire is white
UPM coefficient with feature elimination effects when umpire is black
UPM p-value with feature elimination effects when umpire is black
UPM coefficent with feature elimination effects when umpire is hispanic
UPM p-value with feature elimination effects when umpire is hispanic
UPM coefficient with feature elimination effects when umpire is nonwhite
UPM p-value with feature elimination effects when umpire is nonwhite
When considering the overall data, we showed similar results to the results put forth by Parsons et al. They also did not discover a significant difference in the frequency of called strikes when the race of the umpire and pitcher matched, but, using linear-probability models, they had a lower p-value (p = 0.12).
But we found opposing results to the original research when looking at black and non-white matches between umpire and pitcher. Their results showed that when umpire and pitcher matched in race, umpires were more likely to call a strike, but we found black and non-white umpire/pitcher matches to be less likely to result in a called strike.
There were some marked differences between the Parsons et al. analysis and ours. We performed a logistic regression whereas they used a linear probability model. Also, we used a different dataset covering different years (2013-2015 rather than 2004-2006). Changes in racial diversity, game rules, technology -- including electronic monitoring now in all stadiums of all games and calls -- and any number of unknown factors that we did not control for could have had an impact on the differences in our findings.