Saturday, February 15, 2014

Estimating LOB%

Luck has been the explanation whenever a pitcher has a significantly lower ERA than his FIP. There are two statistics where luck plays a huge role, BABIP and LOB%. Using Steve Staude’s pitching stat correlation tool, we can see that BABIP only has a correlation of 0.156 from one season to the next, while LOB% has a correlation of 0.205, for pitchers with a minimum of 30 innings pitched from 2007 to 2013. These numbers are much lower than the correlation of K% or BB%, suggesting that a large portion of BABIP and LOB% are subject to random variation and independent of a pitcher’s skill. However, the correlation is not 0. They are not completely random, and a pitcher can still play a small role in controlling their BABIP and LOB%. Many writers, including Steve, have tackled the issue of BABIP using batted ball data. In this article, I will be estimating a pitcher’s LOB% for the current season. This is not supposed to be a predictive stat, but a descriptive one. Think of it as FIP. While FIP estimates the pitcher’s ERA using strikeouts, walks and homeruns, xLOB% estimates the pitcher’s LOB% given his other pitching statistics for the same season. I will be introducing pLOB% in the next article, which attempts to project LOB% of a pitcher for the following season.


First, take a look at which statistics correlate most closely to LOB%. Again, I am using Steve’ pitching stat correlation tool and setting the minimum innings pitched at 30 from 2007 to 2013.


Correlation with current year LOB%
Correlation with next year LOB%
BABIP
-0.452
-0.127
GB%
-0.050
-0.047
FB%
0.103
0.059
LD%
-0.135
-0.030
PU% (Popup%)
0.166
0.106
HR/FB
-0.131
-0.135
HR/TBF
-0.138
-0.157
K%
0.421
0.348
BB%
-0.037
0.052
HBP%
-0.034
0.013
O-Swing%
0.246
0.169
Z-Swing%
-0.040
-0.057
Swing%
0.146
0.077
O-Contact%
-0.163
-0.165
Z-Contact%
-0.332
-0.311
Contact%
-0.331
-0.307
Zone%
-0.046
-0.034
SwStr%
0.345
0.302
Foul%
0.311
0.256
rSB
0.062
0.009
rPM
0.045
0.001
LOB%
1
0.205














Looking at the first column, a few stats stand out as strongly correlated with LOB%. BABIP has the strongest correlation with LOB%, at -0.452. This makes perfect sense as a pitcher who gives up a lot of hits would have more of his base runners score. K% comes next at 0.421. This also makes sense as a strikeout does not advance the runner, and high-strikeout pitchers should be able to strand more runners without subjecting themselves to the whims of BABIP. Next comes a series of stats that are highly correlated with K%, namely SwStr%, Z-contact%, contact%, O-swing%. Foul%, which has a correlation of 0.311 with LOB%, initially caught me by surprise. However, a deeper look reveals that it has a correlation of 0.708 with K%, so it does not add much additional information. Both HR/FB and HR/TBF have a fairly strong negative association with LOB%, which should have been expected as homeruns score all the base runners. What surprises me the most is BB%, which has only a -0.037 correlation with LOB%. I did not know what I was expecting before the study, but I probably expected a stronger association, either positive or negative. Now that I think about it, a walk can be positively associated with LOB% because it is the least dangerous form of a base runner, compared to a single or an extra-base hit. It does not advance the runners already on base as much as hits, and the batter only reaches first base after a walk. A walk can also be negatively associated with LOB% because it still advances the base runners and makes them easier to score after the walk. The two factors seem to cancel out each other, and BB% does not seem to have a strong association with LOB%. I also tested the fielding statistics, but they do not appear to have strong associations with LOB%.

Using multiple regression, my model for xLOB% = 0.87 - 0.76 BABIP + 0.42 K%. The R-squared value is 31.8%. The standard error is 0.0574, or 5.74%, suggesting that xLOB% differs from LOB% by 5.74% on average. O-swing%, rSB, FB% and HR/TBF are all significant variables in the model at α = 0.05. However, none of these variables add more than 1% to R-squared value, so I decided to omit them in the model to maintain its simplicity.

Testing out of sample, using data from 2002-2006 with a minimum of 30 innings pitched, xLOB% has a correlation of 0.573 with LOB%. This is very close to the correlation coefficient of 0.564 between xLOB% and LOB% in the data from 2007-2013, suggesting the relationship between BABIP+K% and LOB% is not a quirk of the data from 2007-2013.

How does xLOB% perform as a predictor? Not so well. Using data from 2007-2013, xLOB% has a correlation of 0.299 with LOB% of the following season. This is a lower correlation coefficient than K% has with LOB% of the following season alone (0.348). The reason behind the relative uselessness of xLOB% as a predictor is that BABIP is very inconsistent from year to year. xLOB% itself only has a correlation of 0.463 from year to year, which is similar to the correlation coefficient of PU%, but much lower than that of K% or BB% from year to year. How can LOB% be predicted? That will be the topic of my next article.

All statistics courtesy of Fangraphs.

No comments:

Post a Comment