In my article last week, I developed xLOB% as a descriptive
statistic to estimate a pitcher’s LOB%. In this article, I will attempt to
predict LOB% of a pitcher using his statistics from the previous season.
Despite its fairly weak predictive results, pLOB% explains 12.7% of the
variation in a pitcher’s LOB% in the following season, better than Steamer’s
projection and kLOB%.
After discussing with fellow Batting Leadoff blogger Morris
Greenberg, who also discussed about LOB% in his recent article, I took his
advice to separate starters from relievers. Relievers’ LOB% can be significantly
dependent on factors outside their control if they are frequently pitching less
than a full inning. As a result, I have modified my sample to only starting
pitchers with a minimum of 50 innings pitched from 2007 to 2013. I am only
interested in their performance as starters, so their statistics when they
pitch in relief are not factored in.
Using the new sample, I make a drastic change to my formula
for xLOB%. While rSB was a significant variable in the old model, I omitted it
because it adds less than 1% in the R2 value. However, using the new
sample, rSB/150 (150 IP is an arbitrary baseline to make this a rate stat) adds
around 3% in the R2 value when added to the new model. As a result,
I decide to include rSB/150 in addition to BABIP and K%.
The new formula of xLOB% is 0.9046 - 0.9 BABIP + 0.435 K% + 0.00607
rSB/150. The R2 value is 38.21 and the standard error is 4.20%.
Testing out of sample using starting pitchers with a minimum of 50 IP from
2003-2006 (2003 is the first with data on rSB), xLOB% has a correlation of
0.585 with LOB%, compared to the correlation of 0.618 from 2007-2013. xLOB%
still does not perform well as a predictor. It only has a correlation of 0.291 with LOB% of the
following season using data from 2007-2013. This is a lower correlation
coefficient than K% has with LOB% of the following season alone (0.331).
Similar to xLOB%, I run a multiple regression to develop pLOB%.
The dependent variable is the LOB% of the pitcher for the following season and
the sample includes all starting pitchers with at least 50 innings in
back-to-back seasons from 2007-2013. K%, LOB%, F-Strike% and LD% show up to be
significant variables at α = 0.05. However, both F-Strike and LD% add less than
1% in R2 to the model. In fact, LOB% also adds less than 2% to the
model. The model with K% and LOB% explains 12.68% of the variation in LOB% the
following season, while the model with K% alone achieves a R2 of
10.93. I want to compare both models in predicting LOB%, so I name the model
with K% and LOB% as pLOB%, and the model with only K% as kLOB%.
pLOB% has the formula of 0.553 + 0.328 K% + 0.146 LOB%.
kLOB% has the formula 0.6478 + 0.388 K%.
Going forward, I will compare both models with xLOB% and LOB% of
the previous season as referencein predicting LOB%. I also wanted to compare
them to established projection systems such as ZiPS and Steamer, but the only
data that allows me to calculate projected LOB% I can find is Steamer
projection for 2013. As a result, I will start by comparing the abovementioned
statistics of pitchers in 2012 in predicting their LOB% for 2013.
Steamer performs very poorly in predicting LOB% in 2013. It
trails LOB% in correlation (r) and has significantly larger errors than all of
the other three developed statistics. By r, xLOB% is the strongest predictor
followed by pLOB% and kLOB%. By MAE (mean absolute error) and RMSE (root mean
squared error), pLOB% and kLOB% are neck and neck. The standard deviation of
each statistic shows how risky each statistic is in predicting LOB%. The higher
the standard deviation, the riskier the projection. The riskier the projection,
the more likely it will have large errors and hence the higher RMSE. It is
surprising that Steamer fairs so poorly compared to the other developed
statistics despite being rather conservative, especially considering Steamer
takes into account multiple years of data while all the versions of LOB%
estimators only considers the previous season,
2007-2012 (data used to develop the models)
By correlation, pLOB% performs the best, followed by kLOB%
and xLOB%. By MAE and RMSE, pLOB% is slightly ahead of kLOB% with xLOB% pretty
far behind. This result is not unexpected, as the models are derived from this
exact dataset. pLOB% includes one more variable than kLOB%, so it should be
expected to produce stronger results in this dataset.
2003-2006
This is where pLOB% and kLOB% should be compared, as this
dataset represents out-of-sample testing. pLOB% slightly edges kLOB% in terms
of performance using all three measures. It also has a larger standard
deviation, showing that it has the riskier projection of the two. pLOB%, in
this dataset, is clearly the winner in projecting LOB%. We can also see that
xLOB% does not have strong predictive power, as its correlation with LOB% of
the following season is essentially the same as LOB%. xLOB% is meant to be
descriptive and this result confirms its weak predictive power.
pLOB% has proven to be the most predictive of the statistics
developed, as it should be given the intention behind the development of the
statistic. It also fares surprisingly well against Steamer in one year of data.
Despite its relatively strong performance, pLOB% is still rather weak in its
predictive power. It accounts for only 12.7% of the variation in LOB% from
2007-2012 and 7.7% from 2002-2006. This reinforces what we have always known,
that a large part of LOB% is subject to random variation. pLOB% is simply the best
statistic we can use in predicting LOB%.
All statistics courtesy of Fangraphs.
All statistics courtesy of Fangraphs.
No comments:
Post a Comment