Thursday, March 6, 2014

Predicting LOB%

In my article last week, I developed xLOB% as a descriptive statistic to estimate a pitcher’s LOB%. In this article, I will attempt to predict LOB% of a pitcher using his statistics from the previous season. Despite its fairly weak predictive results, pLOB% explains 12.7% of the variation in a pitcher’s LOB% in the following season, better than Steamer’s projection and kLOB%.


After discussing with fellow Batting Leadoff blogger Morris Greenberg, who also discussed about LOB% in his recent article, I took his advice to separate starters from relievers. Relievers’ LOB% can be significantly dependent on factors outside their control if they are frequently pitching less than a full inning. As a result, I have modified my sample to only starting pitchers with a minimum of 50 innings pitched from 2007 to 2013. I am only interested in their performance as starters, so their statistics when they pitch in relief are not factored in.

Using the new sample, I make a drastic change to my formula for xLOB%. While rSB was a significant variable in the old model, I omitted it because it adds less than 1% in the R2 value. However, using the new sample, rSB/150 (150 IP is an arbitrary baseline to make this a rate stat) adds around 3% in the R2 value when added to the new model. As a result, I decide to include rSB/150 in addition to BABIP and K%.
The new formula of xLOB% is 0.9046 - 0.9 BABIP + 0.435 K% + 0.00607 rSB/150. The R2 value is 38.21 and the standard error is 4.20%. Testing out of sample using starting pitchers with a minimum of 50 IP from 2003-2006 (2003 is the first with data on rSB), xLOB% has a correlation of 0.585 with LOB%, compared to the correlation of 0.618 from 2007-2013. xLOB% still does not perform well as a predictor. It only has a correlation of 0.291 with LOB% of the following season using data from 2007-2013. This is a lower correlation coefficient than K% has with LOB% of the following season alone (0.331).

Similar to xLOB%, I run a multiple regression to develop pLOB%. The dependent variable is the LOB% of the pitcher for the following season and the sample includes all starting pitchers with at least 50 innings in back-to-back seasons from 2007-2013. K%, LOB%, F-Strike% and LD% show up to be significant variables at α = 0.05. However, both F-Strike and LD% add less than 1% in R2 to the model. In fact, LOB% also adds less than 2% to the model. The model with K% and LOB% explains 12.68% of the variation in LOB% the following season, while the model with K% alone achieves a R2 of 10.93. I want to compare both models in predicting LOB%, so I name the model with K% and LOB% as pLOB%, and the model with only K% as kLOB%.

pLOB% has the formula of 0.553 + 0.328 K% + 0.146 LOB%.
kLOB% has the formula 0.6478 + 0.388 K%.

Going forward, I will compare both models with xLOB% and LOB% of the previous season as referencein predicting LOB%. I also wanted to compare them to established projection systems such as ZiPS and Steamer, but the only data that allows me to calculate projected LOB% I can find is Steamer projection for 2013. As a result, I will start by comparing the abovementioned statistics of pitchers in 2012 in predicting their LOB% for 2013.

Steamer performs very poorly in predicting LOB% in 2013. It trails LOB% in correlation (r) and has significantly larger errors than all of the other three developed statistics. By r, xLOB% is the strongest predictor followed by pLOB% and kLOB%. By MAE (mean absolute error) and RMSE (root mean squared error), pLOB% and kLOB% are neck and neck. The standard deviation of each statistic shows how risky each statistic is in predicting LOB%. The higher the standard deviation, the riskier the projection. The riskier the projection, the more likely it will have large errors and hence the higher RMSE. It is surprising that Steamer fairs so poorly compared to the other developed statistics despite being rather conservative, especially considering Steamer takes into account multiple years of data while all the versions of LOB% estimators only considers the previous season,

2007-2012 (data used to develop the models)

By correlation, pLOB% performs the best, followed by kLOB% and xLOB%. By MAE and RMSE, pLOB% is slightly ahead of kLOB% with xLOB% pretty far behind. This result is not unexpected, as the models are derived from this exact dataset. pLOB% includes one more variable than kLOB%, so it should be expected to produce stronger results in this dataset.

2003-2006

This is where pLOB% and kLOB% should be compared, as this dataset represents out-of-sample testing. pLOB% slightly edges kLOB% in terms of performance using all three measures. It also has a larger standard deviation, showing that it has the riskier projection of the two. pLOB%, in this dataset, is clearly the winner in projecting LOB%. We can also see that xLOB% does not have strong predictive power, as its correlation with LOB% of the following season is essentially the same as LOB%. xLOB% is meant to be descriptive and this result confirms its weak predictive power.


pLOB% has proven to be the most predictive of the statistics developed, as it should be given the intention behind the development of the statistic. It also fares surprisingly well against Steamer in one year of data. Despite its relatively strong performance, pLOB% is still rather weak in its predictive power. It accounts for only 12.7% of the variation in LOB% from 2007-2012 and 7.7% from 2002-2006. This reinforces what we have always known, that a large part of LOB% is subject to random variation. pLOB% is simply the best statistic we can use in predicting LOB%.

All statistics courtesy of Fangraphs.

No comments:

Post a Comment