This was the final project I did for my STAT 512 class, Applied Regression Analysis, under Professor Bruce Craig. Since I want to eventually enter a research paper to the Sloan Statistics Conference before my senior year, I shared with with my team and they agreed that we could use a Sports topic for our project.
In this project I did all the SAS (Statistical Analysis Software Coding), my teammate Eli Stevenson did mostly the data collection and Aaron Michael Johnson did most of the data analysis.
Here is our report, the code we used is shown in the appendix:
Background and Description of the Problem
Statistics used in sports have become more prominent not only for the fans who bet on sports teams but also analysts who try to optimize the game by looking into correlations in data and determining prominent factors of a winning team. In this report the problem is to determine which of the following factors determined by our team about the NBA can best predict the number of regular season wins.
The following is the list of factors:
-Number of All-Star Players (allstars)
-Average Age of Starting Lineup (age)
-Percentage of wins in the preseason (preseason)
-Number of wins from last season (lastwins)
-Average salary of players on the team (salary)
-Points per game (PPG)
-3-Pointers taken per game (PAG3)
-Opponent points per game (oppoints)
-Back-to-back games in a season (backtoback)
-Personal Fouls per game (pf)
-Team Value in billions of dollars (tv)
-Revenue per game in millions of dollars (rev)
Description of the Analysis Methods
The factors chosen to be included in the analysis are continuous variables which allows for the data set to be analyzed using multiple linear regression as opposed to analysis of variance. Additionally, the question of the problem involves predicting a team’s winning potential which would require a fitted regression line. The data collected for this analysis is from each team in the NBA from the 2015-16 season and is shown in Figure 1. A summary of the statistics for the response and 12 potential explanatory variables across the 30 teams (aka observations) is shown in Figure 2.
A common concern with multiple linear regression analyses is the potential of highly correlated explanatory variables which can result in “false negatives” with regard to the potential significance of parameters. To alleviate this concern, Figure 3 shows the Pearson Correlations between all of the variables, including the explanatory variable. For visual confirmation of the existence or lack of correlations between variables, Figure 4 shows a subset of the scatter plot matrix. SAS would not create the 13×13 grid of correlation scatter plots. The highest Pearson Correlation Coefficient is only 0.92 between team value and average revenue per game. If there appears to be an issue to multicollinearity in the parameter estimates in the regressions results, one of these two variable would most likely be eliminated to move the X’X matrix further from being singular. The absolute value of all other correlations are less than 0.6 which doesn’t create much concern of high correlations between the other explanatory variables.
Figure 1: Data set for each NBA team for the 2015-16 season
Figure 2: Summary statistics for the response and explanatory variable sample data sets
Figure 3: Pearson Correlation Coefficients
Figure 4: Scatter Plot Matrix
Diagnostic plots of the residuals for the full model are shown in Figure 5 with the residuals plotted against each of the explanatory variables in Figure 6.
Figure 5: Residual & Fit Diagnostic Plots
The QQ plot is shown in the fit diagnostics plot. It suggests the data are approximately Normal because the fit is relatively linear, even though there is a slight s-shape in the QQplot.
The histogram of residuals is shown in the bottom left panel of the fit diagnostics plot. The data for the residuals of the wins looks relatively Normal.
There doesn’t appear to be violations of assumptions in any of the residuals plots.
Figure 6: Residuals plotted against each explanatory variable
In each of the residuals plots, we find that the variance appears to be constant. Although it’s difficult to judge the variance when there is one data point present at that extent of the explanatory variable. The residuals appears to be reasonably Normal. No real deviations can be found in these plots
From the leverage plot in the upper right of the Figure 6, there are three teams with unusual leverages.
After checking the assumptions and finding no blatant violations, the data was then used to determine the results of the multiple linear regression.
The results of the multiple linear regression with the full model are shown below in Figure 7. The ANOVA table shows that the explanatory variables chosen to be in the model are effective at predicting the number of wins as shown by the P-value < 0.0001. The table below this displays some statistics of the model such as the adjusted R-square value which claims that this model can explain about 96% of the variability in the amount of wins a team will have in a season. Additionally, the estimated standard deviation (Root MSE) is only 2.84 wins.
Figure 7: ANOVA table for first order linear model (top), model statistics (middle), and parameter estimates (bottom)
The last table in Figure 7 shows the estimates of the coefficients for the twelve explanatory variables in the linear model. Looking at the P-values for the significance tests (t-tests), it is clear that most of the factors are not individually well-suited factors, with all other variables included, to model a team’s performance over the whole season except for two: Points Per Game and Opponents points.
Below is the model for the full model equation to best predict the number of wins by an NBA team.
X1= ages; X2=lastwins; X3= preseason; X4= allstar; X5=salary; X6=PPG; X7=PAG3; X8=oppints X9= backtoback; X10= pf; X11= tv; X12 = rev;
Y(hat) = βo + β1X1 + β2X2 + β3X3 + β4X4 + β5X5 + β6X6 + β7X7 + β8X8 + β9X9 + β10X10 + β11X11 + β12X12
Y(hat) = 15.60488 + 0.24139X1 + 0.12403X2 + 1.25479X3 – 0.13042X4 – 0.00799X5 + 2.48723X6 – 0.17278X7 – 2.34036X8 + 0.37148X9 – 0.22553X10 + 0.37280X11 + 0.00393X12
R^2 = 0.9754
H0 : β1=β2=β3=β4=β5=β6=β7=β8=β9=β10=β11=β12=0
Ha: β1≠β2≠β3≠β4≠ β5≠ β6≠ β7≠ β8≠ β9≠ β10≠ β11≠ β12≠ 0
F* = 56.13
Df = 12 and 17
P-value < 0.0001
This P-value means that we should reject the null hypothesis. The fact that we reject the null hypothesis means that at least one of these variables is significant in modeling that season’s wins, and therefore the data can be modeled by a linear relationship. By looking at the t-tests for each of the individual coefficients, Points Per Game and Opponents points are found significant even after fitting the other variables first.
Due to the fact that so many variables were being used and many of the variables obviously did not contribute to the model we found a reduced model to better explain the data. The model shown below is the model with the highest Adjusted R-Square. The best three models to represent the data based on their Adjusted R-Square are shown in Figure 8. The Adjusted R-Square function was used due to the fact that Adjusted R-Square adjusts the statistic based on the number of independent variables.
Figure 8: Best three models based on Adjusted R-Square
The best model when using Adjusted R-Square is the model with just age, last year’s wins, points per game, 3 pointers per game, opponent’s points per game and number of back to back games.
Comparing the estimated standard deviations between the two models, we see that the reduced model has a Root MSE of 2.49 while that of the full model is 2.85. The variation about the fitted line decreases when using the reduced model. Figure 9 shows these statistics as well as the parameter estimates for the two explanatory variables in this reduced model. Figure 10 and 11 show plots to allow for diagnostics of the regression and the residuals. The normal quantile (QQ) plot and the histogram of the residuals reveal a very normal residual distribution, however, in Figure 10, the residuals plotted against the explanatory variables show some signs of nonconstant variance. The variance of the wins tends to decrease as the PPG statistic increases.
Figure 9: ANOVA table (top), regression statistics (middle), and parameter estimates (bottom) for the reduced model to predict NBA regular season wins.
Figure 10: Diagnostic plots for the reduced model
Figure 11: Residuals vs. explanatory variables for reduced model
After determining the problem to investigate, which is determining what factors contributed the most to the number of NBA 2015-2016 wins in the regular season, the data of twelve variables was ran with the multiple linear regression test. Once the full model was ran, the reduced model was created by selecting the model with the highest Adjusted R-Square. It was found that the most significant model consisted of age, last year wins, points per game, 3 pointer attempts per game, opponent points and back to back games. This model had an Adjusted R-Square of .9678 which is indicative of a strong model.
The variables selected in the reduced model also make sense in real world context. The model suggests that teams with older players, with more wins in the previous year, who scored more points, who attempted less 3 pointers, and who conceded less points were the teams that won the most games in the regular season. One interesting statistic found from the reduced model was a positive relationship between back to backs and wins. On the other hand variables such as number of all-stars, pre-season win percentages, team salary, personal fouls, team value and revenue are not as significant. Some of these factors not being as important are intuitive such as pre-season win percentage and personal fouls. Unexpectedly, the factors such as all-stars and team salary are not in the best reduced model. This indicates that a team cannot buy wins and that a more balanced team could have an advantage over teams with all-stars.
----------------------- Appendix A: SAS Code data project; input wins age lastwins preseason allstars salary PPG PAG3 oppoints backtoback pf tv rev; datalines; 57 28 53 0.1428571429 1 107 104.3 29.6 98.3 19 20.2 0.915 149 56 28.2 49 0.7142857143 2 72.3 102.7 23.4 98.2 17 20.1 0.92 151 48 30 37 0.5 2 84.7 100 18 98.4 17 18.7 1.175 188 48 29.2 60 0.5714285714 2 71.4 102.8 28.4 99.2 19 19.2 0.825 133 48 25.4 40 0.8571428571 1 77.2 105.7 26.1 102.5 19 21.7 1.7 173 48 25.4 33 0.875 0 78.5 103.4 29.4 100.7 16 18.1 0.725 130 45 26.4 38 0.7142857143 1 72.4 102.2 23 100.5 17 20.2 0.83 139 44 24.6 32 0.375 1 69.3 102 26.2 101.4 20 19 0.81 144 42 28.4 50 0.5 2 88.5 101.6 21.4 103.1 17 18.8 2 201 41 27.4 46 0.5714285714 1 85.2 104.1 24.2 104.6 19 20.8 0.9 143 35 23 25 0.75 0 65.5 102.1 22.2 103.7 19 20.7 0.875 143 33 22.8 41 0.3333333333 0 70.6 99 15.6 103.2 20 20.7 0.6 110 32 28.4 17 0.6666666667 1 73.4 98.4 21.5 101.1 17 19.7 2.5 278 21 28.4 38 0.3333333333 0 80.5 98.6 18.4 106 15 18 1.5 212 10 22.8 18 0.2857142857 0 65.5 97.4 27.5 107.6 19 21.7 0.7 125 73 26.2 67 0.4285714286 3 95.8 114.9 31.6 104.1 20 21 1.3 168 67 30.8 55 0.3333333333 2 87.6 103.5 18.5 92.9 17 17.6 1 172 55 25.2 45 0.8333333333 2 99 110.2 23.7 102.9 16 20.6 0.93 152 53 31 56 0.5 1 95.7 104.5 26.7 100.2 20 21.2 1.6 146 44 23.8 51 0.4285714286 0 66.6 105.1 28.5 104.3 19 21.9 0.94 153 42 31 50 0 0 73.1 102.3 28.6 102.6 17 19.7 1.15 168 42 32.4 55 0.8571428571 0 83.4 99.1 18.5 101.3 18 21.7 0.75 135 41 26.8 56 0.375 1 88.3 106.5 30.9 106.4 20 22 1.25 175 40 23.6 38 0.4285714286 0 64.8 97.7 23.8 95.9 18 20.2 0.85 142 33 25.4 29 0.8333333333 1 72.7 106.6 22.4 109.1 19 20.4 0.8 125 33 22.6 30 0.5714285714 0 71.4 101.9 23.7 105 16 21 0.855 136 30 26.8 45 0.4285714286 1 81.4 102.7 23.8 106.5 17 20.9 0.65 131 29 25.2 16 0.2857142857 0 70.7 102.4 16.4 106 14 20.7 0.625 128 23 25.6 39 0.6666666667 0 73.4 100.9 25.8 107.5 14 22.7 0.91 145 17 25.8 21 0.375 1 72 97.3 24.6 106.9 18 20.3 2.6 293 ; proc print data= project; run; proc means data= project maxdec=2; var wins age lastwins preseason allstars salary PPG PAG3 oppoints backtoback pf tv rev; run; proc univariate data= project; var wins age lastwins preseason allstars salary PPG PAG3 oppoints backtoback pf tv rev; histogram wins age lastwins preseason allstars salary PPG PAG3 oppoints backtoback pf tv rev / normal; run; proc corr data= project; var wins age lastwins preseason allstars salary PPG PAG3 oppoints backtoback pf tv rev; run; proc corr data=project; var age lastwins preseason allstars salary PPG PAG3 oppoints backtoback pf tv rev; with wins; run; proc corr data=project plots=matrix; var wins age lastwins preseason allstars salary PPG PAG3 oppoints backtoback pf tv rev; run; proc reg data=project; model wins= age lastwins preseason allstars salary PPG PAG3 oppoints backtoback pf tv rev; run; Here is the data analysis without the last season wins and the preseason percentage: proc corr data= project; var wins age allstars salary PPG PAG3 oppoints backtoback pf tv rev; run; proc corr data=project; var age allstars salary PPG PAG3 oppoints backtoback pf tv rev; with wins; run; proc corr data=project plots=matrix; var wins age allstars salary PPG PAG3 oppoints backtoback pf tv rev; run; proc reg data=project; model wins= age allstars salary PPG PAG3 oppoints backtoback pf tv rev; run; proc reg data=project; model wins= age lastwins preseason allstars salary PPG PAG3 oppoints backtoback pf tv rev / selection = ADJRSQ best = 3; run; proc reg data=project; model wins= age lastwins PPG PAG3 oppoints backtoback; run;