Predicting NCAA Division 1 Basketball Teams to Make the NCAA March Madness Tournament¶
By: Tyler Gordon
CMSC320 - Introduction to Data Science Final Project
Instructor: Maksym Morawski
University of Maryland Fall 2022
from PIL import Image
img = Image.open('img.jpg')
img = img.resize(size = (700, 500))
img.show()
Overview¶
College basketball is without a doubt one of my favorite things. There's nothing quite like being in the middle of the student section during a Maryland Terrapins home basketball game, or the excitement I get during the March Madness opening round when high level games are on all day for a few weeks. A big part of the college basketball spectacle is the March Madness tournament, and making a bracket to predict the winner of the 64 team tournament is integral to enjoying the tournament. 'Bracketology' as it is often called, is the process of experts and media pundits crafting their predictions for the tournament throughout the season. I have always thought it would be pretty neat to try and replicate those predicitions on my own, and now I have the capabilities! Whilem I'm not going to predict seeds, matchups or winners, I am going to try and predict what teams will make the March Madness tournament in the first place. I also want to see if I can pinpoint certain team statistics that are more indicative of if a team will make the tournmanent. Let's give it a go!
The Dataset¶
For this project, I have chosen to use Andrew Sundberg's College Basketball Dataset) from Kaggle. This data set contains data from the 2013-2019 and 2021 basketball seasons for all of the division 1 basketball teams. I chose not to use the 2020 data due to the postseason being cancelled following the Covid-19 pandemic. The dataset has 24 columns with categorical and statistical data. The following are some of the more important columns we use throughout this project.
ADJOE: Adjusted Offensive Efficiency (An estimate of the offensive efficiency (points scored per 100 possessions) a team would have against the average Division I defense)
ADJDE: Adjusted Defensive Efficiency (An estimate of the defensive efficiency (points allowed per 100 possessions) a team would have against the average Division I offense)
EFG_O: Effective Field Goal Percentage Shot
EFG_D: Effective Field Goal Percentage Allowed
ADJ_T: Adjusted Tempo (An estimate of the tempo (possessions per 40 minutes) a team would have against the team that wants to play at an average Division I tempo)
3P_O: Three-Point Shooting Percentage
3P_D: Three-Point Shooting Percentage Allowed
The rest of the columns are described on the dataset's kaggle page (linked above) and contain columns like games played, wins, conference and more statistics. The data was scraped from http://barttorvik.com/trank.php# and the creator of the dataset, Andrew Sundberg, cleaned it and added the POSTSEASON, SEED and YEAR columns.
Curating the Data¶
We have to import all the packages needed for later. A lot of our visualizations will use matplotlib and seaborn. Our models and testing will use the various sklearn packages below.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import tree, preprocessing, svm
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import train_test_split, GridSearchCV
from scipy.stats import chi2_contingency
import warnings
warnings.filterwarnings("ignore")
cbb_13to21 = pd.read_csv('cbb.csv')
We want to predict tournament qualifiers. The first step for me in this process is to label every team in order to distinguish the teams that qualified for the tournament from those who didn't. We want to check to see how many teams are in the tournament and we can do that with value counts. Any team that didn't make the tournament will have a NaN for the postseason value. We also print the sum of the tournament teams just so we know exactly how many we have in the dataset.
cbb_13to21['POSTSEASON'].value_counts()
print(cbb_13to21['POSTSEASON'].value_counts().sum())
476
Now we want to remove the teams that didn't qualify from the dataset. Since we know postseason NaNs are teams that didn't make the tourney, we can use np.where and notnull to create a new column where teams with a NaN for postseason have a value of false, and all the other teams have a value of true. Viewing the output from the following cell we can see all the teams that did not qualify for the tournament across the 7 year span of data, at 1979 rows.
The new TOURNEY row is added all the way at the end with the corresponding true or false value for each row.
cbb_13to21['TOURNEY'] = np.where(cbb_13to21['POSTSEASON'].notnull(), True, False)
cbb_13to21[cbb_13to21['TOURNEY'] == False]
| TEAM | CONF | G | W | ADJOE | ADJDE | BARTHAG | EFG_O | EFG_D | TOR | ... | 2P_O | 2P_D | 3P_O | 3P_D | ADJ_T | WAB | POSTSEASON | SEED | YEAR | TOURNEY | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 56 | Duquesne | A10 | 30 | 11 | 107.0 | 111.7 | 0.3790 | 51.2 | 51.7 | 18.3 | ... | 49.5 | 47.7 | 36.2 | 38.5 | 67.6 | -11.3 | NaN | NaN | 2015 | False |
| 57 | Fordham | A10 | 30 | 9 | 101.0 | 103.0 | 0.4450 | 46.7 | 50.2 | 22.2 | ... | 47.8 | 49.6 | 29.8 | 34.1 | 65.9 | -12.3 | NaN | NaN | 2015 | False |
| 58 | George Mason | A10 | 30 | 8 | 101.2 | 103.8 | 0.4276 | 45.5 | 50.0 | 21.9 | ... | 44.9 | 48.4 | 31.6 | 35.3 | 65.0 | -12.6 | NaN | NaN | 2015 | False |
| 59 | George Washington | A10 | 35 | 22 | 107.2 | 96.2 | 0.7755 | 48.9 | 45.9 | 18.7 | ... | 47.3 | 44.9 | 35.2 | 31.9 | 62.7 | -2.3 | NaN | NaN | 2015 | False |
| 60 | La Salle | A10 | 33 | 17 | 98.9 | 92.9 | 0.6734 | 46.7 | 45.8 | 19.9 | ... | 46.1 | 45.1 | 32.1 | 31.6 | 64.8 | -6.3 | NaN | NaN | 2015 | False |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2030 | Prairie View A&M | SWAC | 31 | 12 | 87.1 | 103.5 | 0.1209 | 41.3 | 46.9 | 21.6 | ... | 42.0 | 46.9 | 26.0 | 31.3 | 67.0 | -15.1 | NaN | NaN | 2013 | False |
| 2031 | Buffalo | MAC | 32 | 12 | 102.5 | 103.4 | 0.4761 | 50.4 | 47.2 | 23.7 | ... | 50.0 | 46.1 | 34.2 | 33.2 | 65.4 | -13.6 | NaN | NaN | 2013 | False |
| 2032 | Louisiana Lafayette | SB | 32 | 12 | 98.4 | 105.3 | 0.3135 | 48.6 | 48.8 | 19.6 | ... | 46.9 | 47.2 | 34.5 | 34.9 | 70.2 | -13.9 | NaN | NaN | 2013 | False |
| 2033 | Bethune Cookman | MEAC | 32 | 12 | 96.0 | 107.5 | 0.2134 | 45.9 | 51.0 | 18.4 | ... | 45.6 | 50.9 | 31.0 | 34.1 | 65.9 | -15.0 | NaN | NaN | 2013 | False |
| 2034 | Troy | SB | 33 | 12 | 97.3 | 107.5 | 0.2416 | 45.3 | 51.0 | 16.5 | ... | 44.5 | 49.5 | 31.2 | 35.8 | 61.1 | -16.6 | NaN | NaN | 2013 | False |
1979 rows × 25 columns
Now we can call describe on our dataframe. This allows us to see some statistical data about the dataset. One thing we can see is that there are 2,455 total teams. Using our new columns above, we know we have 1,979 teams that didn't qualify for the tournament and 476 teams that did across the 8 years. These values add up to 2,455 so we know we have all the teams accounted for in terms of the outcome we are looking for, which is an important step for curating our data before exploring it and later analyzing it.
We can also see that the mean games played is 31.5, the mean wins is 16.28 for a team and so forth along with standard deviation, min/max and quartile data. This allows us to begin to get some general ideas about the dataset as a whole, with all outcomes combined.
cbb_13to21.describe()
| G | W | ADJOE | ADJDE | BARTHAG | EFG_O | EFG_D | TOR | TORD | ORB | ... | FTR | FTRD | 2P_O | 2P_D | 3P_O | 3P_D | ADJ_T | WAB | SEED | YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 2455.000000 | 2455.000000 | 2455.000000 | 2455.000000 | 2455.000000 | 2455.000000 | 2455.000000 | 2455.000000 | 2455.000000 | 2455.000000 | ... | 2455.000000 | 2455.00000 | 2455.000000 | 2455.000000 | 2455.000000 | 2455.000000 | 2455.000000 | 2455.000000 | 476.000000 | 2455.000000 |
| mean | 31.492464 | 16.284318 | 103.304481 | 103.304603 | 0.493957 | 49.805132 | 50.001385 | 18.763055 | 18.689572 | 29.875642 | ... | 35.989776 | 36.26998 | 48.802281 | 48.976660 | 34.406395 | 34.598737 | 67.812301 | -7.802485 | 8.802521 | 2016.007332 |
| std | 2.657401 | 6.610960 | 7.376981 | 6.605318 | 0.256244 | 3.143061 | 2.939602 | 2.090595 | 2.201749 | 4.134332 | ... | 5.247820 | 6.24590 | 3.384468 | 3.340546 | 2.789434 | 2.415766 | 3.277622 | 6.965736 | 4.676354 | 1.999375 |
| min | 15.000000 | 0.000000 | 76.600000 | 84.000000 | 0.005000 | 39.200000 | 39.600000 | 11.900000 | 10.200000 | 15.000000 | ... | 21.600000 | 21.80000 | 37.700000 | 37.700000 | 24.900000 | 27.100000 | 57.200000 | -25.200000 | 1.000000 | 2013.000000 |
| 25% | 30.000000 | 11.000000 | 98.300000 | 98.500000 | 0.282200 | 47.750000 | 48.000000 | 17.300000 | 17.200000 | 27.100000 | ... | 32.400000 | 31.90000 | 46.500000 | 46.700000 | 32.500000 | 33.000000 | 65.700000 | -13.000000 | 5.000000 | 2014.000000 |
| 50% | 31.000000 | 16.000000 | 103.000000 | 103.500000 | 0.475000 | 49.700000 | 50.000000 | 18.700000 | 18.600000 | 29.900000 | ... | 35.800000 | 35.80000 | 48.700000 | 49.000000 | 34.400000 | 34.600000 | 67.800000 | -8.300000 | 9.000000 | 2016.000000 |
| 75% | 33.000000 | 21.000000 | 108.000000 | 107.900000 | 0.712200 | 51.900000 | 52.000000 | 20.100000 | 20.100000 | 32.600000 | ... | 39.500000 | 40.20000 | 51.000000 | 51.300000 | 36.300000 | 36.200000 | 70.000000 | -3.150000 | 13.000000 | 2018.000000 |
| max | 40.000000 | 38.000000 | 129.100000 | 124.000000 | 0.984200 | 59.800000 | 59.500000 | 27.100000 | 28.500000 | 43.600000 | ... | 58.600000 | 60.70000 | 62.600000 | 61.200000 | 44.100000 | 43.100000 | 83.400000 | 13.100000 | 16.000000 | 2019.000000 |
8 rows × 21 columns
Again below I double check to see the number of null values in each column. We can verify the 1,979 null values in POSTSEASON and SEED to match up with the team's that did not make the tournament. Seed is essentially the same as postseason in this case as teams that did not make the tournament do not have a seed and thusly would be null. Later we will have to decide if we want to include these values analysis or not. If we do we will need to modify the data to remove or replace null values. However for the time being the TOURNEY column will do fine for handling this.
cbb_13to21.isnull().sum()
TEAM 0 CONF 0 G 0 W 0 ADJOE 0 ADJDE 0 BARTHAG 0 EFG_O 0 EFG_D 0 TOR 0 TORD 0 ORB 0 DRB 0 FTR 0 FTRD 0 2P_O 0 2P_D 0 3P_O 0 3P_D 0 ADJ_T 0 WAB 0 POSTSEASON 1979 SEED 1979 YEAR 0 TOURNEY 0 dtype: int64
Exploring our Dataset¶
Now that we have made some adjustments to the dataset we want to start exploring it for patterns, outliers and any general information we can learn about it. A good way of doing this is visualizations. First we are going to make a heatmap using seaborn to view the correlations of the features (columns) in our dataset. Correlation describes the dependence between two variables and using a heatmap we can visualize and see the magnitude of the relationship between these variables. Darker colors below suggest higher dependence and light colors suggest a lower dependence.
We can see that games, wins and ADJOE all seem to be highly related. Also 2 point and 3 point shooting percentage are related strongly to offensive efficiency which makes sense intuitively. The more shots a team makes, the more they are going to score and be productive offensively. WAB and BARTHAG are also highly related. These represent a team's win above the bubble and their power rating respectively. Wins above bubble refers to the numbers of wins they were above getting into the tournament. The bubble usually references teams that are projected to be very close to being in the tournament or very close to falling out of the tournament. Power rating here refers to the chance of beating an average division 1 opponent. It is also somewhat intuitive that these are correlated as the more wins above the bubble a team is the better they are more likely to beat an average team.
sns.heatmap(cbb_13to21.corr(), cmap="YlGnBu")
sns.set(rc={'figure.figsize':(10,10)})
Now I am going to use a pairplot. Pairplots plot the relationships between every feature in the dataset. Here scatterplots are used for all the variables. I also decided to not use numerous variables in the construction of the pairplot. The team, conference, postseason and seed features are more categorical and I wanted to look at the relationships between statistical categories more. I also didn't include games, wins, BARTHAG and WAB. My reasoning was along the line that teams that made the tournament were likely going to have more games just from getting to play in the tournament, and also wins in this case. BARTHAG and WAB are statistics that are already developed for projecting teams strength and making the tournament. I decided to leave these out so they don't create any bias in the data from another projection model or algorithm.
df = cbb_13to21.drop(columns=['TEAM', 'CONF', 'POSTSEASON', 'SEED', 'YEAR', 'G', 'W', 'BARTHAG', 'WAB'])
sns.pairplot(df, kind='reg', hue='TOURNEY')
<seaborn.axisgrid.PairGrid at 0x7fdc029dc550>
After looking at the heatmap and pairplot I wanted to look at some relationships a little closer. First I created two dataframe copies. The first cbb_other is teams that didn't make the tourney and cbb_tourney is teams that did make the tournament.
cbb_other = cbb_13to21[cbb_13to21['TOURNEY'] == False]
cbb_tourney = cbb_13to21[cbb_13to21['TOURNEY'] == True]
The first feature relationship I wanted to look at was the relationship between adjusted offensive efficiency and effective field goal percentage. These two seemed to have a strong positive correlation in both previous charts. As we can see below the tournament teams (blue) definetly appear to have higher values for both these variables for the most part. Although, for the most part the difference between the tourney and non-tourney teams does not seem to be as drastic as one might expect. This correlation seems to make sense, make more shots and you have a better offense. Score more points and a team is likely to win more games. However we cannot determine if this relation is significant quite yet.
fig, ax = plt.subplots()
ax.scatter(cbb_other['ADJOE'], cbb_other['EFG_O'], c='red', alpha = .5, label='Not Tourney Team')
ax.scatter(cbb_tourney['ADJOE'], cbb_tourney['EFG_O'], c='blue', label='Tourney Team')
ax.legend(loc='best')
plt.xlabel("Adjusted Offensive Efficiency")
plt.ylabel("Effective Field Goal Percentage")
plt.title("Adjusted Offensive Efficiency vs Effective Field Goal %")
plt.show()
Now lets take a look at the adjusted defensive efficiency and the effective allowed field goal percentage, two defensive statistics. Again it seems tourney teams have lower values for both these categories, indicating stronger defenses. It again makes sense as keeping the opponent to a lower field goal percentage means less points being scored, indicating a strong defense. Just like the previous scatterplot, there is some differences between tourney and non-tourney but nothing drastic.
fig, ax = plt.subplots()
ax.scatter(cbb_other['ADJDE'], cbb_other['EFG_D'], c='red', alpha=.5, label='Not Tourney Team')
ax.scatter(cbb_tourney['ADJDE'], cbb_tourney['EFG_D'], c='blue', label='Tourney Team')
ax.legend(loc='best')
plt.xlabel("Adjusted Defensive Efficiency")
plt.ylabel("Effective Field Goal Percentage Allowed")
plt.title("Adjusted Defensive Efficiency vs Effective Field Goal % Allowed")
plt.show()
The next two scatterplots are very similar to the previous two. Now we are looking at three point shooting for offense and defense compared to the efficiency of offense and defense. We again can see that there is probably a relationship between the two features and the tournament teams also seem to do better for the most part in these categories.
fig, ax = plt.subplots()
ax.scatter(cbb_other['3P_O'], cbb_other['ADJOE'], c='red', alpha=.5, label='Not Tourney Team')
ax.scatter(cbb_tourney['3P_O'], cbb_tourney['ADJOE'], c='blue', label='Tourney Team')
ax.legend(loc='best')
plt.xlabel("Three-Point Shooting Percentage")
plt.ylabel("Adjusted Offensive Efficiency")
plt.title("Three-Point Percentage Shooting vs Adj Offensive Efficiency")
plt.show()
fig, ax = plt.subplots()
ax.scatter(cbb_other['3P_D'], cbb_other['ADJDE'], c='red', alpha=.5, label='Not Tourney Team')
ax.scatter(cbb_tourney['3P_D'], cbb_tourney['ADJDE'], c='blue', label='Tourney Team')
ax.legend(loc='best')
plt.xlabel("Three-Point Shooting Percentage Allowed")
plt.ylabel("Adjusted Defensive Efficiency")
plt.title("Three-Point Percentage Shooting Allowed vs Adj Defensive Efficiency")
plt.show()
Next I wanted to see in more detail the adjusted offensive and defensive efficiency plotted together. It would make sense to assume tournament teams would likely have a better offense and defense allowing them to win games and make the tournament. Again it seems from the scatterplot the tourney teams for the most part have better values in both of these areas. (Low defensive values on y with high offensive values on x)
fig, ax = plt.subplots()
ax.scatter(cbb_other['ADJOE'], cbb_other['ADJDE'], c='red', alpha=.5)
ax.scatter(cbb_tourney['ADJOE'], cbb_tourney['ADJDE'], c='blue')
plt.xlabel("Adjusted Offensive Efficiency")
plt.ylabel("Adjusted Defensive Efficiency")
plt.title("Adjusted Offensive Efficiency vs Adjusted Defensive Efficiency")
plt.show()
The next visualization I wanted to create was a violin plot. Violin plots allow us to see the probability density of the data at different values. I thought it would be useful to compare the tourney and non-tourney teams distributions for the features we saw above. Especially since there are about 4 times the number of non-tourney teams as tourney. Hopefully we can get a better sense of relationship and expand on what we learned above.
Below is the initial violin plot I created. It compares the adjusted offensive efficiency distributions. It is definetly apparent that tourney teams (labeled true) have slightly higher values. The dotted lines represent the quartiles, similar to a box and whisker plot. The center dotted line is the mean in each violin and tourney teams have a bit of a higher mean. The mean for non-tourney teams comes in right as the bottom tail for the tourney teams starts. This suggest the bulk of the non-tourney teams have an ADJOE around that of the lower end of the tourney teams.
fig, ax = plt.subplots()
sns.violinplot(x = 'TOURNEY', y='ADJOE', data=cbb_13to21, inner='quartiles')
plt.xlabel("Made Final Four")
plt.ylabel("Adjusted Offensive Efficiency")
plt.title("Comparing Spread of Offensive Efficiency")
plt.show()
visual_df = cbb_13to21.drop(columns=['TEAM', 'CONF', 'POSTSEASON', 'SEED', 'YEAR', 'TOURNEY'])
dict2 = {
'G':'Games',
'W':'Wins',
'ADJOE':'Adjusted Offensive Efficiency',
'ADJDE':'Adjusted Defensive Efficiency',
'BARTHAG':'Chance of Beating an Average DI Team',
'EFG_O':'Effective Field Goal Percentage Shot',
'EFG_D':'Effective Field Goal Percentage Allowed',
'TOR':'Turnover Rate',
'TORD':'Steal Rate',
'ORB':'Offensive Rebound Rate',
'DRB':'Offensive Rebound Rate Allowed',
'FTR':'Free Throw Rate',
'FTRD':'Free Throw Rate Allowed',
'2P_O':'Two-Point Shooting Percentage',
'2P_D':'Two-Point Shooting Percentage Allowed',
'3P_O':'Three-Point Shooting Percentage',
'3P_D':'Three-Point Shooting Percentage Allowed',
'ADJ_T':'Adjusted Tempo',
'WAB':'Wins Above Bubble'
}
Now I want to make a violin plot for all of the features we have looked at. Above I drop some of the non-statistical columns we discussed for a new dataframe for use in plotting. I also created the dictionary above for titles in the violin plot, more of a cosmetic piece.
Below I iterate through each column left and create a violin plot on one figure for us to examine.
fig = plt.figure(figsize=(30,30))
grid = fig.add_gridspec(6,6)
subplot_list = []
for x in range(0,6):
for y in range(0,3):
subplot_list.append(fig.add_subplot(grid[x,y]))
cols = visual_df.columns
for col, subplot in zip(cols, subplot_list):
sns.violinplot(x = 'TOURNEY', y=col, data=cbb_13to21, inner='quartiles', ax=subplot)
subplot.set_title(dict2[col])
fig.tight_layout()
plt.show()
Using the violin plots we can examine the distributions. Upon first glance a few features look to be similar for the tourney and non-tourney teams. The adjusted tempo, free throw rate, offensive rebound rate allowed and steal rate have very similar distributions, especially when looking at the mean and main density areas of the violin plots.
As we have seen, the efficiency and shooting statistics have distinct differences between tourney and non-tourney teams with tourney teams having a much better distribution. Games, wins and BARTHAG all have interesting distributions. BARTHAG has a very odd shape, and as I talked about earlier this was somewhat expected. It's almost uniform for non-tourney teams, which makes sense. If BARTHAG measures chance of beating an average team, there are a wide range of non-tourney teams well spread to play the average.
Games and wins still have the issues to think about as I mentioned earlier with tournament teams naturally just getting more of them. That's why I decided to drop these statistics along with games, wins, BARTHAG and wins above bubble as I really don't think they will help provide an accurate training for our model later.
Next I wanted to replot the correlations with the heatmap, now that we have more or less officially narrowed down some of the features, to get a closer look at our remaining features and their relationships to one another.
cbb = cbb_13to21.drop(columns=['TEAM', 'CONF', 'POSTSEASON', 'SEED', 'YEAR', 'G', 'W', 'BARTHAG', 'WAB'])
sns.heatmap(cbb.corr(), cmap="YlGnBu")
sns.set(rc={'figure.figsize':(10,10)})
We see that besides the obvious shooting and shooting efficiency statistics having a higher correlation, for the most part it seems most of the features are not super highly correlated. This is much better than previously. We don't want highly correlated variables to increase our error risk and we have now cut down the number so hopefully we can get an accurate model later.
Hypothesis Testing¶
My null hypothesis is going to be that none of the features we are looking at contributes to a team making the NCAA tournament. As a reminder, our remaining features are
- Adjusted Offensive Efficiency
- Adjusted Defensive Efficiency
- Effective Field Goal Percentage
- Effective Field Goal Percentage Allowed
- Turnover Percentage Committed
- Offensive Rebound Rate
- Offensive Rebound Rate Allowed
- Free Throw Rate
- Free Throw Rate Allowed
- Two-Point Shooting Percentage
- Two-Point Shooting Percentage Allowed
- Three-Point Shooting Percentage
- Three-Point Shooting Percentage Allowed
- Adjusted Tempo
We are going to carry out a Chi-Squared Test on each column to get a p-value for the statistical significance of a column influencing making the tournament. P-values less than 0.5 will cause us to reject the null hypothesis for that column indicating it does have an impact on a team making the NCAA tournament.
I pulled out all the columns into a dataframe and the outcome (TOURNEY) column into it's own dataframe. I then iterated through each feature calculating the p-value for each and creating lists to hold the dependent and independent variables and created output to make the results easy to decipher.
cbb_features = cbb.iloc[:,:-1]
cbb_label = cbb['TOURNEY']
ind = []
dep = []
pval = []
for col in cbb_features:
csq = chi2_contingency(pd.crosstab(cbb[col], cbb['TOURNEY']))
pval.append(csq[1])
if csq[1] > .05:
ind.append(col)
else:
dep.append(col)
print(str(col) + " p-value: " + str(csq[1]))
print("Independent Features: " + str(ind))
print("Dependent Features: " + str(dep))
ADJOE p-value: 3.4764572966789143e-81 ADJDE p-value: 6.743592839012311e-69 EFG_O p-value: 1.3221470306479153e-33 EFG_D p-value: 1.00238239688931e-38 TOR p-value: 2.785675155726798e-22 TORD p-value: 0.034092821524116176 ORB p-value: 4.9122732805854736e-05 DRB p-value: 0.006646091135500911 FTR p-value: 0.21717855012036538 FTRD p-value: 1.6709203644346735e-06 2P_O p-value: 5.875844364327951e-23 2P_D p-value: 2.826525380057135e-29 3P_O p-value: 2.5798046940770557e-13 3P_D p-value: 2.359711828614138e-15 ADJ_T p-value: 0.08200786766686213 Independent Features: ['FTR', 'ADJ_T'] Dependent Features: ['ADJOE', 'ADJDE', 'EFG_O', 'EFG_D', 'TOR', 'TORD', 'ORB', 'DRB', 'FTRD', '2P_O', '2P_D', '3P_O', '3P_D']
As we can see, free throw rate and the adjusted tempo resulted in p-values above 0.5 and thus we cannot reject the null hypothesis for these indicating they are independent of the tournament outcome. The rest of the features also rejected the null hypothesis and are related to the tourney outcome. I decided to also drop the free throw rate and adjusted tempo upon getting the results from the hypothesis testing. For one, they weren't really among any features we had seen as important or likely to have an impact. Additionally, when thinking about basketball, teams that have a fast tempo aren't always the best. Sure it may help to shoot faster and get more shots up, but if a team has a bad shooting night the other team can slow the game down and limit their possessions. Free throw rate is a feature that you would think may help a team, but perhaps the impact just is not significant enough.
Creating and Analyzing Classification Models¶
Now I want to use classification models from sklearn to try and model the dataset to make predictions and learn more. There are two main goals I have. First I want to try and identify what features are the most important in the models in predicting what teams will make the tournament. Secondly I want to determine what classification model works the best on this dataset.
First I dropped the columns indicated as independent from our previous hypothesis testing. Then I reshaped our current dataframe into two new ones. One will hold all of the features and the second will hold the labels, true or false for tourney qualification. The features are essentially the X values in the dataset and the labels are the y values or results. We can see that both lists have 2,455 teams with 13 features in the X set.
cbb = cbb.drop(columns=ind)
cbb_features = cbb.iloc[:,:-1]
cbb_label = cbb['TOURNEY']
X = cbb_features.values
y = cbb_label.values
print(X.shape)
print(y.shape)
(2455, 13) (2455,)
Now we are going to use train_test_split to split our data into a training set and a test set. I chose the split of 75% of data for training and 25% for testing. This will allow us to train the model and then check it's performance on the test data. I also chose the stratify parameter and set it to the labels of our data. This will help ensure there is an even distribution amongst the two labels since there are a good bit more of the non-tourney teams than the tourney teams.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.25, stratify=y, random_state=13)
We also want to normalize our data. Normalizing data allows us to adjust values on a common scale. Using StandardScaler allows us to standardize our X test and train features by removing the mean and then scaling to unit variance. This helps to ensure a standard distribution which helps make sure our machine learning algorithms behave correctly. I referred to Understanding Feature Importance and How to Implement it in Python for multiple sections through our machine learning portion of this project, as well as other resources that I will include as we continue.
scaler = preprocessing.StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.fit_transform(X_test)
Next we initialize our 3 models. The Random Forest, Decision Tree and Support Vector (SVC) models, all for classification of course. We used a random state of 13 here just to make sure we get the same randomness every time we run to help make better comparisons. We also preset the kernel to linear for SVC as it allows us to get coefficients of feature, which we will discuss later. However besides these inputs, these models are largely without hyperparameters, which are parameters we use with linear models that help control the learning process of these machine learning models.
forest = RandomForestClassifier(random_state=13)
dt = tree.DecisionTreeClassifier(random_state=13)
svc = svm.SVC(random_state=13, kernel='linear')
Here's a breakdown of each of our model types.
Random Forest:
A large number of individual decision trees that operate together. Each tree in the random forest comes up with a prediction and the prediction with the most votes is the one that wins.
Decision Tree:
Splitting a dataset over and over to split features up into matching groups as best as possible.
SVC:
Try to find a hyperplane in an N-dimensional space (N - number features) that can directly classify data points.
Further reading can be found below.
SVM
Random Forests and Decision Trees
forest.fit(X_train, y_train)
dt.fit(X_train, y_train)
svc.fit(X_train, y_train)
SVC(kernel='linear', random_state=13)
After we fit each no-hyperparameter model to the training data we can begin to work on the first goal, of finding the important features. I decided to compare the feature importance of each model using no hyperparameters and then with hyperparameters to see if there was any difference. Below I graphed each of our initial models feature importances on a horizontal bar graph.
The first graph is the random forest feature importances, and is blue. We can pull out the feature importances directly from the model, which is very efficient. We can see the adjusted offense is easily the most impact with adjusted defense next up. The rest of the features have similar importance with the defensive and offensive efficiencies being the third and fourth most important respectively.
forest_sort = forest.feature_importances_.argsort()
plt.barh(cbb_features.columns[forest_sort], forest.feature_importances_[forest_sort], color=['blue'])
plt.xlabel("Feature Importance")
plt.title("Feature Importance for Random Forest")
Text(0.5, 1.0, 'Feature Importance for Random Forest')
Next we can look at the decision tree's feature importances, whose charts are in red. We can also pull them straight out of the model again. This time ADJOE is a very strong winner with ADJDE at about half the importance. The rest of the features again are around a similar value but with turnover percentage allowed and offensive rebounding rate being third and fourth.
dt_sort = dt.feature_importances_.argsort()
plt.barh(cbb_features.columns[dt_sort], dt.feature_importances_[dt_sort], color=['red'])
plt.xlabel("Feature Importance")
plt.title("Feature Importance for Decision Tree Model")
Text(0.5, 1.0, 'Feature Importance for Decision Tree Model')
Finally we have the SVC model, colored in green. The SVC model does not have feature importance built in, however we can use the coefficients for each variable and compare those instead. This will result in a different value range when compared to random forest and decision tree but the overall meaning is the same. Again we see ADJOE and ADJDE as the top two features with defensive efficiency and turnover percentage committed in third and fourth. One thing to notice about this graph is that the spread seems to decline more evenly as the other graphs had a sharp fall off after the first two features, implying the SVC model was impacted by the top features less and the bottom features more.
svc_sort = abs(svc.coef_[0]).argsort()
plt.barh(cbb_features.columns[svc_sort], abs(svc.coef_[0])[svc_sort], color=['green'])
plt.xlabel("Model Coefficients")
plt.title("Model Coefficients for SVC Model")
Text(0.5, 1.0, 'Model Coefficients for SVC Model')
Now we can add hyperparameters to our models and then run them on our testing data. For this section I followed a lot of the logic and advice found in: Tuning the Hyperparameters of your Machine Learning Model using GridSearchCV to use GridSearchCV for finding my hyperparameters. GridSearch CV is an sklearn function that allows us to identify various hyperparameter values which it then uses to try out all possible combinations to find the best set for the model. It allows us to find the best set of hyperparameters for our specific dataset and model, not just the type of model overall. GridSearch CV also uses k-fold cross validation along with training and testing during the process of finding the right hyperparameters. K-fold cross validation splits the dataset into k parts and then use one fold for testing and one for training in each iteration. In each iteration the metrics of the model are recorded and at the end the averages are found. This allows each combination of hyperparameters to have a better benchmark for performance resulting in a better fit for your model.
I ended up using the GridSearch CV default of 5-fold cross validation (Data set split into 5 even segments). Below are the parameters for a random forest in a dictionary. For all the models I used the documentation as well as examples online to come up with my set of each hyperparameter.
forest_params = {
'n_estimators' : [100, 300, 500],
'criterion' : ['gini', 'entropy'],
'max_depth' : [4, 5, 6, 7, 8],
'max_features' : ['auto', 'sqrt', 'log2']
}
Below we run GridSearchCV on the above parameters and then fit it to the training data. We can then retrieve the best parameters which I decided to print out below.
CV_forest = GridSearchCV(estimator=forest, param_grid=forest_params)
CV_forest.fit(X_train, y_train)
params = CV_forest.best_params_
print(params)
{'criterion': 'entropy', 'max_depth': 7, 'max_features': 'auto', 'n_estimators': 300}
Now we can feed these parameters straight into a new random forest model.
forest_model = RandomForestClassifier(criterion=params['criterion'], max_depth=params['max_depth'],
max_features=params['max_features'], n_estimators=params['n_estimators'], random_state=13)
Now we fit the model to the data. After we use the trained model to predict the results of our test X set. Using the classification report from sklearn.metrics we can get a statistical output for how well the model performed. We are focusing on the precision, recall and f1-score for the False and True values. Precision is a measure of how the model's ability to not label a negative sample as positive. In our case precision is the model's ability to not label a non-tourney team as a tourney team. Recall is a measure of the model's ability to correctly label all positive samples. Again for our case, recall would be the model's ability to label all the tournament team's correctly. F1-score is a weighted harmonic mean of the precision and recall that ranges from 0 (worst) to 1 (best). As we can see from the output below the random forest scored the following:
- Precision: 0.81
- Out of all the teams the model predicted would make the tournament, 81% did.
- Recall: 0.66
- Out of all the teams that did make the tournament, the model predicted this correctly for 66% of those teams.
- F1-Score: 0.73
forest_model.fit(X_train, y_train)
forest_pred = forest_model.predict(X_test)
print(classification_report(y_test, forest_pred))
precision recall f1-score support
False 0.92 0.96 0.94 495
True 0.81 0.66 0.73 119
accuracy 0.90 614
macro avg 0.87 0.81 0.83 614
weighted avg 0.90 0.90 0.90 614
I also decided to make a confusion matrix. Confusion matrices are great for comparing the predicted category labels to the true label. Confusion Matrix Visualization is a source I used for styling the confusion matrix to see the percentage of values in each category. The four categories are:
- True Negative
- False Positive
- False Negative
- True Positive
Ideally you would have all of your data from the model fall into true negative or true positive, meaning the predicted labels match the actual labels.
Using the predictions we can create the confusion matrix quite easily. Following that we can zip up the names, counts and percentages to format the labels for each box and plot them in a heatmap for the nice colorful output we see below.
forest_cm = confusion_matrix(y_test, forest_pred)
group_name = ['True Neg', 'False Pos', 'False Neg', 'True Pos']
group_counts = ["{0:0.0f}".format(value) for value in
forest_cm.flatten()]
group_percentages = ["{0:.2%}".format(value) for value in
forest_cm.flatten()/np.sum(forest_cm)]
labels = [f"{v1}\n{v2}\n{v3}" for v1, v2, v3 in zip(group_name, group_counts, group_percentages)]
labels = np.asarray(labels).reshape(2,2)
sns.heatmap(forest_cm, annot=labels, fmt='', cmap='Blues')
plt.ylabel("True Label")
plt.xlabel("Predicted Label")
Text(0.5, 66.5, 'Predicted Label')
We can see that 77.69% of our predictions were true negatives and 12.70% were true positives. This adds up to around 90% correct predictions. However there are a lot more negatives than positives. We can see that out of those predicted positive, 41 were wrong, about 34% of the positives in total. Additionally, 18 true positives were labelled false, or about 19% of the true positives were falsely labelled. These are similar statistics to precision and recall and allow us to compare models. You can compare multiple types of the same model or different models altogether, which is what we will be doing.
Now we can plot the feature importances from our model with hyperparameters. I again used a horizontal bar chart and just doubled the columns for comparison.
ind = np.arange(len(cbb_features.columns))
width = .35
fig = plt.figure()
ax = fig.add_subplot(111)
rects1 = ax.barh(ind, forest.feature_importances_[forest_sort], width, color='blue', label='No Hyperparameters')
rects2 = ax.barh(ind + width, forest_model.feature_importances_[forest_sort], width, color='royalblue', label='Hyperparameters Tuned')
ax.set_yticks(ind + width / 2)
ax.set_yticklabels(cbb_features.columns[forest_sort])
plt.legend(loc='best')
plt.xlabel("Feature Importance")
plt.title("Comparing Feature Importance for Random Forest")
Text(0.5, 1.0, 'Comparing Feature Importance for Random Forest')
It seems our new model has placed more importance on ADJOE, ADJDE and EFG_D. Most of the other features had a reduced importance in our tuned model. The biggest jump by far is ADJOE which continues to line up with what we saw in the other models previously.
Now lets do the same process with the decision tree model and examine the output.
dt_params = {
'ccp_alpha' : [0.1, .01, .001],
'criterion' : ['gini', 'entropy'],
'max_depth' : [4, 5, 6, 7, 8],
'max_features' : ['auto', 'sqrt', 'log2']
}
CV_dt = GridSearchCV(estimator=dt, param_grid=dt_params)
CV_dt.fit(X_train, y_train)
params = CV_dt.best_params_
print(params)
{'ccp_alpha': 0.01, 'criterion': 'entropy', 'max_depth': 7, 'max_features': 'auto'}
Now we can put our parameters from GridSearch CV into our decision tree and run the model just as we did before.
dt_model = tree.DecisionTreeClassifier(criterion=params['criterion'], max_depth=params['max_depth'],
max_features=params['max_features'], ccp_alpha=params['ccp_alpha'], random_state=13)
dt_model.fit(X_train, y_train)
dt_pred = dt_model.predict(X_test)
print(classification_report(y_test, dt_pred))
precision recall f1-score support
False 0.90 0.97 0.93 495
True 0.83 0.54 0.65 119
accuracy 0.89 614
macro avg 0.86 0.76 0.79 614
weighted avg 0.88 0.89 0.88 614
Lets examine the classification report again, this time for the decision tree model.
- Precision: 0.83
- Out of all the teams the model predicted would make the tournament, 83% did.
- Recall: 0.54
- Out of all the teams that did make the tournament, the model predicted this correctly for 54% of those teams.
- F1-Score: 0.65
Let's take a look at the confusion matrix for this data. We wil compare with the other models following the process after each.
dt_cm = confusion_matrix(y_test, dt_pred)
group_name = ['True Neg', 'False Pos', 'False Neg', 'True Pos']
group_counts = ["{0:0.0f}".format(value) for value in
dt_cm.flatten()]
group_percentages = ["{0:.2%}".format(value) for value in
dt_cm.flatten()/np.sum(dt_cm)]
labels = [f"{v1}\n{v2}\n{v3}" for v1, v2, v3 in zip(group_name, group_counts, group_percentages)]
labels = np.asarray(labels).reshape(2,2)
sns.heatmap(dt_cm, annot=labels, fmt='', cmap='Reds')
plt.ylabel("True Label")
plt.xlabel("Predicted Label")
Text(0.5, 66.5, 'Predicted Label')
The confusion matrix has similar results to the random forest. The true negative is quite similar. The false positives are also the same. The false negatives are 2% more and the true positives are 2% less. This basically means the decision tree model is labeling more of the actual true values as false. This lines up with the recall being less than our random forest model. Next lets look at the feature importances.
ind = np.arange(len(cbb_features.columns))
width = .35
fig = plt.figure()
ax = fig.add_subplot(111)
rects1 = ax.barh(ind, dt.feature_importances_[dt_sort], width, color='red', label='No Hyperparameters')
rects2 = ax.barh(ind + width, dt_model.feature_importances_[dt_sort], width, color='lightcoral', label='Hyperparameters Tuned')
ax.set_yticks(ind + width / 2)
ax.set_yticklabels(cbb_features.columns[dt_sort])
plt.legend(loc='best')
plt.xlabel("Feature Importance")
plt.title("Comparing Feature Importance for Decision Tree")
Text(0.5, 1.0, 'Comparing Feature Importance for Decision Tree')
This is quite the interesting result. It seems the decision tree model with hyperparameters completely dropped all features except for ADJOE and ADJDE. We haven't seen anything like this yet, however ADJOE and ADJDE have been the consistently high features of importance, so perhaps it isn't too surprising. It is interesting that the precision got better with this model but the recall was worse. It's possible using only ADJOE and ADJDE led to this outcome. We will examine this more later, now lets move onto SVC.
svc_params = {
'C' : [0.1, 1, 10, 100],
'gamma' : ['scale', 'auto'],
}
CV_svc = GridSearchCV(estimator=svc, param_grid=svc_params)
CV_svc.fit(X_train, y_train)
params = CV_svc.best_params_
print(params)
{'C': 0.1, 'gamma': 'scale'}
We again create the model with the GridSearch CV hyperparameters.
svc_model = svm.SVC(C=params['C'],
gamma=params['gamma'], kernel='linear', random_state=13)
svc_model.fit(X_train, y_train)
svc_pred = svc_model.predict(X_test)
print(classification_report(y_test, svc_pred))
precision recall f1-score support
False 0.92 0.95 0.94 495
True 0.76 0.67 0.71 119
accuracy 0.90 614
macro avg 0.84 0.81 0.83 614
weighted avg 0.89 0.90 0.89 614
Lets examine the classification report again, this time for the SVC model.
- Precision: 0.76
- Out of all the teams the model predicted would make the tournament, 76% did.
- Recall: 0.67
- Out of all the teams that did make the tournament, the model predicted this correctly for 67% of those teams.
- F1-Score: 0.71
Let's take a look at the confusion matrix for this data.
svc_cm = confusion_matrix(y_test, svc_pred)
group_name = ['True Neg', 'False Pos', 'False Neg', 'True Pos']
group_counts = ["{0:0.0f}".format(value) for value in
svc_cm.flatten()]
group_percentages = ["{0:.2%}".format(value) for value in
svc_cm.flatten()/np.sum(svc_cm)]
labels = [f"{v1}\n{v2}\n{v3}" for v1, v2, v3 in zip(group_name, group_counts, group_percentages)]
labels = np.asarray(labels).reshape(2,2)
sns.heatmap(svc_cm, annot=labels, fmt='', cmap='Greens')
plt.ylabel("True Label")
plt.xlabel("Predicted Label")
Text(0.5, 66.5, 'Predicted Label')
We get some slightly different results from this confusion matrix. The true negative has its lowest value of the 3 confusion matrices with the other three labels having slightly larger values. The conclusion could be drawn that this model loses some of its ability to predict true negatives and gains a small amount of ability to predict true positives.
Lets again look at the model coefficients for the SVC hyperparameter model.
ind = np.arange(len(cbb_features.columns))
width = .35
fig = plt.figure()
ax = fig.add_subplot(111)
rects1 = ax.barh(ind, abs(svc.coef_[0])[svc_sort], width, color='green', label='No Hyperparameters')
rects2 = ax.barh(ind + width, abs(svc_model.coef_[0])[svc_sort], width, color='springgreen', label='Hyperparameters Tuned')
ax.set_yticks(ind + width / 2)
ax.set_yticklabels(cbb_features.columns[svc_sort])
plt.legend(loc='best')
plt.xlabel("Model Coefficients")
plt.title("Comparing Model Coefficients for SVC")
Text(0.5, 1.0, 'Comparing Model Coefficients for SVC')
The coefficients for this model are different than the two previous we have seen. This time the ADJOE coefficient decreases after tuning the model as well as ADJDE and EFG_D. All the other features increase in importance for the most part.
Analyzing Results¶
Let's first take a look at feature importance. I'm going to focus on the feature importance with the hyperparameters tuning the model as this improved our models.
cols = cbb_features.columns.to_list()
f_vals= forest_model.feature_importances_
dt_vals = dt_model.feature_importances_
svc_vals = abs(svc_model.coef_[0])
df = pd.DataFrame(columns=cols)
f_vals = f_vals.tolist()
dt_vals = dt_vals.tolist()
svc_vals = svc_vals.tolist()
df.loc[len(df)] = f_vals
df.loc[len(df)] = dt_vals
df.loc[len(df)] = svc_vals
df
| ADJOE | ADJDE | EFG_O | EFG_D | TOR | TORD | ORB | DRB | FTRD | 2P_O | 2P_D | 3P_O | 3P_D | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.255142 | 0.181372 | 0.073021 | 0.095008 | 0.060202 | 0.030454 | 0.041029 | 0.024765 | 0.031431 | 0.066991 | 0.060501 | 0.039798 | 0.040286 |
| 1 | 0.703690 | 0.296310 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 2 | 0.699239 | 0.544134 | 0.183282 | 0.196365 | 0.287311 | 0.325153 | 0.264477 | 0.096713 | 0.114548 | 0.154782 | 0.122363 | 0.215492 | 0.058801 |
df = df.transpose()
ranks = df.rank(ascending=False)
ranks['avg'] = ranks.mean(numeric_only=True, axis=1)
ranks.sort_values(by=['avg'])
| 0 | 1 | 2 | avg | |
|---|---|---|---|---|
| ADJOE | 1.0 | 1.0 | 1.0 | 1.000000 |
| ADJDE | 2.0 | 2.0 | 2.0 | 2.000000 |
| EFG_D | 3.0 | 8.0 | 7.0 | 6.000000 |
| TOR | 7.0 | 8.0 | 4.0 | 6.333333 |
| EFG_O | 4.0 | 8.0 | 8.0 | 6.666667 |
| ORB | 8.0 | 8.0 | 5.0 | 7.000000 |
| 2P_O | 5.0 | 8.0 | 9.0 | 7.333333 |
| TORD | 12.0 | 8.0 | 3.0 | 7.666667 |
| 2P_D | 6.0 | 8.0 | 10.0 | 8.000000 |
| 3P_O | 10.0 | 8.0 | 6.0 | 8.000000 |
| FTRD | 11.0 | 8.0 | 11.0 | 10.000000 |
| 3P_D | 9.0 | 8.0 | 13.0 | 10.000000 |
| DRB | 13.0 | 8.0 | 12.0 | 11.000000 |
Above I created a new dataframe and ranked the coefficients for each model based on importance. Then I averaged the rank for each coefficient from the three models. I sorted them based on the average. From the data frame above we can see the order of importance amongst the three tuned models is the following.
- ADJOE
- ADJDE
- EFG_D
- TOR
- EFG_O
- ORB
- 2P_D
- TORD
- 2P_D
- 3P_O
- FTRD
- 3P_D
- DRB
I am a little surprised by how low the three point statistics are, mainly 3P_O, or three point field goal percentage. It seems most good teams today shoot well from three point range and it seems almost necessary to do this. Perhaps it is less important during the season and overall efficiency on offense is a better indicator, not shooting well from any one range itself. In general seeing the efficiencies high on the list makes sense to me, they are advanced stats used constantly to determine how good a team is, so it makes sense they are good indicators in the model.
Next I want to look at which model was the best for our data. Below are the scores of each of the models from above.
Random Forest¶
- Precision: 0.81
- Out of all the teams the model predicted would make the tournament, 81% did.
- Recall: 0.66
- Out of all the teams that did make the tournament, the model predicted this correctly for 66% of those teams.
- F1-Score: 0.73
Decision Tree¶
- Precision: 0.83
- Out of all the teams the model predicted would make the tournament, 83% did.
- Recall: 0.54
- Out of all the teams that did make the tournament, the model predicted this correctly for 54% of those teams.
- F1-Score: 0.65
SVC¶
- Precision: 0.76
- Out of all the teams the model predicted would make the tournament, 76% did.
- Recall: 0.67
- Out of all the teams that did make the tournament, the model predicted this correctly for 67% of those teams.
- F1-Score: 0.71
Here's how they rank for each category.
Precision
- 0.83 Decision Tree
- 0.81 Random Forest
- 0.76 SVC
Recall
- 0.67 SVC
- 0.66 Random Forest
- 0.54 Decision Tree
F1 Score
- 0.73 Random Forest
- 0.71 SVC
- 0.65 Decision Tree
Random Forest seems to be the best model, beating out SVC very closely and decision tree not too far behind them. All of the models performed decent but none really excelled. This could be due to not enough features or data perhaps. It is only 7 seasons of data and only 64 (technically 68 with play-in games) out of around 351 teams. That can lead to a lot of non-tourney teams and not enough tourney teams in the models.
Now we can use our top performing model, random forest, on this current 2022-2023 NCAA men's D1 basketball season to predict the teams that are going to make March Madness!
Applying Model to Current Season¶
First we are going to scrape https://barttorvik.com/trank.php# for the current NCAA basketball data (12/16/2022 7:30 PM).
First I downloaded the HTML inside the table tag from the website and put it into an HTML file. Next I read in the HTML using BeautifulSoup. Following this we searched for the table and row tags to begin pulling in the data.
Next step was to go through each row and find all of the td tags. In each td tag we indexed the data we needed and put it into a data list, which we then appended to an array. Finally we made a dataframe with our columns and appended each to the dataframe getting the top 200 teams at the time of the download inside our new table.
from bs4 import BeautifulSoup
with open('cbb121622.html', 'r') as f:
page = f.read()
soup = BeautifulSoup(page, 'html.parser')
table = soup.find("table")
rows = soup.findAll("tr")
teams = []
for r in rows[2:]:
cols = r.findAll("td")
try:
if cols[1].text != "":
team_name = cols[1].get("id")
data_list = [team_name, float(cols[5].text), float(cols[6].text), float(cols[7].text), float(cols[8].text[:2]),
float(cols[9].text[:2]), float(cols[10].text[:2]), float(cols[11].text[:2]),
float(cols[12].text[:2]), float(cols[13].text[:2]), float(cols[14].text[:2]),
float(cols[15].text[:2]), float(cols[16].text[:2]), float(cols[17].text[:2]),
float(cols[18].text[:2]), float(cols[19].text[:2]), float(cols[20].text[:2]),
float(cols[21].text)]
teams.append(data_list)
except:
pass
col_names = ['TEAM',
'ADJOE',
'ADJDE',
'BARTHAG',
'EFG_O',
'EFG_D',
'TOR',
'TORD',
'ORB',
'DRB',
'FTR',
'FTRD',
'2P_O',
'2P_D',
'3P_O',
'3P_D',
'ADJ_T',
'WAB']
df = pd.DataFrame(columns=col_names)
for i, team_data in enumerate(teams):
df = df.append(dict(zip(col_names, team_data)), ignore_index=True)
Now we will modify the table for input to our model. We drop the team column but save it for later and also drop the other features that we did not include in previous models. Calling head below shows us that we now have the 13 features we needed.
teams = df['TEAM']
cbb_now = df.drop(columns=['BARTHAG', 'ADJ_T', 'WAB', 'FTR', 'TEAM'])
cbb_now.head()
| ADJOE | ADJDE | EFG_O | EFG_D | TOR | TORD | ORB | DRB | FTRD | 2P_O | 2P_D | 3P_O | 3P_D | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 112.321 | 81.41 | 51.0 | 37.0 | 16.0 | 25.0 | 38.0 | 29.0 | 32.0 | 53.0 | 38.0 | 32.0 | 23.0 |
| 1 | 107.371 | 81.72 | 47.0 | 36.0 | 20.0 | 27.0 | 40.0 | 28.0 | 31.0 | 46.0 | 41.0 | 32.0 | 20.0 |
| 2 | 118.42 | 90.416 | 55.0 | 49.0 | 13.0 | 25.0 | 31.0 | 26.0 | 25.0 | 55.0 | 51.0 | 36.0 | 31.0 |
| 3 | 115.77 | 89.51 | 57.0 | 41.0 | 18.0 | 23.0 | 36.0 | 22.0 | 41.0 | 58.0 | 41.0 | 37.0 | 27.0 |
| 4 | 118.23 | 92.935 | 52.0 | 44.0 | 15.0 | 14.0 | 37.0 | 24.0 | 16.0 | 54.0 | 46.0 | 33.0 | 27.0 |
Now we can predict labels. First we normalize our data, then we can call our top performing model, as we discussed previously. We call our forest model on the test set of data.
x_predict = scaler.fit_transform(cbb_now)
forest_predict = forest_model.predict(x_predict)
df['POSTSEASON'] = forest_predict
df2 = df.drop(columns=['ADJOE', 'ADJDE', 'BARTHAG', 'EFG_O', 'EFG_D', 'TOR', 'TORD',
'ORB', 'DRB', 'FTR', 'FTRD', '2P_O', '2P_D', '3P_O', '3P_D', 'ADJ_T', 'WAB'])
march_madness = df2[df2["POSTSEASON"] == True]
After predicting the outcomes we can drop all columns that aren't need for output and display all the teams predicted true for March Madness.
print("Teams that will make the March Madness Tournament:")
march_madness.head(55)
Teams that will make the March Madness Tournament:
| TEAM | POSTSEASON | |
|---|---|---|
| 0 | Houston | True |
| 1 | Tennessee | True |
| 2 | UCLA | True |
| 3 | Connecticut | True |
| 4 | Purdue | True |
| 5 | Saint_Mary_s | True |
| 6 | Arizona | True |
| 7 | Kansas | True |
| 8 | West_Virginia | True |
| 9 | Duke | True |
| 10 | Texas | True |
| 11 | Alabama | True |
| 12 | Gonzaga | True |
| 13 | Memphis | True |
| 14 | Kentucky | True |
| 15 | Baylor | True |
| 16 | Indiana | True |
| 17 | Virginia_Tech | True |
| 18 | Virginia | True |
| 19 | Iowa | True |
| 20 | Marquette | True |
| 21 | Illinois | True |
| 22 | San_Diego_St_ | True |
| 23 | Xavier | True |
| 24 | Arkansas | True |
| 25 | Auburn | True |
| 26 | Ohio_St_ | True |
| 27 | North_Carolina | True |
| 28 | Creighton | True |
| 29 | Maryland | True |
| 30 | Utah_St_ | True |
| 31 | Rutgers | True |
| 32 | North_Carolina_St_ | True |
| 34 | Oklahoma_St_ | True |
| 35 | Mississippi_St_ | True |
| 38 | Florida | True |
| 39 | Penn_St_ | True |
| 40 | Utah | True |
| 41 | Texas_Tech | True |
| 42 | Oregon | True |
| 43 | Oklahoma | True |
| 44 | Kent_St_ | True |
| 45 | Kansas_St_ | True |
| 46 | Florida_Atlantic | True |
| 47 | UAB | True |
| 48 | Boise_St_ | True |
| 51 | Miami_FL | True |
| 52 | Butler | True |
| 54 | Iona | True |
| 55 | TCU | True |
| 56 | Colorado | True |
| 57 | New_Mexico | True |
| 61 | Texas_A_M | True |
| 62 | Michigan | True |
| 65 | Missouri | True |
Our model predicted 55 teams to make the March Madness tournament this year. Obviously this data is only from today, in December and with 3 months to go before the tournament a lot can change so I wouldn't expect the teams to stay the same. Another note is that usually 32 of the teams are conference champions. This allows for mid-major (smaller) schools to have a chance to play. the remaining 36 teams selected (4 teams have to do play-in games to make it) are comprised of the highest ranked teams from any conference. This model does not take that into account which leads not as many smaller schools to be selected. It also creates some randomness as if a weaker team wins the conference, they may make the tournament when they were not expected to.
Adjusting for these situations to better mirror the selection process is one improvement we could make. In the future we could also potentially have the current data update live giving real time predictions. There are many ways this could be expanded upon and I might just have to explore the possibilities.