top of page

Project 3: 2015-2018 MLB Linear Regression

Introduction to the Problem and the Dataset

In this project, I will be working on the previously used datasets I have used for Projects 1 and 2, which are the statistics of the MLB for the 2015 - 2018 seasons, I will be using the games file for this project. This project involves using linear regression on past MLB seasons. I plan to use linear regression to predict the attendance of the MLB games based on the other attributes in the dataset. Such as the away team's final score, the home team's final score, the elapsed time, and if games were delayed, the amount of time they were delayed.

What is Regression and How Does It Work?

Regression is a statistical technique that relates a dependent variable to one or more independent variables. It is used to uncover the association between variables observed in data. Linear regression is a data analysis technique that predicts the value of unknown data by using another related and known data value. It works by attempting to plot a line graph between two data variables, x and y. The independent variable x is plotted along the horizontal axis. The dependent variable y is plotted on the vertical axis. Machine learning must transform the data values to meet four assumptions: a linear relationship, residual independence, normality, and homoscedasticity. 

image.png
image.png
image.png

Experiment 1

Data Understanding

I started by loading my dataset and seeing all of the entries and columns. I used the games_df.info() to learn more about the data types of each feature. Then to see the variations within the dataset I used games_df.describe().

 
Screenshot 2023-10-19 173514.png
Screenshot 2023-10-19 174000.png
Screenshot 2023-10-19 174045.png

Pre-Processing

The goal for this data pre-processing phase is to clean the data so it can allow me to get answers to the questions I've posed and also to clean up the data so there aren't so many attributes cluttering up the database. Here I dropped all the columns that didn't line up with the questions I wanted to answer.

Screenshot 2023-10-19 174456.png

I created a heatmap to show the feature's importance.

image.png

Modeling

For my first experiment, I created a linear regression model on all the variables in my dataset except the attendance variable, as predictors. We can see that the data is mainly lined up with the red line, but there are some outliers where the predicted attendance is higher than the actual attendance.

Screenshot 2023-10-19 175301.png
image.png

Evaluation

For the evaluation in Experiment 1, I assessed the accuracy of the linear regression model I have created. The score was 54.83% which is good because anything over 50% is good. I also calculated the MSRE.

Screenshot 2023-10-19 180411.png
image.png

Experiment 2

For experiment two I decided to use the ForestRegressor regression model. I set the random state to 42. It ended up putting out an accuracy of 50.77%, so it wasn't as good as the linear regression model as it was about 4% lower.

image.png

Experiment 3

For experiment three I decided to use the K-nearest neighbor regression model. I set the n_neighbors to 200. It ended up putting out an abysmal 1.5%, so that wasn't really good. So why did it give us such a low percentage? Well the K-nearest neighbor regression does not work well with high-dimensional data inputs, and the data we have is high-dimensional. 

image.png

Impact

Based on my results it's quite difficult to gauge how much of an impact this can have on predicted attendance numbers. But if I were to assume I'd say the positives would be that the predicted attendance would help predict the general income for ball teams, and the negatives could be that the predicted attendance could lead to teams relying too heavily on the predictions and end up falling short in real-time. 

Conclusion

I've learned a lot from this projects on how to nagivate using these regression algorithms. One of the most suprising things that I learned from this project was just how differently all of these regression algorithms work. I learned that not all regression algorithms are right for everything, because I saw that the KNN algorithm did not mesh well with my dataset. But the lingear regression and the random forest regressor was a better fit. 

bottom of page