Introduction

Like the previous project, I will use the same data set from the 2015 - 2018 MLB season. In this data visualization and classification project, I plan to predict which games are the most important based on the scores of the home team if they win.

The problems I want to solve are:

What is the importance of the scores from the home teams that won that game?

What is the distribution from the amount of time a game took and the home teams final score?

What is the distribution from the elapsed time in a game?

Introducing the Data

Project 2: 2015-2018 MLB Pitching Classification Dataset

The dataset used for this project is from Kaggle and is taken directly from the MLB game database. The dataset contains 17 variables mainly on the away team score team name and the home team score and name. There are tens of thousands of attributes that are in this dataset. The names of all the columns are attendance, away_final_score, away_team, data, elapsed_time, g_id, home_final_score, home_team, start_time, umpire_1B, umpire_2B, umpire_3B, umpire_HP, venue_name, weather, wind, delay.

Pre-Processing the Data

The goal for this data pre-processing phase is to clean the data so it can allow me to get answers to the questions I've posed and also to clean up the data so there aren't so many attributes cluttering up the database.

So first I dropped all the columns that didn't line up with the questions I wanted to answer.

Next, I inserted a new column called Home Team Won? and it outputted true if the home team won and false if the home team lost.

Visualization and Analysis

To address the first, second, and third questions I created a bar plot that shows the importance of each score for home teams that won the game and I created a scatterplot that shows the scores of home teams and how long it took for each score. Then I created a histogram plot that shows the distribution of elapsed time throughout the game. My initial problems were finding what to form my problems around and how to see how they could relate to strikeouts and hits.

Modeling the Data

I let the Random Forest classifier try to solve my problem. I chose this model because of the Random Forest Classifiers' ability to combine multiple random tree classifiers together amplifying its effectiveness. Random forest is a supervised learning algorithm using the ensemble learning technique, by combining multiple algorithms to make the existing model more powerful. It works by selecting random samples from given data, then constructing a decision tree for every training data, then voting by averaging the decision tree, and finally choosing the most predicted result. The pros are that it reduces overfitting and improves accuracy, and it also works with both categorical and continuous values. The cons are that too many trees can make the algorithm too slow for real-time predictions. I also used the Decision Tree classifier alongside the Random Forest Classifiers and the results were similar to the output of the previous classifier.

Evaluation

The models perform quite well. It gets to 74% accuracy. The evaluation metrics I use are precision, recall, f1-score, and support. I used them because they help keep the number of false positives low.

Storytelling

The four most important characteristics are "Home Team Won?", "away_final_score", "elapsed_time", and "home_final_score". We see that the home team's final score is more important whenever the home team's score is in the low numbers from 0 - 3. I was able to get 74% accuracy with a provided 20% test sample using the Random Forest classifier. Using the Decision Tree classifier I was able to obtain a 75% accuracy. After all of my studying, I was able to answer the initial questions I had procured for the dataset.

Impact

This project could be helpful for anyone wanting to know the importance of certain scores in baseball games as this could stretch beyond baseball and into other sports. The ability to see the difference between the game scores and how it impacts the elapsed time in the game could help people determine when to leave the stadium to beat traffic. This can be hurtful if people use this data for some reason and the results are the opposite of what they expected because this is a game of chance and nothing is ever guaranteed in baseball.

References

https://www.kaggle.com/datasets/pschale/mlb-pitch-data-20152018

Code

Project 2