Beren Brande
Project 1
Analyzing 2015-2018 MLB
Pitching Statistics
Introduction
When I was a kid, my parents used to take me to all kinds of sports games, but up until a couple of years ago I never showed any more interest in sports. Well now I am very much into sports and I've liked baseball. So if you don't have a good concept of how baseball works let me explain, Two teams face off against each other within the duration of nine innings and can go past the given nine innings until one team has outscored the other team after the team has had the chance to either tie or score more than the other team. The visiting team bats first in the top of the first inning then after the pitching team has gotten three outs the inning goes to the bottom of the first and this will repeat itself until the ninth inning if, by the top of the ninth inning, the home team has a higher score than the visiting team and the visiting team fails to either tie the score or score more than the home team the game will conclude because in the bottom of the ninth, the home team would have no reason to score more points when they have already done so and the game ends after nine innings if the score is not tied. For an at-bat, a pitcher's goal is to get three strikes on the count to be able to count that as one out on the inning if the pitcher throws a pitch that is outside the zone (the zone is pre-established for players depending on their height) then it is called a ball. If a pitcher throws more than four balls then the batter automatically advances to first base (if someone is already on first base then they advance to the next base).
In baseball, you can determine a lot of things just by looking at the statistics. For example, you can see a player's percentage out of 100% on their past at-bats if they hit the ball a lot or they aren't really that good at hitting the ball, so judging if a player has a good at-bat percentage, a good rule of thumb is that anything over 0.250 is good and over 0.300 is very good. Because a 0.300 at-bat percentage tells you that 30% of the time this player has hit the ball into play. So the things I really want to figure out are:
Out of the pitches which ones became a strikeout?
Out of the pitches which ones became a hit?
Which pitch number of the at-bat was most likely to result in a strikeout?
Which pitch number of the at-bat was most likely to result in a hit?
What kind of event was most likely to happen during any given inning?
Introducing the Data
The dataset used for this project is from Kaggle and is taken directly from the MLB pitch database. The datasets contain between 40 and 11 variables mainly on the pitch speed the axis of which the pitch ended up and what type of ball was thrown that play. There are millions of attributes that are in this dataset. The dataset has values from the years 2015 to 2018 so I am limited in the timeframe by only a few years.
Pre-Processing the Data


The dataset required some preprocessing as I plan to shrink the columns down to only the ones I need so the files are much smaller to contain. The data I am mainly using is composed of 2 files which are atbats.csv and pitches.csv whenever I need to use both of them together I can use the merge command line and bring them together and they match up with each other because each one of the files has an identifier that links the two together. I also created two new databases called pitches_strikes (which contains a list of only pitches that resulted in strikeouts) and another one called pitches_hit (which contains a list of only pitches that resulted in hits).
Visualization and Analysis
To address the third and fourth questions I created a box plot that showed how far into an at-bat was the most likely to get a hit or a strikeout. My initial problems were finding what to form my problems around and how to see how they could relate to strikeouts and hits.


Storytelling


The data suggests that during an at-bat a batter gets a hit around the fourth pitch and the batter strikes out around the fifth pitch. I see that for obvious reasons a batter strikes out after the third pitch because you need to reach three strikes to get a strike out, but for a hit, you can get a hit at any point during the at-bat. There are some outliers for at-bats that last 10+ pitches but other than those the general area for hits and strikeouts is 2-5 pitches for hits and 4-6 pitches for strikeouts.
Impact
