One of my personal projects that I am working on is called League of Legends. Everything is on my GitHub. I love video games, not specifically League of Legends, but video games in general. Kaggle has a huge dataset of League of Legends data so I decided to download it and see if I can find a way to predict match outcomes.
The first thing I did was check for missing values. Fortunately, there weren’t any! This makes everything much easier. The next thing I did was check for columns with near zero variances. These columns provide no useful information for modeling. There were many columns. I removed these columns.
I got the dataset down to about 19 columns. The next task was to split the data into a much smaller set. The dataset had about 1,000,000 observations which is too many for my laptop. I split it down into about 2000 observations. This will save a lot of time. I take out the “id” column because that provides no information either.
I’m down to about 2000 observations, with 18 columns. I randomized the rows to reduce bias then split the data into a training and test set. I set up a train control object to do 10 cross validation folds. This will also help to reduce bias from outliers or other sources. I used the top classification machine learning algorithms for this data. This will include logistic regression, linear discriminant analysis, decision trees, naïve bayes, k-nearest neighbors, learning vector quantization, support vector machines and random forests.
The best model out of these ended up being the random forest model with 76% accuracy. The most important predictor variable happened to be number of deaths! Wow. No surprise, really. You die more, you lose more!
Data Science & Fitness. What a perfect combination.