I am graduating this semester. This means I need to get a job by January. This gives me such a rush of excitement and stress. I am very ready to leave cold Southeastern Idaho and to start my career. I really connected with data science. I know that this is what I want to do. Data science classes are the most interesting classes I’ve had in my college career, the second being advanced math and statistics classes. Data science is a way that I can apply everything I’ve learned while also using programming. Programming is such a universal, powerful tool. I am eager to find my place in the workforce and to start applying my knowledge and to keep learning more.
I’ve read so much information about data science jobs. Some say to start as an analyst, some say to get an internship, some say to just go for it and land that data science job. They all talk of portfolios and skills that you should keep up to par. I wish I didn’t have to worry about my other classes and could just work on my portfolio. There are so many datasets out there that I would love to analyze and visualize. You can do so much with data science, the possibilities are endless. I want to start working on a new project each week. My current projects are my consulting class project where I am using machine learning algorithms to predict customer attrition for a Fortune 500 company, a data cleaning and linear regression project to reduce product waste for a company in Idaho (this is in data science society), my senior project where I analyze fitness surveys/data to predict the best health plans for an individual, and a league of legends project that I just came up with to investigate game data to predict match outcomes. With these projects, my experimental design class, my part-time teaching assistant job for Social Science statistics, and job applications, I will have my hands full. I will try my hardest to keep this blog and GitHub updated.
I found some cool job data in New York City. My wife and I have always wanted to live in New York! This is a very cool dataset. I needed to clean this data up because there were a lot of columns that didn’t provide any information. First, I took the salary range to and from columns and averaged them into a new column. Next, I removed those columns and lot of other columns that had things like job descriptions, processing dates, etc. I am left with about 27 columns of information.
I filtered using string detect for keywords in the job titles. My wife and I have different career paths so this helps to filter out all the jobs that don’t fit with our needs. Next, I filtered to get jobs that have only been posted in the last two months.
These are a few useful plots that will help in our job search (Sadie's plot is in the middle, mine are on the top and bottom):
One of my personal projects that I am working on is called League of Legends. Everything is on my GitHub. I love video games, not specifically League of Legends, but video games in general. Kaggle has a huge dataset of League of Legends data so I decided to download it and see if I can find a way to predict match outcomes.
The first thing I did was check for missing values. Fortunately, there weren’t any! This makes everything much easier. The next thing I did was check for columns with near zero variances. These columns provide no useful information for modeling. There were many columns. I removed these columns.
I got the dataset down to about 19 columns. The next task was to split the data into a much smaller set. The dataset had about 1,000,000 observations which is too many for my laptop. I split it down into about 2000 observations. This will save a lot of time. I take out the “id” column because that provides no information either.
I’m down to about 2000 observations, with 18 columns. I randomized the rows to reduce bias then split the data into a training and test set. I set up a train control object to do 10 cross validation folds. This will also help to reduce bias from outliers or other sources. I used the top classification machine learning algorithms for this data. This will include logistic regression, linear discriminant analysis, decision trees, naïve bayes, k-nearest neighbors, learning vector quantization, support vector machines and random forests.
The best model out of these ended up being the random forest model with 76% accuracy. The most important predictor variable happened to be number of deaths! Wow. No surprise, really. You die more, you lose more!
I did it. I found the career that will keep me entertained. After many experiments with different majors, I found the topic that will bring me joy and interest. Data Science.
My college career began with criminal justice. I always wanted to be an FBI agent. The older and more mature that I became, I realized that isn’t what I want to do. I have a gift for mathematics. I learned of this gift from a very young age. I knew that I needed to make full use of this gift. I knew I needed to reach my full potential in that realm. I changed my major to physics. I liked physics a lot but I began to have a curiosity for space so then I signed up for aerospace engineering. I wanted to be a rocket scientist. This plan would stay put until I transferred to a new school where it wasn’t a major. I had to switch to mechanical engineering. My schedule was filled with engineering and math classes. I noticed a pattern. I enjoyed the math classes, but I just tolerated the engineering. This led me to believe I should switch to major in math. After my first statistics class, I chose to have my degree emphasis be in statistics. I was all set. I researched careers and decided that being an actuary was a good fit.
My very last semester (now), I began to doubt the actuary career. I knew it was the right direction but not the right specialization. I wanted to be able to work in all kinds of industry. I wanted to be able to work with all kinds of data. Data Science had the best chance of offering me this. I quickly began to love data science. I would get excited for new data sets and learning programming. It was very addicting to continue learning skills that you could directly apply to data RIGHT NOW. No waiting for a job or anything like that. You could just download a dataset and play with it right away. Start analyzing it and making predictions. This could help in all aspects of my life! I have so many plans on using my skills to help my life go more smoothly. Anything is possible. It can be as small as analyzing video game data to see what helps me win more to looking at sports data to try to predict winners each week to predicting cancer or stock rates. It opened a whole world for me. I feel a bit behind due to the late switch but I’m just glad I arrived. Every day I learn new things in my classes, articles, datacamp, and all kinds of other online tools. I have many projects in progress right now and I’m so excited to see what will come from them. This will start my data science blog.
Data Science & Fitness. What a perfect combination.