machine learning - Lukas Wadya

July 25, 2019

Boozehound Cocktail Recommender

This project was a labor of love for me since it combines two of my favorite things: data and cocktails. Ever since my wife signed us up for a drink-mixing class a few years ago I’ve been stirring up various concoctions, both established recipes and original creations. My goal was to create a simple app that would let anyone discover new drink recipes based upon their current favorites or just some descriptive words. Recipe books can be fantastic resources, but I often just want something that tastes like some other drink but different or some combination of flavors and a specific spirit and that’s where a table of contents fails. While I never thought of Boozehound as a replacement for my favorite recipe books I was hoping it could serve as an effective alternative when I just don’t have the patience to thumb through dozens of recipes to find what I want to make.

Photo of a book and index cards containing cocktail recipes

Data Collection

Because I wanted Boozehound to work with descriptions and not just cocktail and spirit names I knew I would be relying upon Natural Language Processing and would need a fair amount of descriptive text with which to work. I also wanted the app to look good so I needed to get my recipes from a resource that also has images for each drink. I started by scraping the well-designed Liquor.com, which has a ton of great recipes and excellent photos. Unfortunately the site has extremely inconsistent write-ups on each cocktail: some are paragraphs long and others only a sentence or two. I wanted more consistent, longer drink descriptions and I found them at The Spruce Eats, which I scraped using BeautifulSoup in the ‘scrape_spruce_eats’ notebook on my project GitHub repo. Spruce Eats doesn’t have the greatest list of recipes, but I was still able to collect roughly 980 separate drink entries from the site, each with an ingredient list, description, and an image URL.

Text Pre-Processing

After getting all of my cocktail recipe data into a Pandas DataFrame, I still needed to format my corpus to prepare it for modeling. I used the SpaCy library to lemmatize words and keep only the nouns and adjectives. SpaCy is both fast and easy to use, which made it ideal for my relatively simple pre-processing. I then used scikit-learn’s TF-IDF implementation to create a matrix of each recipe’s word frequency vector. I chose TF-IDF over other vectorizers because it accounts for word count disparities and some of my drink descriptions are twice as long as others. The app also runs user search strings through the same process and those should certainly be shorter than any cocktail description. My pre-processing and modeling work is stored in the ‘model_spruce_eats’ notebook.

Models

Building my initial model was a fairly simple process of dimension reduction and distance calculations. Since I had a TF-IDF matrix with significantly more columns than rows, I needed some way of condensing those features. I tried a few solutions and Non-Negative Matrix Factorization gave me the most sensible groupings. From there I just calculated pairwise Euclidean distances between the NMF description vectors and a similarly vectorized search string. There was just one problem: my model was such a good recommender that it was way too boring. It relied far too heavily on the frequency of cocktail and spirit names in determining similarity, so a search for margarita would just return ten different variations of margaritas. To make my model more interesting I created a second model that is only able to use words from the description that are not the names of drinks or spirits. Both models are then blended together, which the user can control through the Safe-Weird slider in the Boozehound app.

The image below gives an example of how both models work. Drink and spirit names dominate the descriptions, so the Safe model in the top half has a good chance of connecting words like tequila. The Weird model on the bottom has to make connections using other words, in this case refreshing. Because it has less data with which to work, the Weird model tends to make less relevant recommendations, but they’re often more interesting.

Chart showing an example of Safe and Fun model word similarities

The App

I built the app in Flask and wanted a simple, clean aesthetic reminiscent of a classy cocktail bar. As a result I spent just as much time using CSS to stylize content as I did just getting the app to work properly. Luckily I enjoy design and layout so it was a real pleasure seeing everything slowly come together. My Metis instructors would often bring up the idea of completing the story and to me this project would have been incomplete without a concise and visually-pleasing presentation of recipe recommendations.

Picture of the Boozehound app

Conclusions

While I am pleased with the final product I presented in class at Metis, there’s a lot I still want to address with the Boozehound app. Some of the searches I tested returned odd results that could be improved upon. I’d also like to add some UX improvements like helpful feature descriptions and some initial search suggestions for users who don’t know what they want. Another planned feature is a one-click search button to allow the user to find recipes similar to a drink that shows up as a recommendation without having to type it into the search bar. Boozehound is all about exploration and I want to make rifling through a bunch of new recipes as easy as possible.

Check out the full project on my GitHub

July 24, 2019

Kaggle Instacart Classification

I built models to classify whether or not items in a user’s order history will be in their most recent order, basically recreating the Kaggle Instacart Market Basket Analysis Competition. Because the full dataset was too large to work with on my older Macbook, I loaded the data into a SQL database on an AWS EC2 instance. This setup allowed me to easily query subsets of the data in order to do all of my preliminary development. I then cleaned up my work and wrote it into a script called ‘build_models.py’ that can be easily run through a notebook or the command line. Once I was ready to scale up to the full dataset, I simply ran the build_models script on a 2XL EC2 instance and brought the resulting models back into my ‘kaggle_instacart’ notebook for test set evaluation.

Feature Engineering

I spent the majority of my time on this project engineering features from the basic dataset. After creating several features, I tested different combinations of them on a small subset of the data in order to eliminate any that seemed to have no effect on model prediction. After paring down features I ended up training and testing my final models on the following predictors:

percent_in_user_orders: Percent of a user’s orders in which an item appears
percent_in_all_orders: Percent of all orders in which an item appears
in_last_cart: 1 if an item appears in a user’s most recent prior order, 0 if not
in_last_five: Number of orders in a user’s five most recent prior orders in which an item appears
total_user_orders: Total number of previous orders placed by a user
mean_orders_between: Average number of orders between which an item appears in a user’s order
mean_days_between: Average number of days between which an item appears in a user’s order
orders_since_newest: Number of orders between the last user order containing an item and the most recent order
days_since_newest: Number of days between the last user order containing an item and the most recent order
product_reorder_proba: Probability that any user reorders an item
user_reorder_proba: Probability that a user reorders any item
mean_cart_size: Average user cart (aka order) size
mean_cart_percentile: Average percentile of user cart add order for an item
mean_hour_of_week: Average hour of the week that a user orders an item (168 hours in a week)
newest_cart_size: Number of items in the most recent cart
newest_hour_of_week: Hour of the week that the most recent order was placed
cart_size_difference: Absolute value of the difference between the average size of the orders containing an item and the size of the most recent order
hour_of_week_difference: Absolute value of the difference between the average hour of the week in which a user purchases an item and the hour of the week of the most recent order

Models

In my preliminary tests using subsets of the Instacart data, I trained a number of different models: logistic regression, gradient boosting decision trees, random forest, and KNN. After several rounds of testing, I took the two that performed best, logistic regression and gradient boosting trees, and trained them on the full data set, minus a holdout test set. I used F1 score as my evaluation metric because I wanted the models to balance precision and recall in predicting which previously ordered items would appear in the newest orders. To account for the large class imbalance caused by the majority of previously ordered items not being in the most recent orders, I created adjusted probability threshold F1 scores as well. The scores below treat each dataframe row, which represents an item ordered by a specific user, as a separate, equally-weighted entity. Both models performed similarly, with the gradient boosting trees classifier achieving slightly higher scores:

Model	Raw F1 Score	Adjusted F1 Score
Logistic Regression	0.313	0.447
Gradient Boosting Trees	0.338	0.461

I also calculated mean per-user F1 scores that more closely match the metric of the original Kaggle contest. If either model were incorporated into a recommendation engine the user-based metric would better represent its performance. In these F1 scores, model performance is virtually identical:

Model	Per-User F1 Score
Logistic Regression	0.367
Gradient Boosting Trees	0.368

Results

The charts below show the most influential predictors and their respective coefficient values for each model. The logistic regression model relies heavily upon information about the size of the most recent cart, while the gradient boosting decision trees model gives far more weight to the contents of a user’s previous orders. If information about the most recent cart were not available, the gradient boosting model would most likely outperform the logistic regression model.

logistic regression model coefficients
gradient boosting decision trees model coefficients

Conclusions

This project was all about feature creation – the more features I engineered the better my models performed. At Metis I had a pretty tight deadline to get everything done and as a result did not incorporate all of the predictors I wanted to. I plan to eventually circle back and add more, including implementing some ideas from the Kaggle contest winners.

Check out the full project on my GitHub

July 24, 2019

Predicting NHL Injuries

I’m a huge fan of both hockey and its associated analytics movement, so I wanted to work with NHL data while finding a relatively uncovered angle. I also wanted to put some of the conventional wisdom of the league to the test:

Do smaller and lighter players get injured more?
Is being injury-prone a real thing?
Are older players more likely to get hurt?
Are players from certain countries more or less likely to get injured? (European players were once considered "soft" and Alex Ovechkin once said, "Russian machine never breaks")

To help answer these questions, I built a model to predict how many games an NHL player will miss in a season due to injury. I then examined the model’s coefficient values to see if I could gain any insights into my injury questions.

Data Collection

Counting stats for sports are easy to come by and hockey is no different – I was able to download CSVs of all player stats and biometrics for the last ten NHL seasons from Natural Stat Trick. I combined the separate Natural Stat Trick datasets in the ‘player_nst_data’ notebook in my project Github repo. Unfortunately reliable player injury histories are much more difficult to come by. I was able to scrape lists of injuries from individual player profiles from the Canadian sports site TSN using Selenium and BeautifulSoup. All injury scraping and parsing is contained in the ‘player_injury_data’ notebook. While I don’t believe the TSN data is an exhaustive list of player injuries it is the best I could find so that’s what I used.

Example image of TSN player profile
An example of a TSN player profile page containing injury data

Feature Selection/Engineering

Mostly due to the amazing amount of counting stats Natural Stat Trick aggregates, I had an abundance of potential features for my models. I relied on my domain knowledge as a longtime hockey fan to whittle down the list to anything I thought could correlate with injury rates, as well as predictors to test the conventional wisdom of what causes players to get hurt. I removed goaltenders from my data set because their counting stats are completely different from those of other skaters. I also utilized sklearn’s polynomial features to account for feature interactions. Here is a partial list of individual features and my logic for choosing them:

Games Missed Due to Injury: what I’m trying to predict
Height/Weight: to see if smaller/lighter players get injured more often
Position: defensemen play more minutes and are more likely to block shots than other positions, wingers and defensemen are generally more likely than centers to engage in physical play along the boards
Penalties Drawn: penalties are often assessed when a player is injured as the result of a dangerous and illegal play
Hits Delivered/Hits Taken: an indicator of physical play that could lead to more injuries
Shots Blocked: players sometimes suffer contusions and break bones blocking shots
Age: to see if older players get injured more often
Major Penalties: majors are often assessed for fighting
Being European/Being Russian: to see if either correlates with increased injury rates

Additional Feature Engineering

I started off with some simple models because I wanted to evaluate what formulation of my data would work best before spending my time tweaking hyperparameters. I fit an Ordinary Least Squares regression on each of the following data sets:

Each entry contains the total counting stats and games missed due to injury for the last ten seasons.
Each entry contains the counting stats, games missed due to injury, and games missed last season for a single season for a player. In this and the following format, players can have multiple rows of data.
Similar to the last format except it includes rolling averages of counting stats and games missed for all previous seasons.

I created each of these data formulations in the ‘data_merge’ notebook. Unsurprisingly, the last and most robust data set resulted in the lowest test MSE so I used that formulation of my data for the final modeling. All modeling lives in the ‘nhl_injuries’ notebook.

EDA

Exploratory Data Analysis confirmed some of my assumptions going in – mainly that injuries are largely random and unpredictable. The heatmap below shows very low correlation between any of my predictors and the response.

Predictor/Response Correlation Heatmap

I also created a histogram of games missed due to injury per season that shows most players miss little or no time. As a result the distribution of games missed has an extreme right skew.

Games Missed Due to Injury Histogram

Models

Before training my final models I limited the data set to only players who had played at least 50 career games, which mainly removed career minor league players who have spotty NHL statistics. I standard-scaled all values, created a train/test split of my data set of 7,851 player-seasons, and trained a Lasso linear regression model with polynomial features and a random forest regressor model. The test R² values for both models were low, indicating that the predictors did not explain much of the variation in the response in either model.

Model	R²
Lasso	0.075
Random Forest	0.107

Results

Since the Lasso model performed similarly to the random forest model while maintaining greater interpretability, I used it for all subsequent analyses. I graphed the ten most important predictors for the Lasso model along with their coefficient values. Although the model didn’t have great predictive value I still wanted to determine which inputs had the most influence.

Predictor Coefficients

Using the chart above I was able to get some answers to my initial questions:

Height and weight don’t appear anywhere near the top of the predictor list, meaning my model finds little evidence that smaller or lighter players get hurt more
Average Previous Games Missed and Last Games Missed appear frequently in the top 10, lending credence to some players being injury-prone
Age times Average Previous Games Missed is the fourth-highest predictor coefficient, providing some very weak evidence of older players being more likely to get injured
Nationality appears to have almost no importance to the model, save for Russian defencemen having a small positive correlation with games missed due to injury

Conclusions

In many ways the results of this project confirmed my pre-existing beliefs, especially that NHL injuries are largely random events that are difficult to neatly capture in a model. Because I started off assuming a model would not have great predictive quality, I felt obligated to create the best model I could to account for my bias. I also expected to find little to no evidence of players being injury prone, but my model presents evidence that players injured in the past are more likely to get hurt in the future. While I still believe injuries are too random to be reliably predicted by a model, I do think my model can be improved with more and better data. More complete NHL and minor league injury data would fill a lot of gaps in the model, especially for players who are not NHL stalwarts. New NHL player tracking data could potentially provide input values with far more predictive value than traditional counting stats. I’m excited to see what future insights can be gleaned as we continue to collect increasingly more detailed sports data.

Check out the full project on my GitHub