data science - Lukas Wadya

October 24, 2019

Metis Bootcamp Weeks 2-12

This is the third post in a series on my experiences attending the Metis immersive Data Science course in New York City. My previous posts covered the application process and pre-bootcamp work through Week 1. While I cannot guarantee anyone else’s bootcamp will be the same as mine, I hope you find these posts as useful as I found other blogs when I was researching data science programs.

Each week of bootcamp brings with it a host of new models, evaluation metrics, and visualization tools but maintains the consistent structure established in Week 1: pair programming followed by lectures and independent project time. There are four individual projects in addition to the first week’s group effort, each one concentrating on particular skills and analyses. Lectures are timed to introduce you to the knowledge and tools you’ll need to immediately apply to project work. Weekly seminars from the careers department add some variety to the normal lecture schedule.

	A Typical Day at Metis
9 am	Pair Programming
10:30 am	Instructor Lecture
12 pm	Lunch
1 pm	Lecture or Careers Seminar
3 pm	Project Work
5 pm	End of Day

Pair Programming

The typical day at Metis starts at 9am with a pair programming challenge focused on coding skills, a logic problem, or breaking down a machine learning algorithm. Problems can seem impossible, very easy, or anywhere in between, and the variety in the challenges means you’ll likely vacillate between feeling really smart and really stumped. Metis makes a big deal of the diversity of student backgrounds, which I certainly found to be true of my cohort. Pair problems cover enough different topics that you’re unlikley to be an expert in all of them, meaning you may encounter some you can’t solve alone. They aren’t checked or graded, so there’s no real pressure to solve them, but they’re certainly worth spending time on. Many of the problems are similar to (if not identical to) questions you’ll encounter on take-home tests and in whiteboard challenges during the job interview process.

Lectures

The number and length of lectures varies in a given day but you can expect at least one between pair programming and the noon lunch break. The standard lecture consists of an instructor walking the class through a Jupyter Notebook or slide deck to introduce or further explain a particular concept or technology. Examples from Week 2 include a Powerpoint presentation on the assumptions of linear regression and a notebook showing how to train linear regression models using the Scikit-learn and Statsmodels libraries. Lectures coincide with the focus of each project, so Weeks 2 and 3 are heavy on linear regression and data visualization to prepare for the regression modeling project due at the end of Week 3. Here’s a quick summary of lecture and project topics, but keep in mind the current curriculum may be different:

Weeks	Topics	Project Duration (in weeks)
1	Exploratory Data Analysis, Git, Pandas	1
2-3	Linear Regression, Web Scraping	2
4-6	Classification Algorithms, Data Visualization, SQL, AWS	2.5
6-8	Natural Language Processing, Clustering Algorithms, NoSQL	2.5
9-12	Passion Project, Big Data, Neural Networks	4

Lectures tend to be front-loaded within each project cycle and the course as a whole in order to quickly provide you the knowledge necessary to complete each project. Towards the end of the course many lectures are optional as they may not apply directly to your chosen final project. Information will likely come at you too quickly for you to absorb it all, let alone all of the provided supplemental lessons. A hurdle many of my classmates and I had to overcome early on in bootcamp was abandoning the idea of ‘doing it all.’ There isn’t enough time to dig deep into every concept introduced by the instructors so eventually everyone has to pick and choose on what they spend their time and effort. Prioritizing the specific topics and skills you want to learn is essential to making the intense coursework manageable. For every data science skill you rigorously study and apply you’ll likely have to settle for cursory knowledge of other topics, which is especially true for project work.

Projects

Everything at Metis revolves around the five projects at the core of the program, and for good reason – that work will eventually make up the portfolio you’ll use to get your first data science job. I’ve already written an in-depth post about the first EDA project; the remaining four projects cover regression, classification, natural language processing, and any topic, covered or uncovered in class, that you want to explore further. Students deliver 4-5 minute presentations on their projects to complete each assignment, with earlier work serving as practice for the final presentations. Although most projects have specific requirements, students choose their own topics and acquire their own data, which leads to a broad diversity of work. Of course, coming up with five ideas you like for which you can find sufficient data is easier said than done. Since you can only afford so much time for ideation, I recommend a strategy of giving yourself a hard cutoff day for when you’ll just commit to a project from your list of ideas. I was fortunate enough to find topics I liked for three of my individual projects, and those were a pleasure to work on; putting in extra hours is always easier if you have genuine passion for what you’re researching. The one solo project for which I failed to find a great topic is easily my least favorite, but I’m still happy with the results. After using up my allotted ideation time I still didn’t like any of my project proposals, so I ended up picking the one I disliked the least. I could have burned a few more days of work time trying to come up with a better idea, but instead was able to comfortably deliver a finished project and learn quite a bit in the process.

Since the Metis curriculum is available online and may change in the future, I won’t spend many words rehashing it. Instead I’ll quickly describe each project, link to what I did for the assignment, and provide a bit of insight into the particular challenges associated with it. The first project is the only one of the five that is a mandated collaborative effort, and you can read more about it in this blog post. My second project at Metis required us to scrape our own data and use it to fit a linear regression model. This project is the first real gauge of skill and understanding amongst the class, which can make presentation day intimidating. The early weeks of the course heavily favor students who enter the program with a lot of coding experience, and if that’s not you it can be easy to feel behind. As both a student and teaching assistant, I’ve seen a number of novice programmers struggle through the second project only to show marked improvement on the next one. For our third project, my class was tasked with utilizing a cloud computing service (AWS, google cloud, etc) to host a large amount of tabular data in a SQL database and then using it to fit a classification model. Whereas the regression project offers limited variety in applicable algorithms and software libraries, the classification project starts to open the data science floodgates. I really enjoyed the increase in creativity and diversity of topics and approaches in student presentations from projects two to three, and that trend continued steadily through the rest of the course. The fourth project is very open-ended and focuses on Natural Language Processing (NLP for short). The two previous solo projects are both exercises in supervised learning, or training models using actual values you want them to accurately predict. The NLP project marks students’ first foray into unsupervised learning, meaning output consists of things like document groupings and word relationships instead of a set of predictions. This assignment can seem directionless and the results difficult to interpret, which can be difficult for students who are used to clear goals and structure. The minimal requirements and sundry applicable techniques and technologies make the NLP presentations quite diverse. Students at this stage also tend to start applying more advanced visualizations using software like Tableau and deploying models in app prototypes with Flask. Of course, the cornerstone of most Metis portfolios is the final project, which is what students present to a crowd of employers and alumni on Career Day. This project tends to be more ambitious and polished than its predecessors because students have amassed more knowledge by the end of the course and they have a full four weeks to work on it instead of two. Topics and technology used tend to be more diverse than in any other project, but many students trend towards training neural networks since they are introduced in the final weeks of the course. If you’re interested in a software library or algorithm not covered in bootcamp, the final project is an ideal opportunity to learn and apply it. Career Day presentations are pretty much like those for any other project except there’s an unfamiliar audience and students spend more time practicing and fine-tuning their presentations beforehand. Showing your work to a group of people interested in hiring data scientists can be intimidating but it’s important to remember that while Career Day is a great opportunity to connect with employers, it’s not your only route to a job. Ultimately the event is a platform to show off what you’ve learned and to celebrate that progress with your classmates. I’ll write more about the job hunting process coming out of bootcamp in a future post.

Career Seminars

Your most important interactions with the Metis careers department will likely occur during your post-graduation job search, but advisors also check in regularly during the bootcamp. Various seminars aim to deliver tips on effective networking, optimizing LinkedIn profiles, and writing resumes to attract data science recruiters. Working data scientist guest speakers describe their professional responsibilities and experiences navigating the field. Some even give insights into what recruiters and hiring managers are looking for out of new hires at their companies. Much of the information provided by Metis career advisors is useful even if you are a seasoned job seeker, but it can often come at inopportune times, when students are focused on finishing projects. It can be easy to ignore careers seminars in lieu of more pressing concerns, but paying attention and taking notes is still worth your while, since only some of the presentation materials are available online. Once you leave bootcamp for full-time job hunting, you’ll appreciate all the leads and guidance you’ve collected.

Closing Thoughts

The Metis Data Science bootcamp was exactly what I hoped it would be: an intense, collaborative learning experience designed to push you to create the best work you can in a short period of time. It was well worth the cost of tuition ($17,000 when I attended) and helped me get hired as a data scientist within three months of completing the course. I enjoyed learning from knowledgable instructors and teaching assistants as well as fellow classmates from my close-knit cohort. Above all I appreciated how professional Metis was at every step of my bootcamp experience. Having said all that, I don’t think Metis, or bootcamps in general, are right for everyone. I was able to make the most of the program due in large part to my financial stability and lack of competing time commitments going into it. You can certainly succeed with a less advantageous situation, but the class will be even more of a challenge. The trite adage ‘you get out what you put in’ also applies to Metis, making time management and focus paramount if you want to get the most out of the short, fast-paced curriculum. I recommend visiting Metis and other bootcamps for a private tour or open house event to get a better feel for how you might fit in there. I also found the personal experiences detailed in alumni blogs very helpful in deciding on the bootcamp route; I hope that you get as much out of my posts as I did reading those of others.

TL;DR

You won’t have time to learn everything, so you’ll want to prioritize what you’re most interested in learning
Don’t spend so much time coming up with the perfect project idea that you don’t have enough time to execute it
Students with prior programming experience tend to have an easier time early in the program, but the rest of the class catches up fairly quickly
Career Day is a good opportunity, but not your only route to a data science job

August 5, 2019

Easily Deploy Simple Flask Apps to Heroku

Have a working Flask app you want to deploy to Heroku? Read on to learn how…

Flask and Heroku logos with a gradient between them

A few weeks ago I had a couple of Flask apps that I wanted to deploy to Heroku. My apps ran fine locally so I expected the process to be quite simple with Heroku’s ability to deploy directly from a GitHub repository. While I eventually got both of my apps up and running, I thought the resources I found were a bit too dense for the relatively straightforward process I used to get everything set up. I’m hoping this blog post helps someone in a similar position get their work deployed more quickly than I did. You can follow along as I explain each step or just scroll down for the abridged version at the bottom of this post.

1. Push Your App to a GitHub Repository

This step isn’t completely necessary since you can deploy using a combination of Git and the Heroku CLI, but I would definitely recommend just linking a GitHub repo because it’s much easier. If you aren’t sure how to get your local files into a GitHub repo, check out this guide by Karl Broman. Once you have your app in its own GitHub repository and you’ve confirmed that it runs locally you should be ready for the next step.

2. Set Up a Virtual Environment

Before we start creating the files Heroku needs to properly deploy your app we’re going to set up a virtual environment for your app. You don’t actually have to follow this step but I highly recommend doing it to avoid the messy alternative. That said, if you don’t want to use a virtual environment, feel free to skip ahead to section 3.

So why a virtual environment?

A virtual environment allows you to keep specific versions of libraries installed for app development and testing while still allowing you to update everything to the newest version outside of that environment. This is especially handy if you’re using a package management system like Conda that encourages you to update all packages simultaneously.
Having a virtual environment for your app makes it easy to get a list of the packages it requires to run, which is exactly what we’ll do in the next step.

Follow this guide from Real Python to set up your virtual environment and make sure it’s running for the remaining steps. If you try to run your app it should fail because your newly-created virtual environment doesn’t have Flask or any of the other Python libraries your app is using installed yet.

ModuleNotFoundError: No module named 'flask'

So in order to make your app functional again, you’ll have to install all necessary packages in the virtual environment using pip. For Flask the shell command looks like this:

(env) $ pip install flask

Once you can launch your app locally from the virtual environment you should be ready for the next step.

3. Install and Test Gunicorn

Regardless of whether or not you created a virtual environment, you’ll need to use Gunicorn to get your app to run properly on Heroku. Flask’s built-in web server cannot handle concurrent requests, but Gunicorn can. For a much better, in-depth explanation check out this Heroku article. What you want to do now is install Gunicorn and make sure you can run your app with it. Let’s first install it:

(env) $ pip install gunicorn

And now navigate to the directory containing your Flask app and let’s make sure it works with Gunicorn. The command below will run the function app from the file app.py contained in the current working directory using Gunicorn. You may need to change the command slightly based on what you named things; just follow this format: gunicorn (your file):(your function).

(env) $ gunicorn app:app

You should see a message telling you that Gunicorn is running and a local address you can open in a web browser. All app functionality should be the same as if you ran it through Flask’s built-in web server.

If your Flask app script is not in the root directory of your GitHub repository, you’ll also want to test a modified version of this command. We’ll be telling Heroku to use Gunicorn to run your app, but it does that from the root directory. So let’s navigate to the repo root directory and test Gunicorn again (if your app script is in the root directory you can just skip this part). The command below uses the –pythonpath flag so Gunicorn can find my app file in a subdirectory. Follow this format for your own file hierarchy: gunicorn –pythonpath (app directory) (your file):(your function).

(env) $ gunicorn --pythonpath flask app:app

If everything works properly, you’re ready to create the three files Heroku needs!

4. Add ‘Procfile’ to Your Repository

The Procfile is what Heroku uses to launch your app and should be easy to create because we just figured out what needs to be in it in the last step. All you have to do now is put web: in front of the Gunicorn command you ran from the root directory of your repository and save that in a plain text file called Procfile in your repo root. My Procfile looks like this:

web: gunicorn --pythonpath flask app:app

That’s it. Just follow the pattern web: gunicorn –pythonpath (app directory) (your file):(your function) and save it in Procfile.

5. Add ‘runtime.txt’ to Your Repository

Next we’ll create runtime.txt, which tells Heroku what version of Python you want to use to run your app. You can get your Python version through the command line by running this line:

(env) $ Python -V

You should see output that looks something like this:

Python 3.7.3

Now all you have to do is properly format the Python version and save it in a plain text file called runtime.txt in the root directory of the repo. For the version above the file contents should look like this:

python-3.7.3

Before deploying you should also check the Heroku Dev Center to confirm that you’re specifying a supported version of Python.

6. Add ‘requirements.txt’ to Your Repository

The last file we need to create is requirements.txt, which tells Heroku exactly what Python libraries your app is dependent upon. Did you set up that virtual environment in Step 2? If you did, congratulations, this step is going to be really simple. Make sure your virtual environment is still activated and run the command below to write a list of the libraries your code needs to run into requirements.txt.

(env) $ pip freeze > requirements.txt

You can also run pip freeze by itself to get the list of libraries without creating the requirements.txt file. If you set up a virtual environment, the list should be pretty short because that environment contains only the libraries your app needs to run. If you didn’t use a virtual environment your list will contain every Python library pip has installed on your machine, which is likely way more than you need just to run your app. You can whittle down that list by manually deleting items from requirements.txt, but that is definitely not a best practice. It’ll work, but I highly recommend using a virtual environment instead.

7. Push Your New Files to GitHub

Once you have Procfile, runtime.txt, and requirements.txt saved locally in your app’s root directory you’ll want to commit and push them to your GitHub repository. Since you’ll be connecting the GitHub repo for the app to Heroku for deployment, all necessary files need to be in your online repository.

8. Connect Your GitHub Repo to a Heroku Project

You’ll need to first create a Heroku account if you don’t already have one and then create a new app through the Heroku web interface. I won’t detail how to do either of those things here because Heroku’s interface makes it very simple. Once you’ve created your app you should see a section called Deployment Method under the Deploy tab: Click Connect to GitHub.

Screenshot of Deployment Method Menu

Once you give Heroku permission to access your GitHub account you’ll be able to search for your app repository and then connect to it.

Screenshot of GitHub connection interface

9. Deploy Your App

With your repository connected you should now be able to deploy it under the Manual Deploy section. Just select what branch you want to use and click Deploy Branch. There’s also an option for automatic deployment which will deploy a new version of your app on Heroku every time you push to the deployed branch on GitHub.

Screenshot of Manual Deploy interface

You can check the results of your build under the Activity tab: if everything worked you should see Build succeeded after your latest build attempt. If you see Build failed check the build log to see what went wrong. I’ve found googling the exact error message from the logs to be very helpful when I’m not sure how to fix a failed build.

Screenshot of Activity Feed

You may also run into a situation where your build succeeds but does not run properly or at all. Because the build was successful, you won’t find any error messages in the build log and will instead have to view the Application Logs. Click the More dropdown menu in the upper right corner of the Heroku dashboard and select View Logs to access them.

Screenshot of Activity Logs button

Googling error messages will again come in handy when troubleshooting activity logs. I was having trouble getting an app that uses OpenCV to run despite a successful build and was able to quickly resolve my problem thanks to a search turning up this StackOverflow thread.

Thanks for Reading!

Hopefully you’ve now successfully deployed your Flask app to Heroku. If it didn’t work for you, please let me know what or where things went wrong so I can make improvements to this post. If you want to dig deeper into Heroku, a good place to start is the well-organized Heroku Dev Center. Thanks for reading and happy Heroku deploying!

TL;DR

Here’s how to deploy a Flask app on Heroku:

Push your working Flask app to a GitHub repository
Set up a virtual environment and install all library dependencies
Install and test Gunicorn locally
gunicorn --pythonpath (app directory) (file name):(function name)
Add ‘Procfile’
web: gunicorn --pythonpath (app directory) (file name):(function name)
Add ‘runtime.txt’
Python -V
Add ‘requirements.txt’
pip freeze > requirements.txt
Push new files to the GitHub repository
Connect GitHub repository to Heroku and deploy
Deploy your app

July 25, 2019

ShotPlot Archery App

I’ve long been fascinated by computer vision and had been thinking about using it to develop an automatic archery scoring system for a while. A few years ago, I found out about an archery range (shout out to Gotham Archery!) that had just opened in my neighborhood and decided to check it out. I was hooked after the introductory class and have been shooting there regularly ever since. As I continued to work on improving my form, I found self-evaluation to be somewhat difficult and wanted to come up with a quick and simple way to calculate my scores and shot distributions. While developing my skills at the Metis data science bootcamp, I started to get a clearer vision of how exactly I could build such a tool. My initial app idea involved live object tracking running on a mobile device, which I quickly realized might be too ambitious for a computer vision neophyte. I eventually settled on a plan to analyze a single target photo to derive shot positions and an average shot score for the session.

Data Collection

Before gathering my initial data, I set some restrictions on what each of those images would require. I wanted images to have all four corners of the target sheet visible so I could remove perspective skew and uniformly frame each one. Photos also needed to have enough contrast to pick out the target sheet and shot holes from the background. In order to keep the scope of the project manageable, I only used a single type of target: the traditional single-spot, ten ring variety. With those parameters in mind, I collected target data in two ways over several trips to the aforementioned Gotham Archery; I used my iPhone to photograph my target after each round of shooting at the range and also collected several used targets others had shot from the range’s discard bin. I set up a small home studio to quickly shoot the gathered targets but did not use any special lighting, camera equipment, or a tripod because I wanted the images to represent what an app user could easily produce themselves. I ended up collecting around 40 usable targets (some were too creased or torn) and set aside 11 of those to use as a test set to evaluate the app’s performance.

Images of used targets shot in different locations
Examples of suitable target images

Choosing an Algorithm

With my data in hand I was ready to start writing some code to process images into qualitative values, which meant choosing between one of a couple diverging approaches. Either training a Convolutional Neural Network or a more manual image processing approach would work to calculate scores, but both options come with benefits and important limitations:

Algorithm	Pros	Cons
CNN	Probably less coding	Might need more data
	High personal interest	Only good for score data
Manual Processing	Needs less data	Probably more coding
	Good for scores and positional data	Less sexy

Going with a neural network may have been difficult due to the small number of targets I had collected. Even though I could have bolstered the dataset by taking multiple photographs of each target from different angles and orientations I’m still not sure I would have had enough to train a quality model. However the real dealbreaker for me was that a CNN would not be able to provide me with shot coordinates, which I really wanted to help break down an archer’s inconsistencies. Heavily processing images with OpenCV was simply the better solution for my problem, no matter how much I would have liked to work with neural networks on this project.

Image Processing with OpenCV

OpenCV has a vast selection of image processing tools that can be intimidating at first glance and I spent the first few days working with the library just learning what commands might prove useful. Between my own exploration and reading a few blogs, like the incredibly helpful PyImageSearch, I was able to put together a rough plan for deriving shot positions from targets. I needed to do the following:

Remove perspective skew to flatten target image
Standardize position and orientation of targets
Use blob detection to find shot holes

From that outline, I broke down the required work into several smaller steps:

Import image and set color space (OpenCV imports color channels as BGR instead of RGB)
Find target sheet corners
Use corners to remove perspective skew, flattening target sheet
Find scoring region circles on target
Resize image into a square with circles at the center
Partition image by background color
Balance values of each partition to make holes stand out from the background
Recombine image partitions into a single image
Obscure logos at bottom of target sheet to hide them from blob detection
Use blob detection to find holes
Split up large blobs that are actually clusters of shots
Calculate shot scores based on distance from center of target

Sample images from processing steps 1, 3, 5, and 9
Sample target at various stages of processing

You can check out larger versions of the images in the ‘shotplot.ipynb’ notebook in the project GitHub repo, which runs through the entire shot identification process. The actual code for the OpenCV processing lives in the script ‘target_reader.py’ so that it can be easily imported into both a notebook and the Flask app script.

Sample image of target with identified shots circled
Sample target with identified shots circled

Algorithm Performance

In order to evaluate the performance of my image processing code, I manually identified shots on my eleven-target test set and compared the results to what my algorithm found. I then compiled a confusion matrix and recall and precision values for the over 550 shots in the test set:

Metric	Score
Test Set Recall	.955
Test Set Precision	.983

	Not Labeled a Shot	Labeled a Shot
Not Actually a Shot	N/A*	9
Actually a Shot	25	530

* Too many to count

In practice, I was constantly testing the performance of my code against different targets in my ‘training’ set and making adjustments when necessary. I certainly became more dilligent about testing after an early mishap resulted in my algorithm ‘overfitting’ a specific kind of target image and perform significantly worse against others. Another issue I encountered is the subjectivity of shot identification: determining how many shots created some holes is difficult if not impossible. Fortunately, manually identifying most shots is straightforward so I do not think the evaluation statistics would change significantly based on another person’s shot identifications.

Detail images of shots that are difficult to identify
Examples of shots that are difficult to identify

The App

I built the app in Flask and relied heavily upon the D3.js library for visualization. This project was my first foray into D3 and I greatly valued the flexibility and customizability it offers. Other visualization software and libraries like Tableau and Matplotlib had less startup cost but couldn’t faithfully reproduce the clear vision I had in mind for the app. Using D3 also leaves open the possibility of adding interactive features to the charts themselves in future development.

Image of the ShotPlot app
Screenshot of the ShotPlot app

Conclusions

Overall I’m pleased with the results of my first attempt at implementing both computer vision and D3 visualization into a completed project. Although ShotPlot succesfully identifies the vast majority of all shots, it does tend to miss a few that are clearly visible, which I’d like to address in future updates. I also removed some information by obscuring the logos at the bottom of the targets because parts of them were getting misidentified as shots. Ideally I’d like to find a better solution that counts shots in those areas and will be testing some alternatives like using template matching to isolate abnormalities that could be identified as shots. Along with performance improvements, I’m aiming to get the app working properly on smartphones since that is the platform on which it’s most likely to be used. I’d also like to expand the visualizations to really take advantage of D3’s ability to create interactive charts. My long-term goals for ShotPlot include adding analyses across multiple sessions and support for multiple target types and shooting styles.

Check out the full project on my GitHub

July 25, 2019

Boozehound Cocktail Recommender

This project was a labor of love for me since it combines two of my favorite things: data and cocktails. Ever since my wife signed us up for a drink-mixing class a few years ago I’ve been stirring up various concoctions, both established recipes and original creations. My goal was to create a simple app that would let anyone discover new drink recipes based upon their current favorites or just some descriptive words. Recipe books can be fantastic resources, but I often just want something that tastes like some other drink but different or some combination of flavors and a specific spirit and that’s where a table of contents fails. While I never thought of Boozehound as a replacement for my favorite recipe books I was hoping it could serve as an effective alternative when I just don’t have the patience to thumb through dozens of recipes to find what I want to make.

Photo of a book and index cards containing cocktail recipes

Data Collection

Because I wanted Boozehound to work with descriptions and not just cocktail and spirit names I knew I would be relying upon Natural Language Processing and would need a fair amount of descriptive text with which to work. I also wanted the app to look good so I needed to get my recipes from a resource that also has images for each drink. I started by scraping the well-designed Liquor.com, which has a ton of great recipes and excellent photos. Unfortunately the site has extremely inconsistent write-ups on each cocktail: some are paragraphs long and others only a sentence or two. I wanted more consistent, longer drink descriptions and I found them at The Spruce Eats, which I scraped using BeautifulSoup in the ‘scrape_spruce_eats’ notebook on my project GitHub repo. Spruce Eats doesn’t have the greatest list of recipes, but I was still able to collect roughly 980 separate drink entries from the site, each with an ingredient list, description, and an image URL.

Text Pre-Processing

After getting all of my cocktail recipe data into a Pandas DataFrame, I still needed to format my corpus to prepare it for modeling. I used the SpaCy library to lemmatize words and keep only the nouns and adjectives. SpaCy is both fast and easy to use, which made it ideal for my relatively simple pre-processing. I then used scikit-learn’s TF-IDF implementation to create a matrix of each recipe’s word frequency vector. I chose TF-IDF over other vectorizers because it accounts for word count disparities and some of my drink descriptions are twice as long as others. The app also runs user search strings through the same process and those should certainly be shorter than any cocktail description. My pre-processing and modeling work is stored in the ‘model_spruce_eats’ notebook.

Models

Building my initial model was a fairly simple process of dimension reduction and distance calculations. Since I had a TF-IDF matrix with significantly more columns than rows, I needed some way of condensing those features. I tried a few solutions and Non-Negative Matrix Factorization gave me the most sensible groupings. From there I just calculated pairwise Euclidean distances between the NMF description vectors and a similarly vectorized search string. There was just one problem: my model was such a good recommender that it was way too boring. It relied far too heavily on the frequency of cocktail and spirit names in determining similarity, so a search for margarita would just return ten different variations of margaritas. To make my model more interesting I created a second model that is only able to use words from the description that are not the names of drinks or spirits. Both models are then blended together, which the user can control through the Safe-Weird slider in the Boozehound app.

The image below gives an example of how both models work. Drink and spirit names dominate the descriptions, so the Safe model in the top half has a good chance of connecting words like tequila. The Weird model on the bottom has to make connections using other words, in this case refreshing. Because it has less data with which to work, the Weird model tends to make less relevant recommendations, but they’re often more interesting.

Chart showing an example of Safe and Fun model word similarities

The App

I built the app in Flask and wanted a simple, clean aesthetic reminiscent of a classy cocktail bar. As a result I spent just as much time using CSS to stylize content as I did just getting the app to work properly. Luckily I enjoy design and layout so it was a real pleasure seeing everything slowly come together. My Metis instructors would often bring up the idea of completing the story and to me this project would have been incomplete without a concise and visually-pleasing presentation of recipe recommendations.

Picture of the Boozehound app

Conclusions

While I am pleased with the final product I presented in class at Metis, there’s a lot I still want to address with the Boozehound app. Some of the searches I tested returned odd results that could be improved upon. I’d also like to add some UX improvements like helpful feature descriptions and some initial search suggestions for users who don’t know what they want. Another planned feature is a one-click search button to allow the user to find recipes similar to a drink that shows up as a recommendation without having to type it into the search bar. Boozehound is all about exploration and I want to make rifling through a bunch of new recipes as easy as possible.

Check out the full project on my GitHub

July 24, 2019

Kaggle Instacart Classification

I built models to classify whether or not items in a user’s order history will be in their most recent order, basically recreating the Kaggle Instacart Market Basket Analysis Competition. Because the full dataset was too large to work with on my older Macbook, I loaded the data into a SQL database on an AWS EC2 instance. This setup allowed me to easily query subsets of the data in order to do all of my preliminary development. I then cleaned up my work and wrote it into a script called ‘build_models.py’ that can be easily run through a notebook or the command line. Once I was ready to scale up to the full dataset, I simply ran the build_models script on a 2XL EC2 instance and brought the resulting models back into my ‘kaggle_instacart’ notebook for test set evaluation.

Feature Engineering

I spent the majority of my time on this project engineering features from the basic dataset. After creating several features, I tested different combinations of them on a small subset of the data in order to eliminate any that seemed to have no effect on model prediction. After paring down features I ended up training and testing my final models on the following predictors:

percent_in_user_orders: Percent of a user’s orders in which an item appears
percent_in_all_orders: Percent of all orders in which an item appears
in_last_cart: 1 if an item appears in a user’s most recent prior order, 0 if not
in_last_five: Number of orders in a user’s five most recent prior orders in which an item appears
total_user_orders: Total number of previous orders placed by a user
mean_orders_between: Average number of orders between which an item appears in a user’s order
mean_days_between: Average number of days between which an item appears in a user’s order
orders_since_newest: Number of orders between the last user order containing an item and the most recent order
days_since_newest: Number of days between the last user order containing an item and the most recent order
product_reorder_proba: Probability that any user reorders an item
user_reorder_proba: Probability that a user reorders any item
mean_cart_size: Average user cart (aka order) size
mean_cart_percentile: Average percentile of user cart add order for an item
mean_hour_of_week: Average hour of the week that a user orders an item (168 hours in a week)
newest_cart_size: Number of items in the most recent cart
newest_hour_of_week: Hour of the week that the most recent order was placed
cart_size_difference: Absolute value of the difference between the average size of the orders containing an item and the size of the most recent order
hour_of_week_difference: Absolute value of the difference between the average hour of the week in which a user purchases an item and the hour of the week of the most recent order

Models

In my preliminary tests using subsets of the Instacart data, I trained a number of different models: logistic regression, gradient boosting decision trees, random forest, and KNN. After several rounds of testing, I took the two that performed best, logistic regression and gradient boosting trees, and trained them on the full data set, minus a holdout test set. I used F1 score as my evaluation metric because I wanted the models to balance precision and recall in predicting which previously ordered items would appear in the newest orders. To account for the large class imbalance caused by the majority of previously ordered items not being in the most recent orders, I created adjusted probability threshold F1 scores as well. The scores below treat each dataframe row, which represents an item ordered by a specific user, as a separate, equally-weighted entity. Both models performed similarly, with the gradient boosting trees classifier achieving slightly higher scores:

Model	Raw F1 Score	Adjusted F1 Score
Logistic Regression	0.313	0.447
Gradient Boosting Trees	0.338	0.461

I also calculated mean per-user F1 scores that more closely match the metric of the original Kaggle contest. If either model were incorporated into a recommendation engine the user-based metric would better represent its performance. In these F1 scores, model performance is virtually identical:

Model	Per-User F1 Score
Logistic Regression	0.367
Gradient Boosting Trees	0.368

Results

The charts below show the most influential predictors and their respective coefficient values for each model. The logistic regression model relies heavily upon information about the size of the most recent cart, while the gradient boosting decision trees model gives far more weight to the contents of a user’s previous orders. If information about the most recent cart were not available, the gradient boosting model would most likely outperform the logistic regression model.

logistic regression model coefficients
gradient boosting decision trees model coefficients

Conclusions

This project was all about feature creation – the more features I engineered the better my models performed. At Metis I had a pretty tight deadline to get everything done and as a result did not incorporate all of the predictors I wanted to. I plan to eventually circle back and add more, including implementing some ideas from the Kaggle contest winners.

Check out the full project on my GitHub

July 24, 2019

Predicting NHL Injuries

I’m a huge fan of both hockey and its associated analytics movement, so I wanted to work with NHL data while finding a relatively uncovered angle. I also wanted to put some of the conventional wisdom of the league to the test:

Do smaller and lighter players get injured more?
Is being injury-prone a real thing?
Are older players more likely to get hurt?
Are players from certain countries more or less likely to get injured? (European players were once considered "soft" and Alex Ovechkin once said, "Russian machine never breaks")

To help answer these questions, I built a model to predict how many games an NHL player will miss in a season due to injury. I then examined the model’s coefficient values to see if I could gain any insights into my injury questions.

Data Collection

Counting stats for sports are easy to come by and hockey is no different – I was able to download CSVs of all player stats and biometrics for the last ten NHL seasons from Natural Stat Trick. I combined the separate Natural Stat Trick datasets in the ‘player_nst_data’ notebook in my project Github repo. Unfortunately reliable player injury histories are much more difficult to come by. I was able to scrape lists of injuries from individual player profiles from the Canadian sports site TSN using Selenium and BeautifulSoup. All injury scraping and parsing is contained in the ‘player_injury_data’ notebook. While I don’t believe the TSN data is an exhaustive list of player injuries it is the best I could find so that’s what I used.

Example image of TSN player profile
An example of a TSN player profile page containing injury data

Feature Selection/Engineering

Mostly due to the amazing amount of counting stats Natural Stat Trick aggregates, I had an abundance of potential features for my models. I relied on my domain knowledge as a longtime hockey fan to whittle down the list to anything I thought could correlate with injury rates, as well as predictors to test the conventional wisdom of what causes players to get hurt. I removed goaltenders from my data set because their counting stats are completely different from those of other skaters. I also utilized sklearn’s polynomial features to account for feature interactions. Here is a partial list of individual features and my logic for choosing them:

Games Missed Due to Injury: what I’m trying to predict
Height/Weight: to see if smaller/lighter players get injured more often
Position: defensemen play more minutes and are more likely to block shots than other positions, wingers and defensemen are generally more likely than centers to engage in physical play along the boards
Penalties Drawn: penalties are often assessed when a player is injured as the result of a dangerous and illegal play
Hits Delivered/Hits Taken: an indicator of physical play that could lead to more injuries
Shots Blocked: players sometimes suffer contusions and break bones blocking shots
Age: to see if older players get injured more often
Major Penalties: majors are often assessed for fighting
Being European/Being Russian: to see if either correlates with increased injury rates

Additional Feature Engineering

I started off with some simple models because I wanted to evaluate what formulation of my data would work best before spending my time tweaking hyperparameters. I fit an Ordinary Least Squares regression on each of the following data sets:

Each entry contains the total counting stats and games missed due to injury for the last ten seasons.
Each entry contains the counting stats, games missed due to injury, and games missed last season for a single season for a player. In this and the following format, players can have multiple rows of data.
Similar to the last format except it includes rolling averages of counting stats and games missed for all previous seasons.

I created each of these data formulations in the ‘data_merge’ notebook. Unsurprisingly, the last and most robust data set resulted in the lowest test MSE so I used that formulation of my data for the final modeling. All modeling lives in the ‘nhl_injuries’ notebook.

EDA

Exploratory Data Analysis confirmed some of my assumptions going in – mainly that injuries are largely random and unpredictable. The heatmap below shows very low correlation between any of my predictors and the response.

Predictor/Response Correlation Heatmap

I also created a histogram of games missed due to injury per season that shows most players miss little or no time. As a result the distribution of games missed has an extreme right skew.

Games Missed Due to Injury Histogram

Models

Before training my final models I limited the data set to only players who had played at least 50 career games, which mainly removed career minor league players who have spotty NHL statistics. I standard-scaled all values, created a train/test split of my data set of 7,851 player-seasons, and trained a Lasso linear regression model with polynomial features and a random forest regressor model. The test R² values for both models were low, indicating that the predictors did not explain much of the variation in the response in either model.

Model	R²
Lasso	0.075
Random Forest	0.107

Results

Since the Lasso model performed similarly to the random forest model while maintaining greater interpretability, I used it for all subsequent analyses. I graphed the ten most important predictors for the Lasso model along with their coefficient values. Although the model didn’t have great predictive value I still wanted to determine which inputs had the most influence.

Predictor Coefficients

Using the chart above I was able to get some answers to my initial questions:

Height and weight don’t appear anywhere near the top of the predictor list, meaning my model finds little evidence that smaller or lighter players get hurt more
Average Previous Games Missed and Last Games Missed appear frequently in the top 10, lending credence to some players being injury-prone
Age times Average Previous Games Missed is the fourth-highest predictor coefficient, providing some very weak evidence of older players being more likely to get injured
Nationality appears to have almost no importance to the model, save for Russian defencemen having a small positive correlation with games missed due to injury

Conclusions

In many ways the results of this project confirmed my pre-existing beliefs, especially that NHL injuries are largely random events that are difficult to neatly capture in a model. Because I started off assuming a model would not have great predictive quality, I felt obligated to create the best model I could to account for my bias. I also expected to find little to no evidence of players being injury prone, but my model presents evidence that players injured in the past are more likely to get hurt in the future. While I still believe injuries are too random to be reliably predicted by a model, I do think my model can be improved with more and better data. More complete NHL and minor league injury data would fill a lot of gaps in the model, especially for players who are not NHL stalwarts. New NHL player tracking data could potentially provide input values with far more predictive value than traditional counting stats. I’m excited to see what future insights can be gleaned as we continue to collect increasingly more detailed sports data.

Check out the full project on my GitHub

May 21, 2019

Metis Pre-Bootcamp and Week 1

This is the second post in a series on my experiences attending the Metis immersive Data Science course in New York City. My previous post covered the application process. My subsequent post covers Weeks 2-12. While I cannot guarantee anyone else’s bootcamp will be the same as mine, I hope you find these posts as useful as I found other blogs when I was researching data science programs.

Although Week 1 is the official start of bootcamp, most programs, including Metis, assign a plethora of readings and assignments to be completed before the first day. The content of this material seems fairly similar across different programs: schools want students to be capable of writing basic Python scripts and understand foundational concepts of linear algebra, calculus, statistics, and probability. I can’t speak for all bootcamps, but if you show up for your first day at Metis able to do all of the following you should be fine:

Write a working Python function
Multiply matrices
Find a derivative and integral for a simple function
Calculate summary statistics like variance and standard deviation
Work through simple conditional probability problems

Even if you aren’t able to cover all of these topics you’ll probably be all right as most of them are reviewed in lectures, albeit briefly. In practice software takes care of much of the rote calculation you’ll need to perform for any of the math you learn but being able to properly interpret results will always require some conceptual understanding. The single most important skill you’ll utilize at bootcamp is Python coding; at Metis it is the backbone of every project and most assignments. The more comfortable you are in your coding skills, the more comfortable you’ll be with the bootcamp workload.

The exact format of Metis’ pre-course work may well change by the time you read this, but for the Spring 2019 cohort it was primarily composed of numerous short lessons contained in a Github repository, assigned readings from Allen Downey’s ‘Think Stats,’ and a few dozen math and coding challenges on HackerRank. The Metis repository also contained plenty of extra work that delved deeper into a variety of subjects but was not required to complete any challenges. Metis sends pre-bootcamp work to students a little over two months before class starts and their staff claims the required portions takes about 80 hours to complete, which I found to be a reasonable estimate. Work is submitted and checked, but as far as I know you won’t get ousted from the program if you don’t complete it. That said, I recommend doing all pre-bootcamp work and then some to arrive fully prepared for the rigors of the course. Bootcamps, by the nature of their brevity, do not afford instructors the time to cover most topics in depth. You will also not have the time to comprehensively study all of the material on your own, especially if you’ll be seeing much of it for the first time. Investing your time in studying before beginning a bootcamp can help you get the most out of it by allowing you to spend lectures clarifying and contextualizing concepts instead of constantly grappling with completely unfamiliar ideas. While simultaneously managing the demands of the pre-course work and full-time employment is certainly feasible, I recommend leaving your job at least a week before starting bootcamp if possible.

As soon as you start the Metis bootcamp, you’ll be immediately thrust into a new routine of lectures, short assignments, and project work. I devoted much of this post to the importance of the pre-course work because there is no adjustment period at Metis: the first project is assigned on Day 1 and due at the end of the week. You’ll work in a team, but completing and presenting exploratory data analysis when you’re relatively new to data science can still be a daunting task. Lectures and assignments during Week 1 focus on developing all the skills necessary to finish your first project, and in that way the first week serves as a microcosm of the course as a whole. At Metis you’ll almost always be working on a project; instructors assign new projects the day after the cohort presents their current ones. Lectures leading up to project delivery days are geared towards giving you the skills to complete the current project and tend to front-load vital information so you’ll have enough time to actually use it. For my cohort the group project required using MTA turnstile entry/exit data to solve a hypothetical business problem. My group combined the requisite data set with NYC liquor license information to determine the optimal areas of the city in which to open a new bar, preferably an area with high evening and weekend foot traffic but relatively few existing competitors. Cobbling together a project with people you just met is a great way to instill camaraderie and make everyone feel like they are doing the work of a data scientist. Looking back I feel like we barely accomplished anything on our first project, but that’s more of a testament to how much more material we’ve covered in the weeks since.

Bar Map
High traffic areas with high bar density (Left) and low bar density (Right)

After my first week at Metis I felt comfortable with the daily routine and project cycles, largely due to working up to that first presentation at such a fast pace. More than anything else that first assignment represented a complete project we were able to credibly present to others, establishing a standard that defined expectations for all subsequent projects. Metis instructors and staff will often say that the first week of bootcamp is the most difficult, and I believe that to be accurate in the sense that establishing a routine is more difficult than maintaining it. Metis throws you into data science work right away and by the time you finish that first project you’re that much more ready to face the challenges each subsequent week has in store.

TL;DR

The more you prepare before bootcamp the better you’ll be able to keep up with the pace of the course
Metis ramps up very quickly: the first project is due at the end of Week 1

March 31, 2019

Data Science Bootcamp Applications

This is the first post in a series on my experiences attending the Metis immersive Data Science course in New York City. My subsequent posts cover pre-bootcamp and Week 1 and Weeks 2-12. While I cannot guarantee anyone else’s bootcamp will be the same as mine, I hope you find these posts as useful as I found other blogs when I was researching data science programs.

I distinctly remember that filling out my first bootcamp application was the moment in my journey towards data science that I thought, this is really happening. Although I had been planning to leave my job for a little while and had been learning some data science fundamentals through MOOCs beforehand, applying felt like my first serious step in changing careers. In determining to which immersive programs I wanted to apply, I found the personal testimonials of blogs to be among the most helpful resources, more candid than program sales pitches and far more in-depth than most reviews on SwitchUp and Course Report. Given what a gigantic leap applying felt like to me I am a bit surprised I did not find more posts about the admissions process, and that is what motivated me to write about my experience.

For the sake of brevity I won’t delve much into the details of test questions, interviews, or programming challenges, especially since most of that information is readily available online. If a particular data science program doesn’t describe their admissions process on their website you should be able to get a full rundown of entrance requirements by emailing them. I will also add the caveat that my admissions experiences are limited to just two in-person, immersive programs in New York City, Galvanize and Metis, which represent only a fraction of the full breadth of data science classes. I won’t get into why I picked Metis here, but will likely write about my decision in a future post. While I researched as many options as I could find, I eventually settled on applying to two schools, a limit I recommend not exceeding if you will be applying simultaneously and are working full-time. Data science bootcamps generally have a limited number of spots per session and use the admissions process to find the most suitable candidates. It’s important to know that programs vary significantly in the difficulty and length of their admissions processes; some programs only require a couple of online interviews while others, like Galvanize and Metis, employ far more extensive screening.

Initial applications tend to be fairly uniform, consisting of an online form asking who you are, where you’re from, why you want to study data science, and when you want to start, all (hopefully) easy questions that shouldn’t require more than an hour or two to answer. Submitting an application puts into motion a time-sensitive process that involves completing programming challenges, taking timed online quizzes, and interviewing with school staff through video chat. While you may have some flexibility in scheduling when you undertake these tests, you probably won’t have the time to familiarize yourself with a topic that is completely new to you. Fortunately, schools will usually tell you exactly what they want you to know before you apply and some, like Metis, even have practice assessments you can use to gauge your readiness for admissions. I highly recommend studying topics like linear algebra, calculus, statistics, and scripting (generally coding challenges are in Python) and evaluating your progress with practice quizzes before you apply. Thanks to my preparation, I found both the Galvanize and Metis admissions tests to be challenging but well within my understanding of the material. While I had plenty of prior experience with Python, I had not studied any of the math topics in many years and would have fared far worse on those sections had I not spent a considerable amount of time reacquainting myself with them. If you also feel the need to review the math or statistics fundamentals, I recommend the sites Math is Fun and Mathopolis, which work in tandem to provide simple lessons and corresponding challenges to test your knowledge. For a more advanced understanding of math topics, check out 3Blue1Brown on YouTube, which includes entire series on linear algebra and calculus. A great book for beginners to coding is Learn Python the Hard Way, which should more than prepare you for admissions coding challenges. I utilized all of these resources and found them to be extremely helpful, but by no means is the list exhaustive.

Even though I felt prepared for admissions, I was surprised by the speed and intensity of the whole process. While I was successful in finishing the bulk of both applications on weeknights after work, in retrospect, waiting for the weekend would have been a smarter approach. Trying to complete a 48-hour coding challenge in the free time a full-time job affords is certainly not impossible, but it can add a good deal of unnecessary stress. Saving assessments for the weekend would have allowed more uninterrupted work time and fewer lost hours of sleep. For both Galvanize and Metis, successfully negotiating the online tests and coding challenges ushers in the final round of admissions testing: one or more online interviews with data scientists affiliated with the school. The interviewers are there to serve as a hybrid of test administrator and benevolent guide if you get stuck. I found the interviews to be friendly and casual, but simultaneously nerve-wracking, and recommend studying any suggested materials and more beforehand, particularly if you don’t like being asked to solve problems on the spot. I spent an estimated ten hours working through each program’s admission process from start to finish, not counting time spent studying.

Once you complete all phases of the application process you may have to wait up to a week to hear back from the school’s admissions office with your results. If you are admitted you’ll have to sign an agreement and make a deposit, usually within a week, in order to secure your seat in a cohort. I paid my deposit a little less than four weeks after filling out my first application, but that timeline could certainly have been made shorter if I hadn’t been simulatenously navigating two admissions processes. An advantage to applying to multiple schools at the same time is that if you get admitted to more than one you’ll have the luxury of being able to pick whichever you think is best. Alternatively, if you apply early enough and don’t get into your first choice of program you can try to get into a different school that has a cohort starting around the same time. I have also heard from admissions officers at various schools that early applicants have a better chance of being accepted due to class sessions having more open spots and fewer candidates farther out from their start dates. Some programs require several hours of work to be completed before the first day of class, and early admission affords you a head start on any pre-bootcamp assignments or extra studying you might want to undertake.

Hopefully this post sheds some light on the data science bootcamp admissions process and gives you an idea of what to expect if you decide to apply to one. If you want more details or have any other questions, feel free to email me at [ lukaswadya@gmail.com ]. Thanks for reading!

TL;DR

Admissions can take a while, consider limiting your number of applications if you’re working full-time
Study before applying because you may not have enough time to do so once you start the process
Applying early gives you more flexibility and may make getting admitted to a competitive program easier