Monthly ML: Data Science & Machine Learning Wrap-Up — January 2020
Thanks for checking out my Data Science and Machine Learning blog series. Every month, I’ll be highlighting new topics that I’ve explored, projects I’ve completed and helpful resources that I’ve used along the way. I’ll also be talking about pitfalls, including diversions, wrong turns and struggles. By sharing my experiences, I hope to inspire and help you on your path to become a better data scientist.
Let’s get into it.
With busy season over at work, I was determined to dive right back into my data science and machine learning quest.
I’d spent most of November and December working through UC San Diego’s Probability and Statistics course and I was eager to transition to something practical. I remembered the Titanic challenge on Kaggle, the data science community’s equivalent of “FizzBuzz”, and I decided to try it out.
What I learned from Titanic surprised me at first. After initial EDA, I was feeling pretty smug. “This problem seems relatively easy, I’ll just build a couple models, tune them, ensemble them and call it a day.” How wrong I was.
I tried numerous variations of this approach with ever increasing complexity and my Kaggle leaderboard results actually got worse. Increasingly frustrated, I stopped coding and started reading. It was at this point that I found an awesome Kaggle notebook purporting to score an impressive 82% using only one feature from the dataset. What? Digging in, I discovered that the author had applied domain knowledge to create a simple heuristic algorithm that put my sophisticated ML attempts to shame.
So I decided to try to implement a similar heuristic model in Python (the original notebook was written in R) and attempt to improve up on it. In doing this I came to appreciate how important it is to understand the problem at hand and how domain knowledge can trump hours of hyperparameter tuning.
Moral of the Titanic story? Complexity does not guarantee success. When progress slows, take a step back and re-evaluate your approach. And if you haven’t worked with the Titanic dataset, I definitely recommend it.
By mid-Jan I was looking for a new project and I decided to pivot into the field of natural language processing. Up to this point, my machine learning experience was purely classification or regression using numerical and categorical features. The vanilla stuff. NLP seemed…exotic? Almost like something out of a science fiction book (I know, it’s not nearly that glamorous). Even still, and I don’t care what you say though, creating a model that can parse ambiguous human words is pretty amazing. Conveniently, I’d just received an email announcing the launch of a beginner’s Kaggle competition to make predictions using tweet content. Bingo. I had my project.
Disaster Tweets took me down an interesting path. I learned about different EDA methods that can be applied to textual content. I also learned about the basic pipeline for converting text into vector representations that can be passed into a model. Context is king in NLP, as my early (naïve) “bag of words” attempts proved. Ultimately, I figured out how to utilize the incredible open source model BERT and a single layer neural network to produce decent leaderboard results. It was also my first time running up against machine performance issues and despite my best efforts to configure Keras to run on my desktop GPU, I ended up running BERT within a GPU-enabled Kaggle kernel.
Getting bogged down in jargon was the biggest struggle I had when working on the Disaster Tweets challenge. I highly recommend checking out Jay Alammar’s article if you are trying to get into NLP; it breaks down all the lingo.
In summary, I really enjoyed working on NLP, but I found that my limited knowledge of neural networks was preventing me from fully understanding more advanced models and research papers. This is a topic that I will return to in the future, once I have more NN experience under my belt.
Toward the end of the month, I re-discovered front-end dev with Flask, motivated by an idea that I’ve been developing for a content-sharing platform. This was my first time working with Flask and I was stunned by how quickly I could get something up and running. What would have taken hours of configuration useing the familiar PHP/MySQL/Bootstrap stack was painless with SQLAlchemy/WTF Forms/Flask Login. I am still building out this project with the goal of launching a working prototype. After that, I’ll see whether the creative juices keep flowing.
During my Flask experimentation period, I relied heavily on Pretty Printed and HackersAndSlackers, both of which provided invaluable tips for beginners. In my opinion, Flask’s simplicity is both its biggest draw and its biggest challenge. And, with so many Youtube tutorials all doing basically the same thing in slightly different ways, it was difficult to know how best to structure the application I was looking to build.
Other new topics learned during this time:
· Bootstrap 4 — A big step up from version 3, which I worked with as a freelancer. This is an incredible framework and I highly recommend building a quick site to play around with the out-of-the-box functionality.
· Virtualenv — Package management is hard. In fact, I’ve avoided sinking my teeth into projects simply because I struggled to get all of the dependencies straight. Virtualenv gives you compartmentalization with one line of code, making dependency dread a thing of the past.
· SQLAlchemy — WOW. I had worked on projects that used SQLAlchemy in the past, but never appreciated how beautifully simple it is to interface with sqlite databases via Python classes.
· Heroku — Painless deployment from the command line.
· Git — Well, this one isn’t new, but I learned a few advanced commands that I hadn’t had to use in my typical development pipeline. Still not fully “comfortable” rolling back changes, but that’s why Stack Overflow exists.
Even though I was still playing around with Flask, I wanted to stay active on Kaggle and continue to build a portfolio of high-quality kernels. I decided to join the House Prices: Advanced Regression Techniques competition using the famous Ames, Iowa dataset. This is a “grown-up” machine learning challenge that builds upon foundational topics without doing the hard stuff for you — there are missing values galore, duplicative features, outliers. I was instantly drawn to it, because I’m a sucker for some good EDA. I look forward to working with this dataset and will publish my EDA to Kaggle sometime in early February.
With that, January is a wrap! Thanks again for reading and, until next time, happy coding.