Monthly ML: Data Science & Machine Learning Wrap-Up — February 2020
Thanks for checking out my Data Science and Machine Learning blog series. Every month, I’ll be highlighting new topics that I’ve explored, projects I’ve completed and helpful resources that I’ve used along the way. I’ll also be talking about pitfalls, including diversions, wrong turns and struggles. By sharing my experiences, I hope to inspire and help you on your path to become a better data scientist.
Hello everyone, I hope you had a productive February. Before I jump in, I wanted to invite you to follow my #100DaysOfCode journey on Twitter. Inspired by others, I have pledged to commit time everyday to improve my data science and programming skills. I can’t say for sure where this adventure will take me, but the support so far has been overwhelming. If you’re on Twitter, come say hi! Or even better, join me 🙂
With that, let’s get into it.
Having spent a good part of January working on classification problems (first on Titanic survival predictions and then on Disaster Tweet categorization), I was eager to try my hand at regression. The Kaggle House Prices challenge stood out to me because of the complexity of the dataset — with 80 features — relative to the small number of training examples — fewer than 1,500. Without care, this combination can lead to overfitting, where models are too closely geared to predicting training data and are unable to generalize.
As with my previous Kaggle entries, I wanted to carry out initial exploratory data analysis (EDA) and publish my notebook for others to use. I’ve found that narrating my code helps to focus my train of thought while cementing new ideas. If you’re learning, I highly recommend documenting your progress in the form of Jupyter notebooks or Kaggle kernels. At the end of the day, you have something that looks and sounds professional that can also serve as reference material and proof of concept.
Anyway, I learned a ton from House Prices EDA — click here to view my notebook — including:
- Assessment of multicollinearity and null values
- Outlier identification
- Log-transformations for skewed variables
- Practical observational skills regarding relationships between variables
With EDA done, I turned to data preprocessing: dealing with missing values, addressing outliers, converting between datatypes, label/one-hot encoding and log-transforming. Phew! Challenging, but oh so satisfying, as vague comprehension was replaced with practical implementation. These were all data science topics that I had done in isolation, but never in concert, let alone on a real dataset!
Once I had performed the minimal amount of preprocessing to satisfying sklearn model requirements, I implemented basic feature engineering. Constructing additional features in this way allows the data scientist (ahem, me) to utilize domain knowledge to bootstrap model learning. Then I turned to Google for a refresher on the basic regularized linear models: Ridge
and Lasso
. Why regularized? You may ask.
Regularization introduces additional constraints that penalize model complexity. Since the House Prices dataset contains a large number of features (not to mention the additional features that I engineered), overfitting is inevitable unless models are primed to select only the most impactful data characteristics.
Below are two visual examples of how the regularization parameter alpha can be tuned to produce the lowest cross-validated root mean squared error (RMSE) for Ridge
and Lasso
models.
My initial leaderboard scores for a simple Lasso
model were reasonably good, almost surprisingly so. Nevertheless, I was determined to improve my score and so I decided toexperiment with a variety of other models and average results to create a final set of predictions.
This lead to a lot of hyperparameter tuning, as I attempted to strike a balance between training and validation scores. It was very interesting to see where the biggest performance gains were made, especially with the inclusion of RobustScaler
vs. StandardScaler
in my modeling pipeline.
In the end, I got the best results by averaging predictions from Lasso
, SVR
and XGBoost
models. Why did this produce the highest leaderboard score? Difficult to say, exactly. From what I understand, the overestimation of one model may have happily “canceled out” the underestimation of another model. Another idea is that the patterns gleaned from the data differ from model to model; between them, they capture the majority of important (and predictive) inferences.
While this post makes it seem like the House Prices modeling process was straightforward and linear, that was most certainly not the case. Rather, I performed multiple rounds of data preprocessing, feature engineering and hyperparameter tuning. I scoured fellow Kagglers’ notebooks for ideas and read about the dataset in the original author’s introductory paper. For three days, I puzzled over the large difference between my training and leaderboard scores, only to finally realize (while out for walk) that I had forgotten to scale my test data! I am very pleased with my top scoring model, which earned me a spot in the top 12% of participants (527 out of 4607).
In summary, I really enjoyed working with the House Prices dataset and could go on and on. But if you want to learn more about my strategy and modeling, please check out my full notebook.
As a brief interlude, before I dive into my next project, I wanted to mention how much I have been enjoying Jon Krohn’s Deep Learning Illustrated. The colorful content is incredibly readable and, importantly, memorable. Not to mention the footnotes, which are an amusing treasure trove of detailed insights and references to the most widely known deep learning papers. As someone who likes to know where things come from, I’ve loved reading about the origin of artificial neural networks and then, a few chapters later, implementing one of my own!
Inspired by Deep Learning Illustrated, I spent the remainder of February experimenting with neural networks and the MNIST handwritten digit dataset on Kaggle.
Like NLP, deep learning is rife with terminology — including batch size, learning rate and activation functions, to name just a few. I wanted to perform a detailed review of the important hyperparameters for a neural network and visualize the impact of changing said hyperparameters on model training. For example, below are the learning curves for a collection of deep learning models, each with a different learning rate.
This exploratory project served several purposes:
- I implemented concepts from Deep Learning Illustrated, helping me to learn and remember the material
- I got comfortable with Keras and practiced OOP by creating a custom wrapper class for neural networks
My hope is that the full notebook, with its collection of interesting graphs and reusable code, will be helpful to other beginners who are interested in the fascinating world of neural networks.
In other news, I am extremely excited to announce that I am participating in this year’s Kaggle Days in San Francisco. As I’m sure is very clear by now, I love Kaggle, and I can’t wait to learn from (and compete with) other community members. I am also looking forward to the Open Data Science Conference (ODSC) East event, in Boston, the following week. April is going to be an awesome month!
Last but not least, my favorite resources/videos from this month include:
- 3Blue1Brown: Essence of Linear Algebra [YouTube]
- Introduction to GANs, NIPS 2016 | Ian Goodfellow, OpenAI [YouTube]
Looking forward in March — my goal is to continue delving into neural networks in the context of the MNIST dataset. I also want to learn more about Big Data technologies, including Spark, Hadoop, Hive and Pig.
With that, February is a wrap! Thanks again for reading and, until next time, happy coding.