Sklearn review and Ensemble models

Intro

The sckit-learn module is a full featured Python module for all kinds of data analysis and predictive modeling algorithms. In the pcda class we did one session at the end of the semester that just introduced this library and did some basic statistical/ML modeling. We’ll start by reviewing the basics of using sklearn for statistical and machine learning model building and learn about ensemble models.

Readings and review activities

As a review, first take a look through the following sections (and notebooks) in PDSH. We covered all of this back in the pcda class in our final session. Going through the notebooks will get you back up to speed with sklearn and ML basics.

  • PDSH - Ch 5: Scikit-Learn

    • 05.00-Machine-Learning.ipynb

    • 05.01-What-Is-Machine-Learning.ipynb

    • 05.02-Introducing-Scikit-Learn.ipynb

    • 05.03-Hyperparameters-and-Model-Validation.ipynb

Downloads and other resources

This downloads file will be used throughout all of the Module 2 activities.

Activities

We’ll start with a review of sklearn with a focus on the standard estimator API that makes it pretty easy to quickly try out different types of predictive models. In addition, we’ll explore a class of models known as ensemble models.

Ensemble models are just like they sound - a collection of models that, hopefully, perform better as an aggregated whole than as individual models. Modern weather forecasting relies on ensemble models and you’ll see that most Kaggle winners use ensembles of models. Individual models can be combined by doing things like averaging individual predictions (for regression) or using voting (for classification). Here’s an interesting blog post on using human regression ensembles vs various ML techniques.

We’ll use one of the Kaggle practice competitions involving trying to classify leaves based on simple images of those leaves.

  • you can find the notebook sklearn_gettingstarted_leaf_classification_aap.ipynb in the sklearn_ensemble_leaf folder within the Downloads file.

  • By working through it, we will:
    • review sklearn, numpy and a little pandas

    • build, train, test models in sklearn

    • combine different types of models into ensemble models

Here are screencasts to help guide you through the notebook:

When you are done with this, move on to the next submodule, Using cookiecutter templates for project structure.

Explore