{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Model Pipelines in Sklearn" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Main Idea

\n", "

Version control is a system that tracks changes to files over time.

\n", "
" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# This code spits out lots of warnings. We turn them off for the purposes of this tutorial. \n", "# This is bad practice in general. Only use if you already know your code is correct. \n", "import warnings\n", "import os\n", "warnings.filterwarnings('ignore')\n", "warnings.simplefilter('ignore')\n", "os.environ['PYTHONWARNINGS']='ignore'" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "id": "UJetBHbbmMaS" }, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "from scipy.stats import t" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Defaulting to user installation because normal site-packages is not writeable\n", "Requirement already satisfied: scikit-learn in /home/dane2/.local/lib/python3.9/site-packages (1.2.2)\n", "Requirement already satisfied: joblib>=1.1.1 in /home/dane2/.local/lib/python3.9/site-packages (from scikit-learn) (1.2.0)\n", "Requirement already satisfied: numpy>=1.17.3 in /software/spackages/linux-rocky8-x86_64/gcc-9.5.0/anaconda3-2022.05-zyrazrj6uvrtukupqzhaslr63w7hj6in/lib/python3.9/site-packages (from scikit-learn) (1.21.5)\n", "Requirement already satisfied: threadpoolctl>=2.0.0 in /software/spackages/linux-rocky8-x86_64/gcc-9.5.0/anaconda3-2022.05-zyrazrj6uvrtukupqzhaslr63w7hj6in/lib/python3.9/site-packages (from scikit-learn) (2.2.0)\n", "Requirement already satisfied: scipy>=1.3.2 in /software/spackages/linux-rocky8-x86_64/gcc-9.5.0/anaconda3-2022.05-zyrazrj6uvrtukupqzhaslr63w7hj6in/lib/python3.9/site-packages (from scikit-learn) (1.7.3)\n" ] } ], "source": [ "!pip install -U scikit-learn" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "import sklearn\n", "assert sklearn.__version__ > '1.2'" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "from sklearn import set_config\n", "set_config(transform_output = \"pandas\")" ] }, { "cell_type": "markdown", "metadata": { "id": "tSUqc72ZIH7F" }, "source": [ "## The Data" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 408 }, "id": "P0brlJTfnl7g", "outputId": "b207ec53-1a38-4397-a770-357adae6217a" }, "outputs": [ { "data": { "text/plain": [ "(1460, 81)" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_csv('/zfs/citi/workshop_data/python_ml/ames_train.csv')\n", "df.shape" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
variabletypedescription
0SalePricenumericthe property's sale price in dollars. This is ...
1MSSubClasscategoricalThe building class
2MSZoningcategoricalThe general zoning classification
3LotFrontagenumericLinear feet of street connected to property
4LotAreanumericLot size in square feet
............
75MiscValnumeric$Value of miscellaneous feature
76MoSoldnumericMonth Sold
77YrSoldnumericYear Sold
78SaleTypecategoricalType of sale
79SaleConditioncategoricalCondition of sale
\n", "

80 rows × 3 columns

\n", "
" ], "text/plain": [ " variable type \\\n", "0 SalePrice numeric \n", "1 MSSubClass categorical \n", "2 MSZoning categorical \n", "3 LotFrontage numeric \n", "4 LotArea numeric \n", ".. ... ... \n", "75 MiscVal numeric \n", "76 MoSold numeric \n", "77 YrSold numeric \n", "78 SaleType categorical \n", "79 SaleCondition categorical \n", "\n", " description \n", "0 the property's sale price in dollars. This is ... \n", "1 The building class \n", "2 The general zoning classification \n", "3 Linear feet of street connected to property \n", "4 Lot size in square feet \n", ".. ... \n", "75 $Value of miscellaneous feature \n", "76 Month Sold \n", "77 Year Sold \n", "78 Type of sale \n", "79 Condition of sale \n", "\n", "[80 rows x 3 columns]" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "features = pd.read_csv('/zfs/citi/workshop_data/python_ml/ames_features.csv')\n", "features" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exploratory Analysis" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[]], dtype=object)" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXcAAAEICAYAAACktLTqAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/YYfK9AAAACXBIWXMAAAsTAAALEwEAmpwYAAAVUUlEQVR4nO3df5DcdX3H8efbRAFZhFDwmoaUwxa1QCwlV5RB24u2lR9WsFPbMNSBihOnxVGnmbZBZ1o6TqaxLa061B/xsNJBiSlCZUSqNCWj1iISREOIqVFODGDij/DjqKUmvvvHfkM2x97t5m43990Pz8fMzn73+/N1d8lrv/fZ7+5FZiJJKsuz5jqAJKn3LHdJKpDlLkkFstwlqUCWuyQVyHKXpAJZ7nrGiYjxiPiNPuz3FRGxrdf7lWbCctfAioiXR8SXIuLRiPhRRPxnRPxqD/c/HBEZERPVbTwiVk21fmZ+ITNf1KvjS7Mxf64DSDMREc8DPg38EbAeeA7wCuDJPhzumMzcExFnARsi4p7M/LdJeeZn5p4+HFuaEc/cNaheCJCZ12fm3sz8cWZ+LjO/HhG/EBH/ERE/jIgfRMTHIuKYdjuJiGdFxKqI+Fa1/vqIOLbdupn5X8AW4LSIGI2IHRHx5xHxPeCf9s1r2ffiiLgxIr5f7fvqlmVvjIitEbE7Ij4bESf28psjWe4aVP8N7I2IayPi3IhY0LIsgL8Gfg74JWAxcOUU+3krcCHw69X6u4F/nLxSNJ0NnAp8tZr9s8CxwInAiknrz6P5m8V3gGFgEbCuWnYh8A7gd4DjgS8A13f5dUvdyUxv3gbyRrO4PwrsAPYANwNDbda7EPhqy+Nx4Deq6a3Aq1qWLQR+QnPIchhI4BGapb8VeGu13ijwf8DhLduOAjuq6bOA7wPz2+S5Fbis5fGzgP8BTpzr76m3cm6OuWtgZeZW4FKAiHgxcB3wnoh4G/A+mmPwR9Esz91T7OZE4KaI+GnLvL3AUMvj47L9ePr3M/N/p9jvYuA7U2x3IvDeiLiqZV7QPLv/zhT7kw6KwzIqQmZ+g+ZZ/Gk0h2QSeElmPg/4A5rl2c53gXMz85iW2+GZ+WA3h51m2XeBn4+IdidQ3wXePOmYR2Tml7o4ptQVy10DKSJeHBErI+KE6vFi4CLgDppn6xPAIxGxCPjTaXb1QWD1vhc0I+L4iLigBxHvBB4G1kTEkRFxeDVmv++YV0TEqdUxj46I1/fgmNJTLHcNqseBlwJfjognaJb6vcBK4K+AM4BHgVuAG6fZz3tpjtV/LiIer/bz0tmGy8y9wG8Dvwg8QPN1gd+vlt0EvBtYFxGPVbnPne0xpVaR6R/rkKTSeOYuSQWy3CWpQJa7JBXIcpekAtXiTUzHHXdcHn/88Rx55JFzHaWjJ554wpw9NihZzdlbg5IT6pt106ZNP8jM49sunOu3yGYmS5cuzdtvvz0HgTl7b1CymrO3BiVnZn2zAnflFL3qsIwkFchyl6QCWe6SVCDLXZIKZLlLUoEsd0kqkOUuSQWy3CWpQJa7JBWoFh8/ULrhVbd0td74mvP7nETSM4Vn7pJUIMtdkgpkuUtSgSx3SSqQ5S5JBbLcJalAlrskFchyl6QCWe6SVCDLXZIKZLlLUoEsd0kqkOUuSQXqWO4RsTgibo+IrRGxJSLeVs2/MiIejIh7qtt5LdtcERHbI2JbRLy6n1+AJOnpuvnI3z3Aysy8OyKOAjZFxG3Vsn/IzL9rXTkiTgGWA6cCPwf8e0S8MDP39jK4JGlqHc/cM/PhzLy7mn4c2AosmmaTC4B1mflkZt4PbAfO7EVYSVJ3IjO7XzliGPg8cBrwJ8ClwGPAXTTP7ndHxNXAHZl5XbXNNcCtmXnDpH2tAFYADA0NLR0bG6PRaMz6C+q3iYmJg865+cFHu1pvyaKjZxKprZnknCuDktWcvTUoOaG+WZctW7YpM0faLev6LzFFRAP4JPD2zHwsIj4AvAvI6v4q4I1AtNn8ac8gmbkWWAswMjKSjUaD0dHRbuPMmY0bNx50zku7/UtMFx/cfqczk5xzZVCymrO3BiUnDFbWfbq6WiYink2z2D+WmTcCZObOzNybmT8FPsz+oZcdwOKWzU8AHupdZElSJ91cLRPANcDWzPz7lvkLW1Z7HXBvNX0zsDwiDouIk4CTgTt7F1mS1Ek3wzJnA28ANkfEPdW8dwAXRcTpNIdcxoE3A2TmlohYD9xH80qby71SRpIOrY7lnplfpP04+mem2WY1sHoWuSRJs+A7VCWpQJa7JBXIcpekAlnuklQgy12SCmS5S1KBLHdJKpDlLkkF6vqDw/R0w11+IJgkHWqeuUtSgSx3SSqQ5S5JBbLcJalAlrskFchyl6QCWe6SVCDLXZIKZLlLUoF8h2qNdPuO1/E15/c5iaRB55m7JBXIcpekAlnuklQgy12SCmS5S1KBLHdJKpDlLkkFstwlqUCWuyQVyHKXpAJ1LPeIWBwRt0fE1ojYEhFvq+YfGxG3RcQ3q/sFLdtcERHbI2JbRLy6n1+AJOnpujlz3wOszMxfAl4GXB4RpwCrgA2ZeTKwoXpMtWw5cCpwDvD+iJjXj/CSpPY6lntmPpyZd1fTjwNbgUXABcC11WrXAhdW0xcA6zLzycy8H9gOnNnj3JKkaURmdr9yxDDweeA04IHMPKZl2e7MXBARVwN3ZOZ11fxrgFsz84ZJ+1oBrAAYGhpaOjY2RqPRmOWX038TExNP5dz84KNzkmHJoqM7rtOas+4GJas5e2tQckJ9sy5btmxTZo60W9b1R/5GRAP4JPD2zHwsIqZctc28pz2DZOZaYC3AyMhINhoNRkdHu40zZzZu3PhUzku7/IjeXhu/eLTjOq05625QspqztwYlJwxW1n26ulomIp5Ns9g/lpk3VrN3RsTCavlCYFc1fwewuGXzE4CHehNXktSNbq6WCeAaYGtm/n3LopuBS6rpS4BPtcxfHhGHRcRJwMnAnb2LLEnqpJthmbOBNwCbI+Keat47gDXA+oi4DHgAeD1AZm6JiPXAfTSvtLk8M/f2OrgkaWodyz0zv0j7cXSAV02xzWpg9SxySZJmwXeoSlKBLHdJKpDlLkkFstwlqUCWuyQVyHKXpAJZ7pJUIMtdkgpkuUtSgSx3SSqQ5S5JBbLcJalAXf+xDtXHcBd/JGTlkj1cuuoWxtecfwgSSaobz9wlqUCWuyQVyHKXpAJZ7pJUIMtdkgpkuUtSgSx3SSqQ5S5JBbLcJalAlrskFchyl6QCWe6SVCDLXZIKZLlLUoEsd0kqkOUuSQXqWO4R8ZGI2BUR97bMuzIiHoyIe6rbeS3LroiI7RGxLSJe3a/gkqSpdXPm/lHgnDbz/yEzT69unwGIiFOA5cCp1Tbvj4h5vQorSepOx3LPzM8DP+pyfxcA6zLzycy8H9gOnDmLfJKkGYjM7LxSxDDw6cw8rXp8JXAp8BhwF7AyM3dHxNXAHZl5XbXeNcCtmXlDm32uAFYADA0NLR0bG6PRaPTia+qriYmJp3JufvDROU4ztaEjYOePYcmio+c6Sket39M6M2dvDUpOqG/WZcuWbcrMkXbLZvoHsj8AvAvI6v4q4I1AtFm37bNHZq4F1gKMjIxko9FgdHR0hnEOnY0bNz6V89Iu/lD1XFm5ZA9XbZ7P+MWjcx2lo9bvaZ2Zs7cGJScMVtZ9ZnS1TGbuzMy9mflT4MPsH3rZASxuWfUE4KHZRZQkHawZlXtELGx5+Dpg35U0NwPLI+KwiDgJOBm4c3YRJUkHq+OwTERcD4wCx0XEDuAvgdGIOJ3mkMs48GaAzNwSEeuB+4A9wOWZubcvySVJU+pY7pl5UZvZ10yz/mpg9WxCSZJmx3eoSlKBLHdJKpDlLkkFstwlqUCWuyQVyHKXpAJZ7pJUIMtdkgpkuUtSgSx3SSqQ5S5JBbLcJalAlrskFchyl6QCWe6SVCDLXZIKZLlLUoEsd0kqkOUuSQWy3CWpQJa7JBXIcpekAlnuklQgy12SCmS5S1KBLHdJKpDlLkkFstwlqUCWuyQVaH6nFSLiI8BrgF2ZeVo171jgE8AwMA78XmburpZdAVwG7AXempmf7UtydWV41S1drTe+5vw+J5F0KHVz5v5R4JxJ81YBGzLzZGBD9ZiIOAVYDpxabfP+iJjXs7SSpK50LPfM/Dzwo0mzLwCuraavBS5smb8uM5/MzPuB7cCZvYkqSepWZGbnlSKGgU+3DMs8kpnHtCzfnZkLIuJq4I7MvK6afw1wa2be0GafK4AVAENDQ0vHxsZoNBo9+JL6a2Ji4qmcmx98dI7TTG3oCNj54+7XX7Lo6P6F6aD1e1pn5uytQckJ9c26bNmyTZk50m5ZxzH3gxRt5rV99sjMtcBagJGRkWw0GoyOjvY4zsxMN069cslervriE9WjXn/7emflkj1ctbn7fOMXj/YvTAcbN26szc9+OubsrUHJCYOVdZ+ZXi2zMyIWAlT3u6r5O4DFLeudADw083iSpJmYabnfDFxSTV8CfKpl/vKIOCwiTgJOBu6cXURJ0sHq5lLI64FR4LiI2AH8JbAGWB8RlwEPAK8HyMwtEbEeuA/YA1yemXv7lF2SNIWO5Z6ZF02x6FVTrL8aWD2bUDr0vB5eKovvUJWkAlnuklQgy12SCmS5S1KBLHdJKpDlLkkFstwlqUCWuyQVyHKXpAJZ7pJUIMtdkgpkuUtSgSx3SSqQ5S5JBbLcJalAlrskFchyl6QCWe6SVCDLXZIKZLlLUoEsd0kqkOUuSQWy3CWpQJa7JBXIcpekAlnuklSg+XMdQINleNUtXa03vub8PieRNB3P3CWpQJa7JBVoVsMyETEOPA7sBfZk5khEHAt8AhgGxoHfy8zds4spSToYvThzX5aZp2fmSPV4FbAhM08GNlSPJUmHUD+GZS4Arq2mrwUu7MMxJEnTiMyc+cYR9wO7gQQ+lJlrI+KRzDymZZ3dmbmgzbYrgBUAQ0NDS8fGxmg0GjPO0kubH3x0ymVDR8DOHx/CMDM01zmXLDq663UnJiZq87Ofjjl7a1ByQn2zLlu2bFPLqMkBZnsp5NmZ+VBEPB+4LSK+0e2GmbkWWAswMjKSjUaD0dHRWcbpjUunudxv5ZI9XLW5/leQznXO8YtHu15348aNtfnZT8ecvTUoOWGwsu4zq2GZzHyout8F3AScCeyMiIUA1f2u2YaUJB2cGZd7RBwZEUftmwZ+C7gXuBm4pFrtEuBTsw0pSTo4s/m9fQi4KSL27efjmflvEfEVYH1EXAY8ALx+9jElSQdjxuWemd8GfrnN/B8Cr5pNKEnS7NT/lUENJD+DRppbfvyAJBXIcpekAj2jhmW6HSqQpEHnmbskFegZdeau+hledQsrl+yZ9l3B4Auv0sHyzF2SCmS5S1KBLHdJKpDlLkkFstwlqUCWuyQVyHKXpAJZ7pJUIN/EpIHgp0xKB8czd0kqkOUuSQWy3CWpQJa7JBXIF1T1jOWLtCqZZ+6SVCDLXZIKZLlLUoGKGHP3b6NK0oGKKHdpH5/opSaHZSSpQJ65Sx3s+22gmz/k3Q0vrdSh4Jm7JBXIcpekAvVtWCYizgHeC8wDxjJzTb+OJQ0S3xmrQ6Ev5R4R84B/BH4T2AF8JSJuzsz7+nE8qURz+SQwV8cu6Ylvrr+Wfp25nwlsz8xvA0TEOuACwHKXemxyifTqhd+ZHHsqg1DGpYnM7P1OI34XOCcz31Q9fgPw0sx8S8s6K4AV1cMXAT8EftDzML13HObstUHJas7eGpScUN+sJ2bm8e0W9OvMPdrMO+BZJDPXAmuf2iDirswc6VOenjFn7w1KVnP21qDkhMHKuk+/rpbZASxueXwC8FCfjiVJmqRf5f4V4OSIOCkingMsB27u07EkSZP0ZVgmM/dExFuAz9K8FPIjmbmlw2ZrOyyvC3P23qBkNWdvDUpOGKysQJ9eUJUkzS3foSpJBbLcJalEmTmnN+AcYBuwHVjVx+N8BNgF3Nsy71jgNuCb1f2ClmVXVJm2Aa9umb8U2Fwtex/7h7YOAz5Rzf8yMNyyzSXVMb4JXNIh52LgdmArsAV4Wx2zAocDdwJfq3L+VR1ztqw/D/gq8Oma5xyvjnEPcFddswLHADcA36D5b/WsuuWk+f6Ze1pujwFvr1vOft0O6cGm+A/3LeAFwHNoFsUpfTrWrwFncGC5/w3VEwqwCnh3NX1KleUw4KQq47xq2Z3VP+QAbgXOreb/MfDBano58ImW/5jfru4XVNMLpsm5EDijmj4K+O8qT62yVvtsVNPPrv5hv6xuOVvy/gnwcfaXe11zjgPHTZpXu6zAtcCbqunn0Cz72uWc1DXfA06sc86edt6hPFibb/hZwGdbHl8BXNHH4w1zYLlvAxZW0wuBbe1y0Lzq56xqnW+0zL8I+FDrOtX0fJrvZovWdaplHwIuOojMn6L5GT21zQo8F7gbeGkdc9J8n8UG4JXsL/fa5azWGefp5V6rrMDzgPupzl7rmnNStt8C/rPuOXt5m+sx90XAd1se76jmHSpDmfkwQHX//A65FlXTk+cfsE1m7gEeBX5mmn11FBHDwK/QPCuuXdaImBcR99Ac7rotM2uZE3gP8GfAT1vm1TEnNN/J/bmI2FR9REcds74A+D7wTxHx1YgYi4gja5iz1XLg+mq6zjl7Zq7LvePHFMyRqXJNl3cm20wdIKIBfBJ4e2Y+Nt2qMzhuT7Jm5t7MPJ3mmfGZEXFa3XJGxGuAXZm5aZpsB2wyg2P28md/dmaeAZwLXB4RvzbNunOVdT7NIc4PZOavAE/QHN6oW87mjppvpHwt8C/TrTfDY/b0/30vzXW5z/XHFOyMiIUA1f2uDrl2VNOT5x+wTUTMB44GfjTNvqYUEc+mWewfy8wb65wVIDMfATbSfHG8bjnPBl4bEePAOuCVEXFdDXMCkJkPVfe7gJtofsJq3bLuAHZUv6lB84XVM2qYc59zgbszc2f1uK45e+tQjgG1GQebT/OFhpPY/4LqqX083jAHjrn/LQe+sPI31fSpHPjCyrfZ/8LKV2i+cLjvhZXzqvmXc+ALK+ur6WNpjk8uqG73A8dOkzGAfwbeM2l+rbICxwPHVNNHAF8AXlO3nJMyj7J/zL12OYEjgaNapr9E8wmzjlm/ALyomr6yyli7nNU264A/rOv/pb713aE82BTf+PNoXhHyLeCdfTzO9cDDwE9oPqteRnNsbAPNS5U2tH7zgXdWmbZRvTJezR8B7q2WXc3+S6IOp/lr33aar6y/oGWbN1bzt7f+I5si58tp/vr2dfZfwnVe3bICL6F5aeHXq2P8RTW/VjknZR5lf7nXLifNseyvsf/y0nfWOOvpwF3Vz/9faRZYHXM+l+bHiR/dMq92Oftx8+MHJKlAcz3mLknqA8tdkgpkuUtSgSx3SSqQ5S5JBbLcJalAlrskFej/ARkLLQmo8opuAAAAAElFTkSuQmCC\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# it's always a good idea to look at your response variable\n", "df.hist('SalePrice', bins=30)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It wouldn't be a bad idea to apply Box-Cox if using a linear model. " ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "False" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# we should make sure that no SalePrice values are missing\n", "df.SalePrice.isna().any()" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "PoolQC 0.995205\n", "MiscFeature 0.963014\n", "Alley 0.937671\n", "Fence 0.807534\n", "FireplaceQu 0.472603\n", "LotFrontage 0.177397\n", "GarageYrBlt 0.055479\n", "GarageCond 0.055479\n", "GarageType 0.055479\n", "GarageFinish 0.055479\n", "GarageQual 0.055479\n", "BsmtFinType2 0.026027\n", "BsmtExposure 0.026027\n", "BsmtQual 0.025342\n", "BsmtCond 0.025342\n", "dtype: float64" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# it's also important to look at the degree of missingness in each of your features\n", "column_missingness = df.isna().sum().sort_values(ascending=False) / len(df)\n", "column_missingness.head(15)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's go ahead and drop `PoolQC`, `MiscFeature`, and `Alley` as these are almost always missing. Note that some of these should be treated with more nuance. Take `Fence` for instance. I expect `Fence` takes the value NA when no fence is present. " ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "((1460, 78), (77, 3))" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "features_to_drop = ['PoolQC', 'MiscFeature', 'Alley']\n", "df = df.drop(features_to_drop, axis=1)\n", "features = features[~features.variable.isin(features_to_drop)]\n", "df.shape, features.shape" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "39 0.153846\n", "533 0.153846\n", "520 0.153846\n", "1011 0.153846\n", "1218 0.153846\n", " ... \n", "860 0.000000\n", "51 0.000000\n", "1170 0.000000\n", "642 0.000000\n", "810 0.000000\n", "Length: 1460, dtype: float64" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Let's see if any rows are missing a large portion of their data\n", "row_missingness = df.isna().sum(axis=1).sort_values(ascending=False) / len(df.columns)\n", "row_missingness" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "BsmtQual NaN\n", "BsmtCond NaN\n", "BsmtExposure NaN\n", "BsmtFinType1 NaN\n", "BsmtFinType2 NaN\n", "FireplaceQu NaN\n", "GarageType NaN\n", "GarageYrBlt NaN\n", "GarageFinish NaN\n", "GarageQual NaN\n", "GarageCond NaN\n", "Fence NaN\n", "Name: 39, dtype: object" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# let's take a closer look at some of these\n", "index=39\n", "df.loc[index, df.loc[index].isna()]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Nothing too concerning here. Most of these features are very niche." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(42, 34)" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# get lists of categorical and numeric features\n", "categorical_features = features.variable[features.type=='categorical'].tolist()\n", "numeric_features = features.variable[features.type=='numeric'].tolist()[1:]\n", "len(categorical_features), len(numeric_features)" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['MSSubClass',\n", " 'MSZoning',\n", " 'Street',\n", " 'LotShape',\n", " 'LandContour',\n", " 'Utilities',\n", " 'LotConfig',\n", " 'LandSlope',\n", " 'Neighborhood',\n", " 'Condition1',\n", " 'Condition2',\n", " 'BldgType',\n", " 'HouseStyle',\n", " 'RoofStyle',\n", " 'RoofMatl',\n", " 'Exterior1st',\n", " 'Exterior2nd',\n", " 'MasVnrType',\n", " 'ExterQual',\n", " 'ExterCond',\n", " 'Foundation',\n", " 'BsmtQual',\n", " 'BsmtCond',\n", " 'BsmtExposure',\n", " 'BsmtFinType1',\n", " 'BsmtFinType2',\n", " 'Heating',\n", " 'HeatingQC',\n", " 'CentralAir',\n", " 'Electrical',\n", " 'KitchenQual',\n", " 'Functional',\n", " 'Fireplaces',\n", " 'FireplaceQu',\n", " 'GarageType',\n", " 'GarageFinish',\n", " 'GarageQual',\n", " 'GarageCond',\n", " 'PavedDrive',\n", " 'Fence',\n", " 'SaleType',\n", " 'SaleCondition']" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# let's take a look at our categorical features\n", "categorical_features" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# Let's look at the distribution of features\n", "plot_features = ['Neighborhood', 'HouseStyle', 'GarageType', 'SaleType']\n", "fig, axes = plt.subplots(2,2)\n", "fig.set_size_inches(10,6)\n", "axes = axes.flatten()\n", "for ix, feature in enumerate(plot_features):\n", " df[feature].value_counts().plot(kind='bar', ax=axes[ix])\n", " axes[ix].set_title(feature)\n", "plt.tight_layout()" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['LotFrontage',\n", " 'LotArea',\n", " 'OverallQual',\n", " 'OverallCond',\n", " 'YearBuilt',\n", " 'YearRemodAdd',\n", " 'MasVnrArea',\n", " 'BsmtFinSF1',\n", " 'BsmtFinSF2',\n", " 'BsmtUnfSF',\n", " 'TotalBsmtSF',\n", " '1stFlrSF',\n", " '2ndFlrSF',\n", " 'LowQualFinSF',\n", " 'GrLivArea',\n", " 'BsmtFullBath',\n", " 'BsmtHalfBath',\n", " 'FullBath',\n", " 'HalfBath',\n", " 'BedroomAbvGr',\n", " 'KitchenAbvGr',\n", " 'TotRmsAbvGrd',\n", " 'GarageYrBlt',\n", " 'GarageCars',\n", " 'GarageArea',\n", " 'WoodDeckSF',\n", " 'OpenPorchSF',\n", " 'EnclosedPorch',\n", " '3SsnPorch',\n", " 'ScreenPorch',\n", " 'PoolArea',\n", " 'MiscVal',\n", " 'MoSold',\n", " 'YrSold']" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# let's take a look at our categorical features\n", "numeric_features" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
LotFrontageLotAreaOverallQualOverallCondYearBuiltYearRemodAddMasVnrAreaBsmtFinSF1BsmtFinSF2BsmtUnfSF...GarageAreaWoodDeckSFOpenPorchSFEnclosedPorch3SsnPorchScreenPorchPoolAreaMiscValMoSoldYrSold
count1201.0000001460.0000001460.0000001460.0000001460.0000001460.0000001452.0000001460.0000001460.0000001460.000000...1460.0000001460.0000001460.0000001460.0000001460.0000001460.0000001460.0000001460.0000001460.0000001460.000000
mean70.04995810516.8280826.0993155.5753421971.2678081984.865753103.685262443.63972646.549315567.240411...472.98013794.24452146.66027421.9541103.40958915.0609592.75890443.4890416.3219182007.815753
std24.2847529981.2649321.3829971.11279930.20290420.645407181.066207456.098091161.319273441.866955...213.804841125.33879466.25602861.11914929.31733155.75741540.177307496.1230242.7036261.328095
min21.0000001300.0000001.0000001.0000001872.0000001950.0000000.0000000.0000000.0000000.000000...0.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000001.0000002006.000000
25%59.0000007553.5000005.0000005.0000001954.0000001967.0000000.0000000.0000000.000000223.000000...334.5000000.0000000.0000000.0000000.0000000.0000000.0000000.0000005.0000002007.000000
50%69.0000009478.5000006.0000005.0000001973.0000001994.0000000.000000383.5000000.000000477.500000...480.0000000.00000025.0000000.0000000.0000000.0000000.0000000.0000006.0000002008.000000
75%80.00000011601.5000007.0000006.0000002000.0000002004.000000166.000000712.2500000.000000808.000000...576.000000168.00000068.0000000.0000000.0000000.0000000.0000000.0000008.0000002009.000000
max313.000000215245.00000010.0000009.0000002010.0000002010.0000001600.0000005644.0000001474.0000002336.000000...1418.000000857.000000547.000000552.000000508.000000480.000000738.00000015500.00000012.0000002010.000000
\n", "

8 rows × 34 columns

\n", "
" ], "text/plain": [ " LotFrontage LotArea OverallQual OverallCond YearBuilt \\\n", "count 1201.000000 1460.000000 1460.000000 1460.000000 1460.000000 \n", "mean 70.049958 10516.828082 6.099315 5.575342 1971.267808 \n", "std 24.284752 9981.264932 1.382997 1.112799 30.202904 \n", "min 21.000000 1300.000000 1.000000 1.000000 1872.000000 \n", "25% 59.000000 7553.500000 5.000000 5.000000 1954.000000 \n", "50% 69.000000 9478.500000 6.000000 5.000000 1973.000000 \n", "75% 80.000000 11601.500000 7.000000 6.000000 2000.000000 \n", "max 313.000000 215245.000000 10.000000 9.000000 2010.000000 \n", "\n", " YearRemodAdd MasVnrArea BsmtFinSF1 BsmtFinSF2 BsmtUnfSF ... \\\n", "count 1460.000000 1452.000000 1460.000000 1460.000000 1460.000000 ... \n", "mean 1984.865753 103.685262 443.639726 46.549315 567.240411 ... \n", "std 20.645407 181.066207 456.098091 161.319273 441.866955 ... \n", "min 1950.000000 0.000000 0.000000 0.000000 0.000000 ... \n", "25% 1967.000000 0.000000 0.000000 0.000000 223.000000 ... \n", "50% 1994.000000 0.000000 383.500000 0.000000 477.500000 ... \n", "75% 2004.000000 166.000000 712.250000 0.000000 808.000000 ... \n", "max 2010.000000 1600.000000 5644.000000 1474.000000 2336.000000 ... \n", "\n", " GarageArea WoodDeckSF OpenPorchSF EnclosedPorch 3SsnPorch \\\n", "count 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 \n", "mean 472.980137 94.244521 46.660274 21.954110 3.409589 \n", "std 213.804841 125.338794 66.256028 61.119149 29.317331 \n", "min 0.000000 0.000000 0.000000 0.000000 0.000000 \n", "25% 334.500000 0.000000 0.000000 0.000000 0.000000 \n", "50% 480.000000 0.000000 25.000000 0.000000 0.000000 \n", "75% 576.000000 168.000000 68.000000 0.000000 0.000000 \n", "max 1418.000000 857.000000 547.000000 552.000000 508.000000 \n", "\n", " ScreenPorch PoolArea MiscVal MoSold YrSold \n", "count 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 \n", "mean 15.060959 2.758904 43.489041 6.321918 2007.815753 \n", "std 55.757415 40.177307 496.123024 2.703626 1.328095 \n", "min 0.000000 0.000000 0.000000 1.000000 2006.000000 \n", "25% 0.000000 0.000000 0.000000 5.000000 2007.000000 \n", "50% 0.000000 0.000000 0.000000 6.000000 2008.000000 \n", "75% 0.000000 0.000000 0.000000 8.000000 2009.000000 \n", "max 480.000000 738.000000 15500.000000 12.000000 2010.000000 \n", "\n", "[8 rows x 34 columns]" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# since these are numeric, we can compute some basic stats:\n", "df[numeric_features].describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Probably, most of these should be rescaled with standard scaling. " ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# Let's look at the distribution of features\n", "plot_features = ['LotFrontage', 'BsmtFinSF1', '1stFlrSF', 'ScreenPorch']\n", "fig, axes = plt.subplots(2,2)\n", "fig.set_size_inches(10,6)\n", "axes = axes.flatten()\n", "for ix, feature in enumerate(plot_features):\n", " df[feature].value_counts().plot(kind='hist', ax=axes[ix], bins=20, logy=True)\n", " axes[ix].set_title(feature)\n", "plt.tight_layout()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Some of these features have very low variance. They should probably be removed. Others should possibly be transformed using Box-Cox or similar. " ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# Let's check to see if any of our numeric features are highly correlated with one another\n", "import seaborn as sns\n", "plt.gcf().set_size_inches(10,8)\n", "sns.clustermap(df[numeric_features].corr())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Overall, our features are relatively uncorrelated. Some features are highly correlated. For instance: \n", "* TotalBsmtSF, 1stFlrSF\n", "* GarageCars, GarageArea\n", "* BsmtFnSF1, BsmtFullBath\n", "\n", "We could consider removing some of these near duplicate features. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## First pass\n", "It's usually a good idea to first perform a quick and dirty analysis ignoring the subtleties that we discussed above. We can then try to incorporate these ideas to improve our results. " ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "((1460, 76), (1460,))" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# define input/target\n", "inputs = features.variable.iloc[1:].tolist()\n", "X, y = df[inputs], df['SalePrice']\n", "X.shape, y.shape" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "((1168, 76), (292, 76))" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# split the data\n", "from sklearn.model_selection import train_test_split\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=True, random_state=42)\n", "X_train.shape, X_test.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Preprocessing numeric and categorical features\n", "We will use scikit-learn's pipeline feature to streamline the model fitting and evaluation process. " ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
ColumnTransformer(transformers=[('numeric',\n",
       "                                 Pipeline(steps=[('fill_na',\n",
       "                                                  SimpleImputer(strategy='median')),\n",
       "                                                 ('scale', StandardScaler())]),\n",
       "                                 ['LotFrontage', 'LotArea', 'OverallQual',\n",
       "                                  'OverallCond', 'YearBuilt', 'YearRemodAdd',\n",
       "                                  'MasVnrArea', 'BsmtFinSF1', 'BsmtFinSF2',\n",
       "                                  'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF',\n",
       "                                  '2ndFlrSF', 'LowQualFinSF', 'GrLivArea',\n",
       "                                  'BsmtFullBath', 'BsmtHal...\n",
       "                                  'LotShape', 'LandContour', 'Utilities',\n",
       "                                  'LotConfig', 'LandSlope', 'Neighborhood',\n",
       "                                  'Condition1', 'Condition2', 'BldgType',\n",
       "                                  'HouseStyle', 'RoofStyle', 'RoofMatl',\n",
       "                                  'Exterior1st', 'Exterior2nd', 'MasVnrType',\n",
       "                                  'ExterQual', 'ExterCond', 'Foundation',\n",
       "                                  'BsmtQual', 'BsmtCond', 'BsmtExposure',\n",
       "                                  'BsmtFinType1', 'BsmtFinType2', 'Heating',\n",
       "                                  'HeatingQC', 'CentralAir', 'Electrical', ...])],\n",
       "                  verbose_feature_names_out=False)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ], "text/plain": [ "ColumnTransformer(transformers=[('numeric',\n", " Pipeline(steps=[('fill_na',\n", " SimpleImputer(strategy='median')),\n", " ('scale', StandardScaler())]),\n", " ['LotFrontage', 'LotArea', 'OverallQual',\n", " 'OverallCond', 'YearBuilt', 'YearRemodAdd',\n", " 'MasVnrArea', 'BsmtFinSF1', 'BsmtFinSF2',\n", " 'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF',\n", " '2ndFlrSF', 'LowQualFinSF', 'GrLivArea',\n", " 'BsmtFullBath', 'BsmtHal...\n", " 'LotShape', 'LandContour', 'Utilities',\n", " 'LotConfig', 'LandSlope', 'Neighborhood',\n", " 'Condition1', 'Condition2', 'BldgType',\n", " 'HouseStyle', 'RoofStyle', 'RoofMatl',\n", " 'Exterior1st', 'Exterior2nd', 'MasVnrType',\n", " 'ExterQual', 'ExterCond', 'Foundation',\n", " 'BsmtQual', 'BsmtCond', 'BsmtExposure',\n", " 'BsmtFinType1', 'BsmtFinType2', 'Heating',\n", " 'HeatingQC', 'CentralAir', 'Electrical', ...])],\n", " verbose_feature_names_out=False)" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.preprocessing import StandardScaler, OneHotEncoder\n", "from sklearn.impute import SimpleImputer\n", "from sklearn.pipeline import Pipeline\n", "from sklearn.compose import make_column_transformer, ColumnTransformer\n", "\n", "# preprocessing pipeline for numeric quantities\n", "preproc_numeric = Pipeline([\n", " ('fill_na', SimpleImputer(strategy='median')),\n", " ('scale', StandardScaler())\n", "])\n", "\n", "# preprocessing pipeline for categorical quantities\n", "preproc_categorical = Pipeline([\n", " ('fill_na', SimpleImputer(strategy='most_frequent')),\n", " ('encode', OneHotEncoder(sparse_output=False, drop='first', min_frequency=5, handle_unknown='infrequent_if_exist'))\n", "])\n", "\n", "# full preprocessing pipeline\n", "preproc = ColumnTransformer(\n", " transformers=[\n", " ('numeric', preproc_numeric, numeric_features),\n", " ('categorical', preproc_categorical, categorical_features)\n", "], verbose_feature_names_out=False)\n", "\n", "preproc.fit(X_train)" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
LotFrontageLotAreaOverallQualOverallCondYearBuiltYearRemodAddMasVnrAreaBsmtFinSF1BsmtFinSF2BsmtUnfSF...Fence_MnWwSaleType_ConLDSaleType_NewSaleType_WDSaleType_infrequent_sklearnSaleCondition_AllocaSaleCondition_FamilySaleCondition_NormalSaleCondition_PartialSaleCondition_infrequent_sklearn
892-0.012468-0.211594-0.0889342.165000-0.2597890.873470-0.5978890.472844-0.285504-0.391317...0.00.00.01.00.00.00.01.00.00.0
11051.2345200.1456431.374088-0.5241740.7512220.4874651.4985671.276986-0.285504-0.312872...0.00.00.01.00.00.00.01.00.00.0
413-0.635963-0.160826-0.8204450.372217-1.433867-1.683818-0.597889-0.971996-0.2855040.980347...0.00.00.01.00.00.00.01.00.00.0
522-0.903175-0.529035-0.0889341.268609-0.781602-1.683818-0.597889-0.102477-0.2855040.077111...0.00.00.01.00.00.00.01.00.00.0
10360.8337030.2053382.105599-0.5241741.1751951.114724-0.1924971.255193-0.2855040.061422...0.00.00.01.00.00.00.01.00.00.0
..................................................................
479-0.903175-0.443026-1.5519551.268609-1.1077340.7287181.921333-0.605882-0.2855040.377443...0.00.00.01.00.01.00.00.00.00.0
13612.3924390.5084590.642577-0.5241741.1099680.969972-0.5052281.804363-0.285504-0.705096...0.00.00.01.00.00.00.01.00.00.0
802-0.324216-0.2315850.642577-0.5241741.1099680.969972-0.5978890.440155-0.285504-1.099561...0.00.00.01.00.00.00.01.00.00.0
651-0.457822-0.149296-1.551955-0.524174-1.009895-1.683818-0.597889-0.971996-0.2855040.413303...0.00.00.01.00.00.00.01.00.00.0
722-0.012468-0.238931-1.5519551.268609-0.031496-0.718804-0.597889-0.555760-0.2855040.229518...0.00.00.01.00.00.00.01.00.00.0
\n", "

292 rows × 226 columns

\n", "
" ], "text/plain": [ " LotFrontage LotArea OverallQual OverallCond YearBuilt \\\n", "892 -0.012468 -0.211594 -0.088934 2.165000 -0.259789 \n", "1105 1.234520 0.145643 1.374088 -0.524174 0.751222 \n", "413 -0.635963 -0.160826 -0.820445 0.372217 -1.433867 \n", "522 -0.903175 -0.529035 -0.088934 1.268609 -0.781602 \n", "1036 0.833703 0.205338 2.105599 -0.524174 1.175195 \n", "... ... ... ... ... ... \n", "479 -0.903175 -0.443026 -1.551955 1.268609 -1.107734 \n", "1361 2.392439 0.508459 0.642577 -0.524174 1.109968 \n", "802 -0.324216 -0.231585 0.642577 -0.524174 1.109968 \n", "651 -0.457822 -0.149296 -1.551955 -0.524174 -1.009895 \n", "722 -0.012468 -0.238931 -1.551955 1.268609 -0.031496 \n", "\n", " YearRemodAdd MasVnrArea BsmtFinSF1 BsmtFinSF2 BsmtUnfSF ... \\\n", "892 0.873470 -0.597889 0.472844 -0.285504 -0.391317 ... \n", "1105 0.487465 1.498567 1.276986 -0.285504 -0.312872 ... \n", "413 -1.683818 -0.597889 -0.971996 -0.285504 0.980347 ... \n", "522 -1.683818 -0.597889 -0.102477 -0.285504 0.077111 ... \n", "1036 1.114724 -0.192497 1.255193 -0.285504 0.061422 ... \n", "... ... ... ... ... ... ... \n", "479 0.728718 1.921333 -0.605882 -0.285504 0.377443 ... \n", "1361 0.969972 -0.505228 1.804363 -0.285504 -0.705096 ... \n", "802 0.969972 -0.597889 0.440155 -0.285504 -1.099561 ... \n", "651 -1.683818 -0.597889 -0.971996 -0.285504 0.413303 ... \n", "722 -0.718804 -0.597889 -0.555760 -0.285504 0.229518 ... \n", "\n", " Fence_MnWw SaleType_ConLD SaleType_New SaleType_WD \\\n", "892 0.0 0.0 0.0 1.0 \n", "1105 0.0 0.0 0.0 1.0 \n", "413 0.0 0.0 0.0 1.0 \n", "522 0.0 0.0 0.0 1.0 \n", "1036 0.0 0.0 0.0 1.0 \n", "... ... ... ... ... \n", "479 0.0 0.0 0.0 1.0 \n", "1361 0.0 0.0 0.0 1.0 \n", "802 0.0 0.0 0.0 1.0 \n", "651 0.0 0.0 0.0 1.0 \n", "722 0.0 0.0 0.0 1.0 \n", "\n", " SaleType_infrequent_sklearn SaleCondition_Alloca SaleCondition_Family \\\n", "892 0.0 0.0 0.0 \n", "1105 0.0 0.0 0.0 \n", "413 0.0 0.0 0.0 \n", "522 0.0 0.0 0.0 \n", "1036 0.0 0.0 0.0 \n", "... ... ... ... \n", "479 0.0 1.0 0.0 \n", "1361 0.0 0.0 0.0 \n", "802 0.0 0.0 0.0 \n", "651 0.0 0.0 0.0 \n", "722 0.0 0.0 0.0 \n", "\n", " SaleCondition_Normal SaleCondition_Partial \\\n", "892 1.0 0.0 \n", "1105 1.0 0.0 \n", "413 1.0 0.0 \n", "522 1.0 0.0 \n", "1036 1.0 0.0 \n", "... ... ... \n", "479 0.0 0.0 \n", "1361 1.0 0.0 \n", "802 1.0 0.0 \n", "651 1.0 0.0 \n", "722 1.0 0.0 \n", "\n", " SaleCondition_infrequent_sklearn \n", "892 0.0 \n", "1105 0.0 \n", "413 0.0 \n", "522 0.0 \n", "1036 0.0 \n", "... ... \n", "479 0.0 \n", "1361 0.0 \n", "802 0.0 \n", "651 0.0 \n", "722 0.0 \n", "\n", "[292 rows x 226 columns]" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "preproc.transform(X_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Model fitting" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
Pipeline(steps=[('preproc',\n",
       "                 ColumnTransformer(transformers=[('numeric',\n",
       "                                                  Pipeline(steps=[('fill_na',\n",
       "                                                                   SimpleImputer(strategy='median')),\n",
       "                                                                  ('scale',\n",
       "                                                                   StandardScaler())]),\n",
       "                                                  ['LotFrontage', 'LotArea',\n",
       "                                                   'OverallQual', 'OverallCond',\n",
       "                                                   'YearBuilt', 'YearRemodAdd',\n",
       "                                                   'MasVnrArea', 'BsmtFinSF1',\n",
       "                                                   'BsmtFinSF2', 'BsmtUnfSF',\n",
       "                                                   'TotalBsmtSF', '1stFlrSF',\n",
       "                                                   '2ndFlrSF', 'LowQualFinSF',\n",
       "                                                   'GrLivAr...\n",
       "                                                   'Neighborhood', 'Condition1',\n",
       "                                                   'Condition2', 'BldgType',\n",
       "                                                   'HouseStyle', 'RoofStyle',\n",
       "                                                   'RoofMatl', 'Exterior1st',\n",
       "                                                   'Exterior2nd', 'MasVnrType',\n",
       "                                                   'ExterQual', 'ExterCond',\n",
       "                                                   'Foundation', 'BsmtQual',\n",
       "                                                   'BsmtCond', 'BsmtExposure',\n",
       "                                                   'BsmtFinType1',\n",
       "                                                   'BsmtFinType2', 'Heating',\n",
       "                                                   'HeatingQC', 'CentralAir',\n",
       "                                                   'Electrical', ...])],\n",
       "                                   verbose_feature_names_out=False)),\n",
       "                ('estimator', Ridge(alpha=1))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ], "text/plain": [ "Pipeline(steps=[('preproc',\n", " ColumnTransformer(transformers=[('numeric',\n", " Pipeline(steps=[('fill_na',\n", " SimpleImputer(strategy='median')),\n", " ('scale',\n", " StandardScaler())]),\n", " ['LotFrontage', 'LotArea',\n", " 'OverallQual', 'OverallCond',\n", " 'YearBuilt', 'YearRemodAdd',\n", " 'MasVnrArea', 'BsmtFinSF1',\n", " 'BsmtFinSF2', 'BsmtUnfSF',\n", " 'TotalBsmtSF', '1stFlrSF',\n", " '2ndFlrSF', 'LowQualFinSF',\n", " 'GrLivAr...\n", " 'Neighborhood', 'Condition1',\n", " 'Condition2', 'BldgType',\n", " 'HouseStyle', 'RoofStyle',\n", " 'RoofMatl', 'Exterior1st',\n", " 'Exterior2nd', 'MasVnrType',\n", " 'ExterQual', 'ExterCond',\n", " 'Foundation', 'BsmtQual',\n", " 'BsmtCond', 'BsmtExposure',\n", " 'BsmtFinType1',\n", " 'BsmtFinType2', 'Heating',\n", " 'HeatingQC', 'CentralAir',\n", " 'Electrical', ...])],\n", " verbose_feature_names_out=False)),\n", " ('estimator', Ridge(alpha=1))])" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# create the model pipeline\n", "from sklearn.linear_model import Ridge\n", "\n", "pipe = Pipeline([\n", " ('preproc', preproc), \n", " ('estimator', Ridge(alpha=1))\n", "])\n", "\n", "# fit\n", "pipe.fit(X_train, y_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Model evaluation" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Train R^2: 0.9028881797596011\n", "Test R^2: 0.8714648323729995\n" ] } ], "source": [ "# evaluate\n", "# For random forest regressor, score returns the R^2 value\n", "print(\"Train R^2:\", pipe.score(X_train, y_train))\n", "print(\"Test R^2:\", pipe.score(X_test, y_test))" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Train MAE: 15288.339516024189\n", "Test MAE: 20544.401602914677\n" ] } ], "source": [ "# MAE is a more intuitive metric for price estimation\n", "from sklearn.metrics import mean_absolute_error\n", "y_train_pred_1stpass = pipe.predict(X_train)\n", "y_test_pred_1stpass = pipe.predict(X_test)\n", "print(\"Train MAE:\", mean_absolute_error(y_train, y_train_pred_1stpass))\n", "print(\"Test MAE:\", mean_absolute_error(y_test, y_test_pred_1stpass))" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'fit_time': array([0.16545057, 0.14652419, 0.1494379 , 0.14467359, 0.12987018]),\n", " 'score_time': array([0.03685045, 0.03611279, 0.03650355, 0.03635812, 0.03367758]),\n", " 'test_score': array([-18480.22849655, -20702.55584883, -20390.70685371, -17442.23199009,\n", " -19762.00113844]),\n", " 'train_score': array([-15920.87447649, -14995.07168281, -15245.2950543 , -15810.8980323 ,\n", " -13704.36598872])}" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# more robust classification with cross validation\n", "from sklearn.model_selection import KFold, cross_validate\n", "cv = KFold(n_splits=5, shuffle=True, random_state=42)\n", "cv_scores = cross_validate(pipe, X, y, n_jobs=4, return_train_score=True, scoring='neg_mean_absolute_error')\n", "cv_scores" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(19355.544865524786, (17657.804810833954, 21053.284920215618))" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import scipy\n", "def mean_confidence_interval(data, confidence=0.95):\n", " a = 1.0 * np.array(data)\n", " n = len(a)\n", " m, se = np.mean(a), scipy.stats.sem(a)\n", " h = se * scipy.stats.t.ppf((1 + confidence) / 2., n-1)\n", " return m, (m-h, m+h)\n", "\n", "mean_confidence_interval(-cv_scores['test_score'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Optimizing the pipeline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are many things we can try to improve the fit of our model. For example, we could experiment with the following: \n", "* apply some of the observations from our exploratory data analysis\n", "* different imputation strategies\n", "* add a dimension reduction step like PCA\n", "* try different estimators\n", "* try hyperparameters within different estimators\n", "\n", "How do we start testing various approaches in scikit-learn? One way is to manually change the pipeline we defined above, fit/evaluate the model, and compare with our results above. This can be a tedious and error-prone process. Let's see how do this efficiently using scikit-learn's APIs. \n", "\n", "To introduce the ideas, let's consider the alpha regularization parameter used in Ridge regression." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Tuning estimator parameters" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
Pipeline(steps=[('preproc',\n",
       "                 ColumnTransformer(transformers=[('numeric',\n",
       "                                                  Pipeline(steps=[('fill_na',\n",
       "                                                                   SimpleImputer(strategy='median')),\n",
       "                                                                  ('scale',\n",
       "                                                                   StandardScaler())]),\n",
       "                                                  ['LotFrontage', 'LotArea',\n",
       "                                                   'OverallQual', 'OverallCond',\n",
       "                                                   'YearBuilt', 'YearRemodAdd',\n",
       "                                                   'MasVnrArea', 'BsmtFinSF1',\n",
       "                                                   'BsmtFinSF2', 'BsmtUnfSF',\n",
       "                                                   'TotalBsmtSF', '1stFlrSF',\n",
       "                                                   '2ndFlrSF', 'LowQualFinSF',\n",
       "                                                   'GrLivAr...\n",
       "                                                   'Neighborhood', 'Condition1',\n",
       "                                                   'Condition2', 'BldgType',\n",
       "                                                   'HouseStyle', 'RoofStyle',\n",
       "                                                   'RoofMatl', 'Exterior1st',\n",
       "                                                   'Exterior2nd', 'MasVnrType',\n",
       "                                                   'ExterQual', 'ExterCond',\n",
       "                                                   'Foundation', 'BsmtQual',\n",
       "                                                   'BsmtCond', 'BsmtExposure',\n",
       "                                                   'BsmtFinType1',\n",
       "                                                   'BsmtFinType2', 'Heating',\n",
       "                                                   'HeatingQC', 'CentralAir',\n",
       "                                                   'Electrical', ...])],\n",
       "                                   verbose_feature_names_out=False)),\n",
       "                ('estimator', Ridge(alpha=1))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ], "text/plain": [ "Pipeline(steps=[('preproc',\n", " ColumnTransformer(transformers=[('numeric',\n", " Pipeline(steps=[('fill_na',\n", " SimpleImputer(strategy='median')),\n", " ('scale',\n", " StandardScaler())]),\n", " ['LotFrontage', 'LotArea',\n", " 'OverallQual', 'OverallCond',\n", " 'YearBuilt', 'YearRemodAdd',\n", " 'MasVnrArea', 'BsmtFinSF1',\n", " 'BsmtFinSF2', 'BsmtUnfSF',\n", " 'TotalBsmtSF', '1stFlrSF',\n", " '2ndFlrSF', 'LowQualFinSF',\n", " 'GrLivAr...\n", " 'Neighborhood', 'Condition1',\n", " 'Condition2', 'BldgType',\n", " 'HouseStyle', 'RoofStyle',\n", " 'RoofMatl', 'Exterior1st',\n", " 'Exterior2nd', 'MasVnrType',\n", " 'ExterQual', 'ExterCond',\n", " 'Foundation', 'BsmtQual',\n", " 'BsmtCond', 'BsmtExposure',\n", " 'BsmtFinType1',\n", " 'BsmtFinType2', 'Heating',\n", " 'HeatingQC', 'CentralAir',\n", " 'Electrical', ...])],\n", " verbose_feature_names_out=False)),\n", " ('estimator', Ridge(alpha=1))])" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pipe" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "tags": [] }, "outputs": [ { "data": { "text/html": [ "
Pipeline(steps=[('preproc',\n",
       "                 ColumnTransformer(transformers=[('numeric',\n",
       "                                                  Pipeline(steps=[('fill_na',\n",
       "                                                                   SimpleImputer(strategy='median')),\n",
       "                                                                  ('scale',\n",
       "                                                                   StandardScaler())]),\n",
       "                                                  ['LotFrontage', 'LotArea',\n",
       "                                                   'OverallQual', 'OverallCond',\n",
       "                                                   'YearBuilt', 'YearRemodAdd',\n",
       "                                                   'MasVnrArea', 'BsmtFinSF1',\n",
       "                                                   'BsmtFinSF2', 'BsmtUnfSF',\n",
       "                                                   'TotalBsmtSF', '1stFlrSF',\n",
       "                                                   '2ndFlrSF', 'LowQualFinSF',\n",
       "                                                   'GrLivAr...\n",
       "                                                   'Neighborhood', 'Condition1',\n",
       "                                                   'Condition2', 'BldgType',\n",
       "                                                   'HouseStyle', 'RoofStyle',\n",
       "                                                   'RoofMatl', 'Exterior1st',\n",
       "                                                   'Exterior2nd', 'MasVnrType',\n",
       "                                                   'ExterQual', 'ExterCond',\n",
       "                                                   'Foundation', 'BsmtQual',\n",
       "                                                   'BsmtCond', 'BsmtExposure',\n",
       "                                                   'BsmtFinType1',\n",
       "                                                   'BsmtFinType2', 'Heating',\n",
       "                                                   'HeatingQC', 'CentralAir',\n",
       "                                                   'Electrical', ...])],\n",
       "                                   verbose_feature_names_out=False)),\n",
       "                ('estimator', Ridge(alpha=21.544346900318846))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ], "text/plain": [ "Pipeline(steps=[('preproc',\n", " ColumnTransformer(transformers=[('numeric',\n", " Pipeline(steps=[('fill_na',\n", " SimpleImputer(strategy='median')),\n", " ('scale',\n", " StandardScaler())]),\n", " ['LotFrontage', 'LotArea',\n", " 'OverallQual', 'OverallCond',\n", " 'YearBuilt', 'YearRemodAdd',\n", " 'MasVnrArea', 'BsmtFinSF1',\n", " 'BsmtFinSF2', 'BsmtUnfSF',\n", " 'TotalBsmtSF', '1stFlrSF',\n", " '2ndFlrSF', 'LowQualFinSF',\n", " 'GrLivAr...\n", " 'Neighborhood', 'Condition1',\n", " 'Condition2', 'BldgType',\n", " 'HouseStyle', 'RoofStyle',\n", " 'RoofMatl', 'Exterior1st',\n", " 'Exterior2nd', 'MasVnrType',\n", " 'ExterQual', 'ExterCond',\n", " 'Foundation', 'BsmtQual',\n", " 'BsmtCond', 'BsmtExposure',\n", " 'BsmtFinType1',\n", " 'BsmtFinType2', 'Heating',\n", " 'HeatingQC', 'CentralAir',\n", " 'Electrical', ...])],\n", " verbose_feature_names_out=False)),\n", " ('estimator', Ridge(alpha=21.544346900318846))])" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.model_selection import GridSearchCV\n", "alphas = np.logspace(-2,3,10)\n", "param_grid = {\n", " 'estimator__alpha': alphas\n", "}\n", "cv = KFold(10, shuffle=True)\n", "gs = GridSearchCV(pipe, param_grid, cv=cv, n_jobs=10, scoring='neg_mean_absolute_error', verbose=False)\n", "\n", "gs.fit(X_train, y_train)\n", "gs.best_estimator_" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'mean_fit_time': array([0.21524832, 0.12929206, 0.14147673, 0.13001094, 0.12876217,\n", " 0.12753346, 0.12873785, 0.12855723, 0.12965353, 0.1284827 ]),\n", " 'std_fit_time': array([0.03227405, 0.00410048, 0.0115528 , 0.00307046, 0.00154043,\n", " 0.00228009, 0.00147471, 0.00239597, 0.00322923, 0.00289472]),\n", " 'mean_score_time': array([0.03430345, 0.03392034, 0.03475215, 0.03282845, 0.03321393,\n", " 0.03305686, 0.03315287, 0.03283639, 0.03680825, 0.03171299]),\n", " 'std_score_time': array([0.00137619, 0.00102129, 0.00133493, 0.00081006, 0.00101256,\n", " 0.00059905, 0.00092702, 0.00072003, 0.00701796, 0.00133118]),\n", " 'param_estimator__alpha': masked_array(data=[0.01, 0.03593813663804628, 0.1291549665014884,\n", " 0.464158883361278, 1.6681005372000592,\n", " 5.994842503189409, 21.544346900318846,\n", " 77.42636826811278, 278.2559402207126, 1000.0],\n", " mask=[False, False, False, False, False, False, False, False,\n", " False, False],\n", " fill_value='?',\n", " dtype=object),\n", " 'params': [{'estimator__alpha': 0.01},\n", " {'estimator__alpha': 0.03593813663804628},\n", " {'estimator__alpha': 0.1291549665014884},\n", " {'estimator__alpha': 0.464158883361278},\n", " {'estimator__alpha': 1.6681005372000592},\n", " {'estimator__alpha': 5.994842503189409},\n", " {'estimator__alpha': 21.544346900318846},\n", " {'estimator__alpha': 77.42636826811278},\n", " {'estimator__alpha': 278.2559402207126},\n", " {'estimator__alpha': 1000.0}],\n", " 'split0_test_score': array([-20423.13827829, -20300.08239561, -20006.86387591, -19474.39388573,\n", " -18550.24891054, -17352.5317766 , -16262.56092082, -15829.96244455,\n", " -16998.77404589, -20154.59224233]),\n", " 'split1_test_score': array([-21134.44446939, -21041.43167909, -20813.88308617, -20383.39950686,\n", " -19741.47199681, -19035.29561157, -18429.03015027, -18519.18961014,\n", " -18802.3504061 , -19750.8974826 ]),\n", " 'split2_test_score': array([-21359.14792024, -21318.79576546, -21181.76938699, -20759.19222816,\n", " -20017.03123303, -19265.01444707, -18866.57488933, -18898.58752783,\n", " -19840.54997481, -22240.82282471]),\n", " 'split3_test_score': array([-16215.82422144, -16165.71300515, -16009.92907957, -15550.51162626,\n", " -14902.56891045, -14224.68764814, -14080.82716967, -14585.37627633,\n", " -16005.19557452, -18755.50445943]),\n", " 'split4_test_score': array([-20572.76015233, -20554.29563496, -20448.60425832, -20150.60752661,\n", " -19523.02605861, -18717.88201158, -18341.84138676, -19199.52470833,\n", " -20694.08099187, -22360.81927817]),\n", " 'split5_test_score': array([-22422.67103868, -22194.15170299, -21789.64016355, -21122.89650665,\n", " -20545.06228844, -19923.82378969, -19456.43283809, -19705.32368485,\n", " -20822.52847941, -22857.00430144]),\n", " 'split6_test_score': array([-20644.3925863 , -20509.2302733 , -20175.79674405, -19487.9008511 ,\n", " -18828.08109322, -18199.96520091, -17669.65230276, -17849.52831649,\n", " -18778.54675768, -21355.37693723]),\n", " 'split7_test_score': array([-21914.69307171, -21907.41248813, -21830.14252256, -21500.09999228,\n", " -20885.81482072, -20067.96059518, -19213.33300651, -18493.30959239,\n", " -18566.22147926, -19910.78227227]),\n", " 'split8_test_score': array([-21055.49087205, -21005.26491197, -20866.78602588, -20607.30434511,\n", " -20285.52475287, -19972.25539463, -19994.69035467, -20819.55284724,\n", " -21696.29477714, -23211.57963544]),\n", " 'split9_test_score': array([-24802.52242674, -24697.42185984, -24372.73862675, -23594.76068099,\n", " -22330.42876448, -21144.65673797, -20988.27076951, -21941.85250029,\n", " -22279.78880423, -21872.73786844]),\n", " 'mean_test_score': array([-21054.50850372, -20969.37997165, -20749.61537698, -20263.10671498,\n", " -19560.92588292, -18790.40732133, -18330.32137884, -18584.22075084,\n", " -19448.43312909, -21247.01173021]),\n", " 'std_test_score': array([2024.11365696, 2007.83580036, 1975.2723976 , 1928.5376014 ,\n", " 1855.26088174, 1827.79694374, 1869.50307669, 2052.64854701,\n", " 1901.68797941, 1432.0601566 ]),\n", " 'rank_test_score': array([ 9, 8, 7, 6, 5, 3, 1, 2, 4, 10], dtype=int32)}" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# look at the grid search outcome\n", "gs.cv_results_" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
Pipeline(steps=[('preproc',\n",
       "                 ColumnTransformer(transformers=[('numeric',\n",
       "                                                  Pipeline(steps=[('fill_na',\n",
       "                                                                   SimpleImputer(strategy='median')),\n",
       "                                                                  ('scale',\n",
       "                                                                   StandardScaler())]),\n",
       "                                                  ['LotFrontage', 'LotArea',\n",
       "                                                   'OverallQual', 'OverallCond',\n",
       "                                                   'YearBuilt', 'YearRemodAdd',\n",
       "                                                   'MasVnrArea', 'BsmtFinSF1',\n",
       "                                                   'BsmtFinSF2', 'BsmtUnfSF',\n",
       "                                                   'TotalBsmtSF', '1stFlrSF',\n",
       "                                                   '2ndFlrSF', 'LowQualFinSF',\n",
       "                                                   'GrLivAr...\n",
       "                                                   'Neighborhood', 'Condition1',\n",
       "                                                   'Condition2', 'BldgType',\n",
       "                                                   'HouseStyle', 'RoofStyle',\n",
       "                                                   'RoofMatl', 'Exterior1st',\n",
       "                                                   'Exterior2nd', 'MasVnrType',\n",
       "                                                   'ExterQual', 'ExterCond',\n",
       "                                                   'Foundation', 'BsmtQual',\n",
       "                                                   'BsmtCond', 'BsmtExposure',\n",
       "                                                   'BsmtFinType1',\n",
       "                                                   'BsmtFinType2', 'Heating',\n",
       "                                                   'HeatingQC', 'CentralAir',\n",
       "                                                   'Electrical', ...])],\n",
       "                                   verbose_feature_names_out=False)),\n",
       "                ('estimator', Ridge(alpha=21.544346900318846))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ], "text/plain": [ "Pipeline(steps=[('preproc',\n", " ColumnTransformer(transformers=[('numeric',\n", " Pipeline(steps=[('fill_na',\n", " SimpleImputer(strategy='median')),\n", " ('scale',\n", " StandardScaler())]),\n", " ['LotFrontage', 'LotArea',\n", " 'OverallQual', 'OverallCond',\n", " 'YearBuilt', 'YearRemodAdd',\n", " 'MasVnrArea', 'BsmtFinSF1',\n", " 'BsmtFinSF2', 'BsmtUnfSF',\n", " 'TotalBsmtSF', '1stFlrSF',\n", " '2ndFlrSF', 'LowQualFinSF',\n", " 'GrLivAr...\n", " 'Neighborhood', 'Condition1',\n", " 'Condition2', 'BldgType',\n", " 'HouseStyle', 'RoofStyle',\n", " 'RoofMatl', 'Exterior1st',\n", " 'Exterior2nd', 'MasVnrType',\n", " 'ExterQual', 'ExterCond',\n", " 'Foundation', 'BsmtQual',\n", " 'BsmtCond', 'BsmtExposure',\n", " 'BsmtFinType1',\n", " 'BsmtFinType2', 'Heating',\n", " 'HeatingQC', 'CentralAir',\n", " 'Electrical', ...])],\n", " verbose_feature_names_out=False)),\n", " ('estimator', Ridge(alpha=21.544346900318846))])" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# visualize the search grid\n", "def plot_cv_results(res, confidence=0.95):\n", " n = len(res['mean_test_score'])\n", " sem = scipy.stats.sem(res['mean_test_score'])\n", " t = scipy.stats.t.ppf((1 + confidence) / 2., n-1)\n", " plt.errorbar(x=range(n), y=-res['mean_test_score'], yerr=t*sem, linetype=None)\n", " plt.xlabel('param group')\n", " plt.ylabel('MAE')\n", " \n", "plot_cv_results(gs.cv_results_)\n", "gs.best_estimator_" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Parameter group 0: {'estimator__alpha': 0.01}\n", "\tMean score: 21054.51\n", "\t95% CI: (19528.22, 22580.80)\n", "------------------------------\n", "Parameter group 1: {'estimator__alpha': 0.03593813663804628}\n", "\tMean score: 20969.38\n", "\t95% CI: (19455.37, 22483.39)\n", "------------------------------\n", "Parameter group 2: {'estimator__alpha': 0.1291549665014884}\n", "\tMean score: 20749.62\n", "\t95% CI: (19260.16, 22239.07)\n", "------------------------------\n", "Parameter group 3: {'estimator__alpha': 0.464158883361278}\n", "\tMean score: 20263.11\n", "\t95% CI: (18808.89, 21717.33)\n", "------------------------------\n", "Parameter group 4: {'estimator__alpha': 1.6681005372000592}\n", "\tMean score: 19560.93\n", "\t95% CI: (18161.96, 20959.89)\n", "------------------------------\n", "Parameter group 5: {'estimator__alpha': 5.994842503189409}\n", "\tMean score: 18790.41\n", "\t95% CI: (17412.15, 20168.66)\n", "------------------------------\n", "Parameter group 6: {'estimator__alpha': 21.544346900318846}\n", "\tMean score: 18330.32\n", "\t95% CI: (16920.62, 19740.02)\n", "------------------------------\n", "Parameter group 7: {'estimator__alpha': 77.42636826811278}\n", "\tMean score: 18584.22\n", "\t95% CI: (17036.42, 20132.03)\n", "------------------------------\n", "Parameter group 8: {'estimator__alpha': 278.2559402207126}\n", "\tMean score: 19448.43\n", "\t95% CI: (18014.46, 20882.41)\n", "------------------------------\n", "Parameter group 9: {'estimator__alpha': 1000.0}\n", "\tMean score: 21247.01\n", "\t95% CI: (20167.16, 22326.86)\n" ] } ], "source": [ "# print stats for each parameter group\n", "def analyze_cv_results(cv_results):\n", " split_results = np.vstack([val for key, val in cv_results.items() if 'split' in key])\n", " \n", " for ix, p in enumerate(cv_results['params']): \n", " if ix>0:\n", " print('-'*30)\n", " print(f\"Parameter group {ix}:\", p)\n", " mu, (lo, hi) = mean_confidence_interval(-split_results[:,ix])\n", " print(f\"\\tMean score: {mu:0.2f}\")\n", " print(f\"\\t95% CI: ({lo:0.2f}, {hi:0.2f})\")\n", "\n", "analyze_cv_results(gs.cv_results_)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We get the best mean performance for parameter group 6 with alpha=21.54." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Searching over multiple parameters simultaneously" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Often we will want to search over multiple parameters at the same time. This is easy." ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Fitting 10 folds for each of 15 candidates, totalling 150 fits\n" ] }, { "data": { "text/html": [ "
Pipeline(steps=[('preproc',\n",
       "                 ColumnTransformer(transformers=[('numeric',\n",
       "                                                  Pipeline(steps=[('fill_na',\n",
       "                                                                   SimpleImputer(strategy='most_frequent')),\n",
       "                                                                  ('scale',\n",
       "                                                                   StandardScaler())]),\n",
       "                                                  ['LotFrontage', 'LotArea',\n",
       "                                                   'OverallQual', 'OverallCond',\n",
       "                                                   'YearBuilt', 'YearRemodAdd',\n",
       "                                                   'MasVnrArea', 'BsmtFinSF1',\n",
       "                                                   'BsmtFinSF2', 'BsmtUnfSF',\n",
       "                                                   'TotalBsmtSF', '1stFlrSF',\n",
       "                                                   '2ndFlrSF', 'LowQualFinSF',\n",
       "                                                   '...\n",
       "                                                   'Neighborhood', 'Condition1',\n",
       "                                                   'Condition2', 'BldgType',\n",
       "                                                   'HouseStyle', 'RoofStyle',\n",
       "                                                   'RoofMatl', 'Exterior1st',\n",
       "                                                   'Exterior2nd', 'MasVnrType',\n",
       "                                                   'ExterQual', 'ExterCond',\n",
       "                                                   'Foundation', 'BsmtQual',\n",
       "                                                   'BsmtCond', 'BsmtExposure',\n",
       "                                                   'BsmtFinType1',\n",
       "                                                   'BsmtFinType2', 'Heating',\n",
       "                                                   'HeatingQC', 'CentralAir',\n",
       "                                                   'Electrical', ...])],\n",
       "                                   verbose_feature_names_out=False)),\n",
       "                ('estimator', Ridge(alpha=21.54))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ], "text/plain": [ "Pipeline(steps=[('preproc',\n", " ColumnTransformer(transformers=[('numeric',\n", " Pipeline(steps=[('fill_na',\n", " SimpleImputer(strategy='most_frequent')),\n", " ('scale',\n", " StandardScaler())]),\n", " ['LotFrontage', 'LotArea',\n", " 'OverallQual', 'OverallCond',\n", " 'YearBuilt', 'YearRemodAdd',\n", " 'MasVnrArea', 'BsmtFinSF1',\n", " 'BsmtFinSF2', 'BsmtUnfSF',\n", " 'TotalBsmtSF', '1stFlrSF',\n", " '2ndFlrSF', 'LowQualFinSF',\n", " '...\n", " 'Neighborhood', 'Condition1',\n", " 'Condition2', 'BldgType',\n", " 'HouseStyle', 'RoofStyle',\n", " 'RoofMatl', 'Exterior1st',\n", " 'Exterior2nd', 'MasVnrType',\n", " 'ExterQual', 'ExterCond',\n", " 'Foundation', 'BsmtQual',\n", " 'BsmtCond', 'BsmtExposure',\n", " 'BsmtFinType1',\n", " 'BsmtFinType2', 'Heating',\n", " 'HeatingQC', 'CentralAir',\n", " 'Electrical', ...])],\n", " verbose_feature_names_out=False)),\n", " ('estimator', Ridge(alpha=21.54))])" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.model_selection import GridSearchCV\n", "param_grid = {\n", " 'preproc__numeric__fill_na__strategy': ['mean', 'median', 'most_frequent'],\n", " 'preproc__categorical__encode__min_frequency': [1, 2, 3, 4, 5],\n", " 'estimator__alpha': [21.54]\n", "}\n", "cv = KFold(10, shuffle=True)\n", "gs = GridSearchCV(pipe, param_grid, cv=cv, n_jobs=10, scoring='neg_mean_absolute_error', verbose=True)\n", "gs.fit(X_train, y_train)\n", "gs.best_estimator_" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "plot_cv_results(gs.cv_results_)" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Parameter group 0: {'estimator__alpha': 21.54, 'preproc__categorical__encode__min_frequency': 1, 'preproc__numeric__fill_na__strategy': 'mean'}\n", "\tMean score: 18340.20\n", "\t95% CI: (16399.35, 20281.04)\n", "------------------------------\n", "Parameter group 1: {'estimator__alpha': 21.54, 'preproc__categorical__encode__min_frequency': 1, 'preproc__numeric__fill_na__strategy': 'median'}\n", "\tMean score: 18326.99\n", "\t95% CI: (16385.40, 20268.58)\n", "------------------------------\n", "Parameter group 2: {'estimator__alpha': 21.54, 'preproc__categorical__encode__min_frequency': 1, 'preproc__numeric__fill_na__strategy': 'most_frequent'}\n", "\tMean score: 18295.73\n", "\t95% CI: (16332.00, 20259.46)\n", "------------------------------\n", "Parameter group 3: {'estimator__alpha': 21.54, 'preproc__categorical__encode__min_frequency': 2, 'preproc__numeric__fill_na__strategy': 'mean'}\n", "\tMean score: 18296.74\n", "\t95% CI: (16344.75, 20248.73)\n", "------------------------------\n", "Parameter group 4: {'estimator__alpha': 21.54, 'preproc__categorical__encode__min_frequency': 2, 'preproc__numeric__fill_na__strategy': 'median'}\n", "\tMean score: 18283.10\n", "\t95% CI: (16330.26, 20235.94)\n", "------------------------------\n", "Parameter group 5: {'estimator__alpha': 21.54, 'preproc__categorical__encode__min_frequency': 2, 'preproc__numeric__fill_na__strategy': 'most_frequent'}\n", "\tMean score: 18253.03\n", "\t95% CI: (16278.90, 20227.17)\n", "------------------------------\n", "Parameter group 6: {'estimator__alpha': 21.54, 'preproc__categorical__encode__min_frequency': 3, 'preproc__numeric__fill_na__strategy': 'mean'}\n", "\tMean score: 18321.84\n", "\t95% CI: (16400.77, 20242.91)\n", "------------------------------\n", "Parameter group 7: {'estimator__alpha': 21.54, 'preproc__categorical__encode__min_frequency': 3, 'preproc__numeric__fill_na__strategy': 'median'}\n", "\tMean score: 18310.43\n", "\t95% CI: (16388.60, 20232.26)\n", "------------------------------\n", "Parameter group 8: {'estimator__alpha': 21.54, 'preproc__categorical__encode__min_frequency': 3, 'preproc__numeric__fill_na__strategy': 'most_frequent'}\n", "\tMean score: 18284.16\n", "\t95% CI: (16345.99, 20222.33)\n", "------------------------------\n", "Parameter group 9: {'estimator__alpha': 21.54, 'preproc__categorical__encode__min_frequency': 4, 'preproc__numeric__fill_na__strategy': 'mean'}\n", "\tMean score: 18341.43\n", "\t95% CI: (16408.15, 20274.72)\n", "------------------------------\n", "Parameter group 10: {'estimator__alpha': 21.54, 'preproc__categorical__encode__min_frequency': 4, 'preproc__numeric__fill_na__strategy': 'median'}\n", "\tMean score: 18329.76\n", "\t95% CI: (16396.56, 20262.96)\n", "------------------------------\n", "Parameter group 11: {'estimator__alpha': 21.54, 'preproc__categorical__encode__min_frequency': 4, 'preproc__numeric__fill_na__strategy': 'most_frequent'}\n", "\tMean score: 18301.13\n", "\t95% CI: (16352.43, 20249.83)\n", "------------------------------\n", "Parameter group 12: {'estimator__alpha': 21.54, 'preproc__categorical__encode__min_frequency': 5, 'preproc__numeric__fill_na__strategy': 'mean'}\n", "\tMean score: 18360.32\n", "\t95% CI: (16429.25, 20291.39)\n", "------------------------------\n", "Parameter group 13: {'estimator__alpha': 21.54, 'preproc__categorical__encode__min_frequency': 5, 'preproc__numeric__fill_na__strategy': 'median'}\n", "\tMean score: 18348.10\n", "\t95% CI: (16416.62, 20279.58)\n", "------------------------------\n", "Parameter group 14: {'estimator__alpha': 21.54, 'preproc__categorical__encode__min_frequency': 5, 'preproc__numeric__fill_na__strategy': 'most_frequent'}\n", "\tMean score: 18320.57\n", "\t95% CI: (16373.19, 20267.95)\n" ] } ], "source": [ "analyze_cv_results(gs.cv_results_)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It looks like the model prefers 'most_frequent' imputation with a OneHotEncoder min frequency of 2. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Searching over different transformers" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's try something a little more drastic. What if we were to replace the imputation approach altogether with a the KNNImputer that we saw earlier in the workshop?\n", "\n", "Before, we used GridSearchCV to scan possible values for a parameter in one of our transforms. We can also scan over different transform objects. " ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
Pipeline(steps=[('preproc',\n",
       "                 ColumnTransformer(transformers=[('numeric',\n",
       "                                                  Pipeline(steps=[('fill_na',\n",
       "                                                                   SimpleImputer(strategy='median')),\n",
       "                                                                  ('scale',\n",
       "                                                                   StandardScaler())]),\n",
       "                                                  ['LotFrontage', 'LotArea',\n",
       "                                                   'OverallQual', 'OverallCond',\n",
       "                                                   'YearBuilt', 'YearRemodAdd',\n",
       "                                                   'MasVnrArea', 'BsmtFinSF1',\n",
       "                                                   'BsmtFinSF2', 'BsmtUnfSF',\n",
       "                                                   'TotalBsmtSF', '1stFlrSF',\n",
       "                                                   '2ndFlrSF', 'LowQualFinSF',\n",
       "                                                   'GrLivAr...\n",
       "                                                   'Neighborhood', 'Condition1',\n",
       "                                                   'Condition2', 'BldgType',\n",
       "                                                   'HouseStyle', 'RoofStyle',\n",
       "                                                   'RoofMatl', 'Exterior1st',\n",
       "                                                   'Exterior2nd', 'MasVnrType',\n",
       "                                                   'ExterQual', 'ExterCond',\n",
       "                                                   'Foundation', 'BsmtQual',\n",
       "                                                   'BsmtCond', 'BsmtExposure',\n",
       "                                                   'BsmtFinType1',\n",
       "                                                   'BsmtFinType2', 'Heating',\n",
       "                                                   'HeatingQC', 'CentralAir',\n",
       "                                                   'Electrical', ...])],\n",
       "                                   verbose_feature_names_out=False)),\n",
       "                ('estimator', Ridge(alpha=1))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ], "text/plain": [ "Pipeline(steps=[('preproc',\n", " ColumnTransformer(transformers=[('numeric',\n", " Pipeline(steps=[('fill_na',\n", " SimpleImputer(strategy='median')),\n", " ('scale',\n", " StandardScaler())]),\n", " ['LotFrontage', 'LotArea',\n", " 'OverallQual', 'OverallCond',\n", " 'YearBuilt', 'YearRemodAdd',\n", " 'MasVnrArea', 'BsmtFinSF1',\n", " 'BsmtFinSF2', 'BsmtUnfSF',\n", " 'TotalBsmtSF', '1stFlrSF',\n", " '2ndFlrSF', 'LowQualFinSF',\n", " 'GrLivAr...\n", " 'Neighborhood', 'Condition1',\n", " 'Condition2', 'BldgType',\n", " 'HouseStyle', 'RoofStyle',\n", " 'RoofMatl', 'Exterior1st',\n", " 'Exterior2nd', 'MasVnrType',\n", " 'ExterQual', 'ExterCond',\n", " 'Foundation', 'BsmtQual',\n", " 'BsmtCond', 'BsmtExposure',\n", " 'BsmtFinType1',\n", " 'BsmtFinType2', 'Heating',\n", " 'HeatingQC', 'CentralAir',\n", " 'Electrical', ...])],\n", " verbose_feature_names_out=False)),\n", " ('estimator', Ridge(alpha=1))])" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pipe" ] }, { "cell_type": "code", "execution_count": 39, "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Fitting 10 folds for each of 2 candidates, totalling 20 fits\n" ] }, { "data": { "text/html": [ "
Pipeline(steps=[('preproc',\n",
       "                 ColumnTransformer(transformers=[('numeric',\n",
       "                                                  Pipeline(steps=[('fill_na',\n",
       "                                                                   SimpleImputer(strategy='most_frequent')),\n",
       "                                                                  ('scale',\n",
       "                                                                   StandardScaler())]),\n",
       "                                                  ['LotFrontage', 'LotArea',\n",
       "                                                   'OverallQual', 'OverallCond',\n",
       "                                                   'YearBuilt', 'YearRemodAdd',\n",
       "                                                   'MasVnrArea', 'BsmtFinSF1',\n",
       "                                                   'BsmtFinSF2', 'BsmtUnfSF',\n",
       "                                                   'TotalBsmtSF', '1stFlrSF',\n",
       "                                                   '2ndFlrSF', 'LowQualFinSF',\n",
       "                                                   '...\n",
       "                                                   'Neighborhood', 'Condition1',\n",
       "                                                   'Condition2', 'BldgType',\n",
       "                                                   'HouseStyle', 'RoofStyle',\n",
       "                                                   'RoofMatl', 'Exterior1st',\n",
       "                                                   'Exterior2nd', 'MasVnrType',\n",
       "                                                   'ExterQual', 'ExterCond',\n",
       "                                                   'Foundation', 'BsmtQual',\n",
       "                                                   'BsmtCond', 'BsmtExposure',\n",
       "                                                   'BsmtFinType1',\n",
       "                                                   'BsmtFinType2', 'Heating',\n",
       "                                                   'HeatingQC', 'CentralAir',\n",
       "                                                   'Electrical', ...])],\n",
       "                                   verbose_feature_names_out=False)),\n",
       "                ('estimator', Ridge(alpha=21.54))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ], "text/plain": [ "Pipeline(steps=[('preproc',\n", " ColumnTransformer(transformers=[('numeric',\n", " Pipeline(steps=[('fill_na',\n", " SimpleImputer(strategy='most_frequent')),\n", " ('scale',\n", " StandardScaler())]),\n", " ['LotFrontage', 'LotArea',\n", " 'OverallQual', 'OverallCond',\n", " 'YearBuilt', 'YearRemodAdd',\n", " 'MasVnrArea', 'BsmtFinSF1',\n", " 'BsmtFinSF2', 'BsmtUnfSF',\n", " 'TotalBsmtSF', '1stFlrSF',\n", " '2ndFlrSF', 'LowQualFinSF',\n", " '...\n", " 'Neighborhood', 'Condition1',\n", " 'Condition2', 'BldgType',\n", " 'HouseStyle', 'RoofStyle',\n", " 'RoofMatl', 'Exterior1st',\n", " 'Exterior2nd', 'MasVnrType',\n", " 'ExterQual', 'ExterCond',\n", " 'Foundation', 'BsmtQual',\n", " 'BsmtCond', 'BsmtExposure',\n", " 'BsmtFinType1',\n", " 'BsmtFinType2', 'Heating',\n", " 'HeatingQC', 'CentralAir',\n", " 'Electrical', ...])],\n", " verbose_feature_names_out=False)),\n", " ('estimator', Ridge(alpha=21.54))])" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.impute import KNNImputer\n", "param_grid = {\n", " # before: 'preproc__numeric__fill_na__strategy': ['mean', 'median', 'most_frequent'],\n", " 'preproc__numeric__fill_na': [SimpleImputer(strategy='most_frequent'), KNNImputer(n_neighbors=20, weights=\"uniform\")],\n", " 'preproc__categorical__encode__min_frequency': [2],\n", " 'estimator__alpha': [21.54]\n", "}\n", "\n", "gs = GridSearchCV(pipe, param_grid, cv=cv, n_jobs=10, scoring='neg_mean_absolute_error', verbose=True)\n", "gs.fit(X_train, y_train)\n", "gs.best_estimator_" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "plot_cv_results(gs.cv_results_)" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Parameter group 0: {'estimator__alpha': 21.54, 'preproc__categorical__encode__min_frequency': 2, 'preproc__numeric__fill_na': SimpleImputer(strategy='most_frequent')}\n", "\tMean score: 18322.53\n", "\t95% CI: (15369.93, 21275.13)\n", "------------------------------\n", "Parameter group 1: {'estimator__alpha': 21.54, 'preproc__categorical__encode__min_frequency': 2, 'preproc__numeric__fill_na': KNNImputer(n_neighbors=20)}\n", "\tMean score: 18340.86\n", "\t95% CI: (15379.45, 21302.27)\n" ] } ], "source": [ "analyze_cv_results(gs.cv_results_)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can't distinguish between these two approaches. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Searching different estimators\n", "Just like we could search over different transforms, we can search over different estimators. " ] }, { "cell_type": "code", "execution_count": 42, "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Fitting 10 folds for each of 3 candidates, totalling 30 fits\n" ] }, { "data": { "text/html": [ "
Pipeline(steps=[('preproc',\n",
       "                 ColumnTransformer(transformers=[('numeric',\n",
       "                                                  Pipeline(steps=[('fill_na',\n",
       "                                                                   SimpleImputer(strategy='most_frequent')),\n",
       "                                                                  ('scale',\n",
       "                                                                   StandardScaler())]),\n",
       "                                                  ['LotFrontage', 'LotArea',\n",
       "                                                   'OverallQual', 'OverallCond',\n",
       "                                                   'YearBuilt', 'YearRemodAdd',\n",
       "                                                   'MasVnrArea', 'BsmtFinSF1',\n",
       "                                                   'BsmtFinSF2', 'BsmtUnfSF',\n",
       "                                                   'TotalBsmtSF', '1stFlrSF',\n",
       "                                                   '2ndFlrSF', 'LowQualFinSF',\n",
       "                                                   '...\n",
       "                                                   'Neighborhood', 'Condition1',\n",
       "                                                   'Condition2', 'BldgType',\n",
       "                                                   'HouseStyle', 'RoofStyle',\n",
       "                                                   'RoofMatl', 'Exterior1st',\n",
       "                                                   'Exterior2nd', 'MasVnrType',\n",
       "                                                   'ExterQual', 'ExterCond',\n",
       "                                                   'Foundation', 'BsmtQual',\n",
       "                                                   'BsmtCond', 'BsmtExposure',\n",
       "                                                   'BsmtFinType1',\n",
       "                                                   'BsmtFinType2', 'Heating',\n",
       "                                                   'HeatingQC', 'CentralAir',\n",
       "                                                   'Electrical', ...])],\n",
       "                                   verbose_feature_names_out=False)),\n",
       "                ('estimator', GradientBoostingRegressor())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ], "text/plain": [ "Pipeline(steps=[('preproc',\n", " ColumnTransformer(transformers=[('numeric',\n", " Pipeline(steps=[('fill_na',\n", " SimpleImputer(strategy='most_frequent')),\n", " ('scale',\n", " StandardScaler())]),\n", " ['LotFrontage', 'LotArea',\n", " 'OverallQual', 'OverallCond',\n", " 'YearBuilt', 'YearRemodAdd',\n", " 'MasVnrArea', 'BsmtFinSF1',\n", " 'BsmtFinSF2', 'BsmtUnfSF',\n", " 'TotalBsmtSF', '1stFlrSF',\n", " '2ndFlrSF', 'LowQualFinSF',\n", " '...\n", " 'Neighborhood', 'Condition1',\n", " 'Condition2', 'BldgType',\n", " 'HouseStyle', 'RoofStyle',\n", " 'RoofMatl', 'Exterior1st',\n", " 'Exterior2nd', 'MasVnrType',\n", " 'ExterQual', 'ExterCond',\n", " 'Foundation', 'BsmtQual',\n", " 'BsmtCond', 'BsmtExposure',\n", " 'BsmtFinType1',\n", " 'BsmtFinType2', 'Heating',\n", " 'HeatingQC', 'CentralAir',\n", " 'Electrical', ...])],\n", " verbose_feature_names_out=False)),\n", " ('estimator', GradientBoostingRegressor())])" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor\n", " \n", "param_grid = {\n", " 'preproc__numeric__fill_na': [SimpleImputer(strategy='most_frequent')],\n", " 'preproc__categorical__encode__min_frequency': [2],\n", " 'estimator': [Ridge(alpha=21.54), RandomForestRegressor(), GradientBoostingRegressor()]\n", "}\n", "\n", "gs = GridSearchCV(pipe, param_grid, cv=cv, n_jobs=10, scoring='neg_mean_absolute_error', verbose=True)\n", "gs.fit(X_train, y_train)\n", "gs.best_estimator_" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "plot_cv_results(gs.cv_results_)" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Parameter group 0: {'estimator': Ridge(alpha=21.54), 'preproc__categorical__encode__min_frequency': 2, 'preproc__numeric__fill_na': SimpleImputer(strategy='most_frequent')}\n", "\tMean score: 17885.12\n", "\t95% CI: (16272.11, 19498.13)\n", "------------------------------\n", "Parameter group 1: {'estimator': RandomForestRegressor(), 'preproc__categorical__encode__min_frequency': 2, 'preproc__numeric__fill_na': SimpleImputer(strategy='most_frequent')}\n", "\tMean score: 17662.02\n", "\t95% CI: (16542.67, 18781.37)\n", "------------------------------\n", "Parameter group 2: {'estimator': GradientBoostingRegressor(), 'preproc__categorical__encode__min_frequency': 2, 'preproc__numeric__fill_na': SimpleImputer(strategy='most_frequent')}\n", "\tMean score: 16056.59\n", "\t95% CI: (15043.17, 17070.01)\n" ] } ], "source": [ "analyze_cv_results(gs.cv_results_)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Random Forest and Gradient boosting might have an edge, but there's a lot of overlap in the CIs. At this point, one might want to focus on tuning the hyperparameters for the tree-based models." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For now, though, let's see how to implement dimension reduction in our pipeline. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Dimension reduction" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can insert a PCA step into our pipeline. " ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
Pipeline(steps=[('preproc',\n",
       "                 ColumnTransformer(transformers=[('numeric',\n",
       "                                                  Pipeline(steps=[('fill_na',\n",
       "                                                                   SimpleImputer(strategy='median')),\n",
       "                                                                  ('scale',\n",
       "                                                                   StandardScaler())]),\n",
       "                                                  ['LotFrontage', 'LotArea',\n",
       "                                                   'OverallQual', 'OverallCond',\n",
       "                                                   'YearBuilt', 'YearRemodAdd',\n",
       "                                                   'MasVnrArea', 'BsmtFinSF1',\n",
       "                                                   'BsmtFinSF2', 'BsmtUnfSF',\n",
       "                                                   'TotalBsmtSF', '1stFlrSF',\n",
       "                                                   '2ndFlrSF', 'LowQualFinSF',\n",
       "                                                   'GrLivAr...\n",
       "                                                   'Condition2', 'BldgType',\n",
       "                                                   'HouseStyle', 'RoofStyle',\n",
       "                                                   'RoofMatl', 'Exterior1st',\n",
       "                                                   'Exterior2nd', 'MasVnrType',\n",
       "                                                   'ExterQual', 'ExterCond',\n",
       "                                                   'Foundation', 'BsmtQual',\n",
       "                                                   'BsmtCond', 'BsmtExposure',\n",
       "                                                   'BsmtFinType1',\n",
       "                                                   'BsmtFinType2', 'Heating',\n",
       "                                                   'HeatingQC', 'CentralAir',\n",
       "                                                   'Electrical', ...])],\n",
       "                                   verbose_feature_names_out=False)),\n",
       "                ('feature_selection', PCA(n_components=10)),\n",
       "                ('estimator', Ridge(alpha=1))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ], "text/plain": [ "Pipeline(steps=[('preproc',\n", " ColumnTransformer(transformers=[('numeric',\n", " Pipeline(steps=[('fill_na',\n", " SimpleImputer(strategy='median')),\n", " ('scale',\n", " StandardScaler())]),\n", " ['LotFrontage', 'LotArea',\n", " 'OverallQual', 'OverallCond',\n", " 'YearBuilt', 'YearRemodAdd',\n", " 'MasVnrArea', 'BsmtFinSF1',\n", " 'BsmtFinSF2', 'BsmtUnfSF',\n", " 'TotalBsmtSF', '1stFlrSF',\n", " '2ndFlrSF', 'LowQualFinSF',\n", " 'GrLivAr...\n", " 'Condition2', 'BldgType',\n", " 'HouseStyle', 'RoofStyle',\n", " 'RoofMatl', 'Exterior1st',\n", " 'Exterior2nd', 'MasVnrType',\n", " 'ExterQual', 'ExterCond',\n", " 'Foundation', 'BsmtQual',\n", " 'BsmtCond', 'BsmtExposure',\n", " 'BsmtFinType1',\n", " 'BsmtFinType2', 'Heating',\n", " 'HeatingQC', 'CentralAir',\n", " 'Electrical', ...])],\n", " verbose_feature_names_out=False)),\n", " ('feature_selection', PCA(n_components=10)),\n", " ('estimator', Ridge(alpha=1))])" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.decomposition import PCA\n", "\n", "pipe = Pipeline([\n", " ('preproc', preproc),\n", " ('feature_selection', PCA(n_components=10)),\n", " ('estimator', Ridge(alpha=1))\n", "])\n", "\n", "pipe" ] }, { "cell_type": "code", "execution_count": 46, "metadata": { "scrolled": true, "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Fitting 10 folds for each of 11 candidates, totalling 110 fits\n" ] }, { "data": { "text/html": [ "
Pipeline(steps=[('preproc',\n",
       "                 ColumnTransformer(transformers=[('numeric',\n",
       "                                                  Pipeline(steps=[('fill_na',\n",
       "                                                                   SimpleImputer(strategy='most_frequent')),\n",
       "                                                                  ('scale',\n",
       "                                                                   StandardScaler())]),\n",
       "                                                  ['LotFrontage', 'LotArea',\n",
       "                                                   'OverallQual', 'OverallCond',\n",
       "                                                   'YearBuilt', 'YearRemodAdd',\n",
       "                                                   'MasVnrArea', 'BsmtFinSF1',\n",
       "                                                   'BsmtFinSF2', 'BsmtUnfSF',\n",
       "                                                   'TotalBsmtSF', '1stFlrSF',\n",
       "                                                   '2ndFlrSF', 'LowQualFinSF',\n",
       "                                                   '...\n",
       "                                                   'Condition2', 'BldgType',\n",
       "                                                   'HouseStyle', 'RoofStyle',\n",
       "                                                   'RoofMatl', 'Exterior1st',\n",
       "                                                   'Exterior2nd', 'MasVnrType',\n",
       "                                                   'ExterQual', 'ExterCond',\n",
       "                                                   'Foundation', 'BsmtQual',\n",
       "                                                   'BsmtCond', 'BsmtExposure',\n",
       "                                                   'BsmtFinType1',\n",
       "                                                   'BsmtFinType2', 'Heating',\n",
       "                                                   'HeatingQC', 'CentralAir',\n",
       "                                                   'Electrical', ...])],\n",
       "                                   verbose_feature_names_out=False)),\n",
       "                ('feature_selection', 'passthrough'),\n",
       "                ('estimator', Ridge(alpha=21.54))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ], "text/plain": [ "Pipeline(steps=[('preproc',\n", " ColumnTransformer(transformers=[('numeric',\n", " Pipeline(steps=[('fill_na',\n", " SimpleImputer(strategy='most_frequent')),\n", " ('scale',\n", " StandardScaler())]),\n", " ['LotFrontage', 'LotArea',\n", " 'OverallQual', 'OverallCond',\n", " 'YearBuilt', 'YearRemodAdd',\n", " 'MasVnrArea', 'BsmtFinSF1',\n", " 'BsmtFinSF2', 'BsmtUnfSF',\n", " 'TotalBsmtSF', '1stFlrSF',\n", " '2ndFlrSF', 'LowQualFinSF',\n", " '...\n", " 'Condition2', 'BldgType',\n", " 'HouseStyle', 'RoofStyle',\n", " 'RoofMatl', 'Exterior1st',\n", " 'Exterior2nd', 'MasVnrType',\n", " 'ExterQual', 'ExterCond',\n", " 'Foundation', 'BsmtQual',\n", " 'BsmtCond', 'BsmtExposure',\n", " 'BsmtFinType1',\n", " 'BsmtFinType2', 'Heating',\n", " 'HeatingQC', 'CentralAir',\n", " 'Electrical', ...])],\n", " verbose_feature_names_out=False)),\n", " ('feature_selection', 'passthrough'),\n", " ('estimator', Ridge(alpha=21.54))])" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# now lets fit new pipeline and see how it does\n", "# we add a \"passthrough\" to the parameter search\n", "param_grid = {\n", " 'preproc__numeric__fill_na': [SimpleImputer(strategy='most_frequent')],\n", " 'preproc__categorical__encode__min_frequency': [2],\n", " 'feature_selection': [\"passthrough\"] + [PCA(n) for n in range(1,100,10)],\n", " 'estimator__alpha': [21.54]\n", "}\n", "\n", "gs = GridSearchCV(pipe, param_grid, cv=cv, n_jobs=10, scoring='neg_mean_absolute_error', verbose=True)\n", "gs.fit(X_train, y_train)\n", "gs.best_estimator_" ] }, { "cell_type": "code", "execution_count": 47, "metadata": { "scrolled": true, "tags": [] }, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "plot_cv_results(gs.cv_results_)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The passthrough variant outperforms all of the PCA variants. Let's try a different feature selection approach. " ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [], "source": [ "from sklearn.feature_selection import SelectKBest\n", "from sklearn.feature_selection import f_classif" ] }, { "cell_type": "code", "execution_count": 49, "metadata": { "scrolled": true, "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Fitting 10 folds for each of 6 candidates, totalling 60 fits\n" ] }, { "data": { "text/html": [ "
Pipeline(steps=[('preproc',\n",
       "                 ColumnTransformer(transformers=[('numeric',\n",
       "                                                  Pipeline(steps=[('fill_na',\n",
       "                                                                   SimpleImputer(strategy='most_frequent')),\n",
       "                                                                  ('scale',\n",
       "                                                                   StandardScaler())]),\n",
       "                                                  ['LotFrontage', 'LotArea',\n",
       "                                                   'OverallQual', 'OverallCond',\n",
       "                                                   'YearBuilt', 'YearRemodAdd',\n",
       "                                                   'MasVnrArea', 'BsmtFinSF1',\n",
       "                                                   'BsmtFinSF2', 'BsmtUnfSF',\n",
       "                                                   'TotalBsmtSF', '1stFlrSF',\n",
       "                                                   '2ndFlrSF', 'LowQualFinSF',\n",
       "                                                   '...\n",
       "                                                   'Condition2', 'BldgType',\n",
       "                                                   'HouseStyle', 'RoofStyle',\n",
       "                                                   'RoofMatl', 'Exterior1st',\n",
       "                                                   'Exterior2nd', 'MasVnrType',\n",
       "                                                   'ExterQual', 'ExterCond',\n",
       "                                                   'Foundation', 'BsmtQual',\n",
       "                                                   'BsmtCond', 'BsmtExposure',\n",
       "                                                   'BsmtFinType1',\n",
       "                                                   'BsmtFinType2', 'Heating',\n",
       "                                                   'HeatingQC', 'CentralAir',\n",
       "                                                   'Electrical', ...])],\n",
       "                                   verbose_feature_names_out=False)),\n",
       "                ('feature_selection', 'passthrough'),\n",
       "                ('estimator', Ridge(alpha=21.54))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ], "text/plain": [ "Pipeline(steps=[('preproc',\n", " ColumnTransformer(transformers=[('numeric',\n", " Pipeline(steps=[('fill_na',\n", " SimpleImputer(strategy='most_frequent')),\n", " ('scale',\n", " StandardScaler())]),\n", " ['LotFrontage', 'LotArea',\n", " 'OverallQual', 'OverallCond',\n", " 'YearBuilt', 'YearRemodAdd',\n", " 'MasVnrArea', 'BsmtFinSF1',\n", " 'BsmtFinSF2', 'BsmtUnfSF',\n", " 'TotalBsmtSF', '1stFlrSF',\n", " '2ndFlrSF', 'LowQualFinSF',\n", " '...\n", " 'Condition2', 'BldgType',\n", " 'HouseStyle', 'RoofStyle',\n", " 'RoofMatl', 'Exterior1st',\n", " 'Exterior2nd', 'MasVnrType',\n", " 'ExterQual', 'ExterCond',\n", " 'Foundation', 'BsmtQual',\n", " 'BsmtCond', 'BsmtExposure',\n", " 'BsmtFinType1',\n", " 'BsmtFinType2', 'Heating',\n", " 'HeatingQC', 'CentralAir',\n", " 'Electrical', ...])],\n", " verbose_feature_names_out=False)),\n", " ('feature_selection', 'passthrough'),\n", " ('estimator', Ridge(alpha=21.54))])" ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# now lets fit new pipeline and see how it does\n", "# we add a \"passthrough\" to the parameter search\n", "param_grid = {\n", " 'preproc__numeric__fill_na': [SimpleImputer(strategy='most_frequent')],\n", " 'preproc__categorical__encode__min_frequency': [2],\n", " 'feature_selection': [\"passthrough\"] + [SelectKBest(f_classif, k=k) for k in range(1, 200, 40)],\n", " 'estimator__alpha': [21.54]\n", "}\n", "\n", "gs = GridSearchCV(pipe, param_grid, cv=cv, n_jobs=10, scoring='neg_mean_absolute_error', verbose=True)\n", "gs.fit(X_train, y_train)\n", "gs.best_estimator_" ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Parameter group 0: {'estimator__alpha': 21.54, 'feature_selection': 'passthrough', 'preproc__categorical__encode__min_frequency': 2, 'preproc__numeric__fill_na': SimpleImputer(strategy='most_frequent')}\n", "\tMean score: 18313.82\n", "\t95% CI: (15610.21, 21017.43)\n", "------------------------------\n", "Parameter group 1: {'estimator__alpha': 21.54, 'feature_selection': SelectKBest(k=1), 'preproc__categorical__encode__min_frequency': 2, 'preproc__numeric__fill_na': SimpleImputer(strategy='most_frequent')}\n", "\tMean score: 56349.52\n", "\t95% CI: (52760.24, 59938.79)\n", "------------------------------\n", "Parameter group 2: {'estimator__alpha': 21.54, 'feature_selection': SelectKBest(k=41), 'preproc__categorical__encode__min_frequency': 2, 'preproc__numeric__fill_na': SimpleImputer(strategy='most_frequent')}\n", "\tMean score: 21874.57\n", "\t95% CI: (19566.05, 24183.08)\n", "------------------------------\n", "Parameter group 3: {'estimator__alpha': 21.54, 'feature_selection': SelectKBest(k=81), 'preproc__categorical__encode__min_frequency': 2, 'preproc__numeric__fill_na': SimpleImputer(strategy='most_frequent')}\n", "\tMean score: 19809.51\n", "\t95% CI: (17204.10, 22414.92)\n", "------------------------------\n", "Parameter group 4: {'estimator__alpha': 21.54, 'feature_selection': SelectKBest(k=121), 'preproc__categorical__encode__min_frequency': 2, 'preproc__numeric__fill_na': SimpleImputer(strategy='most_frequent')}\n", "\tMean score: 19531.81\n", "\t95% CI: (17051.55, 22012.07)\n", "------------------------------\n", "Parameter group 5: {'estimator__alpha': 21.54, 'feature_selection': SelectKBest(k=161), 'preproc__categorical__encode__min_frequency': 2, 'preproc__numeric__fill_na': SimpleImputer(strategy='most_frequent')}\n", "\tMean score: 19106.24\n", "\t95% CI: (16654.09, 21558.40)\n" ] }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "plot_cv_results(gs.cv_results_)\n", "analyze_cv_results(gs.cv_results_)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\"passthrough\" wins again, though we retain most of the performance using only 41 features. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Final Evaluation\n", "We can view the best estimator from the search and evaluate it on the test set. This estimator was selected using cross validation then retrained on the entire train set. " ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
Pipeline(steps=[('preproc',\n",
       "                 ColumnTransformer(transformers=[('numeric',\n",
       "                                                  Pipeline(steps=[('fill_na',\n",
       "                                                                   SimpleImputer(strategy='most_frequent')),\n",
       "                                                                  ('scale',\n",
       "                                                                   StandardScaler())]),\n",
       "                                                  ['LotFrontage', 'LotArea',\n",
       "                                                   'OverallQual', 'OverallCond',\n",
       "                                                   'YearBuilt', 'YearRemodAdd',\n",
       "                                                   'MasVnrArea', 'BsmtFinSF1',\n",
       "                                                   'BsmtFinSF2', 'BsmtUnfSF',\n",
       "                                                   'TotalBsmtSF', '1stFlrSF',\n",
       "                                                   '2ndFlrSF', 'LowQualFinSF',\n",
       "                                                   '...\n",
       "                                                   'Condition2', 'BldgType',\n",
       "                                                   'HouseStyle', 'RoofStyle',\n",
       "                                                   'RoofMatl', 'Exterior1st',\n",
       "                                                   'Exterior2nd', 'MasVnrType',\n",
       "                                                   'ExterQual', 'ExterCond',\n",
       "                                                   'Foundation', 'BsmtQual',\n",
       "                                                   'BsmtCond', 'BsmtExposure',\n",
       "                                                   'BsmtFinType1',\n",
       "                                                   'BsmtFinType2', 'Heating',\n",
       "                                                   'HeatingQC', 'CentralAir',\n",
       "                                                   'Electrical', ...])],\n",
       "                                   verbose_feature_names_out=False)),\n",
       "                ('feature_selection', 'passthrough'),\n",
       "                ('estimator', Ridge(alpha=21.54))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ], "text/plain": [ "Pipeline(steps=[('preproc',\n", " ColumnTransformer(transformers=[('numeric',\n", " Pipeline(steps=[('fill_na',\n", " SimpleImputer(strategy='most_frequent')),\n", " ('scale',\n", " StandardScaler())]),\n", " ['LotFrontage', 'LotArea',\n", " 'OverallQual', 'OverallCond',\n", " 'YearBuilt', 'YearRemodAdd',\n", " 'MasVnrArea', 'BsmtFinSF1',\n", " 'BsmtFinSF2', 'BsmtUnfSF',\n", " 'TotalBsmtSF', '1stFlrSF',\n", " '2ndFlrSF', 'LowQualFinSF',\n", " '...\n", " 'Condition2', 'BldgType',\n", " 'HouseStyle', 'RoofStyle',\n", " 'RoofMatl', 'Exterior1st',\n", " 'Exterior2nd', 'MasVnrType',\n", " 'ExterQual', 'ExterCond',\n", " 'Foundation', 'BsmtQual',\n", " 'BsmtCond', 'BsmtExposure',\n", " 'BsmtFinType1',\n", " 'BsmtFinType2', 'Heating',\n", " 'HeatingQC', 'CentralAir',\n", " 'Electrical', ...])],\n", " verbose_feature_names_out=False)),\n", " ('feature_selection', 'passthrough'),\n", " ('estimator', Ridge(alpha=21.54))])" ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gs.best_estimator_" ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "First-pass performance:\n", "Train MAE: 15288.339516024189\n", "Test MAE: 20544.401602914677\n", "Train R^2: 0.9028881797596011\n", "Test R^2: 0.8714648323729995\n", "\n", "Fine tuned performance:\n", "Train MAE: 15485.129766251588\n", "Test MAE: 18672.381852407823\n", "Train R^2: 0.8875868898505703\n", "Test R^2: 0.8765177538936448\n" ] } ], "source": [ "# MAE is a more intuitive metric for price estimation\n", "from sklearn.metrics import mean_absolute_error, r2_score\n", "\n", "print(\"First-pass performance:\")\n", "print(\"Train MAE:\", mean_absolute_error(y_train, y_train_pred_1stpass))\n", "print(\"Test MAE:\", mean_absolute_error(y_test, y_test_pred_1stpass))\n", "print(\"Train R^2:\", r2_score(y_train, y_train_pred_1stpass))\n", "print(\"Test R^2:\", r2_score(y_test, y_test_pred_1stpass))\n", "\n", "y_train_pred = gs.best_estimator_.predict(X_train)\n", "y_test_pred = gs.best_estimator_.predict(X_test)\n", "print(\"\\nFine tuned performance:\")\n", "print(\"Train MAE:\", mean_absolute_error(y_train, y_train_pred))\n", "print(\"Test MAE:\", mean_absolute_error(y_test, y_test_pred))\n", "print(\"Train R^2:\", r2_score(y_train, y_train_pred))\n", "print(\"Test R^2:\", r2_score(y_test, y_test_pred))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Assignment\n", "If you didn't already, apply the methodology described above to your own dataset. Explore additional ways to improve the model. For instance, \n", "we saw that [GradientBoostingRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html) had an edge on Ridge regression without any hyperparameter tuning. Explore the hyperparameters for GradientBoostingRegressor on the sklearn documentation page. Define a hyperparameter search using GridSearchCV and use our evaluation methodology to compare the two methods. " ] } ], "metadata": { "colab": { "provenance": [] }, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.12" } }, "nbformat": 4, "nbformat_minor": 4 }