"
],
"text/plain": [
" Ozone Solar.R Wind Temp Month Day\n",
"0 41.00000 190.000000 7.4 67 5 1\n",
"1 36.00000 118.000000 8.0 72 5 2\n",
"2 12.00000 149.000000 12.6 74 5 3\n",
"3 18.00000 313.000000 11.5 62 5 4\n",
"4 42.12931 185.931507 14.3 56 5 5"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data2 = data_df.copy()\n",
"data2.fillna(data2.mean(), inplace=True)\n",
"data2.head()"
]
},
{
"cell_type": "markdown",
"id": "bb5660e0-5e97-4c09-8677-49fef7477bca",
"metadata": {},
"source": [
"- Method 3: Use `Impute` to handle missing values\n",
"\n",
"In statistics, imputation is the process of replacing missing data with substituted values. \n",
"Because missing data can create problems for analyzing data, imputation is seen as a way to \n",
"avoid pitfalls involved with listwise deletion of cases that have missing values. That is to \n",
"say, when one or more values are missing for a case, most statistical packages default to discarding \n",
"any case that has a missing value, which may introduce bias or affect the representativeness of the \n",
"results. Imputation preserves all cases by replacing missing data with an estimated value based on \n",
"other available information. Once all missing values have been imputed, the data set can then be \n",
"analysed using standard techniques for complete data. There have been many theories embraced by \n",
"scientists to account for missing data but the majority of them introduce bias. A few of the well \n",
"known attempts to deal with missing data include: \n",
"- hot deck and cold deck imputation; \n",
"- listwise and pairwise deletion; \n",
"- mean imputation; \n",
"- non-negative matrix factorization; \n",
"- regression imputation; \n",
"- last observation carried forward; \n",
"- stochastic imputation; \n",
"- and multiple imputation."
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "0925abd1-0d07-4696-acb7-c3f6873c79f1",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/html": [
"
"
],
"text/plain": [
" 0 1 2 3 4 5\n",
"0 41.0 190.0 7.4 67.0 5.0 1.0\n",
"1 36.0 118.0 8.0 72.0 5.0 2.0\n",
"2 12.0 149.0 12.6 74.0 5.0 3.0\n",
"3 18.0 313.0 11.5 62.0 5.0 4.0\n",
"4 18.5 206.0 14.3 56.0 5.0 5.0"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from sklearn.impute import KNNImputer\n",
"imputer = KNNImputer(n_neighbors=2, weights=\"uniform\")\n",
"data_knnimpute = pd.DataFrame(imputer.fit_transform(data_df))\n",
"data_knnimpute.head()"
]
},
{
"cell_type": "markdown",
"id": "75ab8133-4b41-4484-b350-78636e43fd03",
"metadata": {
"tags": []
},
"source": [
"- In addition to KNNImputer, there are **IterativeImputer** (Multivariate imputer that estimates each feature from all the others) and **MissingIndicator**(Binary indicators for missing values)\n",
"- More information on sklearn.impute can be found [here](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.impute)"
]
},
{
"cell_type": "markdown",
"id": "656ecb3b-5945-4f96-8db3-1e77350d3beb",
"metadata": {
"tags": []
},
"source": [
"## Standardizing data\n",
"\n",
"\n",
"- Standardization comes into picture when features of input data set have large differences between their ranges, or simply when they are measured in different measurement units for example: rainfall (0-1000mm), temperature (-10 to 40oC), humidity (0-100%), etc.\n",
"- Standardition Convert all independent variables into the same scale (mean=0, std=1) \n",
"- These differences in the ranges of initial features causes trouble to many machine learning models. For example, for the models that are based on distance computation, if one of the features has a broad range of values, the distance will be governed by this particular feature.\n",
"- The example below use data from above:"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "3b589291-4aa4-4a6a-b205-367573683481",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"import matplotlib.pyplot as plt\n",
"plt.hist(data_df['Temp'], bins=20)\n",
"plt.title(\"Original\")\n",
"plt.show()\n",
"\n",
"plt.hist(data_std['Temp'], bins=20)\n",
"plt.title(\"Standardized\")\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"id": "e0515d17-624c-49d9-a82a-56539bae83a4",
"metadata": {},
"source": [
"#### 2.3.2.2 Using scaling with predefine range\n",
"Transform features by scaling each feature to a given range.\n",
"This estimator scales and translates each feature individually such that it is in the given range on the training set, e.g. between zero and one.\n",
"Formulation for this is:\n",
"\n",
"```\n",
"X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))\n",
"X_scaled = X_std * (max - min) + min\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "586d9d86-5c5b-493a-b434-f0374ca40a87",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"plt.hist(data_df['Temp'], bins=20)\n",
"plt.title(\"Original\")\n",
"plt.show()\n",
"\n",
"plt.hist(data_scaler['Temp'], bins=20)\n",
"plt.title(\"MinMax Scaler\")\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"id": "5047339c-8c86-4513-91b3-684462cf3acb",
"metadata": {},
"source": [
"#### 2.3.2.3 Using Box-Cox Transformation\n",
"- A [Box Cox](https://rss.onlinelibrary.wiley.com/doi/10.1111/j.2517-6161.1964.tb00553.x) transformation is a transformation of a non-normal dependent variables into a normal shape. \n",
"- Normality is an important assumption for many statistical techniques; if your data isn’t normal, applying a Box-Cox means that you are able to run a broader number of tests.\n",
"- The Box Cox transformation is named after statisticians George Box and Sir David Roxbee Cox who collaborated on a 1964 paper and developed the technique.\n",
"- BoxCox can only be applied to stricly positive values"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "cc8af290-49fe-4930-9cee-d9429326a709",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
"
],
"text/plain": [
" Solar.R Wind Temp Month Day Solar.R / Temp\n",
"0 190.000000 7.4 67.0 5.0 1.0 2.835821\n",
"1 118.000000 8.0 72.0 5.0 2.0 1.638889\n",
"2 149.000000 12.6 74.0 5.0 3.0 2.013514\n",
"3 313.000000 11.5 62.0 5.0 4.0 5.048387\n",
"4 185.931507 14.3 56.0 5.0 5.0 3.320205\n",
".. ... ... ... ... ... ...\n",
"148 193.000000 6.9 70.0 9.0 26.0 2.757143\n",
"149 145.000000 13.2 77.0 9.0 27.0 1.883117\n",
"150 191.000000 14.3 75.0 9.0 28.0 2.546667\n",
"151 131.000000 8.0 76.0 9.0 29.0 1.723684\n",
"152 223.000000 11.5 68.0 9.0 30.0 3.279412\n",
"\n",
"[153 rows x 6 columns]"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Let's add a NEW feature - a ratio of two of the measurements\n",
"X['Solar.R / Temp'] = X['Solar.R'] / X['Temp']\n",
"X"
]
},
{
"cell_type": "markdown",
"id": "43b17605-bbf5-45ff-b4f8-9d1d9e2b0355",
"metadata": {
"jp-MarkdownHeadingCollapsed": true,
"tags": []
},
"source": [
"### [Feature selection](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.feature_selection) "
]
},
{
"cell_type": "code",
"execution_count": 19,
"id": "3b732174-d39a-42c9-84f0-e9dc5b1a4c04",
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Top k features: ['Wind', 'Temp']\n"
]
}
],
"source": [
"# SelectKBest for selecting top-scoring features\n",
"from sklearn.feature_selection import SelectKBest, f_regression\n",
"\n",
"# select the best 3 features for regression\n",
"dim_red = SelectKBest(f_regression, k = 2)\n",
"dim_red.fit(X, y)\n",
"X_t = dim_red.transform(X)\n",
"\n",
"# Get back the selected columns\n",
"selected = dim_red.get_support() # boolean values\n",
"selected_names = X.columns[selected]\n",
"\n",
"print('Top k features: ', list(selected_names))"
]
},
{
"cell_type": "code",
"execution_count": 20,
"id": "066b99f0-7c92-4435-90f0-8f86c176e8e2",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Scores: [1.52612031e+01 5.92750302e+01 8.88983834e+01 3.43229390e+00\n",
" 1.94731074e-02 3.35729306e+00]\n",
"New shape: (153, 2)\n"
]
}
],
"source": [
"# Show scores, features selected and new shape\n",
"print('Scores:', dim_red.scores_)\n",
"print('New shape:', X_t.shape)"
]
},
{
"cell_type": "markdown",
"id": "c6c2bce3-1cbe-4f17-b596-9e914ae3ad00",
"metadata": {},
"source": [
"**Note on scoring function selection in `SelectKBest` tranformations:**\n",
"* For regression - [f_regression](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.f_regression.html#sklearn.feature_selection.f_regression)\n",
"* For classification - [chi2](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.chi2.html#sklearn.feature_selection.chi2), [f_classif](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.f_classif.html#sklearn.feature_selection.f_classif)\n"
]
},
{
"cell_type": "markdown",
"id": "0f2c0803-d5e3-4c5d-b1b4-04c71300e4c6",
"metadata": {
"jp-MarkdownHeadingCollapsed": true,
"tags": []
},
"source": [
"### Principal component analysis (aka PCA)\n",
"* Reduces dimensions (number of features), based on what information explains the most variance (or signal)\n",
"* Considered unsupervised learning\n",
"* Useful for very large feature space (e.g. say the botanist in charge of the iris dataset measured 100 more parts of the flower and thus there were 104 columns instead of 4)\n",
"* More about PCA on wikipedia [here](https://en.wikipedia.org/wiki/Principal_component_analysis)"
]
},
{
"cell_type": "code",
"execution_count": 21,
"id": "510ce77f-d34b-4bd6-b83a-6e8107d351ea",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
""
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "\n",
"text/plain": [
"
"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"# PCA for dimensionality reduction\n",
"\n",
"from sklearn import decomposition\n",
"from sklearn import datasets\n",
"\n",
"digits = datasets.load_digits()\n",
"\n",
"X, y = digits.data, digits.target\n",
"\n",
"# perform principal component analysis\n",
"pca = decomposition.PCA(.95)\n",
"pca.fit(X)\n",
"X_t = pca.transform(X)\n",
"(X_t[:, 0])\n",
"\n",
"# import numpy and matplotlib for plotting (and set some stuff)\n",
"import numpy as np\n",
"np.set_printoptions(suppress=True)\n",
"import matplotlib.pyplot as plt\n",
"%matplotlib inline\n",
"\n",
"# let's separate out data based on first two principle components\n",
"x1, x2 = X_t[:, 0], X_t[:, 1]\n",
"\n",
"\n",
"# please don't worry about details of the plotting below \n",
"c1 = np.array(list('rbg')) # colors\n",
"classes = digits.target_names[y] \n",
"for (i, cla) in enumerate(set(classes)):\n",
" xc = [p for (j, p) in enumerate(x1) if classes[j] == cla]\n",
" yc = [p for (j, p) in enumerate(x2) if classes[j] == cla]\n",
" plt.scatter(xc, yc, label = cla)\n",
" plt.ylabel('Principal Component 2')\n",
" plt.xlabel('Principal Component 1')\n",
"plt.legend(loc = 4)"
]
},
{
"cell_type": "markdown",
"id": "8f86dd33-fe78-4137-899d-3459ca17950b",
"metadata": {},
"source": [
"See scikit-learn's excellent tutorial on feature selection [here](http://scikit-learn.org/stable/modules/feature_selection.html)"
]
},
{
"cell_type": "markdown",
"id": "d01843e2-bdb3-41d4-aa25-36d0e4531103",
"metadata": {},
"source": [
"### One Hot Encoding\n",
"* It's an operation on feature labels - a method of dummying variable\n",
"* Expands the feature space by nature of transform - later this can be processed further with a dimensionality reduction (the dummied variables are now their own features)\n",
"* FYI: One hot encoding variables is needed for python ML module `tenorflow`\n",
"* Can do this with `pandas` method or a `sklearn` one-hot-encoder system"
]
},
{
"cell_type": "markdown",
"id": "1d932024-a1f2-4b7c-8f5c-5fc27b8578a2",
"metadata": {},
"source": [
"#### `pandas` method"
]
},
{
"cell_type": "code",
"execution_count": 22,
"id": "f535768c-a9fa-4c30-836a-6353de72499e",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/html": [
"