TL;DR#
Installation#
Scikit-learn is included in the base conda environment available on Palmetto. Follow these instructions to install into your custom environment: https://scikit-learn.org/stable/install.html
Dataset#
from sklearn import datasets
digits = datasets.load_digits()
digits.keys()
dict_keys(['data', 'target', 'frame', 'feature_names', 'target_names', 'images', 'DESCR'])
digits['data'].shape, digits['target'].shape
((1797, 64), (1797,))
X = digits['data']
y = digits['target']
We will see examples of real-world datasets later in the series.
Partition#
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
X_train.shape, X_test.shape, y_train.shape, y_test.shape
((1437, 64), (360, 64), (1437,), (360,))
Session 1 Assignment: Choose a dataset#
Before returning for Session 2 of the sklearn workshop, select a dataset that is suitable for machine learning. The dataset should have the following properties:
tabular: it should be a row-column dataset with numeric or categorical variables for columns. It should not be a text or image dataset.
target and features: it should have a well-defined target variable and feature variables
size: it should have at least a few hundred rows but less than 100,000 rows.
Consider using a dataset from your research if possible. Otherwise, browse kaggle datasets for options.
Model Fitting#
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier()
clf.fit(X_train, y_train)
RandomForestClassifier()
Model Evaluation#
print("Train score:", clf.score(X_train, y_train))
print("Test score:", clf.score(X_test, y_test))
Train score: 1.0
Test score: 0.9805555555555555
Inference#
# predict the class
ix = 42
print("Predicted:", clf.predict(X_test[[ix]]))
print("True:", y_test[ix])
Predicted: [1]
True: 1
# class probabilities
clf.predict_proba(X_test[[ix]])
array([[0. , 0.83, 0.05, 0.02, 0. , 0.01, 0. , 0. , 0.09, 0. ]])