TL;DR#

Installation#

Scikit-learn is included in the base conda environment available on Palmetto. Follow these instructions to install into your custom environment: https://scikit-learn.org/stable/install.html

Dataset#

from sklearn import datasets
digits = datasets.load_digits()
digits.keys()
dict_keys(['data', 'target', 'frame', 'feature_names', 'target_names', 'images', 'DESCR'])
digits['data'].shape, digits['target'].shape
((1797, 64), (1797,))
X = digits['data']
y = digits['target']

We will see examples of real-world datasets later in the series.

Partition#

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
X_train.shape, X_test.shape, y_train.shape, y_test.shape
((1437, 64), (360, 64), (1437,), (360,))

Session 1 Assignment: Choose a dataset#

Before returning for Session 2 of the sklearn workshop, select a dataset that is suitable for machine learning. The dataset should have the following properties:

  • tabular: it should be a row-column dataset with numeric or categorical variables for columns. It should not be a text or image dataset.

  • target and features: it should have a well-defined target variable and feature variables

  • size: it should have at least a few hundred rows but less than 100,000 rows.

Consider using a dataset from your research if possible. Otherwise, browse kaggle datasets for options.

Model Fitting#

from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier()
clf.fit(X_train, y_train)
RandomForestClassifier()

Model Evaluation#

print("Train score:", clf.score(X_train, y_train))
print("Test score:", clf.score(X_test, y_test))
Train score: 1.0
Test score: 0.9805555555555555

Inference#

# predict the class
ix = 42
print("Predicted:", clf.predict(X_test[[ix]]))
print("True:", y_test[ix])
Predicted: [1]
True: 1
# class probabilities
clf.predict_proba(X_test[[ix]])
array([[0.  , 0.83, 0.05, 0.02, 0.  , 0.01, 0.  , 0.  , 0.09, 0.  ]])