Introduction to Python for Machine Learning#

Python has become the de facto language for machine learning due to its simplicity, readability, and extensive ecosystem of libraries. Its flexibility allows for rapid prototyping and development, while its powerful libraries like NumPy, Pandas, and Scikit-learn provide efficient tools for data manipulation and model building.

1. Essential Python Concepts Review#

For a broad introduction to Python, check out the Python Programming Guide. Here are some key features of Python that are particularly useful for machine learning:

  • List Comprehensions: A concise way to create lists in Python.

  • Lambda Functions: Anonymous functions that can be defined in a single line.

  • Error Handling: Using try, except, and finally blocks to handle exceptions.

  • Generators: Functions that return an iterator, allowing for lazy evaluation (meaning they don’t store all values in memory at once).

  • Dynamic Typing: Variables in Python are dynamically typed, meaning you don’t need to specify the type of a variable when you declare it.

  • Runtime Compilation: Python code is compiled to bytecode, which is then interpreted by the Python interpreter. This allows for dynamic execution of code.

from utils import create_answer_box
# Code snippets demonstrating key Python concepts
# List comprehension example
squares = [x**2 for x in range(10)]
squares
[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]
# Lambda function example
multiply = lambda x, y: x * y
multiply(5, 10)
50
# Error handling example
try:
    result = 10 / 0
except ZeroDivisionError:
    print("Cannot divide by zero")
    
print("This line still runs")
Cannot divide by zero
This line still runs
# Generators example
def fibonacci(n):
    a, b = 0, 1
    for _ in range(n):
        yield a
        a, b = b, a + b

fib = fibonacci(int(1e18)) # Large number to demonstrate generator memory efficiency, the first 1e18 (i.e. 1 quintillion) Fibonacci numbers
# Print the next Fibonacci number
next(fib)
0
# How much memory does the fib object consume?
import sys
sys.getsizeof(fib)
232

If we were to run list(fib), we would get a list of the first quintillion Fibonacci numbers. But we don’t want to do that, because it would take up all the memory on our machine!

2. Introduction to NumPy#

NumPy is a powerful library for numerical computing in Python. It provides support for large, multi-dimensional arrays and matrices, along with a collection of mathematical functions to operate on these arrays. NumPy is the foundation for many other libraries in the Python data science ecosystem, such as Pandas, Scikit-learn, PyTorch, and TensorFlow. The reason NumPy is so fast is that it is implemented in C, which is a much faster language than Python.

# Basic NumPy operations and array manipulations
import numpy as np

# Create an array
arr = np.array([1, 2, 3, 4, 5])

# Array operations
print("Array operations:")
print("Array:")
print(arr)
print("Array x 2:")
print(arr * 2)
print("Array summed:")
print(np.sum(arr))

# Broadcasting example
matrix = np.array([[1, 2, 3, 4 , 5], [6, 7, 8, 9, 10]])
print("\nBroadcasting example:")
print("Matrix:")
print(matrix)
print("Array:")
print(arr)
print("Matrix + Array:")
print(matrix + arr)
Array operations:
Array:
[1 2 3 4 5]
Array x 2:
[ 2  4  6  8 10]
Array summed:
15

Broadcasting example:
Matrix:
[[ 1  2  3  4  5]
 [ 6  7  8  9 10]]
Array:
[1 2 3 4 5]
Matrix + Array:
[[ 2  4  6  8 10]
 [ 7  9 11 13 15]]

3. Intro to Pandas#

Pandas is a powerful data manipulation library for Python. It is built on top of NumPy and provides data structures and functions for efficiently manipulating large datasets. Pandas is widely used in data science and machine learning for data cleaning, exploration, and preparation.

Pandas isn’t the best choice for truly massive datasets, since it loads the entire dataset into memory. But even libraries that are better suited for massive datasets, like Dask, tend to conform to the Pandas API, so learning Pandas is a good foundation for working with other libraries.

Pandas is great for organizing and exploring data. We’ll spend more time on data exploration in Day 2, but let’s take a quick look at how to use Pandas dataframes to organize data. We’ll use the California Housing dataset, which is a dataset containing information about housing prices. The dataset contains 20,640 samples and 9 features. The goal is to predict the house value, given a set of features about the property and its district.

# Importing necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import fetch_california_housing

# Load the California housing dataset
california = fetch_california_housing()

See if you can put the california housing data into a Pandas dataframe. california is an sklearn Bunch object, just like we saw in the previous notebook. You can run the below code cell to remind yourself about what are the different things that are available as part of this Bunch object.

california.keys()
dict_keys(['data', 'target', 'frame', 'target_names', 'feature_names', 'DESCR'])

And you can make your Pandas dataframe using the pd.DataFrame class. You can check the docstring for this by running the below code cell. It will tell you the arguments that a call to pd.DataFrame can include, along with examples of its use.

pd.DataFrame?
Init signature:
pd.DataFrame(
    data=None,
    index: 'Axes | None' = None,
    columns: 'Axes | None' = None,
    dtype: 'Dtype | None' = None,
    copy: 'bool | None' = None,
) -> 'None'
Docstring:     
Two-dimensional, size-mutable, potentially heterogeneous tabular data.

Data structure also contains labeled axes (rows and columns).
Arithmetic operations align on both row and column labels. Can be
thought of as a dict-like container for Series objects. The primary
pandas data structure.

Parameters
----------
data : ndarray (structured or homogeneous), Iterable, dict, or DataFrame
    Dict can contain Series, arrays, constants, dataclass or list-like objects. If
    data is a dict, column order follows insertion-order. If a dict contains Series
    which have an index defined, it is aligned by its index. This alignment also
    occurs if data is a Series or a DataFrame itself. Alignment is done on
    Series/DataFrame inputs.

    If data is a list of dicts, column order follows insertion-order.

index : Index or array-like
    Index to use for resulting frame. Will default to RangeIndex if
    no indexing information part of input data and no index provided.
columns : Index or array-like
    Column labels to use for resulting frame when data does not have them,
    defaulting to RangeIndex(0, 1, 2, ..., n). If data contains column labels,
    will perform column selection instead.
dtype : dtype, default None
    Data type to force. Only a single dtype is allowed. If None, infer.
copy : bool or None, default None
    Copy data from inputs.
    For dict data, the default of None behaves like ``copy=True``.  For DataFrame
    or 2d ndarray input, the default of None behaves like ``copy=False``.
    If data is a dict containing one or more Series (possibly of different dtypes),
    ``copy=False`` will ensure that these inputs are not copied.

    .. versionchanged:: 1.3.0

See Also
--------
DataFrame.from_records : Constructor from tuples, also record arrays.
DataFrame.from_dict : From dicts of Series, arrays, or dicts.
read_csv : Read a comma-separated values (csv) file into DataFrame.
read_table : Read general delimited file into DataFrame.
read_clipboard : Read text from clipboard into DataFrame.

Notes
-----
Please reference the :ref:`User Guide <basics.dataframe>` for more information.

Examples
--------
Constructing DataFrame from a dictionary.

>>> d = {'col1': [1, 2], 'col2': [3, 4]}
>>> df = pd.DataFrame(data=d)
>>> df
   col1  col2
0     1     3
1     2     4

Notice that the inferred dtype is int64.

>>> df.dtypes
col1    int64
col2    int64
dtype: object

To enforce a single dtype:

>>> df = pd.DataFrame(data=d, dtype=np.int8)
>>> df.dtypes
col1    int8
col2    int8
dtype: object

Constructing DataFrame from a dictionary including Series:

>>> d = {'col1': [0, 1, 2, 3], 'col2': pd.Series([2, 3], index=[2, 3])}
>>> pd.DataFrame(data=d, index=[0, 1, 2, 3])
   col1  col2
0     0   NaN
1     1   NaN
2     2   2.0
3     3   3.0

Constructing DataFrame from numpy ndarray:

>>> df2 = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
...                    columns=['a', 'b', 'c'])
>>> df2
   a  b  c
0  1  2  3
1  4  5  6
2  7  8  9

Constructing DataFrame from a numpy ndarray that has labeled columns:

>>> data = np.array([(1, 2, 3), (4, 5, 6), (7, 8, 9)],
...                 dtype=[("a", "i4"), ("b", "i4"), ("c", "i4")])
>>> df3 = pd.DataFrame(data, columns=['c', 'a'])
...
>>> df3
   c  a
0  3  1
1  6  4
2  9  7

Constructing DataFrame from dataclass:

>>> from dataclasses import make_dataclass
>>> Point = make_dataclass("Point", [("x", int), ("y", int)])
>>> pd.DataFrame([Point(0, 0), Point(0, 3), Point(2, 3)])
   x  y
0  0  0
1  0  3
2  2  3

Constructing DataFrame from Series/DataFrame:

>>> ser = pd.Series([1, 2, 3], index=["a", "b", "c"])
>>> df = pd.DataFrame(data=ser, index=["a", "c"])
>>> df
   0
a  1
c  3

>>> df1 = pd.DataFrame([1, 2, 3], index=["a", "b", "c"], columns=["x"])
>>> df2 = pd.DataFrame(data=df1, index=["a", "c"])
>>> df2
   x
a  1
c  3
File:           /software/slurm/spackages/linux-rocky8-x86_64/gcc-12.2.0/anaconda3-2023.09-0-3mhml42fa64byxqyd5fig5tbih625dp2/lib/python3.11/site-packages/pandas/core/frame.py
Type:           type
Subclasses:     SubclassedDataFrame

Now, see if you can make a dataframe with the California housing data. We don’t need to worry about the datatype or the index of this dataframe, but it should at least have the independent variables (aka the data) from the california object, as well as labels for the columns. You can try things out in the below code cell. When you have a solution that you think works, please copy/paste it into the text field and submit your answer!

# Try out your code here. Feel free to make additional code cells if you like.
create_answer_box("Please enter your code to produce a Pandas df of the California housing data.", "01-01")

Please enter your code to produce a Pandas df of the California housing data.

# Display the first few rows of the DataFrame
print("First few rows of the California Housing Dataset:")
print(df.head())
First few rows of the California Housing Dataset:
   MedInc  HouseAge  AveRooms  AveBedrms  Population  AveOccup  Latitude  \
0  8.3252      41.0  6.984127   1.023810       322.0  2.555556     37.88   
1  8.3014      21.0  6.238137   0.971880      2401.0  2.109842     37.86   
2  7.2574      52.0  8.288136   1.073446       496.0  2.802260     37.85   
3  5.6431      52.0  5.817352   1.073059       558.0  2.547945     37.85   
4  3.8462      52.0  6.281853   1.081081       565.0  2.181467     37.85   

   Longitude  Price  
0    -122.23  4.526  
1    -122.22  3.585  
2    -122.24  3.521  
3    -122.25  3.413  
4    -122.25  3.422  

Pandas includes some methods for quickly summarizing the data in a dataframe.

# Basic data exploration
print("DataFrame Info:")
df.info()
print("Basic Statistics:")
print(df.describe())

Pandas also makes it easy to filter data.

# Filtering data
print("Houses with more than 4 rooms on average:")
print(df[df['AveRooms'] > 4].head())
print("\nHouses with more than 4 rooms on average and a price above the median:")
print(df[(df['AveRooms'] > 4) & (df['Price'] > df['Price'].median())].head())

We can also sort the data easily, and add new columns.

# Sorting data
print("\nTop 5 most expensive areas:")
print(df.sort_values('Price', ascending=False).head())
# Adding a new column
df['PriceCategory'] = pd.cut(df['Price'], bins=[0, 1.25, 2.5, 3.75, np.inf], labels=['Low', 'Medium', 'High', 'Very High'])
print("\nDataFrame with new PriceCategory column:")
print(df.head())

A particularly useful feature of Pandas is the ability to group data by a particular column and then apply a function to each group. This is similar to the SQL GROUP BY clause, or to Excel’s pivot tables.

# Group by operations
print("\nAverage house age by price category:")
print(df.groupby('PriceCategory', observed=False)['HouseAge'].mean())

Challenge: Find the mean age of houses in the upper quartile (i.e. 75th percentile, or 0.75 quantile) of prices. I.e., among the 25% of houses that are the most expensive, what is their average age? For this, you might find it helpful to know that just as there is a median method (shown above) for Pandas Series, there is also a quantile method, in which you can specify whatever quantile you would like it to return.

# Try out your code here. Submit your final answer below.
df[df['Price'] > df['Price'].quantile(0.75)]['HouseAge'].mean()
30.58895348837209
create_answer_box("Please enter your code.", "01-02")

Please enter your code.

Pandas also includes some plotting functionality, which is built on top of the Matplotlib library. Very useful for quickly visualizing data.

# Basic data visualization
plt.figure(figsize=(4, 3))
df.plot(x='MedInc', y='Price', kind='scatter', alpha=0.5)
plt.title('Median Income vs House Price')
plt.xlabel('Median Income')
plt.ylabel('House Price')
plt.show()
# Correlation heatmap (excluding categorical column)
plt.figure(figsize=(4, 3))
correlation_matrix = df.drop('PriceCategory', axis=1).corr()
plt.imshow(correlation_matrix, cmap='coolwarm', aspect='auto')
plt.colorbar()
plt.xticks(range(len(correlation_matrix.columns)), correlation_matrix.columns, rotation=90)
plt.yticks(range(len(correlation_matrix.columns)), correlation_matrix.columns)
plt.title('Correlation Heatmap of California Housing Features')
plt.tight_layout()
plt.show()
# Handling categorical data
print("\nCount of houses in each price category:")
print(df['PriceCategory'].value_counts())
# Visualizing categorical data
plt.figure(figsize=(6, 4))
df['PriceCategory'].value_counts().plot(kind='bar')
plt.title('Distribution of House Prices by Category')
plt.xlabel('Price Category')
plt.ylabel('Number of Houses')
plt.show()