Introduction to Python for Machine Learning#
Python has become the de facto language for machine learning due to its simplicity, readability, and extensive ecosystem of libraries. Its flexibility allows for rapid prototyping and development, while its powerful libraries like NumPy, Pandas, and Scikit-learn provide efficient tools for data manipulation and model building.
1. Essential Python Concepts Review#
For a broad introduction to Python, check out the Python Programming Guide. Here are some key features of Python that are particularly useful for machine learning:
List Comprehensions: A concise way to create lists in Python.
Lambda Functions: Anonymous functions that can be defined in a single line.
Error Handling: Using
try
,except
, andfinally
blocks to handle exceptions.Generators: Functions that return an iterator, allowing for lazy evaluation (meaning they don’t store all values in memory at once).
Dynamic Typing: Variables in Python are dynamically typed, meaning you don’t need to specify the type of a variable when you declare it.
Runtime Compilation: Python code is compiled to bytecode, which is then interpreted by the Python interpreter. This allows for dynamic execution of code.
from utils import create_answer_box
# Code snippets demonstrating key Python concepts
# List comprehension example
squares = [x**2 for x in range(10)]
squares
[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]
# Lambda function example
multiply = lambda x, y: x * y
multiply(5, 10)
50
# Error handling example
try:
result = 10 / 0
except ZeroDivisionError:
print("Cannot divide by zero")
print("This line still runs")
Cannot divide by zero
This line still runs
# Generators example
def fibonacci(n):
a, b = 0, 1
for _ in range(n):
yield a
a, b = b, a + b
fib = fibonacci(int(1e18)) # Large number to demonstrate generator memory efficiency, the first 1e18 (i.e. 1 quintillion) Fibonacci numbers
# Print the next Fibonacci number
next(fib)
0
# How much memory does the fib object consume?
import sys
sys.getsizeof(fib)
232
If we were to run list(fib)
, we would get a list of the first quintillion Fibonacci numbers. But we don’t want to do that, because it would take up all the memory on our machine!
2. Introduction to NumPy#
NumPy is a powerful library for numerical computing in Python. It provides support for large, multi-dimensional arrays and matrices, along with a collection of mathematical functions to operate on these arrays. NumPy is the foundation for many other libraries in the Python data science ecosystem, such as Pandas, Scikit-learn, PyTorch, and TensorFlow. The reason NumPy is so fast is that it is implemented in C, which is a much faster language than Python.
# Basic NumPy operations and array manipulations
import numpy as np
# Create an array
arr = np.array([1, 2, 3, 4, 5])
# Array operations
print("Array operations:")
print("Array:")
print(arr)
print("Array x 2:")
print(arr * 2)
print("Array summed:")
print(np.sum(arr))
# Broadcasting example
matrix = np.array([[1, 2, 3, 4 , 5], [6, 7, 8, 9, 10]])
print("\nBroadcasting example:")
print("Matrix:")
print(matrix)
print("Array:")
print(arr)
print("Matrix + Array:")
print(matrix + arr)
Array operations:
Array:
[1 2 3 4 5]
Array x 2:
[ 2 4 6 8 10]
Array summed:
15
Broadcasting example:
Matrix:
[[ 1 2 3 4 5]
[ 6 7 8 9 10]]
Array:
[1 2 3 4 5]
Matrix + Array:
[[ 2 4 6 8 10]
[ 7 9 11 13 15]]
3. Intro to Pandas#
Pandas is a powerful data manipulation library for Python. It is built on top of NumPy and provides data structures and functions for efficiently manipulating large datasets. Pandas is widely used in data science and machine learning for data cleaning, exploration, and preparation.
Pandas isn’t the best choice for truly massive datasets, since it loads the entire dataset into memory. But even libraries that are better suited for massive datasets, like Dask, tend to conform to the Pandas API, so learning Pandas is a good foundation for working with other libraries.
Pandas is great for organizing and exploring data. We’ll spend more time on data exploration in Day 2, but let’s take a quick look at how to use Pandas dataframes to organize data. We’ll use the California Housing dataset, which is a dataset containing information about housing prices. The dataset contains 20,640 samples and 9 features. The goal is to predict the house value, given a set of features about the property and its district.
# Importing necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import fetch_california_housing
# Load the California housing dataset
california = fetch_california_housing()
See if you can put the california housing data into a Pandas dataframe. california
is an sklearn Bunch
object, just like we saw in the previous notebook. You can run the below code cell to remind yourself about what are the different things that are available as part of this Bunch
object.
california.keys()
dict_keys(['data', 'target', 'frame', 'target_names', 'feature_names', 'DESCR'])
And you can make your Pandas dataframe using the pd.DataFrame
class. You can check the docstring for this by running the below code cell. It will tell you the arguments that a call to pd.DataFrame
can include, along with examples of its use.
pd.DataFrame?
Init signature:
pd.DataFrame(
data=None,
index: 'Axes | None' = None,
columns: 'Axes | None' = None,
dtype: 'Dtype | None' = None,
copy: 'bool | None' = None,
) -> 'None'
Docstring:
Two-dimensional, size-mutable, potentially heterogeneous tabular data.
Data structure also contains labeled axes (rows and columns).
Arithmetic operations align on both row and column labels. Can be
thought of as a dict-like container for Series objects. The primary
pandas data structure.
Parameters
----------
data : ndarray (structured or homogeneous), Iterable, dict, or DataFrame
Dict can contain Series, arrays, constants, dataclass or list-like objects. If
data is a dict, column order follows insertion-order. If a dict contains Series
which have an index defined, it is aligned by its index. This alignment also
occurs if data is a Series or a DataFrame itself. Alignment is done on
Series/DataFrame inputs.
If data is a list of dicts, column order follows insertion-order.
index : Index or array-like
Index to use for resulting frame. Will default to RangeIndex if
no indexing information part of input data and no index provided.
columns : Index or array-like
Column labels to use for resulting frame when data does not have them,
defaulting to RangeIndex(0, 1, 2, ..., n). If data contains column labels,
will perform column selection instead.
dtype : dtype, default None
Data type to force. Only a single dtype is allowed. If None, infer.
copy : bool or None, default None
Copy data from inputs.
For dict data, the default of None behaves like ``copy=True``. For DataFrame
or 2d ndarray input, the default of None behaves like ``copy=False``.
If data is a dict containing one or more Series (possibly of different dtypes),
``copy=False`` will ensure that these inputs are not copied.
.. versionchanged:: 1.3.0
See Also
--------
DataFrame.from_records : Constructor from tuples, also record arrays.
DataFrame.from_dict : From dicts of Series, arrays, or dicts.
read_csv : Read a comma-separated values (csv) file into DataFrame.
read_table : Read general delimited file into DataFrame.
read_clipboard : Read text from clipboard into DataFrame.
Notes
-----
Please reference the :ref:`User Guide <basics.dataframe>` for more information.
Examples
--------
Constructing DataFrame from a dictionary.
>>> d = {'col1': [1, 2], 'col2': [3, 4]}
>>> df = pd.DataFrame(data=d)
>>> df
col1 col2
0 1 3
1 2 4
Notice that the inferred dtype is int64.
>>> df.dtypes
col1 int64
col2 int64
dtype: object
To enforce a single dtype:
>>> df = pd.DataFrame(data=d, dtype=np.int8)
>>> df.dtypes
col1 int8
col2 int8
dtype: object
Constructing DataFrame from a dictionary including Series:
>>> d = {'col1': [0, 1, 2, 3], 'col2': pd.Series([2, 3], index=[2, 3])}
>>> pd.DataFrame(data=d, index=[0, 1, 2, 3])
col1 col2
0 0 NaN
1 1 NaN
2 2 2.0
3 3 3.0
Constructing DataFrame from numpy ndarray:
>>> df2 = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
... columns=['a', 'b', 'c'])
>>> df2
a b c
0 1 2 3
1 4 5 6
2 7 8 9
Constructing DataFrame from a numpy ndarray that has labeled columns:
>>> data = np.array([(1, 2, 3), (4, 5, 6), (7, 8, 9)],
... dtype=[("a", "i4"), ("b", "i4"), ("c", "i4")])
>>> df3 = pd.DataFrame(data, columns=['c', 'a'])
...
>>> df3
c a
0 3 1
1 6 4
2 9 7
Constructing DataFrame from dataclass:
>>> from dataclasses import make_dataclass
>>> Point = make_dataclass("Point", [("x", int), ("y", int)])
>>> pd.DataFrame([Point(0, 0), Point(0, 3), Point(2, 3)])
x y
0 0 0
1 0 3
2 2 3
Constructing DataFrame from Series/DataFrame:
>>> ser = pd.Series([1, 2, 3], index=["a", "b", "c"])
>>> df = pd.DataFrame(data=ser, index=["a", "c"])
>>> df
0
a 1
c 3
>>> df1 = pd.DataFrame([1, 2, 3], index=["a", "b", "c"], columns=["x"])
>>> df2 = pd.DataFrame(data=df1, index=["a", "c"])
>>> df2
x
a 1
c 3
File: /software/slurm/spackages/linux-rocky8-x86_64/gcc-12.2.0/anaconda3-2023.09-0-3mhml42fa64byxqyd5fig5tbih625dp2/lib/python3.11/site-packages/pandas/core/frame.py
Type: type
Subclasses: SubclassedDataFrame
Now, see if you can make a dataframe with the California housing data. We don’t need to worry about the datatype or the index of this dataframe, but it should at least have the independent variables (aka the data) from the california
object, as well as labels for the columns. You can try things out in the below code cell. When you have a solution that you think works, please copy/paste it into the text field and submit your answer!
# Try out your code here. Feel free to make additional code cells if you like.
create_answer_box("Please enter your code to produce a Pandas df of the California housing data.", "01-01")
Please enter your code to produce a Pandas df of the California housing data.
# Display the first few rows of the DataFrame
print("First few rows of the California Housing Dataset:")
print(df.head())
First few rows of the California Housing Dataset:
MedInc HouseAge AveRooms AveBedrms Population AveOccup Latitude \
0 8.3252 41.0 6.984127 1.023810 322.0 2.555556 37.88
1 8.3014 21.0 6.238137 0.971880 2401.0 2.109842 37.86
2 7.2574 52.0 8.288136 1.073446 496.0 2.802260 37.85
3 5.6431 52.0 5.817352 1.073059 558.0 2.547945 37.85
4 3.8462 52.0 6.281853 1.081081 565.0 2.181467 37.85
Longitude Price
0 -122.23 4.526
1 -122.22 3.585
2 -122.24 3.521
3 -122.25 3.413
4 -122.25 3.422
Pandas includes some methods for quickly summarizing the data in a dataframe.
# Basic data exploration
print("DataFrame Info:")
df.info()
print("Basic Statistics:")
print(df.describe())
Pandas also makes it easy to filter data.
# Filtering data
print("Houses with more than 4 rooms on average:")
print(df[df['AveRooms'] > 4].head())
print("\nHouses with more than 4 rooms on average and a price above the median:")
print(df[(df['AveRooms'] > 4) & (df['Price'] > df['Price'].median())].head())
We can also sort the data easily, and add new columns.
# Sorting data
print("\nTop 5 most expensive areas:")
print(df.sort_values('Price', ascending=False).head())
# Adding a new column
df['PriceCategory'] = pd.cut(df['Price'], bins=[0, 1.25, 2.5, 3.75, np.inf], labels=['Low', 'Medium', 'High', 'Very High'])
print("\nDataFrame with new PriceCategory column:")
print(df.head())
A particularly useful feature of Pandas is the ability to group data by a particular column and then apply a function to each group. This is similar to the SQL GROUP BY
clause, or to Excel’s pivot tables.
# Group by operations
print("\nAverage house age by price category:")
print(df.groupby('PriceCategory', observed=False)['HouseAge'].mean())
Challenge: Find the mean age of houses in the upper quartile (i.e. 75th percentile, or 0.75 quantile) of prices. I.e., among the 25% of houses that are the most expensive, what is their average age? For this, you might find it helpful to know that just as there is a median
method (shown above) for Pandas Series, there is also a quantile
method, in which you can specify whatever quantile you would like it to return.
# Try out your code here. Submit your final answer below.
df[df['Price'] > df['Price'].quantile(0.75)]['HouseAge'].mean()
30.58895348837209
create_answer_box("Please enter your code.", "01-02")
Please enter your code.
Pandas also includes some plotting functionality, which is built on top of the Matplotlib library. Very useful for quickly visualizing data.
# Basic data visualization
plt.figure(figsize=(4, 3))
df.plot(x='MedInc', y='Price', kind='scatter', alpha=0.5)
plt.title('Median Income vs House Price')
plt.xlabel('Median Income')
plt.ylabel('House Price')
plt.show()
# Correlation heatmap (excluding categorical column)
plt.figure(figsize=(4, 3))
correlation_matrix = df.drop('PriceCategory', axis=1).corr()
plt.imshow(correlation_matrix, cmap='coolwarm', aspect='auto')
plt.colorbar()
plt.xticks(range(len(correlation_matrix.columns)), correlation_matrix.columns, rotation=90)
plt.yticks(range(len(correlation_matrix.columns)), correlation_matrix.columns)
plt.title('Correlation Heatmap of California Housing Features')
plt.tight_layout()
plt.show()
# Handling categorical data
print("\nCount of houses in each price category:")
print(df['PriceCategory'].value_counts())
# Visualizing categorical data
plt.figure(figsize=(6, 4))
df['PriceCategory'].value_counts().plot(kind='bar')
plt.title('Distribution of House Prices by Category')
plt.xlabel('Price Category')
plt.ylabel('Number of Houses')
plt.show()