Introduction to Python for Machine Learning

Introduction to Python for Machine Learning#

Python has become the de facto language for machine learning due to its simplicity, readability, and extensive ecosystem of libraries. Its flexibility allows for rapid prototyping and development, while its powerful libraries like NumPy, Pandas, and Scikit-learn provide efficient tools for data manipulation and model building.

1. Essential Python Concepts Review#

For a broad introduction to Python, check out the Python Programming Guide. Here are some key features of Python that are particularly useful for machine learning:

  • List Comprehensions: A concise way to create lists in Python.

  • Lambda Functions: Anonymous functions that can be defined in a single line.

  • Error Handling: Using try, except, and finally blocks to handle exceptions.

  • Generators: Functions that return an iterator, allowing for lazy evaluation (meaning they don’t store all values in memory at once).

  • Dynamic Typing: Variables in Python are dynamically typed, meaning you don’t need to specify the type of a variable when you declare it.

  • Runtime Compilation: Python code is compiled to bytecode, which is then interpreted by the Python interpreter. This allows for dynamic execution of code.

from utils import create_answer_box
create_answer_box("Are you familiar with the above-described concepts, or would you benefit from our spending some time on them now?", "01-01")
# Code snippets demonstrating key Python concepts
# List comprehension example
squares = [x**2 for x in range(10)]
squares
# Lambda function example
multiply = lambda x, y: x * y
multiply(5, 10)
# Error handling example
try:
    result = 10 / 0
except ZeroDivisionError:
    print("Cannot divide by zero")
    
print("This line still runs")
# Generators example
def fibonacci(n):
    a, b = 0, 1
    for _ in range(n):
        yield a
        a, b = b, a + b

fib = fibonacci(int(1e18)) # Large number to demonstrate generator memory efficiency, the first 1e18 (i.e. 1 quintillion) Fibonacci numbers
# Print the next Fibonacci number
next(fib)
# How much memory does the fib object consume?
import sys
sys.getsizeof(fib)

If we were to run list(fib), we would get a list of the first quintillion Fibonacci numbers. But we don’t want to do that, because it would take up all the memory on our machine!

2. Introduction to NumPy#

NumPy is a powerful library for numerical computing in Python. It provides support for large, multi-dimensional arrays and matrices, along with a collection of mathematical functions to operate on these arrays. NumPy is the foundation for many other libraries in the Python data science ecosystem, such as Pandas, Scikit-learn, PyTorch, and TensorFlow. The reason NumPy is so fast is that it is implemented in C, which is a much faster language than Python.

# Basic NumPy operations and array manipulations
import numpy as np

# Create an array
arr = np.array([1, 2, 3, 4, 5])

# Array operations
print("Array operations:")
print("Array:")
print(arr)
print("Array x 2:")
print(arr * 2)
print("Array summed:")
print(np.sum(arr))

# Broadcasting example
matrix = np.array([[1, 2, 3, 4 , 5], [6, 7, 8, 9, 10]])
print("\nBroadcasting example:")
print("Matrix:")
print(matrix)
print("Array:")
print(arr)
print("Matrix + Array:")
print(matrix + arr)

3. Intro to Pandas#

Pandas is a powerful data manipulation library for Python. It is built on top of NumPy and provides data structures and functions for efficiently manipulating large datasets. Pandas is widely used in data science and machine learning for data cleaning, exploration, and preparation.

Pandas isn’t the best choice for truly massive datasets, since it loads the entire dataset into memory. But even libraries that are better suited for massive datasets, like Dask, tend to conform to the Pandas API, so learning Pandas is a good foundation for working with other libraries.

Pandas is great for organizing and exploring data. We’ll spend more time on data exploration in Day 2, but let’s take a quick look at how to use Pandas dataframes to organize data. We’ll use the California Housing dataset, which is a dataset containing information about housing prices. The dataset contains 20,640 samples and 9 features. The goal is to predict the house value, given a set of features about the property and its district.

# Importing necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import fetch_california_housing

# Load the California housing dataset
california = fetch_california_housing()

See if you can put the california housing data into a Pandas dataframe. california is an sklearn Bunch object, just like we saw in the previous notebook. You can run the below code cell to remind yourself about what are the different things that are available as part of this Bunch object.

california.keys()

And you can make your Pandas dataframe using the pd.DataFrame class. You can check the docstring for this by running the below code cell. It will tell you the arguments that a call to pd.DataFrame can include, along with examples of its use.

pd.DataFrame?

Now, see if you can make a dataframe with the California housing data. We don’t need to worry about the datatype or the index of this dataframe, but it should at least have the independent variables (aka the data) from the california object, as well as labels for the columns. You can try things out in the below code cell. When you have a solution that you think works, please copy/paste it into the text field and submit your answer!

# Try out your code here. Feel free to make additional code cells if you like.
create_answer_box("Please copy/paste your code to produce a Pandas df of the California housing data.", "01-02")
# Display the first few rows of the DataFrame
print("First few rows of the California Housing Dataset:")
print(df.head())

Pandas includes some methods for quickly summarizing the data in a dataframe.

# Basic data exploration
print("DataFrame Info:")
df.info()
print("Basic Statistics:")
print(df.describe())

Pandas also makes it easy to filter data.

# Filtering data
print("Houses with more than 4 rooms on average:")
print(df[df['AveRooms'] > 4].head())
print("\nHouses with more than 4 rooms on average and a price above the median:")
print(df[(df['AveRooms'] > 4) & (df['Price'] > df['Price'].median())].head())

We can also sort the data easily, and add new columns.

# Sorting data
print("\nTop 5 most expensive areas:")
print(df.sort_values('Price', ascending=False).head())
# Adding a new column
df['PriceCategory'] = pd.cut(df['Price'], bins=[0, 1.25, 2.5, 3.75, np.inf], labels=['Low', 'Medium', 'High', 'Very High'])
print("\nDataFrame with new PriceCategory column:")
print(df.head())

A particularly useful feature of Pandas is the ability to group data by a particular column and then apply a function to each group. This is similar to the SQL GROUP BY clause, or to Excel’s pivot tables.

# Group by operations
print("\nAverage house age by price category:")
print(df.groupby('PriceCategory', observed=False)['HouseAge'].mean())

Challenge: Find the mean age of houses in the upper quartile (i.e. 75th percentile, or 0.75 quantile) of prices. I.e., among the 25% of areas that are the most expensive, what is their average age? For this, you might find it helpful to know that just as there is a median method (shown above) for Pandas Series, there is also a quantile method, in which you can specify whatever quantile you would like it to return.

# Try out your code here. Submit your final answer below.
create_answer_box("Please enter your code.", "01-03")

Pandas also includes some plotting functionality, which is built on top of the Matplotlib library. Very useful for quickly visualizing data.

# Basic data visualization
plt.figure(figsize=(4, 3))
df.plot(x='MedInc', y='Price', kind='scatter', alpha=0.5)
plt.title('Median Income vs House Price')
plt.xlabel('Median Income')
plt.ylabel('House Price')
plt.show()
# Correlation heatmap (excluding categorical column)
plt.figure(figsize=(4, 3))
correlation_matrix = df.drop('PriceCategory', axis=1).corr()
plt.imshow(correlation_matrix, cmap='coolwarm', aspect='auto')
plt.colorbar()
plt.xticks(range(len(correlation_matrix.columns)), correlation_matrix.columns, rotation=90)
plt.yticks(range(len(correlation_matrix.columns)), correlation_matrix.columns)
plt.title('Correlation Heatmap of California Housing Features')
plt.tight_layout()
plt.show()
# Handling categorical data
print("\nCount of houses in each price category:")
print(df['PriceCategory'].value_counts())
# Visualizing categorical data
plt.figure(figsize=(6, 4))
df['PriceCategory'].value_counts().plot(kind='bar')
plt.title('Distribution of House Prices by Category')
plt.xlabel('Price Category')
plt.ylabel('Number of Houses')
plt.show()