Introduction to Python for Machine Learning

Introduction to Python for Machine Learning#

Python has become the de facto language for machine learning due to its simplicity, readability, and extensive ecosystem of libraries. Its flexibility allows for rapid prototyping and development, while its powerful libraries like NumPy, Pandas, and Scikit-learn provide efficient tools for data manipulation and model building.

1. Essential Python Concepts Review#

For a broad introduction to Python, check out the Python Programming Guide. Here are some key features of Python that are particularly useful for machine learning:

  • List Comprehensions: A concise way to create lists in Python.

  • Lambda Functions: Anonymous functions that can be defined in a single line.

  • Error Handling: Using try, except, and finally blocks to handle exceptions.

  • Generators: Functions that return an iterator, allowing for lazy evaluation (meaning they don’t store all values in memory at once).

  • Dynamic Typing: Variables in Python are dynamically typed, meaning you don’t need to specify the type of a variable when you declare it.

  • Runtime Compilation: Python code is compiled to bytecode, which is then interpreted by the Python interpreter. This allows for dynamic execution of code.

# Code snippets demonstrating key Python concepts
# List comprehension example
squares = [x**2 for x in range(10)]
squares
# Lambda function example
multiply = lambda x, y: x * y
multiply(5, 10)
# Error handling example
try:
    result = 10 / 0
except ZeroDivisionError:
    print("Cannot divide by zero")
    
print("This line still runs")
# Generators example
def fibonacci(n):
    a, b = 0, 1
    for _ in range(n):
        yield a
        a, b = b, a + b

fib = fibonacci(int(1e18)) # Large number to demonstrate generator memory efficiency, the first 1e18 (i.e. 1 quintillion) Fibonacci numbers
# Print the next Fibonacci number
next(fib)
# How much memory does the fib object consume?
import sys
sys.getsizeof(fib)

If we were to run list(fib), we would get a list of the first quintillion Fibonacci numbers. But we don’t want to do that, because it would take up all the memory on our machine!

2. Introduction to NumPy#

NumPy is a powerful library for numerical computing in Python. It provides support for large, multi-dimensional arrays and matrices, along with a collection of mathematical functions to operate on these arrays. NumPy is the foundation for many other libraries in the Python data science ecosystem, such as Pandas, Scikit-learn, PyTorch, and TensorFlow. The reason NumPy is so fast is that it is implemented in C, which is a much faster language than Python.

# Basic NumPy operations and array manipulations
import numpy as np

# Create an array
arr = np.array([1, 2, 3, 4, 5])

# Array operations
print("Array operations:")
print("Array:")
print(arr)
print("Array x 2:")
print(arr * 2)
print("Array summed:")
print(np.sum(arr))

# Broadcasting example
matrix = np.array([[1, 2, 3, 4 , 5], [6, 7, 8, 9, 10]])
print("\nBroadcasting example:")
print("Matrix:")
print(matrix)
print("Array:")
print(arr)
print("Matrix + Array:")
print(matrix + arr)

3. Intro to Pandas#

Pandas is a powerful data manipulation library for Python. It is built on top of NumPy and provides data structures and functions for efficiently manipulating large datasets. Pandas is widely used in data science and machine learning for data cleaning, exploration, and preparation.

Pandas isn’t the best choice for truly massive datasets, since it loads the entire dataset into memory. But even libraries that are better suited for massive datasets, like Dask, tend to conform to the Pandas API, so learning Pandas is a good foundation for working with other libraries.

Pandas is great for organizing and exploring data. We’ll spend more time on data exploration in Day 2, but let’s take a quick look at how to use Pandas dataframes to organize data. We’ll use the California Housing dataset, which is a dataset containing information about housing prices. The dataset contains 20,640 samples and 9 features. The goal is to predict the house value, given a set of features about the property and its district.

# Importing necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import fetch_california_housing

# Load the California housing dataset
california = fetch_california_housing()

# Create a DataFrame
df = pd.DataFrame(california.data, columns=california.feature_names)
# The above created a dataframe with the features, but it doesn't have the target variable. It's easy to add new columns to a Pandas DataFrame:
df['Price'] = california.target

# Display the first few rows of the DataFrame
print("First few rows of the California Housing Dataset:")
print(df.head())

Pandas includes some methods for quickly summarizing the data in a dataframe.

# Basic data exploration
print("DataFrame Info:")
df.info()
print("Basic Statistics:")
print(df.describe())

Pandas also makes it easy to filter data.

# Filtering data
print("Houses with more than 4 rooms on average:")
print(df[df['AveRooms'] > 4].head())
print("\nHouses with more than 4 rooms on average and a price above the median:")
print(df[(df['AveRooms'] > 4) & (df['Price'] > df['Price'].median())].head())

We can also sort the data easily, and add new columns.

# Sorting data
print("\nTop 5 most expensive areas:")
print(df.sort_values('Price', ascending=False).head())
# Adding a new column
df['PriceCategory'] = pd.cut(df['Price'], bins=[0, 1.25, 2.5, 3.75, np.inf], labels=['Low', 'Medium', 'High', 'Very High'])
print("\nDataFrame with new PriceCategory column:")
print(df.head())

A particularly useful feature of Pandas is the ability to group data by a particular column and then apply a function to each group. This is similar to the SQL GROUP BY clause, or to Excel’s pivot tables.

# Group by operations
print("\nAverage house age by price category:")
print(df.groupby('PriceCategory', observed=False)['HouseAge'].mean())

Pandas also includes some plotting functionality, which is built on top of the Matplotlib library. Very useful for quickly visualizing data.

# Basic data visualization
plt.figure(figsize=(4, 3))
df.plot(x='MedInc', y='Price', kind='scatter', alpha=0.5)
plt.title('Median Income vs House Price')
plt.xlabel('Median Income')
plt.ylabel('House Price')
plt.show()
# Correlation heatmap (excluding categorical column)
plt.figure(figsize=(4, 3))
correlation_matrix = df.drop('PriceCategory', axis=1).corr()
plt.imshow(correlation_matrix, cmap='coolwarm', aspect='auto')
plt.colorbar()
plt.xticks(range(len(correlation_matrix.columns)), correlation_matrix.columns, rotation=90)
plt.yticks(range(len(correlation_matrix.columns)), correlation_matrix.columns)
plt.title('Correlation Heatmap of California Housing Features')
plt.tight_layout()
plt.show()
# Handling categorical data
print("\nCount of houses in each price category:")
print(df['PriceCategory'].value_counts())
# Visualizing categorical data
plt.figure(figsize=(6, 4))
df['PriceCategory'].value_counts().plot(kind='bar')
plt.title('Distribution of House Prices by Category')
plt.xlabel('Price Category')
plt.ylabel('Number of Houses')
plt.show()