August 17, 2022

Statistical Machine Learning by using Jupyter Notebook Problem

For this HW, we will pretend that we only have 1000 images of the MNIST dataset.

We will begin first with what we had in our videos:

#commonly used imports
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt

#get data from sklearn
from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784', version=1, as_frame=False)

#get the attributes and labels
X, y = mnist["data"], mnist["target"]

#convert labels from characters to numbers
y = y.astype(np.uint8)

#split into training and test sets
X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]

Suppose we decide on a Decision Tree Classifier (covered later in Chapter 6). There are two hyperparameters we will fine tune for now: max_leaf_nodes and min_samples. Let’s fine tune these using the following code.

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV

#setup the decision tree
dt_clf = DecisionTreeClassifier( random_state=30)

#the hyperparameters to search through
params = {'max_leaf_nodes': list(range(2, 100)), 'min_samples_split': [3,4, 5]}

#initialize the GridSearch with 3-fold cross validation
grid_search_cv = GridSearchCV(dt_clf, params, verbose=1, cv=3)

#do the search (but only on the first 1000 images)
grid_search_cv.fit(X_train[:1000], y_train[:1000])

What are the best hyperparameters?

Now, let’s fit the training data (again pretending to only have the first 1000 images).

dt_clf = grid_search_cv.best_estimator_
dt_clf.fit(X_train[:1000], y_train[:1000])

Let’s now obtain cross-validation accuracy scores with this fine-tuned model

from sklearn.model_selection import cross_val_score
cross_val_score(dt_clf, X_train[:1000], y_train[:1000], cv=3, scoring="accuracy")

How accurate are your three folds?

If we only had 1000 images and we wanted more to train on, we could do what is known as data augmentation. The augmentation we will do here is to shift the images we do have. For each of the 1000 images, we shift one pixel up, left, right, and down. So this will give us four more images to add to our training. Data augmentation such as this is a very useful technique when there are not enough training instances. If you run the following code, you will see an example of the first image shifted.

from scipy.ndimage.interpolation import shift
def shift_image(image, dx, dy):
    image = image.reshape((28, 28))
    shifted_image = shift(image, [dy, dx], cval=0, mode="constant")
    return shifted_image.reshape([-1])

image = X_train[0]
shifted_image_down = shift_image(image, 0, 5)
shifted_image_left = shift_image(image, -5, 0)

plt.figure(figsize=(12,3))
plt.subplot(131)
plt.title("Original", fontsize=14)
plt.imshow(image.reshape(28, 28), interpolation="nearest", cmap="Greys")
plt.subplot(132)
plt.title("Shifted down", fontsize=14)
plt.imshow(shifted_image_down.reshape(28, 28), interpolation="nearest", cmap="Greys")
plt.subplot(133)
plt.title("Shifted left", fontsize=14)
plt.imshow(shifted_image_left.reshape(28, 28), interpolation="nearest", cmap="Greys")
plt.show()

To shift all of our images and add them to a training data set, we can use the following code:

X_train_augmented = [image for image in X_train[:1000]]
y_train_augmented = [label for label in y_train[:1000]]

for dx, dy in ((1, 0), (-1, 0), (0, 1), (0, -1)):
    for image, label in zip(X_train[:1000], y_train[:1000]):
        X_train_augmented.append(shift_image(image, dx, dy))
        y_train_augmented.append(label)

X_train_augmented = np.array(X_train_augmented)
y_train_augmented = np.array(y_train_augmented)

You now have 5000 images in the X_train_augmented and y_train_augmented sets (the 1000 original images and each image was shifted up, down, left, and right).

Let’s now fine-tune and fit a decision tree classifier using this augmented training data.

dt_clf = DecisionTreeClassifier( random_state=30)
params = {'max_leaf_nodes': list(range(2, 100)), 'min_samples_split': [3,4, 5]}
grid_search_cv = GridSearchCV(dt_clf, params, verbose=1, cv=3)

#This will take around 10 minutes to run
grid_search_cv.fit(X_train_augmented, y_train_augmented)

grid_search_cv.best_estimator_

Now, what are the best hyperparameters?

Finally, obtain cross-validation accuracy scores with this model trained on the augmented data.

What are the accuracy scores now?

Did the augmented data help?

Think of one downside to using augmented data and comment.

Submit your code in a Jupyter notebook. Include all code: the code examples above and what you write.
Put your answers to the questions above into markdown cells.
Use a single hashmark heading to label the problem

Collepals.com Plagiarism Free Papers

Are you looking for custom essay writing service or even dissertation writing services? Just request for our write my paper service, and we'll match you with the best essay writer in your subject! With an exceptional team of professional academic experts in a wide range of subjects, we can guarantee you an unrivaled quality of custom-written papers.

Get ZERO PLAGIARISM, HUMAN WRITTEN ESSAYS

Why Hire Collepals.com writers to do your paper?

Quality- We are experienced and have access to ample research materials.

We write plagiarism Free Content

Confidential- We never share or sell your personal information to third parties.

Support-Chat with us today! We are always waiting to answer all your questions.

Statistical Machine Learning by using Jupyter Notebook Problem

Related Posts

Herb’s Concoction and Martha’s Dilemma: The Case of the Deadly Fertilizer Martha Wang worked in the Consumer Affairs Department of a company cal

Write a professional development plan to explain your pursuit and achievement of one professional development opportunity. At a minimum, the paper should

This section should grab the reader’s attention to the problem you want to look into – try and note why the information might be important. Wou