Computer Science / Software Engineering Notes Network

Machine Learning Technologies

Matthew Barnes

About the module        3

Theory        3

Introduction to ML Technologies (Chapter 1)        3

Classification (Chapter 3)        6

Regression (Chapter 4)        10

Linear regression        10

Polynomial regression        11

Regularisation        12

Gradient Descent        14

Softmax regression        15

SVMs (Chapter 5)        16

Decision trees (Chapter 6)        24

Ensemble learning (Chapter 7)        30

Voting        31

Bagging and pasting        31

Random        32

Boosting        33

Stacking        35

Dimensionality reduction (Chapter 8)        36

Curse of dimensionality        36

Projection        36

Manifold learning        38

Principal Component Analysis (PCA)        41

Locally Linear Embedding (LLE)        42

Artificial Neural Nets (Chapter 10)        43

Deep Neural Nets (Chapter 11)        46

Vanishing gradients        46

Exploding gradients        48

Slow training with gradient descent techniques        48

Overfitting issues in large networks        49

Convolutional Neural Nets (Chapter 13)        50

Recurrent Neural Nets (Chapter 14)        54

Implementation        57

SVMs        57

Linear SVM regression        58

Non-linear SVM regression        58

Polynomial features        58

Kernel trick        59

RBF kernel        59

Decision trees        59

Classification        59

Regression        60

Ensemble learning        60

Bagging and pasting        60

Random forest        61

Gradient boosting        62

Dimensionality reduction        62

Principal Component Analysis (PCA)        62

Incremental PCA        63

Kernel PCA        63

Locally Linear Embedding (LLE)        63

Deep Neural Nets        63

Setting up        64

Setting up a layer        64

Setting up a full DNN        64

Setting up the loss function        64

Setting up the training part        64

Evaluating performance of model        65

Initialise model parameters        65

Full training code for DNNs        65

Using DNNs        65

Convolutional Neural Nets        65

Stacking feature maps        65

Pooling layers        67

Recurrent Neural Nets        67

TL;DR        67

Practical Use Cases        68

Introduction to practical use cases        68

Coursework intro        69

Task and Data        69

Design and Implementation        70

Evaluation        70

FAQ        70

Marking Scheme        70

About the module

Theory

Introduction to ML Technologies (Chapter 1)

Classical software implementation process: system designer generates rules

Machine learning process: learn from data

ML with adaptive (changing) environment

ML generates new insights

Classification (Chapter 3)

Ratio of correct ones among “yes” predictions

“Did we only predict the ‘yes’ that we should have predicted?”

Ratio of “yes” that were successfully predicted

“Did we predict all the ‘yes’ that we should have predicted?”

Regression (Chapter 4)

Linear regression

Polynomial regression

Regularisation

Gradient Descent

  1. We pick a random point on the curve
  2. We get the gradient on that point
  3. We travel down the gradient using one of these regularisation equations
  4. Find the new point on the curve and repeat until we’re at the bottom

A convex function, as opposed to a concave function.

As you can see, if our graph looks like the one on the left, there’ll only be one global minimum.

Softmax regression

  1. For every instance x, compute a softmax score sk(x) for each category k

  1. Estimate the probability of x being each class k using the softmax function to the scores

  1. Pick the class with the highest probability for x

SVMs (Chapter 5)

where n is the number of features we currently have

  1. You set some points as landmarks, position some similarity functions on those landmarks

  1. You then evaluate the similarity functions on every data point
  2. You use those values as the new feature values (instead of using the original values)

Decision trees (Chapter 6)

Ensemble learning (Chapter 7)

A lonely classifier, all on its own.

Whoa! It has loads of friends now!

All of its friends work together by forming their own results and then aggregating them.

Voting

Bagging and pasting

Random

Boosting

Ensemble learning

Boosting

  1. Assign a weight to each data point in the training set
  1. Train the first classifier on some features of the whole data set
  1. Identify the wrongly classified data points and adjust the weights
  1. Train the following (second) classifier on other features, taking into account the weights
  1. Repeat step 3 and 4 until you exhausted the feature space, used a desired number of predictors or you are happy with the final output

  1. Train a model (e.g. a regressor) on the training set
  2. Use the model to predict the training set and calculate the error
  1. Train the next model on the error of the previous model
  2. Calculate the new error between the prediction and the old error
  3. Repeat steps 3 and 4 until you used a desired number of predictors or you are happy with the final result

Stacking

Dimensionality reduction (Chapter 8)

Curse of dimensionality

Projection

Manifold learning

Principal Component Analysis (PCA)

Locally Linear Embedding (LLE)

Artificial Neural Nets (Chapter 10)

  1. Aggregates inputs
  2. If it’s above a threshold, fire a signal
  3. If not above threshold, do not fire a signal

  1. Build a loss function L
  2. Employ calculus to calculate partial derivative of L with respect to each weight
  3. Use differentiable activation function (usually sigmoid)
  4. We now know which way we need to “nudge” each weight for a given training sample

Deep Neural Nets (Chapter 11)

Vanishing gradients

Exploding gradients

Slow training with gradient descent techniques

Overfitting issues in large networks

Convolutional Neural Nets (Chapter 13)

Recurrent Neural Nets (Chapter 14)

Implementation

SVMs

Linear SVM regression

from sklearn.svm import LinearSVR
from sklearn.pipeline import Pipeline

# Without pipeline
svm_reg = LinearSVR(epsilon=
1.5)

# With pipeline
svm_reg = Pipeline((
 (
"scaler", StandardScaler()),
 (
"linear_svc", LinearSVC(
   C=
1,
   loss=
"hinge")),
))

svm_reg.fit(X, y)

Parameters

(LinearSVR)

C

Regularisation parameter.

Strength of regularisation is inversely proportional to C.

epsilon

Epsilon parameter for epsilon-insensitive loss function.

loss

Specifies loss function

Non-linear SVM regression

from sklearn.svm import SVR

svm_poly_reg = SVR(
 kernel=
"poly",
 degree=
2,
 C=
100,
 epsilon=
0.1)
svm_poly_reg.fit(X, y)

Parameters

(SVR)

C

Regularisation parameter.

Strength of regularisation is inversely proportional to C.

epsilon

Epsilon parameter for epsilon-SVR model.

kernel

Specifies kernel type

degree

Degree of polynomial kernel function

Polynomial features

from sklearn.preprocessing import PolynomialFeatures

poly_svm_clf = Pipeline((
 (
"poly_features", PolynomialFeatures(degree=3)),
 (
"scaler", StandardScaler()),
 (
"svm_clf", LinearSVC(

    C=10,

    loss="hinge"))
))

Parameters

(PolynomialFeatures)

degree

The degree of polynomial features

include_bias

If true, includes a bias column of all ones, where all polynomial powers are zero

Kernel trick

svm_reg = Pipeline((
 (
"scaler", StandardScaler()),
 (
"linear_svc", SVC(
   kernel=
"poly", # Include this
   degree=
3,
   coef0=
1,
   C=
5)),
))

Parameters

(SVC, given that kernel=”poly”)

coef0

Independent term in kernel function

RBF kernel

svm_reg = Pipeline((
 (
"scaler", StandardScaler()),
 (
"linear_svc", SVC(
   kernel=
"rbf",
   gamma=
5,
   C=
0.001)),
))

Parameters

(SVC, given that kernel=”rbf”)

gamma

Can be ‘scale’, ‘auto’ or float

Kernel coefficient for ‘rbf’

Decision trees

Classification

from sklearn.tree import DecisionTreeClassifier

tree_clf = DecisionTreeClassifier(

max_depth=2)
tree_clf.fit(X, y)

Parameters

(DecisionTreeClassifier)

max_depth

Maximum depth of the tree

criterion

The function to measure the quality of a split.

Can either be “gini” or “entropy”.

Regression

from sklearn.tree import DecisionTreeRegressor

tree_clf = DecisionTreeRegressor(

max_depth=2)
tree_clf.fit(X, y)

Parameters

(DecisionTreeRegressor)

max_depth

Maximum depth of the tree

Ensemble learning

from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression

log_clf = LogisticRegression()
rnd_clf = RandomForestClassifier()
svm_clf = SVC()

voting_clf = VotingClassifier(
 estimators=[
   (
'lr', log_clf),
   (
'rf', rnd_clf),
   (
'svc', svm_clf)
 ],
 voting=
'hard'
)
voting_clf.fit(X_train, y_train)

Parameters

(VotingClassifier)

estimators

List of classifiers to ensemble

voting

The type of voting, either ‘hard’ or ‘soft’

Bagging and pasting

from sklearn.ensemble import BaggingClassifier

bag_clf = BaggingClassifier(
 DecisionTreeClassifier(),
 n_estimators=
500,
 max_samples=
100,
 bootstrap=
True,
 n_jobs=
-1
)
bag_clf.fit(X_train, y_train)
y_pred = bag_clf.predict(X_test)

Parameters

(BaggingClassifier)

base_estimator

The base estimator to fit on random subsets of the dataset

n_estimators

The number of base estimators in the ensemble

max_samples

The number of samples to draw from X to train

bootstrap

If true, bagging (with replacement)

If false, pasting (without replacement)

n_jobs

The number of jobs to run in parallel. -1 means using all processors.

Random forest

# Code v1: optimised for decision trees
from sklearn.ensemble import RandomForestClassifier

rnd_clf = RandomForestClassifier(
 n_estimators=
500,
 max_leaf_nodes=
16,
 n_jobs=
-1)
rnd_clf.fit(X_train, y_train)

# Code v2: generic solution
bag_clf = BaggingClassifier(
 DecisionTreeClassifier(
   splitter=
"random",
   max_leaf_nodes=
16),
 n_estimators=
500,
 max_samples=
1.0,
 bootstrap=
True,
 n_jobs=
-1
)

Parameters

(RandomForestClassifier)

n_estimators

The number of trees in the forest

max_leaf_nodes

Grow trees with max_leaf_nodes in best-first fashion

n_jobs

The number of jobs to run in parallel. -1 means using all processors.

Gradient boosting

# Code v1: Manual
from sklearn.tree import DecisionTreeRegressor

tree_reg1 = DecisionTreeRegressor(max_depth=
2)
tree_reg1.fit(X, y)

# Next predictor
y2 = y - tree_reg1.predict(X)
tree_reg2 = DecisionTreeRegressor(max_depth=
2)
tree_reg2.fit(X, y2)

# Next predictor
y3 = y2 - tree_reg1.predict(X)
tree_reg3 = DecisionTreeRegressor(max_depth=
2)
tree_reg3.fit(X, y3)

# Prediction
y_pred = sum(tree.predict(X_new)
for tree in (tree_reg1, tree_reg2, tree_reg3))

# Code v2: Automatic
from sklearn.ensemble import GradientBoostingRegressor

gbrt = GradientBoostingRegressor(
 max_depth=
2,
 n_estimators=
3,
 learning_rate=
1.0)
gbrt.fit(X, y)

Parameters

(GradientBoostingRegressor)

max_depth

Maximum depth of the individual regressors

n_estimators

Number of boosting stages to perform

learning_rate

Shrinks the contribution of each tree by learning_rate

Dimensionality reduction

Principal Component Analysis (PCA)

from sklearn.decomposition import PCA

# Explicitly state number of dimensions
pca = PCA(n_components=
2)

# Choose no. of dimensions that add up to
# sufficiently large portion of variance
# (in this case, 95%)
pca = PCA(n_components=
0.95)

X2D = pca.fit_transform(X)

Parameters

(GradientBoostingRegressor)

n_components

Number of components to keep

Incremental PCA

from sklearn.decomposition import IncrementalPCA

n_batches =
100
inc_pca = IncrementalPCA(n_components=
154)

# Splits batches called X_batch and fits them
for X_batch in np.array_split(X_mnist, n_batches):
 inc_pca.partial_fit(X_batch)

# Performs transformations at once
x_mnist_reduced = inc_pca.transform(X_mnist)

Parameters

(GradientBoostingRegressor)

n_components

Number of components to keep

Kernel PCA

from sklearn.decomposition import KernelPCA

rbf_pca = KernelPCA(
 n_components=
2,
 kernel=
"rbf",
 gamma=
0.04)
X_reduced = rbf_pca.fit_transform(X)

Parameters

(GradientBoostingRegressor)

n_components

Number of components to keep

kernel

Type of kernel

gamma

Kernel coefficient

Locally Linear Embedding (LLE)

from sklearn.decomposition import LocallyLinearEmbedding

lle = LocallyLinearEmbedding(
 n_components=
2,
 n_neighbours=
10)
X_reduced = lle.fit_transform(X)

Parameters

(GradientBoostingRegressor)

n_components

Number of coordinates for the manifold

n_neighbours

Number of neighbours to consider for each point

Deep Neural Nets

Setting up

import tensorflow as tf

n_inputs =
28 * 28 # MNIST
n_hidden1 =
300
n_hidden2 =
100
n_outputs =
10

X = tf.placeholder(tf.float32, shape=(
None, n_inputs), name="X")
y = tf.placeholder(tf.int64, shape=(
None), name="y")

Setting up a layer

def neuron_layer(X, n_neurons, name, activation=None):
 
with tf.name_scope(name):
   n_inputs = int(X.get_shape()[
1])
   stddev =
2 / np.sqrt(n_inputs)
   init = tf.truncated_normal((n_inputs, n_neurons), stddev=stddev)
   W = tf.Variable(init, name=
"weights")
   b = tf.Variable(tf.zeros([n_neurons]), name=
"biases")
   z = tf.matmul(X, W) + b
   
if activation == "relu":
     
return tf.nn.relu(z)
   
else:
     
return z

Setting up a full DNN

with tf.name_scope("dnn"):
 hidden1 = neuron_layer(X, n_hidden1,
"hidden1", activation="relu")
 hidden2 = neuron_layer(hidden1, n_hidden2,
"hidden2", activation="relu")
 logits = neuron_layer(hidden2, n_outputs,
"outputs")

Setting up the loss function

with tf.name_scope("loss"):
 xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits)
 loss = tf.reduce_mean(xentropy, name=
"loss")

Setting up the training part

learning_rate = 0.01
with tf.name_scope("train"):
 optimizer = tf.train.GradientDescentOptimizer(learning_rate)
 training_op = optimizer.minimize(loss)

Evaluating performance of model

with tf.name_scope("eval"):
 correct = tf.nn.in_top_k(logits, y,
1)
 accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))

Initialise model parameters

init = tf.global_variables_initializer()
saver = tf.train.Saver()

Full training code for DNNs

with tf.Session() as sess:
 init.run()
 
for epoch in range(n_epochs):
   
for iteration in range(mnist.train.num_examples // batch_size):
     X_batch, y_batch = mnist.train.next_batch(batch_size)
     sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
   acc_train = accuracy.eval(feed_dict={X: X_batch, y: y_batch})
   acc_test = accuracy.eval(feed_dict={X: mnist.test.images, y: mnist.test.labels})

   print(epoch,
"Train accuracy:", acc_train, "Test accuracy:", acc_test)

 save_path = saver.save(sess,
"./my_model_final.ckpt")

Using DNNs

with tf.Session() as sess:
 saver.restore(sess,
"./my_model_final.ckpt")
 X_new_scaled = [...]
# some new images (scaled from 0 to 1)
 Z = logits.eval(feed_dict={X: X_new_scaled})
 y_pred = np.argmax(Z, axis=
1)

Convolutional Neural Nets

Stacking feature maps

import numpy as np
from sklearn.datasets import load_sample_images

# Load sample images
dataset = np.array(load_sample_images().images, dtype=np.float32)
batch_size, height, width, channels = dataset.shape

# Create 2 filters
filters_test = np.zeros(shape=(
7, 7, channels, 2), dtype=np.float32)
filters_test[:,
3, :, 0] = 1 # Vertical line
filters_test[
3, :, :, 1] = 1 # Horizontal line

# Create a graph with input X plus a convolutional layer applying 2 filters
X = tf.placeholder(tf.float32, shape=(
None, height, width, channels))
convolution = tf.nn.conv2d(X, filters, strides=[
1,2,2,1], padding="SAME")

with tf.Session() as sess:
 output = sess.run(convolution, feed_dict={X: dataset})

plt.imshow(output[
0, :, :, 1]) # Plot 1st image's 2nd feature map
plt.show()

X

input mini-batch (4D tensor)

filters

set of filters to apply (also 4D tensor)

strides

4 element 1D array

two central elements are the vertical and horizontal strides. The first and last elements must currently be equal to 1 (They may one day be used to specify a batch stride - to skip some instances, and a channel stride - to skip some of the previous layer’s feature maps or channels)

padding

must be “VALID” or “SAME”

Pooling layers

[...] # load the image dataset, just like above

# Create a graph with input X plus a max pooling layer
X = tf.placeholder(tf.float32, shape=(
None, height, width, channels))
max_pool = tf.nn.max_pool(X, ksize=[
1,2,2,1], strides[1,2,2,1], padding="VALID")

with tf.Session() as sess:
 output = sess.run(max_pool, feed_dict={X: dataset})

plt.imshow(output[
0].astype(np.uint8)) # plot the output for the 1st image
plt.show()

Recurrent Neural Nets

X = tf.placeholder(tf.float32, [None, n_steps, n_inputs])

basic_cell = tf.contrib.rnn.BasicRNNCell(num_units=n_neurons)
outputs, states = tf.nn.dynamic_rnn(basic_cell, X, dtype=tf.float32)

TL;DR

Practical Use Cases

Introduction to practical use cases

  1. Problem: Look at the big picture
  2. Data: Get the data
  3. Data, Algorithm Design: Discover and visualise to gain insights
  4. Algorithm Design: Prepare the data for machine learning algorithms
  5. Algorithm Design & Implementation: Select a model and train it
  6. Evaluation, Iteration: Fine-tune your model
  7. Release: Present your solution
  8. Maintain: Launch, monitor and maintain your system

  1. Problem characterization
  1. Data characterization
  1. Data analysis & hypothesis formulation
  1. Algorithm choices
  1. Algorithm implementation
  2. Evaluation setup & analysis
  1. Iterative improvement
  1. Release
  2. Maintain

Coursework intro

Task and Data

Design and Implementation

Evaluation

FAQ

Marking Scheme