Basically this is a summary of the slides, plus some extra explanation.
Text which is coloured dark yellow: “like this”, is content from the slides I have decided to include as I think it is probably a bit important/interesting, even though there was no hand with bow for it:

Introduction

Images consist of picture elements known as “pixels”.

(x,y) - location
z - depth
- colour
T - time

2D Images are matrices of numbers.

Grey level image
3D view
Corresponding Matrix

Point Operations:

Recalculate point values

Modify Brightness
Find Intensity

Group Operations:

Process neighbourhoods

Image filtering
Edge detection

Feature Extraction

Finds shapes

Roads in remotely-sensed image
Artery in ultrasound image

Applications of Computer Vision

Image coding (MPEG/JPEG)
Product inspection
Robotics
Modern cameras/phones
Medical imaging
Demography
Biometrics

Part 1: Mark

Lecture 1: Eye and Human Vision

Is human vision a good model for computer vision?

The human eye

Diagram of the human eye. Light comes in through the lens and is focused onto the retina. These light impulses are then sent through the optic nerve.

Cones are photopic (Coloured vision) and Rods are scotopic (Black and white) (Nixon made a mistake on his notes) - Thanks Tim :)

Optics

Image is flipped, just like a camera. Our brain works this out and we perceive it the right way up.

Spectral responses

The eye is most sensitive to green light, least to blue, but the point is that colours are at different response levels.

Mach bands

Mach bands are an optical illusion whereby the contrast between edges of slightly differing shades of grey is exaggerated when they are touching.

[Source: Wikipedia]

This and many other optical illusions (edges, static, benham’s disk... you can look at in the slides) demonstrate that our eyes can actually be not so good as a vision system.

Neural processing

Lecture 2: Image Formation

What is inside an image?

Decomposition

You can decompose an image into its bits.

For a greyscale image, each pixel in the image is usually represented by 8 bits; a byte.
The number value ranges from 0-255, which represents all the greyscale colours from white to black.
When you decompose an image, you are only looking at the one bit in each pixel, and this is either a 0 or 1, so either black or white.
So for bit 0, only the first bit of each pixel is being shown in the image. As you can see this bit represents the least amount of information and is very noisy.
As you look at the images increasing in the bit position, you can see more information is represented.
The main point here is that the image gets clearer as you go up to the most significant bit (bit 7).

Resolution

Images are also (obviously) less detailed at lower resolutions.

Fourier Transform

Any periodic function is the result of adding up sine and cosine waves of different frequencies.

[This is a good visual intuitive: https://www.youtube.com/watch?v=spUNpyF58BY]

is the Fourier transform
denotes the Fourier transform process:
is the angular frequency, measure in radians/s (where frequency is the reciprocal of time , )
is the complex variable (usually )
is a continuous signal (varying continuously with time)
gives the frequency components in

What the Fourier Transform actually does

(To do: explain how a function that deals with (sound) waves can be used on images with pixels. If anyone reading can explain that’d be great)

A: The fourier transform is just a technique for mapping signals that vary in one dimension to the frequency domain of that dimension. In the case of audio this means transforming air pressure from the time domain to the (temporal) frequency domain. If we consider the intensity of an image to be our signal, which varies in space (let's say along the x axis), then the FT will transform the signal into spatial frequency in the x dimension.

Now, we can generalise the fourier transform to take signals that are defined in more than one mutually orthogonal dimensions, e.g. an image represented by pixel intensity defined in x and y. Now the FT transforms the image into the (spatial) frequency domain, defined in two dimensions (spatial frequency in x, and spatial frequency in the y direction). Hope that helps? - James

Pulse
Use Fourier
Evaluate integral
And get result

A pulse and its Fourier transform

Reconstructing signal from its Fourier Transform

Magnitude and phase of the Fourier transform of a pulse

Re() is the real function and Im() is the imaginary function so for expression then and . In the complex plane, you can imagine the imaginary numbers as another number line with the real number as the other number. So to calculate the magnitude, it is Pythagoras’ theorem of the two numbers while the angle is the angle of the two numbers. For more information, look up complex numbers. (Thanks Mark ;) )

The magnitude is the scalar amount of the frequency and the phase is its offset (from a regular sin wave) (-Alex)

Lecture 3: Image Sampling

How is an image sampled and what does it imply?

Aliasing in Sampled Imagery

As you can see the left image looks clearer as it is at a higher resolution, and therefore literally has more information than lower resolutions.
The low resolution image on the right is aliased, and appears a bit blurry to the left image.
This is probably a result of a low sampling rate.

Aliasing

Aliasing is an effect that causes different signals to become indistinguishable when sampled. It also often refers to the distortion or artifact that results when a signal reconstructed from samples is different from the original continuous signal. [-Wiki]

Sampling Signals

The original signal is a continuous function. We want to sample this, and have to do it digitally.
If we sample at a good (high) rate, then we can capture a good representation of the signal.
If we sample at a bad (low) rate, then this captured signal become aliased, and is not a good representation.
As you can see, the signal wave is completely the wrong representation from the original image.

Wheels Motion

Have you ever seen a wheel (or propeller) turn sooooooooooo fast in a video (or even real life) that it looks like it’s not turning at all!?
Or even turning the wrong way!?!!??!??
This is due to the sampling rate the wheel is captured at.

The wheel is spinning faster or slower to (a multiple) of the sampling rate.
This can cause the illusion of standing or reverse turning.
Look up some videos or something to see this in action.
The point is the sampling rate matters in vision.

Sampling Theory

Nyquist’s sampling theorem is 1D.
E.g. speech 6kHz, sample at 12 kHz.
Video bandwidth (CCIR) is 5 MHz.
Sampling at 10 MHz gave 576x576 images.
Guideline: “two pixels for every pixel of interest”.

Transform Pair from Sampled Pulse

Some stuff to do with the Fourier Transform…

(If anyone could explain what these graphs are demonstrating that’d be epic)

A: It’s showing the effect of adding reconstructing the sampled signal from its frequency components. However, the images for (b) and (c) are the wrong way around. Essentially it’s trying to demonstrate that as you add more frequency components from the transform of the sampled signal, the reconstruction becomes a better approximation of the sampled signal, until eventually (f) all the frequency components are present and the reconstruction is perfect [(a) and (f) are identical]. - James

2D Fourier Transform

The Fourier Transform has a forward and inverse function.
The forward transform:

Where there are

Two dimensions of space, x and y
Two dimensions of frequency, u and v
Image NxN pixels Px,y

The inverse transform:

Reconstruction (non examinable)

Shift Invariance

Images can be shifted.
What happens to the magnitude and phase of the fourier transform from shifted images.

This shows that the magnitude is not affected by shifting.
But the phase is affected by shifting.
What about for rotation?

Rotation Invariance

As you can see, the transform is just rotated too.

Filtering

Fourier gives access to frequency components.

Original Image

Transforms

Other Transforms

Applications of 2D Fourier Transform

Understanding and analysis
Speeding up algorithms
Representation (invariance)
Coding
Recognition / understanding (e.g. texture)

Lecture 4: Point Operators

How many different operators are there which operate on image points?

Image Histograms

A histogram is a graph of frequency.
In this context it is showing how many pixels have a certain brightness level.
This data is a global feature.

Brightening an Image

An Image can be brightened by using this formula:
Where:

N - new image
O - old image
k - gain
l - level
x,y - coordinates

Intensity Mappings

Exponential/Logarithmic Point Operators (non examinable)

Intensity Normalisation

N - new image
O - old image
x,y - coordinates
Nmax - maximum input
Nmin - minimum input
Omax - maximum output
Omin - minimum output

Avoids need for parameter choice

Intensity normalisation and histogram equalisation

Histogram Equalisation

N2 points in the image; the sum of points per level is equal
Cumulative histogram up to level p should be transformed to cover up to the level q
Number of points per level in the output picture
Cumulative histogram of the output picture
Mapping for the output pixels at level q

Applying intensity normalisation and histogram equalisation

Thresholding and eye image

Thresholding: Manual vs Automatic

Other thresholding includes Entropic thresholding and Optimal Thresholding

Lecture 5: Group Operators

How do we combine points to make a new point in a new image?

Template Convolution

I’m sure everyone knows about this by now…

This is when you take an original image and then apply a convolution template over every pixel to create a resultant pixel value.

(to do: more explanation…)

Convolution is a system response

Template convolution includes coordinate inversion in x and in y

Inversion is not needed if the template is symmetric

3x3 template and weighting coefficients

Where wi,j are the weights and x(i), y(j) denote the position of the point that matches the weighting coefficient position

3x3 averaging operator

Illustrating the effect of window size

Larger windows lead to more blur.

Template convolution via the Fourier transform

Allows for fast computation for template sizes >= 7x7

Template convolution: *
Fourier transform of the picture:
Fourier transform of the template:
Point by point multiplication (.×)

Note: it’s point by point! The equation is wrongly written in various places.

2D Gaussian function

Used to calculate template values
Note compromise between variance σ² and window size
Common choices

5x5, 1.0
7x7, 1.2
9x9, 1.4

2D Gaussian template

Applying Gaussian averaging

Finding the median from a 3x3 template

Preserves edges
Removes salt and pepper noise

Newer stuff (non-examinable)

Averaging which preserves regions

Applying non local means

Even newer stuff: Image Ray Transform

Use analogy to light to find shapes, removing the remainder

Applying Image Ray Transform

Comparing operators

Lecture 6: Edge Detection

What are edges and how do we find them?

Edge detection

Images have edges in them.
It’s a very useful thing to find these edges in computer vision.
There are a few different operators for edge detection of an image.
It involves maths.

First order edge detection

You can detect vertical and horizontal separately.

Vertical edges, Ex
Horizontal edges, Ey
Vertical and horizontal edges:

Template:
Code:

Edge detection maths

Taylor expansion for
By rearrangement

This is equivalent to

Expand

Templates for improved first order difference

Edge Detection in Vector Format

Templates for Prewitt operator

Applying the Prewitt Operator

Templates for Sobel operator

The two kernels used for the 3 x 3 sobel operator.

Applying Sobel operator

Generalising Sobel

Averaging
Differencing

Generalised Sobel (non examinable)

Lecture 7: Further Edge Detection

What better ways are there to detect edges?

Canny edge detection operator

Formulated with three main objectives:

Optimal detection with no spurious responses
Good localisation with minimal distance between detected and true edge position
Single response to eliminate multiple responses to a single edge.

Approximation

Use Gaussian smoothing
Use the Sobel operator (could combine with 1?)
Use non-maximal suppression
Threshold with hysteresis to connect edge points

Interpolation in non-maximum suppression

Sobel edge detection is first order (equivalent to differentiation) so gives us the change in x and the change in y, which we turn into a vector and use to find the gradient direction (the dotted line in the diagram above), which will intuitively be at right angles to the direction of the line (the non-dotted line above).
In non-maximum suppression, we set all points that aren’t a maximum along the line to zero (thereby thinning and sharpening the line).

This is done by taking a line of points along the gradient (dotted line) and setting all the points to zero, except those at a maximum.
Interpolation is used to figure out the values of points between pixels.
Alternatively, we can round the direction to the nearest horizontal/vertical/diagonal line

(Thanks Samuel Collins)

Hysteresis thresholding transfer function

To help better understand the function an example:

The level is currently at black
Brightness is increased until it reaches and passes the “upper switching threshold”
The level now changes to white
If the brightness is decreased and goes below the “upper switching threshold” it will not switch to black
The brightness will have to decrease until it is below the “lower switching threshold”, then it will switch back to black.

Action of non-maximum suppression and hysteresis thresholding

Hysteresis thresholding vs uniform thresholding

As you can see, hysteresis thresholding is arguably better.

Canny vs Sobel

First and second order edge detection

Edge detection via the Laplacian operator

Mathbelts on…

(I think this is essentially just rearranging an equation)

Shape of Laplacian of Gaussian operator

Zero crossing detection

Basic - straight comparison
Advanced:
You could compare every point to try and find where a 1 switches to 0.

That’s not that smart.

So you average the 4 points in the corners

Then you get 4 summations
If one of those is positive and another is negative, then there is a zero crossing.

Marr-Hildreth edge detection

Application of LoG (or DoG) and zero-crossing detection.

Comparison of edge detection operators

Newer stuff - interest detections (non-examinable)

Newer stuff - saliency

Lecture 8: Finding Shapes

How can we group points to find shapes?

Feature extraction by thresholding

Let’s try to extract features from this image using thresholding.

Low threshold doesn't look that good.
Neither does the high threshold.
In conclusion: we need to identify shape!

Template Matching

This is a technique for finding small parts of an image which match a template image.
Intuitively simple
Correlation and convolution
Implementation via Fourier
Relationship with matched filter, viz: optimality

The template is the silverstone sign.
This can be found in the image with template matching
The accumulator space is a thing

Template matching in:

In Noisy images

Noise is an issue for template matching
You want to create a matcher which is robust to noise

In Occluded Images

Encore, Monsieur Fourier! (???) (non examinable)

Applying Template Matching (non examinable)

Here, have this pretty low res image of a reaaaaally realistic weapon identification system, which can detect weapons like guns and knives. (aka, this is an application of template matching.)

Windows XP anyone?

Applying SIFT in ear biometrics (non examinable)

Have you ever wanted to identify people by their ears?
Well now you can!
Do this stuff with the arrows and circles and stuff

Hough Transform

This is another feature extraction technique
It can find imperfect instances of objects within a certain class of shapes by a voting procedure
The voting procedure is carried out in a parameter space, from which object candidates are obtained as local maxima in a so-called accumulator space.
[-Wiki]

Performance equivalent to template matching, but faster
A line is points x,y gradient is m intercept is c.

You can rearrange to get:

In maths it’s the principle of duality

Go and read the following article about the Hough Transform: http://aishack.in/tutorials/hough-transform-basics/

It’s really good.

Applying the Hough transform for lines

Hough Transform for Lines … problems

m, c tend to infinity
Change the parameterisation
Use foot of normal
Gives polar HT for lines

Images and accumulator space of polar Hough Transform

Applying Hough Transform

Lecture 9: Finding More Shapes

How can we go from conic sections to general shapes?

Hough Transform for Circles

Again, it’s duality:
Equation of a circle:

Points	Parameters	Radius
x,y	x0,y0	r
x0,y0	x,y	r

Circle Voting and Accumulator Space

Like before, but this time for circles.

Speeding it up

Now it’s a 3D accumulator, fast algorithms are available
E.g. by differentiation:
SO edge gradient direction can be used, e.g. 2D accumulator by:

Applying the HT for circles

This can be used for detecting the shapes of an eye and iris.

Integrodifferential operator? (non examinable)

https://stackoverflow.com/questions/27058057/comparing-irises-images-with-opencv

Looks cool???

Arbitrary Shapes

Use Generalised HT
Form (discrete) look-up-table (R-table)
Vote via look-up-table
Orientation? Rotate R-table voting
Scale? Scale R-table voting
Inherent problems with discretisation (process of transferring continuous functions to discrete)

R-table Construction

Pick a reference point in the image.

For each boundary point x, compute \Phi(x) - the gradient direction.

r is the distance from the reference point to the boundary point (radial distance), and \alpha is the angle.

The table then stores each (r, \alpha) pair with the corresponding \Phi.

(Thanks Bradley Garrod)

Active Contours (non examinable)

For unknown arbitrary shapes: extract by evolution
Elastic band analogy
Balloon analogy
Discrete vs. continuous
Volcanoes? 🌋

Geometric active contours (non examinable)

Couple’a hippos 🦛🦛

Parts-based shape modelling (non examinable)

This guy don’t look so good

Symmetry yrtemmyS (non examinable)

SPooky

Lecture 10 Applications/Deep Learning

Where is feature extraction used these days?

This lecture is mostly extra (non examinable) stuff, so I’m only gonna cover the hand with bow slides. If you want a look then jump to the slides: http://comp3204.ecs.soton.ac.uk/mark/Lecture%2010.pdf

Where is computer vision used?

What you see depends on the viewpoint you take

Deep Learning

Conclusions

Computer vision is changing the way we live
Computer vision uses modern hardware and modern cameras to achieve what we understand by “sight”
No technique is a panacea (a solution for all difficulties): many alternatives exist
Computer vision is larger than this course

Part 2: Jon

Note: Jon has done some really good handout summaries which have basically done my job for me, they can be found here: http://comp3204.ecs.soton.ac.uk/part2.html

Highly recommend reading these probably.

But I will still go through the slides and make notes of hand with bow slides :)

Lecture 1: Building machines that see

Key terms in designing Computer Vision systems

Robust
Repeatable
Invariant
Constraints

You want your system to be robust and repeatable.
You design your system to be invariant.
You apply constraints to make it work.

Robustness

The vision system must be robust to changes in its environment

i.e. changes in lighting; angle or position of the camera; etc

Repeatability

Repeatability is a measure of robustness
Repeatability means that the system must work the same over and over, regardless of environmental changes

Invariance

Invariance to environmental factor helps achieve robustness and repeatability

Hardware and software can be designed to be invariant to certain environmental changes

e.g. you could design an algorithm to be invariant to illumination changes…

Constraints

Constraints are what you apply to the hardware, software and wetware (human brains in the system) to make sure your computer vision system works in a repeatable, robust fashion.

e.g. you constrain the system by putting it in a box so there can’t be any illumination changes

Constraints in Industrial Vision

Software Constraints

Really simple, but incredibly fast algorithms

Hough Transform is popular, but note that it isn’t all that robust without physical constraints

Actually, the same is true of most algorithms/techniques used in industrial vision

Intelligent use of colour…

Colour-Spaces (non examinable?)

Even though there is no hand with bow for these slides, they seem useful.

There are many different ways of numerically representing colour

A single representation of all possible colours is called a colour-space
It’s generally possible to convert to one colour-space to another by applying a mapping (in the form of a set of equations or an algorithm)

RGB Colour-space

Most Physical image sensors capture RGB

By far the most widely known space
RGB “couples” brightness (luminance) with each channel, meaning that the illumination invariance is difficult

HSV Colour-space

Hue, Saturation, Value is another colour-space

Hue encodes the pure colour as an angle

Red == 0° == 360°

Saturation is how vibrant the colour is
And the Value encodes brightness

A simple way of achieving invariance to lightning is to use just the H or H & S components

LAB Colour-space

Lab color space is more perceptually linear than other color spaces. Perceptually linear means that a change of the same amount in a color value should produce a change of about the same visual importance. It is important especially when you try to measure the perceptual difference of two colors. (Source)

Can also be made invariant to lighting by only using the AB components.

Physical Constraints

Industrial vision is usually solved by applying simple computer vision algorithms, and lots of physical constraints:

Environment: lighting, enclosure, mounting
Acquisition hardware: expensive camera, optics, filters

Vision in the wild

So, what about vision systems in the wild, like ANPR (Automatic Number-Plate Recognition) cameras, or recognition apps for mobile phones?

Apply as many hardware and wetware constraints as possible, and let the software take up the slack
Colour information often less important than luminance

Lecture 2: Machine learning for pattern recognition

Feature Spaces

Many computer vision applications involving machine learning take the following form

This is where cool image processing happens
Feature extractors make feature vectors from images
Machine learning system uses featurevectors to make intelligent decisions

Key terminology

Featurevector: a mathematical vector

Just a list of (usually Real) numbers
Has a fixed number of elements in it

The number of elements is the dimensionality of the vector

Represents a point in a featurespace or equally a direction in the featurespace
The dimensionality of a featurespace is the dimensionality of every vector within it

Vectors of differing dimensionality can’t exist in the same featurespace

Density and Similarity

Distance in featurespace

Feature extractors are often defined so that they produce vectors that are close together for similar inputs

Closeness of two vectors can be computed in the feature space by measuring the distance between the vectors.

Cats are a close distance apart, and are further away from the cluster of dogs in this feature space.

Euclidean distance (L2 distance)

L2 distance is the most intuitive distance…

The straight-line distance between two points
Computed via an extension of Pythagoras theorem to n dimensions:

Equation

The straight-line; “Euclidean distance”

Manhattan/Taxicab distance (L1 distance)

L1 distance is computed along paths parallel to the axes of the space:

Equation

Essentially just like you are taking a taxi cab around the grid like streets of manhattan

Cosine Similarity

Cosine similarity measures the cosine of the angle between two vectors

It is not a distance!

Useful if you don’t care about the relative length of the vectors

Equation

The angle between the two points from the origin

Choosing good featurevector representations for machine learning

Choose features which allow to distinguish objects or classes of interest

Similar within classes
Different between classes

Keep number of features small

Machine-learning can ge t more difficult as dimensionality of featurespace gets large

Supervised Machine Learning: Classification

Classification is the process of assigning a class label to an object (typically represented by a vector in a feature space).
A supervised machine-learning algorithm uses a set of pre-labelled training data to learn how to assign class labels to vectors (and the corresponding objects).

A binary classifier only has two classes
A multiclass classifier has many classes

Linear Classifiers

Linear classifiers try to learn a hyperplane that separates two classes in featurespace with minimum error

There can be lots of hyperplanes to choose from; differing classification algorithms apply differing constraints when learning the classifier.

To classify a new image, you just need to check what side of the hyperplane it is on.

Non-linear binary classifiers

Linear classifiers work best when the data is linearly separable.

Like this:

But what if the data is like this:

There is no hope for a linear classifier! 😭

Non-linear binary classifiers, such as Kernel support Vector Machines learn non-linear decision boundaries. (basically a curved graph separates the data instead of a straight one)

However, you have to be careful, you might lose generality by overfitting:

What class would the blue question mark actually belong to?

Multiclass classifiers: KNN

Assign class of unknown point based on majority class of closest K neighbours in featurespace.

KNN Problems (non examinable?)

Computationally expensive if there are:

Lots of training examples
Many dimensions

Unsupervised Machine Learning: Clustering

Clustering aims to group data without any prior knowledge of what the groups should look like or contain.
In terms of featurevectors, items with similar vectors should be grouped together by a clustering operation.
Some clustering operations create overlapping groups; for now we’re only interested in disjoint clustering methods that assign an item to a single group.

K-Means Clustering

StatQuest: K-means clustering (good video to explain this!!!!!!!!!!!!!!!!)

K-Means is a classic featurespace clustering algorithm for grouping data in K groups with each group represented by a centroid:
Pseudo code:

The value of K is chosen
K initial cluster centres are chosen
The following process is performed iteratively until the centroids don’t move between iterations:

Each point is assigned to its closest centroid
The centroid is recomputer as the mean of all the points assigned to it. If the centroid has no points assigned it is randomly re-initialised to a new point.

The final clusters are created by assigning all points to their nearest centroid.

Lecture 3: Covariance and Principal Components

Random Variables and Expected Values

Random variables

Variable that takes on different values due to chance

Expected values

The expected value (denoted E[X]) is the most likely value a random variable will take.

Variance

Variance (σ2) is the mean squared difference from the mean (μ).
It’s a measure of how spread-out the data is.

Equation

Technically it’s E[(X - E[X])2]

Covariance

Covariance (σ(x,y)) measures how two variables change together
The variance is the covariance when the two variables are the same (σ(x,y)=σ2(x))
A covariance of 0 means the variables are uncorrelated

(Covariance is related to Correlation though…)

Equation

Technically it’s E[ (x - E[x]) (y - E[y]) ]

Covariance Matrix

A covariance matrix encodes how all possible pairs of dimensions in an n-dimensional dataset vary together.

The covariance matrix is a square symmetric matrix, as you can see by the symmetry in the example above.

Mean Centring

Mean Centring is the process of computing the mean (across each dimension independently) of a set of vectors, and then subtracting the mean vector from every vector in the set.

All the vectors will be translated so their average positions is the origin.

From top image to bottom image; mean centered around the origin.

Covariance matrix again

So then this means that

The covariance matrix is directly proportional to Z transposed, multiplied by Z, where Z is the matrix formed by the mean-centred vectors (each row of matrix Z is one mean-centred vector) - Hope this helps :) - Lorena

Principal axes of variation

Basis

A basis is a set of n linearly independent (remember all the way back to first year Foundations guys!) vectors in an n dimensional space

The vectors are orthogonal (all right-angles to each other)
They form a “coordinate system”
There are an infinite number of possible bases

The first principal axis

For a given set of n dimensional data, the first principal axis (or just principal axis) is the vector that describes the direction of greatest variance.
(Big turquoise arrow pointing up and to the right in the image below)

The second principal axis

The second principal axis is a vector in the direction of the greatest variance orthogonal (perpendicular) to the first major axis.
(Small lilac arrow pointing up to the left in the image below)

The third principal axis

In a space with 3 or more dimensions, the third principal axis is the direction of greatest variance orthogonal to both the first and second principal axes.

The fourth… and so on…
The set of n principal axes of an n dimensional space are a basis.

Eigenvectors and Eigenvalues

Important Equation

A = n x n square matrix
υ = n dimensional vector, known as the eigenvector
λ = scalar values, known as an eigenvalue

Properties

There are at most n eigenvector-eigenvalue pairs.
If A is symmetric, then the set of eigenvectors is orthogonal

If A is a covariance matrix, then the eigenvectors are the principal axes
The eigenvalues are proportional to the variance of the data along each eigenvector
The eigenvector corresponding to the largest eigenvalue is the first principal component.

Finding Values

For small matrices (n≤4) there are algebraic solutions to finding all the eigenvector-eigenvalues pairs
For larger matrices, numerical solutions to the Eigendecomposition must be sought.

Eigendecomposition

Eigendecomposition takes a matrix and represents it in terms of its eigenvalues and eigenvectors.

Columns of Q are the eigenvectors
Diagonal eigenvalue matrix (Λii = λi)

If A is real symmetric (i.e. a covariance matrix), then Q-1 = QT (i.e. eigenvectors are orthogonal), so:

Summary

The Eigendecomposition of a covariance matrix A:

Gives you the principal axes and their relative magnitudes.

Ordering

Standard Eigendecomposition implementations will order the eigenvectors (columns of Q) such that the eigenvalues (in the diagonal of Λ) are sorted in order of decreasing value.

Some solvers are optimised to only find the top k eigenvalues and corresponding eigenvectors, rather than all of them.

Principal Component Analysis

Linear Transform

A linear transform W projects data from one space into another:

T = ZW

Original data stored in the rows of Z

T can have fewer dimensions than Z

Linear Transforms

The effects of a linear transform can be reversed if W is invertible:

Z = TW-1

A lossy process if the dimensionality of the spaces is different

PCA

PCA is an Orthogonal Linear Transform that maps data from its original space to a space defined by the principal axes of the data.

The transform matrix W is just the eigenvector matrix Q from the Eigendecomposition of the covariance matrix of the data.
Dimensionality reduction can be achieved by removing the eigenvectors with low eigenvalues from Q (i.e. keeping the first L columns of Q assuming the eigenvectors are sorted by decreasing eigenvalue).

PCA Algorithm

Mean-centre the data vectors
Form the vectors into a matrix Z, such that each row corresponds to a vector
Perform the Eigendecomposition of the matrix ZTZ, to recover the eigen matrix Q and diagonal eigenvalue matrix Λ: ZTZ = QΛQT
Sort the columns of Q and corresponding diagonal values of Λ so that the eigenvalues are decreasing
Select the L largest eigenvectors of Q (the first L columns) to create the transform matrix QL
Project the original vectors into a lower dimensional space, TL: TL = ZQL

Eigenfaces

Spooky 0_0

Making Invariant

Require (almost) the same object pose across images (i.e. full frontal faces)
Align (rotate, scale and translate) the images so that a common feature is in the same place (i.e. the eyes in a set of face images)
Make all the aligned images the same size
(optional) Normalise (or perhaps histogram equalise) the images so they are invariant to global intensity changes

Problems

Featurevectors are huge

If the images are 100x200 pixels, the vector has 20000 dimensions
That’s not really practical…
Also, the vectors are still highly susceptible to imaging noise and variations due to slight misalignments.

Potential Solution… Apply PCA

PCA can be used to reduce the dimensionality

Smaller number of dimensions allows greater robustness to noise and mis-alignment

There are fewer degrees of freedom, so noise/misalignment has much less effect
And the dominant features are captured

Fewer dimensions makes applying machine-learning much more tractable

Lecture 4: Types of image feature and segmentation

Image Feature Morphology

There are 4 main ways of extracting features from an image:

Global
Grid/Block-based
Region based
Local

Global Features

A Global Feature is extracted from the contents of an entire image.

Grid or Block-based Features

Multiple features are extracted; one per block

Region-based Features

Multiple features are extracted; one per region

Local Features

Multiple features are extracted; one per local interest point

Global Features

Image Histograms

Simple global features can be computed from the average of the colour bands of the image’s histogram.

This wasn’t particularly robust, and couldn’t deal well with multiple colours in the image.

A more common approach to computing a global image description is to compute a histogram of the pixel values.

Joint-colour histogram

A joint colour histogram measures the number of times each colour appears in an image.

These are different to histograms in image editing programs with compute separate histograms for each channel.
The colour space is quantised into bins, and we accumulate the number of pixels in each bin.

Technically, it’s a multidimensional histogram, but we flatten it (unwrap) to make it a feature vector.

Normalisation (i.e. by the number of pixels) allows the histogram to be invariant to image size.
Choice of colour-space can make it invariant to uniform lighting changes (e.g. H-S histogram)
Invariant to rotation
But vastly different images can have the same histogram! Like these two below:

Cool, right? Even though it’s, uh, bad; not invariant to this kinda problem lol

Image Segmentation

What is segmentation?

The first part in the process of creating region-based descriptions…

The process of partitioning the image into sets of pixels often called segments.
Pixels within a segment typically share certain visual characteristics.

Global Binary Thresholding

Thresholding is the simplest form of segmentation

Turns grey level images into binary (2 segments) by assigning all pixels with a value less than a predetermined threshold to one segment, and all other pixels to the other.
Really fast
Required a manually set static threshold

Not robust to lightning changes
Can work well in applications with lots of physical constraints (lighting control and / or high-contrast objects)

Otsu’s thresholding method

Otsu’s method (named after Nobuyuki Otsu) provides a way to automatically find the threshold.
Assume there are two classes (i.e. foreground & background)

The histogram must have two peaks

Exhaustively search for the threshold that maximises interclass variance.

More detailed look over at wikipedia!

Adaptive / local thresholding

Local (or Adaptive) thresholding operators compute a different threshold value for every pixel in an image based on the surrounding pixels.

Usually a square or rectangular window around the current pixel is used to define the neighbours

Mean adaptive thresholding

Set the current pixel to 0 if its value is less than the mean of its neighbours plus a constant value; otherwise set to 1.

Parameters:

Size of window
Constant offset value

Good invariance to uneven lighting / contrast
But…

Computationally expensive (at least compared to global methods)
Can be difficult to choose the window size

If the object being imaged can appear at different distance to the camera then it could break…

Segmentation with K-Means

K-Means clustering also provides a simple method for performing segmentation:

Cluster the colour vectors (i.e. [r, g, b]) of all the pixels, and then assign each pixel to a segment based on the closest cluster centroid.
Works best if the colour-space and distance function are compatible

E.g. Lab colour-space is designed so that Euclidean distances are proportional to perceptual colour differences

Naïve approach to segmentation using k-means doesn’t attempt to preserve continuity of segments

Might end up with single pixels assigned to a segment, far away from other pixels in that segment.

Can also encode spatial position in the vectors being clustered: [r, g, b, x, y]

Normalise x and y by the width and height of the image to take away the effect of different images sizes
Scale x and y so they have more or less effect than the colour components

Advanced segmentation techniques

Lots of ongoing research into better segmentation techniques:

Techniques that can automatically determine the number of segments
“Semantic segmentation” techniques that try to create segments that fit the objects in the scene based on training examples

Connected Components

Pixel Connectivity

A pixel is said to be connected with another if they are spatially adjacent to each other.

Two standard ways of defining this adjacency:

4-connectivity
8-connectivity (like minesweeper!)

Connected Component

A connected component is a set of pixels in which every pixel is connected either directly or through any connected path of pixels from the set.

Connected Component Labelling

Connected Component Labelling is the process of finding all the connected components within a binary (segmented) image.

Each connected segment is identified as a connected component.

Lots of different algorithms to perform connected component labelling

Different performance tradeoffs (memory verses time)

The two-pass algorithm

On the first pass:

Iterate through each element of the data by column, then by row (Raster Scanning)

If the element is not the background

Get the neighbouring elements of the current element
If there are no neighbours, uniquely label the current element and continue
Otherwise, find the neighbour with the smallest label and assign it to the current element
Store the equivalence between neighbouring labels

On the second pass:

Iterate through each element of the data by column, then by row

If the element is not the background

Relabel the element with the lowest equivalent label

Lecture 5: Shape description and modelling

Extracting features from shapes represented by connected components

Borders

There are 2 types of pixel borders: inner and outer.

Say you have this pixel shape:

Inner Border

A border made up of only pixels from the shape; the outermost pixels within the shape.

Outer Border

A border where the outline of pixels outside the shape make up the border.

Two ways to describe shape

Region Description
Boundary Description

Region Description: Simple Scalar Shape Features

Area and Perimeter

Compactness

Compactness measures how tightly packed the pixels in the component are.
It’s often computed as the weighted ratio of area to perimeter squared:

Examples:

Centre of Mass

Irregularity / Dispersion

A measure of how “spread-out” the shape is

Moments

Standard Moments

Moments describe the distribution of pixels in a shape.

Moments can be computed for any grey-level image. For the purposes of describing shape, we’ll just focus on moments of a connected component
Standard two-dimensional Cartesian moment of an image, with order p and q and I(s) as the pixel intensity, it is defined as:
In the case of a connected component, this simplifies to:

The zero order moment of a connected component m00 is just the area of the component. The centre of mass is (centroid):

Central Moments

Standard 2D moments can be used as shape descriptors

But, they’re not invariant to translation, rotation and scaling

Central Moments are computed about the centroid of the shape, and are thus translation invariant:

Note: μ01 and μ10 are always 0, so have no descriptive power

Normalised Central Moments

Normalised Central Moments are both scale and translation invariant

Boundary Description

Chain Codes

Simple way of encoding a boundary.

Walk around the boundary and encode the direction you take on each step as a number.

Some direction examples are shown below left.

Then cyclically shift the code so it forms the smallest possible integer value (making it invariant to the starting point)

Chain Code Invariance

Can be made rotation invariant:

Encode the differences in direction rather than absolute values.

Can be made scale invariant:

Resample the component to a fixed size

Doesn’t work well in practice

Chain Code Advantages and Limitations

Can be used for computing perimeter area, moments, etc.

Perimeter for and 8-connected chain code is N(even numbers in code) + √2N(odd numbers in code)

Practically speaking, not so good for shape matching

Problems with noise, resampling effects, etc
Difficult to find good similarity/distance measures

Fourier Descriptors

The Fourier transform can be used to encode shape information by decomposing the boundary into a (small) set of frequency components.

There are two main steps to consider:

Defining a representation of a curve (the boundary)
Expanding the representation using Fourier theory

By choosing these steps carefully it is possible to create rotation, translation and scale invariant boundary descriptions that can be used for recognition, etc.

Region Adjacency Graphs

Build a graph from a set of connected components

Each node corresponds to a component
Nodes connected if they share a border

Can easily detect patterns in the graph

E.g. “a node with one child with four children”

Invariant to non-linear distortions, but not to occlusion.

Theres some more stuff in the slides which isn’t hand with bow-ed, go have a look

Slides 40-47: http://comp3204.ecs.soton.ac.uk/lectures/pdf/L5-shapedescription.pdf

Active Shape Models and Constrained Local Models

ASMs/CLMs extend a PDM (Point Distribution Model) by also learning local appearance around each point

Typically just as an image template.

Using a constrained optimisation algorithm, the shape can be optimally fitted to an image

Constraints:

Plausible shape
Good template matching

Lecture 6: Local interest points

What makes a good interest point?

Invariance to brightness change (local changes as well as global ones)
Sufficient texture variation in the local neighbourhood

But not too much!

Invariance to changes between the angle / position of the scene to the camera

How to find interest points

There are lots of different types of interest point types to choose from

We’ll focus on two specific types and look in detail at common detection algorithms:

Corner detection - Harris and stephens
Blob Detection - Difference-of-Gaussian Extrema

The Harris and Stephens corner detector

Basic Idea

Search for corners by looking through a small window
Shifting that window by a small amount in any direction should give a large change in intensity

Harris & Stephens: Mathematics

Weighted average change in intensity between a window and a shifted version [by (Δx,Δy)] of that window:

The Taylor expansion allows us to approximate the shifted intensity.
Taking the first order terms we get this:

Substituting and simplifying gives:
Bruh.

Structure Tensor

The square symmetric matrix M is called the Structure Tensor or the Second Moment matrix
It concisely encodes how the local shape intensity function of the window changes with small shifts

Eigenvalues of the Structure Tensor

Think back to covariance matrices…

As with the 2D covariance matrix, the structure tensor describes and ellipse: xTMx=c (this is a quadratic form)
The eigenvalues and vectors tell us the rates of change and their respective directions

Harris & Stephens Response Function

Rather than compute the eigenvalues directly, Harris and Stephens defined a corner response function in terms of the determinant and trace of M:
Smart.

Harris & Stephens Detector

Simple algorithm:

Take all points with the response value above a threshold
Keep only the points that are local maxima (i.e. where the current response is bigger than the 8 neighbouring pixels)

Scale in Computer Vision

The problem of scale

As an object moves closer to the camera it gets larger with more detail… as it moves further away it gets smaller and loses detail…
If you’re using a technique that uses a fixed size processing window (e.g. Harris corners, or indeed anything that involves a fixed kernel) then this is a bit of a problem!

Scale space theory

Scale space theory is a formal framework for handling the scale problem.

Represents the image by a series of increasingly smoothed / blurred images parameterised by a scale parameter t.
t represents the amount of smoothing.
Key notion: Image structures smaller than √t have been smoothed away at scale t.

The Gaussian Scale Space

Many possible types of scale space are possible (depending on the smoothing function), but only the Gaussian function has the desired properties for image representation.

These provable properties are called the “scale space axioms”.

Formally, Gaussian scale space is defined as:

Where t ≥ 0 and,

Normally, only a fixed set of values of t are used - it’s common to use integer powers of 2 or √2

Nyquist-Shannon Sampling theorem

If a function x(t) contains no frequencies higher than B hertz, it is completely determined by giving its ordinates at a series of points spaced 1/(2B) seconds apart.

...so, if you filter the signal with a low-pass filter that halves the frequency content, you can also half the sampling rate without loss of information…

Gaussian Pyramid

Every time you double t in scale space, you can half the image size without loss off information!

Leads to a much more efficient representation

Faster processing
Less memory

Multi-scale Harris & Stephens

Extending the Harris and Stephens detector to work across scales is easy…

We define a Gaussian scale space with a fixed set of scales and compute the corner response function at every pixel of each scale and keep only those with a response above a certain threshold.

Blob Detection Finally

Laplacian of Gaussian

Recall that the LoG is the second derivative of a Gaussian

Used in the Marr-Hildreth edge detector

Zero crossing of LoG convolution

By finding local minima or maxima, you get a blob detector!

Scale space LoG

Normalised scale space LoG defined as:
By finding extrema of this function in scale space, you can find blobs at their representative scale (~√2t )

Just need to look at the neighbouring pixels!

Very useful property: if a blob is detected at (x0, y0 ; t0) in an image, then under a scaling of that image by a factor s, the same blob would be detected at (sx0, sy0 ; s2t0) in the scaled image.

Scale space DoG

In practice it’s computationally expensive to build a LoG scale space.
But, the following approximation can be made:

This is called a Difference-of-Gaussians (DoG)

Implies that the LoG scale space can be built from subtracting adjacent scales of a Gaussian scale space

DoG Pyramid

Of course, for efficiency you can also build a DoG pyramid

An oversampled pyramid as there are multiple images between a doubling of scale.
Images between a doubling of scale are an octave.

Lecture 7: Local features and matching

Local features and matching basics

Local Features

Multiple features are extracted; one per local interest point

Why extract local features?

Feature points are used for:

Image alignment
Camera pose estimate & Camera calibration
3D reconstruction
Motion tracking
Object recognition
Indexing and database retrieval
Robot navigation
…

Example: Building a panorama

You need to match and align the images
Detect feature points in both images
Find corresponding pairs
Use the pairs to align the images

Problem 1:

Detect the same points independently in both images

We need a repeatable detector

Problem 2:

For each point correctly recognise the corresponding one
We need an invariant, robust and distinctive descriptor

Two distinct types of matching problem

In stereo vision (for 3D reconstruction) there are two important concepts related to matching:

Narrow-baseline stereo
Wide-baseline stereo

Narrow-baseline stereo

This is where the two images are very similar - the local features have only moved by a few pixels.

Typically the images are from similar points in time

(Notice how the red-circled background object only appears in the second image; so these two images have a slight difference to them)

Wide-baseline stereo

This is where the difference in views is much bigger.

(I don’t think I need to circle the whole image lol)

Two distinct types of matching problem

These concepts extend to general matching:

The techniques for narrow-baseline stereo are applicable to tracking where the object doesn’t move too much between frames
The techniques for wide-baseline stereo are applicable to generic matching tasks (object recognition, panoramas, etc.).

Robust local description

Descriptor Requirements

Dependent on task:

Narrow baseline:

Robustness to rotation and lighting is not so important.
Descriptiveness can be reduced as the search is over a smaller area.

Wide baseline

Need to be robust to intensity change, invariant to rotation.
Need to be highly descriptive to avoid mismatches (but not so distinctive that you can’t find any matches!)
Robust to small localisation errors of the interest point

The descriptor should not change too much if we move it by a few pixels, but to change more rapidly once we move further away.

Matching by correlation (template matching)

(Narrow baseline) template matching

Interest points in two images with a slight change in position

Local search windows, based on the interest point in the first image

The template can then be matched against target interest points in the second image

Problems with wider baselines

Not robust to rotation
Sensitive to localisation of interest point

(although not such a problem with a small search window)

With wider baselines you can’t assume a search area

Need to consider all the interest points in the second image

More likely to mismatch :(

Local Intensity Histograms

Use local histograms instead of pixel patches

Describe the region around each interest point with a pixel histogram
Match each interest point in the first image to the most similar point in the second image (i.e. in terms of Euclidean distance [or other measure] between the histograms)

Local histograms

Problems:

Not necessarily very distinctive

Many interest points likely to have similar distribution of grey-values

Not rotation invariant if the sampling window is square or rectangular

Can be overcome using a circular window

Not invariant to illumination changes
Sensitive to interest point localisation

Overcoming localisation sensitivity

Want to allow the window around the interest point to move a few pixels in any direction without changing the descriptor

Apply a weighting so that pixels near the edge of the sampling patch have less of an effect, and those near the interest point have a greater effect

Common to use Gaussian weighting centred on the interest point for this.

Overcoming lack of illumination invariance

Illumination invariance potentially achievable by normalising or equalising the pixel patches before constructing the histogram
...but there is another alternative!................. I’m not going to tell you……. Joking…….

Local Gradient Histograms

Gradient Magnitudes and Directions

From the partial derivatives of an image (e.g. from applying convolution with Sobel), it is easy to compute the gradient orientations / directions and magnitudes

Gradient Histograms

Instead of building histograms of the raw pixel values we could instead build histograms that encode the gradient magnitude and direction for each pixel in the sampling patch.

Gradient magnitudes (and directions) are invariant to brightness change!
The gradient magnitude and direction histogram is also more distinctive.

Building gradient histograms

Quantise the directions (0°-360°) into a number of bins

Usually around 8 bins

For each pixel in the sampling patch, accumulate the gradient magnitude of that pixel in the respective orientation bin

Rotation Invariance

Gradient histograms are not naturally rotation invariant

But, can be made invariant by finding the dominant orientation and cyclically shifting the histogram so the dominant orientation is in the first bin.

The SIFT feature

Adding spatial awareness

The SIFT (Scale Invariant Feature Transformer) feature is widely used

Builds on the idea of a local gradient histogram by incorporating spatial binning, which in essence creates multiple gradient histograms about the interest point and appends them all together into a longer feature.
Standard SIFT geometry appends a spatial 4x4 grid of histograms with 8 orientations

Leading to a 128-dimensional feature which is highly discriminative and robust!

SIFT Construction: sampling

SIFT Construction: weighting

SIFT Construction: binning

Matching SIFT features

Euclidean Matching

SImplest way to match SIFT features is to take each feature in turn from the first image and find the most similar in the second image

Threshold can be used to reject poor matches
Unfortunately, doesn't work that well and results in lots of mismatches.

Improving matching performance

A better solution is to take each feature from the first image, and find the two closest features in the second image.

Only form a match if the ratio of distances between the closest and second closest matches is less than a threshold.

Typically set at 0.8, meaning that the distance to the closest feature must be at least 80% of the second closest.

This leads to a much more robust matching strategy.

Lecture 8: Consistent matching

Feature distinctiveness

Even though the most advanced local features can be prone to being mismatched.

There is always a tradeoff in feature distinctiveness.
If it’s too distinctive it will not match subtle variations due to noise of imaging conditions.
If it’s not distinctive enough it will match everything.

Constrained matching

Assume we are given a number of correspondences between the interest points in a pair of images

Is it possible to estimate which of those correspondences are inliers (correct) or outliers (incorrect/mismatches)?
What assumptions do we have to make?

By assuming a geometric mapping between the two scenes, can we recover that mapping and eliminate the mismatches?

Geometric Mappings

What are geometric transforms?

In general, a geometric mapping can be thought of as a transform function that maps the x, y coordinates of points in one image to another.

Go have a look at some on Wikipedia here: https://en.wikipedia.org/wiki/Geometric_transformation

Point Transforms

We’re interested in transforms that take the following form: x’=Tx

The Affine Transform

The affine transform is defined as

It’s more convenient to write this as a single transform matrix by adding an extra dimension to each vector:

Translation

Translation and Rotation

Scaling

Aspect Ratio

Shear

Degrees of Freedom

Affine Transform

6 DoF: translation + rotation + scale + aspect ratio + shear

Similarity Transform

4 DoF: translation+rotation+scale

More degrees of freedom

Normalise by w so that the transformed vector is [•,•,1]

Homogeneous coordinates

The Planar Homography (Projective Transformation)

Recovering a geometric mapping

Simultaneous equations

It is possible to estimate a transform matrix from a set of point matches by solving a set of simultaneous equations

Need at least 4 point matches to solve a Homography or 3 to solve an affine transform

The actual solution technique isn’t important…

It is important to note that in the presence of noise, and with potentially more matches than required, that we have to solve an overdetermined system
We need to seek the minimum error or least-squares solution

Least-squares

Robust Estimation

Problem: Noisy data

Need a way to deal with estimating a model (i.e. a transform matrix) in the presence of high amounts of noise (i.e. mis-matches)

Least-squares will be highly suboptimal, and probably find a very bad solution.
Ideally, we want to identify the correct data (the inliers) and the bad data (the outliers)

Then estimate the model using only the good data.

Robust estimation techniques

The problem of learning a model in the presence of inliers and outliers comes under an area of mathematics called robust estimation or robust model fitting

There are a number of different possible techniques

RANSAC: RANdom SAmple Consensus

This is an iterative method to estimate parameters of a mathematical model for a set of observed data that contains outliers.

Assume:

M data items required to estimate model T

N data items in total

Algorithm:

Select M data items at random
Estimate model T
Find how many of the N data items fit T within tolerance tol, call this K (i.e. compute how many times the absolute residual is less than tol). The points that have an absolute residual less than tol are the inliers; the other points are the outliers.
If K is large enough, either accept T, or compute the least-squares estimate using all inliers, and exit with success.
Repeat steps 1..4 nIterations times
Fail - no good T fit of data

Further applications of robust local matching

Object recognition & AR

Object recognition

Image of object is matched against scene, and recognised if there is a consistent match

Augmented reality

Data can be added to a scene on the basis of a match

3D reconstruction

It’s possible to estimate depth, and ultimately build a complete 3d scene from sets of point correspondences formed from matching local features

Problems with direct local feature matching

Local feature matching is slow!

Typical image (800x600) might have ~2000 DoG Interest points/SIFT descriptors

Each SIFT descriptor is 128 dimensions
Now assume you want to match a query image against a database of image…

The distance between every query feature and every other feature needs to be calculated
Can this be optimised somehow?

Efficient Nearest Neighbour Search

How can we quickly find the nearest neighbour to a query point in a high dimensional space?

Index the points in some kind of tree structure?
Hash the points?
Quantise the space?

K-D Trees

Binary tree structure that partitions the space along axis-aligned hyperplanes

Typically take each dimension in turn and splits on the median of the points in the enclosing partition.
Stop after a certain depth, or when the number of points in a leaf is less than a threshold

Search by walking down the tree until a leaf is hit, and then brute-force search to find the best in the leaf.

This is not guaranteed to be the best though…
To have to walk back up the tree and see if there are any better matches, and only stop once the root is reached.
(note you don’t have to check a subtree if it’s clear that all points in that subtree are further than the current best.)

K-D Tree problems

Doesn’t scale well to high dimensions

You tend to end up needing to search most of the tree

There are approximate versions that won’t necessarily return the exact answer that do scale (if you don’t mind the potential for mismatch)

Hashing

Locality Sensitive Hashing (LSH) creates hash codes for vectors such that similar vectors have similar hash codes!

Sketching

A technique called sketching concatenates binary hashes into a bit string.

With the correct LSH function, the Hamming distance between a pair of sketches is proportional to the Euclidean distance between the original vectors
Can easily compress SIFT features to 128 bits

Hamming distance computation is cheap

Lookup tables and bitwise operations

Lecture 9: Image search and Bags of Visual Words

Text Information Retrieval

The bag data structure

A bag is an unordered data like a set, but which unlike a set allows elements to be inserted multiple times.

Sometimes called a multiset or a counted set

Bag of Words

Say you have a document with this content:

“the quick brown fox jumped over the lazy dog”

A bag of words describing this document would be:

Text processing (feature extraction)

The Vector-Space Model

Conceptually simple:

Model each document by a vector
Model each query by a vector
Assumption: documents that are “close together” in space are similar in meaning.
Use standard similarity measures to rank each document to a query in terms of decreasing similarity

Bag of Words Vectors

The lexicon or vocabulary is the set of all (processed) words across all documents known to the system.
We can create vectors for each document with as many dimensions as there are words in the lexicon

Each word in the document’s bag of words contributes to a count to the corresponding element of the vector for that word.

In essence, each vector is a histogram of the word occurrences in the respective document.
Vectors will have a very high number of dimensions, but will be very sparse.

The Vector-space Model

Searching the VSM

Recap: Cosine Similarity

Inverted Indexes

A map of words to lists of postings

Aardvark	[doc3:4]
Astronomy	[doc1:2]
Diet	[doc2:9; doc3:8]
...
Movie	[doc2:10]
Star	[doc:13; doc2:4]
Telescope	[doc1:15]

A posting is a pair formed by a document ID and the number of times the specific word appeared in that document
So for the first entry: Aardvark appeared 4 times in document 3.

Computing the Cosine Similarity

For each word in the query, lookup the relevant posting list and accumulate similarities for only the documents seen in those postings lists

Much more efficient than fully comparing vectors…

Weighting the vectors

The number of times a word occurs in a document reflects the importance of that word in the document.
Intuitions:

A term that appears in many documents is not important: e.g., the, going come,...
If a term is frequent in a document and rare across other documents, it is probably important in that document.

Possible weighting schemes

Binary weights

Only presence (1) or absence (0) of a term recorded in vector.

Raw frequency

Frequency of occurrence of term in document included in vector.

TF / IDF

Term frequency is the frequency count of a term in a document.
Inverse document frequency (idf) provides high values for rare words and low values for common words.

Vector Quantisation

Learning a Vector Quantiser

Vector quantisation is a lossy data compression technique.
Given a set of vectors, a technique like K-Means clustering can be used to learn a fixed size set of representative vectors.

The representatives are the mean vector of each cluster in k-means.
The set of representation vectors is called a codebook

Vector Quantisation

Vector quantisation is achieved by representing a vector by another approximate vector, which is drawn from a pool of representative vectors.

Each input vector is assigned to the “closest” vector from the pool.

Visual Words

SIFT Visual Words

We can vector quantise SIFT descriptors (or any other local feature)

Each descriptor is replaced by a representative vector known as a visual word

In essence the visual word describes a small image patch with a certain pattern of pixels
In many ways the process of applying vector quantisation to local features is analogous to the process of stemming words.

The codebook is the visual equivalent of a lexicon or vocabulary.

Bags of Visual Words

Once we’ve quantised the local features into visual words, they can be put into a bag.

This is a Bag of Visual Words (BoVW)
We’re basically ignoring where in the image the local features came from (including ignoring scale)

So why are we doing all this word stuff for a module about vision?....

Histograms of Bags of Visual Words

Like in the case of text, once we have a BoVW and knowledge of the complete vocabulary (the codebook) we can build histograms of visual word occurrences!

This is nice… it gives us a way of aggregating a variable number of local descriptors into a fixed length vector.

Useful for machine learning
But also allows us to apply techniques for text retrieval to images

Visualising Visual Words

The effect of codebook size

There is one key parameter in building visual word representations - the size of the vocabulary.

Too small, and all vectors look the same

Not distinctive

Too big, and the same visual words might never appear across images

Too distinctive

Content-based Image Retrieval

BoVW Retrieval

With the visual word representation, everything used for text retrieval can be applied directly to images

Vector space model
Cosine similarity
Weighting schemes
Inverted index

Optimal codebook size

Inverted index only gives a performance gain if the vectors are sparse (you don’t want to end up explicitly scoring all documents)
Visual words also need to be sufficiently distinctive to minimise mismatching

Implies a very big codebook

Modern research systems often use 1 million or more visual words for SIFT vectors

Problems with big codebooks

There’s a slight problem…

Need to use k-means to learn 1 million clusters in 128 dimensions from 10’s of millions of features

Non-trivial!
Vector quantisation has the same problems

Have to use approximate methods, like approximate k-d trees

Overall process for building a BoVW retrieval system

Collect the corpus of images that are to be indexed and made searchable
Extract local features from each image
Learn a large codebook from (a sample of) the features
Vector quantise the features, and build BoVW representations for each image
Construct an inverted index with the BoVW representations

Lecture 10: Image classification and auto-annotation

Multilabel classification

Oh my gosh, this picture has a cat in it, so it must be classified as cat.
But wait, it has a dog in it as well, so it’s classified as a dog.
AAHHHH there is both a cat and a dog in the image AAAAA….
Wait…
Why can’t we just classify it as both?

In the context of images often called Automatic Annotation

Object Detection / Localisation

Slide summary: Challenges in Computer Vision

This is a summary of slides slides 7-16, go look in the slides if you’re interested

http://comp3204.ecs.soton.ac.uk/lectures/pdf/L10-classification.pdf

Object Recognition in natural scenes
Scene/Activity Classification
Automatic Annotation (it’s not that good sometimes)
The fundamental problem of computer vision: The Semantic Gap
In an image for computer vision there are:

Semantics
Object Labels
Objects
Descriptors
Raw media

Upside-down cars lol
History:

1999 - SIFT matching

Very powerful, but computationally demanding

2001 - Cascades of Haar-like features

Very popular for face detection

2006 - SURF matching

Combined ideas from SIFT and the integral images used for computing Haar-like features

Interest in auto-annotation grew from the late 90s Bags of “Visual Words” were rather important!

Aside: Optimal codebook size

The codebook vocabulary needs to be much smaller than for doing image search(ing?)

In general, machine-learning techniques need much smaller vectors (for both performance and effectiveness)
The visual words can be allowed to be less distinctive, allowing a little more variation between matching features.
Typically, the number of visual words might be as small as a few hundred, and up to a few thousand.

Another slide summary: Stuff

Slides 18-26

Machine Translation (2002)

Identifying areas in an image, using visual words!

Semantic Spaces

SIngular Value Decomposition
Probabilistic Latent Factor Models
Non-negative Matrix Factorisation
Visual words again!

Research focus shifted a little to use of bigger datasets in the mid-late 2000s.
Interest in simpler (but more scalable) classifiers grew
Classifying with BoVW

BoVW histogram representations are incredibly useful for image classification and object detection
Commonly used with fast linear classifiers and SVMs

Over time the features used to create BoVW representations have improved
Early global colour visual terms

Consider each pixel as a visual word based on the quantisation of its colour to a discrete set of values.
The BoVW Histogram is just a joint colour histogram that we saw earlier
Sunset colours :o

Visual words from regions/segments

Car in the night sky 😌⭐

Visual words from interest points

Salient region detection
Local Descriptors
Vector Quantisation
Vocabulary of visual terms learnt through hierarchical k-means
Word Occurrence Vectors

Local features extracted around interest points work okay for classification, but there are more recent strategies that can work better…

densely sampled features

Dense Local Image Patches

Dense SIFT

Rather than extracting your SIFT features at DoG interest points, you could extract them across a dense grid - this gives much more coverage of the entire image.

Pyramid Dense SIFT

For even better performance and coverage, you can sample in a Gaussian pyramid

Note that the sampling region is a fixed size, so at higher scales you sample more content

Egyptian Rhinoceros

Spatial Pyramids

Developing and benchmarking a BoVW scene classifier

Evaluation Dataset

Common for academic research to use standardised datasets for developing scene classifiers and comparing results

Datasets are usually split into labelled “training” and “test” sets.

Only the training set can be used to train the classifier
Sometimes the test set labels are withheld completely to ensure there is no cheating

Building the BoVW

Firstly the raw features need to be extracted from the training images
Then (if necessary) learn a codebook from these features

i.e. using k-means on the raw features

Might be a uniform random sample of all the features rather than all of them

Apply (vector) quantisation to the raw features and count the number of occurrences to build histograms for each image

Training classifiers

Classifiers can be trained using the histograms.

e.g. OvR linear classifiers with a kernel map.
You might train on a subset of the training data

And use the remaining data to “validate” and optimise parameters.
Once you’ve chosen the optimal parameters you can then re-train using the optimal values.

Classifying the test set

You’re now in a position to apply the classifiers to the test data:

Extract the features
Quantise the features (using the codebook developed from the training set!)
COmpute the occurrence histograms
Use the classifiers to find the most likely class

Evaluating Performance

Lots of ways to evaluate the performance of classification on the test (and validation) set.

Conceptually the simplest summary measure is probably average precision
This is literally the proportion of number of correct classifications to the total number of predictions

Phew, I think we’ve covered all the examinable content. I’ll summarise the bonus slides just for completeness.

Lecture 11: Towards 3D vision

Look at the slides for pictures :) http://comp3204.ecs.soton.ac.uk/lectures/pdf/L11-towards3d.pdf

Applications

Architecture
Urban Planning
Virtual Tourism
Clothing & body measurement
Art
SLAM (Simultaneous localization and mapping)
Cultural Heritage
Forensics
Surveillance
Motion Capture (Films & Games)

Cameras

Camera Geometry
Camera Calibration

Measuring Depth
Narrow Baseline Stereo

Stereo Camera
Epipolar geometry
Dense narrow-baseline stereo

Wide Baseline Stereo

Multiple images can be used to jointly infer 3D structure, and the camera pose and intrinsics of each camera
Point matches (i.e. SIFT) are used as the basis for triangulating 3D points from the 2D images
Reconstructing Venice

Monocular Vision
Shadow Scanner
Structured Light Imaging

Time of Flight Imaging

Non-visible techniques

LIDAR
PrimeSense (Kinect)

Summary Summary

3D computer vision has lots of practical applications
Camera models give a mathematical description of how a pixel in a 2D image is related to a point in a 3D scene

Camera calibration can be used to find the parameters of a camera

Multiple views of a scene can be used to infer depth
There are lots of other techniques for capturing depth that only require a single sensor

Programming for computer vision & other musings related to the coursework

Writing code for computer vision

Images usually stored as arrays of integers

Typically 8-bits per pixel per channel

12-16 bit increasingly common (e.g. HDR imaging)
Uses unsigned pixel values

Compressed using a variety of techniques

Lossy or lossless

Most vision algorithms are continuous

E.g. convolution with a continuous function (i.e. Gaussian)
If we were writing the next Adobe Photoshop, it would be important that we kept out images in a similar format (integer pixels, same number of bits)

We would essentially round pixel values to the closest integer and clip those out of range

For vision applications we don’t want to do this as we’ll lose precision

Always work with floating point pixels

Unless they’ve been specifically optimised for integer math, all vision algorithms should use floating point pixel values

Ensure the best possible discretisation from operations involving continuous functions

Higher effective bit depth (32/64 bits per pixel per band)
Ability to deal with negative values

Turns out to be very important for convolution!

Ability to deal with numbers outside of the normal range

Just because a pixel has a grey level of 1.1 doesn’t mean it’s invalid, just that it’s too bright to be displayed in the normal colour gamut.

Guidelines for writing vision code

Convert any images to float types immediately once you’ve read them
Don’t convert them back to integer types until you need to (i.e. for display or saving)

Be mindful that a meaningful conversion might not just involve rounding if you want to preserve the data.

Convolution

We probably do need to know how to do convolution so...

Convolution is an element-wise multiplication in the Fourier domain (c.f. Convolution Theorem)
f﹡g = ifft(fft(f) . fft(g))
Whilst S and F might only contain real numbers, the FFTs are complex (real + imagj)

Need to do complex multiplication!

Aside: phase and magnitude

Given a complex number (n = real + imagj) from an FFT we can compute its phase and magnitude

phase = atan2(imag, real)
magnitude = sqrt(real*real + imag*imag)

We might perform this transformation to display the FFT as it conceptually helps us understand what the FFT is doing
We can’t use this representation to perform convolution however (need to transform back to complex form first)

Aside: Displaying FFTs

FFTs are often re-ordered so that the DC component (0- frequency) component is in the centre:

Template Convolution

In the time domain, convolution is:

Notice that the image or kernel is “flipped” in time

Also notice that the is no normalisation or similar

int kh = kernel.height;

int kw = kernel.width;

int hh = kh / 2;

int hw = kw / 2;

Image clone = new Image(image.width, image.height);

for (int y = hh; y < image.height - (kh - hh); y++) {

for (int x = hw; x < image.width - (kw - hw); x++) {

float sum = 0;

for (int j = 0, jj = kh - 1; j < kh; j++, jj--) {

for (int i = 0, ii = kw - 1; i < kw; i++, ii--) {

int rx = x + i - hw; int ry = y + j - hh;

sum +=

image.pixels[ry][rx] * kernel.pixels[jj][ii];

}

clone.pixels[y][x] = sum;

}

Formatting bruh

What if you don’t flip the kernel?

Obviously if the kernel is symmetric, there is no difference
However, you’re actually not computing convolution, but another operation called cross-correlation

* represents the complex conjugate
(you can compute this with the multiplication of the FFTs just like convolution: iFFT(FFT(f)* . FFT(g))

Ideal Low-Pass filter

“Ideal” low pass filter removes all frequencies above a cutoff

Ideal Low-Pass filter - problems

Gaussian filters - why

Building Gaussian Filters

High-pass filters

“To obtain a high-pass filtered image, subtract a lowpass filtered image from the image itself”

ILP = I﹡G
IHP = I-ILP
IHP = I - I﹡G
IHP = I﹡δ - I﹡G
IHP = I﹡(δ - G)

Note - Don’t do this!

IHP = I﹡(δ - G) is not the same as IHP = I﹡(1 - G)

High-pass filters have a mixture of negative and positive coefficients

…that means the resultant image will also have positive and negative pixels

this is important - for example it can tell us about the direction of edges:

[-0.5, 0.5] kernel

(remember convolution means kernel flipped)
+values in the output image mean edge from right to left
-values in output image mean edge from left to right

Convolution implementation MUST NOT:

Normalise
result in unsigned types

Building hybrid images

…is really simple

Add the low pass and high-pass images together
Don’t:

average the two images
do a weighted combination of the two images

just add them (and clip if necessary)

Happy Revising!

(not my hybrid, add a suggestion if you know whose it is)

(- Thanks Matthew Barnes, god of notes (this is not a real hybrid lol))

Go look at the bonus lecture yourself: http://comp3204.ecs.soton.ac.uk/lectures/pdf/VisionRetrospective.pdf

TL;DR

The TL;DRs are TL;DRs themselves; so much content

The TL;DRs are actually pretty long, there’s just a lot of important content.

Part 1: Mark

See collab TL;DR below Jon’s part.

Mark (formatted by Joshua Gregory)

Part 2: Jon

Summaries from end of slides and hand-out notes.

Lec 1

A computer vision system is not just software: it is any hardware, software and wetware (i.e. humans) that make up the complete system. To engineer a computer vision system, you need to think about how you can achieve the required level of robustness by constraining the problem at hand and incorporating sufficient invariance to potentially changing environmental factors.

Robust and repeatable computer vision is achieved through engineered invariance and applied constraints.

Colour-spaces

RGB
HSV (Hue, Saturation, Value)

Lec 2

Machine learning is a fundamental part of high-level computer vision (interpreting what is seen, versus the low-level, which is all about processing the image). The standard computer vision pipeline goes from the image, through a process of feature extraction, to a pattern recognition component that makes inferences. Machine learning is a standard way of training the pattern recognition system.

Extracting features is a key part of computer vision

Typically, these are numerical vectors that can be used with machine-learning techniques.
Featurevectors can be compared by measuring distance.

Classification learns what class to assign a feature to.
Clustering groups similar features.

Sometimes you’ll need to figure out the Euclidean distance between 2 points in the feature space.

Point p=(p1, p2, …, pn) and q=(q1, q2, …, qn)

Cosine similarity another measure for vectors. It is not a distance measure.
Similarity = 1 if vectors same direction; decreases as the angle increases.

Classification: assign class labels, binary or multiclass.
ML algorithm learns on pre-labelled training data.
Hyperplane through the feature space, separates the classes.
Linear classifiers: think of it as the hyperplane is a straight line in the vector space.

Linear classifiers more efficient than KNN, don’t need training data and only need to check side of hyperplane for unlabelled point.

Non-linear classifier: Support Vector Machine (SVM)

K-nearest-neighbours (KNN)

Find class of unlabelled vector by finding majority class of closest K training points.
Can become computationally expensive when lots of training example or large dimensions.

K-Means

Algorithm to cluster data in feature space.
Clusters represented by centroids.
Algorithm:

K value set before
Randomly chose centroids
Each point assigned closest centroid
Centroid recomputed as mean
If Centroid has no points assigned, randomly re-initialise
This is done iteratively until the centroids don’t move.

K-means always converges, but not necessarily to most optimal solution.

Lec 3

Understanding the shape of data in a feature space is important to effectively using it. In addition, by understanding the distribution of really highly dimensional data, it is possible to determine the most important modes of variation of that data, and thus represent the data in a space with many fewer dimensions.

Covariance measures the “shape” of data by measuring how different dimensions change together.
The principle axes are a basis, aligned such they describe most directions of greatest variance.
The Eigendecomposition of the covariance matrix produces pairs of eigenvectors (corresponding to the principal axes) and eigenvalues (proportional to the variance along the respective axis).
PCA (Principal Component Analysis) aligns data with its principal axes, and allows dimensionality reduction by discounting axes with low variance.
Eigenfaces applies PCA to vectors made from pixel values to make robust low-dimensional image descriptors.

Variance - how “spread-out” the data is from the mean

n data points: xi ...
μ mean

Covariance - how 2 variables change together.

Variance = covariance when 2 variables the same.
Variable uncorrelated when covariance = 0.

Covariance matrix - square symmetric matrix containing all the covariance values.

Principal axes of variation - linearly independent vectors, orthogonal, act as basis

Major axis data most spread
Minor axis perpendicular to major

Eigendecomposition of covariance matrix

A is square matrix
v is non-zero eigenvector
λ is eigenvalue

If matrix A is covariance matrix, then eigenvectors are principal components.

Vector with largest eigenvalue is first principal axis and so on…

The Eigendecomposition is thus a way of finding the principal axes

PCA projects data in an original space to a new space defined by the basis of principal axes.
Reduce the dimensionality of the data, get rid of axes with small variance.

PCA steps summary:

Mean-centre the data vectors
Form vectors in matrix Z, so each row corresponds to vector
Do Eigendecomposition of matrix ZTZ, to get eigenvector matrix Q and diagonal eigenvalue matrix Λ:

Sort columns of Q and values of Λ to decreasing order.
Select L largest eigenvectors of Q (first L columns) to make transform matrix QL.
Project original vectors into lower dimensional space, TL:

Eigenfaces - face recognition, PCA to features, represent images far fewer dimensions.

All images need to be same size, aligned
Also generative model

Overall approach

Training images:

Images flattened to vectors
Mean vector computed and stored
Vectors mean centred
PCA applied, project to lower dimensional space. Transform matrix stored.
Low dimensional vectors used as training data for classifier

KNN with distance threshold

Testing image:

Image flattened to vector, mean vector subtracted
Vector projected by PCA basis (transform matrix) to lower dimensional space
Vector given to classifier, generates class label

Lec 4

Features that can be extracted from images fall into four main categories. Global features are extracted from the entire image. Region-based features are extracted from regions of the image called connected components. Detecting connected components requires that you first apply a process called segmentation to partition the image into multiple segments.

Image features can be categorised into one of four main categories:

Global
Grid-based (block-based)
Region-based
Local, “interest points”

A common type of global feature is a global colour histogram.
Region-based methods need regions to be detected - this process is called segmentation... many different approaches
Connected components are segments in which the pixels are all reachable by a connected path.

Feature morphology refers to the form or shape of a feature

Joint colour histograms - (global) accumulate binned pixel values.
Normalisation of histogram for image of different sizes.
Colour space important.

Segmentation - (region) set of pixels share certain visual characteristics.
Thresholding simplest form, grey image → binary image.

Work well when constrained, designed to stand out (QR codes)

Manual thresholding value, or ...
Otsu’s thresholding method - assumes 2 classes, bi-modal histogram.

Combined spread of 2 classes minimal (intra-class variance)

Adaptive thresholding: e.g. sets pixel to bg (background) if the value less than mean of neighbours plus offset, fg foreground otherwise.

K-Means clustering for segmentation.

Connected Components - segment where all pixels reachable from other (adjacently)

Connected-component labelling - binary image → set of connected components

Two-pass algorithm:

First pass

Raster scan data
If element not bg, then get neighbour elements
No neighbours? Uniquely label current element
Otherwise find neighbour with smallest label, assign current element
Store equivalence between neighbouring labels

Second pass

Raster scan again
If element not bg, then relabel element with lowest equivalent label

Lec 5

Basic shape description involves extracting characteristic features that describe the shape of a connected component. Shape descriptors can be used together with machine learning, or manually defined models to classify shapes into different classes; a common example of this is classifying shapes into letters of the alphabet in order to achieve optical character recognition. Statistical shape models describe how the shape of an object (or set of objects) can change (through internal movement and/or through changes in pose). In addition to describing shapes, statistical shape models can be used to find instances of a shape in an image.

Many different ways to describe the shape of connected components, the choice depends on required invariance
Multiple shapes can efficiently be represented by a RAG, very robust
Point distribution models apply PCA to x-y coordinate pairs across multiple images to produce a low-dimensional parametric model

ASMs/CLMs also model local appearance, and can use this to optimise the fit of the model parameter to match an image.

Scalar features (specifically of connected components)

Area
Perimeter length

Inner border
Outer border
Approximated by:

Compactness - ratio of area to perimeter squared:

Irregularity (dispersion) - ratio of major chord length to area.

x and y bar are coordinates of centroid.

Moments - describe distribution of pixels in a shape (connected component)
2D Cartesian moment, order p and q:

Connected component simplifies to:

Zero order moment of connected component m00 is just the area.
Centre of mass is (centroid):

Central Moments - translation invariant, compute about the centroid

μ10 and μ10 both equal 0, so give no information, but higher order do.

Normalised central moment both translation and scale invariant:

Chain codes - encode boundary of object, step around object and note down direction of each step, either 4 or 8 directions.

Rotating the sequence so it is the smallest integer, makes it start position invariant.
Rotation invariance achieved by storing difference between consecutive numbers.
Scale invariance theoretically by resampling shape, usually doesn’t work well in practice.
Used to computer area, perimeter and moments directly!
Not good for shape matching due to

Noise, resampling, problem generating good similarity/distance metrics.

Fourier descriptors - encode shape information by decomposing boundary into set of frequency components.
Two main steps, choose carefully to make very invariant boundary descriptors:

Define representation of curve (boundary)
Expanding representation using Fourier theory

Region Adjacency Graphs (RAG) - (trees) describe layout of connected components relative to each other.

Nodes correspond to components and connected if shape border

Invariant to: distortion (including rotation, scale, translation, non-linear transformations)
Not invariant to

Point Distribution Models (PDM) - Like eigenfaces, applies similar process to set of points representing a shape

Corresponding 2D points manually created from set of N training face images

Number of points fixed at M

Iterative process called Generalized Procrustes Analysis, align points
Mean shape created
Shape matrix created

Each column stores information on how the x or y ordinate of a point on a face can change.

PCA applied to matrix.

Active Shape Models (ASM) / Constrained Local Models (CLM) - take PDM further, incorporate what image should look like around each point.

Small pixel patch about each point - is template.
Addition of data allows model to be better fitted to unseen image

Each point tried to move to local optimum while PDM contains all points

Active Appearance Models - same thing as ASMs/CLMs, but instead of local features, try to jointly optimise global appearance of face against PDM.

Lec 6

The detection of local interest points that are stable under varying imaging conditions has a huge number of applications in computer vision. Research in this area goes back as far as the 1960s and 70s. The Harris and Stephens corner detection technique developed in 1988 is a classic example of a detection technique with impressive robustness. A related problem to the detection of interest points is the problem of scale (the size at which an object appears in an image). Scale space theory allows interest point techniques to be developed that are invariant to changes in scale (i.e. the object moving further away). The Difference-of-Gaussian blob detector is an example of such a scale-space blob detection technique.

Interest points have loads of applications in computer vision.

They need to be robustly detected, and invariant to rotation, lighting change etc. (sufficient but not too much texture variation in local neighbourhood)

There are 2 types: corners and blobs

Harris & Stephens is common corner detector
Finding extrema in a multi scale DoG pyramid provides robust blob detector

Scale space theory allows us to find features (corners and blobs) of different sizes

Harris & Stephens Corner Detector - classic algorithm.

Considers the brightness of a small patch of image.
Then when you slightly shift that patch then

If brightness same, original patch not stable point
If brightness changes a lot, original patch stable

Computing the weighted sum-squared difference of window and shifted window:

Point (x,y) and shift (∆x, ∆y):

This equation can be “processed” to make:

Square symmetric matrix M concisely describes the shape of the local weighted difference function.

Basically encoding of image derivatives in the x,y and xy directions.
Called the second moment matrix aka structure tensor.

You can use the absolute values of the eigenvalues directly (explained somewhere else)
But Harris & Stephens came up with a scheme that avoids explicitly computing the Eigendecomposition by formulating a corner response function (R(x, y) in terms of the determinant and trace of M:
K usually 0.04 - 0.06
Value of R:

R > 0: Corner
R < 0: edge
| R | = small: flat

Actually find corners, compute R for each pixel, keep only ones over threshold.
Filter out points not local maxima of R within small window (8 neighbouring pixels)

Scale space theory - formal framework for handling images at different scales by represent image as family of smoothed images parameterized by the size of the smoothing kernel used for suppressing fine detail.

Scale parameter: t, image structure of spatial size smaller than ~sqrt(t) mostly smoothed away in scale-space level at scale t.

Gaussian Scale Space - smoothing function is Gaussian kernel

Formally the Gaussian scale space of an image f(x, y) is the family of derived signals L(x, y; t) defined by the convolution of the image with the 2D Gaussian kernel:

Semicolon implies convolution only performed over variables x and y.
Definition valid for all t ≥ 0
Gaussian Pyramids.

Multiscale Harris & Stephens - define Gaussian scale space with fixed set of scales and compute corner response function at every pixel of each scale, keep only those with response function above threshold.

Local extrema in DoG Scale Space
Laplacian of Gaussian (LoG) is the 2nd differential of a Gaussian convolved with an image (the kernel is shaped like a Mexican hat).
Zero-crossings of function, get Marr-Hildreth edge detector.
By finding local max and min, you get blob detector.
Very useful property: if blob is detected at (x0,y0; t0) then in the scaled image (factor of s) same blob detected at (sx0,sy0; s2t0)
In practice, DoG approximation used instead of LoG.

Build gaussian scale space, subtract adjacent scales to produce DoG scale space to then search for extrema.

Every time you double scale, you can half image size, pyramid can be constructed.

Lec 7

How to extract local features from these interest points. A number of techniques have been proposed in the past in order to extract robust local descriptors, starting from simple pixel histograms, through to advanced features like the SIFT descriptor, which encode weighted local image gradients. Once these descriptors have been extracted, they can be used for image matching.

Features extracted around interest points have lots of practical uses in computer vision.
Matching scenarios basically fall into two categories:

Narrow-baseline where the difference in image is slight

Local image templates are often suitable descriptors

Wide-baseline where there are bigger differences in pose

Gradient histogram descriptors are good here

Especially SIFT!

Local feature matching basics - find all local features in one images that have correspondences in another image.
Applications: image retrieval, 3D reconstruction, panoramas, tracking….
Stereo vision, 2 concepts:

Narrow-baseline stereo is where 2 images very similar - local features moved by few pixels.
Wide-baseline stereo where differences much bigger.

Robust local description - type of descriptor dependant on task.

Narrow-baseline, rotation not issue, descriptiveness not too important, lighting not changed much.
Wider baselines, local descriptors with attributes:

Robust to uniform intensity changes in pixel values around interest point.
Invariance to orientation, features not change as image rotated.
Robustness to placing of interest point by few pixels
Descriptors for visually differing local regions be unique and far apart in feature space.

Matching by correlation (template matching) - rectangle region around interest point, use pixel values directly.

Sum-Squared-Difference perform basic matching.
Work well with small differences, but not in wide-baseline.

Local Intensity Histograms - histogram of pixel intensities of local region.

Circular window, (mostly) rotation invariant.
Can use Gaussian weighting.
Not invariant to illumination changes.
Histograms not very distinctive :(

Local Gradient Histograms - encode gradient directions within a window

Easy to compute in x and y directions.
Magnitude of gradient:

Gaussian weighting can be applied
Invariant to uniform intensity changes.
Not rotation invariant!

Have to robustly compute “dominant orientation” subtract from pixels orientation to be rotation invariant.

The SIFT Feature - builds spatial array of orientation-magnitude histograms about the interest point.

Pixel contributions linearly interpolated across nearest spatial bins avoid discontinuities
Magnitudes weighted by Gaussian centred on interest point
Typical: 4x4 spatial grid, 8 orientation bins; 128 dimensional feature vector.

Very descriptive and discriminative feature

Matching SIFT Features - often compared using Euclidean distance
Take each feature from first image and find closest in second image.

Threshold to stop poor matches
However, tends to result in lots of incorrect matches, not robust to big viewpoint changes.
Better: take each feature from the first image, find 2 closest in second image, only form match if ratio of distances between closest and second closest is less than threshold (typically 0.8)

Lec 8

When comparing two images by matching local feature, we need to eliminate mismatches. By applying geometric constraints to the problem of finding corresponding interest points between pairs of images, it is possible to both reduce the number of mismatches and potentially learn a geometric mapping between the two images. Amongst many other applications, these transforms can be used to build panoramas from multiple images. The presence of such a transform is a good indicator for a match between the two images, and is commonly used in object recognition and image retrieval applications.

Inconsistent local feature matches can be removed by assuming some form of constraint holds between the two images

This is usually a geometric mapping

Affine transform or Homography
Can be estimated by finding the least-squares solution of a set of simultaneous equations

Robust methods such as RANSAC allow inliers and outliers to be determined whilst learning the mapping

Interest point matching is slow…

K-D Trees and Hashing can help

Applying constraints to matching - even best local features can be mismatched

Tradeoffs in distinctiveness:

Can be too distinctive or not enough, not match with subtle variations or not match at all

You can estimate which correspondences are inliers or outliers

Geometric mappings - assume object is flat, then can search for geometric mapping satisfied by correct corresponding points.

It is a transform function maps x,y coords from 1 image to other

T is transform matrix
x,y column vectors for coordinates

Affine Transform - combination of translation, scaling, aspect ratio, rotation and skewing.
Affine transforms always preserve parallel lines

Standard 2D transform:
A is 2x2 transform matrix encodes scale, rotation, skew, and b is translation vector.
Single matrix multiplication (extra dimension of fixed value 1):
3x3 matrix:
Said to have 6 degrees of freedom.

Doesn’t allow for perspective effects, as it preserves parallel lines

2D projective transform or (Planar) Homography:

Normalization by w because non-linear.

The vector [wx’,wy’,w]T is called a homogeneous coordinate.
Deal with transform as matrix

a-f are affine parameters.
g-h define keystone distortions, make originally parallel lines come together after transform.

Solve set of homogeneous linear equations for T

Common to compute least-squares estimate of transform matrix.

It minimises the sum-squared error in prediction.
Error of single point called residual; difference between predicted and observed value.

Robust estimation - least squares problem with noise.

Mismatches can throw off estimated transform

How to determine inliners, we need them for better transform estimate
Number of algorithms one is RANSAC
Random Sample Consensus
Algorithm:

Assume: M data items required to estimate model T, N data items in total

Select M data items at random
Estimate model T
Find how many of the N data items fit T within tolerance tol, call this K. points that have an absolute residual less than tol are the inliers; the other points are outliers.
If K is large enough, either accept T, or compute least-squares estimate using all inliers, and exit with success.
Repeat steps 1..4 nIterations items
Fail - no good T fit of data

RANSAC picks some pairs of points randomly, estimates transform.
If enough inliers, algorithm stops, transform re-estimated……

Improving matching speed - biggest problem is speed.

Brute-force of 128-D SIFT takes long.

Ways to speed it up: K-D Trees

Binary tree structure
Each node splits specific dimension of space in two
Leaf nodes store number of points corresponding to the points that have made it down the tree to that point
Fast nearest neighbour search done…….

Hashing
Locality Sensitive hashing Functions, vectors that are spatially similar, similar hash codes.
Sketching - binary string encodings for features, compared more efficiently

Lec 9

Content-based image retrieval (CBIR) are systems that can search for images with an image as the query. Research on CBIR systems started in the early 90’s, but it is only more recently with ubiquitous mobile computing and applications like Google Goggles that the technology has matured. We’ve seen how (local) descriptors can be used to find matching objects within images, but we’ve also seen that the matching process is rather computationally expensive.

For CBIR applications we need to be able to search datasets of millions of images almost instantaneously. In the field of textual document search, techniques to efficiently index and efficiently search massive datasets of text documents are well understood. One of the biggest advances in CBIR has been to apply these textual indexing techniques to the image domain by extracting bags of visual words from images and indexing these.

Effective and efficient text search can be achieved with bags of words, the vector-space model and inverted indexes.
Vector-quantisation can be applied to local features, making them into visual words.

Then you can apply all the same techniques used for text to make efficient retrieval systems

This is a good way of making highly scalable, effective and efficient content-based image retrieval systems

Most text-search systems represent text as a bag of words

A bag is an unordered data structure like a set, but can have elements multiple times.
Create bag from text document:

Break into words (tokenisation)
Process words to reduce variability (eg get rid of “ing” at end)
Remove common words (like “the”)

Vector-space model - text documents represented by vectors
Vectors contain counts of frequency of words in the lexicon (the set of all possible words)

Histograms of word counts, and they are highly sparse.

Searching:

Query turned into vector form, ranked by similarity.
Cosine similarity often used, less affected by magnitude.
Many documents have 0 similarity.
Often cosine sim can be weighted

Words that appear a lot in docs should have less weight
Common weighting scheme: Frequency-inverse document frequency (tf-idf)

In practice vectors never made, BoW just indexed directly as inverted index.

Map of words to postings lists.

Posting is a pair containing document identifier and word count.
Only made if word count > 1

Can quickly find which docs a word occurs in, and how many.

Really efficient computation.

Vector-quantisation - lossy data compression technique
K-Means clustering used to learn fixed size set of representative vectors.
Represent a vector by another approximate vector, draw from a pool of vectors. Each input vector assigned to “closest” vector from pool.

Bag of Visual Words (BoVW) - apply text techniques to computer vision
A visual word is a local descriptor vector (e.g. SIFT vector) that has been vector quantised.

Set of representative vectors is the visual equivalent of the lexicon (codebook)
SIFT - each visual word represents a prototypical pattern of local image gradients

Set of local descriptors can be transformed to fixed dimensionality histogram by counting the number of occurrences of each representative vector.

BoVW Retrieval - use same techniques for text retrieval.

Visual words indexed directly, searched by cosine
Important parameter: size of codebook.

Only works well efficiently if vectors sparse
Ensure visual words distinctive to minimize mismatching

Implies you need large codebook… ~1 million typically
This is long.
And expensive.

Overall process:

Find interest points + extract local features from all images
Learn codebook from sample of features
Perform vector quantisation to assign each feature to a representative visual word
Construct an inverted index

Lec 10

We’ve looked at how features can be extracted from images, and how supervised machine learning techniques like linear classifiers and k-nearest neighbours can be used to train a computer vision system to predict a class for a particular feature input. Current research is looking at how we might make computers able to see in the human sense, fully understanding the content of a visual scene. The choice of feature for image classification is very important. We saw how a Bag of Visual Words (BoVW) representation was a powerful technique for image search. It turns out that BoVWs are very useful for use in high performance image classification.

Object recognition, scene classification and automatic annotation are all important tasks in computer vision.

Researchers are striving to narrow the “semantic gap” between what computers can perceive, compared to humans,

The BoVW approach lends itself to high-performance image classification

Performance is increased if the local features are sampled densely

Typical system takes in image, passes through feature extractor and eventually feeds features to machine learning system to make decisions.
Supervised ML algorithm uses pre-labelled training data for assigning class labels to vectors
Binary classifier - 2 classes
Multiclass classifier - multiple classes

Multilabel classifier can predict multiple labels or classes.

With probabilities/confidences.
Often called automatic image annotation or auto-annotation
Doesn’t determine where in an image a thing is, just looks for presence of the thing.
Object recognition attempts to localise an object and determine class.

Current research challenges

Unconstrained object recognition in natural scenes
Global classification of images into scene categories/events/topics.
Automatic annotation of large sets of imagery.

Fundamental problem to solve: make computers see images like humans.

“Semantic understanding” - overall meaning of image

Semantic Gap

History of approaches

Historically BoVW has been important.
The histograms created also powerful global descriptors and object detectors
Auto Annotation treated like language translation with semantic space.
Raw features from SIFT, and quantised descriptors of segments even colour of pixels.

Codebook needs to be much smaller for ML, for performance and effectiveness
Visual words can be less distinctive, little more variation
Number of visual words few hundred, up to a few thousand.

Classification improved by sampling image at greater rate

Dense SIFT, densely sampled grid, rather at interest points
Pyramid Dense SIFT, Gaussian Scale Space

BoVW representations augmented with spatial pyramid; sub-histograms of visual word occurrences for overlapping windows in image

Improves performance, as learns where in image objects likely appear.
Pyramid Histogram of Words (PHOW)

Developing / Benchmarking BoVW scene classifier
Use standardised datasets when developing/comparing results of classifiers.

Training and test data sets
Only training set used to train classifier
Test set withheld.

Raw features extracted from training images.

Codebook of features, use k-means. Uniform random sample of all features rather than all of them for speed.

Apply vector quantisation to raw features, count the number of occurrences to build histograms of visual words for each image.
Classifiers trained using histograms
Might train on subset of training data to “validate” and “optimise” parameters.
Re-train with all training data.
Can now apply classifiers to test data:

Extract features.
Quantise the features (using codebook)
Compute occurrence histograms
Use classifiers to find most likely class

Lots of ways to evaluate the performance of classification on test set.

Simplest is average precision

Proportion of number of correct classifications to total number of predictions

TL;DR By Mark Towers

Mark (edited by Joshua Gregory)

“These notes may miss something, please say if I have. Thanks” - Mark

Lecture 1: Eye and Human vision

The human eye is a complex machine, with cameras borrowing several similar ideas like retina, lens. The image coming into the eye is flipped upside down by the lens with the brain automatically flipping the image without thinking.
The retina (the light sensitive part of the eye) with rods for low-light and black-white vision while the cones provide colour vision. However cones are most sensitive to green then red then blue.
Mach bands are an optical illusion whereby the contrast between edges of slightly differing shades of grey is exaggerated when they are touching.
Neural networks are an attempt to replicate the way that the brain works for vision

Lecture 2: Image formation

When decomposing an image into bits, the most significant bits have the largest influence on the image. As the bit position increases, the more information is represented.
Resolution is the number of pixels for the width and the height of the image
Fourier transformation: Any periodic function can be converted to the sine and cosine waves of different frequencies. It is also possible to reconstruct signals from its fourier transform. The magnitude and the phase, is calculated by Pythagoras' theorem and the angle of the hypotenuse in the complex plane.

Lecture 3: Image sampling

Aliasing is an effect that causes different signals to become indistinguishable when sampled. It also refers to the distortion or artifacts that results when a signal reconstructed from samples is different from the original continuous signal.
Sampling signals: if an original signal is a continuous function, then digitally sampling it requires a good sampling frequency. So the higher the sample rate, the better representation of the signal can be captured.
Wheel motion is an example of where sampling rate matters in vision as if the sample rate and the wheel rotations per second are synchronous then it can make it seem like the wheel is moving in reverse.
Sampling theory known as Nyquist’s sampling theorem for one-dimension says that for each point of interest (pixel or musical sample), twice the frequency should be gathered.
It is possible to reconstruct a signal from transformed components using the fourier transformation. This means the fourier transforms have a forward (to transform from a time space to frequency space) and a backward transformation called the inverse. Where the variables x and y are for space, u and v are for frequency and is pixel x and y in the image.

Shifting an image doesn’t affect the magnitude of the fourier transform of the image, but it does affect the phase.
Rotating the image does affect the fourier transform and rotates it in the same way.
Filtering: by finding the phase of an image by the fourier transform, this gives access to the frequency components of the image. This can be used to find the high detail parts of the images by a high-pass filter or all of the low level details with a low-pass filter.
Additional transformation types: fourier, discrete cosine and hartley.
Applications of the 2D fourier transformation: understanding and analysis of images, speeding up algorithms through the use of the function, representation invariant properties (rotation and shift), coding of images by magnitude and phase and recognition and understanding of textures.

Lecture 4: Point operators

Image histograms are a graph of the frequency of brightness for an image (global feature)
An image can be brightened using the formula below where N is the new image, O is the old image, k is the gain, l is the level and x, y is the coordinates.

Intensity mapping allows for the changing of the output brightness (intensity) following this function where it is limited to the min and max values according to the second function. Where Nmax is the maximum limit, Nmin is the minimum limit, Omax is the maximum output and Omin is the minimum output and Ox,y being the old image coordinates and Nx, y being the new image coordinates.

Histogram equalisation improves contrast in images.
Equations:

Thresholding is a way of turning a greyscale image into a black-white image by checking if a pixel value is greater than a threshold value. If true then the pixel is set to 255 otherwise set to 0.

Lecture 5: Group operators

Template convolution applies a kernel matrix turning a group of pixels to a single value. Template convolution includes coordinate inversion in the x axis and y axis, however this is not needed if the template is symmetric.
It is possible to apply template convolution via the fourier transform such that is the fourier transform of the picture and is the fourier transform of the template and is the point by point multiplication.

The gaussian function is used to calculate the template values and a compromise between the variance and window size. This allows for gaussian averaging by using the gaussian function for each element in a matrix.
There are more advanced ways of doing gaussian function like non-local means and image ray transform.

Lecture 6: Edge detection

It is very useful to find the edge of an image.
First order edge detection: There are three equations, one for vertical edges, one for horizontal edges and vertical + horizontal edges. Where is a point at x and y. It is possible to rearrange the to the taylor expansion for to the first derivatives .

Edge detection in vector format where the magnitude is and the direction is . Examples are the Prewitt operator or Sobel operator.

Lecture 7: Further edge detection

The canny edge detection algorithm can be broken down to 5 different steps

Apply Gaussian filter to smooth the image in order to remove the noise
Find the intensity gradients of the image
Apply non-maximum suppression to get rid of spurious response to edge detection
Apply double threshold to determine potential edges
Track edge by hysteresis: Finalize the detection of edges by suppressing all the other edges that are weak and not connected to strong edges.

It is formulated with three main objectives.

Optimal detection with no spurious responses
Good localisation with minimal distance between detected and true edge position
Single response to eliminate multiple responses to a single edge.

Interpolation in non-maximum suppression. Interpolate the gradient along the gradients (plus and minus a certain distance) and check if center is larger than neighbours.
Hysteresis thresholding transfer function. There are 2 threshold values. As a continuous process, when the brightness goes below the lower threshold, it is set to black, but it will only be set to white if the brightness goes above the upper threshold.

The Laplacian operator is a different operator that creates a new complex gaussian like operator with an extremely complex math function. (I really hope you don't have to remember this function - Mark)

Zero-crossing detection - average and sum the 4 corners around a pixel. If the signs change between summations, you have a detection.
Marr-Hildreth edge detection - application of Laplacian of Gaussian (or Difference of Gaussian) and zero-crossing detection.

Lecture 8: Finding shapes

It is important to be able to extract features from an image. This can be done by finding the difference between low and high threshold. While this can identify high contrast parts of the image however this doesn’t help identify shapes in the image.
Template matching: This is a technique for finding small parts of an image that match a template image. It uses correlation and convolution and implementation via Fourier.
Template matching doesn’t work with noisy images or for occluded images (obstructions).
Hough transform is a feature extraction technique that can find imperfect instances of objects within a certain class of shapes by a voting procedure. The voting procedure is carried out in a parameter space from which object candidates are obtained as a local maxima in a so-called accumulator space. Hough transform has the same performance equivalent to template matching but is faster to compute. Basically, straight lines in the image are represented as points in the m,c plane and vice versa.

Lecture 9: Finding more shapes

Hough transform can be used for not only lines but for circles or arbitrary shapes. To do so uses a generalised hough transformation where it forms a discrete look-up tables called the R-table. It votes via a look-up-table. The orientation is turned by rotating the R-table voting and scaling by scaling the R-table. This has inherent problems with discretisation (the process of transferring the continuous functions to the discrete voting table).

Lecture 10: Applications/Deep learning

Computer vision is used in industry for quality control or in academia for face recognition. Nowadays, computer vision uses modern hardware and modern cameras to achieve what we understand by “sight”. Currently there is no technique for all problems and is an active area of research in both industry and academia.

Jon

Lecture 1: Building machines that see

Key terms in designing computer vision systems as you want your system to be robust and repeatable, the system design to be invariant and the constraints to work.

Robust - Changes to the environment will not affect the accuracy of the system
Repeatable - A measure of robustness, such that the system must work the same over and over regardless of the environmental changes
Invariant - An environmental factor that helps achieve robustness and repeatability. Hardware and software can be designed to be invariant to certain environmental changes such as invariant to illumination changes.
Constraints - Are applied to the hardware, software and wetware (humans in the system) to make sure the vision system works in a repeatable and robust fashion. An example is putting the system in a box so there can’t be any illumination changes.

For industry vision, there are constraints applied

Software constraints: Simple and fast algorithm are the most popular as it means that there is a smaller chance of failure compared to deep learning approaches. Therefore Hough transform is a popular algorithm however this requires physical constraints to make it robust. So the use of colour is an important software feature as some colour-spaces don't encode the whole human vision or does not encode the luminance of a colour.
Physical constraints - In industry environments, it is possible to constrain the physical environment e.g. lighting, enclosure, mounting, camera specs, optics and filters.

However in the wild, many of these constraints are not able to be guaranteed therefore as many hardware and wetware constraints are applied with the software taking up the slack. E.g. the colour information is often less important than lumination.

Lecture 2: Machine learning for pattern recognition

For computer vision application, they normally take the form of image → feature extractor → machine learning → results. This is as the feature extractor converts the raw image into a feature vector which can then be learnt much easier for a machine learning algorithm (instead of learning the raw image).
Feature vector are just a mathematical vector with a fixed number of elements in it where the number of elements is the dimensionality of the vector. Each element represents a point in a featurespace or equally a direction in the featurespace. Therefore the dimensionality of a featurespace is equal to the dimensionality of every feature vector in it.
An aim of feature extractor is that they produce vectors that are similar to each other for similar images. This similarity is measured by finding the distance between the two feature vectors. There are different measures of distance

Euclidean distance (L2 distance) - The straight line distance between two points that is computed via an extension of Pythagoras theorem to n dimensions.
Manhattan/Taxicab distance (L1 distance) - The distance along paths parallel to the axes of the space
Cosine similarity - This isn’t a distance but rather the angle between the two angles. This is useful if the relative length of the vectors don’t matter.

Choosing a good feature vector representations for machine learning as better feature vectors allows easier distinguishing of objects or classes of interest. Machine learning also becomes more difficult, the more features (dimensionally) therefore the smaller the number of features the better.
Classification is a supervised machine learning that allows a computer to “learn” the assigning of class labels to an object (normally a feature vector of an image). To do this requires a pre-labelled dataset of an input and the correct output for the process. Binary classifiers only have two classes while multiclass classifier has many classes.
Linear classifiers are type of binary classifier where to tries to learn a hyperplane that separates two classes in a featurespace with minimum error. (Do foundations of machine learning if you want to learn the specifies and cry a lot). In order to classify a new image, the image is placed in the hyperspace then check which side of the hyperplane the image is determining the class of the image.
Some data can’t be linearly separable therefore non-linear binary classifiers such as Kernel support vector machines. These machines can learn non-linear decision boundaries through the use of the kernel trick (foundations of machine learning has the mathematically details). However this has the problem that the algorithm must lose generality by overfitting as the decision boundaries is not limited to a straight line.
KNN is a multiclass classifier that assigns the class of unknown points based on the majority class of the closest K neighbours in the featurespace. So for a new point then within a set radius, all of the points within the radius are found. With the new point class being determined by the majority class of the points. However this has problems as it is computationally expensive if there are lots of training examples or many dimensions.
K-mean clustering is an unsupervised machine learning algorithm that aims to group data together into classes without any prior knowledge (or labelled data). In terms of feature vectors, items are similar vectors should be grouped together by a cluster operation. Some clustering operations can be done using a probabilistic model instead of making a discrete choice. It follows the algorithm below

The value of K (number of clusters) is chosen
The initial cluster centres are chosens (called the centroids)
Till the centroids dont move after an iteration

Each point is assigned to its closest centroids
The centroids are recomputed as the mean of all of the points assigned to it. If the centroid has no points assigned it to, it is randomly re-initialised to a new point.

The final clusters are created by assigning all of the points to their nearest centroids.

Lecture 3: Covariance and Principal components

Variance is the mean squared difference from the mean. It is a measure of how spread-out the data is. It can be written as where is the exception (basically the mean if the probability is equal).

Covariance measures how two variables change together and the variance is the covariance when the two variables are the same so It can be written as .

Covariance matrix encodes how all possible pairs of dimensions in an n-dimensional dataset carry together square symmetric matrix.

Mean centring is a process of computing the mean (across a number of independent dimensions) as a set of vectors which is then subtracted from the mean vector from every vector in the set. For each of the dimensions then all of the vector are translated (by subtracting each point by the mean) so their mean position is the origin position.
Principal axes of variation - This can be computed using this algorithm:

Basis is a set of n linearly independent vectors in an n-dimensional space. This means that all of the vectors are orthogonal forming a coordinate system.
The first principal axis is the vector that describes the direction of greatest variance for a given set of n-dimensional data.
All of the following principal axis are orthogonal (perpendicular) to all of the previous axis. E.g. second principal axis is orthogonal to the first axe while the third principal axis is orthogonal to the first and second principal axes.

Eigenvectors and eigenvalues are found by solving the equation where A is a n-square matrix, v is a n-dimensional vector (eigenvector) and is the scalar value (eigenvalue). Eigenvectors and values have the following properties:

At most n eigenvector-value pairs
If A is symmetric, the set of eigenvectors is orthogonal
If A is a covariance matrix, the eigenvector are the principal axes
The eigenvalues are proportional to the variance of the data along each eigenvector.
If all of the eigenvectors are independent then the sorted list of eigenvectors is equal to PCA axes.

Eigendecomposition () factorizes a matrix into eigenvectors and a matrix where the diagonal are eigenvalues. If A is a real symmetric (i.e.covariance matrix) then (i.e. eigenvectors are orthogonal). This allows finding of the principal components.
A linear transform projects data from one space into another space that can reduce the dimensions than the original. (e.g. ). This process can be reversed if W is invertible (invertible) so this is not possible if the dimensional is changed.
Principal component analysis (PCA) is an orthogonal linear transform that maps data from its original space to a space defined by the principal axes of the data.

The transform matrix W is just the eigenvector matrix Q from the eigendecomposition of the covariance matrix of the data.
Dimensionality reduction can be achieved by removing the eigenvectors with low eigenvalues from Q (i.e. keeping the first L columns of Q assuming the eigenvectors are sorted by decreasing eigenvalue.

PCA algorithm that allows for the reduction of dimensions of feature space making it easier to learn by machine learning algorithms. It also means there are a smaller number of dimensions allowing for greater robustness to noise, mis-alignment and the dominant features are captured.

Mean-centre the data vector
Form the vectors into a matrix Z, such that each row corresponds to a vector
Perform the eigendecomposition of the matrix to recover the eigen matrix Q and diagonal matrix
Sort the columns of Q and corresponding diagonal values of so that the eigenvalues are decreasing.
Select the L largest eigenvectors of Q (the first L columns) to create the transform matrix
Project the original vectors into a lower dimensional space, .

Lecture 4: Types of image feature and segmentation

There are four main ways of extracting features from an image

Global - features are extracted from the contents of an entire image through image histograms and joint-colour histogram.
Grid/block-based - The entire image split into blocks with a feature extracted from each block. This allows for multiple feature vectors to be extracted.
Region-based - The entire image is split into a number of regions and a feature is extracted from each allowing for multiple feature vectors to be extracted.
Local - The interest points of the image are detected with a feature vector extracted from the surrounding pixels of each interest points.

Image histograms - A simple global feature that finds the image histograms of the intensity (by averaging the colour band values). However this isn’t particularly robust and can’t deal well with multiple colours in the image.

Joint-colour histogram - The colour space is quantised into bins and we accumulate the number of pixels in each bin. This is technically a multidimensional histogram (due to each colour) that if flatten (unwrap) it can be used to make it a feature vector.
Image segmentation - The process of creating region-based descriptions, basically creating groups (segments) of pixels that are “similar”. This is normally pixels that share certain visual characteristics (colour, lighting, etc).

Global binary thresholding - Takes a greyscale image and assigns all pixel with a value lower than a predetermined threshold to one segment with all other pixels to the other segment. This is really fast but requires a manually set static threshold making it not robust to lightning changes. Works well in applications with lots of physical constraints
Otsu’s thresholding method - Provides a way to automatically find the threshold assuming that there are only two classes (i.e. foreground and background) so the histogram must have two peaks. So an exhaustive search for the threshold that maximises the interclass variance
Adaptive/local thresholding - computes a different threshold value for every pixel in an image based on the surrounding pixels. This is usually a square or rectangular window around the current pixel to define the neighbours.
Mean adaptive thresholding - Set the current pixel to zero if its value is less than the mean of its neighbours plus a constant values otherwise set to 1. This relies on the parameters: size of the window and the constant offset value. This means that it deals well with uneven lighting/contrast but it is computationally expensive (compared to other global methods) and can be difficult to choose the window size. As the scale of an object can break it.
Segmentation with K-means - Cluster the colour vectors [r, g, b] of all the pixels where each pixel is assigned to a segment based on the closest centroid. This works best if the colour-space and sitane function are compatible (e.g. lab colour-space is designed so that Euclidean distances are proportional to perceptual colour differences. This doesn’t preserve continuity of segments so single pixel might end up alone. Therefore also encoding the position is helpful but this should be normalised by the width and height of the image to remove the effect of different image sizes and scale x and y so they have more or less effect than the colour components.

Pixel connectivity - A pixel is connected to another if they are spatially adjacent to each other. There are two standard ways of defining this: 4-connectivity and 8-connectivity representation the cardinal directions.
Connected component labelling - The process of detecting connected regions within a binary (segmented) image. There are a lot of different algorithms that have a variety of performance tradeoffs (memory vs time). An example algorithm is the two-pass algorithm:

One the first pass, iterate through each element of the data by column then by row (raster scanning)

If the element is not the background

Get the neighbouring elements of the current element
If there are no neighbours, uniquely label the current element and continue
Otherwise, find the neighbour with the smallest label and assign it to the current element
Store the equivalence between neighbouring labels

On the second pass, iterate over the data in the same way

If the element is not the background

Relabel the element with the lowest equivalent label

Lecture 5: Shape description and modelling

It is possible to extract features from shapes by connected pixels. There are two types of pixel borders: inner and outer.

The inner border is the set of pixels that are outermost pixels within the shape
The outer border is the set of pixels that the outline of pixels outside the shape.

There are two ways of describing a shape: Region description and boundary description.

Region description: A simple scalar shape features

Perimeter = Length around the outside of the component
Area = Number of pixels in component
Compactness = How tightly packed the pixels are, often computed as the weighted ratio of area to perimeter squared. .
Centre of mass = The mean of x and y positions of all the pixels in the component
Irregularity/Dispersion = How “spread-out” the shape is. This is computed is two different ways: the ratio between the furthest point from the centre of mass and the area of the shape. The other is the ratio between the furthest boundary point and the closest boundary point.

Moment = The distribution of pixels of shapes in a greyscale image with the order p and q and I(s) is the pixel intensity. For the case of connected components, this is simplified to the second equation. These equations can be used to calculate the centre of mass: . But it is not invariant to translation, rotation or scaling however other moments exist.

Central moments = Translation invariant as it uses the centre of mass	Normalised central moments = Both scale and translation invariant

It is important to encode the boundary of a component, this is done in two ways: chain codes and chain code invariance.

Chain codes: By walking around the boundary and encode the direction you take on each step as a number. They cyclically shift the code so it forms the smallest possible integer value (making it invariant to the starting point).
Chain code invariance: This can be made rotation invariant by encoding the differences in direction rather than absolute values. It can also be made scale invariant by resampling the components to a fixed size (however this doesn’t work well in practice).

Region adjacency graph - It builds a graph from a set of connected components where each node corresponds to a component and a vertex (connection) between nodes exist if they share a border. This can be used to easily detect patterns in the graph making it invariant to non-linear distortions but not to occlusions.
Active shape models and constrained local models (ASM/CLM) extends Point Distribution Model (PDM) by learning the local appearance around each point which is typically just an image template. Using a constrained optimisation algorithm, the shape can be optimally fitted to an image by using constraints.

Lecture 6: Local interest points

Interest points - Points within an image that are “important”. Good interest points have:

Invariance to brightness changes
Sufficient texture variation in the local neighbourhood
Invariance to changes between the angle / position of the scene to the camera.

There are lots of different types of interest point types to choose from.

Corner detection - By using a small window, the idea is that at a corner, there will be a significant change in all directions. Whereas if the window is at the edge then there is no change along the edge direction. And in a flat region there will be no change in all directions. This is dependant on measuring the change in intensity between a window and the shifted version. TODO maths
Blob detection (also called Difference-of-Gaussian Extrema) - TODO

Lecture 7: Local features and matching

Local features are used for: image alignment, object recognition, index and database retrieval, motion tracking, 3D reconstruction.
Local features are found by identifying all of the interest points in an image with a feature vector being extracted from interest points from the surrounding pixels.
There are two distinct types of matching problem: narrow-baseline stereo and wide-baseline stereo.

Narrow-baseline stereo - When the local features in two images have only moved by a few pixels. Because of this the interest matcher, it doesn’t need robustness to rotation and light. The search can be optimised for within the window area, so the detector doesn’t need to identify a large number of points (low descriptiveness).
Wide-baseline stereo - When the local features in two images have moved by large amounts (the feature has moved, rotated, etc in the image). Because of this the interest matcher, it needs to be robust to intensity change and invariant to rotation. The matcher also needs to be highly descriptive so that it identifies a large number of interest points in the image to avoid mismatches (but not so distinctive that you can’t find any matches). Also robust to small localisation errors of the interest points, the descriptor should not change too much if it moves by a few pixels but to change more rapidly once it moves further way.

There are a number of problems with wider baselines: not robust to rotation, sensitive to localisation of interest points (small search window helps), wider baseline can't assume a search area (therefore requires considering all of the interest points between the two images).

Instead of using pixel patches to identify interest points, local histograms can be used instead. To match each interest point in the first image to the most similar point in the second image (i.e. in terms of euclidean distance between the histograms). But this has the following problems:

Not necessarily very distinctive as many interest points likely to have similar distribution of greyscale image.
It is not rotation invariant if the sampling window is square or rectangular but can be overcome by using a circular window.
It is not invariant to illumination changes
It is sensitive to interest point localisation, the point where the interest point is identified.

It order to overcome localisation sensitivity to

Allow the window to move a few pixels in any direction without changing the descriptor.
By applying a weighting so that the pixels near the edge of the window have a smaller effect than the points in the centre. Commonly this is done using the gaussian weighting centred on the interest point for this.

It order to overcome the lack of illumination invariance, this is potentially achievable by normalising or equalising the pixel patches before constructing the histogram.
Local gradient histograms is an alternative to pixel histograms by finding the partial derivatives of an image (e.g. applying convolution with Sobel). So instead of building histograms of the raw pixel values we could instead build histograms that encode the gradient magnitude and direction for each pixel in the sampling patch. As gradient magnitudes (and directions) are invariant to brightness change plus gradient magnitude and direction histograms are more distinctive than local pixel histograms. This is built by quantising the direction (0 to 2 radians) into a number of bins (usually 8). For each pixel in the sample patch, accumulate the gradient magnitude of that pixel in the respective orientation bin.

Gradient histograms are not naturally rotation invariant however they can be made invariant by finding the dominant (bin with the greatest value) orientation and cyclically shifting the histogram so the dominant orientation is the first bin.

SIFT (Scale Invariant Feature Transformer) feature is widely used that builds on the idea of local gradient histogram by incorporating spatial binning which create multiple gradient histograms about the interest point and appends them all together into a longer feature. Standard SIFT geometry appends a spatial 4x4 grid of histograms with 8 orientations. This leads to a 128-dimensional feature that is highly discriminative and robust. In order to match SIFT features:

Find the SIFT feature vector of the interest point of interest and then find all of the feature vectors of the image. Using the Euclidean distance between the interest points is an easy way of finding the feature vectors that are closest to the interest point. Thresholds can be used to reject poor matches however that doesn’t work well and results in lots of mismatches.
An improved matching method is to take each feature from the first image and find the two closest feature in the second image. Only form a match if the ratio of distances between the closest and second closest matches is less than a threshold (usually set at 0.8 meaning that the distance to the closest feature must be at most 80% of the second closest. This leads to a more robust matching strategy.

Lecture 8: Consistent matching

Even the most advanced local features can be prone to being mismatched as there are always tradeoffs in feature distinctiveness. As if its too distinctive it will not match subtle variations due to noise of imaging conditions but if it’s not distinctive enough it will match everything.
Geometric mapping can be thought of as a transform function that maps the x, y coordinates of points in one image to another.

Point transforms - take the form of where is the transformed coordinate, is the transform matrix and is the original coordinate.
Affine transform - take the form of that allows for translation, rotation, scaling, changing of the aspect ratio and shearing. This means it has 6 degrees of freedom due to the number of operations
Projective transformation (planar homography) - Uses a 3x3 matrix as the transformation with the coordinator vector being [x, y, 1].

It is possible to recover the transform matrix from a transformed image and the original image by finding a set of point matches. For a homography is needs at least 4 points and 3 for affine transform. To solve uses the least squares difference between the proposed matrix image and the actual image.
RANdom SAmple Consensus (RANSAC) is an iterative method to estimate parameters of a mathematical model for a set of observed data that contains outliers. The algorithm works in two stages where the first stage is to generate a mathematical model representing the dataset with the second stage to find how far each data point is from the line of best fit. If a data point is greater than a threshold value, the point is an outlier otherwise it is an inlier. This is repeated a set number of times keeping the inliners.
However there are problems with direct local feature matching as the algorithm is slow. The typical image (800x600) has ~2000 difference of gaussian interest points / SIFT descriptors as it has 128 dimensions. For matching a query feature to an image requires operations.

K-D Trees - A binary tree structure that partitions the space along axis-aligned hyperplanes. This typically takes each dimension in turn and splits on the median of the points in the enclosing partitions but stops after a certain depth or when the number of points in a leaf is less than a threshold. To search, just walk down the tree until a leaf is hit and brute-force search to find the best in the leaf. However this doesn’t scale well to high dimensions as you tend to end up needing to search most of the tree. And there are approximate versions that won’t necessarily return the exact answer that do scale.
Locality Sensitive Hashing (LSH) - Creates hash codes for vectors such that similar vectors have similar hash codes

Lecture 9: Image search and Bags of Visual words

Bag data structure - An unordered data set but allows elements to be inserted multiple times (sometimes called a multiset or counted set).
Text processing is a way of doing feature extraction of text by splitting a sentence into words (tokenization) with optional stop-words removal and stemming/lemmatization. The tokenized words are packaged into a bag called a bag of words with a count of each word. This conceptually makes language able to be solved with vector-space models. This allows for documents to be modelled with a vector and queried and for evaluating how “similar” two documents are.

Bag of words vectors uses a lexicon or vocabulary as the set of all (processed) words across all documents. A vector is created for each document with as many dimensions as words in the lexicon. Each word vector is a histogram of the word occurrences in the document however will have a very high number of dimensions but it is very sparse.
Inverted indexes are a way of mapping words to documents. A posting is a pair formed by a document ID and the number of times the specific word appeared in that document.
A more efficient way of computing the similarity of two documents is to compute the cosine similarity using the inverted indexes. Where the smaller the angle the better.

However doing this assumes that every word in the document has equal importance which is normally not true as if a term is frequent in a document but rare across other documents, it is probably more important for that document. There are multiple weighting schemes

Binary weights - Only presence (1) or absence (0) of a term recorded in vector
Raw frequency - The frequency of occurrence of term in document included in vector
TF/IDF (Term frequency / Inverse document frequency) - Provides high values for rare words and low values for common words by the term frequency being the frequency count of a term in the document and the inverse document frequency being the amount of information the word provides.

Vector quantisation is a lossy data compression technique. Given a set of vectors, a technique like K-Means clustering can be used to learn a fixed size set of representative vectors where the set of representation vectors is called a codebook. Vector quantisation achieves this by representing a vector by another approximate vector that is drawn from a pool of representative vectors. Each input vector is assigned to the closest vector from a pool.
SIFT visual words use vector quantisation as a way of replacing each descriptor (interest point) by a representative vector known as a visual word. This is a way of simplifying a patch of pixels to a feature vector of a cluster.
It is possible to create a histogram of bag of visual words by generating a codebook of all the visual words in the image. This is a way of aggregating a variable number of local descriptors into a fixed length vector and allows applying techniques for text retrieval to images.
An important key parameters in building visual word representation is the size of the codebook as if it is too small, all vectors look the same meaning that it is not distinctive enough. Too big and the same visual word might never appear across images meaning it is not too distinctive.