Registers that programmers can freely use like variables in Assembly
They usually have names like R1, R2, R3 ...
General purpose registers are a subset of user visible registers. They store things like the accumulator (the result of a mathematical operation).

There are typically between 8, 16, 32 or 128 GP registers.
They take up processor silicon area, so we can’t have too many of them.

Condition code registers are also a subset of user visible registers.

Each bit of a condition code register is a flag, e.g. one bit could mean last operation was zero, another bit could mean overflow etc.
It can be read, but it can’t be written to by programs.

Control registers

Registers that are not accessible to the programmer. An example would be the instruction register (it holds the current instruction) or the memory address register (stores addresses fetched / to be stored).
The only control register that’s visible to assembly code is the program counter.

Assembler, C -> ARM

C	ARM Assembly
main() { int a,b,c[50]; b = 2; for( a= 0; a < 50; a++) c[a] = a * b; }	mov r3, #2 str r3, [fp, #-16] mov r3, #0 str r3, [fp, #-20] b .L2 .L3: ldr r1, [fp, #-20] ldr r2, [fp, #-20] ldr r3, [fp, #-16] mul r0, r3, r2 mvn r2, #207 mov r3, r1, asl #2 sub r1, fp, #12 add r3, r3, r1 add r3, r3, r2 str r0, [r3, #0] ldr r3, [fp, #-20] add r3, r3, #1 str r3, [fp, #-20] .L2: ldr r3, [fp, #-20] cmp r3, #49 ble .L3 sub sp, fp, #12 ldmfd sp, {fp, sp, pc}

Data flow

Fetch:

Program counter (PC) contains address of next instruction
Address moved to Memory Address Register (MAR)
Address placed on address bus
Control unit requests memory read
Result placed on data bus, copied to MBR, then to IR
Meanwhile PC incremented by 1

Execute:

Depends on the instruction being executed, but it may include:

Memory read/write
Input/Output
Register transfers
ALU operations

What happens during an interrupt?

Registers like PC are loaded onto a stack
Stack pointer is loaded to MAR and MBR written to memory
PC is redirected to the interrupt handling routine
Once interrupt has been handled, registers are loaded back from the stack
Processing continues like normal

Computer interfaces

Types of input

Sound
Temperature
Touch
Motion (IR etc.)
Light, images
Magnetism
Acceleration

Analogue inputs

Analogue inputs are things like light, sound, temperature etc.
They can be mostly expressed like a wave
Computers are purely digital; they cannot read pure analogue
Therefore, to get analogue input, we must convert analogue to digital

Converting analogue to digital

Transducer: converts one form of energy into another, e.g. light into electrical
Converter: converts analogue voltage into digital input
Analogue-to-digital converter is referred to as ADC
Digital-to-analogue converter is referred to as DAC

Quantisation

To ‘quantise’ something means to break it up into ‘steps’.
Example:

This red sine wave is of analogue format. However, the blue wave is a ‘quantised’ version of the red sine wave because it’s broken up into steps, or ‘quantities’.
The left blue wave is more accurate than the right blue wave because the steps are smaller. The smaller the steps, the more accurate the conversion is.
This ‘quantised’ wave can be represented digitally. This is how analogue waves are converted into digital formats.

How does ADC work?

All ADCs use one or more comparators, which take in analogue inputs and produces a digital output depending on the input
Example:

If voltage 0v to 1v is found, comparator 1 produces output
If voltage 1v to 2v is found, comparator 2 produces output
If voltage 2v to 3v is found, comparator 3 produces output etc.
By doing this, depending on which comparator is outputting, the converter will know the current range of the analogue input, and can store it digitally.

DAC

Converts digital signals to analogue by estimating curves given certain heights.
Quantises the same way that ADC does, except it goes the other way round.
Example:

DAC could quantise 5v to about 20 mV

CISC and RISC

Architectures

Von Neumann Architecture

Programs and data are stored together in the same space

Harvard Architecture

Programs and data are stored in separate memory locations

RISC

Reduced Instruction Set Computer

Instructions have a fixed length, each executed in a single clock cycle
Therefore, can do pipelining to achieve one-instruction-per-one-clock-cycle
Simpler instructions, increase clock speed, no microcode
More minimalistic
Operations are performed on internal registers only.

Only LOAD and STORE instructions access external memory

Families that use RISC:

MIPS
SPARC
DEC Alpha
Power PC (IBM)
PA-RISC
Itanium i860/i960
88000 (Motorola)
ARM

CISC

Complex Instruction Set Computer

Old binary code can run on newer versions
Complex instructions, can do more in one instruction
More advanced
Uses microcode

Microcode - hardware-level instructions that implement higher-level instructions
one program instruction takes multiple cycles

Variable-length instructions to save program memory
Small internal register sets compared to RISC
Complex addressing modes, operands can exist in external memory or internal registers

Pipelining and branch prediction

Pipelining

You can overlap the following operations:

Fetch instruction
Decode instruction
Calculate operands
Fetch operands
Execute instructions
Write output (result)

Like this:
However, if there is a branch in the pipeline (like an if statement), the pattern will have to break, similar to this:
RISC only has instruction fetch, instruction decode, execute and memory access, so pipeline stalls on RISC are less drastic.

Dealing with branches

Multiple Streams

You could have two pipelines, and prefetch each branch into a separate pipeline.
So when you reach that if statement, you have two paths (pipelines) to go through.
Advantage:

It doesn’t break the pattern of pipelining

Disadvantage:

It uses up a lot more memory
If there are branches in those pipelines, you’ll need even more pipelines

Prefetch Branch Target

Fetches where the program will branch to ahead of time (called the target of branch)
Keeps target until branch is executed

Loop buffer

Stores the previous few instructions in very fast memory buffer (possibly cache)
Maintained by fetch stage of pipeline
Check buffer before fetching from memory
This is used for executing small loops very quickly

Branch prediction

There are 4 ways of performing branch prediction:
1) Predicts never taken, assumes that jump will not happen
2) Predicts always taken, assumes that jump will happen
3) Predict by opcode

Some instructions are more likely to jump than others
Is up to 75% accurate

4) Taken / Not taken switch

Based on previous history
Good for loops
2-bit predictor:

Remembers how often jumps are made
Makes future choices based on the past

Delay slot

If a program branches elsewhere, the pipelines may still continue going on the wrong path, executing instructions called ‘delay slots’, while the other pipelines are busy fetching and decoding the correct branch.
This is to avoid branch hazards. Delay slots exist because the pipelines must be full of instructions at all times. This is a side-effect of pipelined architectures.

SSE

Flynn’s classifications

SISD - Single Instruction Single Data

Performing one instruction on one register
Example: pipeline processors

SIMD - Single Instruction Multiple Data

Performing one instruction on lots of data values
Example: GPUs are optimised for SIMD techniques

MISD - Multiple Instruction Single Data

Performing multiple instructions on one register
Example: Systolic array, which is a network which takes in multiple inputs and uses processor nodes to achieve a desired result, and works similar to a human brain.

MIMD - Multiple Instruction Multiple Data

Performing multiple instructions on lots of data values
Example: Multiple CPU cores

SSE

SSE stands for Streaming SIMD Extensions
It is an extension to the x86 architecture and was designed by Intel.
It adds around 70 new SIMD instructions, used for things like:

image processing
video processing
array/vector processing
floating-point maths
text processing
general speed-up (depends on application)

How SSE works

You put smaller values into one big register, then perform the SSE instruction. Then, it will perform a certain calculation on the big register, and in turn, will perform the same calculation on multiple smaller values.
Example:

You put 8 x 16 bit values into one 128 bit register
You put another 8 x 16 bit values into another 128 bit register
Perform the ‘add’ instruction
The result will be another 128 bit register, but with all the added 16 bit values inside it.
You’ll now have 8 x 16 bit results in that 128 bit result!
This way, you don’t need 8 add instructions; you only need 1.

Problems with SSE:

Overflow can wrap around

255 + 2 = 1

Overflow can limit

255 + 2 = 255

Superscalar

What is superscalar?

These instructions do not depend on each other (they share no registers):

(1) R0 = R1 + R2
(2) R3 = R4 + R5

Yet, traditionally, they are executed like this:

Execute (1), then execute (2)

However, since they do not depend on each other, you could execute them like this:

Execute (1) and (2) at the same time

This is the main gimmick of a superscalar processor.

Limitations

Instruction level parallelism

When instructions are independent and can be overlapped. It is a property of the program.

Compiler based optimisation

It’s up to the compiler to decide where superscalar techniques are possible.

Hardware techniques

If there are not enough execution resources on the CPU to perform multiple instructions at one time, it physically cannot happen.

Limited by:

True data dependency (Read-After-Write)

When multiple instructions use the same registers:
R0 = R1 + R2
R3 = R0

Procedural dependency

Similar to true data dependency, but a branch determines what the shared register will be:
IF R0 = 1 THEN

R1 = 5

R2 = R2 + R1

Resource conflicts

When two instructions require the same resources at the same time.
Example: if two instructions require the ALU

Output dependency (Write-After-Write)

(1) R3 = R3 + R5
(2) R3 = 1
If (2) is completed before (1), then (1) will have the incorrect output. This is output dependency. This can be fixed with register renaming.

Anti-dependency (Write-After-Read)

(1) R3 = R3 + R5
(2) R4 = R3 + 1
(3) R3 = R5 + 1
(3) cannot be executed before (2) because (2) needs the old value of R3 and (3) overwrites R3. This is anti-dependency. This can be fixed with register renaming.

Instruction issues

There are 3 ways superscalar processors can decode and execute instructions:
In-Order Issue In-Order Completion

All instructions are decoded and executed sequentially. This is the least efficient, but it gives the other 2 a frame of reference in terms of performance.

In-Order Issue Out-of-Order Completion

All instructions are decoded in order, but they can be executed at the same time. Output dependency is a problem here.

Out-of-Order Issue Out-of-Order completion

Continues to fetch and decode as much as it can until decode pipeline is full. Then, when an execution resource is available, it’ll execute the next stored instruction. This is typically the most optimised way to decode and execute instructions.

Superscalar requirements

To support superscalar techniques, a processor must:

simultaneously fetch multiple instructions
be able to determine true dependencies with registers
be able to execute multiple instructions in parallel
have the mechanisms to process in the right order

Interconnection structure & bus design

What is a bus

All the units in a computer need to be connected somehow.
These connections are called ‘buses’.
They’re like special paths on which data travels on from one unit to another.
They don’t carry 1 bit, like a copper wire; they’re typically a few words long, so they’re more like a ribbon cable. Each path the bus has is called a ‘line’. Buses have around 50+ parallel lines.
Different types of connections are needed for different types of units:

Memory
I/O
CPU

Single bus problems

If a device uses only one bus, it could lead to problems, such as:

Long paths could mean that coordination of bus transfer could affect performance
Certain data transfer could exceed bus capacity
Different devices work at different speeds, so they will be out of sync.

Therefore devices tend to use multiple buses.

Bus layouts for different units

Unit	Buses
Memory
I/O module (mouse, CD drive etc.)
CPU connection

Types of buses

Control / Address / Data bus

These types of buses are common to all units.
Control bus

Controls access to the data and address lines. Each line has a different use:

Memory read/write signal
I/O Port read/write signal
Transfer Acknowledgement
Bus request/grant
Interrupt request/acknowledgement
Clock signals
Reset

Typically, each line (or each bit) would represent one of these flags.
This is usually hidden from the programmer.

Address bus

Carries the address of the related data. This is especially prominent in RAM and disks. The number of lines this bus has is a big factor for spatial capacity, because you can refer to bigger addresses.
Modern CPUs can have 40-bit addresses (1TB).

Data bus

Carries data from one unit to another. The number of lines in this bus is a big factor for performance, since it would be able to transfer more data per clock cycle.

Peripheral Component Interconnect Express (PCIe)

PCI:

PCI was a type of bus that connected the processor, the LAN adapter, all the peripherals, everything.
It’s not used so much anymore.

PCIe:

Now, we have PCIe.
It is a serial bus with 0.25 - 1 Gbyte/s channels
Version 4 bumped it up to 2GByte/s in 2017
Version 6 bumped it up to 8GByte/s in 2019
There are 2 - 32 channels per bus
You’d typically use this for a graphics card (16 = 5.6GB/s)

Universal Serial Bus (USB)

Ideal for low-speed I/O devices
Simple configuration, simple design
Speeds

usb2: 480 Mbit/s
usb3: 4.8 Gbit/s
usb4: 10-40 Gbit/s

Hardware:

The USB system forms a sort of ‘tree’ structure.
You can add USB hubs and extend the tree structure.
USB cable contains four wires

2 data lines
1 power (+5 volts) & 1 ground

Data transmitted as:

“0” is transmitted as a voltage transition
“1” as the absence of a transition
Thus, a sequence of “0”s forms a regular pulse stream, like this: “0101010101010101010”
A sequence of “1”s would form something that looks like this: “0000000000” or this: “1111111111111”

Signals:

Every 1.00 ± 0.05 ms, the root hub broadcasts a new frame. This is how the root hub ‘communicates’ with the node devices.
If no data is to be sent, just a “Start of Frame” packet is sent, which does nothing
Data associated with one addressed device, either to or from hub
Four kinds of frames:
Control

Used to configure devices & inquire status

Isochronous

For real-time devices where data should be sent/received at precise intervals

Bulk

For large data transfers

Interrupt

USB does not support “interrupts”, so this frame is used for regular polling of devices, e.g. polling a keyboard every 50 ms.

Hard disk buses

Parallel ATA (PATA or IDE)

PATA is a way of connecting devices to a bus controller.
Only has 2 channels per controller (often 2) hence 4 device limit.
2, 11, 16, 33, 100, 133 MByte/s bandwidth
Now mainly seen in old PCs for old optical/CD/DVDs

Serial ATA (SATA)

A type of serial computer bus interface that connects a host adapter to things like hard disk drives, optical drives and solid state drives.

SCSI

A set of standards for physically connecting and transferring data between computers and peripheral devices.
Was mainly used in servers throughout the 80’s and 90’s
1 to 15 devices per channel
10, 20, 40, 80, 160, 320 MB/s
10 – 640 MB/s
SCSI disks still available – usually high speed
Newer iSCSI also common now

Uses tcp/ip and, typically, Ethernet cables

Newer SAS – serial attached SCSI

Very common now
Fast serial SCSI, compatible with latest SATA, 3-6Gbit/s

Fibre channel

Fibre-channel SCSI can use fibre!
2Gbit/s ie aprox. 200MByte/s bandwidth
allows distant disk arrays

I2C

Simple two wire bus for small slow peripherals like sensors, fans etc
Single data line and one clock line

The single data line is used to send and receive

Mass storage

Magnetic disk

Metal/glass/plastic disk coated with magnetizable material (like iron oxide)
Different types of magnetic disk:

Floppy

A small, square/rectangular type of magnetic storage
(old: 8”, 5.25”), 3.5”
Small capacity: 1.44Mbyte
Slow but cheap

Hard disk

Properties:

Universal
Cheap
Fast external storage
Especially in RAID
Getting larger
Multiple Terabyte now usual

Hard disks have multiple discs inside it, each with tracks running through it.
Each track is split up into sectors. A sector is the smallest unit of storage a disc has.
These sectors contain data.
Data is read by a ‘head’, which has a magnetic sensor that reads the sectors.
Speed:

Seek time

Moving head to correct track

(Rotational) latency (eg 4ms)

Waiting for data to rotate under head

Access time = Seek + Latency

On-disk cache:

Modern disks have 8 - 128 MB on-board RAM buffer/cache
Used to store whole tracks and cache r/w
acts as a buffer between disk and external I/O

Fast disks:

Current disk range
6 - 14TB are largest disks now
10-15 k rpm fastest spins
Sustained data rates 150-170 MBytes/s
average access time 3.6 – 8 ms
8-12W power
15k rpm are normally only 600GB

Small disks:

2.5” typically 5400rpm and slower access: 11ms
Some have onboard Flash cache (eg 256MB)
1” Eg IBM Microdrive PCMCIA (historic)
Around 1GB
3600rpm, 12ms access, around 6MB/s.

Mean time between failures (MTBF) = 114 yrs
Chance that fast disk will fail after 1 year = 0.5%

Solid State Drives

Properties:

non volatile NAND logic (fast) or Flash based
expensive
but fast access speeds, say 0.1ms
Near "zero" latency compared to HDs
sequential speed: 500-3000 MB/s
Transactions/s (IOPS) often quoted (eg 90000)
max about 1-4TB at the moment
more shock resistant, silent, lower power

Characteristics:

can only write approx Millions of times

wear levelling – spreads the writes around so one area is not worn out
overprovision - spare storage

file systems consider SSD issues:

TRIM command to tell SSD which blocks are not needed and can be "erased"
erase normally only works on blocks

they need up to date firmware and drivers

Optical storage and CD-ROM

Originally for audio
650Mbytes giving over 70 minutes audio
Polycarbonate coated with highly reflective coat, usually aluminum
Data stored as pits
Read by reflecting laser
Constant packing density
Constant linear velocity
Format:

Mode 0=blank data field
Mode 1=2048 byte data+error correction
Mode 2=2336 byte data

Random Access is difficult on a CD because it’s designed for sequential access.
Therefore, it’s more efficient to package everything and read one big file (e.g. zip or tar.gz).

DVD

Multi-layer
Very high capacity (4.7G per layer)
Full length movie on single disk
Using MPEG2 compression
Movies carry regional coding and copyright
Players only play correct region films
CD vs DVD:
Single-layer vs Dual-layer
At first, writable DVDs had trouble with standards.
Now various standards generally coexist.

Blu-ray

Use 405nm laser – hence smaller features
15-30GB still useful for offline storage/transfer
128GB 4 layer versions exist
Can read at 70MB/s

Magnetic tape

Serial access
Slow - speed often quoted in GB/hour
Very cheap
Backup and archive

Storage Area Network (SAN)

Like disks – block accessed

ie not like NAS which is a file server

Fast and low latency
Starting to use 16Gbit/s fibre channel
Can use 12 TB drives or even SSD

RAID

What is RAID

RAID stood for Redundant Array of Inexpensive Disks, but now it stands for Redundant Array of Independent Disks.
They’re ways of distributing data across multiple physical drives

RAID 0

Different ‘strips’ are distributed across all hard disks
Size is N * disk size

RAID 1

Two disks are mirrored
If one is faulty, just swap it and remirror it
The size is N * disk size / 2 (since 50% of the total storage is being used)

RAID 0+1

You have two RAID 0 systems, and you mirror them together using RAID 1
Size: N / 2

RAID 1+0

You mirror every drive with RAID 1, and you use a RAID 0 system on them all as if they’re individual drives
Size: N / 2

RAID 4

Kind of like RAID 0, but with blocks instead of stripes
Allocates 1 parity drive
Not used now

RAID 5

Like RAID 4, but the parity blocks are dynamically assigned to the data drives through round robin
It uses the recovery mechanism of XOR for parity (in the XOR truth table, if a column is lost, you can XOR the other two remaining bits to get the missing one)
Even though parity is distributed over multiple disks, the size is still N - 1

RAID 6

Like RAID 5, but has a double parity write, so it can tolerate two disk failures
The size is N - 2

RAID 5+0

Multiple RAID 5 systems, all encapsulated under a RAID 0 system.

RAID 6+0

Multiple RAID 6 systems, all encapsulated under a RAID 0 system.

Software RAID

You can do RAID through software as well, but it takes more CPU power.
However, this is available to most OSs.

Hot spares and hot swap

Hot Spare:

A hot spare is a disk which is not used by the RAID system at first. In case of a failure, the system recovers the data on the failed disk (using something like XOR for RAID 5), and fills it in the hot spare. In other words, the disk sits there doing nothing until one of the other disks fail.
Typically, the broken disk gets replaced with a new one which takes over the role of the hot spare.
In a RAID system that cannot recover data, like RAID 0, a hot spare would be useless.

Hot Swap:

You can pull out a faulty disk and replace it while the system is running.

When drives go faulty, the RAID system needs time to rebuild. This is where hot swapping and hot sparing can prevent downtime.

Stripe sizes

Decreasing Stripe Size:

Files are spread over more disks
Decreases I/O rate performance
Increases transfer rate

Increasing Stripe Size:

Files tend to be on fewer/one disk/s
Better for I/O rates
Lower transfer rates

RAID controllers

Carry out SCSI channel and cache management
Many have >64MB of cache!
Take the load off the CPU and speed-up writes
Often have battery-backed-up cache
Can migrate from one level to another
RAID systems often have dual power supplies

Microcontrollers

What is a microcontroller?

A microcontroller is a self-contained computer on a chip.
A microcontroller has:

CPU, memory, clock, I/O
often 8 or 16 bit word sizes - some now 32 bit word sizes
slower clock eg: 32kHz – 100MHz
small RAM
very low power

They are cheap (billions sold each year) and make up 55% of all CPUs sold
Their memory is very limited:

registers – eg 64
SRAM: as low as 1k
Flash/EEPROM: 16-64k for programs

They are designed to run on about 1mA

Embedded systems

Microcontrollers are usually embedded into other systems and are preprogrammed, like washing machines, printers, cars and even clothes.

Input / Output

GPIO

General Purpose Input Output
Useful for things like switches

Analogue to digital inputs

Converts analogue signals into numbers
Used for sound, light etc.

Different types of microcontrollers

Arduino

Atmel ATMega series chips e.g ATMega1280, ATMega328P
easily accessible I/O
8k RAM, 16MHz, 128k Flash
Uses a variation of C
Arduino Uno:

ATMega328P cpu
2KB SRAM
32KB flash
Very popular

Microchip’s PIC

from 8 bit, 4MHz
to PIC32: 32 bit 80MHz, 32k SRAM, 512k Flash
used in PicAxe – for hobby electronics – which runs Basic!

TI MSP430

16 bit
von Neumann architecture
CC430 includes onboard radio

ARM

RISC family started in 1983
families: cortex-M, R and A (application)
more powerful: clock/RAM/32bit/cache
A series can address 4GB and has 3-5 stage pipeline
large memory means it can run OS like Linux or Windows Mobile if needed
very common in mobile phones
approx 0.5mW/MHz

Cortex M

M0+ very low power, slower

48-96MHz typical
1-96 kB RAM
1-256kB flash

M4 fast, powerful (eg FPU, DSP etc)

50-180 MHz
16-256kB RAM
32k to 2MB flash

Raspberry Pi 3 details

The Raspberry Pi is more of a single board computer than a microcontroller.
1.2GHz 64 bit quad-core A53 SoC (quoted as 2760 DMIPS each core)
1GB RAM
VideoCore IV graphics core
802.11n WiFi, Bluetooth 4.1/le
100 Mbit/s Ethernet, USB 2
HDMI video out
Micro-SD
Audio i/o
Camera interface
Display interface
40 GPIO pins

Memory and RAM

Location

Internal

In the CPU, mainly cache

External

RAM and ROM etc.

Capacity

The capacity of memory is expressed in words, or bytes.

Unit of transfer

Internal

Usually governed by data bus width

External

Usually a block which is much larger than a word

Addressable unit

Smallest location which can be uniquely addressed
Word internally

Access methods

Sequential

Start at the beginning and keep reading forward by 1 until you reach the desired address.
Speed is dependent on location of memory
Not used nowadays
Example: tapes

Direct

Individual blocks have a unique address, then a further address in that block points to the desired data.
Example: graphics cards

Random

Individual addresses are located exactly; no blocks
Access time is independent of location or previous access
Example: RAM

Associative

Compares input search data with a table that contains pointers to the addresses of interest
Example: cache

Memory hierarchy

Performance

Access time

Time between presenting the address and actually getting the data

Memory cycle time

Time for the memory to ‘recover’ after retrieving data
Cycle time = access + recovery

Transfer rate

Rate at which data can be moved

Physical types

Semiconductor

e.g. RAM

Magnetic, bubble etc.

Still under development

Optical

Future tech

Semiconductor memory

Memory that uses semiconductor-based integrated circuit
RAM:

Read/Write
Volatile – loses all data on power off
Temporary storage
Static or dynamic types

Dynamic RAM (DRAM)

In DRAM, each bit is stored as a charge in a capacitor.
However, since capacitors lose their charge, they have to be ‘refreshed’ using refresh circuits. This slows down performance.
If the capacitors are not “refreshed”, data can be lost. Therefore, when power is no longer supplied to DRAM, all the data is lost.
Simpler construction
Smaller per bit
Less expensive
Slower (6-60ns)
Main memory

Static RAM (SRAM)

Each bit is stored as an on/off switch, so there’s no need to refresh.
Even though there’s no need to refresh, data is still lost when SRAM loses power.
More complex construction
Larger per bit
More expensive
Faster (<1ns)
Good for Cache

Read-only Memory (ROM)

Read-only memory is a type of memory that isn’t meant to be changed.
It’s perfect for permanent storage and hardware libraries.
Types of ROM:
Programmable Read-only Memory (PROM):

You need special equipment to program it

Electrically Erasable Programmable Read-only Memory (EEPROM):

Takes a lot longer to write than read
You can erase it and program it again if you need to

Flash memory (Not really ROM):

Erase whole memory electrically
Can be used as slow RAM

Double data rate RAM (DDR RAM)

Double data rate means that data is transferred on the rise AND fall edges of the clock cycle.
This means double the data rate than normal!
When RAM uses this technique, it’s called DDR RAM.
DDR3 is faster than DDR2 because it has wider paths

Error correction

DRAM can lose data

Eg 25k failures per Mbit per billion hours

Hard Failure

Permanent defect – the most common

Soft Error

Random, non-destructive
No permanent damage to memory

Detected (and fixed) using error correcting code

Simple parity bit

An extra bit is added to the end of a byte to show if the number of 1’s is even or not.
The extra bit is 1 if they are even, and 0 if they are odd.
This can spot errors, but it can’t fix them and it can only detect an even number of wrong bits.
Example:

010110011 <- this is fine
011010000 <- this is fine
010011100 <- this is not fine

Flash RAM solid state disk

You can use non volatile flash RAM – typically very fast random access (0.01ms)

Parallel Processing

Amdahl’s law

Where
Tp = Total effort (time) required to complete a task
Ws = Bits that absolutely have to be done sequentially
Wp = Bits that can be done in parallel
P = Number of cores
Intuitively, you’d think the more cores you have, the faster your computer is.
In reality, due to Amdahl’s law, your processing capabilities limits as you increase the number of cores:
Amdahl’s law suggests that it’s more efficient to apply small speedups on a bigger part of the CPU than it is to apply a big speedup to smaller parts of the CPU.

Symmetric multiprocessors

Symmetric multiprocessing is a MIMD system in which multiple processors share the same memory and have access to all I/O devices controlled by a single operating system instance.
Hardware manages contention
Increases performance especially multiuser/thread
Reasonably scalable until bus is saturated (full)

Computer cluster

A computer cluster is a group of computers all connected together working on the same problem.
It is controlled by software.
It is easy to build (press x to doubt), very fast, has very large storage, but it uses up a lot of power and it is expensive.

Process based message-parsing

Messages can be sent to and from different processes
However there are problems:

Deadlock - if one process is waiting for the message to be acknowledged and the other process is waiting for the message, they will end up waiting forever
Buffer overflow - if the buffer of the processes are full, it can cause problems

Thread based message-parsing

Similar to process based message-parsing, but two threads from the same program communicate with each other.
This is useful because both threads can see all process resources.
However, there is massive potential for data contention and collision.
Avoiding this is the responsibility of the system architect.
Multi-threaded: the notion of "you are here" (the PC) becomes multi-valued (it’s harder to track the progression of execution with multi-threading)

Instruction level parallelism

This is a type of parallelism that relies on software (more specifically, the compiler), and is invisible to the user.
This requires special hardware.
Instruction level parallelism is similar to the concepts of SSE.
If you had this piece of code:

a = b + c

x = y + z

Normally you would need two instructions that require the ALU, right? Because you’re adding twice.
However, with instruction level parallelism, the compiler could decide to append the values of ‘b’ and ‘c’ together, and ‘y’ and z’ together, then add the two amalgamations, then split it afterwards.
By doing that, only one instruction needs the ALU. The rest is just bit manipulation, which would generally be faster.

Main dataflow

A naive compiler will compile a program, and have every instruction execute one at a time, one after the other.
This can be represented as a ‘datapath’ graph, which is like a timeline of instructions. It represents a program’s execution.
This means the program will work perfectly, but it’s not very optimised.
This is why compilers optimise code through ‘code generation’.

Code generation

The compiler will compress the datapath diagram as much as possible:
Now the compiler can see what types of instructions it needs.

All the fundamental instructions like ‘multiply’, ‘minus’, ‘plus’ is no problem.
However, at control step 5, the program will need to divide two values at the same time. Does the system have an instruction like that?
Additionally, the program will have to multiply 3 times and subtract at the same time at control step 0. Does the system have an instruction like that?
The same can be applied to control step 4.

If the compiler doesn’t have those, the compiler will spread out the datapath nodes, so that the instructions are a little more simple:

At control step 1, the program must multiply twice. Does the system have an instruction like that?
The same thing can be applied to control steps 5, 4 and 3.

The compiler keeps reducing the datapath until all of the required instructions are present in the system. This is one of the ways compilers optimise your code.

PC architectures

Chipset

A chipset is a set of electronic components on an integrated circuit that controls various components, like the processor, memory, peripherals etc.
It exists on the motherboard.
There are two parts to the chipset:

North:

Controls things like the processor, memory, high-speed PCIe slots (mainly for graphics cards) etc.
Fast

South:

Controls things like PCIe slots (mainly for things like network cards or sound cards), BIOS, USB, Ethernet, CMOS (responsible for keeping the date as well as some important hardware settings) etc.

Server motherboards

More memory slots (eg 1TB RAM!)
ECC RAM – error correcting
Onboard RAID
Dual/quad sockets for many CPUs
Web accessible - 10GB/s Ethernet

Intel QuickPath Interconnect (QPI)

QPI is a type of point-to-point processor interconnect developed by Intel.
This means that this technology connects processors together.
Typically ~20 data lines/lanes
80 bit “flit” (packet) transferred in 2 clocks
Includes error detect/correct & 64bits data
2.4 to 4.8GHz clock
Allows multiple CPU chips on motherboard

Onboard SSD

6Gbit/s sata3 is about 600MBytes/s – this is a real bottleneck for solid state drives!
Other bus types, like:

M.2 is the form factor (connector), NVMe is the protocol. But NVMe drives use the PCIe bus, whether they are in M.2 form factor or on traditional PCIe expansion cards. There are also SSDs in M.2 form factor that use a SATA connection rather than NVMe.
PCIe

all attempt to make new SSD interface fast enough (>1GB/s!)
PCs don’t even need SATA now!

Basic Input/Output System (BIOS)

The BIOS is a piece of firmware on the motherboard that initialises hardware and performs start-up tests. It is the first thing that runs when the system is turned on.
The BIOS is stored in flash so that it can be updated.
After the BIOS initialises the hardware and has completed all the start-up tests, it searches the storage devices for a ‘boot loader’.
A boot loader is a piece of software that tells the BIOS how to load up the operating systems. An example would be GRUB (commonly used for dual-booting Linux).

Unified Extensible Firmware Interface (UEFI)

UEFI replaces BIOS.
CPU-independent, meaning it can run on ARM, Intel, AMD etc.
Supports 64 bit systems
Boot services and Runtime services
It stores things like date, time, NVRAM (non-volatile RAM, so UEFI can store things)
GOP – graphics output protocol, which gives the UEFI the ability to display a GUI, as opposed to BIOS which relied on a coloured terminal.
Does not rely on boot sector (uses NVRAM data to boot OS)

Cache

Definition

Cache is a small amount of memory used to make memory retrieval quicker.
Sometimes, fetching things from RAM isn’t fast enough. This is why cache is used.
Cache stores often-used pieces of data so that the CPU can fetch it much quicker.
It is located between RAM and the CPU, sometimes even on the chip itself.

Operation

CPU first requests data from an address
If the cache has the data from that address, then the cache gives it to the CPU and is known as a ‘hit’.
If the cache does not have it, then the cache gets it from the RAM (slower) and is known as a ‘miss’.
The cache then stores that data, and gives the data to the CPU.
There are three levels of cache: L1, L2 and L3.
L1 is the fastest but the smallest, L3 is the slowest but the biggest and L2 is in between.

Size

Cost

More cache is expensive

Speed

More cache is faster
In complex caches, looking up data takes up time

Instruction / Data split

First-level cache is split up into instruction-level cache and data-level cache. Instruction-level cache is read-only.

Three level caches

Third-level cache is shared between all cores, whereas each core will have their own level 1 and 2 cache.

Mapping function

Cache is a lot smaller than RAM. This is why you cannot have a one-to-one relation between the records in RAM and the records in cache; there’s just too many slots in RAM compared to slots in cache.
Because of this, we need to ‘map’ slots in cache to slots in RAM. There are two ways to do this.

Direct mapping

Direct mapping uses a ‘mod’ operation to point slots in RAM to slots in cache.
(This image is used in the following examples)
If the system needed to cache index ‘0’ in RAM, it would put it in ‘0’ in cache.
If the system needed to cache index ‘4’ in RAM, it would put it in ‘0’ in cache.
If the system needed to cache index ‘5’ in RAM, it would put it in ‘1’ in cache etc.
But how would you know which address the cache came from?
To know which address the cache value came from, it includes a ‘tag’ partition in its data. This ‘tag’ partition stores the quotient of the ‘mod’ operation.
Example:

Main memory 0 = Cache memory 0, tag 0
Main memory 1 = Cache memory 1, tag 0
Main memory 2 = Cache memory 2, tag 0
Main memory 3 = Cache memory 3, tag 0
Main memory 4 = Cache memory 0, tag 1
Main memory 5 = Cache memory 1, tag 1 etc.

Full associative mapping

This is a pretty basic mapping technique.
This splits the memory address into two parts: a tag and an offset. The most significant bits make up the tag and the least significant bits make up the offset.
To place a block in the cache:

The system scrolls through each slot in cache, looking for a non-valid (vacant) slot. If it’s there, it will put the data in there.

To search a word in the cache:

The tag field of the memory address is compared with the tag bits in the cache slots. If it exists somewhere in the whole cache, the offset is compared and the data is taken from cache. If not, it is fetched from RAM using an offset.

Set associative mapping

Similar to direct mapping (as it still uses mod), but instead of mapping RAM slots to individual cache slots, it maps RAM slots to a partition of cache memory. All partitions are the same size, and are called ‘sets’.
(This image is used in the following examples)
If we needed to cache RAM index 0, we can put it in either cache index 0 or 1.
If we needed to cache RAM index 1, we can put it in either cache index 2 or 3.
If we needed to cache RAM index 4, we can put it in either cache index 0 or 1 etc.
How would you know which address the cache came from?
Like direct mapping, the cache will store a ‘tag’ partition, which stores the quotient of the mod operation.
By looking at the tag and what cache partition it is, you can decide what cache slot points to what RAM slot.
Example:

Tag 0, cache partition 0 = RAM slot 0
Tag 0, cache partition 1 = RAM slot 1
Tag 1, cache partition 0 = RAM slot 2
Tag 1, cache partition 1 = RAM slot 3
Tag 0, cache partition 2 = RAM slot 4

The image above is an example of two-way set associative mapping (because each set is of size 2), but most commonly, 8-way is used.

Replacement algorithms

A replacement algorithm is an algorithm that decides whether or not a piece of data replaces another piece of data in the cache.
Direct mapping:

There is only one replacement algorithm here.
Since each line in RAM belongs to one line in cache, the system will have to replace the line in cache if it needs to.

Full / Set associative mapping:

Least recently used (LRU)

Stores the data in cache as well as a timestamp.
If a piece of data is hardly ever used, it is replaced.

First in first out (FIFO)

Replace the block that has been in the cache the longest
Like a queue

Least frequently used

Take out the block that has had the fewest hits

Random

Write policies

A block from RAM is stored in cache. However, the data in that block needs to be updated to a new value. What should the system do?
Additionally, an I/O device may interact with RAM directly, leaving cache to store old values.
The rules surrounding these problems are called ‘write policies’. Cache can deal with this in two ways.

Write-through

All writes go to RAM and the cache as well.
Multiple CPUs can monitor RAM and cache to make sure it’s up to date.
This fixes the problem, but it’s slow and has high traffic.

Write-back

Updates initially go to the cache only, and flags them for update.
Then, when the cache blocks are modified/replaced, the update flag will tell the system to update the RAM.
For this to work, I/O must access cache first, not RAM.

Cache coherence

Let’s just say there are two caches between two processors, with one main memory.
The CPU wants to fetch the value X, which has the number 5. Cache 1 now has X = 5, and cache 2 now has X = 5.
The CPU then wants to add 3 to X. The first processor executes this, and with the first cache, gets the value 8. Cache 1 now has X = 8.
The CPU then wants to add 7 to X. The second processor executes this, and with the second cache, gets the value 12. Cache 2 now has X = 12.
Cache 1 and cache 2 now have inconsistent values of X. This is the problem which cache coherence prevents.
There exists MESI protocols like "Snooping" which are usually built-in to enforce cache coherence.

Intel’s latest CPUs

Multi-core CPU

Multiple processors, called ‘cores’, can be put into one CPU, making it a multi-core CPU.
Multi-core is efficient as the integration is all in one chip.

Non uniform memory access (NUMA)

Separate blocks of RAM
Local Ram is fast
Remote Ram is accessible but slower
Mechanisms include ccNUMA (cache coherency)
Helps reduce bottlenecks
Useful with large number of cores or clusters

Loop detector

Loops are detected in a buffer so that pipelining can be used.
Now, it looks at micro-operations
Sometimes, the loop detector unit can optimise.

Hyperthreading

A core can look like two cores, and take twice as much traffic
Separate threads can run on the same core
Hyperthreading fills execution units with tasks from another thread as well

More efficient use of units
It is not the same as having a second core! Even if OS reports double the cores
One physical core, multiple logical cores
There are limited execution units

Hyperthreading is used on i7 processors.

Integrated memory controller

This controls the flow of memory between processors. By using a QPI bus, memory connected to a processor can reach another processor.
Used to be on motherboard chipset
Makes scalability easier as each CPU can have its own bank of RAM
Remote access is 60% slower

Power management

Separate microcontroller looks after power

Can shut down cores completely
Can boost clock when needed and if it’s cool enough

L1 & L2 caches use more transistors each bit to save power

“Turbo boost”

Basically clock speed increased for short bursts or if only one core is in use

Depends on temperature/power

Used on i7 and i5 CPUs
e.g. Skylake Core i7-6920 2.9GHz:

1 core active 3.8 GHz
2 cores active 3.6 GHz
3 or 4 active 3.4 GHz

Summary of i3, i5, i7 and Xeon

GPUs

Definition

Stands for Graphics Processing Unit

Why we need GPUs

Your typical screen resolution: 1920x1080 = 2 Million pixels.. 6MB
3D shading and drawing requires lots of maths per pixel (especially matrix operations)
Typically, all the operations are the same. It’s just that there’s a lot of pixels to do them on.
CPUs are not very efficient at this repetitive task which can be done in parallel

3D rendering pipeline

3D primitives

Model transformation
Lighting
Viewing transformation
Clipping
Projection transformation
Scan conversion

Images

Typical transformation maths

This is a perspective transformation (so objects vanish into the distance etc.)
When you break this down into elementary operations, this turns into a lot of multiply, add, sin/cos etc. maths for each set of coordinates.

Raster operations

More pixel operations than vector/3D
e.g. blends, stencils, per-pixel shading

Comparison to supercomputers

Early GPUs had parallel graphics pipelines, but now they’ve moved to be more programmable. Now, they do hundreds of billions of operations per second.
While clock speeds are only around 1 GHz…
Number of compute units is thousands
Bus width 256-768 bits wide or more
May need 400W of power!
Nvidia GTX 1080 => 4 teraflops
AMD radeon pro WX9100 => 12 teraflops
That’s a lot of floating point operations every second!
Because of this processing power, GPUs are often used in supercomputers.

GPUs with no graphical output

Nvidia tesla:
Kepler architecture designed more for number crunching (3-9 Tflops)
Performance of nvidia tesla models (green) compared with CPU (blue)

Replacing the pipeline model

The GeForce 7800 GTX had 24 pixel shaders. However, the raster operations were heavily used sometimes, leaving the pixel shaders doing nothing.
That’s a waste, which is why general-purpose units evolved.

Onboard Intel GPUs

About 1/10th performance of a separate GPU
New generation has reached the same level as mid-range separate GPUs (i7’s HD650 is about 1Tflops)
Good for lower power desktops/laptops

Programming GPUs

3D – use OpenGL, DirectX etc.
For maths/compute - OpenCL is cross platform or CUDA for Nvidia only
Very useful for general number crunching

Uses of GPUs

Video compression
Image processing
Modelling
Autonomous vehicles
Consoles
Machine learning algorithms

Many repeated computes
Over many inputs
Ideal for GPUs!

Logical

Intro to digital electronics

Truth table

A truth table is a way to show the input, outputs and patterns of a boolean formula.
Example with AND:

Logic notation

+ and . are used for OR/AND

A + B is A or B
A . B is A and B

¬ can be used as ‘NOT’, as well as a dash above the letter, a bar above a letter or statement can also indicate ‘NOT’.

¬A is NOT A
A’ is NOT A
Ā is NOT A

⊕ can be used as ‘XOR’ also known as EXOR (exclusive OR).

A ⊕ B is A exclusive or B

Logic’s relation to electronics

0 -> low voltage
1 -> high voltage
Everything is either a 1 or a 0 (on or off, high or low).

Logic gates

Gate	Explanation
AND
OR
NAND
NOR
XOR

Boolean algebra

Valid for OR, AND, and XOR:

Associative

A + (B + C) = (A + B) + C
A . (B . C) = (A . B) . C

Distributive

A . (B + C) = A . B + A . C
A + (B . C) = A + B . A + C

Commutative

A + B = B + A

Demorgan’s theorems:

¬(A + B) = ¬A . ¬B
¬(A . B) = ¬A + ¬B
Break the line, change the sign! (The line being the ‘NOT’, but you can’t see it here, because I’m using a different notation. You can see it in the image below, using a different notation).

You can make ANY gate out of NAND / NOR gates.

Transistors

Transistors are small semiconductor devices that amplify or switch electronic signals. They have 3 main uses:

as an electronic switch within a circuit
to switch on another part of a circuit when a change in resistance of a sensor device is detected
as an interface device, to receive signals from low current devices (such as ICs) and use these to turn on high current devices (such as motors)

In the CPU, they are used to make logic gates.

FET transistor

Field-effect transistors (FETs) are digital switches that respond to an input voltage to allow an increase in either voltage or current.

MOSFET transistor

MOSFET transistor stands for metal-oxide semiconductor field-effect transistor

Flip flops and registers

Flip flops

A flip flop is a circuit that stores 1 bit of data.
There are 3 types of flip flops.

Reset-set flip flop (RS)

A type of flip flop that uses ‘reset’ and a ‘set’ inputs. It can use NAND or NOR gates.
If you have NORs, S = R = 0 is undefined, so don’t do it!
If you have NANDs, S = R = 1 is undefined, so don’t do it!
If a signal is sent through ‘R’, then that sets the output ‘Q’ to always a ‘high’ and the ‘NOT Q’ to a low.
If a signal is sent through ‘S’, then that sets the output ‘NOT Q’ to always a ‘high’ and the ‘Q’ to a low.

Clocked reset-set flip flop

The same as an RS flip-flop, but there is an extra ‘CLK’ (clock) input. If CLK is 0, S and R will have no effect. If CLK is 1, S and R will work normally. This is typically used to delay the output by one clock cycle.

D-type flip flop

Like clocked RS flip-flop circuits, but instead of S and R, there’s just D, which stands for ‘data/delay’. The rising edge of ‘D’ determines what ‘Q’ will be, and ‘clock’ is the same as CLK.
(Data/delay at the top, clock below it)
Most D-type flip flops have the capability to be force setted and resetted by using the ‘R’ and ‘S’ inputs, similar to an RS flip flop. These are the kinds used in registers:

Registers

Registers are just groups of flip flops.
Registers are commonly used as temporary storage in a processor, because:

They are faster and more convenient than main memory.
More registers can help speed up complex calculations.

Reg4

This is a design of register using 4 D-type flip flops, a clock and a clear input:
However, the inputs of D are being outputted by Q on every clock cycle. Instead of this, we could add an LD (load) input:
With this, when LD = 0, the flip-flop C inputs are held at 1. This means, no matter what the CLK changes to, there is no change to the ‘C’ input. Therefore, there is no positive clock edge, so the flip-flops keep their current values.
When LD = 1, the CLK input passes through the OR gate, so the flip-flops can receive a positive clock edge and can load a new value from the D3-D0 inputs.
But this is bad; if you wanted to set the register’s value, LD must be kept at ‘1’ for the right amount of time (one clock cycle) and no longer. This is called clock gating.
To rectify this, we can modify the flip flop inputs and not the CLK:
When LD = 0, the flip-flop inputs are the Q outputs, so each flip-flop just keeps its current value.
When LD = 1, the flip-flop inputs are D3, D2, D1 and D0, and this new value is “loaded” into the register.

Shift registers

A shift register ‘shifts’ its output once every clock cycle.
‘SI’ is an input, telling the shift register what to shift into the register.
This will ‘shift right’ and add the SI bit to the left side.
Example:

SI = 0, register = 0011 -> register = 0001
SI = 1, register = 0001 -> register = 1000
SI = 1, register = 1000-> register = 1100
SI = 0, register = 1100 -> register = 0110
SI = 1, register = 0110 -> register = 1011

We can add a parallel load, just like we did for regular registers.
When LD = 0, the flip-flop inputs will be SI, Q0, Q1, Q2, so the register shifts on the next positive clock edge.
When LD = 1, the flip-flop inputs are D0-D3, and a new value is loaded into the shift register, on the next positive clock edge.
In summary, with a shift register parallel load, put LD to 0 if you want to shift the contents of the register, and put LD to 1 if you want to put a new value into the register. This isn’t really used for storing a value, like a normal register.

Serial data transfer

Receiving serial data

The end of the serial device goes into the ‘SI’ input of a shift register
The serial device then sends bits, one by one, into the shift register
The shift register then shifts all the bits into one word for the computer to look at and compare.

Sending serial data

The computer sets a word in the shift register, but in reverse
The shift register’s Q3 output is connected to the serial device.
The shift will then keep sending the Q3 output (the most significant bit that the shift register is holding) and shifting the stored word (so that the next bit goes into Q3) until all the bits has been sent to the serial device.

Registers in modern hardware

Registers store data in the CPU

Used to supply values to the ALU.
Used to store the results.

If we can use registers, why bother with RAM?
Answer: Registers are expensive!

Registers occupy the most expensive space on a chip – the core.
L1 and L2 are very fast RAM – but not as fast as registers.

Number systems

Definition

A number system is a way of representing numbers.
We all use decimal (base 10), e.g. fifteen is 15, seven is 7 etc.
Computers use binary (base 2), e.g. 310 is 112, 1910 is 100112 etc.
We can shorten binary using hex (base 16), e.g. 1010 is A16, 2610 is 1A16 etc. Hex is often represented as 0x[something] so you know it’s hex.
Octal (base 8) isn’t used that much anymore
Binary coded decimal (BCD, base 10) takes every decimal digit and just converts each of them to binary, e.g. 13 = 0001 0011 (with 0001 being ‘1’ and 0011 being ‘3’)

Decimal	Binary	Hexadecimal	Octal	BCD
1	00001	0x1	1	0001
2	00010	0x2	2	0010
3	00011	0x3	3	0011
4	00100	0x4	4	0100
5	00101	0x5	5	0101
6	00110	0x6	6	0110
7	00111	0x7	7	0111
8	01000	0x8	10	1000
9	01001	0x9	11	1001
10	01010	0xA	12	0001 0000
11	01011	0xB	13	0001 0001
12	01100	0xC	14	0001 0010
13	01101	0xD	15	0001 0011
14	01110	0xE	16	0001 0100
15	01111	0xF	17	0001 0101
16	10000	0x10	20	0001 0110

Convert decimal to binary

Keep dividing by two and use the remainders until you reach quotient 0
Example:

Convert 46 to binary
46 / 2 = 23 remainder 0
23 / 2 = 11 remainder 1
11 / 2 = 5 remainder 1
5 / 2 = 2 remainder 1
2 / 2 = 1 remainder 0
1 / 2 = 0 remainder 1
so 46 in binary is 1 0 1 1 1 0

Convert binary to decimal

Split the bits up, and assign ascending square numbers, going from the least significant bit up to the most significant bit.
If the bit is a 1, add that square number to the sum. If the bit is a 0, don’t add the square number to the sum.
Example:

Convert 100110 to decimal
Assign ascending square numbers to the bits, from the least significant bit to the most significant bit:

Bits	1	0	0	1	1	0
Square numbers (in decimal)	32	16	8	4	2	1

If a bit is a 1, add up that square number. If the bit is a 0, don’t add it up and move on.
Looking at the table, there are ‘1’ bits under ‘32’, ‘4’ and ‘2’. So, we sum them up:
32 + 4 + 2 = 38
So 100110 in decimal is 38.

Convert hexadecimal to binary

Take each digit of hexadecimal and put it in a table.
Remember that each digit of hexadecimal translates to a nibble in binary (a number with 4 bits; half a byte).
You know that A, B, C, D ... in hex is 10, 11, 12, 13... in decimal.
Just convert each digit in hex to their nibble equivalents, then concatenate them all.
If it’s easier, you can convert each hex digit to their decimal equivalent, then to their binary equivalents.
Example:

Convert 0x3F1 to binary
Put the digits into a table, and convert each digit separately

Hex	3	F	1
Decimal	3	15	1
Binary	0011	1111	0001

Remember that the binary ones HAVE to be 4 bits long!
Now, simply concatenate the binary values.
0011, 1111, 0001 → 001111110001
So we now know that 0x3F1 is 1111110001 in binary.

Convert octal to binary

This is pretty much the same as hex to binary, except it’s with octal and the binary values for each octal digit are 3 bits long.
Example:

Convert 226 from octal to binary.
Put the digits into a table, and convert each digit separately

Octal	2	2	6
Binary	010	010	110

You don’t even need a decimal row, because there are no letters in octal; each digit in octal is its own decimal equivalent.
Remember that binary values here are 3 bits long!
Now, concatenate the binary values.
010, 010, 110 → 010010110
So now we know that 226 in octal is 10010110 in binary.

Convert binary coded decimal (BCD) to decimal

Binary coded decimal is basically decimal, but each digit in decimal has been converted to binary and appended.
To convert BCD to decimal, take each nibble and convert them to decimal.
Example:

Convert 1010111 from BCD to decimal
Split it up into 4’s: 101 - 0111
Oh no! It doesn’t divide evenly into 4’s! That’s OK, just add zero padding to the last one, like this: 0101 - 0111
Now, convert each binary nibble to its decimal equivalents:
0101 - 0111 → 5 - 7
Now append the decimal digits
5 - 7 → 57
So now we know that 1010111 in BCD is 57 in decimal.

Remember: each nibble in BCD cannot have a decimal value over 9. That wouldn’t make sense!

Arithmetic

Unsigned integers

An unsigned integer is a normal integer that cannot be negative.
Example:

11011 as an unsigned integer is 27 in decimal.

Signed integers

A signed integer is an integer that can be negative. The range of values it can be is usually sacrificed to allow negative numbers, compared to unsigned integers.
Example:

An unsigned 8-bit integer can hold values between 0 to 255.
A signed 8-bit integer could hold values between -128 to 127.

There are different ways of creating signed integers.

Sign and magnitude

The most significant bit works like a ‘flag’, saying if the number is negative or positive. It can be ‘0’ if it’s positive and ‘1’ if it’s negative.
Example:

11011, using sign and magnitude, is -11 in decimal.
01011, using sign and magnitude, is 11 in decimal.

This is a nice and simple way to have signed integers, but doing proper maths with it is practically impossible, as the most significant bit isn’t a numerical value (it’s more like a flag).

2’s complement

Numbers are defined such that X + (-X) = 2n
+3 0011
-3 1101
=====
10000
If there is overflow, it is either discarded or used as an overflow flag.
Maths works really well with this method.
The most significant bit can still be used as a flag for positive or negative, similar to sign and magnitude
How to convert a number to its negative counterpart:
NOT all the bits, then add one.
Example:

Convert 001001 to its negative counterpart.
NOT all the bits: 110110
Then add 1: 110111

How to add two 2’s complement numbers together:
Add them together normally, and if there is an overflow bit, just discard it.
To convert this to decimal, do the same square number method as before, but make the biggest square number negative (e.g. if you had 32, 16, 8, 4, 2, 1 as the table’s header rows, make it -32, 16, 8, 4, 2, 1).

1’s complement

To get the negative number from a positive number, just invert all the bits.
Example:

Convert 001001 to its negative counterpart.
NOT all the bits: 110110

Similar to 2’s complement

Biased offset

Biased offset is where you define a particular binary number to be the ‘0’, and everything below that is negative and everything above it is positive.
Example:

Let’s have a look at biased offset (6)

Binary	Decimal (unsigned int)	Decimal (biased offset 6)
0000	0	-6
0001	1	-5
0010	2	-4
0011	3	-3
0100	4	-2
0101	5	-1
0110	6	0
0111	7	1
1000	8	2
1001	9	3
1010	10	4
1011	11	5
1100	12	6

As you can see, where the binary number is supposed to be ‘6’ (if you looked at it in an unsigned way), it’s actually 0 (in a biased offset 6 way) and everything below it are the negative numbers and everything above it are the positive numbers.
To convert biased offset to decimal, convert the binary normally to decimal first, then subtract the bias off.
Example:
Convert 1001 to decimal with a biased offset of 6
Convert it normal to decimal: 1 + 8 = 9
Then take away the biased offset: 9 - 6 = 3
So 1001 in decimal with biased offset of 6 is 3.

Floating point

Floating points are binary numbers that represent fractions. The binary number is split up into three sections:

Sign (just 1 bit)
Exponent (in biased offset, determines how big numbers are)
Fraction (also called mantissa, determines the accuracy of the numbers)

It’s very similar to scientific notation; standard form (stuff like 1.5 x 103)
There are four formats of the IEEE standard: 754-1985 for floating point numbers:

But the general concept is the same.
Example: convert this to decimal
The decimal equivalent is 0.15625. This will explain, step-by-step, how to reach that answer:
The ‘sign’ determines if it’s positive or negative. It’s a ‘0’, so we know the number must be positive.
The ‘exponent’ determines how big or small our number is. This number is of the single format, so it has a bias of +127.
We can calculate the exponent by finding the normal decimal equivalent, which is 124. Then, we subtract the bias, so it’s 124 - 127, which is -3. The exponent is -3.
The ‘fraction’ works a bit differently. The most significant digit is ‘1’. As you go down, the digits actually represent fractions, like in this table:

Binary digits	1	0	1	0	0	0
Decimal representation	1	1/2	1/4	1/8	1/16	1/32

As you can see, it’s similar to a normal unsigned integer, except the square numbers are their reciprocals (one over something).
You may be confused as to why the binary digits in the table above are ‘101000...’ and not ‘01000...’ like in the picture. This is because IEEE 754-1985 has an ‘implied 1’ at the beginning of their fraction. It’s not visible or stored physically, but when doing calculations, it has to be taken into account. So just remember: when doing floating point arithmetic with IEEE 754-1985, append the ‘implied 1’ to the end of the fraction part.
You can think of this like a decimal point being immediately after the most significant bit:
1010000...
1.010000...
Normally, we would add up the decimal representations with a ‘1’ bit above them, but there’s one thing we have to do first.
We must apply the exponent we found earlier, -3.
Since the exponent is negative, we need this fraction to get smaller. Therefore, we shift the bits 3 places to the right. This will make the fraction 3 places smaller.

Binary digits	0	0	0	1	0	1
Decimal representation	1	1/2	1/4	1/8	1/16	1/32

If the exponent was positive, we would shift the bits 3 places left, because a positive exponent means we would want a bigger fraction. By shifting left, you would have to introduce the integer square numbers again, like normal.
Now we can add up the decimal representations that have a ‘1’ bit above them.
This would be 1/8 + 1/32, which is 0.15625; the original answer.
This format represents zero like this:

sign = 0 for positive zero, 1 for negative zero
biased exponent = 0
fraction = 0

Exponent field = 0

If your exponent field is 0, then that means your number is not normalised. That’s bad; your numbers should always be normalised, unless you’re trying to represent 0 in floating point.
Additionally, if your exponent field is 0, the ‘implied 1’ rule stops working, meaning you can represent 0’s and have denormalised numbers (not that you would).

Rounding modes

IEEE 754-1985 has 4 different rounding modes:

Round to Nearest – rounds to the nearest value; if the number falls midway it is rounded to the nearest value with an even (zero) least significant bit, which occurs 50% of the time (like normal rounding in maths)
Round toward 0 – directed rounding towards zero (rounds down if positive, rounds up if negative)
Round toward +∞ – directed rounding towards positive infinity (everything rounds up, like ‘ceil’ function)
Round toward −∞ – directed rounding towards negative infinity. (everything rounds down, like ‘floor’ function)

Addition / Subtraction

Check for zero operands (saves time)

Subtract exponents : |Ea - Eb| = d
Align the fractions : right shift the fraction of the smaller exponent by d bits (efficient)

Simply make the floating point numbers have the same exponent however you like (less efficient)

Add or subtract the fraction parts
Normalise : reset the alignment of the fraction part
Adjust the exponent accordingly
Sort out the rounding

Multiplication / Division

Handle NaN propagation
Check for zero operands (saves time)
Evaluate product sign
Fixed point unsigned multiplication/division on the fractions (Basically, normal multiplication/division. You could just convert it to decimal, do the operation, then convert it back to binary again)
Add exponents (or subtract, if division)
Normalise fractions and adjust exponent
Round the result

Variable types to bitwidths and ranges

Gates

Gate sizes

The AND gate is bigger than the NAND gate because there’s an extra inversion stage
Therefore, it’s more efficient to use NANDs than ANDs.

Gate speed (delays)

Gate speed depends on many things:

The time it takes to put a signal into a gate compared to the time the gate will give an output will be different (gate delay)
Depends on the value of other inputs (if inputs are not synchronised it could cause delay)
Depends on load (gates in serial take longer than gates in parallel)
Depends on function (NAND is faster than AND)

Glitch generator

You’d think that A . ¬A will always evaluate to false, right? Because when A is true, ¬A will be false etc.
But actually, in practice, this will evaluate to true for a split second upon changing the input A.
This is because it takes a while for an inverter to take in input and release an output signal.
This ‘split second’ is the delay.

Inertial delay

Depending on how fast your signal is, a gate might not even be fast enough to process an output.

Karnaugh maps

Karnaugh maps are a convenient way to represent boolean functions with <=4 variables.
They are very helpful in simplifying boolean formulas.
It uses gray code, which is a sequence of binary numbers starting from 0 and ascending, but only 1 bit changes at a time, e.g: 000, 001, 011, 010, 110, 111, 101, 100 etc.
Example:

Let’s just say we have a boolean formula: (A . B) + (A + B) . (B . C)
We want to simplify this using a karnaugh map.
We could put this into a karnaugh map. Remember, the bits at the top and the bits on the left must be in gray code (only 1 bit changes at a time):
Now, use the A, B and C inputs to fill in the karnaugh map, like a truth table:
Now, you have to group all of the 1’s together into ‘blocks’. A block can only be in the shape of a rectangle/square, and there can only be a power of 2 (square number) amount of elements in a block (2, 4, 8 etc.)
Here, I have grouped all of the 1’s using two blocks, both of size ‘2’. One is in red, and one is in blue. Notice how the blocks are rectangular shaped only.
Now comes the part where you simplify the original boolean formula.
Each block represents a small boolean formula. You must find all these boolean formulas and OR them together to get the simplified boolean formula:
[a formula in here] + [another formula in here]
The red block encapsulates the outputs where B and C are both true. If B and C are true, then the output will be in the red block. Therefore, the formula that the red block represents is B . C
The blue block encapsulates the outputs where A and B are both true. If A and B are true, then the output will be in the blue block. Therefore, the formula that the blue block represents is A . B
Now we have the subformulas, we just need to OR them all together, and we’ll have our simplified expression: (A . B) + (B . C)
We can even go one step further and use the distributive rule, if we wanted to:
B . (A + C)
Now we know that (A . B) + (A + B) . (B . C) can be simplified to B . (A + C)

Addition

Full adder

A logic unit that adds three bits together: input 1, input 2 and a carry digit from before, and returns sum and carry digit. If you imagine adding up two binary numbers, you’re basically XOR-ing to get the sums and AND-ing to get the carries. This is what a full adder does.

Ripple adder

If you had lots of full adders together, you could add a whole byte. This is what a ripple adder does; it puts a whole bunch of full adders together so that it can add things like words and bytes.
The size is really good; it’s simple and compact. But since it’s sequential, the delay on all the gates will add up, making it slow.
If only there was some way to add up all the bits at the same time...

Carry-lookahead

Carry-lookahead does just as the name suggests; it uses carry bits to ‘look ahead’ of the calculation and determine the value of all the bits at the same time.
The ‘generate’ function G(A, B) is basically an AND function
The ‘propogate’ function P(A, B) is basically an XOR function
You can see here that you can mathematically get all of the carry bits using just the input bytes:
This is what a carry-lookahead does. It calculates all the caries ahead of time, then calculates the addition in one fell swoop.
This is all combinatorial, so it hardly has any delay.
However, the size gets increasingly big. If you had a carry-lookahead adder of size 4, and you had another one of size 32, the one with size 32 would be 64x bigger than the one of size 4.

Multicycling

Instead of just using 1 8-bit adder sequentially to add up all the bits, why not use 2 8-bit adders and split the byte into two parts? That way you’ll be done twice as fast.

Binary multiplication

You can use the ADD & SHIFT method:

Booth multiplication algorithm

Let’s just say you want to multiply 6 and 2.
Let 6 = A = 0110
Let 2 = B = 0010
Draw a table with n + 1 rows (n being the number of bits on both inputs)
Put a P column, a B column and a B-1 column.
Fill P with a zero, B with the value of B and B-1 with 1 zero.

P	B	B-1
0000	0010	0

You compare the least significant bit in B and the value of B-1. They are 0 and 0, so we shift everything to the right.

P	B	B-1
0000	0010	0
0000	0001	0

Now the two bits are 1 and 0. When this happens, we set the P to P - A.
0000 - 0110 is 1010, so we replace P with that now:

P	B	B-1
0000	0010	0
0000	0001	0
1010	0001	0

Immediately after that, we do another shift:

P	B	B-1
0000	0010	0
0000	0001	0
1010 1101	0001 0000	0 1

The two numbers are 0 and 1 now. This means P = P + A.
1101 + 0110 is 0011, so we set that to P and immediately do another shift:

P	B	B-1
0000	0010	0
0000	0001	0
1010 1101	0001 0000	0 1
0011 0001	0000 1000	1 0

The two bits are 0 and 0 now, so we just do a normal shift:

P	B	B-1
0000	0010	0
0000	0001	0
1010 1101	0001 0000	0 1
0011 0001	0000 1000	1 0
0000	1100	0

Halt! The number of rows (not including the title row) is 5. The number of bits we’re working with is 4. When we reach n + 1 rows, we stop. 4 + 1 = 5, so we’re going to stop now.
Concatenate P and B together to get the answer.
0000, 1100 → 00001100 → 1100
1100 is 12
6 * 2 is 12, so this algorithm works!
Computers perform this algorithm with signed and unsigned multipliers.

ARM and Assembly

ARM instruction set

32-bit instructions in the full ARM instruction set
Also the Thumb instruction set is a subset of the full ARM instruction set, stored in 16 bits
The instructions operate on a restricted view of the ARM registers (just r0 to r7, sp, lr, and pc)
Optimized for higher code density from C code (~65% of ARM code size so useful on small sys)
Processor can mix ARM/Thumb code!

Thumb2

Thumb-2 is an extension to the Thumb Instruction Set Architecture (ISA)
retains the complete 16-bit Thumb instruction set
Adds 32-bit instructions to implement almost all of the ARM ISA functionality
25% faster than Thumb

ARM registers

16 32-bit registers R0-R15
3 ‘special’ registers:

R15 – Program Counter (PC)
R14 – Link Register (LR)
R13 – Stack Pointer (SP)

Current Program Status Register (CPSR)

Simple instruction examples

mov r2, r1
mov r4, #7
add r3, r1, r2
subs r3, r1, r2
B <label>
cmp r1, #7
ble <label>

r2 = r1
r4 = 7
r3 = r1 + r2
r3 = r1 – r2
Branch to <label>
Compare r1 with 7
Branch if less or equal

Status register

sub R1,R2,R3 ← subtract R2 and R3 and puts it in R1
subs R1,R2,R3 ← does the same, but sets the status register flag
You can use that status register flag for branches, so subs is better

Memory access

Table

MOV r1, #Table
LDR r0, [r1]

Puts the address of Table into r0 then…

Loads word from address stored in r1 into r0

This is like r0 = Table[0]

Note we have #Table, not a fixed address (like 23000)

Offsetting

LDR r0, [r1,8]

Adds 8 to the address r1 then reads the word
So this is like r0 = Table[8]

In this case r1 contains a memory address

Looping

MOV R0,#20 @load a counter R0 with 20
next: @body of loop – this is a label
...
...
SUBS R0,R0,#1 @decrement loop counter
BNE next @repeat until loop count = zero

Stack

A stack is a last-in first-out (LIFO) data structure.
To put data onto the stack you ‘push’ it. To get data from the stack, you ‘pop’ it.
The CPU needs to know where the stack is. It can do so by using a ‘stack pointer’.
Register R13 is used to point to the active stack – it is called the Stack Pointer SP
Some instructions rely on the stack
Especially pop and push
Code should not manipulate the SP
Normally called sp, not r13, in assembler

Subroutine return address

If you jump to code with a branch (or jump) – you may need to return to the next instruction in the calling code on completion of your subroutine
Achieved by storing contents of PC (before it is overwritten by address of subroutine) – this is the return address
To return from the subroutine, the return address is copied into the PC.

Subroutine call with interrupt

Very similar, main difference is that an interrupt is triggered by some external event (i.e. outside the CPU) – typically I/O
In contrast, a subroutine call is always under the control of the program running
An interrupt needs to have the address of the interrupt routine specified somehow: still save RA and copy back into the PC at the end of the interrupt

Return address using stack

push RA onto stack
on return, pop RA from stack and overwrite PC
works well and is widely used
supports nested calls, recursion, re-entrant code and read-only program store
any use of stack by subroutine must be completed before return from subroutine (else will pop off incorrect RA!)
stack manipulation introduces some delay

Return address using LR register

When code jumps to a subroutine it needs to know where to jump back to when complete
Store Return Address in a special register
Used in ARM: R14, Link Register
Return from routine, just copy R14 to R15
Need to push R14 onto stack before another routine called (and pop on return)
Compromise between speed and complexity – widely used
(There are other mechanisms, not covered here)

LR

Branching with BL or BLX the return address is automatically stored in LR
So a subroutine can return with:
BX LR

Or the subroutine code could store LR on the stack (push LR) then at the end:
pop PC
To put the LR address into the PC from the stack

Networking

What is a network

A network is one or more computers that are connected together and can share data.
The Internet is a network of networks.
LAN: Local Area Network. Communications inside a building or a campus
WAN: Wide Area Network
Note – the Internet != the World Wide Web
The WWW is essentially the web pages you can access on a browser, which uses the Internet.

Layered network models

Two main models:

OSI model
TCP/IP model

TCP/IP Model

Application Layer
Transport Layer (a.k.a. Host-to-Host Layer)
Internet Layer (a.k.a. Network Layer)
Network Access Layer (a.k.a. Link Layer)

Layer encapsulation

Each layer in the TCP/IP stack adds its own header.
This becomes part of the payload for the layer below it

Physical & Link layer

The physical/link layer refers to the protocols concerned with physically transferring the data from one system to another.
Examples:

LAN defined by the IEEE 802 family of standards

Ethernet (IEEE 802.3)
Wireless LAN (IEEE 802.11 a/b/g/n/ac)
Wireless Personal Area Networks (IEEE 802.15)

Bluetooth (IEEE 802.15.1)
Low-Rate WPAN (IEEE 802.15.4)

ADSL (ITU G.992 and ANSI T1.413)
VDSL (a.k.a. FTTC or Infinity) (ITU G.993)
DOCSIS (ITU-T Rec. J.112, J.122 and J.222)
Mobile 3G, LTE (a.k.a 4G)

Network access layer

The network access layer is pretty much an alias for the physical / link layer.
Deals with the local link that a host is connected to.
Each host has a link-unique address (48-bit MAC address)
e.g. encapsulation of a packet to an Ethernet frame:

Internet layer

This layer is in charge of transporting data packets between hosts.
Three basic functions:

Handles next-hop routing.
Passes a received packet’s payload to the correct transport layer (the next layer up).
Provides unique addressing.

Ability to connect different types of network
The Internet layer only provides a “best-effort” packet delivery. This is sometimes known as unreliable.

Five main protocols operate at the Internet Layer:

IPv4
IPv6
ICMP
ICMPv6
IPSEC

IPv4

Internet Protocol version 4
Each node has a unique 32-bit IP address
Written in octet-grouped dotted-decimal notation
e.g. 152.78.65.112

There are too many network nodes; not enough IPv4 addresses to pass around!
32-bit address space = about 4.3 billion addresses
Exhaustion has been anticipated since the 1980s
IANA allocated their last blocks to RIRs on 3/02/2011.
Two “solutions”:

Network Address Translation (NAT) (1993)
IPv6 (1998)

NAT & NAPT

A way to share one IPv4 address between multiple computers.

e.g. a home router shares one public IPv4 address with multiple devices with private IPv4 addresses.

RFC1918 “private” address space:

192.168.0.0/16, 10.0.0.0/8 and 172.16.0.0/16

Not a real solution:

NAPT breaks the end-to-end principle & some protocols
Even with NAT, we still have IPv4 exhaustion
Gives a false sense of security – NAT is not a replacement for a properly configured firewall

IPv6

Internet Protocol version 6
Each node has a unique 128-bit IP address
Written in colon-delimited hexadecimal

e.g. 2001:630:d0:f111:e07a:b1fa:68a1:80eb

Not many home ISPs support IPv6
You can use a tunnelbroker to get working IPv6 on an IPv4 connection

Hurricane Electric
SixXS

Tunnels IPv6 over IPv4
Higher latency than a native IPv6 connection.

ICMP / ICMPv6

Used for diagnostic and control purposes or generated in response to errors in IP operations.
Contained within a standard IP packet
ICMP works with IPv4
ICMPv6 works with IPv6
Used by Ping and Traceroute
Ping send simple packets to a server to test speeds
Traceroute records the path a packet takes to reach its destination server.

Transport layer

Provides different kinds of host-to-host communication services.
Two main protocols:

TCP = Transmission Control Protocol
UDP = User Datagram Protocol

Some other specialised protocols:

Point to Point Tunnelling Protocol
Cubesat Space Protocol

TCP and UDP

TCP relies on a ‘handshake’ mechanism. It’s similar to how a normal conversation would go on, with one person saying something, the other person saying something in response, the other person saying another thing etc.
UDP is different; packets are just sent to the recipient and responses are not expected, hence the term “fire and forget”.

TCP flow / congestion control

Flow Control (a):

Prevents a fast sender overwhelming a slow receiver.

Congestion Control (b):

Reduces send rate to cope with network congestion.

Application layer

Software that uses the network.
Generally, programmers will use pre-made libraries BUT need to choose appropriate modes.
Python Socket Example:

#create an INET, STREAMing socket
s = socket.socket( socket.AF_INET, socket.SOCK_STREAM)

#now connect to the web server on port 80
s.connect("www.myself.me", 80)

Naming and addressing

Each layer uses a different type of addressing.
Using IP addresses in applications can be problematic – what if you change host?

DNS & ARP

DNS

Domain Name Service provides a way to map symbolic domain names to an IPv4 or IPv6 address.
Highly-reliable and resilient distributed service
Operates at the Application Layer

Different types of record:

A Record IPv4 address record.
AAAA Record IPv6 address record
CNAME Alias of one name to another
MX Record Message Transfer Agent address record (E-Mail)

ARP / NDP

Address Resolution Protocol (IPv4)
Neighbour Discovery Protocol (IPv6)
Operates at the Link Layer
Translates IP addresses to MAC addresses
You can find a MAC address for a given IP with the arp command:

$ arp -a 152.78.65.254
iam-router.core.ecs.soton.ac.uk (152.78.65.254) at 00:14:1b:3d:2c:00 [ether] on eth0

Routing

When you go on a website, the packets don’t go directly from your computer to their server; they follow a route of various other nodes before they reach the server. This is ‘routing’.
Occurs where there is a change in IP address spaces.
Routing occurs at the Internet layer.
Each router has an IP address in each address space it routes between.
Routers can also have other functions: Firewalling, DNS, DHCP, etc.

Multiple paths

There can be multiple possible paths between two hosts.
This is for redundancy, improved performance of commercial reasons.
Internet: managed by BGP
“Local”: usually RIP or OSPF

Switches

A switch is a device that inspects and forwards packets towards their destination.
Used to connect multiple devices on one network segment.
“Simple” switches operates at the link layer
Switches only forward traffic between the ports it needs to.

Other concerns

It’s more than just the technology:

Privacy
Anonymity
Ethics
“Net Neutrality”
Snooping
Filtering
Security
Sustainable Internet

Monitoring

Everywhere you go, the Internet makes a log of what you do.
Network owners need/want to know what is happening on their network.
Governments want to know what people are doing.
In the UK, ISPs can be required to store certain connection information. New legislation is being considered to expand the extent of this logging.
To stay anonymous, you could:

Use a VPN

Still creates logs somewhere
VPN exit point can be monitored

Use something like Tor, “The Onion Router”

Routes traffic through a (random) series of hops
Traffic is encrypted up to the exit node
Reduces performance
Not foolproof: Vulnerable to traffic analysis & exit node monitoring
Contains other exploitable vulnerabilities

The Internet of things

Everything can be connected to the Internet nowadays, from:

Light bulbs,
Thermostats,
Fridges,
Health monitoring,
Energy/utility monitoring
Wireless Sensor Networks

These objects need their own networking technologies and their own security and privacy concerns.

Instruction sets

Interpretation of binary data

The binary number 0100 0010 can be interpreted in different ways:

As a character
As a memory address
As a machine instruction
As an ‘enumerated type’ in a programming language e.g. enum colour {red, blue, green}
As anything; it depends on the context.

Instruction set

An instruction set is a complete collection of instructions that are understood by the CPU.
Each number represents a certain instruction
These instructions are usually used in assembly through mnemonics (names for the instructions, e.g. instead of saying 0110 you can say ‘ADD’)
One line of assembly typically equates to one machine instruction.

Elements of an instruction

Operation code (op code or opcode)

do this

Source operand reference

use this item of data

Result reference

put the answer here

Next Instruction Reference (usually implicit – next instruction in program memory)

when you have finished, do this...

Not every instruction does all of this

Instruction types

Arithmetic instructions

Add, Subtract, Multiply, Divide
Signed Integer
Floating point
May include

Increment (x = x + 1)
Decrement (x = x – 1)
Negate (x = -x)

Shift and rotate operations

Shifting binary values
E.g. 0110 → 1100 → 1001 → 0011 → 0110

Logical

AND, OR, NOT, XOR

Transfer of control

Branch (may be conditional)

e.g. branch to x if result is zero

Skip

e.g. increment and skip if zero
ISZ Register1
Branch xxxx
ADD A

Subroutine call (AKA routine, function, procedure)

cf. interrupt

Input / Output

Communicating with input / output devices
May be specific instructions
May be done using data movement instructions (memory mapped)
May be done by a separate controller (DMA – Direct Memory Access)
Much faster, data routed directly (not through ALU) and no need to fetch I/O instruction.
Typically used for Disk or Network transfers

Types of operand

Operand - like an argument in high-level programming, but in assembly
Types of operand:

Addresses
Numbers

Integer/floating point

Characters

ASCII

Logical Data

Bits or flags

Instruction set architectures (ISA)

Accumulator ISA

Stores results in an accumulator, a special register for storing results of mathematical operations.
Example:

You do ADD 5,6
5 + 6 = 11
11 is now stored in the accumulator

Stack based ISA

Uses a stack to perform operations
Example:

You push ‘5’ onto the stack.
You push ‘7’ onto the stack.
You perform the ‘SUB’ operation:

‘7’ is popped off of the stack.
‘5’ is popped off of the stack.
7 - 5 = 2.
‘2’ is pushed onto the stack.

Register-memory ISA

One operand comes from memory, one comes from a register.
Example:

You could do ADD R1, R3, B
Where ‘B’ is a location in main memory and R1 and R3 are registers.

It’s simple to generate code, but the code is non-commutative, so source operands can be destroyed.
Systems that use this: IBM 360/370, TMS320, Intel x86, M68000

Register-register ISA

Both operands are registers
Example:

You could do ADD R1, R2, R3
Where R1, R2 and R3 are all registers.

It’s fast and instruction encoding is simple, but since the instructions are simple, they result in bigger programs.
Systems that use this: Alpha, SPARC, ARM, PowerPC, Trimedia TM5200

Memory-memory ISA

Both operands are from memory.
Example:

You could do ADD R1, A, B
Where A and B are locations in memory and R1 is a register.

The code is compact, but speed could be bottlenecked by memory fetching.
This is not used today.

Alignment

Native types (ints, chars etc.) have a fixed size.
Objects (structs, classes) do not have a fixed size.
Alignment is where an object of arbitrary size fits nicely in the addresses with no overlap.
Basically, with address size A and type size s, this is true: A mod s == 0

Why insist on alignment?

Memory is typically organised on a multiple of a word boundary
Misaligned memory access requires multiple aligned memory accesses

Endedness

Refers to the dilemma of:

How bytes in a word are ordered
How bits in a byte are ordered

Example:

We have the binary number 1011
Is this number 11 (1 + 2 + 8, LSB on the right) or is it 13 (1 + 4 + 8, LSB on the left)?

Another example:

We want to store 0x12345678 into memory. We have the starting address 0x900.
Should we store it like this (big endian)?

0x900	0x901	0x902	0x903
0x12	0x34	0x56	0x78

or like this (little endian)?

0x900	0x901	0x902	0x903
0x78	0x56	0x34	0x12

It really doesn’t matter all too much.
The systems that use big endian are:

IBM 370/390
Motorola M680x0
Sun SPARC
Most RISC machines
The internet

The systems that use little endian are:

Intel 80x86
Pentium
VAX
Alpha
ARM (originally)

These systems support both (bi endian):

PowerPC supports both modes, and demands only that a given process within the machine is internally consistent
ARM now includes a switch to toggle between endian modes

Address modes

Immediate addressing

Where a constant is hard-coded into the instruction
Example:

ADD R1, #4, #5
R1 will have the value of 9.

Direct addressing

Where you use the address of the operand in the instruction
Example:

ADD R1, 4, 0x56
Adds 4 to the value stored in address 0x56 and puts it in R1.

Indirect addressing

Where you use a register’s value as an address, and you find the value within that address.
Example:

ADD R1, 4, (0x56)
Adds 4 to the value that address 0x56 is pointing at and puts it in R1.

Multilevel indirect addressing

Like indirect addressing, but there’s lots of address layers.
Example:

ADD R1, 4, (((0x56)))
Adds 4 to the value of the address of the address that address 0x56 is pointing at and puts it in R1.

Register addressing

Simply loading a register.
Example:

ADD R1, R2, R3
Adds contents of R2 and R3 and puts it in R1.

Register indirect addressing

Like indirect addressing, but with registers.
Example:

ADD R1, 4, (R2)
Adds 4 to the value that R2 is pointing at and puts it in R1.

Displacement addressing

You supply an address and a register. The value in the register is added to the address, like an offset.
Example:

R1 = 2
0x56, R1
0x56 + 2
0x58 ← final address

Four types:

Relative displacement: Effective address = A + (PC)
Base register displacement: Effective address = A + (R)
Indexing displacement: Effective address = (A) + R
Scaled

Scaled displacement addressing

You supply two registers and a displacement. The first register is the initial position, the second register is the scale, and the displacement is added on.
This is mostly used for arrays of large objects.
Example:

The initial position is 40, the scale is 2 and the displacement is 1.
Taking a word to be 4 units, we must go up by 2 (scale) * 4 units, which is 8 units, so we’re now at position 48.
The displacement is 1, so we add that, making our position 49.

Orthogonality

If every opcode uses the same (combination of) address modes then the instruction set is orthogonal.
Example: an orthogonal instruction set could use only direct addressing for every instruction.

Completeness

If the same set of arithmetic operations is provided for each native type (int, real and so on), then the instruction set is complete.
Example: There is an ADD for ints, chars, reals etc., there is a SUB for ints, chars, reals etc., there is a MUL for ints, chars, reals etc. and so on.

Operating Systems & Virtual Machines

About operating systems

The operating system is a program that protects the programmer and the user from details of the hardware.
It provides an interface to the hardware, so that other programs can make effective use of the computer
This works much better if the hardware provides support for the OS
Operating systems and hardware have developed together
Two families:

Microsoft Windows
Unix-like

Layers and views of a computer system

Kernel services

Memory management

Allocate memory to programs
Manage virtual memory
Protect against programming errors & malware

Task management

Launch processes (process = program + state of program)
Maintain process table: list of running processes
Carry out time slicing and context switching
Handle interrupts
Set program privilege levels

File management

Respond to program requests to open, read, write and close files
Set and check permissions
Handle buffering

Device management

Use device drivers to respond to program requests to open, read, write and close devices such as printers, cameras, displays, keyboard

Essential hardware features for an OS

Memory protection

To protect the OS
To allow the OS to protect programs from each other

Timer

To prevent a job monopolizing the system

Privileged instructions

Only executed by OS, e.g. I/O

Interrupts

Allow for relinquishing and regaining control

Scheduling

Scheduling: sharing the CPU usage of the processes in an optimised manner.
Making effective use of the processor
Central facts:

Memory access is very slow compared with the processor speed
I/O devices are extremely slow compared with the processor
Multiple tasks need to share the machine
A processor can only do one thing at a time

While one task is waiting for an I/O device (or the user), another task should be able to use the processor

Context switching

Where a process is saved and stored so that another process can run.

Process control block

The data structure used for saving a process.
Includes:

Identifier
State
Volatile environment

Program Counter
Memory pointers
Context data

Priority
I/O status
Accounting information

Memory management

Memory management is responsible for the management of memory (obviously).
It exists to make the user think there are an arbitrary number of processes running at the same time
It also makes every process think it has infinite memory; virtual memory does just this.

Swapping

The operating system stores a queue of processes wanting to execute

Processes on the disk, queue in the kernel

As space becomes available, the processes are loaded from the disk
When a process finishes, it is removed from memory
A process may also be swapped out if it is stalled (for example, waiting for I/O)

Partitioning

Where memory is split up into variable sized chunks, e.g. one partition could be 2MB, another could be 16MB.
Memory is allocated as required
Fragmentation makes it harder to load incoming processes

Physical and logical addresses

Physical address:

The actual address in memory

Logical address:

The address relative to the beginning of the program

Paging

Fixed and variable size partitions are very inefficient, so paging will divide the physical memory into small, equal chunks called ‘frames’.
Then, the process will be divided into chunks (pages) of the same size as memory frames.
The process of mapping pages → frames is efficient (in terms of memory)

Virtual memory

Sometimes, we run out of memory.
Therefore, we can swap pages in and out using virtual memory.
Virtual memory is a block of dedicated disk space used for storing pages.
Seldom-used pages can be stored in virtual memory using a ‘page table’ until they are needed again. This saves memory.
Each process have their own page table.
Logical addresses now refer to a page, and the page table translates it to a physical address using a base address.
The whole point of virtual memory is to pre-emptively get the required data into the most accessible place
Virtual memory and paging is invisible to the user

Page table sizes

Question:

You have a 4GB virtual memory address space and a page size of 4 kB. How many entries would you need in the page table?

First of all, you should know that the page table requires one entry per page.
To answer this, you can use the following equation:

It would also help to convert the numbers to powers of two (so you only need to add/subtract exponents):

4GB = 232
4kB = 212

Which you can do by using logarithms (e.g. log2(4 * 1024 4kB in bytes) = 12)
Then you can put them into the equation:

From that, you will learn that the number of entries required is 220.
For more info about these kinds of questions, check out this citation:
http://www.cs.cornell.edu/courses/cs4410/2015su/lectures/lec14-pagetables.html

Translation lookaside buffer

Most VM systems have a dedicated cache for the page table - the translation lookaside buffer or TLB.

Demand paging

If the process tries to fetch a page that’s not in virtual memory, then a ‘page fault’ occurs. It’s not a real ‘fault’, like an error, it just means that the page table needs to be updated from the disk.
If pages are demanded too much, this can cause ‘page thrashing’. This is where the system is too busy fetching pages instead of actually processing anything.

Segmentation

Segmentation, unlike paging, is where main memory is split up into variable sizes.
Segmentation is visible to the programmer
Segmentation allows the programmer to view the memory as a set of independent address spaces (virtual memories)
Each segment can have different access rights and privileges
Fragments of a process can be modified (recompiled) independently
Data/cache sharing: multiple processes can be allowed qualified access to a segment
Advantages:

Simplifies handling of growing data structures
Allows programs to be altered and recompiled independently, without re-linking and re-loading
Lends itself to protection
Code segments – “pure code” (no self-modification allowed)

Don’t need to write back to disk
Re-entrant code possible (multiple tasks or processes) (i.e. sharing among processes)

Some systems combine segmentation and VM

Amdahl’s Law

The performance improvement to be gained from using some faster mechanism is limited by the fraction of time the faster mode can be used

The principle of locality

"Real" programs tend to re-use data and instructions they have used recently
"Real" programs spend 90% of their execution time in 10% of the code

Temporal locality: Recently accessed items are likely to be accessed again soon
Spatial locality: Code/data that is close in memory tend to be referenced closely in time

Disk caches

Disks are extremely complex subsystems:
They have their own processor and memory to communicate with the bus
A layered cache subsystem provides

Copies of recently accessed parts of files on special dedicated 'easy to get at' areas of magnetic real estate
Copies of recently accessed parts of files in their own RAM caches