I’m trying something different for the advanced databases notes.
Have you tried some of my other notes?
For really big notes, like Engineering Management and Law, does the lag annoy you?
I’m going to split this one up into three parts.
That way, you won’t have a 200 page document lagging up your browser.
They’re going to be split up into Nick’s first half, George’s bit and Nick’s second half.
Click on the table title (e.g. “Part 2”) to go to that part’s notes page.

Part 1

✅ Data types

✅ DBMS Architecture

✅ Data Storage

✅ Access Structures

✅ Multidimensional Access Structures

Part 2

✅ Relational Algebra

✅ Problems and Complexity in Databases

✅ Conjunctive Queries - Containment and Minimisation

✅ Query Processing

✅ Data Integration

✅ Ontology Based Data Integration

Part 3

✅ Transactions and Concurrency

✅ Advanced Transactions

✅ Logging and Recovery

✅ Parallel Databases

✅ Distributed Databases

✅ Data Warehousing

✅ Information Retrieval

✅ Non-Relational Databases

Stream Processing (not examined)

Peer to Peer Systems (not examined)

Data types

Data types:

Numeric (2)
Character (c)
Temporal (time)
Spatial (space, geographical or games)
Image (picture)
Text (strings)
Audio and video (multimedia)

Operations on data:

Comparison (1 == 2)
Arithmetic (1 + 2)
Fuzzy searches (finding strings that match a pattern approximately)
Retrieve all documents that contain a given word
Find a picture that contains blue sky (classification)

That’s cool and all, but for other data types, some operations get a bit weird.
For example, can you add together two images?
Can you multiply a video and a character?

Does the order of items in a database have meaning or is it just for convenience?

Temporal data

Temporal data: data describing time
We can use it to answer questions such as:

What was the average price of product X during 1995?
In which month did we sell the most copies of video Y?
What was the treatment history for patient Z?
In what year did Hirohiko Araki write Phantom Blood?
At what date did I publish the Foundations of Computer Science notes?

Characteristics of time:

Time structure

Linear
Possible futures
Branching time
Directed acyclic graph (order of events through time)
Periodic / Cyclic (seasonal variations and sales)

Boundedness of time

Unbounded (no end and no start)
Time origin exists (00:00 Jan 1970 for unix time)
Bounded at both ends (Unix time will hit its limit for a 32-bit integer at 03:14:07 UTC on Tuesday, 19 January 2038)

Chronon: a fixed period in a timeline
Types of time density:

Discrete: A timeline is isomorphic to the integers, because the integers have a total order.

Between each pair of chronons is a finite number of other chronons.

Dense: A timeline is isomorphic to the rational numbers, because the rational numbers have a partial order

Between each pair of chronons is an infinite number of other chronons

Continuous: A timeline is isomorphic to the real numbers, because the real numbers have a total order

Between each pair of chronons is an infinite number of other chronons

If you’re asking “what’s isomorphic again?” it means there is a total mapping from every element of a timeline to every element of the integers / rational numbers / real numbers.
Huh?
Basically, timelines can be represented with all those things. It’s what unix time does.

Granularity is also important.
Which event comes first:

Event A which occurs at 11am
Event B which occurs at 3pm on the same day

It may seem obvious, but depending on the granularity (it could be one day or one minute), it could be different.

We can store time-based events, but which events should we store?

The Valid Time of a fact - when the fact is true in reality
The Transaction Time of a fact - when the fact is current in the database, and can be retrieved

If we store both of these, it is bitemporal.

There was going to be TSQL to describe time slices in databases, but it’s really old and was never truly implemented.
It includes:

A WHEN clause (see next slide)
Retrieval of timestamps
Retrieval of temporally ordered information
Using the TIME-SLICE clause to specify a time domain
Using the GROUP BY clause for modified aggregate functions

Spatial data

(Geo)Spatial data: information about a physical object that can be represented by numerical values in a geographic coordinate system.
Types of spatial data:

Points
Regions

Boxes
Quadrangles
Polynomial surfaces

Vectors

Operations on spatial data:

Length
Intersect
Contains
Overlaps
Centre

Applications of spatial data:

Computer Aided Design (CAD)
Computer generated graphics
Geographic Information Systems (GIS)
Properties of interest:

Connectivity
Adjacency
Order
Metric relations

Characteristics of systems dealing with space:

Data objects may be highly complex
Data volumes may be very large
Data may be held in real time
Performance is not easy to achieve
Access is likely to be through specialised graphical front-ends; operator skills are key
Query processing will not be performed using SQL

Multimedia data

Multimedia data: data that consist of various media types like text, audio, video, and animation.
Types of multimedia data:

Textual

Unstructured and needs index to be built either manually or automatically by building an inverted list of significant words
Can be a markup language, giving structure
Character Large Objects (CLOBs) are now commonly supported, which can handle text documents as well as standard data and supports text search and retrieval facilities
Useful to have queries like:

Find all legal documents with ‘Jones’ as client
Find all suspects with false teeth who have been interviewed
Find all articles on ‘databases’

Image

Examples:

X-Rays
Maps
Photographs
This old picture of Nick Gibbins on his LinkedIn

Classified by BLOBs
Need support for:

Image analysis and pattern recognition
Image structuring and understanding
Spatial reasoning and image information retrieval

Audio

Digitised sound stored in various formats
Consumes lots of space so compression techniques are used
Can use MIDI, which is just a bunch of instructions for synthesised instruments

Video

Most space hungry
Images stored as frames
Each frame can consume over a megabyte, frames typically packed at 24-30 fps
To have audio too, interleaved file structures incorporate times sequencing of audio / video playback e.g. Microsoft AVI or Apple Quicktime

DBMS Architecture

What’s inside a DBMS (how does it actually work)?
What’s the interface to the DBMS?

Database interfaces:

Data Definition Language (DDL)

Creating tables, indices
Manipulating database schema

Data Manipulation Language (DML)

Queries
Updating table contents

DBMS Interfaces (what is exposed to us)

DBMS Components (how it all works)

System Catalogue: contains metadata about stored data and schemas

Like a little database within a database
Stores:

Names and sizes of files
Storage details of files
Names and data types of data items
Mappings between schemas
Constraints
Statistical information

DDL Compiler: processes schema definitions and stores schema descriptions in the system catalogue

Query Compiler: parses and validates queries, compiles queries to internal form (query plan), passess compiled queries to query optimiser.

Query Optimiser: Rearranges and reorders operations within query plan

Eliminates redundancies (removes duplicate operations)
Identifies appropriate algorithms and indexes to implement operations
Consults system catalogue for statistical and other information
Generates executable code
For example, you can work with less data entries (and work quicker) if you filter out data entries first and then join rather than to join everything and then filter later; the query optimiser is in charge of optimisations like that.

Precompiler: Extracts DML commands from application programs and sends them to the DML compiler.

DML Compiler: Compiles DML into executable code that can be sent to the runtime processor

Runtime Database Processor: Executes privileged commands

Executes query plans from the query optimiser
Accesses database through stored data manager
Handles concurrency

Stored Data Manager: Controls access to information on disc, using basic operating system services

Manages shared buffer pool (available main memory used for transferring data to and from disc, like virtual memory and paging)

Other component modules:

Loading utility loads files into DB
Backup utility dumps DB to secondary storage (tape, typically)
Recovery utility deals with failure using backup information
File reorganisation utility improves performance
Performance monitoring provides statistics for DBA to decide whether to reorganise

Data Storage

Storage Organisation

Memory hierarchy:

Cache: volatile, very fast, very expensive, limited capacity

Capacities and access times:

Registers: 101 bytes, 1 cycle
L1: 104 bytes, < 5 cycles
L2: 105 bytes, 5 - 10 cycles

Main Memory: volatile, fast, affordable, medium capacity (RAM)

Capacity: 109 - 1010 bytes
Access time: 10-8 seconds (20 - 30 cycles)

Secondary Storage: non-volatile storage, slow, cheap, large capacity (HDD, Solid state)

Capacity: 1011 - 1012 bytes
Access time: 10-3 seconds (106 cycles)

Tertiary Storage: non-volatile, very slow, very cheap, very large capacity (tape storage, backups)

Capacity: 1013 - 1017 bytes
Access time: 101 - 102 seconds

Secondary Storage

Hard Disk Drive: typical secondary storage medium for databases

Platter: surface on both sides
Spindle: in the middle of the disks that spins
Surface: the side of a disk
Head: the end of the arm that performs the read / write
Actuator: magnetic coil that flips the arm backwards and forwards

Disk structure:

A: track
B: geometrical sector
C: track sector
D: cluster

A bit old fashioned; things have changed now.
Wouldn’t we be able to store more on the outer track sectors than the inner track sectors? They are longer, after all.
So we use zone bit recording, which varies the number of sectors per track (depending on track location) to improve overall storage density.

Format of a sector:

gap

sync

mark

data

ecc

Gap: separator between sectors
Sync: indicates start of sector
Address mark: indicates sector’s number / location
ECC: error correcting code (may be distributed)
4k advanced format:

gap + sync + mark = 15 bytes
data = 4096 bytes
ecc = 100 bytes
2.7% overhead

Disk access time: reading

Access time = Seek time + Rotational Delay + Transfer time
In a hard disk, you need to wait for the head to move to the tracks and for the disk to rotate.
Average seek time range:

4ms for high end drives
15ms for mobile devices

Rotational delay (latency)

Average delay = time for 0.5 rev

Rotational speed (rpm)	Average delay (ms)
4,200	7.14
5,400	5.56
7,200	4.17
10,000	3.00
15,000	2.00

Transfer time (block size / transfer rate)

up to 1000 Mbit/sec
432 Mbit/sec 12x Blu-Ray disk
1.23 Mbits/sec 1x CD
SSDs are limited by interface

e.g., SATA 3000 Mbit/s

Sequential access

How fast is it to read the “next” block?
Access time = (block size / transfer rate) + negligible costs
No seek time or rotational delay to worry about
Negligible costs:

Skip inter-block gap
Switch track (within same cylinder)
Switch to adjacent cylinder occasionally)

Generally, sequential i/o is less expensive than random i/o

Disk access time: writing

Costs are similar to reading, unless we want to verify data
Access Time = Seek Time +
                Rotational Delay (1/2 rotation) +
                Transfer Time (for writing) +
                Rotational Delay (full rotation) +
                Transfer Time (for verifying)

Disk access time: modifying

Read block
Modify in memory
Write block
Verify block (optional)

Access Time = Seek Time +
                Rotational Delay (1/2 rotation) +
                Transfer Time (for reading) +
                Rotational Delay (full rotation) +
                Transfer Time (for writing) +
                [        Rotational Delay (full rotation) +
                        Transfer Time (for verifying)
                ]

How to refer the computer to a particular place on disk (block addressing)?

Cylinder-head-sector

Physical location of data on disk, but ZBR can cause problems

Logical Block Addressing

Blocks located by integer index, allows remapping of bad blocks

Block size selection

Size of blocks affects I/O efficiency
Big blocks reduce costs of access
Big blocks also increase amount of irrelevant data read

Solid State Drives

No moving parts, so no waiting for disks to spin or heads to move
Uses NAND flash memory
More expensive than HDD (~8-9x)
Typically smaller than HDD
I/O performance is higher
Writes are slower than reads
Limited number of program-erase cycles (~100,000)

HDD vs SDD

Random I/O per second (IOPS) = 1 / (seek + latency + transfer)

	HDD	SDD
Random Read IOPS	125-150 IOPS	~50,000 IOPS
Random Write IOPS	125-150 IOPS	~40,000 IOPS

Buffer Management

We need to be selective about what’s kept in memory.
We need to use a buffer pool!
The buffer pool is organised into frames (size of database block, plus metadata)

Buffer metadata

Each frame in the buffer pool has these properties:

Pin count - number of current users of the block in that frame
Dirty flag - 1 if the copy in the buffer has been changed (and needs to be properly written to the disk), 0 otherwise
Access time - optional; used for LRU replacement
Loading time - optional; used for FIFO replacement
Clock flag - optional; used for clock replacement

Requesting a block (pseudocode)

if         buffer pool already has a frame containing the block
then         increment pin count (“pin the block”)
else        if         there is an empty frame
        then         read block into frame and set pin count to 1
        else         choose a frame to be replaced
                if        dirty bit on the replacement frame is set
                then        write block in replacement frame to disk
                endif
                read requested block into replacement frame
        endif
endif

Buffer replacement strategies

Let’s say you wanted to store a block in the buffer.
But the buffer is full!
We need to replace one of the existing frames with our new frame.
We can replace a frame if its pin count is 0 (basically, if nobody is using it).
But if we have multiple frames with pin counts of 0, which do we pick?
We have several strategies for this:

LRU: select frame that has not been read or written for the longest time (you would need a table storing read and write times for frames)
FIFO: select the oldest frame (you would need a table storing the time at which frames were stored)
Clock (Second Chance): allows you to survive a fatal blow with 1 HP approximation of LRU. The DBMS scrolls through all the frames in a cycle, and if it finds a frame that hasn’t been accessed in a full cycle it’ll replace it.
What? You wanna go into more detail?
No problem! (If you don’t care, feel free to skip)

Imagine all the frames in a circle with an arrow from the centre pointing towards them.
All the frames have a flag of ‘0’ (cold) or ‘1’ (hot). They all start at 0.
If a frame is being read or written to, the flag will become ‘1’ (note: the head doesn’t have to be on a frame to update its flag. The flag will just update whenever the frame is read/written).
The arrow won’t move until we need to search for a replacement frame.

Cool, so some frames have been acted upon and some haven’t.
When we need to replace a frame, the arrow will move clockwise looking for a 0-flag frame.
If the arrow finds a 1-flag frame, it’ll change it to a 0 and continue.
If the arrow finds a 0-flag frame, it’ll use that frame for replacement.
That’s it!
If all the frames are 1, that’s no problem; we’ll just do a whole cycle of changing all the frames to 0, and then we’ll end up replacing the frame we started with.
This reduces the need for a whole table (like in LRU) using simple 1-bit flags for each frame.

Single buffering: only using a single buffer space

Read B1 → Buffer
Process data in buffer
Read B2 → Buffer
Process data in buffer

... and so on.

Double buffering: using two buffer spaces

Read B1 → Buffer 1
Process data in buffer 1 && Read B2 → Buffer 2
Process data in buffer 2 && Read B3 → Buffer 1
Process data in buffer 1 && Read B4 → Buffer 2

... and so on.

If time to process a block > time to read a block:

Double buffer time = R + nP
Single buffer time = n(R + P)
Where

P = time to process a block
R = time to read a block
n = number of blocks

A good way to explain it is as follows (which is actually from wikipedia):

It is a nice sunny day and you have decided to get the paddling pool out, only you can not find your garden hose.
You'll have to fill the pool with buckets.
So you fill one bucket (or buffer) from the tap, turn the tap off, walk over to the pool, pour the water in, walk back to the tap to repeat the exercise.
This is analogous to single buffering.
The tap has to be turned off while you "process" the bucket of water.

Now consider how you would do it if you had two buckets.
You would fill the first bucket and then swap the second in under the running tap.
You then have the length of time it takes for the second bucket to fill in order to empty the first into the paddling pool.
When you return you can simply swap the buckets so that the first is now filling again, during which time you can empty the second into the pool.
This can be repeated until the pool is full.
It is clear to see that this technique will fill the pool far faster as there is much less time spent waiting, doing nothing, while buckets fill.
This is analogous to double buffering.
The tap can be on all the time and does not have to wait while the processing is done.

The Five Minute Rule

The Five Minute Rule: you can eat food off the floor as long as you pick it up within five minutes Data referenced every five minutes should be memory resident
Basically, if you’re fetching a piece of data at least every 5 minutes, you should keep that within memory so you don’t have to waste time fetching it.
It’s a rule of thumb; it’s not set in stone.

There’s a few calculations that were performed to get to this:

Assume a block is accessed every X seconds.

CD: cost if we keep that block on disk
$D: cost of disk unit
I: number of IOs that unit can perform per second

CM: cost if we keep that block in RAM
$M: cost of 1MB of RAM
P: number of pages in 1MB RAM

If CD is smaller than CM, keep the block on disk.
If CM is smaller than CD, keep the block in memory.
We can calculate the break even interval in seconds (basically, X) by setting CD = CM, and rearranging for X:

Back in 1985, they calculated X to be 5 minutes for a 9GB HDD + controller, hence the five minute rule.
In 2007, they recalculated X for:

250GB SATA HDD (X = ~1.5 hours)
32GB SSD (X = 15 minutes)

... and again in 2016:

240GB SSD (X = 13.5 minutes)

So now I suppose it’s more like the 13.5-minute-if-you’re-using-an-SSD-and-1.5-hour-if-you’re-using-a-HDD rule.

Disk Organisation

Data Items

How do we store different types of data?
Numbers:

Integer (short): we just store the binary equivalent
Real numbers: we use the IEEE 754 format (typically), with 1 bit sign, n bits mantissa and m bits exponent (mantissa and exponent are like a x 10b)

Characters: we use coding schemes such as ASCII and UTF-8
Booleans: we use 1 byte per value, e.g. all 1’s for true and all 0’s for false
Dates:

Days since a given date (integer) from:

1st Jan 1900
1st Jan 1970 (UNIX epoch)

ISO8601 dates:

Calendar dates: YYYYMMDD (8 characters)
Ordinal dates: YYYYDDD (7 characters)

Times:

Seconds since midnight (integer)
ISO8601 times:

HHMMSS (6 characters)
HHMMSSFF (8 characters, to represent fractional seconds)

Strings:

Null terminated (like in C) [E] [C] [S] [\0] [...
Length given [3] [E] [C] [S] [...
Fixed length [E] [C] [S]

Bit arrays: Put the length first, then the bits (like length given for strings, but for bits instead)

Data items are either fixed or variable length, and may include what type of data it is so we know how to interpret them and what size they are (if they’re fixed)

Records

Records are collections of related data items, named fields.
For example, a user record with a username field, a password field and an email address field.
They can have fixed or variable formats and lengths.

For fixed format records, we have a schema that describes the format of the schema.
They describe:

Number of fields
Types of fields
Order in record
Meaning of each field

Example:

Schema:

e#, 2 byte integer
name, 10 byte char
dept, 2 byte code

Records:

For variable format records, we can have a schema-less format.
What does that mean?
It means the record itself contains the format; it is “self-describing”.
Used for sparse records, repeating fields, evolving formats.
However it may waste space compared to fixed format records.
The best example would be a JSON document:

{
{
name: "Jonathan Joestar",
power: "hamon",
enemy: "Dio Brando"
},
{
name: "Johnny Joestar",
power: "stand",
stand_name: "Tusk"
}
}

Records also have headers, which is data at the beginning of a record that describes:

Record type (points to schema)
Record length
Timestamp

Intermediate between fixed and variable format

Blocks

What are blocks? Can you place them?
They’re like folders; you place records inside them.

The block header may contain:

File ID (or RELATION or DB ID)
This block ID
Record directory
Pointer to free space
Type of block (e.g. contains recs type 4; is overflow, …)
Pointer to other blocks “like it”
Timestamp

There are different things to consider when placing records into blocks:

Separating records in a block

How do you know when a record ends and when another begins?
There are three approaches:

Use fixed length records - no need to separate (if a record is size 10 bytes and you read through 10 bytes, you know the next record will follow)
Use a special marker to indicate record end (like a null terminating character but for records)
Give record lengths (or offsets)

Within each record (like length given strings but for records)
In block header (however can make the block and its contents more highly coupled)

Spanned vs unspanned

What happens when our record size is bigger than our block size?
We can’t squeeze a record into a block!
What do we do? We split it up!
Split up records are spanned.
However, we need pointers between partitions of the spanned records so we can follow a “path” and read the whole record.

Records that are not split up are unspanned.
They’re simpler, but they may waste space.

Sequencing

Ordering records in file (and block) by some key value
By sequencing we can make ordered reads more efficient and do merge-joins
You can either physically have records next to each other:

... or you can have linked records, like a linked list:

You can also put a pointer to the overflow area in the header (especially useful for spanned records):

Indirection

How do we actually refer to the records?
There’s different options:

Physical addressing

Just remember the physical location of the record and use that to locate the record when you need it

Indirect addressing

You use a record ID and a mapping from record IDs to physical addresses
The physical address can point to the block the record is in: this means the record can be moved around the block without changing its ID, and it’s still pretty fast.
Every block and record has two addresses:

A database address (when stored in disk)
A memory address (when stored in memory)

When you’re in the buffer (memory) it’s better to use memory addresses, otherwise you’ll be converting the database address which takes time.

Alright, so you’ve got blocks and records and stuff in the database.

So our map points to some database addresses in those blocks within secondary storage and we want to move it into the buffer (memory) so we can process it.
Well? What are we waiting for? Let’s load it in!

Uh... that’s great, but why are the arrows red?
Well, the pointers in that map are pointing to the database addresses of the blocks.
When we copied the map into the buffer, the pointers are still pointing to the database addresses instead of the new memory addresses!
When a pointer is pointing to a database address despite being in a map within the buffer, that pointer is swizzled (represented by red arrows in the illustrations).
When a pointer is pointing to a memory address in the buffer, that pointer is unswizzled (represented by green arrows in the illustrations).
We want our pointers to be unswizzled!
To unswizzle these pointers, we’re going to have to convert the block 1 pointer from a database address to a memory address using some kind of translation table.

However, the next pointer is pointing to a block that isn’t within memory. What do we do?
We can just load block 2 into memory.

Now our pointers are unswizzled!
There are a couple of swizzling strategies:

Automatic: as soon as a block is brought into memory, locate all pointers and unswizzle them all
On Demand: only unswizzle a pointer when we actually look it up
No swizzling: don’t unswizzle at all; just use the translation table on demand (kind of hard to call this a swizzling strategy if it doesn’t swizzle)

This also works the other way around.
When blocks are written back to the disk, swizzled pointers are rewritten using the translation table.
You also need to beware of pinned blocks (that cannot yet be safely written to disk)

Other options in-between

Trade-off between:

Flexibility (easier to move records on insertion / deletion)
Cost (of maintaining indirection)

Insertion and Deletion

So, we’ve talked about storing records in blocks and accessing them.
But what about inserting records and deleting records?

Let’s start with insertion.
If the records do not have any kind of order, it’s simple.
To insert without order, either:

Look for the next empty slot

Just stick it at the end

If we need to keep some fixed order, such as sorted by primary key, things get a bit trickier.
First, we find the block we want to put the record in.
Then, if we find some space to put the record, we need to shift the records around so put the new record in its proper place.

However, records can be quite big. Do we really want to shift those huge, hefty records every time we want to insert?
We don’t have to! We can use an offset table.
An offset table is a group of pointers that point to records in a block.
Basically, we can store the order of the records in the offset table instead of the physical location of the records themselves.
That way, when we re-organise our records, we can shift around nice small pointers instead of huge bulky records.

When we add a new record, we just stick it at the end of the block any old how but retain the order with the pointers by shifting them around and stuff.

What if we don’t actually have any space to put our record in the block?
There’s two approaches we can adopt in this situation:

Find space on a “nearby” block

If there’s no space on this block, check the blocks next to it.
If there’s space, shift the highest records of the previous blocks to the next blocks, update the pointers and slides the records around on the blocks.

Create an overflow block

This is similar to the previous approach, but instead of the next block being physically adjacent to the previous block, it works more like a linked list in the sense that the previous block “points” to the next block via a pointer in the header.
The “next” blocks are called the overflow blocks.
Overflow blocks hold records that theoretically belong in the original block.
Overflow blocks can point to even more overflow blocks, if the space is required, hence the parallel to linked lists.
Graphically, they’re represented as a nub on the block, even though the pointer is in the header.

Now, let’s move on to deletion.
With the offset table, if we can shift records around too, then when we delete a record we can just shift the records about to ensure there is always one unused region in the centre. This is “reclaiming space”.

If we can’t shift around records, then we’ll have to keep track of where, and how large, the available regions are, when a new record is inserted into the block.
The block header doesn’t normally have to store the entire available space list. The header can just have the list head in the header, and each available region can point to the next region in the list.

Is that it? Problem solved?
Not quite...
You may have spotted it already, but there’s a problem with our database!
It’s the pointer to the 4th record. It still exists, but it doesn’t point to the 4th record anymore, because it has been deleted.
Basically, we have a dangling pointer.
Dangling pointers are bad.
How do we deal with this problem?
Tombstones!

When we delete a record, we put a tombstone in its place.
A tombstone is permanent; it must exist until the entire database is reconstructed.
Where the tombstone is placed depends on the nature of record pointers.
There’s two ways to use tombstones:

If we’re using a map, such as an offset table or logical ID → physical ID, we make the tombstone a null pointer in place of the record we’re deleting.

Basically, we can’t use that logical ID again. We’re “patching up” the dangling pointer.

Logical ID	Physical ID
56	5AF8
57
58	2DE0

If we’re using physical addresses, we place the bit that serves as a tombstone at the beginning of the record.

That way, all the bytes after that tombstone bit can still be used.

A block of memory
Logical ID		56
Physical ID	5AF0	5AF8	5B00	5B08	5B10	5B18
Data	5B		3F	02	7A	2B

5B00 and above can be reused, but 5AF8 can never be reused.

When deleting, we need to think about how expensive it’ll be to shift around records and how much space will be wasted.

Access Structures

Index basics

Tuple (in the sense of a database): one entry within a relation or database

(d1, d2, d3 ... dn)
[ID: 102, Name: Joseph Joestar, Stand: Hermit Purple]

Relation: a set of tuples

[ID: 102, Name: Joseph Joestar, Stand: Hermit Purple]
[ID: 103, Name: Jotaro Kujo, Stand: Star Platinum]
[ID: 104, Name: Pannacotta Fugo, Stand: Purple Haze]

Relations are stored in files.
Files are stored as collections of blocks (like when you try to store a file onto a disc, it can be separated into segments).
Blocks contain records that correspond to tuples in the relation (keys).

Given a key, we want to find the record corresponding to that key.
If we have an index, we can do this much easier.
We could make our index super complex and fast, but then we’d spend a lot of time updating it when inserting / deleting records.
We could make our index simple, but then we won’t be getting the most efficient look-up time we possibly could.

There’s different ways to structure our index, with trade offs between:

Rapid access when retrieving data
Speed of updating the database

Sequential files

All the tuples are stored with their key in records.
Typically, they are stored in order.
It’s a common practice to leave space between the keys so we can perform insertions in-between (like in BASIC where line numbers are separated by multiples of 10).

What’s the simplest form of index?
Not having an index at all!
If we don’t have an index, we’re forced to search through the entire database linearly until we find a match.
That’s terribly inefficient! What if we have millions of records?

Dense indexes

A dense index is a mapping from keys → pointers of records.
The mapping holds entries for every record.
The key-pointer pairs are much smaller than the records themselves...
... therefore the dense index takes up fewer blocks than the data file...
... meaning we can perform fewer disk accesses.
The keys are sorted, so we can perform binary search for the key we want to locate.
If it’s small enough, we can even store the whole index in main memory.
We can use dense keys even if the data file is not sorted, but we’ll get back to that later.

Sparse indexes

A sparse index is a mapping from keys → pointers of blocks.
The sparse index won’t contain every key from every record.
Instead, it’ll contain the keys of the first record of each block, and point to those.
The concept is that you find the block with your record in, and then you load that whole block and look for your record within.
It uses less space than the dense index, but it takes longer to find the record we want.

We can only use sparse indexes if the data file is sorted by the search key.
Why?
Let’s just say you were looking for record 40.
You find out that there is a block starting with record 30 and a block starting with record 50.
Because the data file is sorted, you know record 40 will be in-between records 30 and 50, so you’ll want to fetch the block containing those records (the block starting with record 30) and search within that fetched block.
If the data file was not sorted, there’d be no guarantee that record 40 would be between 30 and 50, thus breaking the entire concept of a sparse index.

Multi-level indexes

However, if we have loads of records, we’ll have loads of sparse indexes.
We’re going to have to load all those indexes into memory.
Is there a way to split up the sparse indexes?
Yup! We create a multi-level index by establishing a sparse second-level index.
A second-level index maps keys → pointers of first-level index blocks.
The jumps in index size of the second-level index is greater than that of the first-level index.
The concept is that you start with the second-level index and find + load the first-level index block that your key belongs to.
Then, you can read through that block to find the data file block that contains your record, and finally you can read through that data file block to find your record.
So it works like layers; you start with the top-level and you make your way down, whittling down the records, until you find the one you’re searching for.

We could make even more layers of indexes like third-level indexes, but by that point we might as well use B-trees (which we will also go over later).
There’s no point doing dense multi-level indexes as they’re all just going to be the same size as the first-level index.

Note:

Block pointers (used in sparse indexes) can be smaller than record pointers (used in dense indexes)
If the file is contiguous (all the blocks are next to each other), then we can omit pointers (we can calculate them from block size and key position)

For example, if there are 1024 bytes per block and we want the third block, we can do (3 - 1) * 1024 = 2048 bytes and look up using that.

Sparse vs Dense:

Sparse:

Less index space per record so we can keep more index in memory
Better for insertions

Dense:

Can tell if a record exists without accessing file
Needed for secondary indexes (which we will get onto later)

Duplicates

What if we get duplicate records?
In other words, multiple records with the same key?
Sparse and dense have different approaches to this.

Let’s start with dense.
The first approach is to just store pointers to all the duplicates like normal.

The second approach is to only point to the first repeated record.

It’s a little better; our index is smaller now.

Let’s move on to sparse indexes now.
The first approach is to not do anything about it. Just leave it as it is.
However, if we don’t do something, we’ll get weird results:

That highlighted record 20 comes before the referenced record 20 in the sparse index. The sparse index should be pointing to the first instance of record 20!

The second approach makes the index contain the first new key from each block.
It can get a bit confusing, but it solves the problem.
If there are no new keys in that block, just default back to containing the first key in that block.

If we have blocks with the same key in every record, we can remove the pointer to it.
This way, the sparse index will only point to the first instance of each value.

Deletions

How do we delete records while updating the index?
Let’s start with dense.
Dense is pretty simple; we just remove the record from the data file and shift up all the records in the block.
Then, we do the same with the index; we remove that index mapping from the index and shift up all the mapping beneath it in that block.
Here’s an example removing record 30:

Before	After

Let’s move on to sparse.
This can be a bit more tricky.
There are some cases where it’s even more simple than dense.
For example, removing a record which is not being pointed to.
We don’t even need to update the index; just remove it from the data file block.
Here’s an example removing record 40:

Before	After

However, things might not be that simple.
If we delete a record that is being pointed to, we need to change the key in the index to reflect the new “first record” in that block.
Here’s an example removing record 30, where record 40 becomes the new “first record”:

Before	After

If there are no more records in that block, then you’ll need to remove the mapping to that block too, and shift all the indexes up to retain consistency.
Here’s an example removing record 30 and 40, where the second data file block loses all its records:

Before	After

Insertions

How do we insert records while still updating the index?
Let’s start with dense.
Dense is so simple they didn’t even explain it in the slides.
Just add the record to the data file and mapping to the index.
If the key is in-between the range of the smallest key and the highest key, you may need to shift the records around, but that’s about the worst of it.
Here’s an example of adding record 110 and record 35:

Before	After

Let’s move on to sparse.
Again, there’s one case where it’s even easier than dense: when we’re adding a record to a block with free spaces in it.
Here’s an example of adding the record 34:

Before	After

However, when we add a record that pushes higher records up into other blocks, we’ll need to reorganize the index a little.
If the new record key is smaller than the biggest record key of the block, we can shift the records up and update the index with the new “first record” of the higher block.
Here’s an example of adding record 15:

Before	After

Another way of doing this is to just introduce a new block for record 20 on its own and add a new index mapping.
However, if your data file is already contiguous, it won’t be if you do that.

One last case: if you’re adding a record and it’s the biggest record key of the block, then you have the option of using an overflow block.
The last record of the block points to an overflow block that contains more records that should logically belong in that block.
Here’s an example of adding record 25:

Before	After

However, it’s only temporary. We should re-organise this later.

Secondary indexes

All this time, we’ve had the assumption that our data file was sorted.
But life’s not fair, and the world isn’t perfect.

When we have a primary index, we can be sure that the records are sorted. That’s what makes them “primary”.
However, when we have a secondary index, we can’t be sure if the records are sorted.
Typically, a primary index is defined by some set of fields (like an ID field), and there could be other secondary indexes defined on other sets of fields (like name + date of birth).
What if we wanted to query our records based on those secondary indexes?

Well, for one thing, we know that secondary indexes must always be dense.
Why?
Because sparse indexes require the records to be sorted.
Secondary indexes do not determine placement of the records in the data file, so we can safely assume that records will never be sorted under a secondary index.
This also means that secondary indexes will always point to records and not blocks (because that’s what dense does).

The first-level secondary index must be dense, but the second level can be sparse.
Why?
Just because the data file is unsorted doesn’t mean the secondary index must be too!

Now, you may be thinking “secondary indexes seem great, but how do they fare against duplicate values?”
What? That wasn’t what you were thinking? 😰
Anyway, there are four approaches to this.
The first approach is to just have repeated entries in the secondary index:

Oof... that’s so disk space + search time inefficient that I’d cringe if I wasn’t just a line of text.
There has to be a better way, right?

The second approach is to drop the repeated keys.
The problem with this is that each index block will end up with a variable size.

At first glance, this may seem even dumber than the first approach. And on further inspection, you’d discover that you were right.

The third approach is to chain records with the same key in a linked list.
So the first instance of record ‘n’ will point to the second instance of record ‘n’ and so on until the last one.
The indexes will only map the first instance of each record.

This is cool, but we’ll need to add a new pointer field to the next record instance and you’ll have to follow the chain to look for the record you want.

Indirection

The fourth approach is indirection via buckets of pointers.
The index doesn’t map to a record at all. Instead, it maps to a “bucket” containing pointers to all the records that have that key.

One good thing about this is that we can calculate conjunctions easier, if we have multiple secondary indexes. To do that, we can take the intersection of the buckets.
Let’s just say you had a database of people and you wanted to find everyone who is named “Alex” and has blue eyes.
Name and eye colour can be two different secondary indexes.
If we get the bucket of having the name “Alex” and the bucket of having blue eyes, all you need to do is find the intersection of those two buckets and boom: you’ve got your results!

Oh, and did I mention you can do all of that without even touching the data file?

Advantages of these types of indexes:

Simple
Index is a sequential file and good for scans

Disadvantages of these types of indexes:

Inserts can be expensive, and/or
Lose sequentiality & balance

B+ trees

B+ trees are the most widely used tree-structured indexes.
They are balanced multiway trees that:

Yield consistent performance
Sacrifices sequentiality (they’re not stored in order)

Here’s an example of a B+ tree:

So, we have some blocks with numbers in them, there’s green and red and arrows and aaargh
How do these B+ trees actually work?
Those numbers are keys, similar to our indexed sequential files from before.
You start from the root node (the golden node at the top) and work your way down.

To look up a key, you start from the root node and go down a path depending on whether your key is “above” or “below” the key inside that node.
You keep going through the non-leaf nodes until you reach a leaf node. In that leaf node, you will find either the data you’re looking for (if it’s primary indexed) or a pointer to the data you’re looking for (if it’s secondary indexed).

For example, say we wanted to find a block with key “35”. We start at the root node. 35 is below 45, so we take the path to the left of 45 down to the non-leaf node with just 30 in it. 35 is bigger than 30, so we take the path to the right of 30, taking us to the leaf node with 30 and 35 in it. The data we want is in that leaf node, and will be located to the right of 35.

The non-leaf nodes look like this:

Typically the root node is kept in memory, as it’s the entry point and is therefore the most accessed node of the whole tree.
Some nodes from the second level may also be kept in memory, if it’s accessed a lot.

The leaf nodes look like this:

As mentioned before, there are two kinds of leaf nodes:

Primary index: the record data itself is stored within the leaf
Secondary index: a pointer to the record data is stored within the leaf

It doesn’t matter if it a non-leaf or leaf node, they all contain:

n keys
n + 1 pointers

But there’s also another constraint: minimum keys / pointers.
We want to keep our nodes from being too empty, so we can stay space efficient.
Therefore we only allow a minimum number of keys / pointers to be stored in a node.
The minimum number of keys / pointers can be calculated as such:

For non-leaf nodes:

⌈(n + 1) / 2⌉ - 1 keys
⌈(n + 1) / 2⌉ pointers

For leaf nodes:

⌊(n + 1) / 2 ⌋ keys
⌊(n + 1) / 2 ⌋ pointers

For example, if we have n = 3 keys (like our examples above):

Non-leaf nodes have a minimum of 1 key
Non-leaf nodes have a minimum of 2 pointers
Leaf nodes have a minimum of 2 key
Leaf nodes have a minimum of 2 pointers

Any less than those minimums and we need to sort our B+ tree to fix that!

These are the rules that B+ trees must follow:

All leaves same distance from root (balanced tree)
Pointers in leaves point to records except for “sequence pointers”
That whole minimum thing I said just above

Insertion

There are four cases to consider when inserting:

Space available in leaf
Leaf overflow
Non-leaf overflow
New root

Let’s go through each one!
Case 1: Space available in leaf

Let’s say we have this B+ tree and want to add “32” to it:

Just simply follow down the tree until you find the right slot to insert!

Case 2: Leaf overflow

Let’s just say we have this tree and want to add “7” to it.

Following down the tree, it seems we need to put it in that “3, 5, 11” node.
But there’s no space! We have “leaf overflow”! What do we do?
We split up the “3, 5, 11” node and add 7 to the bigger half:

But now there’s nothing pointing to the “3, 5” node...
We need to add a new entry to that non-leaf node so that something is pointing to the “3, 5” node.

What should we put as that new key?
Here’s a little B+ trick: the value of a non-leaf key is always equal to the smallest key found in its right subtree.
Well? What’s the smallest key value of the right subtree in this case?

It’s 7, so that means our value for our non-leaf key would be 7.

Case 3: Non-leaf overflow

Let’s say we have this tree and we need to add “160”:

Like before, we need to split up the “150, 156, 179” leaf node to fit in 160:

Now, like last time, we need to add a new entry to the non-leaf node and...
... ah.
The non-leaf node will overflow! What do we do?
Like last time, we split the non-leaf node:

Now, nothing is pointing to that non-leaf node.
We need to make the root point to it.
Since that entire subtree’s smallest key is 160, the new entry will have the value “160”:

Case 4: New root

Let’s say you have this tree and want to add “45” to it:

As you can quickly see, the node “30, 32, 40” is full, but we need to add our “45” to it.
Like last time, we just split our node and add “45” to the bigger half:

Now we need to change the non-leaf node parent so our new leaf node is being pointed to by something.
The only non-leaf node we can change is the root node. We can’t change the size though...
All we can do is split it, like what we did with the leaf node.

Now we’re missing a root node...
The root node can simply have one value, on the left being “10, 20” and on the right being “40”.
What should that value be?
The smallest value of the right subtree is 30, so the value should be 30:

Deletion

There are four cases to consider when deleting:

Simple case
Coalesce with sibling
Re-distribute keys
Cases 2. or 3. at non-leaf

Case 1: Simple case

Let’s just say we want to remove “31” from this tree:

We just follow down the tree, look for the key “31” and remove it, similar to the simple case of insertion:

Case 2: Coalesce with sibling

Let’s just say we want to remove “50” from this tree:

Like with case 1, we can just remove it from the leaf node:

That should be it, right?
Not quite...
Remember how all nodes should have a minimum number of keys?
Here, n = 4. Therefore the minimum number of keys for a leaf node should be ⌊(4 + 1) / 2 ⌋ = 2.
However, we have a leaf node with just 40 in it. That’s under the minimum!
To fix this, we merge or “coalesce” the “10, 20, 30” and “40” nodes:

But now, that “40” entry in the root node no longer makes sense. Therefore we can get rid of it:

Case 3: Re-distribute keys

Let’s just say we want to remove “50” from this tree:

Like from last time, we can just remove the key:

In case 2, we just merged this node with the one behind it.
In this case, we can’t do that because the node behind is full.
What do we do? The opposite!
Instead of moving 40 behind, we move 35 ahead:

Now we need to update the entry of the root node from “40” to “35” because the new smallest value of that right subtree is 35:

Case 4: Cases 2. or 3. at non-leaf

Let’s say we wanted to remove 37 from this tree:

Like with the previous cases, we can just remove the key:

But now, like with case 2 and 3, we’ve gone under the minimum number of keys!
We can either move 30 to the node behind or the node in-front, at the back or at the front, left napkin or right napkin. Which do we choose? The one who took the napkin first?
No. We pick the node behind. Don’t forget to change the non-leaf node entry to reflect this:

If you look closely, you can see that there is some wasted space here.
All the non-leaf nodes can be combined to make one big root node!

Coalescing is useful and saves space, but it’s often not implemented as it’s too hard and not worth it.

B+ trees vs static indexes sequential files:

Advantages of sequential files:

Consumes less space
Blocks are contiguous (next to each other)
Fewer disk accesses (even when we’re reorganising)
Better concurrency control

Advantages of B+ trees:

It doesn’t have to be managed as much as sequential files (with sequential files, you need to keep checking it and know what’s going on at all times for reorganisation and stuff. You don’t have to do that with B+ trees)

What? Are you still having trouble with B+ trees?
Don’t cry! I wasn’t making fun of you, honest!
Here’s a visualisation tool which will help, if you still don’t get things: https://www.cs.usfca.edu/~galles/visualization/BPlusTree.html

Hash tables

If you don’t remember hash tables, read the “Make a hash of it” section in Algorithmics.
If you’re lazy and don’t want to open up another tab, a hash uses a hash function h() which takes a key and computes an integer value, which determines which bucket to pick from a bucket array, which stores records.

With main memory hash tables, the bucket array contains linked lists of records, so once you have your bucket you search through the list until you find the record you’re looking for.
With secondary storage hash tables, the bucket array consists of disk blocks, making it more compatible with storage devices.

There’s two approaches to hash tables:

The hash function calculates the block pointer directly, or as an offset from the first block.

The bucket block must be in fixed, consecutive locations

The hash function calculates the offset in an array of block pointers.

It’s used for secondary search keys

A good hash function has the same expected number of keys for each bucket, so a key has an even chance of being in any of the buckets.
Keys usually aren’t sorted, but they can be if CPU time is critical and there aren’t many insertions / deletions.

Here’s an example:

Hash so far	Operation + Explanation
	Nothing We haven’t done anything so far. Ah alright; if you really want to be pedantic about it, we’ve initialised our hash table with 4 buckets, containing 2 records per bucket.
	Insert a, b, c, d h(a) = 1 h(b) = 2 h(c) = 1 h(d) = 0 For each key, we hash it using the values above and we “append” that key (or record) to the associated bucket. For example, when we hash ‘a’ into ‘1’, we look up the bucket labelled ‘1’ and insert our key ‘a’ there.
	Insert e h(e) = 1 Like above, we try to insert ‘e’ into bucket ‘1’ according to the hash function. However, bucket ‘1’ is full, therefore we add an overflow bucket to extend the capacity of bucket ‘1’. We then put ‘e’ into that overflow bucket.
	Delete b When we delete a key, we can just remove it from the bucket. If there were any keys after it, we can just move them up the bucket.
	Delete c By deleting ‘c’, we’re freeing space for ‘e’ in the bucket it’s leading from, so we can move ‘e’ from the overflow bucket to the original bucket and remove the overflow bucket.

Utilisation: number of keys used / total capacity of the bucket
Space utilisation should be between 50% and 80%
This depends on how good the hash function is and on the ratio of the number of keys to buckets.

We cope with overflows and reorganisations using dynamic hashing.
Dynamic hashing comes in two flavours:

Extensible
Linear

Extensible hashing

We’ll start with extensible hashing.
The idea is that you use the most significant i bits of the key to direct the key into buckets.

The variable i can increase if we need more buckets / directories.
Here’s an example:

Hash so far + Operation

Operation + Explanation

Nothing

The variable i is just 1, and we have three keys.

Increase i to 2

Partition “10” and “11” into different buckets

Add “1010” to the “10” bucket

Insert 1010

To insert 1010, we could just put 1010 into the ‘1’ directory. However, that directory is full.

Therefore we need to increase i from 1 to 2.

Then we partition “10” and “11” into different buckets, leaving us with enough space to put “1010” into the “10” bucket.

Add “0111” to the “01” bucket

Insert 0111

Since the “01” bucket has enough space for one more key, we can place “0111” into the “01” bucket.

It just so happens that “00” and “01” share the same bucket. However, if we need to add one more key they’ll need their own buckets.

Partition “00” and “01” into different buckets

Add “0000” to the “00” bucket

Insert 0000

Speak of the devil! There isn’t enough space in the “00” bucket, so we’ll need to split “00” and “01” into their own buckets.

Then we’ll have enough space to put 0000 into the “00” bucket.

Increase i to 3

Partitioning “100” and “101” into different blocks

Add “1001” to the “100” bucket

Insert 1001

Don’t worry, this is the last operation!

If we look at bucket “10”, we’ll see it’s full again.

Like last time, we need to increment i to 3.

“100” and “101” will then share a bucket. We will split that into two, so “100” and “101” will have their own buckets.

We will then have enough space in bucket “100” to put 1001 in.

Duplicate keys are allowed, in case you were wondering.

What about deleting?
It’s much of the same, however you don’t need to merge the blocks back together.
However it’s good to do so when possible to save space.

When the bucket is full, you don’t have to increment i.
You can just use an overflow chain:

Extensible hashing uses indirection, because directories point to buckets.
It can handle growing files, however the directory doubles with every i increment.
Let’s just say you have some really big keys and i ends up being 42.
How many directories will you need?
2^42 = 4,398,046,511,104
If we store a byte for each directory, that’s over 4 terabytes...
Linear hashing aims to solve this problem.

Linear hashing

Linear hashing is slightly different, as it uses the first i least significant bits of the key.
The hash file grows incrementally and linearly, unlike extensible hashing which doubles.

We also have a value n, which is the highest value bucket we have (which is also the number of buckets we have)

Let’s just say we have a key K that we want to add.
Let’s also say that m is the last i bits of K. In other words, we need to add key K to bucket m.

So, for example, if K = 0111 and i = 2, then m = 11

If m < n, then the bucket m exists and we can add K to that bucket. No problem!
However, if n ≤ m < 2i, bucket m does not exist, so we add it to bucket m - 2i-1, which is the same as flipping the most significant bit.

With our previous example, if m = 11 and n = 01, we’d put key K into bucket 01, which is the same as flipping m’s most significant bit: 11 → 01

If we run out of space, we can grow an extra bucket by incrementing ‘n’.

Still don’t really get it?
Here’s a more visual example:

Hash so far + Operation	Operation + Explanation
	Nothing Here we have a linear hash where i = 2, and the maximum value bucket we have is 01 (n). The bold numbers underneath the blue buckets are the bucket values. If we find a value of m equal to a bucket value or a number underneath a bucket value, the key will be added to that bucket. As you can see on the left, for example, 11 is underneath bucket value 01, so 0101 will redirect to bucket 01, and 1111 will be redirected to bucket 01 (see the case for n ≤ m < 2i above) In a way, the pairs 00, 10 and 01, 11 “share” buckets, a bit like how extensible hashing shared buckets.
	Increment n If our buckets get too crowded, we can add a new bucket. This is similar to that partitioning / splitting thing we did with extensible hashing. Here we give “10” its own bucket, so now we can move all the keys where m = 10 over to the new 10 bucket.
	Insert 0101 If the key K = 0101 and i = 2, then m = 01. This should go into bucket 01. Since it’s full, we can use an overflow block.
	Increment n If we increment n, we’re giving 11 their own bucket. Therefore we can move the key 1111 to the new bucket 11. With the new space available in bucket 01, we can move 0101 out of the overflow block and into the main block. However, now we’ve reached the maximum number of buckets we can create where i = 2. If we want more, we need to increase i.
	Increment i Incrementing i doesn’t change the capacity per se, but it changes the potential for capacity increase. As you can see the buckets are shared again; notice that there are m = 101 keys in the 001 bucket or m = 111 keys in the 011 bucket, for example.
	Increment n and insert 100 Now, there is space in the 000 bucket, so we could just insert 100 in there, but let’s try to spread out our keys into different buckets more. When we add the 100 bucket, m = 000 and m = 100 no longer share the same bucket. Therefore we can now add 100 into its own 100 bucket.

Sure, we know how to increment i and n, but when do we do it?
It depends on the utilisation. There’s usually a threshold where, if utilisation exceeds it, file expansion will occur.
If things are getting a little crowded, consider increasing n (and maybe i, if you’ve reached the maximum number of buckets)

Linear hashing can handle growing files, meaning there’s less wasted space and no full reorganisations, plus there’s no indirection.
However we can still have overflow chains.

Indexing vs Hashing

So why should we use indexing / B+ trees over hash tables and vice versa?

We should use hashing when we want an exact value:

SELECT ...
FROM R
WHERE R.A = 5

We should use indexing (including B-trees) when we want range searches:

SELECT ...
FROM R
WHERE R.A > 5

Multidimensional Access Structures

So far, we’ve only talked about one-dimensional indexes.
They only allowed us to query things one attribute at a time:

Assume a single search key
Require a single linear order for keys (B-trees)
Require that the key be completely known for any lookup (hash tables)

However, things could be more efficient for queries on multiple attributes, such as “department=sales AND salary=£30,000”, for example, if these indexes supported multidimensional access.
There are other advantages to multidimensional access structures:

Partial match queries (looking for relations that match values of some attributes, not all attributes)
Range queries (e.g. pay > £200)
Nearest-neighbour queries (e.g. “LIKE” SQL statement)

Conventional indexes

We can do queries like these already with the indexes we’ve previously looked at.
For example, let’s look at the query department=sales and salary>£40,000.

There’s three ways to do it:
Approach #1

Get all the matching records using an index on attribute “department”
Check values of “salary” attribute on those records

Approach #2

Use index on attribute “department” and “salary” separately for the two predicates department=sales and salary>£40,000 and get two sets of records
Take the intersection of the two record sets (find records that are inside both sets)

Approach #3

Use the index of “department” to point us to a suitable index on the “salary” attribute.
Get all matching records of that “salary” attribute index we were pointed to.

This only works if we have multiple indexes for the same attribute that handle records of different values.
For example, we could have three indexes for salary: one that handles records with the value “sales”, one that handles records with the value “research” and one that handles records with the value “production”.

So, what’s wrong with that? We even have 3 different ways of doing this with our normal conventional indexes. Can I skip this chapter then?
No!
It’s not that there’s anything wrong doing things this way, but if we want these kinds of queries to work efficiently, we’ll need to build our indexes with certain queries in mind.
The following access structures are designed to work better with any kind of query and won’t prejudice it in any direction or another.

Hash-like

Grid files

In a grid file, we represent our schema as a grid, with the axis or “dimensions” being the attributes.
We then split the grid into rectangle partitions called “regions” by defining ranges of attributes.
For example, an index with attributes “salary” and “age” would look like:

No, don’t worry, your monitor is perfectly clean; those black dots on the grid are records.
Each point on the grid can represent a record with specific values for each attribute (or axis).
The region a record lies in can tell you about its attributes; for example, a record being in the middle tells you that their age is between 40 and 55, and their salary is between £20k and £40k.
Regions also point to buckets, so records of the same region are also stored in the same bucket.
How do we know what ranges to split the grid into? Try splitting it so that each region contains the same number of records.

Pros:

Good for multiple-key search
Supports partial-match, range and nearest-neighbour queries

Cons:

Space, management overhead (nothing is free, basically space is expensive)
Need partitioning ranges that evenly split keys (might take time to calculate, especially with databases that already contain lots of records)

Partitioned hashing

Before, we only hashed the record’s key.
In partitioned hashing, we can hash the attributes.
There are multiple hashes, because there are multiple attributes.
All the hash function outputs are appended together to get one final hash
Example:

Pros:

Good hash function will evenly distribute records between buckets (something grid files can’t do automatically)
Supports partial-match queries (use wildcard in hash value, e.g. *11 for either sales or research)

Cons:

No good for nearest-neighbour or range queries

Hierarchical indexes

kd-trees

A kd-tree is a multidimensional binary search tree that whittles down the attribute values as you traverse down the tree.
If you’ve done machine learning, it’s very similar to decision trees and random forest.
Each node splits an attribute into two halves to traverse down using a predicate (e.g. age > 30).
Nodes contain an attribute-value pair (for the predicate) and two pointers to point to either more nodes or records (leaves).
All nodes at the same level discriminate the same attribute.
The levels rotate between attributes of all dimensions (for each level you go down, the attributes that are discriminated against are cycled through).
Because a kd-tree is effectively just a bunch of attribute ranges, it can be represented as a grid (not a grid file; that’s different!).

Any Linux users here?

Jokes aside, if we know the exact value of the attribute to search for, we only need to go down one branch of the tree.
If we don’t know the exact value (if the predicate uses <, >, <=, >= etc.) we might need to go down multiple branches.

Disk accesses should be kept as few as possible. If we’re trekking down paths, we might spend unnecessary time reading!
We can avoid this by:

Multiway nodes (nodes split into more than just 2, so it’s not a binary tree anymore)
Group nodes into blocks

Quad trees

There are two kinds of quad trees:

Region quad-tree
Point quad-tree

A region quad-tree is like a grid file, but it partitions space into four equal sub-regions: north east (NE), north west (NW), south east (SE) and south west (SW).
We split the sub-regions even further if they contain more records than will fit into a block (if they get too big and need splitting, basically).

A point quad-tree is similar, but the regions aren’t split so evenly and “square”.
There’s still NE, NW, SE and SW regions, but instead a record is picked as the pivot and the region is quartered with respect to that record position.
Apart from that, it’s pretty much equivalent to the region quad-tree.

R-trees

R-Trees use rectangles instead of ranges, hence the name Rectangle-trees.
Or, depending on the dimensions, they could be cuboids. Or hyperrectangles. Or n-orthotopes.
Each region (or rectangle) represents a range of all attributes of the index.
It is represented by attribute values at the bottom left of the rectangle and attribute values at the top right, so like coordinates.
Regions can contain other regions. Regions that do not contain any other regions are represented as “leaves” in the tree structure.

R-trees are the same as kd-trees and quad trees in the sense that you start at the root and whittle down attribute values as you go down, but instead you check if your query can contain in any of the children regions. If so, go down that path.
Like quad trees, every step you take in the R-tree, you are querying every attribute at once, as opposed to kd-trees where each level only queries one attribute.

UB-trees

UB-trees work by:

Mapping n-dimensional space onto a 1-dimensional line using a fractal space-filling curve (Hilbert curve, Moore curve, Z-order curve etc.)

Partitioning the ranges (called Z-regions) and indexes (called Z-indexes) of the curve using a B+ tree. The leaves of the B+ tree represent these Z-regions, and point to the records that lie within that Z-region.

Turning the query into a query rectangle (a rectangle covering several Z-indexes. A record shall satisfy the query if its Z-index is inside the query rectangle)

Look for Z-regions that intersect the query rectangle

Once we have those regions, we can pick out the records we need by removing them from the B+ tree and heavily reducing the amount of tree search we need to do.

To make this work, you need to define two algorithms:

A mapping of a multi-dimensional point (record) to a Z-index

This can be achieved by interleaving bits of the attributes, which will result in this Z-order curve.
Attribute 1 = x1x2x3
Attribute 2 = y1y2y3
Z-index = y1x1y2x2y3x3

How to determine which Z-regions intersect the query rectangle

For a bit more info on UB-trees, go here.

Bitmap indexes

Bitmap indexes are collections of bit-vectors used to index attributes among a group of records.
Each column is a record, and each row is an attribute.
You can think of it more like a bit matrix than a collection of vectors.
Here, an example really helps with understanding bitmap indexes. Here’s one from the slides about an online homeware vendor:

Products p3 and p5 cost £100
Product p1 costs £200
Products p2, p7 and p10 cost £300
Products p4, p6, p8 and p9 cost £400
Products p1, p4, p5 and p9 are designed for lounges
Products p5 and p7 are designed for dining rooms
Products p3, p5, p6 and p10 are designed for kitchens

A ‘1’ means it fits that criterion, and a ‘0’ means it does not.
For example, p1 has a ‘1’ under ‘Lounge’ because p1 is designed for lounges.

We can use this to perform lookups.
If we wanted to find a product that is worth £300 and is designed for kitchens:

We can use bitwise AND to find the product that fits those two queries.
0100001001 & 0010110001 = 0000000001
It seems p10 fits the query perfectly.

There’s a lot of zeros in that bitmap index...
What if we had millions of records? With thousands of attributes?
We’d be wasting tonnes of space!
Therefore we can do run-length encoding to reduce space.
We can decode one bit at a time as we do the bitwise operation.

Pros:

Efficient answering of partial-match queries

Cons:

Requires fixed record numbers
Changes to data file require changes to bitmap index