When we do hippity hoppity data magic with a database, many users may be performing operations at the same time.
This means we have an element of concurrency.
That introduces a whole range of problems, like serialisability, deadlocks etc.
Well, those problems won’t solve themselves. This topic talks about what problems we may face with concurrency and how we solve them.

Transaction: a logical unit of work that changes the contents of a database. It can be made up of many different smaller operations, such as read and write.

A possible problem could be this:

User 1 finds seat 22a empty
User 2 finds seat 22a empty
User 1 books seat 22a
User 2 books seat 22a

In a perfect world where Dio never gets the stone mask, transactions would run serially.
That means one at a time, with no overlap.
However, we need parallelism; there are too many transactions to do everything serially.
Therefore transactions should be serialisable.
No, that doesn’t mean you can serialise the transaction to a file.
It means the transactions behave as if they were serial, but they may be executed concurrently.
In the possible problem, if those transactions were serialised, user 1 would book seat 22a and user 2 would find the seat already booked.

Another possible problem is this. Let’s say 456 is giving £100 to 123:

Add £100 to account 123 CRASH
Subtract £100 from account 456 SUCCESS

The operation on account 123 crashed, so nothing happens to that balance.
However, the operation on 456 still happened, so effectively, 456 just lost money for no reason.
Transactions should be atomic, meaning they should be fully executed or not at all.
If this transaction was atomic, when the operation on 123 crashed, the operation on 456 should've been cancelled.

Transaction Problems

There are a variety of problems we can face.
One is the lost update problem: one transaction reads a value, but another transaction’s operation is interleaved and updates that value with something else, so when the former transaction writes that value, the latter’s update is now “lost”.

T1	T2	XT1	YT1	XT2	YT2	Xd	Yd
read(X)		20				20	50
X := X - 10		10				20	50
	read(X)	10		20		20	50
	X := X + 5	10		25		20	50
write(X)		10		25		10	50
read(Y)		10	50	25		10	50
	write(X)	10	50	25		25	50
Y := Y + 10		10	60	25		25	50
write(Y)		10	60	25		25	60

Temporary Update (Dirty Read) problem: one transaction updates a DB item, but then fails. Item is accessed before reverting to the original value.

T1	T2	XT1	YT1	XT2	YT2	Xd	Yd
read(X)		20				20	50
X := X - 10		10				20	50
write(X)		10				10	50
	read(X)	10		10		10	50
	X := X + 5	10		15		10	50
	write(X)	10		15		15	50
read(Y)		10	50	15		15	50
CRASH :O
rollback						20	50

Incorrect Summary problem: one transaction is calculating an aggregate summary on multiple records, while other transactions are updating those records.
The aggregate function may read values before/after they are updated.

T1	T2	XT1	YT1	S	XT2	YT2	Xd	Yd
S := 0				0			20	50
	read(X)			0	20		20	50
	X := X - 10			0	10		20	50
	write(X)			0	10		10	50
read(X)		10		0	10		10	50
S := S + X		10		10	10		10	50
read(Y)		10	50	10	10		10	50
S := S + Y		10	50	60	10		10	50
	read(Y)	10	50	60	10	50	10	50
	Y := Y + 10	10	50	60	10	60	10	50
	write(Y)	10	50	60	10	60	10	60

Unrepeatable Read problem: one transaction reads an item twice, while another changes the item between the two reads.

T1	T2	XT1	YT1	XT2	YT2	Xd	Yd
read(X)		20				20	50
	read(X)	20		20		20	50
	X := X - 10	20		10		20	50
	write(X)	20		10		10	50
read(X)		10		10		10	50

These are some pretty bad problems.
When transactions are executed, they must ensure that either:

All operations are successful and recorded permanently in the database
There is no effect on the database or other transactions

Transactions may be read-only or update.

We also need a lifecycle to track the start and end of transactions, and whether they commit or abort.

ACID

You probably know the ACID rules already. It’s been drilled into your head enough times.
But here’s a refresher:

Atomicity: a transaction is either performed completely or not at all
Consistency: transaction must take database from one consistent state to another
Isolation: transaction should not make updates visible until committed
Durability: once database is changed and committed, changes should not be lost because of failure

A schedule S of n transactions is an ordering of the operations of the transactions.

The order of the operations in S must be the same order that the operations appear in their transactions, too.

Two operations in a schedule are conflicting if:

They belong to different transactions
They access the same data item
At least one of the operations is a “write”

A schedule is serial if all the operations for each transaction are executed consecutively. Basically, it means no interleaving.
A schedule is serialisable if it is equivalent to a serial schedule of the same transactions.

Two schedules are result equivalent if they produce the same final state on the database.
Two schedules are conflict equivalent if the order of any two conflicting operations is the same in both schedules.
We typically use result equivalence when deciding if a schedule is serialisable.

Let’s have a look at some examples:

Serial schedule T1 ; T2

T1	T2	XT1	YT1	XT2	YT2	Xd	Yd
read(X)		20				20	50
X := X - 10		10				20	50
write(X)		10				10	50
read(Y)		10	50			10	50
Y := Y + 10		10	60			10	50
write(Y)		10	60			10	60
	read(X)	10	60	10		10	60
	X := X + 5	10	60	15		10	60
	write(X)	10	60	15		15	60

Serial schedule T2 ; T1

T1	T2	XT1	YT1	XT2	YT2	Xd	Yd
	read(X)			20		20	50
	X := X + 5			25		20	50
	write(X)			25		25	50
read(X)		25		25		25	50
X := X - 10		15		25		25	50
write(X)		15		25		15	50
read(Y)		15	50	25		15	50
Y := Y + 10		15	60	25		15	50
write(Y)		15	60	25		15	60

Non-Serial and Non-Serialisable Schedule

T1	T2	XT1	YT1	XT2	YT2	Xd	Yd
read(X)		20				20	50
X := X - 10		10				20	50
	read(X)	10		20		20	50
	X := X + 5	10		25		20	50
write(X)		10		25		10	50
read(Y)		10	50	25		10	50
	write(X)	10	50	25		25	50
Y := Y + 10		10	60	25		25	50
write(Y)		10	60	25		25	60

The result of this one, X = 25 and Y = 60, is not the same as the result of our serial schedules, which is X = 15 and Y = 60. Therefore this schedule is not serialisable.

Non-Serial but Serialisable Schedule

T1	T2	XT1	YT1	XT2	YT2	Xd	Yd
read(X)		20				20	50
X := X - 10		10				20	50
write(X)		10				10	50
	read(X)	10		10		10	50
	X := X + 5	10		15		10	50
	write(X)	10		15		15	50
read(Y)		10	50	15		15	50
Y := Y + 10		10	60	15		15	50
write(Y)		10	60	15		15	60

This schedule has the same result as the serial schedules, therefore this schedule is serialisable.

Locking

Locks are used to synchronise access by concurrent transactions to a database.
Basically, it’s to stop concurrent transactions from messing with the same piece of data at the same time.

There’s two kinds of locks:

Shared (for reading)
Exclusive (for writing)

So if transaction T has a shared lock on X, T can read X but it can’t write to it.
If T has an exclusive lock on X, T can write to X. This means T can read X, too.

Binary locks (which are exclusive only) are also possible, but they are generally too restrictive.

There are three lock operations:

lock-shared(X) - Attempt to acquire a shared lock on X
lock-exclusive(X) - Attempt to acquire an exclusive lock on X
unlock(X) - Relinquish all locks on X

The outcome of a lock operation is either:

Grant lock (you can now access this item)
Wait for lock to be granted (not yet able to access item)
Abort

		Lock requested
		Shared	Exclusive
Lock already held by someone else	None	Grant	Grant
	Shared	Grant	Wait
	Exclusive	Wait	Wait

Here are the rules of locking:

Must issue lock-shared(X) or lock-exclusive(X) before a read(X) operation
Must issue lock-exclusive(X) before a write(X) operation
Must issue unlock(X) after all read(X) and write(X) operations are completed
Cannot issue lock-shared(X) if already holding a lock on X
Cannot issue lock-exclusive(X) if already holding a lock on X
Cannot issue unlock(X) unless holding a lock on X

Pretty obvious, right? You need a shared lock to read, an exclusive lock to write, you can’t get an exclusive lock if someone else has one etc.

You can also convert locks.
This is a relaxation of rules 4 and 5.
You can upgrade a shared lock to an exclusive lock (if an exclusive lock isn’t already distributed to someone else)
You can downgrade an exclusive lock to a shared lock (at which point, someone else can have an exclusive lock)

That’s great and all, but it’s still not enough.
Why?
Because even though we can lock stuff, some transactions might store local copies of data items.
Those local copies could be out-of-date, if another transaction interleaves and updates it.
Want an example?

We have X = 20 and Y = 50.
T1 wants to add these two up and store them in X.
T1 gets a shared lock on Y and reads it, and locally stores Y as 50.
T1 then releases their lock on Y.
However T2 interleaves with T1!
T2 performs Y := Y + X, so Y is now 70.
Back to T1, T1 uses its local copy of Y and adds it to X, which is X := X + 50, making X = 70.
T1 should’ve used the updated version of Y, but they couldn’t, because T2 interrupted them.

T2 was able to interrupt T1 because T1 released their lock on Y early, giving T2 the chance to take that lock, read Y and change Y while T1 was in the middle of its operations.
This wouldn’t have happened if T1 only started releasing locks until they were finished.

Two-Phase Locking (2PL)

Two-Phase Locking (2PL) adds some more rules.

All locking operations precede the first unlock operation in a transaction
Locks are only released after a transaction commits or aborts

So what does this mean?
This means that there is a phase where we get all the locks we need, and a phase where we release all the locks we have.
This ensures we get all the locks at the start and keep all the locks throughout the transaction, and only release them at the end.

This guarantees serialisable transactions.

Deadlock

Be careful!
Deadlock can occur if two transactions are waiting for the other to release their lock.
For example:

T1 has a lock on X and T2 has a lock on Y.
T1 asks for a lock on Y and T2 asks for a lock on X.
Now T1 and T2 are waiting for the other to give up their lock.
Computers are stupid, so now they’ll sit there forever!

There are several conditions that must be satisfied for deadlock to occur:

Concurrency: two processes fight over shared resource
Hold: a process holds onto its shared resource until it’s done with it
Wait: a process holds onto its locks while it’s waiting for another lock
Mutual dependency

What’s this “mutual dependency”?
It’s best explained with a Wait-For graph.

Every node is a process, and every edge is a dependency.
A dependency being: a process waiting for another process to release a lock.
In this wait-for graph, there is a cycle, meaning there is a mutual dependency: a condition for a deadlock.
T1 is waiting for T2 who is waiting for T3 who is waiting for T1.

Deadlock prevention:

Every transaction locks all items it needs in advance; if they can’t lock even a single item, no items are locked
Transactions updating the same resources are not allowed to execute concurrently (if T1 and T2 both write to X, don’t run them concurrently)

Deadlock detection:

Wait-For graphs
Timeouts (if transaction has been waiting for a lock for a certain time, just abort)

Timestamps

Timestamps are an alternative to locks.
Deadlocks cannot occur with timestamps.

There are three types of timestamps:

When a transaction starts: TS(T)
The last time a resource was read: read-TS(X)
The last time a resource was written to: write-TS(X)

Transactions are ordered based on their timestamps.
If a transaction is trying to read a resource, and the resource’s timestamp is later than the transaction’s timestamp, the transaction will abort (and perhaps try again, with a new, later timestamp).
This goes for writing as well.
Basically, if a transaction finds that a resource they need has been written / read by a transaction that is younger than them, it will abort.
If a transaction can read / write to a resource, it will update its timestamp with their own.
This is called Basic Timestamp Ordering.

Thomas’ Write Rule is a modification of Basic TO that rejects fewer write operations.
It weakens the checks for write(X) so obsolete write operations are ignored.
It doesn’t enforce serialisability, though.
Basically, if a transaction tries to write to a resource, but finds that a younger transaction interacted with that resource first, it just won’t execute that write and will instead continue processing.

Granularity and Concurrency

So what should we lock?
Records? That’s fine, but what if our transaction changes thousands of records? We’re going to need lots of locks; we’ll have a big overhead.
The disc block? That’s a bit better, however if we have a bunch of transactions that act upon that disc block, they’ll all be executed serially, even if they write to different items.
The entire database?! That means every transaction will be concurrent! There won’t even be a point in implementing concurrency logic!

Coarser granularity gives lower degree of concurrency
Finer granularity gives higher overhead

Advanced Transactions

Timestamp Ordering

A timestamp TS(T) is a unique identifier used to identify a transaction. The timestamp ordering algorithm orders transactions based on their timestamps. An ordered non-serial schedule should always be serialisable, as the equivalent serial schedule consists of the transactions in order of the timestamp values. The algorithm associates each database item X with two values:

read_TS(X)

the read timestamp of item X is the largest of all transactions that have successfully read item X
read_TS(X) = TS(T) where T is the youngest transaction that has read X successfully

write_TS(X)

the write timestamp of item X is the largest of all transactions that have successfully written item X
write_TS(X) = TS(T) where T is the youngest transaction that has written X successfully

The basic timestamp ordering algorithm compares the timestamps of a transaction T with read_TS(X) and write_TS(X) to ensure the timestamp ordering is valid. If the order is violated, T is aborted and resubmitted to the system with a new timestamp. Any transactions using values edited by T must be rolled back. In cascading rollback, the system keeps rolling back changes as many transactions are dependent on each other.

Thomas’s Write Rule

Thomas's write rule is a modification of basic timestamp ordering which does not enforce conflict serialisability and rejects fewer write operations by modifying checks on operations so that obsolete write operations are ignored. If the timestamp of a read operation on an item X is invalid, it is not executed but processing continues. This is because a transaction with a larger timestamp will overwrite the value of X.

Save Points in Transactions

A save point is an identifiable point in a transaction representing a partially consistent state, which can be used to restart a transaction.

used for deadlock handling

can partially rollback a transaction to release locks

save points can be persistent

following a system crash, transactions restart from their most recent savepoints

A chained transaction is a transaction broken into sub-transactions, which are executed serially.

on chaining to the next sub-transaction

the system commits the output
the system keeps or releases locks

chained transactions can restart from the most recent commit

cannot rollback to the previous sub-transaction

Save Point

Chained Transaction

allows a transaction to be broken into sub-transactions

database context is preserved

rolls back to an arbitrary save point

does not free unwanted locks

allows a transaction to be broken into sub-transactions

database context is preserved

rolls back to the previous savepoint

frees unwanted locks

Nested Transactions

In a nested transaction, a transaction forms a hierarchy of sub-transactions. Here, sub-transactions may abort without aborting their parent transaction. There are three rules for nested transactions:

Commit Rule

the commit of a sub-transaction is accessible only to the parent
the final commit occurs only after all ancestors have committed

Rollback Rule

if a transaction rolls back, all its sub-transactions also rollback

Visibility Rule

changes made by a sub-transaction are visible to its parent
changes made by a sub-transaction are not visible to its siblings
objects held by a parent can be accessed by its sub-transactions

Sagas

A saga is a collection of transactions that form a long-duration transaction. A saga is specified as a directed graph where nodes are either actions or terminal nodes abort or complete.

paths leading to abort nodes are sequences of actions that cause the overall transaction to rollback
paths leading to complete nodes are successful sequences that make persistent changes to the database

In a saga, each action A has a compensating action A-1. If A is an action and a is a sequence of actions, than AaA-1 ≡ a.

Logging and Recovery

Durability: Once a database is changed and committed, changes should not be lost due to failure

Whenever a transaction is submitted, it will either be committed or aborted. If a transaction fails while executing, any executed operations must be reversed. Failures are classified as transaction, system, or media failures:

System Crash - a hardware, software, or network error occurs during a transaction, such as main memory failure
Transaction / System Error - an operation in a transaction causes it to fail, such as integer overflow, division by zero, incorrect parameters, or logical errors
Logical Error / Exception - conditions occur that necessitate cancellation of the transaction, such as data not being found, or an exception condition such as insufficient account balance in a bank withdrawal
Concurrency Control - a transaction can be aborted because it violates serializability, or to resolve a state of deadlock
Disk Failure - disk blocks may lose data due to a read/write malfunction or read/write head crash
Physical Failure - external failure including power cut, fire, theft, sabotage, disk removal

Logging

To be able to recover from failures that affect transactions, the system maintains a log to store transaction operations and information for recovery. The system log is a sequential, append-only file that is kept on disk so it is not affected by failures. The main memory contains a log buffer, which collects log data and appends it to the log file.

A transaction T reaches its commit point when all its access operations have been executed successfully and the effect of all operations has been recorded in the system log. Beyond the commit point, the effect of the transaction is permanently stored in the database. If a system failure occurs, any transactions which have not been committed are rolled back to undo their effect on the database.

Data that needs to be updated is stored in main memory buffers. It is updated in main memory before being written back to disk. A collection of in-memory buffers called the DBMS cache holds these buffers. When the DBMS requests an item, it first checks the cache to determine whether the item is stored. If not, the item is located on disk and copied into the cache. It may be necessary to flush some of the buffers to make space available for new data.

A log record contains 3 main stages:

<start T> - transaction T has started execution
<commit T> - transaction T has completed and no longer accesses database
<abort T> - transaction T has aborted and changes are not written to disk

Undo Logging

Undo Logging repairs a database following a system crash by undoing the effects of transactions that were incomplete at the time of the crash. Introduces a new record type <T, X, old> showing that T has changed X from its old value.

Rules for Undo Logging:

If transaction T modifies item X, a log record <T, X, old> must be written to disk before X is written
If a transaction T commits, its log record <commit T> must be written to disk after all items are written

Recovery with Undo Logging:

In checkpointing, a periodic checkpoint is introduced into the system log. Before a checkpoint, all transactions have either committed or aborted, so that an algorithm will never scan entries before the checkpoint. A new record type <ckpt> is introduced and represents a checkpoint.

Rules for checkpointing:

Stop accepting new transactions
Wait for all active transactions to finish and write <commit T>/<abort T>, then flush the log
Write <ckpt> to the log and flush the log
Resume accepting transactions

In nonquiescent checkpointing, we allow transactions to continue during checkpointing. This introduces new record types <start ckpt (T1…Tn)> representing a checkpoint starting with active transactions T1…Tn which have not yet been committed, and <end ckpt>, which represents the end of a checkpoint.

Rules for nonquiescent checkpointing:

Write <start ckpt (T1…Tn)> to log and flush the log
Wait until transactions T1…Tn have all committed or aborted
Write <end ckpt> to log and flush the log

Recovery with Checkpointed Undo Logging:

If <end ckpt> appears last

all incomplete transactions began after the start of the checkpoint
undo every incomplete transaction after <start ckpt…>

If <start ckpt (T1…Tn)> appears last

the system crash has occurred during a checkpoint
incomplete transactions occur either after the start of the checkpoint or during the checkpoint
undo every incomplete transaction after <start ckpt…>

Redo Logging

Undo Logging writes <commit T> log records only after all database items have been written to disk. This can potentially cause more disk I/O operations.

Redo Logging ignores incomplete transactions and repeats changes made by committed transactions to overwrite incorrect values. It writes a <commit T> log record to disk before any values are written to disk, and introduces a new record type <T, X, new> showing that T has changed X to a new value.

Rule for Redo Logging:

Before modifying an item X on disk, all log records relating to the record (<T, X, new>, <commit T>) must be written to disk first

Recovery with Redo Logging:

Checkpointing with Redo Logging:

Write a log record <start ckpt (T1…Tn)> where T1…Tn are not committed and flush the log
For all database items that have been committed and are in the buffer, write to disk
Write a log record <end ckpt> and flush the log

Recovery with Checkpoint Redo Logging:

If <end ckpt> appears last

values that have been written and committed before the checkpoint can be ignored
redo any uncommitted transactions within the current checkpoint

If <start ckpt (T1…Tn)> appears last:

transactions committed before this checkpoint may not have been written
search back to the previous <end ckpt> and redo uncommitted transactions

Undo/Redo Logging

Undo/Redo Logging aims to address issues with undo and redo logging.

Undo Logging may increase number of disk i/o operations
Redo Logging requires that all modified blocks are kept in buffers until the transaction commits and the logs flushed

Undo/Redo Logging introduces a new record type <T, X, old new> showing that T has updated X from an old value to a new value.

Rules for Undo/Redo Logging:

Before transaction T modifies any item X on disk, the update record <T, X, old, new> must written to disk first
A log record <commit T> must be flushed to disk as soon as it is written to the log

Checkpointing with Undo/Redo Logging:

Write <start ckpt (T1…Tn)> to log and flush the log
Write any changed buffers to disk
Write <end ckpt> to log and flush the log

Parallel Databases

The I/O bottleneck

If you’ve read George’s physical query planning section in Part 2, you’ll remember that we calculated efficiency based on disk accesses.
Secondary storage dominates performance of DBMSes, and is therefore a bottleneck for performance.
There’s two ways of addressing this:

Main memory databases (put everything in RAM; it’s expensive)
Parallel databases (spread everything across multiple disks; cheaper)

Parallelism: an arrangement or state that permits several operations or tasks to be performed simultaneously rather than consecutively.
Basically, database operations can happen in parallel, and don’t have to be sequential (one after the other).
Parallel databases can process data and access data over multiple disks.

Why use parallel databases?

Hardware trends
Reduced elapsed time for queries (quicker queries)
Increased transaction throughput
Increased scalability (to increase space, just add more disks)
Better price/performance
Improved application availability (if one disk fails, DBMS can just use the others)
Access to more data

In short, you get better performance.

Parallel architectures

There’s different architectures we could adopt to create parallel databases.

Shared Memory

Shared Memory is a highly coupled architecture where all processors share the same memory buffer, and that memory buffer is written to the databases.
It uses symmetric multiprocessing (SMP)

You can think of this as a Java program with multiple threads that perform parallel operations on an in-memory data structure.
It’s easy to write software for, because we don’t need to worry about there being multiple copies of the same data across processes; there’s only one memory buffer.
However, it’s not very scalable because the hardware is difficult to set up.
Also, now we’re bottlenecked by the memory buffer, so if we keep adding processes and databases, we’ll eventually get diminishing returns on performance.

Shared Disc

Shared Disc is a loosely coupled architecture where each process has its own memory buffer (distributed memory), and there is a switch that allows each process to talk to any database.

We no longer have the memory bottleneck! That means it’s more scalable than the shared memory architecture.
However, there may be some data incoherence, as the same page may be loaded in by more than one buffer at once.
For example, process 1 and 2 could read a record in R(x), and process 1 could change x to 1 and process 2 could change x to 2. Both processes now have different versions of the same record.
Therefore we need some kind of global locking mechanism (think synchronized from Java)
We still have a single logical database storage, it’s just that each processor has their own database buffer.

Shared Nothing

Shared Nothing is an architecture ruled by a King Nothing.
Just kidding, Shared Nothing is a massively parallel, loosely coupled architecture that assigns each processor their own part of the data, with their own database.
There are high speed interconnects between processors, so they can still communicate.

There can be no data incoherence because each process only acts upon their own data.
However we need distributed deadlock detection so processes don’t hang when talking to each other.
We also need multiphase commit protocol. Don’t remember that? We did it back in Distributed Systems & Networks!
Lastly, we need to break SQL requests into multiple sub-requests, because the records the query needs might be split across multiple processes.

It’s actually possible to mix-and-match hardware and software from different architectures.
For example, Virtual Shared Disk (VSD) makes an IBM SP shared nothing system look like a shared disc setup (for Oracle).
However, from this point on, we’ll only talk about shared nothing, so don’t worry about the other architectures for now.

The challenges of shared nothing are as follows:

Partitioning the data (how to split it up between processes)
Keeping the partitioned data balanced
Splitting up queries to get the work done
Avoiding distributed deadlock
Concurrency control
Dealing with node failure

So let’s say we have four processors, and all the data is balanced evenly across them all.

However, we start to get more and more data, and performance starts to drop because the buffers are getting full.

We need to scale up!
Luckily, we’re using an architecture that allows us to scale up easily by adding more disks and processors. So we do just that.

The system is acting like they’re not even there. What gives?
We need to repartition our data across the new processors, which means assigning what data those new processors are in charge of, and moving their data to them.

Parallel query processing

When a query comes into a parallel database, what happens?
The application sends the query to the coordinator process.
The coordinator process then delegates the query (or parts of the query) to worker processes.
For every worker process, when the worker process is finished with its task, it sends the result back to the coordinator process.
When the coordinator process gets all its results back, it unions it all and outputs that as the query result.

The worker processes can be split across processors.

This comes in two flavours:

Inter-Query Parallelism: whole queries and transactions are executed on different processors concurrently. Improves throughput.
Intra-Query Parallelism: queries are split into smaller tasks and distributed between all the processes. Improves response times.

We’ll talk about intra-query parallelism here.
There’s three kinds of parallelism we can perform on the operators when we do intra-query parallelism:

Intra-operator (horizontal) parallelism: operators are decomposed into many independent operator instances, which are performed on different subsets of the data on different processes
Inter-operator (vertical) parallelism: operations, that don’t have to happen sequentially to work, are overlapped and data is pipelined from one stage to the next
Bushy (independent) parallelism): subtrees in the query plan are executed concurrently

Those explanations make no sense? Don’t worry, they’re more like summaries; we’ll go into detail here.

Intra-operator parallelism

With intra-operator parallelism, we perform the same operation across all processors, but with different partitions of the data.

So, for example, say we have a select operator across a relation with 100 records, and we have 4 processors.
We could split the workload by giving each processor a select operator and a partition of 25 records to work on.
So the workload would look like:

Processor #1: select on records 1-25
Processor #2: select on records 26-50
Processor #3: select on records 51-75
Processor #4: select on records 76-100

Then all those results would be collected by the coordinator, unioned up and then outputted.

How do we partition our records?
There’s three ways:

Range partitioning: this is like what I did in the example. You split the number of records evenly across all processors.
Hash partitioning: you use a hash function h : record → processor ID to decide which record goes to what processor
Schema partitioning: you use the schema to decide, for example if a query acts upon relations R and S, you could delegate all records in R to processor #1 and all records in S to processor #2.

Let’s look at another example.
We have a query: SELECT c1, c2 FROM t WHERE c1 > 5.5
There are 100,000 records in t, and the predicate eliminates 90% of them.
We have four processors that split the relation t into quarters.
One way of approaching this problem is data shipping.
It’s where we get all the data we need from the processors, and then let the coordinator handle the selecting and projecting.

This puts strain on the coordinator, and wastes resources.
Instead, we can do query shipping and make the worker processes perform the operations, and then send all their results to the coordinator for unioning.

Here, database operations are performed where the data is, and network traffic is minimised.
We want to do query shipping as much as we can, but sometimes that’s not possible, especially for operations that can’t be split like this (like product).
Therefore, we need a combination of data and query shipping.

Inter-operator parallelism

Inter-operator parallelism allows operators with a producer-consumer dependency to be executed concurrently.
In other words, all operators follow a producer-consumer pattern, where for each operator, it consumes data from the operator before it and produces data for the operator ahead of it.
This allows operators to be run concurrently.
This also allows operators ahead of the plan to start producing results before the operator before it has finished.
This is best suited for single-pass operators.
Still doesn’t make sense? Here’s an example.

Let’s say we have a scan, a join, and sort operators one after the other in a query.
The scan just looks up all the records we need from relations R and S.
The join pairs up two matching records from R and S with respect to attribute A and outputs them.
The sort, as the name suggests, sorts the records via an attribute.
We could just do these all one after the other: scan all of the records, then join all the records, then sort them all.
But we don’t have to wait until one operation is done with all the records before we move on to the next!

Here’s how this’ll work:

The scan reads each record and sends them to the join operator, as and when it reads them.
The join uses indexes to store the records of R and records with S with respect to attribute A. When the join finds a match between the two indexes, it’ll output the new pair to the sort operator.
The sort puts all the records it receives into a heap with respect to whatever attribute we want to sort by. Therefore, when the sort has consumed all the input, it can just fetch all the records from the heap and output a sorted list of records.

We don’t have to use just one or the other: we can use both intra- and inter-operator parallelism!

There once was a DBMS called Volcano.
Along with all the normal operators, it had an Exchange operator.
This operator was inserted between the steps of a query in order to:

Pipeline results
Direct streams of data to the next steps, redistributing as necessary

This allows us to support both vertical and horizontal parallelism in our plans.
For example, say we have this query:

This is the query after we’ve added Exchange operators:

As you can see, multiple operators go into the exchange operator and multiple go out. This is the horizontal parallelism of splitting the workload into data subsets.
The exchange operator also handles the vertical parallelism by taking in tuples on-the-fly from its inputs and sending them to the outputs.

Bushy parallelism

Bushy parallelism simply means we execute the subtrees of the query plan concurrently.
For example, if we have this query tree:

We can execute the red subtrees concurrently in different processors, and once they’re done, those outputs can come together and another processor can execute the blue subtree.

Joins

In this section, we’ll look at the kinds of query processing we might find.
They incorporate aspects of intra-operator and bushy parallelism.
This is the example database we’ll use:

Database = {

Customer(CUSTKEY, C_NAME, C_NATION)
Order(ORDERKEY, DATE, CUSTKEY, SUPPKEY)
Supplier(SUPPKEY, S_NAME, S_NATION)

}

Enquiry

Also called query without join.
Let’s say we wanted to ask:

“How many customers live in the UK?”

This assumes that the Customer relation is partitioned between processors.
We’d do this query by having each processor filter out their own customers that live in the UK and count them.
Then, all the processors send their subcounts to the coordinator, where they are all counted and returned.

Collocated Join

Let’s say we wanted to ask:

“Which customer placed orders in July?”

We also make the assumption that the Order and Customer relations are both partitioned on the CUSTKEY attribute.
Therefore the processors have all the information they need to make joins themselves.
Each processor’s joins are then sent to the coordinator for unioning.

Directed Join

We want to ask the same question again, but this time, Order is partitioned on ORDERKEY and Customer is partitioned on CUSTKEY.
Therefore we need to retrieve the Order rows using a scan, and then use Order.CUSTKEY to direct each Order record to their respective Customer processor.

Broadcast Join

What if we want to ask the question:

“Which customers and suppliers are in the same country?”

We also make the assumption that Supplier is partitioned on SUPPKEY and Customer is partitioned on CUSTKEY.
The join we want to make is on *_NATION, which is not an attribute that is being partitioned, so we can’t use directed join.
What we can do is we can scan the Supplier records, and then broadcast them to all processors joining Customer records.
This works, but there’s a lot of unnecessary communications here.

Repartitioned Join

We want to ask the same question.
An alternative to broadcast join is to have another processor that does the joining, instead of making the processor that scans Customers also join with Suppliers.
Now, slave task 1 can map Supplier records to slave task 3 nodes using Supplier.S_NATION, and slave task 2 can do the same with Customer.C_NATION.

Concurrency control

In a parallel database, we can split queries and execute those parts elsewhere.
However, parallel databases are still subject to concurrency problems, in the same way that monolithic databases are.
There might be two queries running on one processor at the same time, trying to access the same relation.
In addition, there could be node failures.
We need concurrency control and recovery across multiple nodes for:

Locking and deadlock detection
Two-phase commit to ensure ‘all or nothing’

With a Shared Nothing architecture, each node is responsible for locking its own data.
There’s no global locking mechanism.
However, we can still get a distributed deadlock:

Processor #1 locks relation R and wants relation S on processor #2
Processor #2 locks relation S and wants relation R on processor #1

One way to solve this is using timeouts.
After a certain timeout, one of the processors performs a rollback to let the other processor proceed.
The timeout might need a random element to avoid ‘chatter’, which is when both transactions give up at the same time and then try again.

A more sophisticated way of doing this is to use a distributed deadlock detector (DDD).
This is how IBM’s DB2 works.
Basically, each node maintains a local ‘wait-for’ graph.
Each database has a catalogue node that has their own DDD.
Periodically, all nodes send their graphs to the DDD.
DD records all the currently held locks.
If a lock is still there after two successful iterations, a deadlock is declared.

Reliability

We want to preserve the ACID properties for our parallel database.

Isolation is taken care of by 2PL (two-phase locking) protocol
Isolation implies consistency
Durability can be taken care of node-by-node, with proper logging and recovery routines

Atomicity is the only one that sticks out.
We need to commit all parts of a transaction, or none.
Therefore we need to use a two-phase commit protocol (2PC) to ensure atomicity is preserved.

First, we need to distinguish between transactions.
Global transaction: a transaction managed by a single site, known as the coordinator
Local transaction: a transaction executed on separate sites, known as the participants

We covered 2PC back in Distributed Systems & Networks.
However, I’ll give you a quick refresher.
Phase 1: Voting

The coordinator sends “prepare T” message to all participants
Participants respond with either “vote-commit T” or “vote-abort T”
Coordinator waits for participants to respond within a timeout period

Phase 2: Committing

If all the participants said “vote-commit T”, then the coordinator will send “commit T” to all participants and then wait for the results.
If at least one participant said “vote-abort T”, then the coordinator will send “abort T” to all participants and wait for acknowledgements.
When all acknowledgments are received, the transaction is completed.
If a site does not acknowledge, resend global decision until it is acknowledged.

When things go right

There are states that the sites can be in during this protocol.
For the coordinator:

INITIAL: before we’ve done anything yet
WAIT: sent “prepare T”
ABORT: received at least one “vote-abort T”
COMMIT: received all “vote-commit T”

For the participants:

INITIAL: before we’ve done anything yet
READY: got “prepare T”, sent “vote-commit T”
ABORT: either sent “vote-abort T” or received “abort-T”
COMMIT: committed and sent an acknowledgement

When someone fails during 2PC, there are two possible protocols that happen:

Termination protocol: used by the other sites when they timeout waiting for the next message from the failed site
Recovery protocol: used by the failed site when it restarts and tries to work out the state of the commit

The behaviour of the sites under these protocols depends on the state they were in when the failed site... uh, failed.

Termination protocol:

Coordinator, WAIT:

Coordinator is still waiting for all the votes.
Coordinator can abort the global transaction.

Coordinator, COMMIT / ABORT:

Coordinator is waiting for acknowledgements
Coordinator resends global decision to participants who have not acknowledged

Participant, INITIAL:

Participant is waiting for a “prepare T”
Abort the transaction after a timeout
If we still get the “prepare T” even after we abort:

resend the “vote-abort T” message or
ignore (coordinator will eventually abort too)

Participant, READY:

Participant is waiting for commit or abort message
Participant can talk to other participants about what the decision is (cooperative termination protocol)

Recovery protocol:

Coordinator, INITIAL:

The protocol hasn’t even begun yet, so we can restart commit procedure

Coordinator, WAIT:

Coordinator has sent “prepare T” but hasn’t received all votes yet
Restart procedure by resending “prepare T”

Coordinator, COMMIT / ABORT:

If coordinator has received all “ack” message, complete successfully
Otherwise, terminate

Participant, INITIAL:

Participant has not voted yet, so the coordinator cannot have reached a decision.
Abort by sending “vote-abort T”

Participant, READY:

Participant has voted, but can’t remember what they voted
Use cooperative termination protocol (ask other participants)

Participant, COMMIT / ABORT:

Resend “ack” message

There are other operations can also exploit the parallel hardware:

Parallel Data Loading/Import/Export
Parallel Index Creation
Parallel Rebalancing
Parallel Backup
Parallel Recovery

Distributed Databases

A distributed database is a collection of sites connected by a communications network.
Each site is a database system that stores their own bit of data, but all the sites have agreed to work together.
All the data can be accessed from any one site; the data the site doesn’t have is just fetched from the network.

Distributed DBMS Principles

Here’s a list of principles a distributed database management system should have:

Principle	Explanation
Local autonomy	Each site should be autonomous or independent of each other (low coupling) Every site should provide their own: Security Locking Logging Integrity Recovery Local operations on a site only affect the local resources of that site, and shouldn’t affect any other site on the network. This ensures that all the sites are modular, and doesn’t spaghettify the operations of the network.
No reliance on a central site	The system should not rely on a central site, which may be a single point of failure or bottleneck.
Continuous operation	The system should never require downtime. There should be on-line backup and recovery, which are fast enough to be performed online without noticeable performance drops.
Location independence	Applications shouldn’t even be aware of where the data is physically stored.
Fragmentation independence	Relations can be divided into fragments and stored at different sites.
Replication independence	Copies of relations and fragments can be stored on different sites. This should be under the hood; applications shouldn’t know that duplicates are being maintained and synchronised.
Distributed query processing	Queries are broken down into component transactions to be executed at the distributed sites (see Parallel Databases for more info).
Distributed transaction management	A system should support atomic transactions (either a transaction happens or it doesn’t).
Hardware independence	It shouldn’t matter what hardware each site uses; one could be on x86, another could be ARM etc.
Operating system independence	It shouldn’t matter what OS each site uses; one could be using Linux, another could use Windows etc.
Network independence	It shouldn’t matter what communication protocols and network topologies are used to connect the sites.
DBMS independence	It shouldn’t matter what DBMS each site uses; one could be using MySQL, another could use PostgreSQL etc.

Parallel databases have all these principles too, except they don’t have:

Local autonomy
Hardware independence
Operating system independence
Network independence
DBMS independence

Fragmentation

In the principles, I mentioned fragments.
A fragment is a subset of the data, split in a certain way.
In other words, fragments are partitions of a global relation, which is the whole unfragmented relation.
The way the data is split are either:

Horizontal fragmentation: each fragment contains a subset of the tuples

Vertical fragmentation: each fragment contains a subset of the attributes

A relation R is decomposed into fragments FR = {R1, R2, ..., Rn}

Decomposition (horizontal or vertical) can be expressed in terms of relational algebra expressions.

FR is complete if each data item di in R is found in some Rj.
In layman’s terms, completeness means that for every data item, you can find a fragment that contains that.
For horizontal, a data item is a tuple. For vertical, a data item is an attribute.

R can be reconstructed if it is possible to define a relational operator ▽ such that
𝑅=▽𝑅𝑖, for all 𝑅𝑖 ∈ 𝐹𝑅
For example, ▽ would be an n-ary union operator for horizontal fragmentation, because you can get the full relation R if you just union all the tuples in all the fragments.
▽ will be different for different types of fragmentation.

FR is disjoint if every data item di in each Rj is not in any Rk where k ≠ j.
Basically, every data item is unique to one and only one fragment.
If we have disjointness, we don’t have redundancy.
This is only strictly true for horizontal decomposition, because all tuples should be unique anyway.
However, in vertical, we’re repeating primary keys.

But we need to repeat those primary keys so we know what attribute goes to what record.
Therefore, when we’re talking about vertical fragmentation, disjointness does not include primary keys; it’s only applicable to non-primary key attributes.

There are two kinds of horizontal fragmentation:

Primary horizontal fragmentation

Tuples are split into fragments using an attribute from those tuples.

Decomposition: FR = {Ri : Ri = σfi(R)}
Reconstruction:
Disjointness: FR is disjoint if all the predicates fi are mutually exclusive.

fi is a fragmentation predicate for Ri and decides what tuples should be in that fragment.
σfi filters out the tuples using that predicate, so σfi(R) simply gets the tuples for Ri.

Derived horizontal fragmentation

Tuples are split into fragments using an attribute from a joined relation.

Decomposition: FR = {Ri : Ri = R⋉Si}
... where FS = {Si : 𝑆i=σfi(S)} and fi are the fragmentation predicates for the primary horizontal fragmentation of S
Reconstruction:

Here’s some info about vertical fragmentation:
Decomposition: FR = {Ri : Ri = πai(R)}
... where ai is a subset of the attributes of R
Completeness: FR is complete if each attribute of R appears in some ai
Reconstruction: R = ⋈K Ri for all Ri ∈ FR
Disjointness: FR is disjoint if each non-primary key attribute of R appears in at most one ai

You can also have a hybrid fragmentation:

Vertical fragmentation of horizontal fragments
Horizontal fragmentation of vertical fragments

Query Processing

Since fragments are expressed using relational algebra, we can incorporate fragmentation into our query plans.
Global relations can be constructed using a localisation program.
No, you don’t compile and run a localisation program; it’s simply a way to reconstruct a global relation using its fragments.
For example, a localisation program for horizontal fragmentation could be:

R = R1 ∪ R2 ∪ ... ∪ Rn

So if we see R in our plan, we can replace it with that big union.
However, there’s no benefit to this. Yet. No benefit if we do only this.

For horizontal fragmentation, there are two cases to consider:
Reduction with selection

Given horizontal fragmentation of R such that Rj = σpj(R):
σp(Rj) = Ø if ∀x ∈ R, ¬(p(x) ∧ pj(x))
... where pj is the fragmentation predicate for Rj.

So what does this actually mean?
It means that if you have a selection with predicate ‘p’, you can look at each fragment’s predicate ‘pj’ and see whether the selection predicate and the fragment’s predicate are mutually exclusive.
If they are mutually exclusive, that means no tuples in that relation will match the select, therefore the fragment will yield nothing when “select”-ed, so there’s no point in including it in the plan.
By doing things this way, we can reduce the number of records we actually need to sift through in our plan.

Reduction with join

This is similar to selection, but it’s with joins.
Recall that joins distribute over unions:

(R1 ∪ R2) ⨝ 𝑆 ≡ (R1 ⨝ 𝑆) ∪ (R2 ⨝ 𝑆)

Previously, we checked if a select predicate is mutually exclusive with a fragment predicate.
Now, we’re checking if two fragment predicates are mutually exclusive!
Say you’re joining two fragments, Ri and Rj.
Given their predicates are pi and pj respectively:
Ri ⨝ Rj = ∅ if ∀x ∈ Ri, ∀y ∈ Rj ¬(pi(x) ∧ pj(y))
In my opinion, the example from the slides sucks, so here’s one from this book I found:

Query

Localised query

Optimised query

Before, we were joining the whole relations with each other.
Now, we’re joining fragments that we know aren’t mutually exclusive, and unioning them.
Basically, we’re trimming out the calculations of trying to join tuples that we know won’t join together.

In vertical fragmentation, the localisation program is:

R = R1 ⨝ R2 ⨝ ... ⨝ Rn

For vertical fragmentation, there’s only one case, and that’s:
Reduction with projection

Given a relation R with attributes A = {a1, a2, ..., an} vertically fragmented as Ri = πAi(R) where Ai ⊆ A
πD,K(Ri) is useless if D ⊈ Ai
... where D is set of projection attributes

Alright, so what does this actually mean?
This means if we’re trying to project a vertical fragment that doesn’t even contain the column we’re trying to project, we can remove it from the plan.
In this example, R2 is the only fragment that has the attributes within the set D:

Distributed Joins

We have two relations, R and S.
They’re each stored at a different site.
Where do we perform the join R ⋈ S?

I have an idea: how about we cry in a corner and regret our decision to study Computer Science?
Just kidding; don’t give up now!
One way to do it is we bring one of the relations over to the other site, and perform the join there.

The communications cost is the same as the size of the relation we’re moving.
If size(R) < size(S), move R to site 2, otherwise move S to site 1.
Basically, always move the smaller relation.

That’s simple enough. But there’s another way we can do it: with semijoins!
This is called a semijoin reduction.
Now, if you’ve read Part 2 of these notes (George’s bit), semijoins are defined with the symbol ⊳
However, this bit defines them with the symbols ⋉ (left) and ⋊ (right). According to Wikipedia, that’s the correct way to represent them. Don’t get confused!
What we can do first is send over projected tuples that only contain the attributes we need to make the join.

πp(S) projects the attributes used in predicate p over tuples in S

Then, once we have those S tuples in site 1, we can perform a left semijoin.
It’s left semijoin because we don’t have the full S tuples; we only have enough to make the join, which could just be a foreign key or something.

R ⋉p S ≡ πR(R ⨝p πp(S))

Now we can send the result of the semijoin over to site 2, and then we can fill in all those half-full S tuples to make it a full join.

R ⨝p S ≡ (R ⋉p S) ⨝p S

The communication cost for this method is the size of the projected S relation plus the size of the left semijoin.

CostCOM = size(πp(S)) + size(R ⋉p S)

This approach is better if:

size(πp(S)) + size(R ⋉p S) < size(R)

If not, just do it the first way.

Concurrency Control

Transactions can happen all across the network on different sites.
The site from which the transaction originated from is the coordinator.
The sites on which the transaction is executed are known as the participants.

Distributed Two-Phase Locking (2PL)

Non-distributed databases aim to maintain isolation, which means transactions should not be visible until fully committed.
Distributed databases can do that too, with two-phase locking (2PL).
It ensures serialisability, the highest isolation level, which means the output of concurrent transactions appear as if they were executed serially.

Two-phase locking has two phases:

Growing phase: obtain locks, access data items
Shrinking phase: release locks

In a non-distributed database, locking is controlled by a lock manager.
There are two ways to do 2PL in a distributed setting:

Centralised 2PL (C2PL)

Responsibility for lock management lies with a single site
The coordinator runs a transaction manager TM
Participants run data processors DP
Lock manager LM runs on some central site

TM requests locks from LM
If granted, TM submits operations to processors DP
When DPs finish, TM sends message to LM to release locks

This is good, but the LM is a single point of failure, and is also a bottleneck.

Distributed 2PL (D2PL)

Each site has its own lock manager
Coordinator still runs TM, but each participant has their own LM and DP

TM sends operations and lock requests to each LM
If lock can be granted, LM forwards operation to local DP
DP sends “end of operation” to TM
TM sends message to LM to release locks

There’s a variant of this where DPs send “end of operation” messages to their own LMs, and then their LMs releases the lock and informs the TM.

Deadlock

Just like with any concurrency thing, we can get deadlock.
Three conditions must be satisfied for deadlock to occur:

Concurrency: two transactions want to lock one resource
Hold: transactions keep the resource locked until they’re done with it
Wait: transactions wait in a queue for additional resources while keeping the resource they have locked

In parallel databases, we briefly went over wait-for graphs.
It’s a graph that represents the current state of locking between transitions.

Each node is a transaction that is currently executing.
An edge from T1 to T2 means T1 is waiting for a locked item from T2.
A deadlock exists if the WFG (wait-for graph) contains a cycle.

There are two kinds of wait-for graphs:

Local WFG (one per site, only considers transactions on that site)
Global WFG (union of all LWFGs)

Deadlock may occur:

on a single site (within its own LWFG)

between sites (within the GWFG)

There’s three ways to handle deadlocks:

Prevention

This method is called pre-declaration.
It guarantees that deadlocks cannot occur in the first place.

First, before it even begins, the transaction declares all the data it’s going to need access to.
The TM (transaction manager) checks that locking data items will not cause a deadlock.
The transaction will proceed only if all data items are available and unlocked.

Basically, the transaction will only happen if all the data it needs is unlocked at the start.
The problem with this is that it’s difficult to know in advance which data items will be accessed by a transaction.

Avoidance

Resource ordering: all resources have an “order”, and transactions must follow that order when accessing resources.

If a resource A comes before B, in order to lock B, a transaction needs to lock A and B.
If the order is A B → C, to get the C lock, you’ll need the A and B locks too.
Therefore, no transaction will ever be left with a lock on B and waiting for a lock on A.

Transaction prioritisation: each transaction has a timestamp that corresponds to the time it was started: ts(T), and can be prioritised using this timestamp.

When a lock request is denied, the transaction either aborts or waits according to the WAIT-DIE or WOUND-WAIT rules.
If Ti requests a lock on a data item that is already locked Tj
WAIT-DIE: if ts(Ti) < ts(Tj) (if the requesting transaction is older), then wait. If not, die (restart)
WOUND-WAIT: if ts(Ti) < ts(Tj), we are wounded (restart), otherwise just wait.

Note: WOUND-WAIT preempts active transactions

Detection and Resolution

We can detect deadlocks by studying the GWFG for cycles (detection), and break those cycles by aborting transactions (resolution).
Selecting the minimum number of transactions to abort is NP-complete.
There are three approaches to deadlock detection:

Centralised

One site is the designated deadlock detector (DD)
Each site sends it their LWFG periodically
DD constructs GWFG and looks for cycles

Hierarchical

Each site has a DD, but receives LWFGs in a hierarchy
Each site sends its LWFG to the next level, where that site will construct a bigger LWFG and look for cycles
The top-level site will eventually have a GWFG

Distributed

Responsibility for detecting deadlocks is delegated to sites
LWFGs are modified to show relationships between local transactions and remote transactions
This gives sites the power to identify deadlocks they are affected by.
If the cycle doesn’t involve any external edges, the site can handle the deadlock locally
If the cycle involves external edges, the site communicates with other sites and then an agreement must be made on a victim transaction to abort.

Reliability

We also want atomicity and durability.
We can achieve this with a two-phase commit protocol (2PC).
However, there are two versions we can use for distributed databases:

Centralised 2PC

It’s just like the one with parallel databases; the coordinator asks for votes for all participants, and then the participants send their votes back, and the coordinator tells the participants what to do next.
The participants do not talk to each other.

Linear 2PC

Instead of sending messages to all participants at once, the coordinator asks for votes from the participant next to it.
Then that participant asks the one next to it and so on.
When it reaches the end, the conversation goes in the opposite direction, all the way back to the coordinator.
The first phase goes from the coordinator to the participants.
The second phase goes from the participants to the coordinator.

Linear 2PC involves fewer messages, but centralised 2PC provides opportunities for parallelism and has better response time performance.

The CAP Theorem

We did the CAP theorem back in Distributed Systems & Networks.
Basically, we can have two of three things:

Consistency: Each server always returns the correct response to each request
Availability: As long as at least 1 server is up, each request eventually returns a response
Partition Tolerance: Even if communication drops between two site partitions, the system as a whole should continue to function

CAP is a trade-off between safety and liveness in an unreliable system.

Safety: “something bad will never happen”, e.g. a safe database system will never give you the wrong answer to a request, but you might not even get an answer back if you request something.
Liveness: “something good will eventually occur”, e.g. you will always get an answer from a live database system, but sometimes, you’ll get a slightly wrong answer to a query. If you keep asking, you’ll eventually get a correct one.

Typically we sacrifice either availability (liveness) or consistency (safety)

Data Warehousing

OLTP vs OLAP

Online Transaction Processing (OLTP) is characterised by a large number of short transactions (INSERT, UPDATE, DELETE) with an emphasis on fast query processing and data integrity.

Online Analytical Processing (OLAP) is characterised by a small number of complex transactions often involving aggregation and multidimensional schemas. Typically, data is taken from a data warehouse. OLAP is ideal for data mining, complex analytical calculations, and business reporting functions such as financial analysis.

Data mining is the process of discovering hidden patterns and relationships in large databases using advanced analytical techniques.

Codd's 12 rules for OLAP:

Multidimensional conceptual view	A multidimensional data model is provided that is intuitively analytical and easy to use. The model decides how users perceive business problems, e.g. time, location, or product.
Transparency	The technology, underlying data repository, computing architecture and the diverse nature of source data is transparent to users.
Accessibility	Access is provided only to the data that is needed to perform the specific analysis, presenting a single, coherent and consistent view to the users.
Consistency reporting performance	Users do not experience any significant reduction in reporting performance as the number of dimensions or the size of the database increases. Users perceive consistent runtime, response time, and machine utilization every time a query is run.
Client-server architecture	The system conforms to the principles of client-server architecture for optimum performance, flexibility, adaptability and interoperability.
Generic dimensionality	Every data dimension is logically equivalent in both structure and operational capabilities.
Dynamic sparse matrix handling	The schema adapts to the specific analytical model being used to optimize sparse matrix handling.
Multi-user support	Support is provided for users to work concurrently with either the same analytical model or to create different models from the same data.
Unrestricted cross-dimensional operations	The system recognizes dimensions and automatically performs roll-up and drill-down operations within a dimension or across dimensions.
Intuitive data manipulation	Consolidation path reorientation, drill-down, roll-up and other manipulations are accomplished intuitively and are enabled directly via point and click actions.
Flexible reporting	Business users are provided with capabilities to arrange columns, rows and cells in a manner that allows easy manipulation, analysis and synthesis of information.
Unlimited dimensions and aggregation levels	There can be at least 15 to 20 data dimensions within a common analytical model.

Data Warehouses

A data warehouse is a large store of data accumulated from a wide range of sources within a company and used to guide management decisions. A data warehouse is:

Subject-oriented - data is organised according to its subject e.g. financial, products
Integrated - data is combined from several sources; data encoding and data naming are consistent
Time-variant - data is collected over time and used for comparisons, trends, and forecasting
Non-volatile - data is not updated or changed once in the data warehouse

Why use a separate data warehouse?

Historical Data

decision support requires historical data, which OLTP databases do not normally maintain

Data Consolidation

decision support requires consolidation (aggregation, summarisation) of data from different sources, including OLTP databases and external sources

Data Quality

different sources typically use inconsistent data representations, which must be reconciled

A data warehouse may be stored using:

an existing OLTP database
a new relational database
a multidimensional database
a proprietary database format
a mixture of the above

Data may be accessed using:

Data Mining
Online Analytical Processing (OLAP)
Decision Support Systems (DSS)
Executive Information Systems (EIS)

The OLAP Cube

The OLAP cube is an array-based multidimensional database that makes it possible to process and analyse data in multiple dimensions much more efficiently than a traditional relational database. A relational database stored data in a two-dimensional, row-by-column format. The OLAP cube extends this with additional layers, adding additional dimensions to the database. These are often arranged in a hierarchy e.g. country, state, city, store.

OLAP Operations

Dice: a sub-cube is selected from the OLAP cube by selecting two or more dimensions

Lightbox

Slice: a single dimension is selected from the OLAP cube

Lightbox

Pivot: the current view is rotated around a pivot point

Drill Down: in drill-down, less detailed data is converted into highly detailed data. This can be done by:

moving down in the concept hierarchy
adding a new dimension

Roll Up: highly detailed data is converted to less detailed data. This can be done by:

moving up in the concept hierarchy
removing a dimension

Star Schemas

A star schema consists of one fact table connected to a collection of dimension tables. It gets its name from its resemblance to a star shape with the fact table at its center and dimension tables surrounding it.

Fact Table: contains attributes for the numeric measurement of a business such as sales and costs. Data in fact tables is extracted from an OLTP database and moved to the data warehouse
Dimension Table: stores descriptions of the characteristics of a business such as products, customers, promotions, and times

The primary key of the fact table is a combination of foreign keys from the dimension tables. The rule of referential integrity states that each row in the fact table must contain a primary key value from each dimension table.

Star Schema in Data Warehouse modeling - GeeksforGeeks

Advantages of star schemas:

Query Performance

As a star schema has a small number of tables, queries run faster than they do in an OLTP system. Small queries within a single table run almost instantaneously. Large join queries involving multiple tables only take seconds or minutes to run

Load Performance

The structural simplicity of a star schema reduces the time required to load large batches of data into a database. By defining facts and dimensions in different tables, the impact of a load operation is reduced; dimension tables are created once and periodically refreshed, while fact tables are updated regularly

Referential Integrity

Referential integrity states that all references to data are valid. It requires that if a value of one attribute in a table references another attribute, then the referenced value must exist. A star schema has built-in referential integrity when data is loaded. This is enforced because each row in a dimension table has a unique primary key, and all keys in the fact table are foreign keys drawn from the dimension tables.

Information Retrieval

Information Retrieval (IR) is the retrieval of media based on the information a user needs. The user provides a query to the retrieval system, which returns a set of relevant documents from a database. Documents include web pages, scientific papers, news articles, or paragraphs.

an information need is a topic the user desires to know more about
a query is what the user conveys to the computer in an attempt to communicate their information need
a document is relevant if the user perceives it to contain information of value with respect to the query

An IR model consists of:

a set of documents D
a set of queries Q
a ranking function R(di,qi) that associates a real number with a query and document

Implementation of IR Systems

IR systems either:

index all words contained in all documents (full-text indexing)
selectively extract index terms from documents

During tokenization, the system identifies distinct words in the document, by separating on whitespace to obtain a list of tokens. It then cleans and prepares the data:

Normalisation: words or tokens are put into standard format. For example, uh-huh becomes uh huh
Case Folding: all characters are mapped to lowercase
Lemmatisation: reducing words to their root. For example, saying becomes say

Zipf's law is given by the formula where na is the rank of a word. This means that the most common words appear in the middle of the graph, and other words can be ignored. Stopword removal is the process of removing very common words which have little semantic value such as the, a, at, and, also.

Indexing

The lexicon is a data structure listing all terms that appear in a collection of documents. This is typically implemented as a hash table for fast lookup. The inverted index is a data structure that contains information about terms in the lexicon. This may include information such as the number of occurrences within documents and pointers to its occurrences (postings).

To calculate the rank of each term:

lookup the query term in the lexicon
use the inverted index to find the address of the posting list
create an empty priority queue of length R
for each term-document pair in the posting list

if queue length < R, add the (document, count) pair to the queue
if queue length >= R and count is larger than the lowest entry in the queue, replace the entry with the new (document, count) pair

the resulting queue contains an ordered ranking of terms

The Boolean Model

The boolean model involves querying terms in the lexicon and inverted index, and applying boolean set operators to posting lists to identify relevant documents:

AND - set intersection
OR - set union
NOT - set complement

Given the document collection:

d1 = “Three quarks for Master Mark”

d2 = “The strange history of quark cheese”

d3 = “Strange quark plasmas”

d4 = “Strange Quark XPress problem”

The lexicon and inverted index:

three → {d1}

quark → {d1, d2, d3, d4}

master → {d1}

mark → {d1}

strange → {d2, d3, d4}

history → {d2}

cheese → {d2}

plasma → {d3}

express → {d4}

problem → {d4}

The query:

strange AND quark AND NOT cheese

The result set is:

{d2, d3, d4} ∩ {d1, d2, d3, d4} ∩ {d1, d3, d4}

= {d3, d4}

The Vector Model

The binary vector model represents documents and queries as vectors of term-based features containing terms t, documents j, and queries q.

Features are binary:

wmj = 1 if term km is present in document dj
wmj = 0 otherwise

To measure the similarity between a document and query , we can use cosine similarity to produce a binary value. The dot product is normalised by the magnitude of each vector.

TF-IDF (Term Frequency-Inverse Document Frequency) evaluates the relevance of a term to a document.

Term Frequency (TF) - frequency of a term t in a document d, typically represented logarithmically
Document Frequency (DF) - number of documents a term occurs in
Inverse Document Frequency (IDF) - number of documents divided by document frequency
TF-IDF - calculated by multiplying TF and IDF.

tfr,d

dit
idft

Query Refinement

Typical queries are ambiguous and can be refined to provide better search results.

The Rocchio method implements query refinement through relevance feedback. The query is first executed to provide the user with initial results. The user tags each result as relevant or non-relevant, allowing the model to adjust its weights. The improved query is created using the new weights and the user adjusts this until the results are satisfactory. Typical starting weights are α = 1, β = 0.75, = 0.25. (nope, I don’t understand this formula either)

Evaluation

Relevance is the key to determining the effectiveness of an IR system. Precision and recall are two common statistics used to determine the effectiveness of a system. F1 combines precision and recall by taking the harmonic mean between them.

TP (True Positive): "yes" data is correctly predicted as "yes"

TN (True Negative): "no" data is correctly predicted as "no"

FP (False Positive): "no" data is incorrectly predicted as "yes"

FN (False Negative): "yes" data in incorrectly predicted as "no"

Accuracy =

proportion of correct predictions

Precision =

proportion of correct positive predictions out of all positive predictions

i.e. how many positive predictions were correct?

Recall =

proportion of correct positive predictions out of all correct predictions

i.e. how many correct predictions were made?

F1 = or

harmonic mean of precision and recall

ROC Curve (Receiver Operating Characteristic): plots true positive rate (recall) against false positive rate

TPR = True Positive Rate

ratio of positives which were predicted correctly
TPR =

FPR = False Positive Rate

ratio of positives which were predicted incorrectly
FPR =

Non-Relational Databases

Hierarchical Databases

The first hierarchical database was the IBM Information Management System, developed in 1966 to support the Apollo programme. Hierarchical databases are built as trees of related record types connected by parent-child relationships.

An occurence of a parent-child relationship (PCR) consists of:

one parent record
zero or more child records
a relationship between the parent and its children

A database may contain many hierarchical occurrence trees:

an occurrence tree is a tree whose root is a single record from the root of the schema
an occurrence tree contains all children and descendants of the root record all the way to leaf records

Disadvantages of hierarchical databases:

multiple parents (many-to-one) relationships are not supported
n-ary relationships (many-to-many) are not supported
each child record type can only be involved in one parent-child relationship
query and update operations require the programmer to explicitly navigate the hierarchy, resulting in poor data independence

Network Databases

Network databases were standardised by the Conference on Data Systems Languages committee in 1969. A network database allows 1:N (one-to-many) relationships between records and files.

no constraints on the number and direction of links between record types
a root record type is not necessary

An occurrence of a set consists of:

one owner record
zero or more member records
a set relationship between the owner and its members

Set types can only represent one-to-many relationships. To create many-to-many relationships, a dummy record can be used to join two record types.

Advantages of network databases:

it is easier to model systems with network databases than with hierarchical databases
network databases can represent one-to-many and many-to-many relationships

Disadvantages of network databases:

query and update operations require the programmer to explicitly navigate the network, resulting in poor data independence

Native XML Databases

A native XML database allows data to be specified and stored in XML format.

XPath is used to navigate through elements and attributes in an XML document

XPath is a syntax for defining parts of an XML document
XPath uses path expressions to navigate in XML documents
XPath contains a library of standard functions

XML Infoset is a specification describing an abstract data model from an XML document in terms of a set of information items
XML DOM defines a standard way for accessing and manipulating XML documents. It represents an XML document as a tree structure

NoSQL Databases

Object-relational impedance mismatch is a set of conceptual and technical difficulties that are often encountered when a relational database is being served by an application program written in an object-oriented language, as objects and classes must be mapped to database tables defined by the relational schema. Impedance mismatch can be addressed by using a NoSQL database.

NoSQL is an approach to database design that provides flexible schemas for the storage and retrieval of data beyond the traditional table structures found in relational databases. NoSQL databases are:

non-relational
schema-free
distributed
scalable
open source

NoSQL data types:

Structured: code is written in a specific format, in such a way that search engines can understand it
Semi-Structured: data contains markers to separate elements and enforce hierarchies between records and fields
Unstructured: information either does not have a pre-defined model or is not organised in a pre-defined manner
Polymorphic: data can be transformed into any data type as required