Computer Science / Software Engineering Notes Network

Advanced Databases Part 1

Matthew Barnes

Parts        3

Data types        4

Temporal data        5

Spatial data        6

Multimedia data        7

DBMS Architecture        8

Data Storage        10

Storage Organisation        10

Secondary Storage        11

Buffer Management        14

The Five Minute Rule        17

Disk Organisation        18

Data Items        18

Records        19

Blocks        20

Insertion and Deletion        24

Access Structures        30

Index basics        30

Sequential files        30

Dense indexes        31

Sparse indexes        31

Multi-level indexes        32

Duplicates        33

Deletions        36

Insertions        39

Secondary indexes        42

Indirection        45

B+ trees        46

Insertion        49

Deletion        55

Hash tables        60

Extensible hashing        64

Linear hashing        68

Indexing vs Hashing        71

Multidimensional Access Structures        71

Conventional indexes        71

Hash-like        73

Grid files        73

Partitioned hashing        74

Hierarchical indexes        75

kd-trees        75

Quad trees        76

R-trees        77

UB-trees        78

Bitmap indexes        81


Parts

Part 1

✅ Data types

✅ DBMS Architecture

✅ Data Storage

✅ Access Structures

✅ Multidimensional Access Structures

Part 2

 Relational Algebra

 Problems and Complexity in Databases

 Conjunctive Queries - Containment and Minimisation

 Query Processing

 Data Integration

 Ontology Based Data Integration

Part 3

 Transactions and Concurrency

 Advanced Transactions

 Logging and Recovery

 Parallel Databases

 Distributed Databases

 Data Warehousing

 Information Retrieval

 Non-Relational Databases

Stream Processing (not examined)

Peer to Peer Systems (not examined)

Data types

Temporal data

Spatial data

Multimedia data

DBMS Architecture

DBMS Interfaces (what is exposed to us)

DBMS Components (how it all works)

Data Storage

Storage Organisation

  1. Cache: volatile, very fast, very expensive, limited capacity
  1. Main Memory: volatile, fast, affordable, medium capacity (RAM)
  1. Secondary Storage: non-volatile storage, slow, cheap, large capacity (HDD, Solid state)
  1. Tertiary Storage: non-volatile, very slow, very cheap, very large capacity (tape storage, backups)

Secondary Storage

gap

sync

mark

data

ecc

Rotational speed (rpm)

Average delay (ms)

4,200

7.14

5,400

5.56

7,200

4.17

10,000

3.00

15,000

2.00

  1. Read block
  2. Modify in memory
  3. Write block
  4. Verify block (optional)

HDD

SDD

Random Read IOPS

125-150 IOPS

~50,000 IOPS

Random Write IOPS

125-150 IOPS

~40,000 IOPS

Buffer Management

  1. Read B1 → Buffer
  2. Process data in buffer
  3. Read B2 → Buffer
  4. Process data in buffer
  1. Read B1 → Buffer 1
  2. Process data in buffer 1 && Read B2 → Buffer 2
  3. Process data in buffer 2 && Read B3 → Buffer 1
  4. Process data in buffer 1 && Read B4 → Buffer 2

The Five Minute Rule

Disk Organisation

Data Items

Records

  1. e#, 2 byte integer
  2. name, 10 byte char
  3. dept, 2 byte code

5

5

s

m

i

t

h

0

2

8

3

j

o

n

e

s

0

1

{
 {
   name:
"Jonathan Joestar",
   power:
"hamon",
   enemy:
"Dio Brando"
 },
 {
   name:
"Johnny Joestar",
   power:
"stand",
   stand_name:
"Tusk"
 }
}

Blocks

  1. Use fixed length records - no need to separate (if a record is size 10 bytes and you read through 10 bytes, you know the next record will follow)
  2. Use a special marker to indicate record end (like a null terminating character but for records)
  3. Give record lengths (or offsets)

Insertion and Deletion

  1. Look for the next empty slot

  1. Just stick it at the end

  1. Find space on a “nearby” block

  1. Create an overflow block

  1. If we’re using a map, such as an offset table or logical ID → physical ID, we make the tombstone a null pointer in place of the record we’re deleting.

Logical ID

Physical ID

56

5AF8

57

58

2DE0

  1. If we’re using physical addresses, we place the bit that serves as a tombstone at the beginning of the record.

A block of memory

Logical ID

56

Physical ID

5AF0

5AF8

5B00

5B08

5B10

5B18

Data

5B

3F

02

7A

2B

5B00 and above can be reused, but 5AF8 can never be reused.

Access Structures

Index basics

Sequential files

Dense indexes

Sparse indexes

Multi-level indexes

Duplicates

Deletions

Before

After

Before

After

Before

After

Before

After

Insertions

Before

After

Before

After

Before

After

Before

After

Secondary indexes

Indirection

B+ trees

  1. All leaves same distance from root (balanced tree)
  2. Pointers in leaves point to records except for “sequence pointers”
  3. That whole minimum thing I said just above

Insertion

  1. Space available in leaf
  2. Leaf overflow
  3. Non-leaf overflow
  4. New root

Deletion

  1. Simple case
  2. Coalesce with sibling
  3. Re-distribute keys
  4. Cases 2. or 3. at non-leaf

Hash tables

  1. The hash function calculates the block pointer directly, or as an offset from the first block.

  1. The hash function calculates the offset in an array of block pointers.

Hash so far

Operation + Explanation

Nothing

We haven’t done anything so far.

Ah alright; if you really want to be pedantic about it, we’ve initialised our hash table with 4 buckets, containing 2 records per bucket.

Insert a, b, c, d

h(a) = 1

h(b) = 2

h(c) = 1

h(d) = 0

For each key, we hash it using the values above and we “append” that key (or record) to the associated bucket.

For example, when we hash ‘a’ into ‘1’, we look up the bucket labelled ‘1’ and insert our key ‘a’ there.

Insert e

h(e) = 1

Like above, we try to insert ‘e’ into bucket ‘1’ according to the hash function.

However, bucket ‘1’ is full, therefore we add an overflow bucket to extend the capacity of bucket ‘1’. We then put ‘e’ into that overflow bucket.

Delete b

When we delete a key, we can just remove it from the bucket. If there were any keys after it, we can just move them up the bucket.

Delete c

By deleting ‘c’, we’re freeing space for ‘e’ in the bucket it’s leading from, so we can move ‘e’ from the overflow bucket to the original bucket and remove the overflow bucket.

Extensible hashing

Hash so far + Operation

Operation + Explanation

Nothing

The variable i is just 1, and we have three keys.

Increase i to 2

Partition “10” and “11” into different buckets

Add “1010” to the “10” bucket

Insert 1010

To insert 1010, we could just put 1010 into the ‘1’ directory. However, that directory is full.

Therefore we need to increase i from 1 to 2.

Then we partition “10” and “11” into different buckets, leaving us with enough space to put “1010” into the “10” bucket.

Add “0111” to the “01” bucket

Insert 0111

Since the “01” bucket has enough space for one more key, we can place “0111” into the “01” bucket.

It just so happens that “00” and “01” share the same bucket. However, if we need to add one more key they’ll need their own buckets.

Partition “00” and “01” into different buckets

Add “0000” to the “00” bucket

Insert 0000

Speak of the devil! There isn’t enough space in the “00” bucket, so we’ll need to split “00” and “01” into their own buckets.

Then we’ll have enough space to put 0000 into the “00” bucket.

Increase i to 3

Partitioning “100” and “101” into different blocks

Add “1001” to the “100” bucket

Insert 1001

Don’t worry, this is the last operation!

If we look at bucket “10”, we’ll see it’s full again.

Like last time, we need to increment i to 3.

“100” and “101” will then share a bucket. We will split that into two, so “100” and “101” will have their own buckets.

We will then have enough space in bucket “100” to put 1001 in.

Duplicate keys are allowed, in case you were wondering.

Linear hashing

Hash so far + Operation

Operation + Explanation

Nothing

Here we have a linear hash where i = 2, and the maximum value bucket we have is 01 (n).

The bold numbers underneath the blue buckets are the bucket values.

If we find a value of m equal to a bucket value or a number underneath a bucket value, the key will be added to that bucket.

As you can see on the left, for example, 11 is underneath bucket value 01, so 0101 will redirect to bucket 01, and 1111 will be redirected to bucket 01 (see the case for n ≤ m < 2i above)

In a way, the pairs 00, 10 and 01, 11 “share” buckets, a bit like how extensible hashing shared buckets.

Increment n

If our buckets get too crowded, we can add a new bucket. This is similar to that partitioning / splitting thing we did with extensible hashing.

Here we give “10” its own bucket, so now we can move all the keys where m = 10 over to the new 10 bucket.

Insert 0101

If the key K = 0101 and i = 2, then m = 01. This should go into bucket 01.

Since it’s full, we can use an overflow block.

Increment n

If we increment n, we’re giving 11 their own bucket. Therefore we can move the key 1111 to the new bucket 11.

With the new space available in bucket 01, we can move 0101 out of the overflow block and into the main block.

However, now we’ve reached the maximum number of buckets we can create where i = 2. If we want more, we need to increase i.

Increment i

Incrementing i doesn’t change the capacity per se, but it changes the potential for capacity increase.

As you can see the buckets are shared again; notice that there are m = 101 keys in the 001 bucket or m = 111 keys in the 011 bucket, for example.

Increment n and insert 100

Now, there is space in the 000 bucket, so we could just insert 100 in there, but let’s try to spread out our keys into different buckets more.

When we add the 100 bucket, m = 000 and m = 100 no longer share the same bucket.

Therefore we can now add 100 into its own 100 bucket.

Indexing vs Hashing

SELECT ...
FROM R
WHERE R.A = 5

SELECT ...
FROM R
WHERE R.A > 5

Multidimensional Access Structures

Conventional indexes

  1. Get all the matching records using an index on attribute “department”
  2. Check values of “salary” attribute on those records

  1. Use index on attribute “department” and “salary” separately for the two predicates department=sales and salary>£40,000 and get two sets of records
  2. Take the intersection of the two record sets (find records that are inside both sets)

  1. Use the index of “department” to point us to a suitable index on the “salary” attribute.
  2. Get all matching records of that “salary” attribute index we were pointed to.

Hash-like

Grid files

Partitioned hashing

Hierarchical indexes

kd-trees

Quad trees

R-trees

UB-trees

  1. Mapping n-dimensional space onto a 1-dimensional line using a fractal space-filling curve (Hilbert curve, Moore curve, Z-order curve etc.)

  1. Partitioning the ranges (called Z-regions) and indexes (called Z-indexes) of the curve using a B+ tree. The leaves of the B+ tree represent these Z-regions, and point to the records that lie within that Z-region.

  1. Turning the query into a query rectangle (a rectangle covering several Z-indexes. A record shall satisfy the query if its Z-index is inside the query rectangle)

  1. Look for Z-regions that intersect the query rectangle

  1. A mapping of a multi-dimensional point (record) to a Z-index
  1. How to determine which Z-regions intersect the query rectangle

Bitmap indexes