Computer Science / Software Engineering Notes Network

Advanced Database Notes Part 1

Advanced Databases Part 2

Matthew Barnes

Relational Algebra        2

Relational Transformations        5

Relational Algebra and SQL        6

Problems and Complexity in Databases        7

Relational calculus        8

List all the airlines        9

List the codes of the airports in London        10

List the airlines that fly directly from London to Glasgow        11

Active domain        11

Conjunctive Query - Containment and Minimisation        12

Conjunctive queries        12

List all the airlines        13

List the codes of the airports        13

List the airlines that fly directly from London to Glasgow        13

Homomorphism        14

Semantics of Conjunctive Queries        16

CQ problems        17

Homomorphism Theorem        18

Minimisation        21

Query Processing        23

Logical query plans        24

Cost Estimation        24

Query Optimisation        26

Physical query plans        29

Scanning        30

One-Pass Algorithms        30

Nested-Loop Joins        32

Two-Pass Algorithms        35

Index-based Algorithms        36

Data Integration        37

Query Rewriting        38

Global as View        40

Local as View        40

The Bucket Algorithm        42

Ontology Based Data Integration        44

Ontology-Based Query Answering (OBQA)        44

Universal Models        44

The Chase        44

Relational Algebra

  1. Each row represents a k-tuple of R
  2. The values of an attribute are all from the same domain
  3. Each attribute of each tuple in a relation contains a single atomic value
  4. The ordering of rows is immaterial (relations are just sets, so row order doesn’t matter)
  5. All rows are distinct (relations are sets, so there can be no duplicates)
  1. Each column is identified by its name
  2. The ordering of the attributes (columns) do not matter
  1. The ordering of the attributes matter (so we can reference attributes by order, not by name)

Intension

Extension

Relation Schema

Instance of a relation schema (Relation)

Relational Database Schema

Relational Database Instance (Database)

Characters

Character

Stand ID

Jotaro Kujo

1

Polnareff

2

Jonathan Joestar

null

Gyro Zeppeli

4

Stands

Stand ID

Stand

1

Star Platinum

2

Silver Chariot

3

The World

Semijoin

Character

Stand ID

Jotaro Kujo

1

Polnareff

2

Relational Transformations

Relational Algebra and SQL

Problems and Complexity in Databases

Relational calculus

{x1, ..., xk | φ}

Flight

Origin

Destination

Airline

VIE

LHR

BA

LGW

GLA

U2

LCA

VIE

OS

Airport

Code

City

VIE

Vienna

LHR

London

LGW

London

LCA

Larnaca

GLA

Glasgow

EDI

Edinburgh

List all the airlines

Flight

Origin

Destination

Airline

VIE

LHR

BA

LGW

GLA

U2

LCA

VIE

OS

Airport

Code

City

VIE

Vienna

LHR

London

LGW

London

LCA

Larnaca

GLA

Glasgow

EDI

Edinburgh

List the codes of the airports in London

Flight

Origin

Destination

Airline

VIE

LHR

BA

LGW

GLA

U2

LCA

VIE

OS

Airport

Code

City

VIE

Vienna

LHR

London

LGW

London

LCA

Larnaca

GLA

Glasgow

EDI

Edinburgh

List the airlines that fly directly from London to Glasgow

Flight

Origin

Destination

Airline

VIE

LHR

BA

LGW

GLA

U2

LCA

VIE

OS

Airport

Code

City

VIE

Vienna

LHR

London

LGW

London

LCA

Larnaca

GLA

Glasgow

EDI

Edinburgh

   u = London ⋀

   Airport(y, v) ⋀

   v = Glasgow ⋀

   Flight(x,y,z)

}

Active domain

Conjunctive Query - Containment and Minimisation

Conjunctive queries

Q(x) := ∃y (R1(v1) ∧ ... ∧ Rm(vm))

Q(x) :- R1(v1) ∧ ... ∧ Rm(vm)

List all the airlines

List the codes of the airports

List the airlines that fly directly from London to Glasgow

   u = London ⋀

   Airport(y, v) ⋀

   v = Glasgow ⋀

   Flight(x,y,z)

}

 Airport(y,Glasgow),

 Flight(x,y,z)

Homomorphism

= {P(h(x),h(y)),         P(h(y),h(z)),        P(h(z),h(x))}

= {P(x,y),                P(y,z),                P(z,x)}

Semantics of Conjunctive Queries

CQ problems

Homomorphism Theorem

  1. Q1 ⊆ Q2 ⇒ Homomorphism from Q2 to Q1
  1. Homomorphism from Q2 to Q1 ⇒ Q1 ⊆ Q2

Minimisation

  1. Q1 ≡ Q2
  2. Q2 has fewer body atoms than Q1

Minimization(Q(x) :- body)
Repeat until no change

  choose an atom α ∈ body
 
if there is a containment mapping from Q(x) :- body to Q(x) :- body ∖ {α}
 
then body := body ∖ {α}
Return Q(x) :- body

Query Processing

Logical query plans

Cost Estimation

Scan

T(scan(R)) = T(R)

For all A in R, V(scan(R), A) = V(R, A)

Product

T(R ✕ S) = T(R)T(S)

For all A in R, V(R ✕ S, A) = V(R,A)

For all B in S, V(R ✕ S, B) = V(S,B)

Projection

T(πA(R)) = T(R)

For all A in R and πA(R), V(πA(R),A) = V(R,A)

This assumes that the projection does not remove duplicate tuples (so value count remains the same)

Selection attr=value

T(σA=c(R)) = T(R) / V(R,A)

V(σA=c(R),A) = 1

Assumes all values of A appear with equal frequency

Selection

attr=attr

T(σA=B(R)) = T(R) / max(V(R,A), V(R,B))

V(σA=B(R),A) = V(σA=B(R),B) = min(V(R,A), V(R,B))

Assumes that all values of A and B appear with equal frequency

Note: for all other attributes X of R, you can reduce it to V(R, X) because it’s not being acted upon

Selection

Inequality

T(σA<c(R)) = T(R) / 3 (as a rule of thumb)

May be different if range of values and distribution is known

Selection

Conjunction / Disjunction

T(σA=c1∧B=c2(R)) = (and)

T(σA=c1∨B=c2(R)) =+ (or)

Join

T(R⋈A=BS) = T(R)T(S) / max(V(R,A), V(S,B))

T(R⋈R1=S1∧R2=S2S) =

T(R)T(S) / max(V(R,R1), V(S,S1))

V(R⋈A=BS,A) = V(R⋈A=BS,B) = min(V(R,A), V(S,B))

Range

[1,10]

[11,20]

[21,30]

[31,40]

[41,50]

Tuples in range

50

2,000

2,000

3,000

2,950

Query Optimisation

  1. Start with canonical form

  1. Move σ operators down the tree

  1. Reorder subtrees to put most restrictive σ first

  1. Combine × and σ to create ⨝

  1. Move π operators down the tree

Physical query plans

  1. Clustered File (non-clustered relations)
  1. Clustered Relation (also called contiguous)

Scanning

  1. Table scan
  1. Index scan

One-Pass Algorithms

  1. Unary, tuple at a time

foreach block of R:
        
copy block to input buffer
        perform operation (
select, project) on each tuple in block
        move selected/projected tuples
to output buffer

  1. Unary, full-relation

foreach block of R:
        
copy block to input buffer
        foreach
tuple in block
                
if tuple is not in accumulator
                        
copy to accumulator
                        
copy to output buffer

  1. Binary, full-relation

foreach block of S:
        
read block
        
add tuples to search structure
foreach
block of R
        
copy block to input buffer
        foreach
tuple in block
                find
matching tuples in search structure
                construct
new tuples and copy to output

Nested-Loop Joins

foreach tuple r1 in R1
        
foreach tuple r2 in R2
                
if r1.C = r2.C then output r1,r2 pair

     = T(R1) * (1 + T(R2))

                                = 10,000 * (1 + 5,000)

     = 50,010,000 disk accesses

     = 1,000 disk accesses

  = T(R1) / 1,000 * 6,000 = 60,000 disk accesses

     = 1,000 disk accesses

            = 1,000 + 500 = 1,500 disk accesses

Two-Pass Algorithms

  1. Reading each chunk, sorting each individual chunk, then writing the chunk back to disk
  2. Taking all the chunks, sorting the chunks using merge sort, writing back to disk

  = 6,000 + 1,500

  = 7,500 disk accesses

  = 60,000 + 15,000

  = 75,000 disk accesses

  = 4,500 disk accesses

Index-based Algorithms

foreach R1 tuple:
 check if R1.C is in index
 
if match, read R2 tuple: 1 disk access

Data Integration

Query Rewriting

Global as View

Local as View

GAV

LAV

  • Not modular
  • Addition of new sources changes the domain schema
  • Can be awkward to write mediated schema without loss of information
  • Query reformulation is easy
  • Reduces to view unfolding (polynomial)
  • Can build hierarchies of mediated schemas
  • Best when
  • Few, stable, data sources
  • Well-known to the mediator (e.g. corporate integration)
  • Garlic, TSIMMIS, HERMES
  • Modular - adding new sources is easy
  • Very flexible - power of the entire query language available to describe sources
  • Reformulation is hard
  • Involves answering queries only using views (can be intractable)
  • Best when
  • Many, relatively unknown data sources
  • Possibility of addition / deletion of sources
  • Information Manifold, InfoMaster, Emerac

The Bucket Algorithm

  1. For every relation g in query Q:
  1. Combine source relations from each bucket into a conjunctive query Q’ and check for containment.

Relation

Selected views

Enrolled(S,D)

V2(student,dept,course) → Registered(student,course,year), Enrolled(student,dept)

V4(student,course,number) → Registered(student,course,year), Course(course,number), Enrolled(student,dept), number ≤ 100

Registered(S,C,Y)

V1(student,number,year) → Registered(student,course,year), Course(course,number), number ≥ 500, year ≥ 1992

V2(student,dept,course) → Registered(student,course,year), Enrolled(student,dept)

V3(student,course) → Registered(student,course,year), year ≤ 1990

The year range doesn’t fit our query, so we can’t consider V3 for Registered

V4(student,course,number) → Registered(student,course,year), Course(course,number), Enrolled(student,dept), number ≤ 100

Course(C,N)

V1(student,number,year) → Registered(student,course,year), Course(course,number), number ≥ 500, year ≥ 1992

Enrolled(S,D)

Registered(S,C,Y)

Course(C,N)

V2(S,D,C’)

V4(S,C’,N’)

V1(S,N’,Y’)

V2(S,D’,C)

V4(S,C,N’)

V1(S’,N,Y’)

Ontology Based Data Integration

Ontology-Based Query Answering (OBQA)

certain-answers(Q, <D,>) = ⋂J ∈ models(D) Q(J)

∀x∀y (hasFather(x,y) → Person(x) ^ Person(y))}

  1. Q(x,y) :- hasFather(x,y)
  1. Q(x) :- hasFather(x,y)
  1. Q(x) :- hasFather(x,y), hasFather(y,z), hasFather(z,w)
  1. Q(x,w) :- hasFather(x,y), hasFather(y,z), hasFather(z,w)

Universal Models

  1. U is a model of D ∧ Σ
  2. ∀J ∈ models(D ∧ Σ), there exists a homomorphism hJ such that hJ(U) ⊆ J

The Chase