Advanced Databases Part 2 
        
        
            Matthew Barnes
        
        
            Relational Algebra        2
        
        
            Relational Transformations        5
        
        
            Relational Algebra and SQL        6
        
        
            Problems and Complexity in Databases        7
        
        
            Relational calculus        8
        
        
            List all the airlines        9
        
        
            List the codes of the airports in London        10
        
        
            List the airlines that fly directly from London to
                    Glasgow        11
        
        
            Active domain        11
        
        
            Conjunctive Query - Containment and Minimisation        12
        
        
            Conjunctive queries        12
        
        
            List all the airlines        13
        
        
            List the codes of the airports        13
        
        
            List the airlines that fly directly from London to
                    Glasgow        13
        
        
            Homomorphism        14
        
        
            Semantics of Conjunctive Queries        16
        
        
            CQ problems        17
        
        
            Homomorphism Theorem        18
        
        
            Minimisation        21
        
        
            Query Processing        23
        
        
            Logical query plans        24
        
        
            Cost Estimation        24
        
        
            Query Optimisation        26
        
        
            Physical query plans        29
        
        
            Scanning        30
        
        
            One-Pass Algorithms        30
        
        
            Nested-Loop Joins        32
        
        
            Two-Pass Algorithms        35
        
        
            Index-based Algorithms        36
        
        
            Data Integration        37
        
        
            Query Rewriting        38
        
        
            Global as View        40
        
        
            Local as View        40
        
        
            The Bucket Algorithm        42
        
        
            Ontology Based Data Integration        44
        
        
            Ontology-Based Query Answering (OBQA)        44
        
        
            Universal Models        44
        
        
            The Chase        44
        
        
            Relational Algebra
        
        
            - 
                A lot of the stuff in here just crosses over with Data Management.
            
- 
                To uphold brevity, I’m only going to note the stuff
                    that is new and can’t be found in Data
                    Management.
            
- 
                If you want a full refresher, read up on Data
                    Management’s topics:
            
            - 
                “Relational Model > Relational Algebra”
                    and
            
- 
                “Data languages > Relational Algebra vs
                    SQL”
            
- 
                “Database Systems > Modelling and SQL basics >
                    Joins”
            
            - 
                ... and then come back. If you already know all that stuff,
                    read ahead!
            
        
            - 
                Just quickly: relations are subsets of cartesian products of sets, and
                    records are represented as tuples.
            
- 
                Properties of relations:
            
            - 
                Each row represents a k-tuple of R
            
- 
                The values of an attribute are all from the same
                    domain
            
- 
                Each attribute of each tuple in a relation contains a
                    single atomic value
            
- 
                The ordering of rows is immaterial (relations are just sets, so row order doesn’t
                    matter)
            
- 
                All rows are distinct (relations are sets, so there can be no duplicates)
            
            - 
                Now, relations can come in two flavours: named perspective, or unnamed perspective.
            
- 
                Extra properties of named perspective:
            
            - 
                Each column is identified by its name
            
- 
                The ordering of the attributes (columns) do not
                    matter
            
            - 
                Extra properties of unnamed perspective:
            
            - 
                The ordering of the attributes matter (so we can reference attributes by order, not by
                    name)
            
            - 
                For an easy way to remember, just think like this: named perspective names the columns; unnamed perspective does not so it has
                    to use order.
            
        
            - 
                There are k-ary relation schemas, that look like R(A1,A2,A3, ... AK)
            
            - 
                R is the relation name
            
- 
                Ai are names of attributes
            
            - 
                In a DBMS, they give types to the attributes too, like int
                    or string.
            
- 
                There are also relational database schemas that look like Ri(A1, A2, A3 ... Ak)
            
            - 
                Ri are the names of the relations
            
- 
                Ai are names of attributes in Ri
            
- 
                Notice now it’s Ri and not R. That means there’s multiple relation schemas,
                    because it’s a relational database schema!
            
        
            - 
                These schemas can be instantiated.
            
- 
                Instantiated relation schemas and relational database schemas are called relations and databases, respectively.
            
- 
                These concepts have names: intension and extension.
            
        
        
            
                
                    | Intension | Extension | 
                
                    | 
                            Relation Schema
                         | 
                            Instance of a relation schema (Relation)
                         | 
                
                    | 
                            Relational Database Schema
                         | 
                            Relational Database Instance (Database)
                         | 
            
        
        
        
            - 
                Intensions (schemas) do not change very much because
                    they’re like the blueprints; unless you’re
                    changing the business model, there’s no need to edit
                    it.
            
- 
                Extensions (instantiations) are very dynamic because
                    that’s the actual data; millions of inserts and
                    deletes are happening every second.
            
        
            - 
                In relational algebra, we can use $ to distinguish
                    positions, for example in the relation R(A, B, C) then $1 is
                    A, $2 is B and $3 is C.
            
        
            - 
                Relational Algebra operators come in groups.
            
- 
                Group 1: (three binary operations from set theory)
            
            - Union
- 
                Difference
            
- 
                Cartesian Product
            
            - 
                Group 2: (two unary operations on relations)
            
            - 
                Depending on the variant of relational algebra we might need:
            
        
            - 
                There’s also an operator for assigning names to
                    intermediary operations.
            
- 
                This can abstract huge operations using names.
            
- 
                These abstractions are called views.
            
- 
                It looks like this: ← 
            
- 
                Here’s an example:
            
            - 
                πfname,lname(σsalary>£40,000(Staff))
            
            - 
                Here’s the same operation but broken down:
            
            - 
                ExpensiveStaff
                            ←
                    σsalary>£40,000(Staff)
            
- 
                Result
                                    ←
                    πfname,lname(ExpensiveStaff)
            
        
        
        
            - 
                ✅ Commutativity: R ⋃ S = S ⋃ R
            
- 
                ✅ Associativity: R ⋃ (S ⋃ T) = (R ⋃ S) ⋃ T
            
            - 
                ❌ Commutativity: R - S ≠ S - R
            
- 
                ❌ Associativity: R - (S - T) ≠ (R - S) - T (proof)
            
            - 
                ❌ Commutativity: R ⨉ S ≠ S ⨉ R
            
- 
                ✅ Associativity: R ⨉ (S ⨉ T) = (R ⨉ S) ⨉ T
            
- 
                ✅ Distributivity: R ⨉ (S ⋃ T) = (R ⨉ S) ⋃ (R ⨉ T)
            
            - 
                ✅ Commutativity: σΘ1(σΘ2(R)) = σΘ2(σΘ1(R))
            
- 
                σΘ1(σΘ2(R)) = σΘ1⋀Θ2(R)
            
- 
                σΘ(R ⨉ S) = σΘ(R) ⨉ S (if Θ mentions only attributes in R)
            
            - 
                These rules can be used in query processing and
                    optimisation.
            
        
            - 
                We can derive the intersection operation using
                    difference:
            
- 
                R ∩ S = R – (R – S) = S – (S – R)
            
        
            - 
                Here’s some symbols for operations you already
                    knew:
            
             
        
        
        
             
        
        
        
             
        
        
        
            - 
                A semijoin finds the records of the first relation that are within the
                    join of the first and second relations.
            
- 
                Alright, that’s a weird sentence to read, but once
                    you look at this example it’ll make sense:
            
            
                
                    | 
 
                            
                                
                                    | 
                                            Characters
                                         |  
                                    | 
                                            Character
                                         | 
                                            Stand ID
                                         |  
                                    | 
                                            Jotaro Kujo
                                         | 
                                            1
                                         |  
                                    | 
                                            Polnareff
                                         | 
                                            2
                                         |  
                                    | 
                                            Jonathan Joestar
                                         | 
                                            null
                                         |  
                                    | 
                                            Gyro Zeppeli
                                         | 
                                            4
                                         |  
 | 
 
                            
                                
                                    | 
                                            Stands
                                         |  
                                    | 
                                            Stand ID
                                         | 
                                            Stand
                                         |  
                                    | 
                                            1
                                         | 
                                            Star Platinum
                                         |  
                                    | 
                                            2
                                         | 
                                            Silver Chariot
                                         |  
                                    | 
                                            3
                                         | 
                                            The World
                                         |  
 | 
 
                            
                                
                                    | 
                                            Semijoin
                                         |  
                                    | 
                                            Character
                                         | 
                                            Stand ID
                                         |  
                                    | 
                                            Jotaro Kujo
                                         | 
                                            1
                                         |  
                                    | 
                                            Polnareff
                                         | 
                                            2
                                         |  
 | 
            
        
        
            - 
                Here, the only records that come from the semijoin are
                    Jotaro and Polnareff because they’re the only ones in
                    the relation that have recorded stands.
            
- 
                If this were a join, Jotaro would’ve been paired up
                    with Star Platinum and Polnareff would’ve been paired
                    up with Silver Chariot. Since Jotaro and Polnareff
                    would’ve been in that join, they alone (and not their
                    stands) are returned in the resulting semijoin
                    relation.
            
            - 
                A semijoin is defined like this:
            
             
        
        
            Relational Transformations
        
        
            - 
                Here are a bunch of transformations you can do in
                    relational algebra:
            
            - 
                Conjunctive selections can cascade into individual
                    selections
            
             
        
        
        
            - 
                Only the last in a sequence of projections is
                    required
            
             
        
        
        
            - 
                Selection and projection are commutative (only if selection attributes are projected)
            
             
        
        
        
            - 
                Cartesian product and theta join are commutative (only with named perspective)
            
             
        
        
             
        
        
        
            - 
                Selection distributes over theta join (if predicate only involves attributes being joined)
            
             
        
        
             
        
        
        
            - 
                Projection distributed over theta join (if relations have attributes that are being projected, and
                    join condition only uses those attributes)
            
             
        
        
             
        
        
        
            - 
                Selection distributes over set operations
            
             
        
        
             
        
        
             
        
        
        
            - 
                Projection distributes over set union
            
             
        
        
        
            - 
                Associativity of theta join and cartesian product
            
             
        
        
             
        
        
        
            - 
                Associativity of set union and intersection
            
             
        
        
             
        
        
            Relational Algebra and SQL
        
        
            - 
                This topic draws parallels between relational algebra and
                    SQL.
            
- 
                It talks about self-joins, aliases, multisets etc.
            
- 
                Again, if you don’t remember, refer to the Data Management Notes.
            
            Problems and Complexity in Databases
        
        
            - 
                To gauge how efficient our database operations are, we need
                    to calculate their complexities.
            
- 
                We can use complexity theory for this (you know, Big O and stuff)
            
        
            - 
                In this chapter we will tackle two problems:
            
            - 
                Formally: Given a database D, a query Q in L, and a tuple of constants t, is t Q(D)? Q(D)?
- 
                Informally: Let’s say I give you a record from a database, and
                    a query acting upon that database. If I got the output of
                    that query, would the given record be in that output?
            
- 
                Example: Say our database is {(‘a’, 1), (‘b’, 2)}, your given record is (‘a’, 1) and the query is SELECT * FROM database WHERE $1 = ‘a’. The answer to this QOT problem would be true, because we
                    would get our record (‘a’, 1) in that query.
            
            - 
                Boolean Query Evaluation (BQE)
            
            - 
                Formally: Given a database D, a query Q in L, and a tuple of constants t, is Q(D) ≠ Ø?
            
- 
                Informally: Let’s say you have a database and a query acting on
                    that database. If you got the output of that query, would it
                    have any tuples?
            
- 
                Example: Say our database is {(‘a’, 1), (‘b’, 2)}, and the query is SELECT * FROM database WHERE $2 = 3. The answer to this BQE problem would be false, because
                    there are no tuples (records) where the second element is
                    3.
            
            - 
                There is a theorem that states QOT(RA) ≡L BQE(RA)
            
            - 
                (≡L means logspace-equivalent)
            
        
            - 
                There’s three ways to measure complexity in
                    databases:
            
            - 
                Combined complexity: both D and Q are part of the input (you’re checking the complexity of both)
            
- 
                Query complexity: fixed D, input Q (you’re only checking the complexity of the
                    query)
            
- 
                Data complexity: input D, fixed Q (you’re only checking the complexity of the
                    database)
            
        
            - 
                BQE is PSPACE-complete with combined complexity
            
- 
                BQE is PSPACE-complete when database is fixed (query
                    complexity)
            
- 
                BQE is LOGSPACE when query is fixed (data complexity)
            
- 
                When a problem is LOGSPACE, it is polynomial time as well
                    because there’s a polynomial number of
                    configurations.
            
- 
                So why are the BQE and QOT problems so easy when the query
                    is fixed, but not when the database is fixed?
            
            - 
                The complexities of these problems are actually
                    exponential.
            
- 
                But the size of the query is the exponent (the small number
                    on top; the power)
            
- 
                So if the size of the query is fixed, the complexity
                    changes to polynomial.
            
        
            - 
                There are some more problems that relate to
                    databases:
            
            - 
                Formally: Given a query Q, is there a finite database D such that Q(D) ≠ Ø?
            
- 
                Informally: If you have a query, is it possible to find a database
                    where this query’s output will not be empty?
            
- 
                If the answer is no, then the query makes no sense and can
                    trivially be calculated as empty.
            
            - 
                Formally: Given two queries Q1 and Q2, Q1(D) ≡ Q2(D)? (for every finite database D)
            
- 
                Informally: If you have two queries, are their outputs exactly the
                    same for every database?
            
- 
                If the answer is yes, then we can simply use the query that
                    is the easiest to calculate, even if it’s
                    syntactically a different query than the one we got.
            
            - 
                Formally: Given two queries Q1 and Q2, Q1(D) ⊆ Q2(D)? (for every finite database D)
            
- 
                Informally: Given two queries, can all the outputs of the first query
                    be found in the outputs in the second query for all
                    databases?
            
- 
                If the answer is yes, we can approximate a query by picking
                    the other query that is a subset.
            
- 
                This sacrifices completeness with complexity.
            
        
            - 
                They’re all undecidable, unfortunately...
            
- 
                ... for relational algebra, that is!
            
- 
                There’s other relational query languages, such as
                    relational calculus.
            
- 
                Sublanguages of relational algebra can also make these
                    problems decidable, like only having select, project and
                    equijoin (like our SJDB coursework).
            
- 
                But for now we’ll look at relational calculus.
            
            Relational calculus
        
        
            - 
                Here’s a thought: if relations are just sets, and
                    queries return relations as output, why can’t we use
                    set comprehension as queries?
            
- 
                Relational calculus is like that: it’s very similar to set
                    comprehension in set theory.
            
- 
                Or, if you’ve ever done list comprehension or set
                    comprehension in Python, it’s like that.
            
- 
                The syntax is like this:
            
            {x1, ..., xk | φ}
        
        
        
            - 
                x1, ... xk are the values in our output tuples
            
- 
                φ is a formula that our values abide by
            
        
            - 
                Makes no sense? Don’t worry, these examples will
                    clear things up.
            
- 
                Here’s a database we’re going to work
                    with:
            
            
                
                    | 
 
                            
                                
                                    | 
                                            Flight
                                         |  
                                    | 
                                            Origin
                                         | 
                                            Destination
                                         | 
                                            Airline
                                         |  
                                    | 
                                            VIE
                                         | 
                                            LHR
                                         | 
                                            BA
                                         |  
                                    | 
                                            LGW
                                         | 
                                            GLA
                                         | 
                                            U2
                                         |  
                                    | 
                                            LCA
                                         | 
                                            VIE
                                         | 
                                            OS
                                         |  
 | 
 
                            
                                
                                    | 
                                            Airport
                                         |  
                                    | 
                                            Code
                                         | 
                                            City
                                         |  
                                    | 
                                            VIE
                                         | 
                                            Vienna
                                         |  
                                    | 
                                            LHR
                                         | 
                                            London
                                         |  
                                    | 
                                            LGW
                                         | 
                                            London
                                         |  
                                    | 
                                            LCA
                                         | 
                                            Larnaca
                                         |  
                                    | 
                                            GLA
                                         | 
                                            Glasgow
                                         |  
                                    | 
                                            EDI
                                         | 
                                            Edinburgh
                                         |  
 | 
            
        
        
            - 
                If you don’t know what a plane is, these tables
                    represent flights that are going on in different airports.
                    Different airlines offer different flights between
                    airports.
            
- ✈️
            List all the airlines
        
        
        
            
                
                    | 
 
                            
                                
                                    | 
                                            Flight
                                         |  
                                    | 
                                            Origin
                                         | 
                                            Destination
                                         | 
                                            Airline
                                         |  
                                    | 
                                            VIE
                                         | 
                                            LHR
                                         | 
                                            BA
                                         |  
                                    | 
                                            LGW
                                         | 
                                            GLA
                                         | 
                                            U2
                                         |  
                                    | 
                                            LCA
                                         | 
                                            VIE
                                         | 
                                            OS
                                         |  
 | 
 
                            
                                
                                    | 
                                            Airport
                                         |  
                                    | 
                                            Code
                                         | 
                                            City
                                         |  
                                    | 
                                            VIE
                                         | 
                                            Vienna
                                         |  
                                    | 
                                            LHR
                                         | 
                                            London
                                         |  
                                    | 
                                            LGW
                                         | 
                                            London
                                         |  
                                    | 
                                            LCA
                                         | 
                                            Larnaca
                                         |  
                                    | 
                                            GLA
                                         | 
                                            Glasgow
                                         |  
                                    | 
                                            EDI
                                         | 
                                            Edinburgh
                                         |  
 | 
            
        
        
            - 
                Query: {z | ∃x∃y Flight(x,y,z)}
            
- 
                English, please:
            
            - 
                Let’s break this down.
            
- 
                The bit on the left of the bar, {z | ∃x∃y Flight(x,y,z)}, is what we want in our output. It is defined by the part
                    to the right of the bar. Basically, we want our output to be
                    a set of all the different values z could be.
            
- 
                The bit on the right of the bar, {z | ∃x∃y Flight(x,y,z)}, simply introduces some new variables for us to reason
                    with. It means that, within this formula, there exists
                    values for these variables that satisfy this query. If we
                    need new variables to use, but don’t want them in our
                    output, then “declare” them like this.
            
- 
                The bit on the far right, {z | ∃x∃y Flight(x,y,z)}, says that the values of variables x, y and z must be the
                    values of the tuples within Flight. For example, for one of
                    the possibilities, x = VIE, y = LHR and z = BA.
            
- 
                The output says we want all values that z satisfy, and
                    since the formula says z is the Airline column in Flight,
                    this query gets all the airlines.
            
            - 
                This works the same as a projection.
            
- 
                πairline Flight
            
            List the codes of the airports in London
        
        
        
            
                
                    | 
 
                            
                                
                                    | 
                                            Flight
                                         |  
                                    | 
                                            Origin
                                         | 
                                            Destination
                                         | 
                                            Airline
                                         |  
                                    | 
                                            VIE
                                         | 
                                            LHR
                                         | 
                                            BA
                                         |  
                                    | 
                                            LGW
                                         | 
                                            GLA
                                         | 
                                            U2
                                         |  
                                    | 
                                            LCA
                                         | 
                                            VIE
                                         | 
                                            OS
                                         |  
 | 
 
                            
                                
                                    | 
                                            Airport
                                         |  
                                    | 
                                            Code
                                         | 
                                            City
                                         |  
                                    | 
                                            VIE
                                         | 
                                            Vienna
                                         |  
                                    | 
                                            LHR
                                         | 
                                            London
                                         |  
                                    | 
                                            LGW
                                         | 
                                            London
                                         |  
                                    | 
                                            LCA
                                         | 
                                            Larnaca
                                         |  
                                    | 
                                            GLA
                                         | 
                                            Glasgow
                                         |  
                                    | 
                                            EDI
                                         | 
                                            Edinburgh
                                         |  
 | 
            
        
        
            - 
                Query: {x | ∃y Airport(x,y) ⋀ y = London}
            
- 
                English, please:
            
            - 
                Airport(x,y) means that the variables x,y relate to the attributes
                    in the tuples of Airport, meaning x are values of Code and y
                    are values of City.
            
- 
                y = London means that y must have the value of London. Since the
                    formula before said that y must have values of City, this
                    limits our output relation to only having the city of
                    London.
            
- 
                Since the output is x, and the first formula defined x to
                    be the codes of the Airport, those codes will be our output.
                    Since y is limited to just the value of London, the output
                    will be the airport codes within London.
            
            - 
                This works the same as a projection and selection.
            
- 
                πcode (σcity=’London’ Airport)
            
            List the airlines that fly directly from London to
                Glasgow
        
        
        
            
                
                    | 
 
                            
                                
                                    | 
                                            Flight
                                         |  
                                    | 
                                            Origin
                                         | 
                                            Destination
                                         | 
                                            Airline
                                         |  
                                    | 
                                            VIE
                                         | 
                                            LHR
                                         | 
                                            BA
                                         |  
                                    | 
                                            LGW
                                         | 
                                            GLA
                                         | 
                                            U2
                                         |  
                                    | 
                                            LCA
                                         | 
                                            VIE
                                         | 
                                            OS
                                         |  
 | 
 
                            
                                
                                    | 
                                            Airport
                                         |  
                                    | 
                                            Code
                                         | 
                                            City
                                         |  
                                    | 
                                            VIE
                                         | 
                                            Vienna
                                         |  
                                    | 
                                            LHR
                                         | 
                                            London
                                         |  
                                    | 
                                            LGW
                                         | 
                                            London
                                         |  
                                    | 
                                            LCA
                                         | 
                                            Larnaca
                                         |  
                                    | 
                                            GLA
                                         | 
                                            Glasgow
                                         |  
                                    | 
                                            EDI
                                         | 
                                            Edinburgh
                                         |  
 | 
            
        
        
            - 
                Query: {z | ∃x∃y∃u∃v Airport(x,u)
                    ⋀
            
               u = London ⋀
        
        
               Airport(y, v) ⋀
        
        
               v = Glasgow ⋀
        
        
               Flight(x,y,z)
        
        }
        
        
            - 
                The first two, Airport(x,u) ⋀ u = London, defines that x and u are codes and cities, and that the
                    only cities that u should inhibit is London.
            
- 
                The next two, Airport(y,v) ⋀ v = Glasgow, defines that y and v are codes and cities, and that the
                    only cities that v should inhibit is Glasgow. Keep in mind
                    that this is completely separate to what we did with x and
                    u.
            
- 
                The last one, Flight(x,y,z), says that we’re only focusing on the flights from x
                    to y, and using z to store the airlines. Since u has to be
                    London and v has to be Glasgow, going from airports x to y
                    must be going from London to Glasgow.
            
- 
                Since the output is z, we are getting the airlines of
                    flights that go from airports located in London to
                    Glasgow.
            
            - 
                This works the same as a join.
            
- 
                πairline ((Flight ⋈origin=code (σcity=’London’ Airport)) ⋈destination=code ((σcity=’Glasgow’ Airport))
            
            Active domain
        
        
            - 
                Relational calculus is cool and all, but there are some
                    problematic queries.
            
- 
                We’ve been using “there exists”, but what
                    about “for all”?
            
- 
                {x | ∀y R(x,y)}
            
- 
                How is this calculated?
            
- 
                This means “get all values of x such that they are
                    connected to all values of y”
            
- 
                As opposed to “get all values of x such that they are
                    connected to some values of y” (for the “there exists” operator)
            
- 
                What does it mean “all values” of y?
            
- 
                It means all values in the active domain.
            
- 
                The active domain is the set of all values in the relation
                    at that current point in time. No outside values; no
                    possible values we can think of. This means the domain is
                    always closed.
            
- 
                If we look at an example, it’ll make more
                    sense.
            
        
            - 
                dom = {1,2,3}
            
- 
                D = {R(1,1), R(1,2)}
            
- 
                {x | ∀y R(x,y)} → {}
            
        
            - 
                Here, the query has no output.
            
- 
                1 can map to 1 and 2, but it can’t map to 3.
            
- 
                2 doesn’t map to anything, and 3 doesn’t map to
                    anything.
            
- 
                So there can be no value of x that maps to every value of y.
            
- 
                Here’s another example:
            
        
            - 
                dom = {1,2}
            
- 
                D = {R(1,1), R(1,2)}
            
- 
                {x | ∀y R(x,y)} → {1}
            
        
            - 
                Here, the domain has changed to just 1 and 2.
            
- 
                2 still doesn’t map to anything, but 1 now maps to 1
                    and 2.
            
- 
                1 can map to any value in the domain!
            
- 
                That means 1 is a possible value for x, since it can map to
                    any y value.
            
- 
                By changing the domain like this, we can make or break
                    these queries. This is why we use the active domain.
            
        
            - 
                A theorem states that relational algebra (RA), domain relational calculus (DRC) and tuple relational calculus (TRC) are all equally expressive.
            
- 
                What is tuple relational calculus?
            
- 
                It’s a declarative language introduced by Codd. It
                    predates domain relational calculus.
            
            Conjunctive Query - Containment and Minimisation
        
        
            Conjunctive queries
        
        
            - 
                Conjunctive Queries (CQ) are yet another query language that we can use.
            
- 
                It’s called that because it is expressed as a set of
                    atoms.
            
- 
                In practice, you can think of it as syntactic sugar for
                    relational calculus.
            
- 
                However, it’s more powerful, even though it’s
                    less expressive.
            
- 
                It’s syntax goes like this:
            
            Q(x) := ∃y (R1(v1) ∧ ... ∧ Rm(vm))
        
        
            - 
                Ri (1 ≤ i ≤ m) are relations
            
- 
                x, y, v1, ... vm are tuples of variables
            
- 
                Each variable mentioned in vi (1 <= i <= m) appears in either x or y
            
- 
                The variables in x, which are free, are called distinguished variables
            
            - 
                Conjunctive queries have a form that use rule-based queries
                    that look like this:
            
            Q(x) :- R1(v1) ∧ ... ∧ Rm(vm)
        
        
            - 
                This is very similar to before, but less verbose and
                    doesn’t require existential quantifications (like
                    “for all” and “there exists”).
            
- 
                The body of Q can be seen as a set of atoms.
            
- 
                The (x) bit in Q(x) can also be called the head.
            
        
            - 
                Let’s convert the relational calculus examples into
                    conjunctive queries for our examples!
            
            List all the airlines
        
        
            - 
                Relational calculus: {z | ∃x∃y Flight(x,y,z)}
            
- 
                Conjunctive queries: Q(z) :- Flight(x,y,z)
            
- 
                Explanation:
            
            - 
                Here, instead of putting our output variables to the left
                    of the bar, we put it as a parameter of query Q. In this
                    case, that’s z.
            
- 
                In the rule-based query form, we don’t need the
                    “there exists” quantification.
            
- 
                Everything else is much of the same.
            
            List the codes of the airports
        
        
            - 
                Relational calculus: {x | ∃y Airport(x,y) ⋀ y = London}
            
- 
                Conjunctive queries: Q(x) :- Airport(x,y), y = London
            
- 
                Explanation:
            
            - 
                It’s pretty much the same as the last example, except
                    we can also have invariants for the variables, like with y =
                    London.
            
- 
                However, what if I told you we don’t even need
                    that?
            
- Behold!
            - 
                Conjunctive queries: Q(x) :- Airport(x,London)
            
            - 
                With conjunctive queries, we don’t need to create a
                    variable and define it just to limit an attribute in
                    Airport; we can just state the value.
            
            List the airlines that fly directly from London to
                Glasgow
        
        
            - 
                Relational calculus: {z | ∃x∃y∃u∃v Airport(x,u)
                    ⋀
            
               u = London ⋀
        
        
               Airport(y, v) ⋀
        
        
               v = Glasgow ⋀
        
        
               Flight(x,y,z)
        
        }
        
            - 
                Conjunctive queries: Q(z) :- Airport(x,London),
            
             Airport(y,Glasgow),
        
        
             Flight(x,y,z)
        
        
        
            - 
                Much like before, we don’t need to define variables
                    just to limit attributes.
            
- 
                The conjunctive query is a lot more succinct here: the
                    first atom says that x is an airport based in London, the
                    second atom says that y is an airport based in Glasgow, and
                    the third atom says that z is an airline that has a flight
                    from x to y.
            
            Homomorphism
        
        
            - 
                A homomorphism is a substitution that converts a set of atoms from one
                    conjunctive query into a set of atoms from another
                    conjunctive query, while preserving structure.
            
- 
                It is a substitution from a set of atoms A to a set of
                    atoms B, h : terms(A) → terms(B) such that:
            
            - 
                t is a constant ⇒ h(t) = t
            
- 
                R(t1, ..., tk) ∈ A ⇒ h(R(t1, ..., tk))
            
- 
                h(R(t1, ..., tk)) = R(h(t1), ..., h(tk))
            
- 
                R(h(t1), ..., h(tk)) ∈ B
            
            - 
                Where terms(A) = {t | t is a variable or constant that occurs in A}
            
        
            - 
                If it’s still confusing, let me give you an
                    example:
            
- 
                S1 = {P(x,y), P(y,z), P(z,w)}
            
             
        
        
            - 
                S2 = {P(x,y), P(y,z), P(z,x)}
            
             
        
        
            - 
                We can define a homomorphism ‘h’ from S1 to S2 as {x → x, y → y, z → z, w → x}
            
- 
                h(S1) = {h(P(x,y)),         h(P(y,z)),                h(P(z,x))}
            
            = {P(h(x),h(y)),         P(h(y),h(z)),        P(h(z),h(x))}
        
        
            = {P(x,y),                P(y,z),                P(z,x)}
        
        
            - 
                All the elements of h(S1) also belong to S2. Therefore h is a homomorphism from S1 to S2.
            
- 
                Note: In these notes, when I define a homomorphism, I’ll
                    often omit the direction of the mapping for the sake of
                    making the explanation easier (e.g. S1 to S2). However, if you’re writing them out in an exam or
                    something, you must explicitly state what the homomorphism is mapping to and
                    from where. Otherwise, it can be ambiguous, and probably
                    won’t even be a proper homomorphism.
            
        
            - 
                Let’s try a more trivial one.
            
- 
                S1 = {P(x,y), P(y,x), P(y,y)}
            
             
        
        
        
             
        
        
            - 
                We can define h as {x → x, y → x}
            
- 
                h(S1) = {P(x,x), P(x,x), P(x,x)} = {P(x,x)}
            
- 
                Again, all elements of h(S1) also belong to S2, so h is a homomorphism from S1 to S2.
            
        
            - 
                If you want to think of it a bit more graphically, try it
                    this way:
            
            - 
                You have a graph. Your objective is to make your graph look
                    like the other graph, or at least be a “subset”
                    of it.
            
- 
                You can change the nodes to other values
            
- 
                If you change a node to another node value that already
                    exists, you can “merge” it with that node (if you change w to x, you can “merge” the w
                    node into the x node)
            
- 
                You can also “split” nodes if a transition is
                    bidirectional, but you can’t leave any duplicate
                    nodes
            
            - 
                Wanna try an example, but graphically?
            
- 
                S1 = {P(x,y), P(y,z), P(z,x)}
            
             
        
        
            - 
                S2 = {P(x,y), P(y,x), P(y,y)}
            
             
        
        
            - 
                First of all, we can see that there is a loop in y in the
                    second graph. In the first graph, y goes to z. This means we
                    should change z to y, so that y can loop.
            
             
        
        
            - 
                Since we have two y’s, we can “merge”
                    them together. The first y points to the other y, so
                    that’ll create a loop, and the second y points to x.
                    Since x also points to the first y, that’ll create a
                    bidirectional transition.
            
             
        
        
            - 
                I’d say that looks like our second graph! Therefore a
                    homomorphism exists.
            
- 
                All we did was change z to y, so our only mapping will be z
                    → y. Everything else will be the identity function,
                    making our h = {x → x, y → y, z → y}
            
- 
                This method is only really for visualising a homomorphism;
                    for actually calculating one, it’s probably best if
                    you just write it. But I don’t know. Maybe you like
                    doing things this way. Whatever gets you that 1st.
            
            Semantics of Conjunctive Queries
        
        
            - 
                A match is a homomorphism from a conjunctive query to a
                    database.
            
- 
                It maps the variables in the body of the conjunctive query
                    to values in the database.
            
- 
                Think of it like one of the outputs of the query.
            
- 
                Formally, it goes like this:
            
            - 
                Q(x1, ..., xk) :- body
            
- 
                Let there be database D
            
- 
                Match is homomorphism h such that h(body) ⊆ D
            
            - 
                Say we have a database with some records:
            
- 
                School(Student(Name, Age, ClassID), Class(ClassID, ClassName))
            
            - 
                Student(“Matthew”, 21, 13)
            
- 
                Student(“John”, 24, 14)
            
- 
                Student(“Giorno”, 15, 14)
            
- 
                Class(13, “Computer Science”)
            
- 
                Class(14, “Biology”)
            
            - 
                We also have a conjunctive query:
            
- 
                Q(Name) :- Student(Name, BioID), Class(BioID,
                    “Biology”)
            
- 
                We can match the variables of the conjunctive query to
                    values in the database to create a match:
            
- 
                {Name → “Giorno”, BioID → 14}
            
- 
                That’s one possible match!
            
- 
                There is one other possible match, you know.
            
- 
                {Name → “John”, BioID → 14}
            
- 
                So each individual query result is a match, but what do we
                    call all the possible matches?
            
- 
                I thought you’d never ask...
            
        
            - 
                The answer to a conjunctive query over a database is the set of
                    k-tuples, where each tuple is the result of a match mapping
                    each conjunctive query atom to its database
                    counterpart.
            
- 
                Formally, it goes like this:
            
            - 
                Q(D) := {(h(x1),…, h(xk)) | h is a match of Q in D}
            
            - 
                Here’s an example, following that school one I
                    defined before:
            
            - 
                This is the query in question:
            
- 
                Q(Name, Age) :- Student(Name, BioID), Class(BioID,
                    “Biology”)
            
- 
                These are the two matches that this query has in the
                    database:
            
- 
                h1 = {Name → “Giorno”, Age → 15,
                    BioID → 14}
            
- 
                h2 = {Name → “John”, Age → 24,
                    BioID → 14}
            
- 
                So we just run the matches on the atoms in the query:
            
- 
                { (h1(Name), h1(Age)), (h2(Name), h2(Age)) }
            
- 
                { (“Giorno”, 15), (“John”, 24) }
            
            CQ problems
        
        
            - 
                For CQs, the problems are a little bit better:
            
            - 
                BQE(CQ) is NP-complete (combined complexity)
            
- 
                BQE[D](CQ) is NP-complete, for fixed database (query complexity)
            
- 
                BQE[Q](CQ) is LOGSPACE, for a fixed query (data complexity)
            
        
            - 
                Why BQE and BQE[D] are NP:
            
            - 
                To solve, go through every possible substitution of h : terms(body) → terms(D) and check if it’s a match of Q in D
            
- 
                Basically, use brute force
            
        
        
            - 
                Why BQE and BQE[D] are NP-hard:
            
            - 
                Reduction from 3-colourability
            
- 
                Let’s say we have an undirected graph G with edges
                    and nodes.
            
- 
                Let’s say the query QG has a match if D satisfies 3-colourability.
            
- 
                If the graph G is 3-colourable, there must be a
                    homomorphism from this graph to a cycle that just loops red
                    → blue → green.
            
- 
                In other words, there must be a match of QG in D = {E(r,g), E(g,b), E(b,r)}
            
- 
                It’s hard to think of it in just words: let’s
                    have a visual example!
            
- 
                Here’s our graph G:
            
             
        
        
            - 
                That looks pretty 3-colourable to me. In fact, it’s
                    already 3-coloured!
            
- 
                So according to this proof, there must be a homomorphism
                    from this graph to a loop that goes red → blue →
                    green (or just cycles any pattern of those three
                    colours).
            
             
        
        
            - 
                Why, yes there is!
            
- 
                {1 → a, 2 → b, 3 → c, 4 → a, 5 →
                    c, 6 → a}
            
- 
                They all just map to their colour equivalents.
            
- 
                All the reds map to a, all the greens map to b and all the
                    blues map to c.
            
- 
                Therefore, if we can find a homomorphism in polynomial time
                    (find if a match exists for QG in D through a BQE algorithm), then we can solve
                    3-colourability in polynomial time.
            
        
        
            - 
                BQE[Q] with CQ is LOGSPACE because BQE[Q] with domain relational calculus is also LOGSPACE.
            
        
            - 
                A canonical database of a query Q is a database of which the homomorphism of Q on the
                    database is the identity function.
            
- 
                It’s syntax is like this: D[Q] (this says “D, the canonical database of
                    Q”)
            
- 
                You can create one from a query Q by taking every variable x and mapping it to a new constant like x’
            
- 
                This process is called “freezing”.
            
- Example:
            - 
                Given Q(x,y) :- R(x,y), P(y,z,w), R(z,x)
            
- 
                Then D[Q] = {R(x’, y’),
                    P(y’,z’,w’), R(z’, x’)}
            
            - 
                Note: the mapping of variables to new constants is a bijection,
                    e.g. for all x → x’, then you can do x’ → x
            
- 
                This also means if you have a canonical database, you can
                    get the query it was defined from too!
            
- 
                So why did I tell you this?
            
- 
                The concept of canonical databases solves one of the
                    problems we’ve seen!
            
        
            - 
                Satisfiability (SAT) of CQs can be solved using canonical databases.
            
- 
                If you have a conjunctive query Q, there will always be a
                    database that satisfies it because every conjunctive query
                    has a canonical database.
            
- 
                Therefore satisfiability of CQs is constant time, because
                    it’s always true!
            
        
            - 
                Equivalence (EQUIV) and Containment (CONT) of CQs:
            
- 
                Keep in mind these rules only work for conjunctive
                    queries!
            
            - 
                We can turn an equivalence problem into a containment
                    problem by:
            
            - 
                Q1 ≡ Q2
            
- → 
- 
                Q1 ⊆ Q2 and Q2 ⊆ Q1
            
            - 
                We can turn a containment problem into an equivalence
                    problem by:
            
            - 
                Q1 ⊆ Q2
            
- → 
- 
                Q1 ≡ (Q1 ∧ Q2) (this means if Q1 is equivalent to itself and Q1 conjuncted with Q2)
            
            - 
                Because of this, we only need to worry about CONT, and
                    EQUIV can follow.
            
            Homomorphism Theorem
        
        
            - 
                A containment mapping or query homomorphism is simply a homomorphism from query to query.
            
- Formally:
            - 
                Q1(x1, ..., xk) :- body1
            
- 
                Q2(y1, ..., yk) :- body2
            
- 
                h : terms(body1) → terms(body2)
            
- 
                Such that:
            
            - 
                h is homomorphism from body1 to body2
            
- 
                (h(x1), ..., h(xk)) = (y1, ..., yk)
            
        
            - 
                The homomorphism theory states that for any two conjunctive queries Q1 and Q2, if Q1 ⊆ Q2, there exists a containment mapping from Q2 to Q1, and vice versa.
            
        
            - 
                Here’s a little example to help with
                    visualisation:
            
            - 
                Q1(x,y) :- R(x,z), S(z,z), R(z,y)
            
             
        
        
            - 
                Q2(a,b) :- R(a,c), S(c,d), R(d,b)
            
             
        
        
            - 
                We know that Q1 ⊆ Q2 because there is a homomorphism from Q2 to Q1:
            
            - 
                h = {a → x, b → y, c → z, d → z}
            
            - 
                There is no homomorphism from Q1 to Q2, so Q1 ⊂ Q2
            
        
            - 
                Is there a proof of this?
            
- 
                Why, yes there is! Two, in fact, that go both ways.
            
- 
                Note that I’m only going to explain the high-level
                    ideas; if you’re really a masochist for mathematical
                    syntax, check out the slides; it’s full of it.
            
            - 
                Q1 ⊆ Q2 ⇒ Homomorphism from Q2 to Q1
            
            - 
                We can get the canonical database of Q1 by writing D[Q1]
            
- 
                There is an answer to Q1 over D[Q1] (trivially)
            
- 
                Since Q1 is a subset of Q2, that means that same answer will work for Q2 upon D[Q1]
            
- 
                If Q2 has an answer for D[Q1], that must mean there is a homomorphism from Q2 to D[Q1]
            
- 
                But since D[Q1] is essentially a copy-paste of Q1, then Q2 also has a homomorphism to Q1.
            
            - 
                Homomorphism from Q2 to Q1 ⇒ Q1 ⊆ Q2
            
            - 
                Let’s say we have a database D, and there is a tuple t such that t ∈ Q1(D) (tuple t is one of the matches that Q1 has over D)
            
- 
                To complete this proof, we need to show that t ∈ Q2(D) (tuple t is also one of the matches that Q2 has over D)
            
- 
                Since there is a homomorphism h from Q2 to Q1, we can convert the body of Q2 to the body of Q1.
            
- 
                Since there is a match from Q1 to D, that means there is a homomorphism from Q1 to D (because that’s what a match basically is:
                    a homomorphism). We’ll call it g
            
- 
                Now we can use h to get from the body of Q2 to the body of Q1, and then we can use g to get to the body of Q1 to the database D.
            
- 
                That implies there is at least one tuple t that is from the body of Q2 and the body of Q1 which can be mapped to database D.
            
- 
                If tuple t is from the body of Q2 and Q1, that means t ∈ Q2(D) and t ∈ Q1(D), respectively.
            
        
            - 
                That’s amazing! So if we find a homomorphism from
                    Q2 to Q1, we can basically solve CONT, and in turn, EQUIV.
            
- 
                One small problem though: deciding whether a homomorphism
                    exists is NP-complete.
            
            - 
                Why it’s NP: iterate through all possible substitutions, looking for a
                    valid homomorphism
            
- 
                Why it’s NP-hard: reduction from BQE(CQ)
            
            - 
                You have queries Q1 and Q2
            
- 
                If you have an algorithm that can find homomorphisms in
                    polynomial time, you can find a homomorphism from Q2 to Q1 in polynomial time.
            
- 
                You can also find a homomorphism from Q2 to D[Q1] in polynomial time, or Q1 to D[Q2] in polynomial time, so you could effectively use that to
                    solve BQE(CQ) with Q upon D, where D is the canonical database of some
                    query.
            
            - 
                Because of the homomorphism theory, if finding a
                    homomorphism for CQs is NP-complete, that means EQUIV(CQ) and CONT(CQ) are also NP-complete.
            
        
            - 
                Here’s a graph of all the problems we’ve seen
                    so far in complexity groups:
            
             
        
        
        
            - 
                How do we minimise a conjunctive query?
            
- 
                A conjunctive query Q1 is minimal if there is no other CQ Q2 such that:
            
            - 
                Q1 ≡ Q2
            
- 
                Q2 has fewer body atoms than Q1
            
            - 
                So when we minimise a query, we’re computing the
                    minimal queries that are still equivalent to the original
                    query.
            
        
            - 
                We can exploit the homomorphism theory a little to help
                    us:
            
            - 
                Let’s say we have a CQ Q1(x1, ..., xk) :- body1
            
- 
                Q1 is equivalent to Q2(y1, ..., yk) :- body2 where |body2| < |body1|
            
- 
                Because they’re equivalent, there’s a
                    homomorphism from Q1 to Q2.
            
- 
                But equivalency works both ways: there’s also a
                    homomorphism from Q2 to Q1!
            
- 
                But Q2 has less atoms than Q1. If we go from Q1 to Q2 to Q1 again, we’ll end up with less atoms than
                    before.
            
- 
                In other words, this means there exists a query Q1(x1, ..., xk) :- body3 ...
            
- 
                ... such that body3 ⊆ body1
            
            - 
                This theorem, in a nutshell, states that all we need to do
                    is remove atoms from the body. No need for any weird
                    substitution!
            
        
            - 
                Here’s some pseudocode for minimisation:
            
            
                
                    | 
                            Minimization(Q(x) :- body)Repeat until no change
 
                              choose an atom α ∈ bodyif there is a containment mapping from Q(x) :- body to Q(x) :- body ∖ {α}
 then body := body ∖ {α}
 Return Q(x) :- body
 | 
            
        
        
        
            - 
                Note: if there is a query homomorphism from Q(x) :- body to
                    Q(x) :- body \ {α}, then the two queries are
                    equivalent because there is a trivial containment mapping
                    from the latter to the former query (it’s simply the identity function).
            
        
            - 
                What was that? You wanna see this in action?
            
- 
                Here you go:
            
             
        
        
            - 
                You might think you can reduce it further by turning R(x,b)
                    to R(a,b) through x → a, but that’s not valid
                    because x is a distinguished variable (it’s in the output).
            
        
            - 
                Here’s a question: does the order in which we remove
                    the atoms matter?
            
            - No.
- 
                No matter what path you’re on, if you’re not at
                    a minimal query, there will always be a more minimal query
                    below you that you’re equivalent to.
            
- 
                Because of the theorem above, that “more
                    minimal” query is just the query you’re on, but
                    with lesser atoms.
            
- 
                So it’s impossible to hit a “dead end”
                    when going through this minimisation tree (unless you’re at a minimal query).
            
        
            - 
                Here’s another theorem to think about:
            
            - 
                Let’s say you have a conjunctive query Q.
            
- 
                You have two more, Q1 and Q2, that are minimal conjunctive queries.
            
- 
                If Q1 ≡ Q and Q2 ≡ Q, then Q1 and Q2 are isomorphic.
            
- 
                That means they’re identical, and that they only
                    differ by variable renaming.
            
- 
                So semantically, Q1 and Q2 are exactly the same.
            
            - 
                Because of this theorem, every query has one and only one
                    minimal conjunctive query (semantically speaking).
            
- 
                This is called the core of Q.
            
            Query Processing
        
        
            - 
                If you’ve already done the coursework, the first half
                    of this topic should be straightforward.
            
- 
                When you perform a query on a database, it goes through
                    different levels:
            
             
        
        
            - 
                When the query is parsed, a logical query plan is created.
            
- 
                This is a plan of the relational algebra operators to use
                    to perform your query.
            
- 
                It’s abstract and high-level.
            
- 
                Once that is made, a physical query plan is created.
            
- 
                This is a plan of all the algorithms the RDMS is going to
                    use to perform all those logical operations. It’s
                    low-level.
            
- 
                Optimisation takes place in both of these plans; they both
                    need optimising, but they are optimised in different
                    ways.
            
             
        
        
            Logical query plans
        
        
            - 
                In order to optimise, we need to begin with an inefficient
                    “starting point” that we need to improve
                    upon.
            
- 
                This is our canonical form, which is a standard way of representing a query as a
                    logical query plan.
            
             
        
        
            - 
                In a canonical query plan, all the projects sit at the top,
                    the selects are all daisy chained using “and”
                    operators, and all the relations are connected left-deep
                    using products.
            
            Cost Estimation
        
        
            - 
                We can estimate the cost of a plan by giving a cost to each
                    operator in terms of the size, and the number of unique
                    values, of the relations on which it operates.
            
- 
                Then, we can pick query plans that minimise this
                    cost.
            
- 
                We calculate cost using:
            
            - 
                T(R)                -
                    Number of tuples in relation R (cardinality of R)
            
- 
                V(R,
                    A)        - Number of distinct values for attribute A in relation
                    R
            
            - 
                Note: for any relation R, V(R, A) < T(R) for all attributes A on R
            
- 
                Here’s how we can estimate the cost of each
                    operator:
            
        
        
            
                
                    | Scan | 
                            T(scan(R)) = T(R)
                         
 
                            For all A in R, V(scan(R), A) = V(R, A)
                         | 
                
                    | Product | 
                            T(R ✕ S) = T(R)T(S)
                         
 
                            For all A in R, V(R ✕ S, A) = V(R,A)
                         
                            For all B in S, V(R ✕ S, B) = V(S,B)
                         | 
                
                    | Projection | 
                            T(πA(R)) = T(R)
                         
 
                            For all A in R and πA(R), V(πA(R),A) = V(R,A)
                         
 
                            This assumes that the projection does not
                                remove duplicate tuples (so value count remains the same)
                         | 
                
                    | 
                            Selection attr=value
                         | 
                            T(σA=c(R)) = T(R) / V(R,A)
                         
 
                            V(σA=c(R),A) = 1
                         
 
                            Assumes all values of A appear with equal
                                frequency
                         | 
                
                    | Selection attr=attr | 
                            T(σA=B(R)) = T(R) / max(V(R,A), V(R,B))
                         
 
                            V(σA=B(R),A) = V(σA=B(R),B) = min(V(R,A), V(R,B))
                         
 
                            Assumes that all values of A and B appear with
                                equal frequency
                         
 
                            Note: for all other attributes X of R, you can
                                reduce it to V(R, X) because it’s not being acted
                                upon
                         | 
                
                    | Selection Inequality | 
                            T(σA<c(R)) = T(R) / 3 (as a rule of thumb)
                         
 
                            May be different if range of values and
                                distribution is known
                         | 
                
                    | Selection 
                            Conjunction / Disjunction
                         | 
                            T(σA=c1∧B=c2(R)) = (and) 
 
                            T(σA=c1∨B=c2(R)) = +  (or) | 
                
                    | Join | 
                            T(R⋈A=BS) = T(R)T(S) / max(V(R,A), V(S,B))
                         
 
                            T(R⋈R1=S1∧R2=S2S) =  
 
                            T(R)T(S) / max(V(R,R1), V(S,S1))
                         
 
                            V(R⋈A=BS,A) = V(R⋈A=BS,B) = min(V(R,A), V(S,B))
                         | 
            
        
        
        
            - 
                Throughout these formulae, we’ve assumed that all
                    values have equal frequency.
            
- 
                However this is unrealistic and can give us inaccurate
                    estimates.
            
- 
                We can calculate cost based on some other approaches based
                    on histograms:
            
            - 
                Equal-width: divide the attribute domain into equal parts, give tuple
                    counts for each (e.g. split integer attribute A into categories 1-10,
                    11-20, 21-30, give tuple counts for each)
            
             
        
        
            - 
                Equal-height: sort tuples by attribute, divide into equal-sized sets of
                    tuples and give maximum value for each set (e.g. integer attribute A is split into categories 1-5,
                    6-12, 13-30, where each category has the same number of
                    tuples)
            
             
        
        
            - 
                Most-frequent values: give tuple counts for top-n most frequent values
            
        
            - 
                Here’s an example of equal-width:
            
            - 
                We have a relation R(A, B, C) with 10,000 tuples.
            
- 
                The following is an equal-width histogram on A:
            
            
                
                    | Range | [1,10] | [11,20] | [21,30] | [31,40] | [41,50] | 
                
                    | 
                            Tuples in range
                         | 50 | 2,000 | 2,000 | 3,000 | 2,950 | 
            
        
        
        
            - 
                What is T(σA=10(R))?
            
- 
                We need to find how many tuples are equal to 10 in the
                    range [1,10], which we can do by doing 50 * (1 / 10)
            
- 
                Therefore the answer will be 5 tuples.
            
            Query Optimisation
        
        
            - 
                How do we optimise a query?
            
- 
                We go through these steps:
            
        
            - 
                Start with canonical form
            
            - 
                Let’s start with a database:
            
             
        
        
            - 
                Here is our canonical form tree of a query:
            
             
        
        
            - 
                In case you haven’t noticed, everything is going to
                    be from the slides.
            
            - 
                Move σ operators down the tree
            
            - 
                We move the σattr=val operations just above the relations that contain the
                    attribute.
            
- 
                We move the σattr=attr operations just above the product that introduces
                    those two attributes to the relation.
            
             
        
        
            - 
                Reorder subtrees to put most restrictive σ
                    first
            
            - 
                Reorder the subtrees to put the most restrictive relations
                    (fewest tuples) first
            
- 
                For the sake of simplicity, we can keep it left-deep
            
             
        
        
            - 
                Combine × and σ to create ⨝
            
            - 
                If there is a product and a select above it, we can convert
                    it into a join.
            
             
        
        
            - 
                Move π operators down the tree
            
            - 
                We can use projections to only project the attributes we
                    need.
            
- 
                By reducing the degree (or arity) of the relations, we can
                    use fewer buffer frames.
            
             
        
        
        
            - 
                After all that, we’ve successfully optimised our
                    logical query plan!
            
            Physical query plans
        
        
            - 
                So now we have an optimised logical query plan.
            
- 
                How do we execute these operations?
            
- 
                There’s a wide range of algorithms to choose from:
                    what are they and which do we pick?
            
- 
                It depends on stuff like:
            
            - 
                Structure of relations
            
- 
                Size of relations
            
- 
                Presence of indexes and hashes
            
        
            - 
                To pick the best algorithm, we need to quantise its
                    effectiveness.
            
- 
                Therefore we need a “cost”.
            
- 
                The cost can be the number of disk accesses the algorithm
                    makes.
            
- 
                We make the assumption that the arguments of the operator
                    (all the relations and stuff) are on disk, and the results
                    are on main memory.
            
- 
                We won’t care about memory accesses because
                    they’re so much faster than disk accesses, it’s
                    practically negligible.
            
        
            - 
                Here are the cost parameters we will be using:
            
            - 
                M         -
                    Main memory available for buffers
            
- 
                S(R)        - Size of a
                    tuple of relation R
            
- 
                B(R)        - Blocks
                    used to store relation R
            
- 
                T(R)        - Number of tuples in relation R (cardinality of R)
            
- 
                V(R,a)        - Number of distinct values for attribute a in relation R
            
        
            - 
                On disk, we don’t just store a huge bank of tuples in
                    one giant blob.
            
- 
                We store tuples into blocks, and then we can fetch those
                    blocks instead of individual tuples.
            
- 
                There’s two ways of doing this:
            
            - 
                Clustered File (non-clustered relations)
            
            - 
                Tuples from different relations, that are joined on
                    particular attribute values, are put together into the same
                    blocks.
            
- 
                [R1 R2 S1 S2] [R3 R4 S3 S4]
            
- 
                They might not be joined; the point is, all the tuples in a
                    block don’t have to be from the same relation.
            
            - 
                Clustered Relation (also called contiguous)
            
            - 
                Tuples from the same relation are also put into the same
                    blocks.
            
- 
                [R1 R2 R3 R4] [S1 S2 S3 S4]
            
- 
                The point is, all tuples in a block are from the same
                    relation.
            
        
            - 
                We can also get something called a clustering index, which allows tuples to be read in an order that
                    corresponds to physical order.
            
- 
                This index is stored on disk, and simply points to the
                    blocks that contain the tuples we want to get.
            
             
        
        
            Scanning
        
        
            - 
                Scanning is exactly what you think it is: an iteration over all
                    tuples in a relation R, checking for tuples that satisfy
                    some predicate.
            
- 
                You could use scanning for certain binary queries.
            
- 
                It can also be used by the later algorithms to fetch blocks
                    that hold needed tuples.
            
- 
                There’s two kinds of scanning:
            
            - 
                Table scan
            
            - 
                All the tuples we want to scan are in blocks.
            
- 
                We already know those blocks (perhaps they’re stored in some kind of table), but we need to load them from disk one at a time.
            
- 
                The number of disk accesses is B(R), because we need to
                    load all the blocks that have relation R records.
            
- 
                However, if R is not clustered, there is a worst-case
                    scenario.
            
- 
                What if our blocks are like this?
            
            - 
                [R1, S1, S2, S3]
            
- 
                [R2, S4, S5, S6]
            
- 
                [R3, S7, S8, S9]
            
            - 
                There’s only one relation from R in every
                    block!
            
- 
                Therefore the number of disk accesses will be T(R), because
                    we’ll need to read one block for every tuple.
            
            - 
                Index scan
            
            - 
                We have an index on some attribute of R.
            
- 
                Therefore, we can use that index to find all blocks holding
                    R.
            
- 
                The I/O cost is a little higher: it’s B(R) +
                    B(IR), if R is clustered.
            
- 
                However, since the number of blocks holding R is way
                    greater than the number of blocks holding the index, we can
                    treat it as just B(R).
            
- 
                By the same logic as table scan, the I/O cost can be T(R)
                    if R is not clustered.
            
            One-Pass Algorithms
        
        
            - 
                One-Pass algorithms are algorithms that only read from disk
                    once.
            
- 
                It uses stuff like buffers and accumulators.
            
- 
                There’s three main categories:
            
            - 
                Unary, tuple at a time
            
            - 
                This is used for selects and projects.
            
- 
                For every block of R, the block is read into the input
                    buffer.
            
- 
                The operation (select or project) is then performed on
                    every tuple in that block.
            
- 
                The block is then moved to the output buffer (which will
                    become our result).
            
        
        
            
                
                    | 
                            foreach block of R:copy block to input buffer
 perform
                                operation (select, project) on each tuple in block
 move
                                selected/projected tuples to output buffer
 | 
            
        
        
        
             
        
        
            - 
                B(R) and T(R) disk access depends on clustering.
            
- 
                If we have a select operator that compares an attribute to
                    a constant, and there is an index for that attribute,
                    there’s going to be way less disk access.
            
- 
                The good thing about this algorithm is that you only really
                    need one block of memory, and that’s for the input
                    operation.
            
- 
                Well, that’s not including the output buffer, but
                    still...
            
            - 
                Unary, full-relation
            
            - 
                This is used for duplicate elimination and grouping.
            
- 
                This is like unary tuple at a time, but now we have an
                    accumulator!
            
- 
                The accumulator can store tuples, or integer values, or
                    anything, really.
            
- 
                For every block R, we copy the block to the input
                    buffer.
            
- 
                Depending on the tuples in the block, we update the
                    accumulator and move tuples to the output buffer.
            
        
             
        
        
            - 
                To do duplicate elimination, store all the
                    “seen” tuples in the accumulator. If we find a
                    tuple that’s already in the accumulator, don’t
                    send it into the output. Otherwise, put it in the output
                    buffer.
            
        
        
            
                
                    | 
                            foreach block of R:copy block to input buffer
 foreach tuple in block
 if tuple is not in accumulator
 copy to accumulator
 copy to output buffer
 | 
            
        
        
        
            - 
                We require at least B(δ(R)) + 1 blocks of memory:
            
            - 
                1 block of memory for the input buffer
            
- 
                B(δ(R)) for all the “seen” tuples (delta means distinct tuples)
            
            - 
                The accumulator can be implemented as a data structure,
                    like a tree or a hash.
            
- 
                If we have less memory than that, we’ll have to
                    thrash (dip into the disk space a bit).
            
- 
                The cost is still B(R).
            
        
            - 
                We can also do things like grouping, so min, max, sum,
                    count, avg
            
- 
                Our accumulator can just be a number or something.
            
            - 
                Binary, full-relation
            
            - 
                This is like an “applied” version of unary,
                    full-relation.
            
- 
                It works by reading one relation and loading it into an
                    index, then reading another relation with access to that
                    index.
            
- 
                Let’s say we have two relations: R and S, and S is
                    smaller than R.
            
- 
                Our accumulator is now our index!
            
- 
                We read all blocks in S, and store them in our index.
            
- 
                Then we read all blocks in R, using our index to construct
                    new tuples for the output.
            
        
        
            
                
                    | 
                            foreach block of S:read block
 add tuples to search structure
 foreach block of R
 copy block to input buffer
 foreach tuple in block
 find matching tuples in search structure
 construct new tuples and copy to output
 | 
            
        
        
        
            - 
                We can use this for joins: R(X,Y) and S(Y,Z).
            
- 
                The search index will be keyed on Y, and for every tuple in
                    R, we search for a matching tuple in the search index (a tuple belonging to S).
            
            Nested-Loop Joins
        
        
            - 
                A nested-loop join uses a nested loop to iterate through
                    all the tuples in two relations.
            
- 
                This is also known as an iteration join.
            
- 
                For this example, we’ll say we want to join relations
                    R1 and R2 on attribute C:
            
        
        
            
                
                    | 
                            foreach tuple r1 in R1foreach tuple r2 in R2
 if r1.C = r2.C then output r1,r2 pair
 | 
            
        
        
        
            - 
                These factors could affect cost:
            
            - 
                Tuples of relation stored physically together
                    (clustered)
            
- 
                Relations sorted by join attribute
            
- 
                Indexes exist
            
        
            - 
                We’re going to be doing a bunch of examples, so
                    here’s some stats to work with:
            
            - 
                T(R1) = 10,000 (number of tuples)
            
- 
                T(R2) = 5,000
            
- 
                S(R1) = S(R2) = 1/10 block (size of tuple in
                    relation)
            
- 
                M = 101 blocks (memory)
            
        
            - 
                Attempt #1: Tuple-based nested loop join
            
            - 
                What’s the total cost if we used the nested-loop
                    join?
            
- 
                The relations aren’t contiguous right now, so
                    there’s one disk access per tuple.
            
- 
                Cost to read each tuple in R1 = cost to read tuple + cost
                    to read all of R2
            
- 
                Total cost = all tuples in R1 * cost to read one tuple in R1
            
                 = T(R1) * (1 + T(R2))
        
        
             
                                         
                    = 10,000 * (1 + 5,000)
        
        
                 = 50,010,000 disk accesses
        
        
        
            - 
                Phew! That’s a lot of disk accesses!
            
- 
                Can we reduce it? Can we make it better?
            
- 
                Yes, we can.
            
- 
                Right now, we’re wasting so much space.
            
- 
                We only need two blocks of memory to perform this
                    algorithm, yet we have 101 blocks at our disposal.
            
- 
                We can read a chunk of 100 blocks of R1, iterate through
                    that, and then read the next chunk of 100 blocks and so
                    on.
            
        
            - 
                Attempt #2: Block-based nested loop join
            
            - 
                Cost to read one 100-block chunk of R1 = 100 / S(R1)
            
                 = 1,000 disk accesses
        
        
            - 
                Remember, we’re not contiguous, so it’s 1 disk
                    access per tuple.
            
- 
                But we do want to fill our memory blocks with just R1
                    tuples, so we fetch tuples until we fill our blocks.
            
- 
                Cost to process each chunk = 1,000 + T(R2) = 6,000
            
- 
                Total cost = all chunks to read * cost to process each chunk
            
              = T(R1) / 1,000 * 6,000 = 60,000 disk accesses
        
        
        
            - 
                Well, it is a huge improvement.
            
- 
                Did you know that we can do even better?
            
- 
                If we reverse the join order, so replace R1 with R2 and
                    vice versa, we can improve.
            
        
            - 
                Attempt #3: Join reordering
            
            - 
                Cost to read one 100-block chunk of R2 = 100 / S(R2)
            
                 = 1,000 disk accesses
        
        
            - 
                Cost to process each chunk = 1,000 + T(R1) = 11,000
            
- 
                Total cost = T(R2) / 1,000 * 11,000 = 55,000 disk accesses
            
        
            - 
                Alright, it’s somewhat better.
            
- 
                But what if I told you...
            
- 
                ... we can do even better?!
            
- 
                If our relations are contiguous (clustered), we can do
                    things even faster.
            
        
            - 
                Attempt #4: Contiguous relations
            
            - 
                B(R1) = 1,000 (number of blocks in R1)
            
- 
                B(R2) = 500 (number of blocks in R2)
            
- 
                Cost to read one 100-block chunk of R2 = 100 disk accesses (now it’s contiguous, so one block = one disk
                    access)
            
- 
                Cost to process each chunk = 100 + B(R1) = 1,100
            
- 
                Total cost = B(R2) / 100 * 1,100 = 5,500 disk accesses
            
        
            - 
                It’s way better when it’s contiguous!
            
- 
                Wait... no way...
            
- 
                We can do even better?!
            
- 
                That’s right!
            
- 
                If both relations are contiguous and sorted by C, we can do a little trick.
            
- 
                Just a small question, though: when was the last time you
                    implemented merge sort?
            
        
        
            - 
                We have two pointers to the start of R1 records and to the
                    start of R2 records.
            
- 
                If the referenced R1 has a lower C value, we move on to the
                    next R1 record.
            
- 
                If the referenced R2 has a lower C value, we move on to the
                    next R2 record.
            
- 
                If both referenced R1 and R2 records have the same C value,
                    we have a match.
            
- 
                If you understand merge sort, it’s very similar to
                    merging.
            
- 
                In terms of the blocks, we can just read the next one as we
                    go.
            
- 
                Total cost = B(R1) + B(R2)
            
                        = 1,000 + 500 = 1,500 disk accesses
        
        
        
            - 
                Wow! Using merging to compute a join; who would’ve
                    thought? There can’t possibly be any algorithm faster
                    than this...
            
- 
                ... right?
            
- 
                N-No... no way...! We can do even BETTER?!
            
            Two-Pass Algorithms
        
        
            - 
                Well, not necessarily “better”, but more suited
                    for general data.
            
- 
                Not all relations are going to be sorted by a specific
                    attribute.
            
- 
                So if we want to use our merge join, we need to sort it,
                    right?
            
- 
                But if we waste time sorting and then merging, will we
                    actually save time as opposed to just using a nested loop
                    like before? Let’s see...
            
        
            - 
                A two-pass algorithm is an algorithm that reads the disk twice.
            
- 
                Why do we need one right now?
            
- 
                Sorting all the chunks using merge sort requires two
                    passes:
            
            - 
                Reading each chunk, sorting each individual chunk, then
                    writing the chunk back to disk
            
- 
                Taking all the chunks, sorting the chunks using merge sort,
                    writing back to disk
            
            - 
                Then, obviously, we do the join at the end.
            
        
            - 
                Attempt #6: Merge join with sort
            
            - 
                Sort cost of R1 = 4 * 1,000 = 4,000 disk accesses
            
- 
                Sort cost of R2 = 4 * 500   = 2,000 disk accesses
            
- 
                Total cost = sort cost + join cost
            
              = 6,000 + 1,500
        
        
              = 7,500 disk accesses
        
        
        
            - 
                Hmm... it was better last time. It seems it’s not
                    worth it.
            
- 
                But wait! Let’s try a bigger example...
            
            - 
                B(R1) = 10,000 blocks
            
- 
                B(R2) = 5,000 blocks
            
- 
                Nested loop cost:
            
            - 
                Total cost = (5,000 / 100) * (100 + 10,000)
            
- 
                505,000 disk accesses
            
            - 
                Sort cost of R1 = 4 * 10,000  = 40,000 disk accesses
            
- 
                Sort cost of R2 = 4 * 5,000   = 20,000 disk accesses
            
- 
                Total cost = sort cost + join cost
            
              = 60,000 + 15,000
        
        
              = 75,000 disk accesses
        
        
        
            - 
                It seems with bigger examples, it works out better.
            
- 
                One way to figure out which algorithm to use is to
                    calculate their cost, and pick the most efficient one.
            
- 
                Heh... you called it. There’s a way we can do even
                    better.
            
        
            - 
                We don’t have to fully sort our relations.
            
- 
                We can write the records into runs (sorted blocks), and then join with just the runs.
            
        
            - 
                Attempt #7: Improved merge join
            
            - 
                Read R1 + write R1 into runs = 2 * 1,000 = 2,000
            
- 
                Read R2 + write R2 into runs = 2 * 500 = 1,000
            
- 
                Total cost = 2,000 + 1,000 + (1,000 + 500)
            
              = 4,500 disk accesses
        
        
        
            - 
                If you’re wondering how to perform merge join without
                    a fully sorted file of records, one could use heaps to store
                    the smallest join attribute of each block: one heap for R1
                    and one heap for R2.
            
- 
                The join can use the heaps to get the next blocks of R1 and
                    R2 as the join goes along. That way, the fully sorted file
                    isn’t written to disk. You would need a bit more
                    memory to store the heaps, but it would reduce disk
                    accesses.
            
        
            - 
                There’s another kind of join called Hash-Join that uses a hash index to write records to buckets
                    using the join attribute.
            
- 
                All records in the same bucket are all matches with each
                    other.
            
- 
                It works by reading all of R1 and writing them to buckets,
                    reading all of R2 and writing them into buckets, and then
                    joining R1 and R2.
            
- 
                The complexity calculation is the same as the improved
                    merge join.
            
- 
                If a bucket gets full in memory, the disk is used.
            
             
        
        
            Index-based Algorithms
        
        
            - 
                You can also use an index-based algorithm.
            
- 
                However, this assumes you already have an index created and
                    stored in memory.
            
- 
                There are three assumptions:
            
            - 
                R2.C index exists
            
- 
                R1 is contiguous, but unordered
            
- 
                R2.C index fits in memory
            
            - 
                If those three criteria are fulfilled, the index-based
                    algorithm goes like this:
            
        
        
            
                
                    | 
                            foreach R1 tuple:check if R1.C is in index
 if match, read R2 tuple: 1 disk access
 | 
            
        
        
        
            - 
                Fairly simple, right? It just goes through all the R1
                    tuples, looking for matches in the index.
            
            Data Integration
        
        
            - 
                You can fetch data from lots of different sources.
            
- 
                However, each source is going to have its own schema.
            
- 
                Do you really want to query every source using their own
                    schemas?
            
- 
                No! We want to have one universal schema that we convert
                    all our sources to, so we only need to use one schema to
                    query all the data from the sources.
            
- 
                This universal (or global) schema is a part of the domain (or global or target) language, and it encompasses:
            
            - 
                Database schema
            
- 
                Ontologies (OWL 2) (basically constraints about our data)
            
- 
                Description Logics
            
             
        
        
        
            - 
                There’s four parts to this we’re going to be
                    looking at:
            
            - 
                Query language - what we will use to query our domain
            
            - 
                e.g. conjunctive queries, q(z) ← Employee(“John”, x) ∧
                    HasManager(x, z)
            
            - 
                Domain language - relational schemas that represent what our domain
                    will look like
            
            - 
                e.g. Employee(name, ID), HasManager(employeeID, managerID),
                    Manager(name, ID), DirectsProject(managerID, projectID),
                    Project(name, ID)
            
            - 
                Schema mapping language - mapping the data sources to our domain (global
                    schema)
            
- 
                There’s three kinds:
            
            - 
                Global-Local as View (multiple sources to multiple domain):
            
- 
                S1(x, y1) ∧ S2(y1, y2) → Manager(x, z1) ∧
                    Employee(z2, z1)
            
        
            - 
                Local as View (one source to multiple domains):
            
- 
                S1(x, y1) → Manager(x, z1) ∧ Employee(z2,
                    z1)
            
        
            - 
                Global as View (multiple sources to one domain):
            
- 
                S1(x, y1) ∧ S2(y1, y2) → Manager(x, z1)
            
            - 
                Data Sources language - databases of different relational schemas
                    representing the different sources
            
            - 
                e.g. S1(a, b), S2(c, d, e) etc.
            
        
            - 
                So, how do we query the data from the sources using our
                    domain?
            
- 
                There’s two ways to do this.
            
- 
                The first way is called data warehousing. It’s the simple option.
            
- 
                We construct a centralised database for our domain, and
                    then we periodically update that database using our
                    sources.
            
             
        
        
            - 
                This is simple, but we need to host a database, and we need
                    to keep updating it.
            
            Query Rewriting
        
        
            - 
                The other way is virtual data integration (VDI).
            
- 
                Instead of having a centralised database, we convert our
                    domain query on the fly to queries over the sources.
            
             
        
        
            - 
                So we have some conjunctive query Q, which only uses
                    relations from the domain.
            
- 
                We want to convert this query into Q’ such that
                    Q’ only uses relations from the sources.
            
- We want:
            - 
                Q’ ⊧ Q (i.e. answers in Q’ are correct, Q’ ⊆
                    Q)
            
- 
                Q’ to provide all possible answers to Q given the
                    sources (we want all the answers we can get)
            
            - 
                To sum those two criteria up, we want all correct answers.
            
        
            - 
                We can get two kinds of rewritings.
            
- 
                Given a query Q and view definitions V={V1, ..., Vn} (these views simply map the domain and sources
                    together)
            
            - 
                Equivalent Rewriting: our new query Q’ is semantically exactly the same
                    as Q, except it uses the source relations instead of the
                    domain relations.
            
- 
                Formally:
            
            - 
                Q’ refers only to views in V
                
 
- 
                Q’ ⊧ Q 
            
            - 
                Maximally-Contained Rewriting: our new query Q’ isn’t exactly the same as Q,
                    but it’s the closest subset we’re going to get
                    with these sources.
            
- 
                Formally:
            
            - 
                Q’ refers only to views in V
            
- 
                Q’ ⊆ Q
            
- 
                There is no rewriting Q’’ such that Q’
                    ⊆ Q’’ ⊆ Q and Q’’ ≠
                    Q
            
        
            - 
                So how do we rewrite our queries?
            
- 
                It depends on whether we’re using Global as View or
                    Local as View.
            
            Global as View
        
        
            - 
                Remember, Global-as-View maps multiple source relations to
                    one domain relation.
            
- 
                In more detail, it means each global (domain) relation is
                    defined as a view over source relations.
            
- 
                Here’s an example:
            
            - 
                Movie(title, year) ← DB1(title, director, year)
            
- 
                Movie(title, year) ← DB2(title, year, director,
                    actor)
            
- 
                Review(title, year, review) ← DB1(title, director,
                    year) ∧ DB3(title, review)
            
            - 
                So we have 3 source relations, DB1, DB2 and DB3, and 2
                    domain relations, Movie and Review.
            
        
            - 
                Let’s say we want to query all the reviews for movies
                    released after 1997.
            
- 
                The query would look like this:
            
            - 
                Q(title, review) :- Movie(title, year) ∧ year>1997 ∧ Review(title, year, review)
            
            - 
                Now, we just substitute the domain relations for the source
                    relations in accordance to the view definitions above:
            
            - 
                Q(title, review) :- DB1(title, director, year) ∧ year>1997 ∧ DB1(title, director, year) ∧ DB3(title, review)
            
            - 
                Now we tidy up; you can see we have two DB1’s in this
                    query, which we don’t need, so we can get rid of
                    one:
            
            - 
                Q(title, review) :- DB1(title, director, year) ∧ year>1997 ∧ DB3(title, review)
            
            - 
                ... and there’s our rewritten query!
            
        
            - 
                There’s two mappings to Movie, so we could’ve
                    used DB2 instead:
            
            - 
                Q(title, review) :- DB2(title, year, director, actor) ∧ year>1997 ∧ DB3(title, review)
            
            - 
                But this is just a redundant rewriting of the query that
                    uses DB1. We’d only need DB2 if we need the actor of
                    the film as well.
            
            Local as View
        
        
            - 
                Remember, Local-As-View maps one source relation to
                    multiple domain relations.
            
- 
                In more detail, it means each source relation is defined as
                    a view over mediator (domain) relations.
            
- 
                Say we have DB1(title, year, director) and DB2(title,
                    review).
            
- 
                We could define views on those source relations:
            
            - 
                V1(title, year, director) → Movie(title, year,
                    director, genre) ∧ American(director) ∧ year ≥
                    1960 ∧ genre = ‘Comedy’
            
- 
                V2(title, review) → Movie(title, year, director,
                    genre) ∧ year ≥ 1990 ∧ MovieReview(title,
                    review)
            
        
            - 
                Let’s say we want to rewrite the query for the
                    reviews of comedies produced after 1950.
            
- 
                The query would look like this:
            
            - 
                Q(title, review) :- Movie(title, year, director,
                    ‘Comedy’), year ≥ 1950, MovieReview(title,
                    review)
            
            - 
                The reformulated query would look like this:
            
            - 
                Q’(title, review) :- V1(title, year, director), V2(title, review)
            
            - 
                It’s clear to see that we can get our title from
                    either V1 or V2, and we get our review from V2, but what
                    about the year being greater or equal to 1950? It just
                    vanished from the query!
            
- 
                That ‘year’ predicate has been
                    “overwritten” by V1’s year predicate of
                    year ≥ 1960.
            
- 
                Basically, by calling V1, we’re already doing a
                    filter by year.
            
- 
                But since V1’s year filter only captures films above
                    1960 and not 1950 like we wanted to in the first place, that
                    makes this query a maximally-contained rewriting.
            
- 
                That means Q’ is a subset of Q.
            
- 
                We can’t really help it in this case, because our
                    sources don’t support movies over 1950; only over
                    1960.
            
        
            - 
                Here are some pros and cons of Global-as-View vs
                    Local-as-View:
            
            
                
                    | GAV | LAV | 
                
                    | 
                            
                                Addition of new sources changes the domain
                                    schema
                             
                            
                                Can be awkward to write mediated schema
                                    without loss of information
                            
                                Query reformulation is easy
                             
                            
                                Reduces to view unfolding
                                    (polynomial)
                            
                                Can build hierarchies of mediated
                                    schemas
                             
                            
                                Few, stable, data sources
                            
                                Well-known to the mediator (e.g. corporate
                                    integration)
                             | 
                            
                                Modular - adding new sources is easy
                            
                                Very flexible - power of the entire query
                                    language available to describe sources
                            
                                Reformulation is hard
                             
                            
                                Involves answering queries only using views
                                    (can be intractable)
                             
                            
                                Many, relatively unknown data sources
                            
                                Possibility of addition / deletion of
                                    sources
                             
                            
                                Information Manifold, InfoMaster,
                                    Emerac
                             | 
            
        
        
            The Bucket Algorithm
        
        
            - 
                I showed you a formulated query in LAV, but I didn’t
                    show you how I got to it.
            
- 
                Formulating queries with LAV is a bit more complicated than
                    GAV.
            
- 
                You need to use an algorithm, like the bucket algorithm.
            
- 
                Say you have a query Q and source descriptions (views)
                    V1, ..., Vi
            
- 
                It works like this:
            
            - 
                For every relation g in query Q:
            
            - 
                Find views that contains relation g
            
- 
                Check that constraints in those views are compatible with
                    Q
            
- 
                Add the views that satisfy the constraints to a
                    bucket
            
            - 
                Combine source relations from each bucket into a
                    conjunctive query Q’ and check for containment.
            
            - 
                I bet that didn’t make a lot of sense, so
                    here’s an example.
            
        
            - 
                We have the following views:
            
            - 
                V1(student,number,year) →
                    Registered(student,course,year), Course(course,number),
                    number ≥ 500, year ≥ 1992
            
- 
                V2(student,dept,course) →
                    Registered(student,course,year),
                    Enrolled(student,dept)
            
- 
                V3(student,course) → Registered(student,course,year),
                    year ≤ 1990
            
- 
                V4(student,course,number) →
                    Registered(student,course,year), Course(course,number),
                    Enrolled(student,dept), number ≤ 100
            
            - 
                Here we have a query we want to rewrite:
            
            - 
                Q(S,D) :- Enrolled(S,D), Registered(S,C,Y), Course(C,N), N
                    ≥ 300, Y ≥ 1995
            
            - 
                Now we go through every relation, selecting views:
            
        
        
            
                
                    | Relation | 
                            Selected views
                         | 
                
                    | 
                            Enrolled(S,D)
                         | 
                            V2(student,dept,course) →
                                Registered(student,course,year), Enrolled(student,dept)
                         
 
                            V4(student,course,number) →
                                Registered(student,course,year),
                                Course(course,number), Enrolled(student,dept), number ≤ 100
                         | 
                
                    | 
                            Registered(S,C,Y)
                         | 
                            V1(student,number,year) → Registered(student,course,year), Course(course,number),
                                number ≥ 500, year ≥ 1992
                         
 
                            V2(student,dept,course) → Registered(student,course,year),
                                Enrolled(student,dept)
                         
 
                            V3(student,course) → Registered(student,course,year), year ≤ 1990
                         
                            The year range doesn’t fit our query, so
                                we can’t consider V3 for Registered
                         
 
                            V4(student,course,number) → Registered(student,course,year), Course(course,number),
                                Enrolled(student,dept), number ≤ 100
                         | 
                
                    | 
                            Course(C,N)
                         | 
                            V1(student,number,year) →
                                Registered(student,course,year), Course(course,number), number ≥ 500, year ≥
                                1992
                         | 
            
        
        
        
            - 
                Now we have our buckets filled in as follows:
            
        
        
            
                
                    | 
                            Enrolled(S,D)
                         | 
                            Registered(S,C,Y)
                         | 
                            Course(C,N)
                         | 
                
                    | 
                            V2(S,D,C’)
                         
                            V4(S,C’,N’)
                         | 
                            V1(S,N’,Y’)
                         
                            V2(S,D’,C)
                         
                            V4(S,C,N’)
                         | 
                            V1(S’,N,Y’)
                         | 
            
        
        
        
            - 
                You may be wondering why there’s apostrophes
                    here.
            
- 
                That’s just to show the attribute in that relation
                    does not fit with the domain’s head.
            
- 
                For example, in Enrolled(S,D), the C in V2(S,D,C’) has an apostrophe because it’s not one of the
                    distinguished variables S or D.
            
- 
                You can think of the normal attributes like the ones we
                    “need” and the ones with apostrophes like the
                    ones we don’t need, or don’t need yet.
            
        
            - 
                On this part, we simply do a cross product on all these
                    sets of views (Enrolled x Registered x Course) and then pick combinations and check if they’re
                    contained in our query.
            
- 
                For starters, let’s pick V2 for Enrolled, V1 for
                    Registered and V1 for Course.
            
- 
                This’ll turn our query into this:
            
            - 
                Q’(S,D) :- V2(S,D,C’), V1(S,N’,Y),
                    V1(S’,N,Y’), N≥300, Y≥1995
            
            - 
                We can minimise this query a bit, since we have two
                    V1’s and V1 already has an N comparison:
            
            - 
                Q’(S,D) :- V2(S,D,C’), V1(S,N,Y),
                    Y≥1995
            
            - 
                That’s our new query!
            
- 
                Now we need to check if Q’ is contained in Q.
            
        
            - 
                We can do this by going the opposite direction of
                    substituting the views for domain relations:
            
            - 
                Q’(S,D) :- V2(S,D,C’), V1(S,N,Y), Y≥1995
            
- 
                Q’(S,D) :- Registered(S,C’,Y) ∧ Enrolled(S,D) ∧ Registered(S,C’’,Y) ∧
                    Course(C’’,N) ∧ N ≥ 500 ∧ Y ≥
                    1992 ∧ Y≥1995
            
- 
                Q’(S,D) :- Enrolled(S,D) ∧ Registered(S,C,Y) ∧ Course(C,N) ∧ N ≥ 500 ∧
                    Y ≥ 1992 ∧ Y≥1995
            
- 
                Q’(S,D) :- Enrolled(S,D) ∧ Registered(S,C,Y) ∧ Course(C,N) ∧ N ≥ 500 ∧ Y ≥ 1995
            
            - 
                Compare that with the original query:
            
            - 
                Q(S,D) :- Enrolled(S,D) ∧ Registered(S,C,Y) ∧ Course(C,N) ∧ N ≥ 300 ∧ Y ≥ 1995
            
            - 
                As you can see, our new query only differs by checking for
                    N’s greater than 500 as opposed to greater than 300.
                    That’s simply because our source does not support
                    N’s lower than 500.
            
- 
                But because of this, it is reflected in our query, so
                    Q’ ⊆ Q.
            
        
            - 
                To find the maximally-contained rewriting of Q, you would
                    have to go through each combination of views in that cross
                    product we did earlier to find the rewritten query that is
                    closest to Q.
            
            Ontology Based Data Integration
        
        
            - 
                Remember when I mentioned ontologies in Data
                    Integration?
            
- 
                An ontology is a set of rules that data must follow.
            
- 
                How do we represent these rules?
            
- 
                Using a language called Source-to-target Tuple Generating Dependencies (s-t TGDs).
            
- 
                They go like this:
            
            - 
                φ(x,y) conjunction of atoms, defining an antecedent.
            
- 
                ψ(x,z) conjunction of atoms that must hold true / exist if
                    LHS is true (a consequent).
            
            - 
                Just think of it like an “implies” rule, except
                    with a bunch of mathematical jargon shrouding it.
            
        
            - 
                An example of such a rule is this:
            
            - 
                ∀a Person(a) → ∃b hasFather(a,b)
            
            - 
                This rule says that every person has a father.
            
- 
                If you remember Software Modelling and Design, these rules or “invariants” will look
                    familiar to you.
            
- 
                If you’re afraid of mathematical syntax and how the
                    quantifiers look (don’t worry, I am too), you can use syntactic sugar to make it look nicer:
            
            - 
                Person(a) → hasFather(a,b)
            
            - 
                We’ve actually been using this syntactic sugar the
                    whole time, back in query rewriting with global-as-view and
                    local-as-view; it uses the same language!
            
- 
                I’ll use these two syntaxes interchangeably, so just
                    keep that in mind so you don’t get confused.
            
        
            - 
                Let’s go over some more definitions before we dive
                    into the heavy stuff (because trust me, it gets
                    heavy).
            
- 
                An instance is an instantiation of a database.
            
- 
                Remember back in Relational Algebra when we called that a
                    Relational Database Instance? Here, we’ll just say
                    ‘instance’.
            
        
            - 
                An instance J is a model of the rule σ = ∀x,y φ(x,y) → ∃z ψ(x,z)
            
- 
                ... written as J ⊧ σ if the following holds:
            
            - 
                There is a homomorphism h such that h(φ(x,y)) ⊆ J
            
- 
                There exists a h|x ⊆ g such that g(ψ(x,z)) ⊆ J
            
            - 
                Whoa whoa whoa... there’s a lot of jargon here, can
                    we dial it back a bit?
            
- 
                What this is trying to say is that a database instance J is
                    a model of the rule σ if the records of J satisfy that
                    rule.
            
- 
                Let’s say we have a database D={A(x), B(x)} and we have a rule σ = ∀x A(x) → ∃y B(x,y) (basically the rule says for every x in A, that x must be
                    in B too, with some other value y).
            
- 
                The instance {A(1), B(1,0)} would model σ, but the instance {A(1), B(2,3)} would not model σ.
            
        
            - 
                Now let’s explain a bit of the syntax.
            
- 
                What is h|x? It simply means to restrict the domain of h to just
                    x.
            
- 
                So if h was {x → a, y → b, z → c}, h|x would just be {x → a}.
            
- 
                An instance J is a model of σ if you can find a
                    homomorphism h from the antecedent to J’s relations,
                    and if you can find a similar homomorphism g from the
                    consequent to J’s relations, and if g is a subset of h
                    whose domain is restricted to the variables shared between
                    the antecedent and the consequent.
            
- 
                What? That was just jargon, but in English form?
            
- 
                If we go over an example, it’ll make much more
                    sense.
            
        
            - 
                Let’s check if J={A(1), B(1,0)} is a model of that rule we looked at earlier.
            
- 
                We need a homomorphism h that maps from A(x) to {A(1), B(1,0)}.
            
- 
                That’s easy: h = {x → 1}.
            
- 
                Now we need to look at what variables are shared by the
                    antecedent and the consequent.
            
- 
                The rule is A(x) → B(x,y), so what variables are shared between A(x) and B(x,y)?
            
- 
                It’s x! Therefore, our second homomorphism g needs to
                    have x → 1 in it (the x→ 1 comes from the homomorphism h).
            
- 
                How can we construct g such that it maps B(x,y) to {A(1), B(1,0)}?
            
- 
                Again, that’s not so hard: g = {x → 1, y → 0}
            
- 
                There we go! We’ve found the two homomorphisms: J is
                    a model of σ!
            
        
            - 
                It’d be a bit dumb if we could only have one rule,
                    right?
            
- 
                We can have a set of these rules called ∑.
            
- 
                As defined before, this is an ontology.
            
- 
                J is a model of the ontology ∑, written as J ⊧
                    ∑, if for each σ in ∑, J ⊧
                    σ.
            
- 
                Basically, J models the ontology if it can model every rule
                    in it.
            
            Ontology-Based Query Answering (OBQA)
        
        
            - 
                A certain answer is a tuple that satisfies a query in accordance with the
                    ontology.
            
             
        
        
        
            certain-answers(Q, <D,∑>) = ⋂J ∈ models(D∧∑) Q(J)
        
        
            - 
                Where models(D∧∑) = {J | J ⊇ D and J ⊧ ∑}
            
        
            - 
                To sort of English-ify this definition, let’s say you
                    have an instance. This instance can be a superset of your
                    database D, and it also models the ontology. If you find a
                    tuple in this superset (J ∈ models(D∧∑), and in the query result (Q(J)), and also appears in every other instance (⋂), it is a certain answer.
            
- 
                It’s alright if that didn’t make sense; this is
                    a hard topic to explain. Let’s go over an
                    example.
            
        
            - 
                We have a database instance D = {Person(john), Person(bob), Person(tom),
                    hasFather(john,bob), hasFather(bob,tom)}
            
- 
                We have an ontology ∑ = {∀x (Person(x) → ∃y
                    hasFather(x,y)),
            
            ∀x∀y (hasFather(x,y) → Person(x) ^
                Person(y))}
        
        
            - 
                Let’s go over some queries!
            
- 
                What are the certain answers of:
            
            - 
                Q(x,y) :- hasFather(x,y)
            
            - 
                This query asks for son and father pairs.
            
- 
                Two certain answers straight away are (john,bob) and (bob,tom).
            
- 
                What about Tom? He must have a father, because the ontology
                    rules state that everyone must have a father.
            
- 
                It is true that Tom has a father (not in D, but in supersets that satisfy the
                    ontology), but the pair of (tom,z) (where z is a placeholder) is not a certain answer.
            
- 
                Why? Because when we do the intersection of all answers,
                    this tuple will get filtered out.
            
- 
                Let’s say we have instances A and B, that are
                    supersets of D (and satisfy the ontology), except A has (tom,jack) and B has (tom,jojo).
            
            - 
                A: (john,bob), (bob,tom), (tom,jack), ...
            
- 
                B: (john,bob), (bob,tom), (tom,jojo), ...
            
            - 
                It’s clear now that when we intersect all answers, (tom,z) will get filtered out, meaning it is not a certain
                    answer.
            
- 
                Therefore the only certain answers are (john,bob) and (bob,tom).
            
- 
                In practice, you can think of it like this: no placeholder
                    variables in certain answers are allowed!
            
            - 
                Q(x) :- hasFather(x,y)
            
            - 
                This query just asks for sons.
            
- 
                Two immediate answers are John and Bob.
            
- 
                But what about Tom? Like we said before, he must be a son
                    of someone. So Tom must be a certain answer, right?
            
- 
                Right! Tom is also a certain answer.
            
- 
                That’s because no matter what D superset (that
                    satisfies the ontology) you can pick, Tom will always be a
                    son.
            
- 
                Also, there’s no placeholder in just (tom), so it’s not going to get filtered out in the
                    intersection.
            
- 
                Therefore the certain answers are {(john), (bob), (tom)}
            
            - 
                Q(x) :- hasFather(x,y), hasFather(y,z),
                    hasFather(z,w)
            
            - 
                This query asks for all great grandsons.
            
- 
                In the instance D, it’s clear that there are no great
                    grandsons. Heck, there’s only one grandfather.
            
- 
                But instance D doesn’t satisfy the ontology, does
                    it?
            
- 
                Therefore we only want to focus on supersets of D that do
                    satisfy the ontology.
            
- 
                If we had an instance superset J that satisfies the
                    ontology, every person in J would be a son of someone, a
                    grandson of someone, a great grandson of someone etc.
            
- 
                That includes John, Bob and Tom!
            
- 
                However, everyone else would be a placeholder, or simply
                    someone who doesn’t belong in D. The same logic
                    applies in the first example, and those people would be
                    filtered out in the intersection.
            
- 
                Therefore, the certain answers are {(john), (bob), (tom)}
            
            - 
                Q(x,w) :- hasFather(x,y), hasFather(y,z),
                    hasFather(z,w)
            
            - 
                This query asks for all great grandsons and great
                    grandfathers.
            
- 
                In instance D, there are no great grandfathers.
            
- 
                That means for every superset of D, all great grandfathers
                    will be placeholders.
            
- 
                If all great grandfathers are placeholders, that means
                    every tuple containing a great grandfather will be filtered
                    out in the intersection.
            
- 
                So the certain answers here are... nothing. There are
                    none.
            
        
            - 
                Here’s a formal definition from the slides. To be
                    honest, typing everything out is tiring and the slides show
                    this quite well anyway, so here you go.
            
- 
                By the way, () ∈ Q(x) simply means “if this boolean CQ returns
                    true”, where () is used as the “true”
                    constant
            
             
        
        
            Universal Models
        
        
            - 
                It seems like a real pain to imagine every kind of superset
                    model when calculating queries...
            
- 
                It would be really handy if we had just one model that we
                    can reason with, and if that model satisfies a query, every
                    possible model will satisfy it too...
            
- 
                Were you wishing for that? You were, weren’t
                    you?
            
- 
                You’re in luck, because that’s called the Universal Model (also called the Canonical Model).
            
- 
                Despite the name, it doesn’t model the universe;
                    instead, it’s a model that can be mapped to every
                    other possible model (well, subsets of them), but back to that in a second.
            
        
            - 
                Let’s say we have a knowledge base of D = {P(c)} and ∑ = {P(x) → R(x,y) ∧ P(y)}
            
- 
                This is virtually the same as the son father example.
            
- 
                Let’s come up with all kinds of models for this
                    knowledge base:
            
             
        
        
        
            - 
                Yikes... it seems to go on forever.
            
- 
                The key idea here is to focus on a representative model;
                    one that is as general as possible.
            
- 
                That would appear to be the model on the furthest right;
                    it’s the only one that acknowledges that it is
                    infinite.
            
- 
                The model on the far right is the universal model here. You
                    can map (using a homomorphism, of course) that universal model to all the other models on the
                    left.
            
- 
                For example, {z1 → c} is a homomorphism from the right-most model to the
                    left-most model.
            
- 
                So any queries that work on that model will also work on
                    the others too.
            
        
            - 
                So, to put all this more formally, an instance U is a
                    universal model of D ∧ Σ if the following
                    holds:
            
            - 
                U is a model of D ∧ Σ
            
- 
                ∀J ∈ models(D ∧ Σ), there exists a
                    homomorphism hJ such that hJ(U) ⊆ J
            
            - 
                This is basically what I said before: a universal model is,
                    of course, a model of the database and the ontology, but it
                    can also be mapped to any other model of that same database
                    and ontology.
            
        
            - 
                There is a theorem that states D ∧ Σ ⊨ Q,
                    iff U ⊨ Q where U is a universal model of D ∧
                    Σ
            
- 
                Intuitively, this theorem is true, because if a query works
                    with a universal model, it’ll work for every model
                    because the universal model has homomorphisms to the subset
                    of every model in existence.
            
             
        
        
            The Chase
        
        
        
        
            - 
                The Chase is an iterative algorithm used to generate a
                    universal model of a knowledge base.
            
- 
                In The Chase, we use our rules as “generators”
                    that create new relations.
            
        
        
            - 
                D = {Person(john)}
            
- 
                Σ = {∀x Person(x) → ∃y
                    hasParent(x,y) ∧ Person(y)}
            
            - 
                Let’s use The Chase to generate our universal
                    model!
            
- 
                In The Chase, we instantiate the right side using values
                    from the left.
            
- 
                In this rule, a Person must exist / be chosen before we can
                    instantiate the hasParent and Person on the right.
            
- 
                We’ll pick “john” as the person.
            
- 
                Now that we have our argument, we can begin instantiating
                    some new records.
            
- 
                Since x = “john”, the right side will become:
            
            - 
                hasParent(“john”, z1), Person(z1)
            
            - 
                You may be wondering where the z1 comes from.
            
- 
                In The Chase, the “there exists” variables on
                    the right side are replaced with placeholders.
            
- 
                The placeholder ID gets incremented each time a new one is
                    made, but it can be used like a proper variable now.
            
- 
                We can do this step again, this time with x = z1.
            
- 
                ... and then again, with x = z2.
            
- 
                ... and then x = z3.
            
- 
                You get the point; it’ll go on forever.
            
- 
                That’s because the universal model for this example
                    is infinite, so we’ll be generating it forever.
            
- 
                For examples with universal models that are finite, The
                    Chase algorithm will eventually end.
            
             
        
        
            - 
                Let’s formally describe this algorithm.
            
- 
                A rule σ = ∀x∀y (φ(x,y) → ∃z ψ(x,z)) is applicable to instance J if:
            
            - 
                There exists a homomorphism h such that h(φ(x,y)) ⊆ J
            
- 
                There does not exist a h|x ⊆ g such that g(ψ(x,z)) ⊆ J
            
            - 
                Looking at the image below, on the left, if we’re
                    trying to do The Chase with that model J and that rule, we
                    wouldn’t be able to because there exists a
                    homomorphism ‘g’.
            
- 
                However, on the right, the rule is applicable and we can
                    create new records from that rule because a homomorphism
                    ‘g’ cannot exist.
            
             
        
        
            - 
                Let J+ = J ∪{g(ψ(x,z))}, where g ⊇ h|z  and g(z) are “fresh” nulls not in J
            
- 
                The result of applying σ to J is J+, denoted J<σ,h>J+  - a single chase step
            
- 
                If you do two chase steps... would that be J++?
            
        
            - 
                A finite chase of D with respect to Σ is a finite sequence
            
            - 
                D<σ1,h1>J1<σ2,h2>J2<σ3,h3>J3 ... <σn,hn>Jn
            
            - 
                and chase(D,Σ) is defined as the instance Jn
            
- 
                Basically, it’s a chase that has an end.
            
        
            - 
                An infinite chase of D with respect to Σ is a fair finite sequence (by fair, we mean all applicable rules will eventually be
                    applied, so it won’t just be one rule being spam
                    applied like a depth-first search tree or something)
            
            - 
                D<σ1,h1>J1<σ2,h2>J2<σ3,h3>J3 ... <σn,hn>Jn ...
            
            - 
                and chase(D,Σ) is defined as the instance ⋃k≥0 Jk (where J0 = D)
            
- 
                Basically, it’s a chase that has no end.
            
        
            - 
                Keep in mind that The Chase will not give you a unique
                    model!
            
- 
                Depending on the order of rule application, you might get
                    different models.
            
- 
                For example:
            
            - 
                D = {P(a)}
            
- 
                σ1 = ∀x (P(x) → ∃y R(y))
            
- 
                σ2 = ∀x (P(x) → R(x))
            
- 
                Result1 = {P(a), R(z), R(a)} (if you do σ1 then σ2)
            
- 
                Result2 = {P(a), R(a)} (if you do σ2 then σ1)
            
            - 
                However, they are unique up to homomorphic
                    equivalence.
            
- 
                In order words, they’re all homomorphic to each
                    other.
            
- 
                So semantically, there really is only one universal
                    model.
            
        
            - 
                One last thing, just a quick theorem:
            
- If:
            - 
                D ∧ Σ ⊨ Q iff U ⊨ Q where U is a universal model of D ∧ Σ
            
            - 
                chase(D, Σ) is a universal model of D ∧ Σ
            
            - 
                D ∧ Σ ⊨ Q iff chase(D,Σ) ⊨ Q
            
            - 
                Basically, a knowledge base can satisfy a query if the
                    result of The Chase can satisfy it (because the result is just the universal model).
            
        
            - 
                But what happens if The Chase is infinite? Our algorithm
                    will go on forever! I need these questions answered! I
                    can’t sleep if they’re not answered!
                    Please!!
            
- 
                Sorry, bucko, but it seems like your 30-day time trial has
                    come to a close.
            
- 
                This is where the free demo comes to an end.
            
- 
                Pay me £100 for your activation code for the full
                    version of these notes.
            
- 
                Just kidding!
            
- 
                This is still an active research area, so either nobody
                    knows or it’s too bleeding edge to put into a normal
                    Computer Science degree.
            
- 
                Why don’t you become a researcher and find out
                    yourself?
            
- 🙂