COMP3225 - Natural Language Processing

Words

Natural Language Processing: study of interactions between computers and human language, in particular how to program computers to process and analyse large amounts of natural language data

Sentence	Unit of written language
Utterance	Unit of spoken language
Word Form	Inflected form as it appears in the corpus e.g. said
Lemma	Abstract root form of the word e.g. say
Function Word	Indicating a grammatical relationship e.g. by
Types	Number of distinct words in a corpus
Tokens	Total number of words

Corpus: collection of written texts having a specific author, language, time, place, and function

Token: symbol which determines how grammar and semantics can be understood

Morpheme: unit of language that cannot be divided further

Regular Expressions

A regular expression is an algebraic notation for characterising a set of strings
Operators show how symbols are joined together
Capture groups find repeating groups in different locations
Lookahead assertions match expressions without consuming them
Anchors mark particular places in text
Aliases a short versions of longer expressions

Operators

Operator	Meaning	Example
\	separator	/woodchuck/
[ ]	disjunction	/[abc]/
( )	precedence	/dogg(y\|ies)/
{ }	repetition	/abc{3}/ or /abc{1,3}/
?	optional	/colou?r/
^	negation	/^a/
.	wildcard	/.txt/
\|	disjunction	/cat\|dog/
*	zero or many	/[0-9]*/
+	one or many	/[0-9]+/
*?	non-greedy zero or many	/[0-9]*?/
+?	non-greedy one or many	/[0-9]+?/
\1	capture group	/I have a $\w+$, do you want a \1?/

Lookahead Assertions

Operator

Meaning

Example

?<=

?<!

non capturing group

look ahead

look behind

negative look ahead

negative look behind

(?:word)

(?=word)

(?<=word)

(?!word)

(?<!word)

Anchors

Expression	Meaning
^	start of line
$	end of line
\b	word boundary
\B	non-word boundary

Aliases

Expression	Meaning	Expansion
\d	number	[0-9]
\w	alphanumeric or underscore	[a-z A-Z 0-9_]
\s	whitespace	[␣\r\t\n\f]

Text Normalisation

Normalisation: words or tokens are put into standard format. For example, uh-huh becomes uh huh

Case Folding: all characters are mapped to lowercase

Lemmatisation: reducing words to their root. For example, saying becomes say

The Porter stemmer removes affixes from words. It is based on a series of rewrite rules called a cascade which pipelines the output of each step into the next step

Sentence Segmentation: separation of text into sentences using punctuation, markers, and symbols

? and ! mark the end of sentences
. can mark the end of a sentence, an abbreviation, or both
. takes precedence, then ? and !, then other symbols

UNIX can be used for naïve word tokenisation and normalisation
The command tr -sc 'A-Za-z' '\n' < result.txt | sort | uniq finds and sorts all unique words in text

UNIX Commands

Command	Meaning	Example
tr	tokenise	tr -sc 'A-Za-z' '\n' adds a new line after each character sequence
sort	sort alphabetically	tr -sc 'A-Za-z' '\n' \| sort
uniq	list unique elements	tr -sc 'A-Za-z' '\n' \| sort \| uniq

Patterns

Expression	Meaning
([A-Z]\.)+	abbreviations
\w+(-\w+)*	hyphenated words
\£?\d+(\.\d+)?	currency
\.\.\.	ellipsis

BPE (Byte Pair Encoding): form of data compression where the most common pair of consecutive bytes is replaced with a new byte

starts with a vocabulary using characters [a-z] and tries to learn k new tokens
repeats k times

selects the two most frequently adjacent symbols e.g. (a,b)
adds a new merged symbol AB to the dictionary
replaces each A B with AB

Minimum Edit Distance

Minimum edit distance measures the similarity between strings, defined as the minimum number of editing operations, such as insertion, deletion, and substitution, needed to transform one string into another.

the alignment shows the position of strings which provides the largest similarity
i represents insertion
d represents deletion
s represents substitution

The Levenshtein distance between two sequences is the simplest weighting factor in which each operation has a cost of 1.

Given two strings, X of length i and Y of length j, the edit distance is defined as D(i,j)

D(i,j) = min

Evaluation

Evaluation Techniques

Extrinsic Evaluation: evaluates the performance of an NLP component by embedding it in an application and measuring its performance

Intrinsic Evaluation: measures the quality of an NLP component independently of its application

Steps in training and evaluation:

Linguistic Theories

AI develops linguistic theories about patterns, grammars, and rules

Statistical Patterns

AI infers statistical patterns and rules by examining text corpora

Training

a training dataset is used to fit a model
typically 80% of data
uncovers dependencies and patterns, difference between noise and signals, significance of words and phrases

Validation

a validation dataset is used to provide an unbiased evaluation of a model while tuning model hyperparameters
allows tuning of hyperparameters to optimise a model
typically 10% of data

Testing

a test dataset is used to measure the validity and accuracy of a model
measures only the final performance of a model
typically 10% of data

The corpora size affects the quality of training on a model.

a larger corpora means a more varied language and better training datda
the relationship between the size of the vocabulary and the number of tokens N is given by the expression |V| = kNβ where k and β are positive and β is between 0 and 1
β depends on the corpus size and context
|V| is usually limited to ~50,000 words due to GPU memory restrictions
an OOV (out-of-vocabulary) word is a word which doesn't appear in the training data
ignoring OOV words can lead to overfitting as they are not analysed

K-Fold Cross-Validation: statistical procedure used to evaluate the skill of a model on limited data

Shuffle the dataset randomly
Split the dataset into K groups
For each unique group:

take the group as the test set
take the remaining groups as the the training set
fit a model on the training set
evaluate the model on the test set
retain the evaluation score and discard the model

Summarise the skill of the model as the sample of model evaluation scores

Random subsampling: evaluation technique similar to cross-validation, which selects subsets of data as the test set randomly rather than using K fixed-size groups. This improves randomisation, but may not be representative of all data

Evaluation Metrics

TP (True Positive): "yes" data is correctly predicted as "yes"

TN (True Negative): "no" data is correctly predicted as "no"

FP (False Positive): "no" data is incorrectly predicted as "yes"

FN (False Negative): "yes" data in incorrectly predicted as "no"

Accuracy =

proportion of correct predictions

Precision =

proportion of correct positive predictions out of all positive predictions

i.e. how many positive predictions were correct?

Recall =

proportion of correct positive predictions out of all correct predictions

i.e. how many correct predictions were made?

F1 = harmonic mean of precision and recall

ROUGE: recall-based metric used to evaluate text summarisation

a good text summary includes many repeated phrases
a good text summary is similar to a human-generated version
ROUGE-n is the proportion of n-grams / significant phrases which match reference summaries

BLEU: precision-based metric used to evaluate text translations

measures how closely a machine translation matches a human-generated version
measures how many n-word phrases match reference translations
uses a function of n-gram precision over all sentences
BP is the brevity penalty which penalises short translations, p is precision
LF is the length of the reference translation, LS is the length of the system translation

BLEU = BP

BP = min(1, exp(1 - LF/LS))

PERPLEXITY: statistical metric used to evaluate language models

measures how well a vocabulary predicts target text
uses the probability of all words in text appearing in the same order
the perplexity of a set is its inverse probability normalized by the number of words
the perplexity of a set W equals PP(W) where Π represents the product
minimising the perplexity is equivalent to maximising the set probability

N-grams

Probabilities are useful for identifying words where the input is noisy, ambiguous, or broken up.

Speech Recognition

recognition and translation of spoken language into written text
the model predicts which words are most likely to be spoken

Spelling & Grammar Correction

checking for misspellings and grammar errors
the model predicts which spellings are more likely to be correct

Machine Translation

translation of one language to another
the model predicts which translation is most likely to be correct

Alternative Communication

translation of physical movements such as eye gaze to spoken text
the model predicts which words are most likely to be spoken

N-Gram: contiguous sequence of n items. The n-gram model calculates the probability of the next word in some text, given the previous n-1 words

Unigram: single word e.g. "sequence"
Bigram: two-word sequence e.g. "sequence of"
Trigram: three-word sequence e.g. "sequence of words"

N-gram Probabilities

Assume P(w|h) is the probability of word w given the history h. Suppose the history is "its water is so transparent that", w is "the", and C means count:

Assume P(w1:n) is the probability of a sequence w1, …, wn. We can use joint probability to predict this value:

P(w1) = P(w1)

P(w1,w2) = P(w1|w2) * P(w1)

P(w1,w2,w3) = P(w3|w1,w2) * P(w1,w2)

P(w1,w2,w3,w4) = P(w4|w1,w2,w3) * P(w1,w2,w3)

=> P(w1, …, wn) = P(wn|w1, …, wn-1) * P(w1, …, wn-1)

Applying the chain rule of probability:

Assume P(wn|wn-1) is the probability of a bigram, i.e. the probability of a word wn given all previous words by using only the probability of the previous word.

A Markov model is a probabilistic model which assumes it can predict the probability of some future event from its current state.

A Markov assumption assumes that the probability of a word depends only on the previous word
The bigram can be generalised to the n-gram, which looks n-1 words into the past
The N-gram approximation of the conditional probability of the next word in a sequence is given by:

P(wn|wn-1) ≈ P(wn-N+1 : n-1|wn-1)

Given the bigram assumption for the probability of an individual word, we can calculate the probability of a sequence of words w1:n using substitution:

P(WkIWk-ı)

MLE (Maximum Likelihood Estimation): selection of the most suitable set of bigram parameters to maximise the performance of a model in predicting the nth word in some text

To calculate the bigram probability of a word y given the previous word x, we can calculate the count of the bigram C(xy) and normalise by the sum of all bigrams starting with x:

We can simplify this equation by simplifying the denominator to the unigram count for wn-1:

For the general case of MLE n-gram parameter estimation:

ven)

Large n-gram probabilities must be stored as log probabilities to avoid numerical underflow. Adding in log space is equivalent to multiplying in linear space, so log probabilities are combined by addition. The original values can be restored by taking the exponential:

Limitations

Unknown Words

OOV (out-of-vocabulary) words appear in test sets
open vocabulary systems replace OOV words with <UNK> tokens
<UNK> tokens can be associated with OOV words based on their frequency
<UNK> tokens can be used as a single word in the vocabulary
<UNK> tokens often cause overfitting

Zero Probabilities

models can predict a zero probability for n-grams which appear in test sets but not training and validation sets
any joint probabilities with these n-grams will also be zero

Sparse Data

Even when the vocabulary size |V| is large, only a tiny minority of possible n-grams exist in any corpus
P(wk|wk-1) = 0 for most sequences wk-1,wk
This produces sparse matrices
More sparse as n increases

Smoothing

Smoothing prevents language models from assigning a probability of zero to unseen n-grams, by taking some probability from frequent n-grams and giving it to unseen n-grams.

In Laplace smoothing, one is added to all bigram counts before they are converted to probabilities, removing any zero values. The MLE for a unigram of word wi is its count ci normalised by the total number of tokens N.

Laplace smoothing adds one to each word count and normalises by the vocabulary size V.

It is more convenient to use the adjusted count c* to allow easier comparison with other MLE values.

The level of smoothing is defined by the relative discount dc, the ratio of adjusted counts to original counts.

For bigrams, the unigram count must be augmented by the vocabulary size V.

In add-k smoothing, a fractional amount k is added to all bigram counts before they are converted to probabilities.

Interpolation combines several n-gram estimators to give a weighted value. In linear interpolation, n-grams are combined by applying a weight λ where the sum of weights is 1.

P(WnIWn-2Wn-1) —
P(wn I
4-A2P(Wn I Wn— 1 )

Backoff estimates the probability of an n-gram which has no appearances by using a subset of the n-gram. For example, a trigram wn-2wn-1wn is replaced with the bigram wn-1wn or the unigram wn. If an n-gram has zero occurrences, the algorithm backs off to the (n-1)-gram until an occurrence is found.

Katz Backoff uses a parameter α to distribute the probability mass correctly.

PBO(WnlWn-N+lm-l) =
otherwise.

BO = backoff, N = n-gram length

Kneser-Ney smoothing is a commonly used n-gram smoothing method. It subtracts an absolute discount d from each count, which helps to scale lower-order n-grams.

PAbsoluteDiscounting (Wi Wi— I ) ¯
EvC(Wi-l V) ^[a]

bigram λ * unigram

Kneser-Ney smoothing uses the continuation probability which estimates how frequently a word w appears as a unique continuation of an n-gram. It is calculated by the frequency of the n-gram containing w over the frequency of all n-grams with matching n. For example, "Great Britain" has a low continuation value because it only appears in a single context.

The size of the set of words v, that w appears after,
ie where the count of (v,w) is greater than O.
: C(vw) >
PCONTINUATION(W)
: C(U'w') >
Equation 3.32
The size of the set of adjacent word pairs (u', w')
ie where the count of (u',w') is greater than O.

Stupid Backoff is a simpler solution used for large-scale text corpora, which is inexpensive to train on large datasets and approaches the accuracy of Kneser-Ney smoothing as the size of the training set increases. Stupid Backoff does not discount higher-order probabilities. If a higher-order n-gram has no occurrences, the algorithm backs off to a lower-order n-gram weighted by a fixed weight S.

count(nf-' )
otherwise
count ( w)

Parts of Speech

POS Tagging

A part of speech is a category to which a word is assigned in accordance with its syntactic functions. In English the main parts of speech are noun, pronoun, adjective, determiner, verb, adverb, preposition, conjunction, and interjection.

Parts of speech fall into two categories. Closed classes have a fixed membership where specific words belong. These are typically function words such as of, it, and, you. Open classes such as nouns and verbs are flexible and are continually growing to accommodate new words.

POS Tag: label assigned to tokens in a text corpus to indicate the part of speech and often other grammatical categories such as tense, number, and case

POS tagging: process of assigning a part of speech to each word in text. An input sequence x1, …, xn of tokenised words is converted into an output sequence y1, …, yn of tags. Tagging is a disambiguation task because words are ambiguous and may have various possible tags. POS tagging resolves ambiguities and selects the correct POS tag for each word.

Hidden Markov Models

A hidden Markov model (HMM) is a probabilistic sequence model. Given a sequence of tokens, it calculates the probability distribution over possible sequences of tags and selects the best sequence.

A Markov chain is a model showing the probabilities of transitions between random variables called states. A Markov chain makes an assumption that any future sequence only depends on the current state.

The hidden Markov model makes two assumptions. The Markov assumption assumes that the probability of a future state qi depends only the previous state. The output independence assumes that an output observation oi depends only on its state.

iform
(b)
A Markov chain for weather (a) and one for words (b), showing states and
Figure 8.8
transitions. A start distribution is required; setting = (0.1, 0.7, 0.2] for (a) would mean a
probability 0.7 of starting in state 2 (cold), probability 0.1 of starting in state I (hot), etc.

Formally, a Markov chain is defined by the following components:

set of states

transition probability matrix A

each value ij represents the probability of moving from state i to state j

initial probability distribution over states

is the probability the Markov chain will start at state i

A hidden Markov model contains observed events (input tokens) and hidden events (POS tags). A HMM is defined by the following components:

	set of states
	transition probability matrix each value ij represents the probability of moving from state i to state j
	set of observations from the vocabulary
	emission probability matrix expresses the probability of an observation ot being generated at state qi
	initial probability distribution over states is the probability the Markov chain will start at state i

HMM Taggers

A HMM uses two probabilities. A is the transition probability matrix representing the probability of a tag occurring given the previous tag. The model calculates the MLE of probabilities by counting tag occurrences over previous tag occurrences.

B is the emission probability matrix representing the probability of an observation being associated with a given tag. The model calculates the MLE of probabilities by counting word-tag pair occurrences over all tag occurrences.

Decoding is the process of determining values of hidden variables (tags). Given an input ƛ=(A,B) and a set of observations, find the most probable set of states.

For POS tagging, the goal of decoding is to find the tag sequence t1…tn that is most likely given an observation sequence w1…wn.

Using Bayes rule:

Simplifying by dropping the denominator:

A HMM tagger makes two assumptions. The first assumption states that the probability of a word occurring depends only on its tag.

Second, the bigram assumption states that the probability of a tag occuring is dependent only on the previous tag.

Using these assumptions, the following equation gives the most probable tag sequence for a bigram tagger. The equation states that the probability of a tag set t1…tn given a word set w1…wn is defined by the product of emission and transition values.

tl:n
argmaxP(tl .
..tnlW1.
.. wn) æ argmax
emission transition
P(WiIti) P(tilti-l)

Viterbi Algorithm

The Viterbi algorithm is an efficient algorithm for HMM decoding using dynamic programming.

sets up a probability matrix with a column for observations and a row for each state
each cell of the matrix represents the probability that the HMM is in state j after seeing the first t observations
the value of each cell is calculated by taking the most probable path
the cell probability vt(j) = max P(qi...qt-1, o1...ot, qt = j | 𝝀)
the Viterbi probability is calculated by taking the most probable path extension to the current cell
for state qj at time t, the value of vt(j) = max (i) aij bj (ot)

	previous Viterbi path probability
	transition probability from state qi to state qj
	observation probability for ot given the state j

function VITERBI (observations of len T, state-graph of len N) returns best-path, path-prob

create a path probability matrix viterbi [N,T]

for state s from 1 to N do

viterbi [state,1] = start probability πstate × emission probability bs(o1)

for time t from 2 to T do

for state s from 1 to N do

viterbi [state, time] = max viterbi [s', t-1] × transition probability as'->s × emission probability bs(ot)

best path probability = max viterbi [s,T]

best path pointer = argmax viterbi [s,T]

best path = path at state bestpointer, following pointers to states over time

return best path, best path probability

Named Entity Recognition

NER Tagging

Named Entity: word or phrase referred to with a proper name such as a person, location, or organisation

People
Organization
Location
Geo-Political Entity
Tag
PER
ORG
LOC
GPE
Sample Categories
people, characters
companies, sports teams
regions, mountains, seas
countries, states
Example sentences
'lilring is a giant of computer science.
The IPCC warned about the cyclone.
Mt. Sanitas is in Sunshine Canyon.
Palo Alto is raising the fees for parking.

Named entity recognition (NER) aims to find spans of text that constitute proper names and tag them accordingly. BIO tagging is a method that allows us to treat NER as a sequence labelling task by capturing the boundaries of the sequence. I and O represent words inside and outside the entity, B and E represent the beginning and end, and S represents a single entity.

Variant

Tags

Meaning

Example

Inside

Outside

Jane of United Airlines Holding discussed the Chicago route

I-PER O I-ORG I-ORG I-ORG O O I-LOC O

BIO

Begin

Inside

Outside

Jane of United Airlines Holding discussed the Chicago route

B-PER O B-ORG I-ORG I-ORG O O B-LOC O

BIOES

Begin

Inside

Outside

End

Single

Jane of United Airlines Holding discussed the Chicago route

S-PER O B-ORG I-ORG E-ORG O O S-LOC O

Conditional Random Fields

Conditional Random Field (CRF): statistical modelling technique used to predict the most probable sequence of NER tags for a given sequence of words

argmax 、 (YIX 一
Yew

Linear Chain CRF: log-linear model that assigns a probability to an output tag sequence Y, out of all possible output sequences 𝒴, given an input word sequence X. Assuming a word sequence X = x1,…,xn, linear chain CRF produces a tag sequence Y = y1,…,yn by maximising the probability P(Y|X):

exp
p(YlX)
EWkFk(X, Y')
— —exp
E exp

The K functions FK(X,Y) are the global features for the sequences X and Y, calculated by taking the sum of local features at each position i in Y. Each local feature fk uses the current token yi, previous token yi-1, input sequence X, and position i:

Features

Feature templates populate features with specific values. The template ❮yi,xi❯, ❮yi,yi-1❯, ❮yi,xi-1,xi+2❯ uses information from yi-1, yi, X, i. Word shape features represent the abstract letter pattern of words by mapping lowercase letters to x, uppercase letters to X, and numbers to d. For example, DC2021 -> XXdddd.

A gazetteer is a list of place names, often providing millions of entries for locations with detailed geographical and political information. Name lists containing organisations and products can also be used.

Typical features for a feature-based NER tagger

Inference and Training

The best tag sequence for a given input X is given by:

ignoring the exp function and denominator Z(X) as these are constant and do not affect comparisons
expanding FK(X,Y) to
rearranging the expression

argmaxP(YlX)
argmax
exp
argmax exp
argmax VVk (M— I X , İ)
argmax

To decode the optimal tag sequence we use the Viterbi algorithm, which works because a linear chain CRF depends only on the previous token at each time step. The algorithm fills in an N × T array with appropriate values while maintaining backpointers, then follows the pointers to retrieve the optimal tag sequence.

Lexical and Vector Semantics

Lexical Semantics

Lemma: root of a word e.g. mouse

Word Form: variant of a lemma e.g. mice

Word Sense: meaning of a lemma given its context, e.g. mouse as an animal or mouse as a device

Word Similarity: measure of similarity between the meanings of two words. Can be calculated by human judgement or by word co-occurrence

Word Relatedness: measure of association between words e.g. cup and coffee

Sentiment: positive or negative emotions suggested by words e.g. happy, sad

Semantic Field: set of words which cover a semantic domain and share structured relations. For example, words in the semantic field of foods include bread, butter, cake, donuts

Semantic Frame: set of words which denote perspectives or participants in an event. For example, the semantic frame for a commercial transaction includes buy, sell, pay, money

Vector Semantics

Vector semantics aim to represent a given word as a point in a multidimensional space using the distributions of neighbouring words. Vectors for representing words are called embeddings.

to by
that now
than
with
's
is
not good
bad
dislike
incredibly bad
very good incredibly good
fantastic
wonderful
good

In a term-document matrix, each row represents a word and each column represents a document. The matrix has |V| rows and D columns. Information retrieval (IR) is the task of finding a document d from D documents according to some query q represented by a vector |V|.

battle
good
fool
wit
As You Like It
6
0
Twelfth Night
80
58
15
Julius Caesar
62
2
Henry V
89
4
3

A term-term matrix is a |V| × |V| matrix encoding the occurences of a target word against several contexts. Each row represents a word and each column represents a possible context.

cherry
strawberry
digital
information
aardvark
o
o
o
computer
2
1670
3325
data
8
1683
3982
result
9
85
378
pie
442
60
5
5
sugar
25
19
4
13

To measure the similarity between two words v and w, we can use cosine similarity to produce a value between 0 and 1. The dot product v·w is normalised by the magnitude of each vector.

dot product(v, w)
ViVVi
+V2W2+.
+ VNWN

TF-IDF (Term Frequency-Inverse Document Frequency) evaluates the relevance of a term to a document.

Term Frequency (TF) - frequency of a term t in a document d, typically represented logarithmically
Document Frequency (DF) - number of documents a term occurs in
Inverse Document Frequency (IDF) - number of documents divided by document frequency
TF-IDF - calculated by multiplying TF and IDF.

tfr,d

dit
idft

Pointwise Mutual Information (PMI) measures the association between two words by counting their co-occurrences. Given two words w and c, we measure how often they occur together compared to how often they occur independently. Positive Pointwise Mutual Information (PPMI) replaces negative values with 0.

PMl(w, с)

= max(Iog2

Word2Vec

Embeddings and Skip-Grams

A word embedding is a learned representation for text where words that have the same meaning have a similar representation

dense - number of dimensions d ranging from 50-1000
static - single embedding per word that does not change
binary - uses logistic regression to predict whether a word A is associated with word B

Skip-grams are a generalisation of n-grams in which terms do not need to be consecutive, but can be separated by a number of terms. Formally, a k-skip-n-gram is a subsequence of length n where terms occur at most k terms away from each other.

target words and context words are taken as positive data points
random sampling generates negative data points
logistic regression is used to train the classifier
the model learns weights for embeddings

Word2Vec

Word2Vec: neural network model that learns word associations from a given text corpus. After training, it can detect synonymous words using vector embeddings

Given an input of a target word w and a window size L to find context words c, the classifier will return the probability that c is a context word for w:

…lemon a [tablespoon of apricot jam, a] pinch…

c1 c2 c3 w c4

P(+|w,c) = probability c is a context word for w

P(-|w,c) = probability c is not a context word for w

The similarity between words is calculated using embeddings normalised by the logistic function to return a value between 0 and 1:

Similarity(w,c) ≈ c · w

Assuming that all context words are independent, the probability for all context words c in the window L can be calculated by taking the product:

In summary, Word2Vec trains a probabilistic classifier that takes a target word w and its context window of L words c1:L and assigns a probability based on the similarity between the context window and target word. This is based on the logistic function and the dot product of word embeddings. Word2Vec keeps embeddings for every target word w and every context word c.

l..d
aardvark
apricot
zebra
aardvark
apricot
zebra
W
IVI
IVI+I
C
target words
context & noise
words

Learning Skip-Gram Embeddings

Skip-gram embeddings are learned by starting with random vectors and iteratively shifting the values for each word w to become more like the embeddings of associated words. Positive examples are context words associated with the target word. Negative examples are noise words found using random sampling.

…lemon a [tablespoon of apricot jam, a] pinch…

positive examples +
negative examples -
apricot
apricot
apricot
apricot
tablespoon
of
jam
a
apricot
apricot
apricot
apricot
Cneg
aardvark
my
where
coaxial
apricot
apricot
apricot
apricot
seven
forever
dear
if

Random words are chosen according to their weighted unigram frequency Pα(w) where α is a weight. This is set to α=0.75 by default as Pα(w) > P(w) for rare words:

Given the set of positive and negative training instances, the goal of the algorithm is to adjust these embeddings to:

maximise the similarity of the target word and context word pairs (w,c+) from positive samples
minimise the similarity of noise word and context word pairs (w,c-) from negative samples

Given a positive word-context pair (w,c+) and k noise words c-1…c-k, these goals can be expressed as the loss function L to be minimised. The first term expresses that the positive context word c+ should have a high probability of being a neighbour, while the second term expresses that the negative noise words c- should have a high probability of being a non-neighbour.

LCE —
- -log
logP(+ cpos) -F logP(— Ivv, cnegi)

Visualising Embeddings

Hierarchical Clustering

shows a hierarchical representation of the similarity of words in an embedding space
projects n dimensions down to 2 dimensions

WRIST
ANKLE
SHOULDER
NOSE
FINGER
FACE
KITTEN
MOUSE
OYSTER
BULL
CHCAGO
ATLANTA
MONTREAL
NASHVILLE
CHINA
RUSSIA
AFRICA
EUROPE
AMERICA
BRAZIL
HAWAII

Analogy

shows the relational similarity of words
displayed using the parallelogram model
uses the analogy a is to b as c is to ?

WOMAN
ΜΑΝ
lJNCLE
AlJNT
QlJEENS
KlNGS
KlNG
QlJEEN
QlJEEN
KlNG

Bias in Embeddings

Embeddings encode the bias within training sets used to compute them. A historical corpus is likely to encode the bias of its time.

Bias Amplification

bias where embeddings exaggerate patterns and make them more extreme
implicit bias includes racism, ageism, sexism etc.

Allocation Harm

bias resulting in unfair real world outcomes

Representational Harm

harm caused by a system demeaning some social or racial groups

Debiasing is the process of manipulating embeddings to remove harmful bias.

Recurrent Neural Networks

Neural Units

A neural unit takes a set of real numbers and a bias term as input, computes the weighted sum of inputs, and outputs the result. Given a set of inputs x1…xn, a unit has a set of weights w1…wn and a bias term b, which produces a weighted sum z:

Using vector notation:

Instead of using the linear function z, neural units can apply a non-linear function f to z. The output of this function is the activation value a:

Using the sigmoid function, the output of a neural unit is:

Schematic of a basic neural unit:

Neural Networks

An artificial neural network (ANN) is a computing system based on biological neural networks.

consists of many neurons connected together by edges
neurons use mathematical functions and weights to compute their outputs
thresholds, aggregate signals, and layers can be added to artificial neural networks

A multi-layer perceptron (MLP) consists of an input layer, hidden layers, and an output layer. Each layer is fully connected to the next layer. The input layer is composed of input neurons, while the hidden and output layers are composed of TLUs. Bias neurons are also included in the network. When an ANN contains many hidden layers it is called a deep neural network (DNN).

Feed-Forward Neural Networks

A feed-forward network is a multilayer neural network where connections are acyclic (no cycles). The outputs from units in each layer are passed to units in the next layer. No outputs are passed back to previous layers. In a fully connected feedforward network, information only travels forwards and each layer is fully connected to the next.

A fully connected feed-forward network takes inputs x1…xn and a bias term b. It has a hidden layer with units h1…hn and an output layer with units y1…yn. The weights and bias term are combined into a weight matrix W. The sigmoid function is used in the hidden layer to output the output h:

The hidden layer has a weight matrix U, which is multiplied by the input to produce an output z:

The softmax function normalises the output to a binary vector following a probability distribution where all values sum to 1:

The final output y of the network is the normalised vector z:

To calculate the output of the Nth layer in a neural network:

the output z[i] is calculated using its weight matrix W[i], bias term b[i], and the output a[i-1] of the previous layer

the previous output a[i] is calculated using an activation function g[i] and the output z[i] of that layer

where square bracket notation [i] represents a layer number

Each layer of the network is represented in the following form:

[2]
+ b12]

The algorithm for computing the forward step in an n-layer feed-forward network given the input vector a[0] is given by:

for i in l..n
zii W' all¯
dil gli) (ZIO)

Training Neural Networks

A feed-forward neural network is a supervised learning task using an observation x and target y at each step. This system produces an estimate of the true output value . The training procedure learns the parameters W[i] and b[i] for each layer i to make closer to y.

The loss function measures the prediction error by calculating the difference between true values and prediction values. The cross-entropy loss function for logistic regression is:

For a neural network with a probability distribution vector C, the cross-entropy loss is:

Neural Language Models

Neural language models use a moving window to move over a given text. Assuming the vector embeddings have already been pretrained, each group of N embeddings are concatenated as a one-hot vector. (A one-hot vector is a vector where one element is equal to 1 and all other elements are set to zero). The embedding matrix E is a dictionary of embeddings that contains a d-dimensional column vector for each word. At each time step t, N embeddings are taken and multiplied by E, then all three are converted into a d-dimensional embedding. Each of these is multiplied by the weight matrix W and put into an activation function to produce the hidden layer h. This is multiplied by its weight matrix U and put into the softmax function to produce the final output.

This network predicts the probability that the next word wt will be a given vocabulary word Vi and uses the window size N=3.

Select 3 embeddings from E

takes the next three words and creates three one-hot vectors
multiplies each vector by the embedding matrix E
each row of E is a word embedding and the input is a one-hot column vector xi
so the projection layer for input w is Exi = ei
the three embeddings are concatenated

Multiply by W

multiply the d-dimensional embedding by the weight matrix W
pass through the activation function to get the hidden layer h

Multiply by U

multiply the hidden layer h by U to get the output layer

Apply softmax

use the softmax function to produce the final output layer
each node I in the output layer estimates the probability P(wt = i|wt−1,wt−2,wt−3)

p(aardvarkl ...) p(fishl...
Output layer y I
Y42
softmax
Hidden layer
Projection layer
embedding for
word 35
and thanks
for
wt-3
p(forl...)
Y59
embedding for
word 9925
p(zebral...)
YIV IVIx1
IVIXdh
dhX1
dhX3d
0 3dx1
embedding for
word 45180
the

To train a neural language model, the parameters W, U, E, b must be set using gradient descent. Backpropagation is performed on the network to compute the gradient and set W, U and E. Training proceeds by starting with random weights and iteratively adjusting them as the network scans the input text. The cross-entropy loss for each word wt is:

Recurrent Neural Networks

A recurrent neural network is a neural network that contains a cycle within its connections. Recurrent units are dependent on their own output as an input.

An Elman network or simple recurrent network is a single unit containing an input layer xt, hidden layer ht, and output layer yt. The hidden layer contains a recurrent connection where the previous output becomes the current input. W is the weight matrix for the input layer. U is the matrix of hidden weights, which connects the hidden layer from the previous time step to the current time step. V is the matrix of output weights.

The activation function from the hidden layer ht is used to compute an output yt for input xt. The hidden layer is calculated by passing the weighted values of the previous time step and the input of the current time step into the activation function. The output layer is calculated by passing the weighted values from the hidden layer into some activation function:

For soft classification (using class probabilities) the softmax function can be used:

The algorithm for computing the forward step in an n-layer recurrent feed-forward network is given by:

function FORWARDRNN(x, network) returns output sequence y
for if— I to LENGTH(x) do
hi—I +
hi)
return y

Modern deep learning models unroll the recurrent neural network in time to compute loss. The learning template specifies the network structure, parameters, weight matrices, activation and output functions. Given an input sequence, the model generates an unrolled feed-forward network over time. Regular inference or backpropagation can now be performed on the network.

RNN Language Models

RNN language models process inputs one word at a time. They predict the next word in the sequence using the current word and hidden state. The input x consists of word embeddings represented as one-hot vectors and output predictions y. At each step, the model uses the embedding matrix E to retrieve the embedding for the current word and combines it with the previous hidden layer to compute the new hidden layer. This passes through the output layer and softmax function:

Given the output y, the probability of a word i occurring next is its corresponding y component:

In teacher forcing, an RNN language model is trained using the cross-entropy loss function:

For language modelling, the cross-entropy loss is determined by the probability of the next word:

Applications of RNN Language Models

Autoregressive Generation (Text Generation)

takes an input sequence x and outputs the next word y
uses pre-trained word embeddings and cross-entropy
looks at word associations to find related words

Sampled Word
Embeddi ng
Input Word
hole
RNN
hole

Sequence Labelling (POS Tagging)

takes an input sequence x and outputs POS tag probabilities y
uses pre-trained word embeddings and cross-entropy
the RNN block represents an unrolled recurrent neural network consisting of an input layer, hidden layer, and output layer at each time step, with weight matrices U, V, W

Soft m
Em

Sequence Classification

takes an input sequence x and outputs a class probability y
input is processed one word at a time, generating a new hidden layer at each time step
the final hidden layer hn is passed into the softmax function to find class probabilities

Stacked and Bidirectional RNNs

In a stacked RNN, the output of one RNN is used as the input to another. Adding layers boosts the performance of the stack and allows abstraction of inputs.

In a simple recurrent network, the hidden state at time t represents all the information in the network at that time step, as a result of some function f applied to the inputs:

If the network is trained in reverse, the hidden state at time t represents the information to the right of the current input:

A bidirectional RNN combines the forward and backward networks into a single representation that captures the left and right context of an input at each point in time:

⊕

《丷冖冖冖冖冖冖冖冖冖冖冖冖冖冖冖冖 Ⅱ 一一 ~ 」一 NNE

Sequence Processing

Memory Gates

LSTM (Long Short-Term Memory)

RNN architecture containing feedback connections, composed of a memory cell, input gate, output gate, and forget gate
Remembers values over time while three gates regulate the flow of information into/out of the cell
Can add or remove information to the memory cell
Input Gate: provides input data from previous layers
Output Gate: provides output data to future layers
Memory Cell: internal memory of the neuron
Forget Gate: controls signals

Long Short-Term Memory (LSTM)
Other part of the network
Signal control
Output Gate
the output gate
(Other part of
the network)
Signal control
the input gate
(Other part of
the network)
Other part of the network
Special Neuron:
4 inputs,
1 output
Signal control
the forget gate
(Other part of
the network)
LSTM

Forget gate
Input gate
t)
o
Output gate
I-STM cell
Element-wise
multiplication
' Addition
logistic
tanh

GRU (Gated Recurrent Unit)

Simplification of LSTM, but has fewer parameters and lacks an output gate
Contains a memory cell, input gate, and forget gate

Attention

Self-attention allows a network to extract and use information from a given context without using recurrent connections. To compute an attention score, it compares an item to a collection of other items to calculate its relevance in the current context. This uses a dot product comparison:

Once all the scores are calculated, they are normalised with the softmax function to create a vector of weights indicating the proportional relevance of each input i:

Given the score vector α, it generates an output yi by calculating the sum of weighted inputs (multiplying the inputs and relevance values):

Transformers

Query: current focus of attention that is being compared to all other previous inputs

Key: preceding input that is being compared to the query

Value: embedding used to compute the output for the current focus of attention

Transformers use three sets of weights WQ, WK, and WV which are multiplied with inputs xi:

The self-attention score between the current input xi and some element xj is the dot product of the query and key vectors, normalised by the number of dimensions:

The final softmax calculation remains the same but is based on the weighted sum of the value vectors v:

Weight and Sum
value vectors
Softmax
Key/Query
Comparisons
Generate
key, query value
vectors
Output Vector
x
x

Self-Attention Networks

A transformer is an encoder-decoder architecture that handles long-range sequences by using attention to compute the relevance of each input to produce an output. Transformers map sequences of input vectors x1…xn to output sequences y1…yn of the same length. They are made up of stacks of network layers consisting of simple layers, self-attention layers, feed-forward networks, and connections.

Multi-head self-attention layers are groups of self-attention layers called heads which work in parallel at the same depth in a model, each with their own set of parameters. Given these distinct parameters, each head can learn different aspects of the text at the same level of abstraction.

Multihead Attention
Layer
x,
Concat
- ttentlon
N heads
x

Positional Embeddings

Transformer inputs are combined with positional embeddings so that the network can keep track of each element and its location.

transformers do not encode the positions of words, so use positional embeddings
these can be learned during training or generated using a positional encoding function
positional embeddings have the same dimensions as word embeddings
the positional embedding and word embedding are summed to give the full embedding

Machine Translation

Language Divergence and Typology

Some aspects of language are universal and hold true for every language. These include greetings, pronouns, politeness, questions, nouns and verbs. However, languages also differ in many ways. The study of linguistic similarities and differences is called linguistic typology.

Word order typology is the difference in the order of verbs, subjects, and objects in different languages. Languages such as English, French, and German are SVO (Subject-Verb-Object), Hindi and Japanese are SOV (Subject-Object-Verb), and Irish and Arabic are VSO (Verb-Subject-Object).

Subject-Verb-Object (SVO)
Subject-Object-Verb (SOV)
Verb-Subject-Object (VSO)
English, German, French, Mandarin
Japanese, Hindi
Arabic, Irish

The reen wi
Diese W
ho
this wee
ist ie
rüne Hexe zu Hause
Cheng long dao xiang gang qu
Jackie Chan went to Hong Kong
(b)

Referential density refers to the frequencies of omission in different languages. Cold languages can omit pronouns and require the reader to infer the subject of the sentence, while hot languages are explicit.

The Encoder-Decoder Model

Encoder-decoder networks are sequence-to-sequence networks capable of generating translated output sequences. The encoder takes the input sequence and creates a contextualised representation of it using word embeddings. The context is passed to the decoder, which generates a translated output sequence.

Encoder

accepts an input sequence
generates a corresponding sequence of contextualised representations
can be implemented using LSTM, GRU, transformers, stacks

Context

the context vector c is a function of
c is passed to the decoder as input

Decoder

accepts c as input and generates a sequence of hidden states
uses to generate a sequence of output states

Context

Using the source text x and target y, the encoder-decoder model predicts the next target word in sequence y based on the previous target word and source sequence x:

This encoder-decoder network translates from English to Spanish. To translate the source text, we run it through the encoder to generate hidden states and the context vector. The decoder uses the context and the previous hidden state to generate each successive word in the translation.

Decoder
(output is ignored during encoding)
softmax
hidden
n = hdO
he
layer(s)
embedding
layer
Encoder
hd
hd

The encoder takes an input sequence x and generates contextualised representations for each time step. The final state is the context c, which is passed to the decoder. The decoder uses c as the first hidden state . It generates a sequence of outputs until an end-of-sequence marker is generated. Each hidden state is trained on the previous output, previous hidden state, and context c.

hid
Ifn
softmax(zr)

The output y at each time step is passed through a softmax function over the set of all possible outputs to calculate the most likely output using the argmax:

Training the Encoder-Decoder Model

Encoder-decoder networks are trained using source-target pairs as training data. In training, it is common to use teacher forcing in the decoder. This means that the system is forced to use the correct target word as the next input xt+1, rather than relying on its predicted output .

Total loss is the average
cross-entropy Oss per L =
target word:
Ilepö
PO'
arrived
la
Ilegö
Decoder
bruja
la
verde
bruja
Ys
verde
answers
per- word
soft max
hidden
layer(s)
embedding
layer
the
green
witch
Encoder

Attention

The context c may act as a bottleneck for the encoder, as it must equally represent all data in the source text. The attention mechanism is a more accurate method of implementing long-term memory using all states in the encoder.

The context c is a fixed-length vector which is a function of the hidden states i.e. . The function calculates the weighted sum of all the encoder's hidden states . The context is dynamically derived at each point during decoding. This vector ci is generated again at each step i and takes all of the hidden states into account.

~ Ⅱ g(ft-l,hf-l'ci)

Attention also looks at the relevance of each encoder state to the decoder state . The relevance is captured by computing the dot-product attention score:

= decoder hidden state, = encoder hidden state

The score vector across all encoder hidden states provides the relevance of each decoder state to the current step of the decoder. These are normalised with the softmax function to create a weight vector that calculates the proportional relevance of each encoder hidden state to the previous decoder hidden state .

Given the distribution in, the fixed-length context vector ci can be computed by taking the weighted average over all encoder hidden states:

Beam Search

Beam search is a heuristic search algorithm that explores a graph by expanding the most promising node in a limited set. Beam search uses breadth-first-search to build its search tree. At each level, it generates the successors of each state and sorts them in terms of heuristic cost. It selects the best states to expand at each level, where is the beam width. Rather than performing greedy search, which only selects the most promising node, beam search selects possible nodes at each step. In the first step of decoding, it computes a softmax distribution over the vocabulary, assigning a probability to each word. It selects the best options and expands each node, where the selected nodes are called the hypotheses.

Each of hypotheses is scored by P(yi|x,y<i), which is the probability of the current word multiplied by the probability of the path leading to it. The beam prunes hypotheses down to the best hypotheses until an end token </s> is generated, indicating a complete output. The solution is removed from the tree and beam search continues until the beam width is reduced to 0.

We can use the chain rule of probability to break down P(y|x) into the product of each word probability given its context, which we can convert into a sum of logs:

score(y)
logP(ylx)
— log (P(yı
(11.20)

This can be normalised to a fixed length by dividing by the number of words:

score(y) = —logP(ylx) = —
¯ IOgP(YilYl ,

Constituency Grammars

Constituency

Syntactic constituency is the idea that groups of words can behave as single units called constituents. Words are grouped into phrases such as noun phrases or verb phrases and are generated using grammars.

A context-free grammar consists of a set of symbols and a set of rules or productions of the form A → a where A is non-terminal and a is a sequence of symbols. Productions are written in BNF:

NP → Det Nominal

NP → ProperNoun

Nominal → Noun | Nominal Noun

Derivations of context-free grammars can be represented as a parse tree. The root node dominates nodes below it. A parse tree often begins with the start symbol S.

Pro
I
Verb
prefer
Det
a
N om
Nom Noun
Noun flight
mornmg

Grammar Rules

A sentence can consist of a noun phrase and verb phrase

S → NP VP I prefer a morning flight

A verb phrase consists of a verb following by other constituents

VP → Verb NP prefer a morning flight

VP → Verb NP PP leave Boston in the morning

VP → Verb PP leave on Thursday

A preposition is followed by a noun phrase

PP → Preposition NP from England

A nominal is a head noun with extra modifiers

Nominal → Noun flight

Nominal → Nominal Noun the flight

Nominal → Num Nominal 6 flights

A determiner is a word that appears before a noun such as a, any, some, the, this, those, these. It can also be a possessive expression such as my, your, their, or Reece's.

Grammar Rules
Nominal —Y
Pronoun
Proper-Noun
Det Nominal
Nominal Noun
Noun
Verb
Verb NP
Verb NP PP
Verb PP
Preposition NP
Examples
I + want a morning flight
Los Angeles
a + flight
morning + flight
flights
do
want + a flight
leave + Boston + in the morning
leaving + on Thursday
from + Los Angeles

A declarative sentence contains a subject noun phrase followed by a verb phrase

S → NP VP The flight leaves at 6pm

An imperative sentence contains a verb phrase with no subject

S → VP Show me the flight at 6pm

A yes-no question contains an auxiliary verb, subject noun phrase, and verb phrase

S → Aux NP VP Are any flights available today?

A wh-subject-question contains a wh-word (who,whose,when,where,what,which how,why), noun phrase and verb phrase

S → Wh-NP VP What flights leave at 6pm?

A wh-non-subject question contains a wh-phrase and two subjects

S → Wh-NP Aux NP VP What flights do you have at 6pm?

A wh-non-subject-question contains a long-distance dependency because the Wh-NP constituent is far away from the predicate VP.

A clause represents a 'complete thought' and is made up of two or more of the following components:

Subject - what the clause is about
Verb - action done to the object by the subject
Object - person, place, thing or idea
Complement - extra subject or object which completes the phrase
Adverbial - adjunct (additional information)

Treebanks

A treebank is a syntactically annotated corpus. In a treebank, every sentence in the collection is paired with a corresponding parse tree.

( (VBD was)
(NP-SBJ ()T That)
( 」 P , P ( 」」 11 )
( 匕 empty) ( sky) )
()P ()N of)
()P ()N fire)
()N light) ) ) ) )
()C and)
()P should/MD
(NP-SBJ The/DT flight/NN )
()P arrive/VB
()P TMP at/IN
()P eleven/CD a m/RB ) )

NP-SBJ
That
JJ
cold
JJ
empty
sky
VBD
was
ADJP-PRD
JJ
full
IN
of
fire
cc
and
light

A -NONE- token represents a long-distance dependency or a syntactic movement. For example a speech phrase such as "Let's go," he said separates the speech from the non-spoken phrase.

Treebanks tend to implicitly encode a grammar has rules are repeated over the data.

A lexical head is the most grammatically important word in a phrase. For example, Noun is the head of NP and Verb is the head of VP. Each context-free rule is associated with a lexical head which appears in the parse tree.

Combinatory Categorical Grammar is a generative grammar that allows both left-to-right and word-by-word composition. It consists of a set of symbols, categories, and a set of rules or productions.

The notation (X / Y) denotes a function that takes Y as an argument to the right and returns X. The notation (X \ Y) is the same but takes its argument from the left. The following rules apply:

(X / Y) Y → X

Y (X \ Y) → X

A combination of more complex rules allows derivations to be formed:

$the flight United diverted NP/N N NP S/NP NP\NP$

Constituency Parsing

Syntactic Parsing: technique by which tokenised POS-tagged text is assigned a structure that reveals relationships between tokens governed by syntax rules

Structural ambiguity occurs when a grammar can assign more than one structure to a sentence.

Attachment ambiguity - a constituent can be attached in multiple places
Coordination ambiguity - phrases can be conjoined by conjunctions in multiple ways

CKY Parsing

The CKY algorithm is a dynamic programming technique operating on context-free grammars given in CNF to derive the most probable parse trees. It fills in a table of sub-problems to complete the original problem as a whole. CKY indexes sentences using fenceposts between tokens e.g. 0 Book 1 that 2 flight 3. The parse table is filled in left-right and bottom-up. Each superdiagonal row contains the POS tags for the current word. The other cells are filled by analysing phrases within the sentence.

The outermost loop iterates over the columns and the next loop iterates over the rows. The innermost loop ranges over cells where a substring may be split. At each cell [i,j] the algorithm considers whether the contents of the two cells can be combined via the grammar rules. If a rule exists, the left non-terminal is entered into the table.

dad
zxdA"s
u6n0JuJ
unoN
•leU!LUON
dN

function CKY-PARSE(words, grammar) returns table
forje—from I to LENGTH(words) do
for all {A I A —+ words[jl e grammar}
table[j— I table[j — l, j] U A
for i "—from j — 2 down to O do
for to j — do
for all {A IA BC e grammar andB e andC
table[ij] table[ij] IJ A

Span-Based Neural Consistency Parsing

Span-Based Neural Consistency Parsing or neural CKY disambiguates among the possible parse trees to select the most probable tree. It trains a neural classifier to assign scores to constituents, then uses CKY to find the best-scoring parse tree.

Neural CKY architecture:

input tokens are embedded using a pretrained model such as BERT
embeddings are passed through post-processing layers such as transformers
encoder outputs yt are used to compute a span score
span vector v is passed through an MLP classifier containing two fully connected layers and a ReLU
the dimensionality of the output is equal to the number of possible non-terminal symbols
the MLP outputs a score for each symbol

CKY for computing best parse
Compute score for span
Represent span
[STARTI Book
M LP
h -hi
map back to words
the
BERT
to sub words
flight through Houston [ENDI

A parse tree T is a set of |T| spans where the tth span starts at position it and ends at position jt with label lt. Given a span [i,j] with non-terminal symbol k, the classifier assigns a score s(i,j,l) to the span.

The parser uses these scores to compute the score for the whole parse tree s(T) by summing over the scores of its constituent spans.

The most probable parse tree is the tree with the maximum score. The algorithm greedily selects the highest scoring label for each span. This has a 95% accuracy in finding valid trees.

The variant CKY algorithm finds the full parse tree. It defines Sbest(i,j) as the score of the best subtree spanning (i,j).

For spans of length 1:

For other spans (i,j):

Dependency Parsing

Dependency Grammars & Relations

Dependency grammars are defined by a set of directed binary grammatical relations that hold among words in a sentence. They are useful for languages that are morphologically rich and have free word order. They can also be used in conference resolution, question answering, and information extraction.

Relations are represented by labelled directed arcs from heads to dependents. This is a typed dependency structure as labels are drawn from a fixed set of relations. The root node marks the head of the tree.

nsubj
dow
n mod
nmod
I prefer the morning flight through Denver

Dependency relations define labels for head and dependent word groups. Clausal relations involve a predicate (often a verb) and modifiers attached to it. Nominal relations involve modifiers for nominal (often noun) words.

A dependency graph is a directed graph G = (V, A) where V is the set of vertices and A is the set of ordered pairs of vertices, called arcs. Arcs capture head-dependent and grammatical relationships between vertices. A dependency tree is a directed graph satisfying the following constraints:

there is a single root node that has no incoming arcs
each vertex except the root has one incoming arc
there exists a unique path from the root to each vertex in V

An arc from a head to a dependent is projective if there is a path from the head to each word between the head and dependent. A dependency tree is projective if all the arcs inside it are also projective.

Transition-Based Dependency Parsing

Shift-reduce dependency parsing uses a context-free grammar, a stack, and a list of input tokens. Input tokens are successively shifted onto the stack and the top two elements are matched against the grammar rules. When a match is found, a dependency relation is added to the parse tree.

training data is provided as pairs of the form (configuration, transition)
the configuration consists of a stack, input tokens, and set of relations
transitions represent actions and labels

Shift: remove the word from the input and push it onto the stack
Left Arc: assert a head-dependent relation A → B where A is the head and B is the dependent
Right Arc: assert a head-dependent relation A ← B where B is the head and A is the dependent

Input buffer
Oracle
Stack

The algorithm for shift-reduce dependency parsing reduces the stack by creating dependency relations, until only the root node is left. The oracle analyses the top two items in the stack to determine if the parser should create a new relation or shift a new token onto the stack. The oracle has access to the following information:

the current configuration with stack S and set of relations Rc
the training data with vertices V and set of relations Rp

Given this information, the oracle selects a left arc if it produces a correct head-dependent relation, a right arc if it produces a correct head-dependent relation and all the dependents at the top of the stack have already been assigned, or a shift otherwise.

function DEPENDENCYPARSE(words) returns dependency tree
state f— {[root], [words], [ ] } ; initial configuration
while state not final
; choose a transition operator to apply
t 4— ORACLE(state)
state state) ; apply it, creating a new state
return state

Features such as word forms, lemmas, and POS tags are used to train the oracle. Feature templates are abstract sets of predefined feature types to use during training. Features are denoted as location.property where s is the stack, b is the buffer, and r is the set of relations. w is a word form, l is a lemma, t is a POS tag, and op is the transition operator.

Graph-Based Dependency Parsing

Graph-based dependency parsing searches through a space of possible parse trees to maximise their score. The search space is represented as a directed graph which employs graph theory to search for optimal solutions. Given a sentence S, the parser searches for the best tree in the set of all possible trees GS. Graph-based methods are useful as they can produce non-projective trees and have higher accuracy with long-distance dependencies.

Given an input sentence, the algorithm constructs a fully connected weighted directed graph where the vertices are the input words and the edges represent all possible head-dependent assignments. A root node is included with outgoing edges to all other vertices. The weight for each edge represents its score.

root
12
Book
that
flight

For each vertex in the graph, the algorithm selects the incoming edge with the highest score. If the resulting subgraph T = (V, F) is a spanning tree (has no cycles and each vertex except the root has one incoming edge) the tree is returned. Otherwise, the selection process continues. During cleanup, all weights are adjusted by subtracting the score of the maximum edge from the score of all other edges entering that vertex. If any edges are equal to zero, the cycle is collapsed into a single node.

function MAXSPANNINGTREE(G=(V,E), root, score) returns spanning tree
score ' 4_ [l
for each v e V do
bestInEdge argmax
E score[e]
F "—F best1nEdge
for each e=(u,v) e E do
score '[e] score[e] — score[bestInEdge]
if T=(V,F) is a spanning tree then return it
Cea cycle in F
T' root, score')
return T
function CONTRACT(G, C) returns contracted graph
function EXPAN D(T, C) returns expanded graph

The score for each edge can be reduced to a weighted sum of features. Common features include word forms, lemmas, POS tags, NER tags, embeddings, relations, dependencies, and dependency distances.

Word Senses and WordNet

Word Senses

Words are ambiguous and can have different meanings. A word sense is a representation of one aspect of the meaning of a word.

Relations between senses:

Synonymity & Antonymity

Synonym: word which is identical to another
Antonym: word which is opposite to another

Hyponymity & Hypernymity

Hyponym: word which is a subclass of another
Hypernym: word which is a superclass of another

Meronymity & Holonymity

Meronym: word which is part of another
Holonym: word which is built from others

WordNet

WordNet is a lexical database containing semantic relations between words. Each word is described by a gloss (dictionary-style definition), list of synonyms for the sense, and usage examples. The set of synonyms for a sense is called a synset. The lexographic category for a synset is called a supersense.

Word Sense Disambiguation (WSD) is a classification task which predicts word senses in sentences. A WSD algorithm is typically trained using a semantic concordance, a corpus in which each word is labelled with its word sense. Given each noun, verb, adverb, or adjective in the test set, the algorithm will predict the correct sense from the possible senses provided in WordNet.

X = sentence = sequence of labelled words
Y = word senses = sequence of sense tags

electrici :
using
electricity
electric2:
tense
electrics:
thrilling
O
electric
Y2
guitar'
O
guitar
and
Y3
bassi :
low range
bass4:
sea fish
bass7:
instrument
bass
playerl:
In game
player2:
musician
players:
actor
player
stand':
upright
stand5•
stand*0..
put
upright
stand
000
off to one
Y6
sidel:
relative
region
sides:
of body
sideli:
slope
side

WSD algorithms are usually evaluated using the F1 score. A strong baseline involves selecting the most frequent sense for each word. Another baseline called one-sense-per-discourse involves using the same sense each time a word appears in text.

The most accurate form of WSD uses the nearest neighbour algorithm. During training, it passes each sentence into a labelled dataset to build a contextual embedding for each token. For each token ci of each sense c, it averages the embeddings to produce a sense embedding vs for c:

The algorithm computes an embedding t for each word in the test set, then finds its nearest neighbour (cosine distance) from the training set.

O find4
I
find$
find?
O
Cfound th C' ar
ENCODER
found the jar
em
empty

For out-of-vocabulary (OOV) words, the algorithm can infer the missing embeddings using supersenses. It calculates the sense embedding for any higher-level word in the WordNet taxonomy by averaging the senses of its children. For each missing sense in WordNet , let the synset embeddings be , the hypernym synset embeddings be , and the supersense synset embeddings be .

The embedding for a synset is the average of its sense embeddings:

The embedding for a hypernym is the average of its synset embeddings:

The embedding for a supersense is the average of its synset embeddings:

Information Extraction

Information extraction: process of converting unstructured text into structured data, such as a relational database or set of extracted tuples

Relation extraction: process of finding semantic relations among text entities, such as parent-child, part-whole or geospatial relations

Relation Extraction

A relation consists of a set of ordered tuples over named entities. ACE contains 17 relations and 7 entities:

Relations
Physical-Located
Part-Whole-Subsidiary
Person-Social-Family
Org-AFF-Founder
Ty pes
PER-GPE
ORG-ORG
PER-PER
PER.ORG
Examples
He was in Tennessee
XYZ, the parent company of ABC
Yoko's husband John
Steve Jobs, co-founder of Apple..

Wikipedia provides a database of relations contained in infoboxes. Relations appear as RDF (Resource Description Framework) triples appearing as subject-predicate-object expressions. DBpedia is an ontology derived from Wikipedia containing RDF triples.

Subject	Predicate	Object
Golden Gate Park	location	San Francisco

WordNet offers ontological relations between words and concepts such as hypernyms (is-a) and hyponyms (instance-of). The TACRED dataset contains relations triples associated with people and organisations, represented as labelled sentences.

Hearst patterns are hand-crafted lexico-syntactic patterns derived using regular expressions. These have high precision but low recall.

Supervised Learning

Supervised learning approaches are trained on an annotated corpus containing relations and entities, and the annotated texts are used to train classifiers to annotate unseen text. The supervised learning algorithm finds pairs of named entities in a sentence and classifies each pair.

X = feature set for entity pair (embedding/one-hot vector)
Y = prediction for the relation between e1 and e2

function FINDRELATIONS(words) returns relations
relations 4— nil
entities 4— FINDENTITIES(words)
forall entity pairs (el , e2) in entities do
if , e2)
relations -e relations+CLASSIFYRELATION(e/ , e2)

Feature-based classifiers typically use word features such as embeddings, one-hot vectors, n-grams, lemmas, named entity features such as NER tags and entity levels, and syntactical features such as dependencies or dependency distances.

Neural supervised relation classifiers perform supervised classification on words in a sentence. A typical transformer-encoder algorithm uses a pretrained encoder such as BERT to encode sentences with labels. The encoder replaces subject and object entities with NER tags to avoid overfitting. Neural classifiers perform well if the test set is similar to the training set. However, supervised models do not generalise well to different types of text.

p(relation ISUBJ,OBJ)
Linear
Classifier
ENCODER BER
[CLS] [SUBJ_PERSON] was born in [OBJ_LOC]
Michigan

Semi-Supervised Learning

Bootstrapping uses a small set of high quality seed tuples of the form (r, e1, e2) to find sentences containing both entities. It extracts and generalises the context around the entities to learn new patterns.

function BOOTSTRAP(Re/ation R) returns new relation tuples
tuples Gather a set of seed tuples that have relation R
iterate
sentences e— find sentences that contain entities in tuples
patterns generalize the context between and around entities in sentences
novpairs 4— use patterns to grep for more tuples
novpairs e— newpairs with high confidence
tuples e tuples + newpairs
return tuples

The bootstrapping algorithm also assigns confidence values to new tuples to avoid semantic drift. In semantic drift, an incorrect pattern leads to the introduction and propagation of incorrect tuples, causing the meaning of the relation to 'drift'.

Distant Supervision

Distant supervision combines the advantages of semi-supervised learning and bootstrapping. As distant supervision doesn't use a labelled training corpus, it isn't affected by semantic drift.

Training

uses a large database containing many seed samples to create noisy pattern features
combines features in a supervised classifier

Relation Extraction

finds relations from databases such as DBpedia and Freebase
a NER tagger is run on some text to extract sentences with matching relations
training instances are extracted in the form (r, e1, e2)

Prediction

applies feature-based neural classification with a rich set of features

function DISTANT SUPERVISION(Database D, Text T) returns relation classifier C
foreach relation R
foreach tuple (el,e2) of entities with relation R in D
sentences e Sentences in T that contain el and e2
features in sentences
observations observations + new training tuple (el, e2, f, R)
C e— Train supervised classifier on observations
return C

Unsupervised Learning

Unsupervised relation extraction finds relations without using labelled training data. This task is often called Open Information Extraction (Open IE). The ReVerb algorithm extracts relations from sentences in 4 steps:

Perform POS and NER tagging on a sentence s
For each verb in s, find the longest phrase w that starts with a verb and satisfies the constraints
For each phrase w, find the nearest noun phrase x to the left which is not a relative pronoun, wh-word, or existential word. Find the nearest noun phrase y to the right of the verb
Assign and return a confidence value c to the relation r = (x,w,y) using a confidence classifier

A relation is only accepted if it meets syntactic and lexical constraints. These ensure that it is a verb-initial sequence that may also include nouns. The constraints are based on a dictionary D that is used to prune irrelevant strings. This eliminates relations that do not occur often enough to form distinct examples.

V | VP | VW*P

V = verb particle? + adverb?

W = noun | pronoun | adjective | adverb | determiner

P = prep | particle | marker

Temporal Expression Extraction

Temporal expression extraction finds expressions containing dates and times. Relative expressions map to particular times through some reference point. Durations denote spans of time at varying levels of granularity.

Absolute
April 24, 1916
The summer of '77
10:15 AM
The 3rd quarter of 2006
Relative
yesterday
next semester
two weeks from yesterday
last quarter
Durations
four hours
three weeks
six days
the last three quarters

Rule-based systems can extract these expressions using lexical triggers such as nouns, verbs, and adjectives. Sequence labelling approaches use IOB tagging to label named entities in sentences. Features are extracted from tokens and their contexts, and a sequence labelling model is trained on the data. Typical features include word shape, POS tags, NER tags, context, and lexical triggers.

Feature
Token
Tokens in window
Shape
POS
Chunk tags
Lexical triggers
Explanation
The target token to be labeled
Bag of tokens in the window around a target
Character shape features
Parts of speech of target and window words
Base phrase chunk tag for target and words in a window
Presence in a list of temporal terms

Temporal normalisation is the process of mapping a temporal expression to a specific point in time or duration. Most approaches for this task are rule-based and match regular expressions such as YYYY-MM-DD and HH:MM:SS. The temporal anchor is the central point in time to which all other temporal expressions are compared.

Event extraction is used to identify events in texts. An event is an expression denoted an event or state that can be assigned to a temporal expression. Events are classified as actions, states, reporting events (say, tell, explain), perception events (see, observe), and more. Event extraction is usually performed using supervised learning. Sequence labelling models can use IOB tagging to classify event classes and attributes. Feature-based models use information such as POS tags, tense, and word forms.

Template Filling

A script is an abstract sequence of events, participants, and roles. Scripts can be represented as templates consisting of fixed sets of slots that take inputs called slot-fillers. The task of template filling involves finding documents that contain scripts and filling slots with slot-fillers extracted from the text.

Template Recognition

training data consists predefined templates with annotated slot-fillers
matching scripts are found in some text
features are extracted from each labelled sentence that fills a slot
features include tokens, embeddings, word shapes, POS tags, and NER tags
a classifier is trained to find templates

Role-Filler Extraction

training data contains roles such as airline, amount
features are extracted from each labelled sentence that contains a role
a binary role classifier is trained on the labelled training data

Semantic Role Labelling

Semantic roles are relations between arguments and predicates, representing abstract concepts such as causation, experience, and effect. Proto-agent arguments display agent-like properties such as being violationally involved in an event and causing an event or change of state. Proto-patient arguments display patient-like properties such as undergoing a change of state and being affected by other participants.

PropBank

Propbank is a database of sentences annotated with semantic roles. Roles are defined separately for each possible word sense. Each verb sense has a specific set of rules of the form Arg0, Arg1, Arg2…

Arg0 - proto-agent
Arg1 - proto-patient
Argx - optional

(19.11) agree.01
Argo: Agreer
Argl:
Arg2:
Ex l:
Ex2:
Proposition
Other entity agreeing
[Argo The group] agreed [Arg] it wouldn't make an offer].
[ArgM-TMP Usually] [Argo John] agrees [Arg2 with Mary]
[Argl on everything].

PropBank provides additional arguments called ArgMs which represent modifications or adjunct meanings.

Modifier	Type	Meaning	Example
TMP	Temporal	when an action takes place	yesterday evening, now
LOC	Locative	where action takes place	at the museum in San Francisco
MNR	Manner	how an action is performed	clearly, with much enthusiasm
DIR	Directional	motion from source to goal	down to London
CAU	Cause	reason for an action	as a result of the interview
REC	Reciprocal	reflexes and reciprocals	himself, themselves, together
ADV	Adverbial	miscellaneous modifier	happily, she attended

FrameNet

FrameNet is a semantic role-labelling database that stores relations as frames. A frame is a background knowledge structure that connects a set of words using semantic roles called frame elements. For example, the role change_position_on_a_scale consists of words that indicate a change of position on a scale attribute from a starting point initial_value to an end point final_value.

Semantic Role Labelling

Semantic Role Labelling (SRL) is the task of extracting semantic roles for arguments and predicates in a sentence. A feature-based SRL algorithm extracts features from labelled words in a parse tree. A classifier is then trained to predict a semantic role for each constituent using the features.

X = feature template for a parse tree
Y = sequence of SRL tags

The classification task can also be broken down into multiple steps:

Pruning - removal of irrelevant words before training
Identification - binary classification of each node as an argument to be labelled or none
Classification - multiclass classification of constituents labelled as arguments
Global Assignment - optimal role assignment using Viterbi or other constraint-based methods

Most SRL systems use a generalisation of core feature templates. For the sentence "The San Francisco Examiner issued a special edition around noon yesterday", the following features could be used:

the governing predicate, in this case the verb issued
the phrase type of the constituent, in this case NP or NP-SBJ
the headword of the constituent, such as Examiner
the part of speech of the headword, in this case NNP
the path in the parse tree from the constituent to the predicate, such as NP-S-VP-VBD
the voice of the clause, in this case active
the position of the constituent, such as before, after
the subcategorisation of the predicate, such as VP → VBD NP PP

Neural Networks for Semantic Role Labelling

A neural sequence labelling algorithm aims to compute the most probable tag sequence given an input sequence w:

Each input word is mapped to pretrained BERT embeddings and associated with a flag indicating whether it is a predicate. The concatenated embeddings are passed through an LSTM network. The output embeddings and predicate embeddings are concatenated and passed through an MLP with a softmax function which outputs a probability for each tag.

Question Answering

Information Retrieval (IR) is the retrieval of media based on the information a user needs. The user provides a query to the retrieval system, which returns a set of relevant documents from a database. Documents include web pages, scientific papers, news articles, or paragraphs.

IR-Based Factoid QA

IR-Based QA/Open Domain QA answers questions by finding short text segments from documents such as web pages. Factoid questions always have short answers such as noun phrases and verb phrases.

Question
Where is the Louvre Museum located?
What are the names of Odin's ravens?
What kind of nuts are used in marzipan?
What instrument did Max Roach play?
What's the official language of Algeria?
Answer
in Paris, France
Huginn and Muninn
almonds
drums
Arabic

Datasets for IR-based QA are typically created by developing reading comprehension datasets containing tuples of the form (passage, question, answer). This type of data can be used to train a reader to predict a span in the passage as the answer, given a passage and question.

SQuAD contains 100,000 question-answer pairs based on 500+ articles from Wikipedia and associated questions whose answers are text spans from the passages
HotspotQA contains 300,000 crowd-sourced questions and answers. Some of these are multi-hop questions requiring multiple documents to answer
Natural Questions incorporates 300,000 real anonymised Google search queries, presented as a query, Wikipedia page, and annotated answer
TyDi QA is a multilingual database over 11 languages containing 240,000 question and answer pairs from Wikipedia and crowd-sourcing

IR-Based QA used the retrieve and read model. In the retrieval stage, it finds relevant passages from a text database. In the read stage, a neural reading comprehension algorithm passes over each passage to find spans of text that are likely to answer the question.

Q: When was
the premiere of
The Magic Flute?
Retriever
Indexed Docs
Reader
BERT
A:
1791
Relevant

Retrieve and read is typically performed by span labelling. The algorithm identifies a text span in a passage that constitutes an answer. A neural algorithm is given a question q containing n tokens q1…qn and a passage p of m tokens p1…pm. It then computes the probability P(a|q,p) that each possible text span a is the answer. If each span starts at position as and ends at position ae, the probability can be estimated as P(a|q,p) = Pstart(as|q,p) Pend(ae|q,p). For each token pi in the passage, the algorithm computes the start and end probabilities pstart and pend.

A standard baseline algorithm will pass the question and passage to an encoder such as BERT as strings separated with the [SEP] token. It then trains a network layer to predict the start and end position of the text span. These are represented by the vector embeddings S and E. For each output token pi, the algorithm computes the dot product between S and pi, then uses softmax to find its probability.

The score for a span from position i to j is .The output with the highest probability is selected as the final model prediction. The model is trained using log-loss with respect to the start and end positions for each text span.

Pstarti
Encoder (BERT)
q [SEP] PI
Pendl
P
[CLS]
Question
Passage

Entity Linking

Entity linking is the task of associating a mention in text with a representation of a real-world entity. TAGME is a simple baseline algorithm which uses data from Wikipedia pages. It first creates a catalog of Wikipedia pages and indexes them in a standard IR engine. For each Wikipedia page e, the algorithm computes an in-link count, which is the total number of links from other pages that point to e. Finally, the algorithm requires an anchor dictionary. The anchor dictionary contains the anchor texts for each page. These are hyperlinked text spans from other pages that point to the current page e.

Given an input question, TAGME detects mentions by querying the anchor dictionary for each token sequence. If the mention is ambiguous (points to more than one page), the algorithm disambiguates the mention by calculating a score based on two metrics. The prior probability P(e|a) is the probability that the span refers to the entity. For each page, the probability that an anchor a points to e is the ratio of pages linked to e with anchor text a and the total number of occurrences of a as an anchor:

Given a question q, the relatedness for an entity e and anchor a is the weighted average relatedness between e and all other entities in q. in(x) is the set of Wikipedia pages pointing to x and W is the set of all Wikipedia pages in the collection. To calculate the relatedness between two entities A and B:

To compute a score for each entity, TAGME combines relatedness and prior probability to select entities with high relatedness scores, then selecting the entity with the highest prior probability from this set.

Neural Entity Linking

A neural graph-based linking algorithm encodes a candidate span, encodes an entity, and computes the dot product between the encodings. This allows embeddings for all the entities in the knowledge base to be pre-computed and cached. Given a question q and a set of candidate mentions from Wikipedia text, the algorithm outputs tuples of the form (e, ms, me) where e = entity, ms = mention start, me = mention end.

X = question, entity candidate
Y = entity mention classifier + entity linker

Q : When did Shaq come to the nba?
P("Shaquille O' Neal"lQ, 'shaq' is a mention)
P('shaq' is a mentionlQ)
Mention classifier + scorer
Question Encoder
Inner product
Entity Encoder

To get a h-dimensional embedding for each question token, the algorithm runs the question through BERT.

It then computes the probability of each span [i,j] in q being an entity mention. First it must compute the score for the span being the start or end of a mention.

wstart and wend are vectors learned during training. Another trained embedding wmention is used to compute a score for each token being part of a mention.

The three scores are combined to create the mention probability.

The algorithm links mentions to entities by computing dot product similarity between each entity e and span [i,j].

Finally, it takes a softmax distribution over entities for each span.

Evaluation of Factoid Answers

The mean reciprocal rank (MRR) compares a set of gold answers and a ranked output of answers by calculating the reciprocal. The score of a system is the average of the MRR scores across questions in a set.

Reading comprehension systems are often evaluated using two metrics:

Exact Match - the percentage of predicted answers that match the gold answer exactly
F1 Score - the average overlap between predicted and gold answers

[a]λ(w_i-1) should be λ_i-1

this addition bit is basically interpolation as far as I can tell