The primary module aim is to investigate how online collective intelligence and “crowd wisdom” is used in various business and research areas, and the underlying technologies/algorithms used in practice.

Main topics:

Crowdsourcing
Recommender Systems
Reputation Systems
Experimental Design and Web Analytics
Online Advertising Auctions
Rank Aggregation
There’ll also be some guest lectures from industry/academia discussing applications; confirmed for 2020-21 (if you’re reading this later, RIP):

Mikhail Fain (Cookpad: Food tech company)
Davide Zilli (Mind Foundry: Oxford’s spin-off company)

Learning outcomes:

demonstrate knowledge of social computing technology applications, e.g. crowdsourcing, recommender systems, online auctions
describe the most important techniques and issues in designing, building and modelling social computing systems
set up social computing experiments and analyse the results with a scientific approach
tl;dr social computing

There is a strong applied component to the module (i.e. learning how shit is actually used). Note that it’s not about social networks (that’s for COMP6250: Social Media and Network Science).

Assessment:

40% coursework
60% final exam (hence I’m making these notes, bruh I’m not getting paid for this)

Lecturers:

Dr. Sarah Hewitt
Dr. Stuart Middleton
Dr. Jie Zhang

Coursework

Done in pairs (email sem03@soton.ac.uk your team choices, subject 'COMP3208 coursework teams')
Aim is to design, implement, and evaluate a recommender system (to predict ratings for a combination of user and items for which no ratings are provided)

Medium size dataset will be provided consisting of user ratings for items (~20 million entries)

Additional Links and Resources

HCOMP Tutorial on Crowdsourcing and Human Computation: https://www.behind-the-enemy-lines.com/2011/04/tutorial-on-crowdsourcing-and-human.html
Crowdsourcing and Human Computation Course: http://crowdsourcing-class.org/
Networks, Crowds and Markets book (Easley and Kleinberg): http://www.cs.cornell.edu/home/kleinber/networks-book/networks-book.pdf
List of citizen science projects: https://en.wikipedia.org/wiki/List_of_citizen_science_projects
TED Talk by Luis von Ahn (creator of reCAPTCHA): https://www.ted.com/talks/luis_von_ahn_massive_scale_online_collaboration
“Follow the Crowd” blog: https://blog.humancomputation.com/

Recommender Systems

Core textbook: https://link.springer.com/book/10.1007%2F978-3-319-29659-3

relevant chapters: 1, 2, 4 and 7 (and possibly 3 for the coursework)

Additional material from another book: http://www.recommenderbook.net/teaching-material
SQLite: https://www.sqlite.org/index.html

Where to Find What

MS Teams: live lectures, interactive sessions, Q&As
Module page: general module info, schedule, slides, other materials
Blackboard: self-test exercises, final assessment

SQLite4Java: https://bitbucket.org/almworks/sqlite4java
Using SQLite in Java using JDBC: http://www.sqlitetutorial.net/sqlite-java/sqlite-jdbc-driver

What is Social Computing?

It’s a widely used concept with no single definition, but according to HP Research Labs:

"Social Computing Research focuses on methods for harvesting the collective intelligence of groups of people in order to realize greater value from the interaction between users and information”

(basically, harvest info from users to learn about them and their interactions with #content)

ECS has done various research in social computing, e.g. enhancing crowdsourcing techniques for sustainable disaster relief; here’s a flex:

What is Crowdsourcing?

Dictionary definition of crowdsource: “obtain (information or input into a particular task or project) by enlisting the services of a number of people, either paid or unpaid, typically via the Internet”. (Basically, get people online to do shit for you)

Topics to be investigated include:

What type of problems are suitable for crowdsourcing?
How to divide a complex task into many simple tasks that can be done with no expertise and no/little need for additional technology?
How to design appropriate (e.g. monetary) incentives?
How to ensure the quality, e.g. using reputation and/or voting systems?

Crowdsourcing Examples

CAPTCHA

We’ve all done it, we all hate it
"Completely Automated Public Turing test to tell Computers and Humans Apart" - a type of challenge-response test used in computing to determine whether or not the user’s human
Most common type of CAPTCHA (displayed as Version 1.0) invented in 1997 (oldskool); it requires the user to correctly evaluate and enter a sequence of letters or numbers perceptible in a distorted image displayed on the screen
As it’s administered by a computer, in contrast to the standard Turing test that is administered by a human, a CAPTCHA is sometimes described as a reverse Turing test
However it’s received many criticisms, especially from people with disabilities, but also from other people who feel that their everyday work and shitposting is slowed down by distorted words that are difficult to read (it takes the average person ~10 seconds to solve a typical CAPTCHA - you’re too slow!)

NoiseTube

An example of citizen sensing
A research project started in 2008 at the Sony Computer Science Laboratory in Paris in collaboration with Vrije Universiteit Brussel
Goal of the project is to turn smartphones into mobile noise level meters to enable citizens to measure their exposure to noise in their everyday environment, and participate in the collective noise mapping of their city or neighborhood (so people can snitch on your free parties)
Software platform that consists of a mobile application (available for Android and iOS ) and a website (visit www.noisetube.net and register an account) where users can submit their noise measurements to visualise and share them with others

Amazon Mechanical Turk

An example of human computation (so Jeff Bezos can make even more money)
A crowdsourcing website for businesses to hire remotely located "crowdworkers" to perform discrete on-demand tasks that computers are currently unable to do
can include anything from conducting simple data validation and research, to more subjective tasks like survey participation and content moderation
On the Amazon website, “there are still many things that human beings can do much more effectively than computers” (but robots will take over someday)

Some other examples

Citizen science: e.g. Zooniverse, EteRNA
Citizen journalism: e.g. Ushahidi
Others: e.g. Red balloon challenge

Recommender Systems

A subclass of information filtering system that seeks to predict the "rating" or "preference" a user would give to an item

They are primarily used in commercial applications

Collaborative filtering: use collective tastes of many users to produce personalised predictions about the interests of individual users

e.g. "people who bought these items also bought.." or "people who listened to these songs also listened to..." or “people who read these notes also lost the will to live... ” etc.

Sites like Amazon and Netflix use recommender systems for their content, using item-to-item collaborative filtering, which scales to massive data sets and produces “high-quality” (I don’t think they’re very good tbh) recommendations in real time

Topics to be investigated include:

What are the technologies for collaborative filtering?
How to implement a recommender system using medium size data sets?
How to evaluate and compare the effectiveness of a recommender system?

Online Advertising Auctions

Many websites rely on advertising for their income, especially those offering free content (RIP GeoCities)
These ads are determined through very fast auctions, where advertisers bid on users the moment they enter a website. Instead of having a complex algorithm for matching advertisers to users, an auction-approach is used

There are two main types of online advertising:

sponsored search/keyword advertising

search ads only appear in search engine results. Advertisers select a list of keywords that are relevant to their company or products, and can bid to have their ad show when someone searches for those words or phrases
in sponsored search, advertisers bid on keywords entered by the user

display advertising

display ads are the banners or text you see on websites that sell space to advertisers. These ads used to be the result of direct relationships between sites and advertisers (and in some cases, they still are)
In display advertising, advertisers bid based on user profile (determined by tracking users through cookies)
Remember, you can never escape the corporate machine

Topics to be investigated include:

What type of companies play a role in online advertising
What are the different auctions and how do they work?
What are the characteristics of the auctions used in online advertising, and their advantages and disadvantages
How should advertisers bid in the auctions.
Who learns the user profiles, and how do they do it?

Summary

The module focuses on social computing techniques (duh), with 6 main topics (not including social networks, so get off TikTok and revise)
The common theme is the use of collective intelligence (a.k.a the wisdom of the crowds) for research and business applications
Remaining topics to look at:

Reputation Systems
Experimental Design
Rank Aggregation

Past & Present

Terminology

Before diving into the lecture, some useful terms:

Collective Intelligence - “intelligence that emerges from the collaboration, collective efforts, and competition of many individuals”
Social Computing - “methods for building computational systems that harness collective intelligence”
Human Computation - “outsourcing computational steps (microtasks) that are difficult for machines to humans”
Crowdsourcing -“obtaining services, ideas or content from a large group of people through an open call (typically to an online community) rather than from traditional employees or suppliers”

History of Social Computing

1714 - Longitude Prize

Back in 1714, the British Government offered, by Act of Parliament, £20k (over £2m today) for a solution which could find longitude to within half a degree (equivalent to 2 minutes of time)
A group later known as the Board of Longitude was set up to assess submissions and offer rewards
This was won by John Harrison, who invented the revolutionary marine chronometer
The clock in the pic above is displayed in the Science Museum in London (after lockdown you can go check it out if you want)
So why are we talking about clocks? Because this case is seen by some as the first example of crowdsourcing

1791 - Gaspard de Prony’s Logarithmic Tables

In 1791, de Prony worked on producing logarithmic and trigonometric tables for the French Cadastre, as the French National Assembly (after the French Revolution) wanted to bring uniformity to measurements and standards used throughout the nation.
Inspired by Adam Smith's Wealth of Nations, de Prony divided up labor into three levels, bragging that he "could manufacture logarithms as easily as one manufactures pins." (basically he wanted to run a logarithm sweatshop)
The three levels were:

First level: 5 or 6 high-ranking mathematicians with sophisticated analytical skills (to choose analytical formulas, specify the number of decimals and numerical range the tables were to cover, etc.)
Second level: 7 or 8 lesser mathematicians (to calculate pivotal values with the formulas, sets of starting differences, prepare templates for the human computers, etc.)
Third level: 60-90 “human computers” (the first code monkeys, if you will), many of whom were out-of-work hairdressers (blame the aristocrats) with rudimentary arithmetic knowledge, to carry out the most laborious and repetitive tasks

This is actually argued to be the origin of the word “computer”, fun fact

Wisdom of the Crowds

As mentioned before, the wisdom of the crowd is the collective opinion of a group of individuals rather than that of a single expert.

1906 - Sir Francis Galton

This guy was a Victorian statistician, progressive, polymath, sociologist, psychologist, anthropologist, eugenicist, tropical explorer, geographer, inventor, meteorologist, proto-geneticist, and psychometrician (tl;dr he did a lot of shit)
He invented stuff like:

The first weather map
Fingerprints
Efficacy of prayer
Nature versus nurture

Core to statistical analysis is the idea that measurements vary (no shit); they have a central tendency (i.e. mean), and a spread around this central value (i.e. variance). Galton came up with a measure to quantify normal variation: standard deviation

So why is there a picture of an ox above? Well ,in 1906, Galton visited a livestock fair, and an ox was on display, where the villagers were invited to guess the animal's weight (its real weight was 1198 lbs) after it was slaughtered and dressed. Nearly 800 people guessed, and Galton studied their individual entries after the event.

Galton stated "the middlemost estimate expresses the vox populi (i.e. voice of the people), every other estimate being condemned as too low or too high by a majority of the voters". He reported that the median was 1,207 pounds; to his surprise, this was within 0.8% of the real weight measured by the judges
Soon afterwards, in response to an enquiry, he reported the mean (of the guesses) was 1,197 pounds, but didn’t comment on its improved accuracy. Recent archival research has found some slips in transmitting Galton's calculations to the original article in the journal “Nature”: the median was actually 1,208 pounds, and the dressed weight of the ox 1,197 pounds, so the mean estimate had zero error

Thus, had Galton known the true result, his conclusion on the wisdom of the crowd would’ve been more strongly expressed

The same year, Galton suggested in a letter to Nature a better method of cutting a round cake by avoiding making radial incisions… but that’s another bedtime story

1984 - Space Shuttle Challenger Disaster

You may be aware of the tragic Challenger disaster in 1986, where the space shuttle exploded 73 seconds after lift-off: https://www.youtube.com/watch?v=j4JOjcDFtBE

As you can imagine, the disaster had a big impact on the stock market at the time, and particularly the stock prices for NASA’s suppliers, which can be seen in the graph below:

NASA’s 4 main suppliers were:

Rockwell International: built the shuttle and its main engines
Lockheed: managed ground support
Martin Marietta: manufactured the ship's external fuel tank
Morton Thiokol: built the solid-fuel booster rocket

Early Social Computing Systems

In the 80s the rise of the internet came about, as systems expanded worldwide and became more accessible to people

1981: ARPANET was expanded in 1981 as the National Science Foundation (NSF), funded by the Computer Science Network (CSNET)
1982: Internet Protocol Suite (TCP/IP) was standardized, allowing worldwide proliferation of interconnected networks

1986: TCP/IP network access expanded again when the National Science Foundation Network (NSFNet) provided access to supercomputer sites in the US for researchers (first at speeds of 56 kbit/s, later at 1.5 Mbit/s and 45 Mbit/s)
1988–89: NSFNet expanded into academic and research organizations in Europe, Australia, New Zealand and Japan

Although other network protocols such as UUCP had global reach well before this time, this marked the beginning of the Internet as an intercontinental network

1989: Commercial Internet service providers (ISPs) emerged in the US and Australia (then ARPANET was decommissioned in 1990, RIP)

Then in the 90s, some other pretty revolutionary social computing systems came into existence:

In 1992, Xerox developed Tapestry, an experimental collaborative email filtering system, based on the belief that information filtering could be more effective with humans involved in the filtering process
It was designed to support both content-based filtering and collaborative filtering; people could record their reactions (called “annotations”, accessible by other users’ filters) to documents they read
It was intended to serve both as a mail filter and information repository, comprising an indexer, document store, annotation store, filterer, little box, remailer, appraiser and reader/browser

Also in 1992, the GroupLens lab at the University of Minnesota was set up, one of the first to study automated recommender systems; they wrote:

the GroupLens recommender
a USENET article recommendation engine (USENET was an oldbie internet discussion group network)

MovieLens (1997), a popular movie recommendation site used to study recommendation engines, tagging systems, and user interfaces
They also gained notability for work studying open content communities such as Cyclopath, a geo-wiki used in the local area to help plan the regional cycling system

Commercialisation

Overtime. such systems became more commercial, and if you haven’t been living under a rock you probably know:

eBay (1995), which uses a public reputation system to help users be able to trust sellers when buying stuff

eBay uses a feedback and ratings policy, and on a product page, an information box gives relevant history about the user selling the product

Amazon (1997), which uses a collaborative filtering service called BookMatcher, which makes book recommendations to patrons

Products are recommended to a user based on the ratings given by previous users, and Amazon’s patent includes a technique to automatically update the database of recommended items based on the underlying database of books available. It also uses statistical analysis to validate recommendations before offering them

Google (1998) - a PageRank built on collective intelligence, the major difference with the Google algorithm being that PageRank evaluates the link. When a page is linked by another page with “authority”, this gives more relevance to the page itself; a page with qualitatively valuable information actually needs months to acquire some “authority”.

Since 2000, it uses AdWords, which uses auctioning to rank pages. The position of an ad in the results for a particular search, called an “Ad Rank”, is calculated via bid multiplied by quality score.

Here’s a flex of Google’s revenue, if you care:

$59.06 bn revenue in 2014, ~90%
$67.39 bn in 2015, ~90%
$79.38 bn in 2016, ~89%
$95.38 bn in 2017, ~86%
$116.32 bn in 2018, ~85%

Web 2.0

Web 2.0 (also known as Participative/Participatory and Social Web), coined in 1999, refers to sites that emphasize:

user-generated content
ease of use
participatory culture
interoperability (i.e., compatible with other products, systems, and devices)

Here are a bunch of sites (that you’ve probably heard of unless you’ve just immigrated from Pluto) that boomed in the 00s built on a crowd-generated content model:

2000: Threadless - an e-commerce site where designs are created by and chosen by an online community of artists (in some way like an OG RedBubble). Each week, about 1,000 designs are submitted online and are put to a public vote.
2000: iStockPhoto - a royalty free, international micro stock photography provider, which offers millions of photos, illustrations, clip art, videos and audio tracks. Artists, designers and photographers worldwide contribute their work (like Hide The Pain Harold, does anyone remember that meme?) to iStock collections in return for royalties
2001: Wikipedia - a free online encyclopedia, created and edited by volunteers around the world and hosted by the Wikimedia Foundation (and where I’m stealing most of the stuff in this document from)

2004: Flickr - an image hosting and video hosting service, as well as an online community. It is (or was, does anyone still use it?) a popular way for amateur and professional photographers to host high-resolution photos
2005: YouTube
Social Media (Myspace, Facebook, Twitter)

Do I even need to explain the last two points, you know what this shit is

Rise of Crowdsourcing

As mentioned before (doesn’t hurt to recap), coined in 2006 by Jeff Howe in Wired Magazine, crowdsourcing is to “obtain (information or input into a particular task or project) by enlisting the services of a number of people, either paid or unpaid, typically via the Internet”. (Basically, get people online to do shit for you)

Here are some notable examples of crowdsourcing initiatives:

2004: ESP Game - this was a human-based computation game developed to address the problem of creating difficult metadata. The idea was to use the computational power of humans to perform a task that computers can’t (originally, image recognition) by packaging the task as a game

It was originally conceived by Luis von Ahn of Carnegie Mellon University, but Google bought a licence to create their own version of the game in 2006, Google Image Labeler, to return better search results for its online images

2005: Amazon Mechanical Turk - I talked about this shit here, as of 2011 it had half a million users

Crowdsourcing for Social Good

2007, 2010: Ushahidi - created in 2007, Ushahidi (meaning “testimony” in Swahili), is both the name of a Kenyan not-for-profit civic tech company and of a crowdsourcing platform allowing users to submit violence reports and mapping of events.

It was notably used during the 2007–08 Kenyan crisis, and the 2010 Haiti earthquake; here’s a video of a visualisation of the response to the earthquake by the OpenStreetMap community: https://www.youtube.com/watch?v=BwMM_vsA3aY

Also, a 4636 emergency shortcode was set up to run on the Digicel and the Comcel networks in Haiti, and works as follows:

People in Haiti text location, name and requests for aid / reports etc. to 4636.
The data is streamed from different celtels to a server hosted by Ushahidi.
100s of Kreyol-speaking volunteers translate, categorize and plot the geocoords of the location if possible.
The structured data is streamed to different organizations on the ground like Red Cross and InSTEDD, who act on it / pass it to appropriate people there.
It is also streamed to the main Ushahidi database (the publicly viewable one) where it is combined with other data, further annotated etc. and made available to other organizations.

Crowdsourcing and Machine Learning

There are also examples of crowdsourcing in machine learning applications
You guys remember good ol’ Wizard of Oz prototyping from Interaction Design? Stuff like that can be done in the domain of AI “essentially prototyping the AI with human beings”.

In some cases, humans are used to train the AI system and improve its accuracy. A company called Scale offers a bank of human workers to provide training data for self-driving cars and other AI-powered systems.

“Scalers” will, for example, look at camera or sensor feeds and label cars, pedestrians and cyclists in the frame. With enough of this human calibration, the AI will learn to recognise these objects itself (tl;dr using humans to train AI)

In other cases, companies “fake it until they make it”, telling investors and users they have developed a scalable AI technology while secretly relying on human intelligence (basically bullshitting the customer)

As a specific example, a company called Figure Eight, a human-in-the-loop (i.e. getting people to help with shit) machine learning and artificial intelligence company, uses human intelligence to do simple tasks (such as transcribing text or annotating images) to train machine learning algorithms

Figure Eight's software automates tasks for machine learning algorithms, which can be used to improve catalog search results, approve photos, support customers etc.
The technology can be used in the development of self-driving cars, intelligent personal assistants and other technology that uses machine learning, and Figure Eight works with a lot of big companies

Why Do We Still Use Social Computing?

There’s still vast commercial potential (i.e. companies can make $$$) from social computing:

Google’s built on collective intelligence
Facebook uses a newsfeed
Sales are driven via engagement through relevant recommendations and adverts
Crowdsourcing offers access to a worldwide labour market

Crowdsourcing can also lead to greater user engagement, and possibilities such as:

Large-scale digitisation (reCaptcha)
Translation through language learning (Duolingo)
Word processor with a crowd inside (Soylent)

Social computing can also assist with contributions to science (no, really, it can), providing us with knowledge and insights on:

Human behaviour
Economic incentives
Social science research
Citizen science
Unprecedented access to data (participatory sensing) and human computation (e.g. Galaxy Zoo)

Galaxy Zoo is a crowdsourced astronomy project which invites people to assist in the morphological classification of large numbers of galaxies. It’s also an example of citizen science as it enlists the help of members of the public to help in scientific research.

Features of Computing Systems

One can also think of computing features as being either relevant to “traditional” or “social” computing. Some of the most important characteristics of social computing can be summarized as user-created content where:

users can control the data
unique sharing of content or media
ability to tap into the intelligence (unless you’re on something like TikTok) of other users collectively
unique communication and collaborative environment
major emphasis on social networks
interactive, user-friendly interface

When analysing a social computing system, one can ask the questions (common sense really):

What is its purpose (besides providing material for this module)?
What problem(s) does it solve?
What were possibly the key (social computing) challenges in implementing the system?
How are users incentivised to participate?
How could the system be improved?
Is it successful?

Here’s a soulless, unimaginative corporate graphic for your viewing pleasure:

Recommender Systems

Learning Outcomes

Identify and compare key recommender systems, understanding which ones are appropriate for what types of applications
Apply recommender systems techniques to specific settings
Evaluate recommender systems

Resources

Some of the slides I’m ripping off are based on tutorials by Dietmar Jannach and Gerhard Friedrich, and there are two textbooks for this part of the module:

(the intro) Recommender systems: an introduction (Dietmar Jannach, Markus Zanker, Alexander Felfernig, Gerhard Friedrich)
(the real deal) Recommender Systems: The Textbook (Charu C. Aggarwal, 2016 - ISBN 978-3319296579, published by Springer)

Furthermore, some useful websites:

Calculating document similarity: https://www.baeldung.com/cs/ml-similarities-in-text
On recommendation engines: https://www.sciencedirect.com/science/article/pii/B9780128147610000113
The book the previous link came from (with lots and lots of really interesting information): https://www.sciencedirect.com/book/9780128147610/data-science

What are Recommender Systems?

Recomender systems are a type of information filtering system which filters items based on a personal profile. (tl;dr they recommend you stuff)

They’re different from many other information filtering systems (e.g. standard search engines and reputation systems) in that the recommendations are personal to the specific user or group of users (with the theory that we aren’t as unique as we think we are, you’re not special mate).

Recommender systems are given:

a user model which entails ratings, preferences, demographics, and situation context (things like how old we are, where we live :o how much we earn, what kind of job we do, what gender we are etc.).

Recommender systems can infer quite a lot of info about the kind of person we are from this combination of characteristics, and thus the kind of things we’re more likely to like

items, the things that we select - these items may or may not have any additional characteristics associated with them, but are important in themselves

The recommender system can then calculate a relevance score used to rank the options, and can finally choose the set of options to recommend to us.

Examples of Recommender Systems

Amazon is of course one of the biggest online retail spaces; because it’s so huge with so many products, good recommender systems are invaluable

Netflix (though there are lots of rivals), with their own [Original Content](do not steal) and a huge library of other films, shows etc. Obviously it’d take forever to trawl through it all, so recommender systems can help

(the 2000s called, they want their OS back)

News recommendations; after viewing a few news stories these sites often recommend other articles you may be interested in

Social web, for example:

Google Play (recommending apps one might be interested based on what they’ve already downloaded)
Job sites (“jobs you may be interested in”)
Facebook (“groups you may like”)
Picasa (recommending similar photos)

Targeted advertising (e.g. annoying banners at the top of the page, using a recommender in addition to simply cookies and search history etc.)

A web app called Spydify was developed by students at the uni, which produces visualizations such as graphs and infographics to show users’ preferred artists/songs, shows other metrics (such as how “danceable” the music they listen to is), and helps recommend new music based on one’s tastes; check it out!

Why use Recommender Systems?

Recommender systems can provide:

Value for the customer

Reduce information overload (browsing things online can be more difficult and overwhelming vs. IRL; “we can only see what fits on our screen”)
Find things that are interesting
Narrow down the set of choices (we can see things we’ve bought before, and promotional items)
Help one explore the space of options
Discover new things
Entertainment
tl;dr they help people find shit they like

Value for the provider

If it makes things easier for the user, the user is more likely to return to the site
Additional and probably uniquely personalised service for the customer
Increase trust and customer loyalty
Increase sales, click-through rates (CTRs), conversions (i.e. successful engagements with recommended items, i.e. the user takes the bait) etc.
Opportunities for promotion and persuasion
Obtain more knowledge about customers
tl;dr they help companies make more $$$ from people engaging with shit

Potential Issues

However, naturally some challenges exist around developing and using recommenders, such as:

Lack of data (from which to learn and make recommendations) - this is known as the “cold start problem”
Privacy - despite being anonymised, data is effectively collected about users to, in most applications, make a profit, and this data is often sold between companies (and the moment one starts swapping databases it can become much easier to identify individuals)
Lots of similar items - there can be an issue with lack of variety in recommendations, and the scope thereof becoming too narrow/niche
From a user perspective, shared accounts - if someone is sharing an account with someone else, particularly if they have very different tastes, the algorithm can get “confused” and make some… interesting recommendations
One could argue being recommended things can limit people pursuing other options - if you simply follow recommendations, you may never see some of the different options and discover new things you might actually like

Paradigms of Recommender Systems

These are some different forms/models of recommender system, including:

Content-based
Knowledge-based
Collaborative filtering
Hybrid

Looking at the first example:

Content-based

Build a profile of a user (i.e. user model), which is automatically generated and updated (e.g. learn model based on previously bought items)
Identify a set of features from items bought/looked at; for example:

Books/movies: title, genre, writers, actors, and many other keywords
Music: using tags/attributes (e.g. in the Music Genome Project, used by Pandora)
Web pages/documents: metadata (basically tags), or actual contents

Find item similarities based on features (i.e. content)

this process uses information retrieval techniques, a whole field of data science and machine learning (extra extra read all about it)

Finally, produce recommendation based on user profile and similarity

e.g. items similar to items bought, or albums similar to albums listened to

Knowledge-based

Makes use of (often) manually generated information about both users and items
User information generated, e.g. using preference elicitation (ask user sets of questions to determine their likes or demographic information, and build a profile)

In practice, the system’s “knowledge” can be updated and fed back into recommendations periodically, e.g. run overnight

Item similarity/recommendation determined by “experts”
System mimics behaviour of an experienced sales assistant (bookshops being a good example; e.g. you might tell an assistant what you last read, and they can use their knowledge of books and what’s in the store to recommend something new)
Useful when recommending to first-time users (e.g. for one-time purchases) and novel items

In short, knowledge-based systems take the user profile and contextual parameters, but also product features and knowledge models they already have to build a recommendation list

Collaborative Filtering

The most prominent approach to generate recommendations

used by large, commercial e-commerce sites
well-understood; been around for a long time, and various algorithms and variations exist
applicable in many domains (books, films, DVDs, etc.), but doesn’t need info about the items themselves

Uses the wisdom of the crowds to recommend items, where users rate catalog items implicitly (e.g. watching a film multiple times, demonstrates we probably liked it) or explicitly (e.g. actually rating it highly)
The basic idea is that customers who had similar tastes in the past, will have similar tastes in the future; in this sense it exploits correlation between people with similar interests

Ultimately, it’s all about the user profile, contextual parameters and community data
There are two main types of collaborative filtering:

User-based - considers similarities between users, and produces recommendations based on similar interests
Item-based - considers similarity between items: “users who liked item x, also liked item y”

Hybrid

As the name suggests, hybrid approaches use a combination of techniques; a recommender system can become more powerful when different things are put together

Pros and Cons of Different Approaches

Type of Recommender	Pros	Cons
Collaborative	No knowledge-engineering effort Well understood Works well in some domains Can produce “surprising” recommendations	Requires some form of rating feedback “Cold start” for new users and items Sparsity problems No integration of other knowledge sources
Content-based	No community required Comparison between items possible	Content descriptions necessary “Cold start” for new users No “surprises”
Knowledge-based	Deterministic (i.e. a given set of inputs will always return the same output) recommendations Guaranteed quality No “cold start” Can resemble sales dialogue	Knowledge engineering effort to bootstrap (the start up cost is high in terms of data needed to make recommendations) - hence it can be very computationally expensive Basically static Doesn’t react to short-term trends

Similarity Coefficients

(FYI, a coefficient is a number in front of a variable of term, e.g. in 6a + 3b + 2x²y, the term coefficients are 6, 3 and 2 respectively)

So how do we mathematically evaluate the similarity between documents?

Below are a couple of simple approaches, computing document similarity based on keyword overlap:

(x and y are two books, and |X| indicates the size of a set X - remember cardinality in Foundations?)

Dice’s coefficient:

Jaccard’s coefficient:

The above two coefficients effectively tackle the problem in the same way, just via slightly different calculations. One tends to choose the coefficient that’s most suitable for the dataset hand/problem being solved.

At the two extremes:

Identical sets of keywords = a similarity of 1
Completely different keywords = a similarity of 0

“Cleaning” Features

(FYI for below, an array is a horizontal list of numbers, whereas a matrix is a 2D grid)

There are some potential shortcomings to using keywords; e.g. they:

(Generally speaking) Need to be transformed to numbers (we simply care about frequency of occurrence)
Contain “noise” words such as “and”, “the”, “but” etc. (these don’t actually tell us anything about the item)
Generate a sparse matrix (i.e. a matrix which contains very few non-zero elements, as there are a lot of possible keywords; big matrices can waste lots of memory space)

One mechanism to combat this is to clean the features, basically optimizing the set of keywords for the task at hand, using techniques such as (if you did a machine learning module, here’s some déjà vu):

Stop-word removal - filtering out words that aren’t semantically meaningful, like “a”, “an”, “the” etc.
Stemming - reducing inflection in words to their root forms, which still capture the same meaning; e.g.:

Phrase extraction - extracting words that occur together and have a specific meaning, e.g. “hot dog”, as a single term (note, it doesn’t have to be two words, it can be any set that occur together; one has to be mindful of computational expense however)
Term-frequency inverse-document frequency (TF-IDF) - basically assigning less weight to common words (e.g. stop words); mathematically:

Inverse-document frequency (IDF) = , where nₜ = number of times a term t appears across documents, and N = number of documents
Term-frequency inverse-document frequency (TF-IDF) = , where tfₜ = a normalised and/or dampened frequency of term t in the document
In short, the reason we use this is because a word that doesn't occur very often in a set of documents, may actually be an important word (i.e. the less frequency a word is used, the more important it might be). It may be more indicative of what the document is about than any of the other words.

It’s also weighted against the document length, so e.g. if a word only occurs once in a short document, the weighting’s going to be much lower vs. if it appears just a few times in a long document

Calculating Similarity

Given a feature vector (which can represent all the words in a document, for example), there are many ways to determine the similarity between two vectors (so e.g. two documents).
The most commonly used is cosine similarity, simply the angle between two vectors; the smaller the angle, the closer the two vectors/documents/set of word are to each other (one can do cosine similarity on words or entire documents)

If two vectors are at 90 degrees apart, they’re as dissimilar as they can possibly be; if they're less than 90 degrees, they become more similar.
Here’s the actual equation to calculate it:
As with the aforementioned similarity coefficients, at the two extremes:

Identical vectors/documents = a similarity of 1
Completely different vectors/documents = a similarity of 0

Exercise

(This exercise taken from the slides, originally from StackOverflow apparently)

Suppose we want to compare the similarity between two documents, in this case simple sentences (but we can apply the same approach to books or websites):

Document x: “Julie loves me more than Linda loves me”
Document y: “Jane likes me more than Julie loves me”

Step 1: Get the frequency of each term (order does not matter) to produce vectors x̄ and ȳ.

x̄ = (Jane: 0, Julie: 1, likes: 0, Linda: 1, loves: 2, me: 2, more: 1, than: 1)

ȳ = (Jane: 1, Julie: 1, likes: 1, Linda: 0, loves: 1, me: 2, more: 1, than: 1)

Though to simplify we can just write them like:

x̄ = (0,1,0,1,2,2,1,1)

ȳ = (1,1,1,0,1,2,1,1)

Step 2: Calculate the similarity, using the Dice, Jaccard and Cosine approaches.

Dice’s coefficient = (2*5) / (6+7) = 10/13

Jaccard’s coefficient = 5/8

Cosine similarity = (0*1 + 1*1 + 0*1 + 1*0 + 2*1 + 2*2 + 1*1 + 1*1) / (√(0²+1²+0²+1²+2²+2²+1²+1²) * √(1²+1²+1²+0²+1²+2²+1²+1²)) = 9 / (√12 * √10) = (3√30) / 20

Bear in mind that for Dice and Jaccard, we don’t care about the frequency, only if the words appear or not (i.e. the cardinality of the sets, i.e. number of unique elements).

Self-Test Question Answers

If it isn’t obvious, green = correct, red = incorrect (if you’re colourblind have no fear, I’ll also add a tick ✓ and cross ✗).

Question 1

A definition of a recommender system is:

✓ A type of information filtering system which filters items based on a personal profile
✓ A type of information filtering system which filters items based on a personal profile. It is different from many other information filtering systems (such as standard search engines and reputation systems), in that the recommendations are personal to the specific user or group of users
✗ An information filtering system (don’t ask me why this one’s incorrect)
✗ A special kind of search engine

Question 2

Examples of providers that use recommender systems are:

✗ Amazon and Netflix (again I don’t understand why this isn’t technically correct but we move)
✓ Amazon, Netflix and Spotify
✓ Amazon, Netflix, Spotify and many large supermarkets
✗ Amazon, Netflix, Spotify and Google search

Question 3

There are many advantages to using recommender systems for both customers and providers. Some of these for the customer are reducing information overload, and helping them discover new things. For the provider, they increase trust and customer loyalty which can in turn increase sales revenue.

✓ True
✗ False

User-Based Collaborative Filtering

We touched on collaborative filtering in the previous lecture; focusing more on user-based collaborative filtering...

The basic assumption is that people who agreed in the past are likely to agree again (i.e. we’re not quite as unique as we think we are, and there are others out there with similar likes to us)
We try to find users who are similar to each other based on their overlap in interests and opinions (e.g. using jointly rated items)

If ratings for the same items are similar, it is assumed that users have similar interests and so will have similar interest for future items

Then the system can recommend items from similar users which are rated positively (and which the user has not yet rated/seen/bought)

In terms of requirements, systems like Amazon have millions of items and tens of millions of users, so clearly algorithms need to have good performance independent of these colossal numbers. The algorithms should ideally have:

constant time performance
response time of <0.5 seconds
still able to give new users good recommendations

So what steps does one need to consider for the system?

Representation - what will be used as input? e..g items we or other users have purchased, items browsed, ratings users have left behind, etc.
Similarity measure - how do we calculate the similarity between 2 users?
Neighbourhood selection - do we select all users (have to think about efficiency/scalability; if you have millions of users it’s infeasible to do similarity calculations between them all!) or a subset of users? On what basis do we choose?
Predict/estimate rating - need to calculate a predicted rating (i.e. what we predict the user might rate the items) for a user’s unrated items
Generation of recommendation - select the “top M” recommendations ranked by predicted rating

Naturally however, some issues exist with user-based collaborative filtering such as:

User cold-start problem - not enough info known about a new user to decide who is similar, and possibly a lack of other users
Sparsity - recommending from a large items set where users may have only rated a few, it can be hard to find users who’ve rated the same items
Scalability - with many users calculations can become computationally expensive (O(N²) complexity)

To combat this, representative groups of similar users can be created to infer similar items (ringing a bell from machine learning if you did that?)
ML algorithms like clustering can be used to learn and create these groups, but these are time-consuming in their own right, so in practice tend to be run periodically, e.g. once at night

Similarity Measure (for Ratings)

This only considers users who’ve rated some of the same items
The Pearson coefficient has been shown to be the most effective for user-based collaborative filtering

Used in statistics to calculate the correlation between 2 random variables:
This formula returns values between:

+1 (perfectly correlated)
-1 (perfectly negatively correlated)

It looks at the similarity in variations over items, not their absolute values

This is because it corrects for the average value, so e.g. users who consistently rate high can still be similar to users who consistently rate low, as long as the correlation is the same!
This way, inherent/unconscious bias one user may have shouldn’t unfairly skew the results

Note that, while used frequently in recommender systems, you wouldn’t want to use correlation in all situations (or may at least want to use it with caution), as correlation does not equal causation

So e.g. if you were looking for symptoms for a particular illness, looking for correlation between variables, you’d have to be very careful about the results and conclusion drawn

Pearson Coefficient Example

Say we have 6 users as shown in the table below, but users 1 and 2 have not rated some items:

For users 1 and 2, we can calculate the mean average user rating from the items they’ve both rated so far:

Then we can deduct user 1’s average from each of user 1’s ratings, and deduct user 2’s average from each of user 2’s ratings, so we end up with:

Finally we can plug the values into the Pearson coefficient formula (given above) like so, to calculate the similarity between user 1 and 2’s ratings (it looks more complicated than it is!):

As the resulting value of -0.89 is close to -1, we can see that their ratings are strongly negatively/inversely correlated.

Neighbourhood Selection

After calculating the similarity between all users, how do we choose which to base our base our recommendation on? We could use different constraints/selection criteria for neighbours, for example:

Have to share at least one rating (though maybe more)?
Only those with a positive correlation?
Either threshold (by e.g. a minimum positive rating) or fix the maximum number of neighbours, and choose n most correlated neighbours (ideally we can use a combination of thresholding and max)

Prediction

We can then use the given user ratings and similarity calculations to predict unrated items.

This function can be used to predict rating of user u for item i, given neighbourhood N (where N = the set of users in the selected neighbourhood)

This formula adds/subtracts the neighbours’ bias from the given user’s average

For example, say user 2 and user 3 have rated item 3, but user 1 hasn’t:

We can make a prediction of what user 1 will rate item 3, based on user 2 and 3’s ratings of item 3 and those users’ similarity to user 1.

So, given the following user rating average and user similarity results:

...we can simply plug the values into the prediction formula to calculate user 1’s predicted item 3 rating like so:

Improving Prediction

Some neighbour ratings may be more “valuable” than others; agreement on controversial items can be more informative than on common items.

Some things that can be done to possibly improve the prediction function include:

Give more weight to higher variance items
Significance weighting - when the number of co-rated items between users is low, linearly reduce the weight (and vice versa)

Formally, from the textbook: “When the two users have only a small number of ratings in common, the similarity function should be reduced with a discount factor to de-emphasize the importance of that user pair. This method is referred to as significance weighting.

Case amplification - give more weight to “very similar” neighbours (i.e. that have a similarity value close to 1)
Neighbourhood selection - as touched upon above, use a similarity threshold, or a fixed number of neighbours

Produce Recommendation

This is the final step (kinda obvious) for the user-based collaborative filtering recommender to take; it can:

Recommend M items with the highest predicted rating

M can be fixed or flexible (based on the confidence, or a given threshold)

Rank the recommendations in order of highest predicted ratings
In practice, other factors are also taken into account while generating the recommendations, such as:

Item popularity
Item novelty
Variation of recommended items
Other (e.g. business) reasons for promoting certain items

Item-Based Collaborative Filtering

We can also do collaborative filtering based on items!

The assumption is that a given user is more likely to have the same opinion for similar items
Therefore, one can use ratings to e.g. derive the similarity between items (rather than users)
Compared to user-based collaborative filtering, some advantages include:

No user “cold start” problem
Greater scalability with more users
Can be pre-computed “offline” (as one only needs to select the top items from a list)
Results can be equally as good

As with user-based, one needs to consider the main steps for the system, which are similar but of course in terms of items rather than users:

Representation - what will be used as input? e..g items we or other users have purchased, items browsed, ratings users have left behind, etc.
Similarity measure - how do we calculate the similarity between 2 items?
Neighbourhood selection - do we select all items (have to think about efficiency/scalability; if you have millions of items it’s infeasible to do similarity calculations between them all!) or a subset of items? On what basis do we choose?
Predict/estimate rating - need to calculate a predicted rating (i.e. what we predict the user might rate the items) for a user’s unrated items
Generation of recommendation - select the “top M” recommendations ranked by predicted rating

Naturally however, some issues exist with item-based collaborative filtering; pre-processing techniques can be used to mitigate these.

We might want to compute a similarity matrix offline with the items (at most (n*(n-1))/2 entries with n items), which can be reduced further
An item similarity matrix changes less often than a user similarity matrix
With optimization, one can compute recommendations very fast

Similarity Measure (for Ratings)

Looking again at the table, say this time we want to find the similarity between two items (rather than users):

One can technically just use Pearson’s coefficient again, swapping users and items, taking an item average instead of a user average

However, this isn’t adjusted/weighted for users’ average, which could result in unfair bias

Thus a better similarity function to use for instance is adjusted cosine similarity; this calculates the angle (so difference) between two item vectors, but adjusted for user average:

As before, this formula returns values between:

+1 (perfectly correlated)
-1 (perfectly negatively correlated)

Adjusted Cosine Similarity Example

So for example, say we have user ratings for items 3 and 4 like so (as you can see user 1 hasn’t rated both items, so we just ignore this row):

...and we calculate each user’s mean average rating (for all the items each one’s rated) like so:

...we can simply plug these values into the similarity function (summing the products of the respective weighted item ratings, and dividing this result by the product of the two items’ weighted square sums - sounds a lot more complicated than it is, just look at the formula) and simplify the expression for the final result:

As the resulting value of -0.816 is close to -1 (similar to the previous example), we can see that their ratings are strongly negatively/inversely correlated.

Prediction

Similarly to the user-based collaborative filtering system, we can predict unrated items.

This function can be used to predict rating of user u for item i, given neighbourhood N (where N = the set of items in the selected neighbourhood)

Note how this formula is slightly simpler than the previous one, as we need not subtract user biases, due to only comparing similarity between items themselves

Exercise 2

(From Blackboard)

Compute the similarity matrices for user- and item-based recommender systems for the example provided.
Compute the predictions using the two different approaches, with a neighbourhood containing only positive similarity values (i.e. ignoring any negative values)

Alternative Approaches

Of course, there are other ways of predicting ratings beyond similarity matrices and whatnot…

Simple Approach: Slope One

This approach focuses on computing the average differences between pairs of items; say for instance we have the following table of user item ratings:

...the (mean) average difference between items 1 and 2 is simply (ignoring user 1 and 2’s ratings, who haven’t rated item 2):

We can also calculate a predicted rating for item 2, taking the mean of the average differences between item 2 and every other item that all of the users have rated (so 1, 4 and 5):

Optionally we can adapt the prediction formula to weight each difference based on the number of jointly weighted items, like so:

Types of Ratings

Pure collaborative filtering-based systems, as seen above, only rely on the rating matrix; the question is, what kind of ratings can one use? For example:

Explicit Ratings

The most common that we’re familiar with is a Likert scale, typically 1-5 or 1-7
However one has to consider the optimal granularity of scale; for example, research has shown that a 1-10 scale is better accepted in films for example, which offers more granularity
One can also have multidimensional ratings (e.g. multiple ratings per movie, as a user’s opinion may change with multiple viewings; this is effectively a human aspect of rating)
There are challenges, such as users not always willing to rate many items, resulting in sparse matrices, so one has to consider how to incentivise users to rate more items (e.g. paying users, having a point system, or “nudge” users, e.g. a notification saying “5 of your friends have already rated x thing, would you like to do that?”)

Implicit Ratings

Clicks, page views, time spent on a page, demo downloads etc.

So the user isn’t explicitly rating something, but their activity shows they're likely interested in it

These can of course be used in addition to explicit ratings

The question is how do we interpret it “correctly”? (e.g. if a user spends longer browsing a page, does that actually indicate they’re more interested in an item?)

Example: Items Bought Together

Say for instance we have the following matrix of items that people have bought together:

...It’s apparent for example that item 2 is quite frequently bought with item 3 (so we can infer that if a user buys one of the items, they’re more likely to buy the other too).

Evaluating Recommender Systems

What makes for a “good” recommendation?

Of course it can vary between companies, thinking about e.g. what they’re trying to persuade people to buy, what behaviour are they trying to persuade people to engage in, etc. These things will have an impact on good ways of evaluating the recommender system, going back to the “human factor” (remembering that people are mostly predictable).

So what kinds of measures are used in practice?

Total sales numbers
Promotion of certain items
Click-through rates
Interactivity on the platform
Customer return rates (if an item gets returned a lot, that can be a negative; it could have been recommended in the wrong place/to the wrong people)
Customer satisfaction and loyalty

How can we measure the performance (again considering the “human factor”)?

Online tests with real users

For instance, A/B testing (also known as split testing or bucket testing), a binary method comparing two versions to assess which one performs better.
Example measures (indicating good performance) include:

sales increase
higher click-through rates

Laboratory studies

For instance, carrying out controlled experiments
An example measure is satisfaction with the system, which can be qualitatively assessed.g. via questionnaires

There are issues associated with these however; e.g. people tend to be much more positive in a satisfaction survey when the person/entity that wants to know is present in the same space (so people can be less honest/more afraid of saying harsh things)

Offline experiments

Based on historical data we have (rather than new incoming data)
Example measures include:

Prediction accuracy
Coverage

For offline experiments, some common protocol:

We can hide some items where we know there’s a ground truth

e.g. if we already know that people who buy item 2 usually buy item 3

We can also hide some information from domain experts

If somebody already knows a lot about the thing we’re asking them to evaluate, they’re perhaps not the right person to ask due to already having too much contextual information

We can also rank items or predict hidden ratings, using a metric to evaluate this

After conducting some experiment, we effectively have a test set which we can use to test the accuracy of prediction classification

Correct and Incorrect Classifications

How can we label correct and incorrect classifications that the system makes? The following four test responses are often used:

True positives (TP) - the recommended item is relevant to the user
True negatives (TN) - the recommended item is not relevant to the user
False positives (FP) - an item which is not relevant has not been recommended
False negatives (FN) - an item is relevant, but this has not been recommended

True positives and false negatives will be classified in the same space

sometimes this doesn’t matter but there are certain circumstances it does (e.g. for covid testing! it’s important that people who have covid receive a true positive result, and the worst outcome is a false negative, i.e. someone’s told they don’t have it when in reality they do)

A false positive is easier to deal with than a false negative, but ideally we still want to minimize these.

Metrics

Various metrics can be used to attempt to measure the performance of an classification algorithm, such as:

Precision

This is a measure of exactness which determines the fraction of relevant items retrieved out of all items retrieved (e.g. the % of movie recommendations that are actually good for the user).

tp = true positive, fp = false positive (the less false positives we have, the better the precision)

Recall

This is a measure of completeness, which determines the fraction of relevant items retrieved out of all relevant items (e.g. the % of all good movies for the user, that were actually recommended).

tp = true positive, fn = false negative (the less false negatives we have, the better the recall)

Things like calculating the area under a curve can give us a visual representation of precision and recall.

Accuracy

This simple metric considers the accuracy of all correct classifications (i.e. the true positives and the true negatives), given by the formula below:

This is arguably the metric that’s of most interest for recommender systems, which quickly lets us see how effective it is at making predictions.

Rank Score

In a recommender system, the item rank score/position really matters; we want to make sure that ground truths are recommended more often than things where the certainty is less.

For example, looking at the following table excerpt:

...if someone has bought item 2, we want to know that:

they’ve been recommended it
it was pretty high up on the list of recommendations

Rank score extends precision and recall, to take the positions of correct items in a ranked list into account.
Lower-ranked items can be overlooked by users, so this metric is especially important in recommender systems.
We can try to optimize models for such measures (e.g. AUC, “area under the curve”)

Error Measures

There are different ways of measuring the error rate of a recommender; the two most commonly used are probably:

Mean absolute error (MAE)

this computes the deviation between predicted rating and actual ratings

Root mean square error (RMSE)

this is similar to MAE but places more emphasis on greater deviation (due to the values being squared)

Online Auctions

Where Are Auctions Used and Why?

One of the first things that may come to mind when people talk abou auctions are eBay auctions. There are two main ways that you can buy things: buy it now, and auctions! (You’re probably familiar with the system but notice the current price, number of bids, time remaining etc:)

eBay charges a % of commission fees for items sold on their platform

Going into some stuff mentioned in the first lecture...

Display Advertising

Another type of auction is display advertising, a type of graphic advertising on websites (shown via web browsers), usually in the form of some media e.g. an image banner or a video ad.

These days as systems are more sophisticated, they often use cookies from your browsing and purchase history to better target ads to you. (So e.g say you buy some plane tickets and then scroll through Facebook, you might see more airline/holiday advertisements pop up.)

Why Use Auctions?

So why use auctions?

It’s a type of dynamic pricing based on competition
Essentially a price discovery mechanism (e.g. on an eBay auction, via people bidding the final selling price for an item is “discovered”)
Resources are distributed and allocated to the most relevant person (e.g. most relevant ad is shown)
Computation is done by the “crowd”, which in this case consists of the bidders - going back to the “wisdom of the crowds”

Auction Design

There are of course different types of auctions, each with advantages and disadvantages. There are equivalences between them, and they're more/less suited to different scenarios.

The auction design affects how people behave during an auction - so one must consider:

What are the desired objectives and properties in a use case?
How can auctions be designed so that users are incentivized to behave “correctly”?

The most typically desired auction property is for the bidders to bid truthfully; i.e. no matter how valuable the item, we ideally want users to bid what they think the price should be; this is a price discovery process (of course this isn’t always easily feasible in practice).

Auction Types

There are four classical auction types…

English Auction

This type has as very long history and is the most commonly known type of auction, entailing (referring to a singular item):

A first price

Usually a fairly low price is set so that bidders are willing to increase it and get the ball rolling

An opening bid
Ascending bidding increase

Low-res pic of Sotheby’s (fancy place):

Pretty simple process:

Each bidder is free to raise the current highest bid
When no bidder is willing to raise anymore, the auction ends
The highest bidder wins the item at the price of their bid

English auctions are most commonly used (as you’d expect) for selling goods, e.g. antiques and artworks.

Bidders in an auction can be assumed to have a “private value”, i.e. what they’re ultimately willing to pay for the item, independent of what the other bidders may believe the item is worth.

However, their bid may be influenced by what they believe others will bid

The bidder may speculate about the bids of others, and shade their own bid

One may make as few bids as possible to avoid raising the price more than necessary
A bidder’s first bids are often lower than their “private value”
The concept of individual rationality indicates that bidders will typically not bid more than they are willing to pay
A bidder may also refrain from bidding until the bidding slows down, in the hopes that it’ll stop below their maximum bid (and then they can place one)

Dutch Auction

This is an example of an open-bid descending auction (effectively the opposite way to the English auction).

The auction starts at a high price (artificially high so that no one is realistically willing to pay it), announced to the public
The auctioneer then lowers the price until a bidder makes a bid equal to the current offer price
The good is then allocated to the bidder that made the offer, and the agent pays the price they bid

This type of auction is famously used for selling flowers in the Netherlands, fun fact (flowers have a limited life cycle so need to be sold off quickly before they die).

If a bidder bids when the price lowers to their private value, they’ll win the item but may be paying significantly more than they have needed to.

Hence a strategy might be to hold out,and try to bid just before the price lowers to that of the second-highest private value (attempting to win the item at the lowest possible price).
Obviously though, the longer one waits, the greater the risk they lose out on the item, so there's a trade off

Sealed-Bid First Price Auction

This auction is similar to the English auction in that the highest bidded price wins the auction (and the winner then pays how much they bid). However the difference is that, whereas the English auction is open bid, this type is sealed bid (so bidders don’t see other bidders’ prices).

In summary:

First-price sealed-bid auction

Single round
Bidders submit offers in a sealed envelope (doesn’t have to be a literal envelope, but the same idea); all bidders submit simultaneously
Good is allocated to the bidder with the highest price
Winner pays the price they bid

Normally, used for:

procurement by companies and organizations
government project contracts

How would you bid in this type of auction?

In this type of auction there’s a trade-off between bidding a lower price than your private value and risking losing, or bidding as high a price but risking paying more than might’ve been necessary.

Sealed-Bid Second Price Auction (Vickery Auction)

Lastly but not leastly, looking at the second price auction…

The idea is simple: goods are awarded to the bidder that made the highest bid, but they pay the price of the second highest bid.

An example with 3 bidders:

Bidder 1 bids £100
Bidder 2 bids £200
Bidder 3 bids £300
Bidder 3 wins the auction and pays £200

Despite only being a small change to the sealed-bid first price auction model, this auction type arguably has preferable properties, and bidders are better incentivised to bid truthfully.

Of course, one could take a gamble and bid higher than their private value, increasing the chances of winning but also increasing the chance of them having to pay higher than their private value (which arguably goes beyond individual rationality), so again it’s a trade-off.

Dominant Strategy

A strategy is called a dominant strategy if it is better than the other strategies for one player, no matter how that player's opponents may play.
Thus, in the context of an auction, a bidder’s strategy is dominant if - regardless of the prices the other bidders bid - their bidding price will always generate the best possible outcome for them.

Going back to the example with 3 bidders, bid increment (the minimum amount bids need to be increased by, needed for open-bid auctions) is 1:

Bidder 1 has private value £100
Bidder 2 has private value £200
Bidder 3 has private value £300

For the English and Dutch auctions (and the sealed-bid first price auction, strategically equivalent to Dutch), strictly speaking, there is no dominant strategy for bidders, as the best/safest strategy to use is always dependent on what the other bidders bid
However, for the sealed-bid second price (i.e. Vickrey) auction, there is a dominant strategy, to simply always bid your private value.

This is because, if you win, you’ll only have to pay the second highest value, so regardless of what the other bid values actually are, you couldn’t possibly have a better outcome
There’s also no benefit in overbidding your private value:

If you have the highest private value and win the auction, you’ll pay the same amount even if you overbid
If you have the second highest private value and overbidding wins you the auction… you have to pay more than your private value, not desirable

Similarly, bidding lower than your private value doesn’t benefit you (only decreases your chances of winning, or makes no difference)

Due to its dominant strategy and lack of need to consider other bidders’ bids, sealed-bid second price (i.e. Vickrey) auction can be preferable for software applications and analysis (though the other auction types can have other advantages).
Also, more truthful bidding makes life easier for:

the bidders - don’t need to worry so much about what the other bidders might play at
the auctioneer - the outcome of how much the product will be sold for is more predictable

Formal Model and Utility

In analysis, one can use utility theory to better capture individual preferences, attaching numerical values to outcomes.

Assuming private valuations, the utility for bidder i when winning the auction would be uᵢ = vᵢ - pᵢ (where vᵢ is the willingness to pay and pᵢ is the payment), effectively the surplus between what they actually paid and what they were willing to pay
When vᵢ = pᵢ, the user is indifferent between winning and losing the auction (as the amount they have to pay is equal to what they’re winning; uᵢ = 0 and no utility is gained)
uᵢ = 0 also when the bidder loses the auction (they don’t have to pay anything but they don’t gain anything)

Also, one must consider tie-breaking (for sealed-bid auctions), as in some cases 2 or more bidders bid exactly the same.

The typical tie breaking rule is simply a random coin toss (so if 2 bidders are tied, each bidder has a 50% chance of winning)
This is more of a matter in practice; in theoretical analysis, ties often can be resolved by introducing an ɛ term

Revisiting the English auction example (I’m just gonna screenshot the slide cause cba formatting the symbols):

Note how the minimum increment is more of a practical matter; in theoretical analysis, it can be resolved by using an arbitrary small ɛ (epsilon).

Computing Expected Utilities

An expected utility (which we want to maximize) is simply the utility gained if you win in a given scenario, multiplied by the probability of winning:

E[u] = Pr[win]*(v - p)

Let’s look at a first-price sealed bid auction example:

In this kind of auction, if a bidder bids their private value and wins, as they have to pay this value, they’d get no utility.

Thus they’d want to strategically bid lower if possible
The bidder will try and speculate about bids of others, and shade their bid
They of course don’t know bids of others, but may have probabilistic information

So suppose that you are bidder 3 with v₃ = 300. There is a 40% chance that the highest bid (besides you) will be 150, a 40% chance that it will be 210, and a 20% chance that it will be 240. What is your best strategy?
We can calculate the expected utility with respect to each of the highest possible bids, via the aforementioned formula E[u] = Pr[win]*(v - p):

If you bid <150, you lose
If you bid 150, your expected utility is 40% * (300 - 150) = 60
If you bid 210, your expected utility is 80% * (300 - 210) = 72
If you bid 240, your expected utility is 100% * (300 - 240) = 60
Your best strategy is to bid 210 (as expected utility 72 > 60)

Reputation Systems

As buyers and sellers, we’ve always wanted to know that we’re dealing with someone we can rely on; with online systems:

How do we know that we are dealing with a reliable buyer or seller?
How do we know that quality of product/service is as described?

A traditional solution is using a brand as a trust/reputation device (companies, e.g. car manufacturers, often spend a lot of time and money building up their reputation for reliability)
But: what about one-time interactions, e.g. P2P file sharing, eBay; sellers/buyers, unknown shops/brands, or second-hand items?
A solution is to use wisdom of the crowds

Trust Issues…

Terminology

There's no consensus in the scientific community on terminology, but our working definition:

Trust - the subjective probability by which an individual, A, expects that another individual, B, performs a given action on which its welfare depends.
Reputation - the collective opinion from multiple individuals about a single individual. In other words, reputation is the aggregation of personal experiences about a certain individual.

Cherries and Lemons

There is a market with “cherries” (high quality products) and “lemons” (low quality products)
Buyer cannot distinguish between cherries and lemons (i.e. observe quality) before purchase (e.g. used car);

There is asymmetric information between the buyers and sellers (with sellers having more information)

The Lemon Problem

As a buyer, you will want to pay price for “average” quality, even if product is a cherry (since you cannot observe this)
Therefore, cherry sellers choose different market where they get higher price
At the same time, lemon sellers will try to sell their lemons in the cherry market (since they can get a better price there)
Therefore the average product quality declines, and therefore so does the price
In the long-term, only lemons remain (as per Gresham’s law)
In summary, the lemon problem states that asymmetric information (seller knows more about quality of product than buyer) leads to adverse selection (lemons are more likely to remain in the market)
This result is known as market failure, since incentives lead to an undesirable outcome

Moral Hazard

Once the transaction has taken place, there is no incentive for parties stick to the bargain (e.g. send the item, complete the payment, provide a follow-up service);
This is called a moral hazard

particularly a problem with online selling - how do we know for sure the seller is actually going to sell us what we’ve paid for? how does the seller know we’re going to send them the money?

This is a situation in which one party gets involved in a risky even, knowing that it is protected against the risk and the other party will incur the cost
Typical example is insurance: if you are insured, there is less incentive to be careful. Similarly, if a buyer has paid, there is less incentive for the seller to fulfil their part of the bargain.
Whereas adverse selection is due to asymmetry before the transaction, moral hazard is about what happens after the transaction

Trust on the Web

The Internet poses an additional problem for trust and reputation: anonymity - it is easy to fake/assume a new identity
There is no physical presence, so it does not allow for traditional (perception of) trustworthiness through human interaction
There is a low cost of setting up a good looking web presence, compared to bricks-and-mortar
Therefore, the asymmetry of information and control thereof, could potentially be even greater

The Solution?

As a solution, we rely on trust and reputation: personal experience, reviews, word-of-mouth, etc;
Trust and reputation systems leverage the web to mitigate the information asymmetry that leads to market failure and moral hazard.

Reputation vs. Recommender Systems

So what’s the difference?

Recommender systems - rely on people having different preferences, but assumes that some users share similar interests

Recommendations are personalised – different users get different recommendations

Reputation systems - based on assumption of consensus; reputation is ideally an objective measure

E.g. transaction succeeded/failed, quality high/low
designed to address asymmetric information

Design Considerations

What should one consider when designing a recommender system?

What information should be collected?

Rate sellers, buyers, or both?
Binary ratings, multiple categories, text, etc.

Reputation mechanism

How to compute an aggregate score.
Is the system robust against attacks from strategic (or even malicious) users?

What information should be displayed?

Typically a reputation system should not only show the average rating, but also the confidence in the rating. E.g. the number of ratings.

Usability vs robustness

Does a user need to register to provide ratings?

Existing Systems

Probably one of the most well-known reputation systems is the one that eBay uses for its sellers and buyers, trying to tackle the issue of asymmetric information.

This is a two-sided reputation system, as both buyers and sellers are rated by each other
Can provide basic rating for: negative, neutral, or positive
In addition, buyers can provide detailed seller ratings; was the item as described, how was communication, etc.
Can include “free text” feedback
Displays:

Total points where positive is +1 point, and negative is -1
Percentage of positive scores

The system used to be more open, where everything was visible to everybody, though this could make people more afraid to leave negative comments and leave themselves open to abuse by the other parties

Amazon is another big example, where buyers can rate products and sellers

Products can be given a rating from 1-5 stars

Sellers can be rated negative, neutral, or positive, and users can leave text reviews

The people who leave ratings can also be rated themselves

“unhelpful”, “helpful”, “very helpful”, etc.
“40 out of 50 customers found this review helpful”

Other systems include Facebook with “Like” reacts, and Epinions:

Reviewers get monetary compensation (points) for reviews (through E-royalties Credits)
Amount depends on usefulness of review, not on whether was positive or negative

Reputation Value

The simplest way to calculate reputation values is to use average ratings
One can correct this for user bias, i.e. the average user rating with respect to the population average

Calculating a population mean (the population of a particular context): https://www.statisticshowto.com/population-mean/

Weighted ratings (similar to in recommender systems) can also be used, weighting individual users based on their past ratings/trustworthiness (as remember as humans we’re subject to bias; some people will rate more generously others).

This can be done in various ways, e.g.:

Number of ratings performed (“experience”)
Agreements of past ratings with population average
Weight ratings based on time (older ratings receive a lower weight; if it’s been a long time since a buyer bought a product, that weighting may be less relevant than more recent ones)

Maximum likelihood (a probabilistic approach) can also be used

Measuring Confidence

How do we know whether we can rely on the calculated reputation value?
Show confidence, e.g.:

Number of ratings (e.g. if someone shops a lot on eBay and leaves a lt of ratings, it shows they have an interest in that and may be more reliable)
Statistical tests, e.g. standard deviation/variance of the distribution, p-values (when using probabilistic approach)

basically, does someone’s rating fit reasonably within a bell curve, or is it an outlier?)

Probabilistic Approach

Fit the ratings to some distribution (e.g. Gaussian, also known as the normal distribution or bell curve)
The reputation value can then be an average
The confidence can be described by variance, or precision (which is 1 / variance)

For binary ratings we would use the beta distribution (effectively a probability of probabilities), which has parameters α (number of successes, i.e. a positive rating) and β (number of failures, i.e. a negative rating).

(where μ = population mean, σ = population standard deviation)

Using the Beta Distribution

The beta distribution is a family of continuous probability distributions defined on the interval [0, 1]. It is parameterized by two positive shape parameters, denoted by α and β. These appear as exponents of the random variable and control the shape of the distribution.

Initialise α = 1 and β = 1 (prior) - as you can see there’s no curve, just a straight line
Average = 0.5, Variance = 0.083, Precision = 12

We observe “success” (α=2 and β=1) - the additional success produces a straight line
Average=0.67, Variance=0.056,Precision=18

We observe 3 “successes” and 1 “failure” (α=4 and β=2) - now with 3 successes and a failure, we start to get a curve (remember this is effectively an average of averages, i.e. probability of probabilities)
Average=0.67, Variance=0.032, Precision=31.5

We observe 12 “successes” and 4 “failures” (α=13 and β=5) - eventually with enough data in, we’ll start to see a bell curve, i.e. normal/Gaussian distribution (which’ll still change as time progresses, but this is the general shape we’re looking for)
Average=0.72, Variance=0.011,Precision=94.7

Evaluating Reputation Systems

This can (unfortunately) be a much harder task than evaluating recommender systems...

Evaluation methods:

Theoretical models: e.g. using game theory
Field experiments

Induce “true scores” i.e. sell different qualities of goods, and see how the reputation correlates to these qualities (though not always so practical in real life)
Measure the objective quality of items being sold (e.g. valuable playing cards – assessed by an expert) and relate this to the reputation value (however one hsa to bear in mind how experts can differ on valuing items)

Reputation values are also often used to generate rankings

Pagerank algorithm for websites: reputation is calculated by the number of and the reputation of the incoming links (so this value is calculated recursively) to that website (people used to abuse this and try to hack around it but e.g. Google has worked hard to mitigate it!)
TripAdvisor and similar websites rank hotels, restaurants, etc.

Issues

Of course, reputation systems have a number of issues associated with them. One issues is ballot stuffing – when a single user can easily place multiple ratings.

To mitigate this:

Ensure there is some cost either monetary or in terms of effort

Only allow members to place reviews (e.g. Amazon)
Only allow users who actually completed transaction to place review (e.g. eBay)

Rating the raters is usually allowed by non-members vulnerable to attacks

Usually limited by placing cookies, but is easily circumvented

One has to weigh up a system’s usability vs. effectiveness

Another common issue is that of people leaving false/unfair ratings, as people can be biased; if they received an item they think is inferior or they’ve overpaid for (or if they’re having technical issues with…) they’re far more likely to leave a negative review, often very harshly

There’s also the issue of slander, where a competitor places a false rating to discredit the victim. There’s also the issue of self-promotion, where a seller may want to promote their own product; one has to watch out for “farms”, groups of people who are employed to leave fake positive reviews on items to artificially boost their reputation.

Some solutions to the aforementioned issues include:

Require authentications and make it difficult to produce multiple identities (prevent sybil attacks) - e.g. Trustpilot requiring unique usernames
Require proof of successful transactions
Rate the raters
Trust inference techniques: compute trust based on history (so buyers/sellers who’ve been performing well for a long time can be more trustworthy)
Ensure “smoothness” – make sure a single rating does not influence overall score significantly

eBay for instance has tried to address this with the star ratings, and e.g. if the postage is free you can’t make a rating on that

Reputation systems also don’t always accurately assess the user’s interest, so asking the right questions in terms of the review is really important

Furthermore, there’s the issue of whitewashing, where sellers can just change their identity to “reset” their reputation if they get a bad rep. To combat this, sites can:

Ensure identity change is costly or requires effort: e.g. link with email or phone number

This can discourage users to re-register if it requires too much effort/cost is too high

Start with low reputation

Sellers may find it difficult to start

Display confidence (if a seller is coming across as doing all the right things, that can be a problem for the buyer, who doesn’t really know what’s going on behind the scenes... )

Another problem is reciprocation and retaliation, resulting in bias towards positive ratings; eBay used to have 99% positive ratings overall, and there was a high correlation between buyer and seller ratings (indicative of reciprocation/retaliation fears).

Some possible solutions:

Since 2008, eBay no longer allows negative ratings of buyers by sellers (part of the motive for this was wanting to stop people “ganging up” on other buyers/sellers in retaliation)
One-sided ratings
“double blind feedback” (though this can be difficult due to text and timing)
Simultaneous disclosure of buyer and seller ratings/reviews

To briefly touch on a few other issues (yeah, there are a lot)...

Individual bias; some raters are just naturally inclined to be more positive or negative

To counteract this, one can scale/weight based on past history.

Quality variations over time - a famous eBay con sellers use is to build up a good reputation over time by selling low-value items, then suddenly exploit this (often selling new items that are completely out kilter with what they previously sold)

A solution is to thus weight the rating based on transaction price, giving higher weight to more recent transactions

Incentive to provide ratings (sometimes there isn’t any!)

To provide incentive, sites can offer non-monetary rewards, such as status/karma (which could possibly be exchanged for other things), as well as monetary rewards (like prize draws)

Financial Incentives

In terms of financial incentives for reputation systems...

Experimental studies have shown that, for eBay, a good reputation score makes about 8% difference in profit (selling the same item)

Source: The Value of Reputation on eBay: A Controlled Experiment (Resnick et al.)

For websites and other ranking sites, the stakes are probably much higher

e.g. about 18% of digital marketing budget is spent on search engine optimisation (SEO)

Experimental Design and Web Analytics

What is Experimental Design?

In many scientific contexts, one needs to deploy some experiments to test a hypothesis about a system (and often to discover the best combination of design factors to maximize e.g. user interactions)

e.g. suppose we want to design a crowdsourcing system - how can we try out different approaches, and see what system is most effective?
e.g. if we are designing a recommender system - how do we compare which one works best in practice?
e.g. when developing a website - how can we find out which of the various designs is best?

An experiment is a "test or series of runs in which purposeful changes are made to the input variables of a process or system so that we may observe and identify the reasons for changes that may be observed in the output response"
Experimental design is about setting up experiments and analysing them using a sound scientific approach
The purpose is to tune a system and find the optimal design

Where is experimental design used in the real world? Some example applications include:

Engineering:

manufacturing process, e.g. optimise temperature and other variables to optimise yield from a chemical process
product development: beverages, paints, pharmaceuticals, etc.

Sociology/psychology - user studies, e.g. see what factors influence behaviour
Behavioural economics
Computer science (e.g. learning page optimization, and also in the field of human computer interaction)
Pretty much every field of science has experiments of one sort or another

Example - Web design/Landing Page Optimisation

A landing page is one where users arrive on your site following an online search (typically via a search engine like Google), or in response to an advertising campaign

this may or may not be the home page (could be for e.g. a specific product, like a car)

Landing page optimization (LPO) - improving the conversion rate or effectiveness of a page, and ones it links to

This is done in order to increase the volume of business, sales figures, or proportion of users who commit
LPO tries to improve the effectiveness of the display, and increase the proportion of customers progressing to the next stage, or the conversion rate (usually the proportion of customers who progressed to make a purchase)

Not to be confused with search engine optimisation (SEO)! (Which is more of a business of the search engines, rather than the company optimizing their own site)

e.g. Google/Bing use web “spiders” to crawl web pages and track links between different sites that people follow, scraping information from these pages.

They typically use specific page ranking algorithms to determine (after searching with different keywords) which companies/sites are more “important” in relation to others, to help optimise the search rank ordering

LPO also uses web analytics

This comprises quantitative techniques for collecting and analysing web traffic (info like number of visitors to a website, number of page views etc.)
Helps companies to analyse and then estimate how traffic to a website changes after e.g. launching something new or updating the site
It’s being increasingly used for both SEO and LPO
Some of the data web analytics tries to gain includes:

IP address/geographical location
How many (unique) visitors, repeat visits, etc. (not completely reliable)
Which pages viewed and for how long
Browser and plugins used
Keywords that led to page
Referring sites
Clickstream analysis - "paths" through website
Conversion rates, Return on Investment (ROI)

Above is a screenshot from Google Analytics, where you can have your own account to analyse your website, and various metrics are visually displayed

Experimental Approach

...So as we’ve seen, experimental design can be used to help e.g. optimize the structure of a landing page… so how do we do this?
Firstly one need to consider what are the appropriate measures for the site, what are the factors you want to experiment with (testing different combinations of variables/structures to test against the objective of what we’re trying to optimise)

Once you have decided the appropriate measure(s) for your site, you can experiment with different page and site layouts in order to improve these metrics
A unique combination of variables is called a recipe or treatment, for example:

layout of a website
colour scheme of the text

Experimental methods vary from simple A versus B comparison, to more complex multifactorial methods

these compare multiple factors simultaneously, in an efficient and effective way

Good tool support exists, e.g. Google Website Optimizer - http://www.google.com/support/websiteoptimizer/

A-B Split Testing

This approach is very intuitive: randomly assign one of two web pages (e.g. the landing page) to new visitors

A is the "original" (base recipe) and B is the "challenger"

Typically few designs are used, though up to 10 different designs at the same time
Essentially, different versions of the same webpage are shown to different segments of visitors, and they can then be compared to see which variants lead to more conversions
In practice, it tends to be used when you have a clearer picture of what structures/factors you want to compare
Can do more tests sequentially/back-to-back

Less reliable since changes may be due to other factors, e.g. season, change in marketing campaign, etc.

Advantages:

Easy to test, implement, and analyse
Flexibility in defining variable/factor values

can do combination of evolutionary and revolutionary designs

Disadvantages:

Cannot test all possible combinations, and capture interdependence between variables/factors

Therefore, unlikely to find optimal webpage

Not possible to discover which elements contributed the most, because too many things change at the same time
Inefficient form of data collection

Multivariant Testing

This is a more complicated model than A-B testing; as the name implies, we simultaneously gather information about multiple factors/variables

e.g. font size, colour, layout, page size (e.g. of the landing page)
as different users have different browsing behaviours/habits, want to determine what design choices are most appropriate for you site (so can also depend on the demographic)
e.g. if the site’s browsed by a younger audience, having more pictures and less blocks of text may be more effective

One-factor-at-a-time (OFAT) is often used, and can give one a clearer mind of the effectiveness of different values of one particular factor; however, it is often inefficient in practice
Full factorial design looks at all combinations of variables

This is a more refined comparison, allowing for the testing of more structures
It increases exponentially with number of variables/branching factor

Fractional factorial design only looks at a subset of all possible “recipes”

This is essentially somewhere in between OFAT and full factorial
It usually assumes some sort of independence between variables (so you don’t have to be concerned with the connection between variables)

One-factor-at-a-time (OFAT)

Let’s look at an example (from the slides of course)!

Suppose you have 3 factors: page layout (PL), colour scheme (CS), page content (PC)

You have 3 values for each, e.g. PL1, PL2, PL3

We need to choose a base recipe (e.g. the current page), suppose this is PL1, PC1, CS1
Now we need to test 1+2*3 = 7 different recipes:

As in the title, we’re importantly only testing the difference of effectiveness of one factor at a time
e.g. in the left-hand column above, factors PC1 and CS1 are fixed, and we’re only varying the value of PL (1, 2 and 3), where the first entry is our “baseline” recipe

In the second column, the second factor’s being varied, and in the third column the third factor’s being varied

Fractional Factorial Design

An issue with OFAT is that, very often it’s not possible or practical to test all combinations of factors as they increase exponentially

Also in practice, some combinations of factors could be redundant, where the business is pretty confident that some combinations won’t be optimal/liked by the users (so testing these is inefficient).

Thus, sometimes only using fractional factorial design is more suitable
As aforementioned, these are principled approaches to just select a subset of the recipes

An example (at least, it can be interpreted as one) of fractional factorial design is Latin Squares:

First described by Euler in 1782 and used for a number of applications, including Sudoku puzzles
Prevents bias from interaction effects by ensuring that:

each value is tested equally often
with different combinations of other variables

However, can only be used to measure main effect and some interaction effects
A latin squares example (with labelled rows and columns):

each square represents a combination/”recipe” to test
notice how, similarly to sudoku, each possible PC value only appears once in each row/column (there are no repeats)
it’s considered fractional factorial because we’re not testing all possible combinations, just a subset

e.g. in this case there are 3*3*3 = 27 possible recipes, but we’re only testing 9
Trying all 9 combinations allows you to find the best of each factor independently

Here’s a possible realization of the above latin squares test, where the conversion rate after some trials of reach respective combination is denoted (e.g. 0.3 denotes a user conversion rate of 30%):

Main vs. Interaction Effects

As aforementioned, latin squares for instance can only be used to measure main effect and some interaction effects (whereas full factorial design can measure the main effects and all interaction effects).

...So what's the difference in effects? In short, “interaction” effects are those that are directly dependent on the “main” effects.

In many cases, variables interact with each other – e.g. with 2-variable interactions, the best colour scheme depends on the page content, or the best page content depends on the page layout
The main effects look at individual factors, and average over all other factors

e.g. say you’re studying the effectiveness of exercise on losing weight; however losing weight also depends on other factors, e.g. how much food you eat, the type of food, your bedtime etc.
However by testing a main effect, we just focus on measuring the effectiveness of a single factor (e.g. doing exercise in this case), taking an average thereof over the other factors.

Variables are said to be independent when there are no interactions
Interactions occur where the impact of a variable is dependent on the setting of a second variable
Looking at another simple example, suppose we have the following conversion rates for recipes:

We can make some inferences about the variables’ effects and dependencies; for example:

using PL3 leads to the highest conversion rates (at least in these instances)
Also comparing rows 3 and 5, we see there could be some dependency between the PL and PC factors (as CS1 remains the same, but the conversion rate significantly drops)

Determining the Optimal Design

There are different ways we can try to determine which design is optimal; one simple approach that can be used:

We can independently (i.e. consider each separately) find the:

average conversion rate for each factor
the optimal value for each factor

e.g. say we have a factor PL, and (conversion rates) PL1 = 0.1, PL2 = 0.2 and PL3 = 0.3, we can choose value 3 for PL, as this yields the highest conversion rate (common sense really)

The optimal design is the combination of each optimal value (e.g. recipe PL3, PC2 and CS1)
Final check: do an A/B test to check that the “optimal” (i.e. the challenger) design is indeed better than the base case

There are many other things one can take into consideration in determining the optimal design, e.g. allow the parameters/variables to be estimated without bias and with minimum variance.

e.g. in the latin squares method, the estimation is unbiased as each factor is tested uniformly at random
so e.g. in the earlier example, the number of recipes that could include CS1 is equal to the number that could include CS2, so they’re equally likely to be picked; the same applies to the other factors (we can say that they are balanced)

User Trials

So how do we decide how many user trials need to implement, to minimize the variance in the statistics?

We need to conduct multiple trials to obtain statistically significant results
For example, a t-test can be performed to determine which pair of designs are better and whether this is statistically significant

more specifically, a t-test can be used to determine whether the means of two sets of data are significantly different from each other (also taking into account the ranges/variances as well as the raw means, to determine how different the data sets are)

e.g. say we have two data sets with means 10 and 10.1 respectively, we might initially think they’re very similar; however say the first has range 1-100 whereas the second only has range 8-12, the spread of data is obviously likely to be pretty different

the bigger the t value, the more different the datasets are

No matter what method we use, we have to consider: how many trials do we need for each recipe (to achieve meaningful statistical results)?

e.g. going back to the one factor at a time (OFAT) example given earlier: we have 7 recipes, so if we wanted to do 100 trials per recipe to ensure significant results for pairwise comparisons, we’d need 100*7 = 700 trials total.
5 of these 7 recipes contain PL1, so we’d be doing 500 trials using PL1.
...But can we use another approach that requires less trials to still achieve the same significant result?
Say we use a full factorial design approach instead, so have 3*3*3 = 27 recipes:

If we do 20 trials per recipe, that’s 20*27 = 540 total trials.
9 of these recipes contain PL1 (as we’re looking at all factor combinations), but as we’re only doing 20 trials per recipe, we’d be doing just 9*20 = 180 trials using PL1.
If we wanted to increase this to at least 500 trials using PL1 (as in the OFAT example), we only need to increase the trials per recipe to 56 (which is less than 100 as in the OFAT example).
...So the bottom line is: although the number of recipes is greater in the full factorial design approach, less trials per recipe are required to achieve the same statistical significance, so there’s effectively a trade-off.
...We’re almost done! Let’s finally look at the latin squares approach too:

Three of the recipes (the left-hand column) contain PL1, so we’d need 60 trials per recipe to have 60*3 = 180 for PL1, and 167 trials per recipe to have over 500 for PL1 - clearly this is much more than in the previous methods.
To reiterate, there’s a trade-off between the number of recipes, and the number of trials per recipe (that’s all you need to know tbh, sorry if this was long-winded)

Human Computation

Modern Recommendation Systems

Incentives in Crowd

Rank Aggregation

Experts Crowdsourcing

Revision Lectures

tl;dr

In Matthew Barnes’ notes spirit, I’m going to try to sum up the module concisely, a COMP3208 Any% speedrun if you will. So smash that like button and let’s go:

Introductory Lecture

"Social Computing Research focuses on methods for harvesting the collective intelligence of groups of people in order to realize greater value from the interaction between users and information”

CAPTCHA - a type of challenge-response test used in computing to determine whether or not the user’s human
NoiseTube - a citizen sensing project; people use their phones as noise level meters to measure exposure to noise in their environment, participating in collective noise mapping
Amazon Mechanical Turk - example of human computation, a crowdsourcing site for businesses to hire remotely located "crowdworkers"

Many sites rely on ads for income, which are determined through fast auctions; advertisers bid on users the moment they enter a website.

Two main types of online advertising:

sponsored search/keyword advertising (search engine results ads), where advertisers bid on keywords entered by the user
display advertising (banners/text on sites), where advertisers bid based on user profile (determined via cookies)

Past & Present

Terminology:

Collective Intelligence - “intelligence that emerges from the collaboration, collective efforts, and competition of many individuals”
Social Computing - “methods for building computational systems that harness collective intelligence”
Human Computation - “outsourcing computational steps (microtasks) that are difficult for machines to humans”
Crowdsourcing -“obtaining services, ideas or content from a large group of people through an open call (typically to an online community) rather than from traditional employees or suppliers”

History:

1714 - Longitude Prize - first example of crowdsourcing, British Government offered reward for solution to longitude measuring problem, John Harrison invented the marine chronometer
1791 - Gaspard de Prony’s Logarithmic Tables - de Prony divided labourers into three levels to develop logarithmic tables for the French Cadastre; the bottom tier did menial calculations and were “human computers”
1906 - Sir Francis Galton - invented standard deviation, and famous Ox story shows that perceived wisdom of the crowds can be delicately affected by how results are measured/calculated (I think)
1984 - Space Shuttle Challenger Disaster - tragic shuttle explosion, had big impact on the stock market, particularly NASA’s suppliers’ stock prices

Early social computing systems:

1980s - rise of the internet - systems expanded worldwide and became more accessible

‘81 - ARPANET expanded as the National Science Foundation (NSF)
‘82 - Internet Protocol Suite TCP/IP standardized, more global interconnection
‘86 - TCP/IP network access expanded, researcher access to supercomputers
‘88-89 - NSFNet expanded into academic/research organizations, marked the start of the Internet as an intercontinental network
‘89 - ISPs emerged (and ARPANET decommissioned)

1990s - other pioneering systems developed

Xerox’s “Tapestry” (‘92) - experimental collaborative email filtering system, getting humans involved in the filtering process
GroupLens Lab (established in ‘92) - developed recommenders like the GroupLens recommender, a USENET article recommendation engine, MovieLens, and studied Cyclopath

Commercialisation:

1995 - eBay - uses a a public reputation system to help users trust sellers, and has a feedback and ratings policy
1997 - Amazon - uses a collaborative filtering service called BookMatcher, which makes book recommendations (automatically updates a database and uses stats analysis to validate items)
1998 - Google - uses a PageRank built on collective intelligence, which evaluates links and gives pages authority

Since 2000, uses AdWords - auctioning to rank pages, where AdRank = bid * quality score

Web 2.0 (i.e. Participative/Participatory and Social Web) refers to sites emphasizing:

user-generated content
ease of use
participatory culture
interoperability

Sites with crowd-generated content:

2000 - Threadless - e-commerce site; designs created and chosen by an artist community, each week designs are put to a public vote
2000 - iStockPhoto - royalty free, international micro stock photography provider; photographers contribute work in return for royalties
2001 - Wikipedia - free online encyclopedia, created and edited by volunteers
2004: Flickr - image/video hosting service and community, apt for high-res photos
2005: YouTube
Social Media (Myspace, Facebook, Twitter)

Notable crowdsourcing initiatives:

2004 - ESP Game - human-based computation game, use computational power of humans to perform a task that computers can’t (originally, image recognition), packaging the task as a game

Google bought a licence for Google Image Labeler

2005 - Amazon Mechanical Turk

Crowdsourcing for social good:

2007, 2010 - Ushahidi - crowdsourcing platform allowing users to submit violence reports and mapping of events

notably used during the 2007–08 Kenyan crisis, and the 2010 Haiti earthquake

Also, a 4636 emergency shortcode was set up to run on the Digicel and the Comcel networks in Haiti

Crowdsourcing and AI:

Wizard of Oz prototyping - “essentially prototyping the AI with human beings”.

In some cases, humans are used to train the AI system and improve its accuracy
In other cases, companies “fake it until they make it”

e.g. Figure Eight, a human-in-the-loop AI company, uses human intelligence to do simple tasks (e.g. transcribing text or annotating images) to train ML algorithms

Their software automates tasks and can be used to improve catalog search results, approve photos, support customers etc.

Why use social computing now?

Google’s built on collective intelligence
Facebook uses a newsfeed
Sales driven via engagement through relevant recommendations/adverts
Crowdsourcing offers access to a worldwide labour market
Crowdsourcing can lead to greater user engagement and possibilities
Can assist with contributions to science, providing insights on:

Human behaviour
Economic incentives
Social science research
Citizen science
Unprecedented access to data (participatory sensing) and human computation (e.g. Galaxy Zoo)

Some important characteristics of social computing can be summarized as user-created content where:

users can control the data
unique sharing of content or media
ability to tap into the intelligence of other users collectively
unique communication and collaborative environment
big emphasis on social networks
interactive, user-friendly interface

When analysing a social computing system:

What is its purpose?
What problem(s) does it solve?
What were possibly the key (social computing) challenges in implementing it?
How are users incentivised to participate?
How could it be improved?
Is it successful?

Recommender Systems

Recomender systems are a type of information filtering system which filters items based on a personal profile. Recommendations are personal to the specific user/group of users.

Recommender systems are given:

a user model which entails ratings, preferences, demographics, and situation context (can infer lots of info from these characteristics)
items, the things that we select (these may or may not have additional characteristics)
The system can then calculate a relevance score used to rank options, and choose options to recommend

Examples of recommender systems:

Amazon, huge online retail space, so good recommender are invaluable
Netflix, with original content and huge library of other content
News recommendations; news sites often recommend similar articles
Social web, e.g.:

Google Play (recommending apps based on what you’ve downloaded)
Job sites (“jobs you may be interested in”)
Facebook (“groups you may like”)
Picasa (recommending similar photos)

Targeted advertising (e.g. page banners, using a recommender in addition to cookies, search history etc.)

Recommender systems can provide:

Value for the customer

Reduce information overload
Find interesting things
Narrow down set of choices
Help one explore the space of options
Discover new things
Entertainment

Value for the provider

makes things easier, so user is more likely to return
Additional, probably uniquely personalised customer service
Increase trust and customer loyalty
Increase sales, click-through rates (CTRs), conversions
Opportunities for promotion and persuasion
Obtain more knowledge about customers

However, potential issues:

Lack of data - the “cold start problem”
Privacy - companies collect and sell users’ data, anonymity can be compromised
Lots of similar items (lack of variety in recommendations)
From a user perspective, shared accounts - can “confuse” the algorithm
Possibly limits people pursuing other options

Paradigms of recommender systems:

Content based
Knowledge based
Collaborative filtering
Hybrid

Content-based:

Build a profile of a user (i.e. user model), which is automatically generated and updated
Identify a set of features from items bought/looked at
Find item similarities based on features (i.e. content) using information retrieval techniques
Produce recommendation based on user profile and similarity

Knowledge-based:

Makes use of (often) manually generated information about users and items
User information generated, e.g. using preference elicitation

“knowledge” can be updated and fed back periodically, e.g. run overnight

Item similarity/recommendation determined by “experts”
System mimics behaviour of an experienced sales assistant
Useful when recommending to first-time users (e.g. for one-time purchases) and novel items

Collaborative filtering:

Most prominent approach to generate recommendations

used by large, commercial e-commerce sites
well-understood (been around a long time)
applicable in many domains (books, films, DVDs, etc.)
Doesn’t need info about the items themselves

Uses the wisdom of the crowds to recommend items; users rate catalog items implicitly or explicitly
Exploits correlation between people with similar interests

Two main types::

User-based - considers similarities between users
Item-based - considers similarity between items

Hybrid:

As the name suggests, hybrid approaches use a combination of techniques; a recommender system can become more powerful when different things are put together

Similarity coefficients:

Some simple approaches computing document similarity based on keyword overlap:

Dice’s coefficient:
Jaccard’s coefficient:

Identical sets of keywords = a similarity of 1
Completely different keywords = a similarity of 0

Shortcomings to using keywords:

Need to be transformed to numbers
Contain “noise” words such as “and”, “the”, “but” etc.
Generate a sparse matrix (i.e. a matrix with very few non-zero elements)

To help, one can clean the features, with techniques like:

Stop-word removal - filtering out words like “a”, “an”, “the” etc.
Stemming - reducing inflection in words to their root forms, e.g.:

Phrase extraction - extracting words that occur together and have a specific meaning, e.g. “hot dog”
Term-frequency inverse-document frequency (TF-IDF) - assigning less weight to common words (e.g. stop words)

Inverse-document frequency (IDF) = (nₜ = number of times term t appears across documents, N = number of documents)
Term-frequency inverse-document frequency (TF-IDF) = (tfₜ = normalised and/or dampened frequency of term t in the document)
The less frequency a word is used, the more important it might be (it’s also weighted against the document length)

Calculating similarity:

Most commonly used is cosine similarity, simply the angle between two feature vectors; the smaller the angle, the closer the two vectors/documents/set of word are to each other

90 degrees apart = completely dissimilar
less than 90 degrees => become more similar

Equation to calculate it:
At the two extremes:

Identical vectors/documents = a similarity of 1
Completely different vectors/documents = a similarity of 0

User-based collaborative filtering:

Basic assumption is that people who agreed in the past are likely to agree again (i.e. we’re not quite as unique as we think we are)
We try to find users who are similar to each other based on their overlap in interests and opinions (e.g. using jointly rated items)
Then the system can recommend items from similar users which are rated positively

Steps for the system to consider:

Representation - what will be used as input?
Similarity measure - how do we calculate the similarity between 2 users?
Neighbourhood selection - do we select all users or a subset? On what basis do we choose?
Predict/estimate rating - need to calculate a predicted rating for a user’s unrated items
Generation of recommendation - select the “top M” recommendations ranked by predicted rating

Similarity measure for ratings:

Only considers users who’ve rated some of the same items
Pearson coefficient has been shown to be the most effective for user-based collaborative filtering

Used in statistics to calculate the correlation between 2 random variables:
This formula returns values between:

+1 (perfectly correlated)
-1 (perfectly negatively correlated)

Looks at the similarity in variations over items, not their absolute values, as it corrects for the average value

Possible neighbourhood selection criteria:

Have to share at least one rating ?
Only those with a positive correlation?
Either threshold (by e.g. a minimum positive rating) or fix the maximum number of neighbours, and choose n most correlated neighbours?

Prediction function:

This function can be used to predict rating of user u for item i, given neighbourhood N (where N = the set of users in the selected neighbourhood)

This formula adds/subtracts the neighbours’ bias from the given user’s average

Ways to improve prediction function:

Give more weight to higher variance items
Significance weighting - when the number of co-rated items between users is low, linearly reduce the weight
Case amplification - give more weight to “very similar” neighbours
Neighbourhood selection - use a similarity threshold/fixed number of neighbours

Lastly, making a recommendation:

Recommend M items with the highest predicted rating (M can be fixed or flexible)
Rank the recommendations in order of highest predicted ratings
Other possible recommendation factors:

Item popularity
Item novelty
Variation of recommended items
Other reasons for promoting certain items

Item-based collaborative filtering:

Assumption is that a given user is more likely to have the same opinion for similar items
One can use ratings to e.g. derive the similarity between items
Advantages to user-based:

No user “cold start” problem
Greater scalability with more users
Can be pre-computed “offline”
Results can be equally good

Some issues exist with item-based; pre-processing techniques can be used to mitigate these.

Might want to compute a similarity matrix offline with the items (at most (n*(n-1))/2 entries with n items), which can be reduced further
An item similarity matrix changes less often than a user similarity matrix
With optimization, one can compute recommendations very fast

Similarity measure for ratings:

One can use Pearson’s coefficient, swapping users and items, but it isn’t adjusted/weighted for users’ average, which could result in unfair bias
Better similarity function for item-based is adjusted cosine similarity; calculates the angle between two item vectors adjusted for user average:

Prediction function:

Predict rating of user u for item i, given neighbourhood N (where N = the set of items in the selected neighbourhood)

Simpler formula than user-based, as we need not subtract user biases

An alternative simple approach is a slope one, which focuses on computing the average differences between pairs of items. e.g. say we have:

Average difference between items 1 and 2:

Predicted rating for item 2:

Adapted prediction to weight each difference based on the number of jointly weighted items:

Types of ratings:

Explicit Ratings

e.g. Likert scale, typically 1-5 or 1-7
Must consider the optimal granularity of scale (e.g. 1-10 is better for films)
One can also have multidimensional ratings (e.g. multiple ratings per movie)
There are challenges, e.g. users not always willing to rate many items, resulting in sparse matrices

one has to consider how to incentivise users(e.g. paying users, having a point system, or “nudge” users)

Implicit Ratings

Clicks, page views, time spent on a page, demo downloads etc.
These can be used in addition to explicit ratings

The question is how do we interpret it “correctly”?

Evaluating recommender systems:

Remember the “human factor”, and that people are mostly predictable
Measures used in practice include:

Total sales numbers
Promotion of certain items
Click-through rates
Interactivity on the platform
Customer return rates
Customer satisfaction and loyalty

Measuring their performance:

Online tests with real users

e.g. A/B testing (also known as split testing or bucket testing), comparing performance of two versions
Example measures:

sales increase
higher click-through rates

Laboratory studies

e.g. controlled experiments
Example measure: satisfaction with the system, which can be qualitatively assessed.g. via questionnaires (however people tend to be unnaturally positive in surveys where the interviewer is present)

Offline experiments

Based on historical data
Example measures:

Prediction accuracy
Coverage

For offline experiments:

Can hide some items where we know there’s a ground truth
Can hide some information from domain experts
Can rank items or predict hidden ratings (using a metric to evaluate this)

Labelling classifications:

True positives (TP) - the recommended item is relevant to the user
True negatives (TN) - the recommended item is not relevant to the user
False positives (FP) - an item which is not relevant has not been recommended
False negatives (FN) - an item is relevant, but this has not been recommended

To note:

True positives and false negatives will be classified in the same space

A false positive is easier to deal with than a false negative

Classification performance metrics:

Precision - a measure of exactness which determines the fraction of relevant items retrieved out of all items retrieved

Recall - a measure of completeness which determines the fraction of relevant items retrieved out of all relevant items

Accuracy - considers the accuracy of all correct classifications (i.e. true positives and true negatives)

Rank score - extends precision and recall, to take the positions of correct items in a ranked list into account

Lower-ranked items can be overlooked by users (so this metric’s especially important for recommenders)
Can try to optimize models for such measures (e.g. AUC)

Error measures:

Mean absolute error (MAE) - the deviation between predicted rating and actual ratings

Root mean square error (RMSE) - places more emphasis on greater deviation

Online Auctions

Two ways to buy things on eBay for example: buy it now, and auctions
(eBay charges a % of commission fees for items sold on the platform)

Sponsored search is the basic model that most search engines use to generate revenue

e.g. searching for a product on Google with keywords, results are ranked by page ranking advertisement results. (Yahoo used to call sponsored search auctions “computational advertising”)

Another auction type is display advertising - usually in the form of some media on websites, e.g. an image banner or a video ad.

These days, often use cookies from browsing and purchase history to better target ads

Why use auctions:

Type of dynamic pricing based on competition
Essentially a price discovery mechanism
most relevant ad is shown
Computation is done by the “crowd” (e.g. bidders - think “wisdom of the crowds”)

Auction design:

affects how people behave during an auction - must consider:

What are the auction’s desired objectives and properties?
How can they be designed so users are incentivized to behave “correctly”?
Most typically desired property is for the bidders to bid truthfully

This is a price discovery process (of course this isn’t always easily feasible in practice).

Four classical auction types:

English Auction

most commonly used for selling goods, e.g. antiques and artworks.
bidders can be assumed to have a “private value”, i.e. what they’re ultimately willing to pay for the item
bidders may speculate about the bids of others, and shade their own bid

one may make as few bids as possible to avoid raising price
a bidder’s first bids are often lower than their “private value”
concept of individual rationality indicates that bidders will typically not bid more than they are willing to pay
bidder may also refrain from bidding until the bidding slows down

Dutch Auction

example of an open-bid descending auction (effectively opposite of English auction)

famously used for selling flowers in the Netherlands :o

auction starts at an artificially high price
auctioneer lowers the price until a bidder makes a bid equal to current offer price
the good is allocated to the bidder that made the offer

If a bidder bids when price lowers to their private value, they’ll win, but may pay much more than they needed
Thus a strategy might be to hold out, and try bid just before the price lowers to that of the second-highest private value
The longer one waits, the greater risk they lose out - trade-off!

Sealed-Bid First Price Auction

similar to the English auction in that the highest bidded price wins, however this type is sealed bid (bidders don’t see others’ prices)
normally, used for procurement by companies and organizations, and government project contracts
in summary:

Single round
Bidders submit offers in a sealed envelope; all bidders submit simultaneously
Good is allocated to bidder with highest price
Winner pays the price they bid

Sealed-Bid Second Price (Vickery) Auction

goods are awarded to bidder that made the highest bid, but they pay the price of the second highest bid

Dominant strategy:

A strategy that’s better than the other strategies for one player, no matter how that player's opponents may play.
In the context of an auction, a bidder’s strategy is dominant if - regardless of the prices the other bidders bid - their bidding price will always generate the best possible outcome for them.

For English and Dutch auctions (and the sealed-bid first price auction, strategically equivalent to Dutch), strictly speaking, there is no dominant strategy for bidders, as the best/safest strategy to use is always dependent on what the other bidders bid
However, for the sealed-bid second price (i.e. Vickrey) auction, there is a dominant strategy, to simply always bid your private value

This is because, if you win, you’ll only have to pay the second highest value, so regardless of what the other bid values actually are, you couldn’t possibly have a better outcome

Due to its dominant strategy, sealed-bid second price (i.e. Vickrey) auction can be preferable for software applications and analysis

Formal model and utility:

utility for bidder i when winning the auction would be uᵢ = vᵢ - pᵢ (where vᵢ is the willingness to pay and pᵢ is the payment) - effectively the surplus between what they actually paid and what they were willing to pay
When vᵢ = pᵢ, user is indifferent between winning and losing the auction
uᵢ = 0 also when the bidder loses the auction
in terms of tie-breaking (for sealed-bid auctions), the typical tie breaking rule is a random coin toss (so if 2 bidders are tied, each bidder has a 50% chance of winning)
An expected utility is the utility gained if you win in a given scenario, multiplied by the probability of winning: E[u] = Pr[win]*(v - p)

the bigger, the better!

Reputation Systems

How do we know we’re dealing with a reliable buyer or seller, and that the quality of product/service is as described?

A traditional solution is using a brand as a trust/reputation device
What about one-time interactions (e.g. P2P file sharing, eBay)?
Solution is to use wisdom of the crowds

Terminology:

Trust - the subjective probability by which an individual, A, expects that another individual, B, performs a given action on which its welfare depends.
Reputation - the collective opinion from multiple individuals about a single individual. In other words, reputation is the aggregation of personal experiences about a certain individual.

Trust issues:

Cherries and lemons:

There is a market with “cherries” (high quality products) and “lemons” (low quality products)
Buyer cannot distinguish between cherries and lemons before purchase

asymmetric information between the buyers and sellers (sellers have more information)

The lemon problem:

as a buyer, you will want to pay price for “average” quality, even if product is a cherry
cherry sellers choose different market
lemon sellers will try to sell their lemons in the cherry market
average product quality declines; in the long term only lemons remain (Gresham’s law)

Moral hazard:

Once the transaction has taken place, there’s no incentive for parties stick to the bargain
e.g. if you are insured, there is less incentive to be careful

Trust on the web:

anonymity - easy to fake/assume a new identity
no physical presence
low cost of setting up a good looking web presence
this can increase info asymmetry

Solutions:

trust and reputation: personal experience, reviews, word-of-mouth
Trust and reputation systems leverage the web to mitigate information asymmetry that leads to market failure and moral hazard

Reputation systems are based on assumption of consensus; reputation is ideally an objective measure; they are designed to address asymmetric information.

Design considerations:

What information should be collected?

Rate sellers, buyers, or both?
Binary ratings, multiple categories, text, etc.

Reputation mechanism

How to compute aggregate score.
Is the system robust against attacks?

What information should be displayed?

Should usually also show the confidence in the rating

Usability vs robustness

Does user need to register to provide ratings?

e.g. eBay:

two-sided reputation system, (buyers and sellers rated by each other)
negative, neutral, or positive basic ratings
Can include “free text” feedback
Displays:

Total points where positive is +1 point, and negative is -1
Percentage of positive scores

e.g. Amazon:

buyers rate products and sellers
Sellers can be rated negative, neutral, or positive
users can leave text reviews
people who leave ratings can be rated themselves

“unhelpful”, “helpful”, “very helpful”, etc.
“40 out of 50 customers found this review helpful”

Reputation values:

simplest way -> use average ratings
Can correct user bias
Weighted ratings can also be used (weighting users based on their past ratings/trustworthiness), e.g.

Number of ratings performed (“experience”)
Agreements of past ratings with population average
Weight ratings based on time

Maximum likelihood (a probabilistic approach) can be used
Confident measures include:

Number of ratings
Statistical tests (e.g. standard deviation/variance, p-values)

Probabilistic approach:

Fit the ratings to some distribution (e.g. Gaussian)
The reputation value can then be an average
Confidence can be described by variance, or precision
For binary ratings use the beta distribution

α (number of successes, i.e. positive rating)
β (number of failures, i.e. negative rating)

(μ = population mean, σ = population standard deviation)

Evaluating reputation systems:

Theoretical models: e.g. using game theory
Field experiments

Induce “true scores” i.e. sell different qualities of goods, see how reputation correlates
Measure the objective quality of items being sold, relate this to reputation value

Reputation values are often used to generate rankings

Pagerank algorithm for websites

reputation calculated by the number of and the reputation of the incoming links to that site
TripAdvisor and similar websites rank hotels, restaurants, etc.

Issues with reputation systems:

Ballot stuffing - when a single user can easily place multiple ratings; to mitigate:

Ensure there is some cost either monetary or in terms of effort
Rating the raters is usually allowed by non-members vulnerable to attacks
weigh up usability vs. effectiveness

People leaving false/unfair ratings (and slander, self-promotion and “farms”); to mitigate:

Require authentications (make it difficult to produce multiple identities)
Require proof of successful transactions
Rate the raters
Trust inference techniques (can compute trust based on history)
Ensure “smoothness” (make sure a single rating doesn’t influence overall score significantly
Ask the right questions in user reviews

Whitewashing - where sellers change identity to “reset” their reputation; to combat:

Ensure identity change is costly or requires effort (e.g. link with email or phone number)
Start with low reputation (sellers may find it difficult to start)
Display confidence

Reciprocation and retaliation -resulting in bias towards positive ratings, and sometimes ganging up; to combat:

Don’t allows negative ratings
One-sided ratings
“double blind feedback” (though this can be difficult due to text and timing)
Simultaneous disclosure of buyer and seller ratings/reviews

Other issues include:

Individual bias -some raters are just naturally inclined to be more positive or negative

Solution: scale/weight based on past history

Quality variations over time - e.g. build up a good reputation over time by selling low-value items, then suddenly exploit this

Solution: weight the rating based on transaction price

Incentive to provide ratings (sometimes there isn’t any!)

solution: non-monetary rewards (e.g. status/karma), monetary rewards

Additional Lectures

Introduction to Coursework

There are going to be 4 coursework “surgery” sessions, scheduled throughout the duration of the module. These are interactive drop-in sessions to ask any questions you might have.

Assignments

Task:

Build and evaluate a recommender system

Sarah’s lectures guide through some of the techniques that can be used
Stuart will give a lecture on advanced recommender systems, talking about matrix factorization, which you can optionally use for the coursework

This may potentially give better results than e.g. network vector similarity approaches, though it can be worth experimenting with different things to see what works best for your implementation

Students work in groups of 2 or 3
Train a recommender system model to predict scores for a testset

We are given a training and test set of data, where the training set contains the actual ratings and the test set does not

Deliverables:

(Optional) Submit CSV files with predicted review scores
Submit final CSV files with predicted review scores (10 attempts total)

The first 9 submissions are optional, but the final one is the one that’s actually scored and marked (just giving you the opportunity to tweak and improve things if needed)

Source code (final attempt)
PDF report (describing how it works and your evaluations)

Learning Outcomes (which are practical in nature):

You will understand the use of recommender technologies
You will understand important techniques in designing and building models (particularly when the rating sets get large)
You will be able to analyze the results using a scientific approach

Useful Resources

Example package (with datasets and example source files for how to work with databases, using Python and SQLite, and some Java examples) >> Module Wiki
Assignment PDF >> Module Wiki
Marking scheme PDF >> Module Wiki
Submission site >> ECS handin (open week 8)
Recommender Systems >> lectures week 2 (basic), week 7 (advanced)
Coursework Introduction >> lecture week 3 (this lecture)
Coursework Surgery >> sessions on weeks 5, 7, 8, 9

Forming Groups (of 2 or 3)

Decide your teams amongst yourselves
Email sem03@soton.ac.uk your team by 15th March 2021
If you are not part of a team after this, you will be assigned a RANDOM team (and informed by email)

Team of 2 = one implementation
Team of 3 = two implementations (best one submitted)

Task Datasets

Given on the module wiki page
Movie recommendations

Small = 10M ratings, 20k items (better for loading directly into memory)

note that if you only predict for the small data set, you’ll only have access to ~70% of the total assignment marks (as not all the users in the large appear in the small)

Large = 26M ratings, 54k items (will probably need a database rather than naively loading to RAM)

The automated tests will give you a score for both datasets respectively

Small dataset is easier to manage (less memory, faster)
But, when predicting using the small dataset at most 70% total mark can be achieved.

CSV formatted file (of ratings)

Comma delimited files (submitted on the handin system)

Test set

userid, itemid, timestamp (seconds from 1970)
no ratings here, you have to compute them yourself!

Training set

userid, itemid, rating (float), timestamp (seconds from 1970)

Ratings (taken from the original dataset) are from 0.5 to 5.0 (in increments of 0.5)

Module Wiki >> Assignment >> Example package (data sets are in .zips; these contain very large .csv files):

The test example submission .csv has the format that your program needs to produce

The other three files are the example database files (in Python and Java)

It’s suggested to load the data into a database and work with it there, keeping the memory footprint to something e.g. a laptop can handle easily

< walkthrough of data files >

Submitting Computed Ratings

Module Wiki >> Assignment >> Submit (pretty standard m8)

...When you submit you’ll get a standard handin receipt, shortly followed by another email.

This confirms the test harness has queued the submission, and telling you how many submissions you have remaining for your group (it doesn't matter who submits from the group):

...Then after your code is evaluated, you’ll receive another email with the mean square error results for both the small and large datasets (bear in mind though it’s only the large one that counts for marking!):

Written Report

Note you can use any programming language you like; Java, Python, R, whatever

The only thing that'll be auto-marked upon submission is the generated .csv file

Report explaining how your recommender system works

PDF report
Academic paper style

Introduction, Method, Experiments, Discussions, Conclusion, References

10 page max (including everything) - no more

Teams of 2 >> report on your recommender system
Teams of 3 >> code an additional recommender system and compare this to the one you submitted results for
Experiments

Compare your recommender system (in a rigorous way)

Ideally, don’t just report the mean squared error, but rather conduct a set of experiments, maybe comparing the metrics to some sensible baselines, e.g. the average or random ratings from the test set (and if you’re in a team of 3, you have 2 systems you can compare against)
Also you don’t have to limit yourself to mean squared error; ideally explore a range of options

Compare using metrics you think are suitable for comparing recommender systems

As for the marking scheme, it’s linked on the Wiki (on ECS intranet), go check it out!

The .csv file itself is worth 37.5% of the marks (15 marks), where the predicted ratings are based on the MSE scores produced
The report is worth 62.5% of the marks (25 marks)

Academic Integrity

Late submissions

Normal rules for late submissions
Handin submission date (of final submission) is the date that matters

Plagiarism

Your code will be manually inspected (for cheating attempts)
Its OK to re-use provided database example code
Any other code must be clearly marked and acknowledged (e.g. StackOverflow snippets)
Using standard libraries (e.g. numpy) is OK, but recommender system specific libraries is not
Your submitted code and report must be your groups own original work
In case of doubt, feel free to ask (at coursework surgeries not email)!
Violations, deliberate or otherwise, will be reported to the Academic Integrity Officer.

You are allowed to take code snippets from things like StackOverflow, as long as it’s clearly commented within the code, giving credit to the source
You’re also allowed to use standard libraries, e.g. NumPy and SciKit, though you’re not allowed to use libraries/parts of libraries that are recommender system-specific (i.e. you still need to implement the recommender yourself)

Coursework Surgery Sessions

Drop in sessions for Q&A about coursework

Ratings submission questions
Coursework report questions
Recommender system questions
Academic integrity clarifications

Practical help

Java or Python database code

But ... each group's code must be written only by that group

Surgery sessions will NOT write your code for you
Surgery sessions will NOT debug your code for you
Surgery sessions will NOT design your code for you
Surgery sessions will offer general advice, that's all

Introductory Lecture

Module Overview

Coursework

Additional Links and Resources

Recommender Systems

Where to Find What

What is Social Computing?

What is Crowdsourcing?

Crowdsourcing Examples

CAPTCHA

NoiseTube

Amazon Mechanical Turk

Some other examples

Recommender Systems

Online Advertising Auctions

Summary

Past & Present

Terminology

History of Social Computing

1714 - Longitude Prize

1791 - Gaspard de Prony’s Logarithmic Tables

Wisdom of the Crowds

1906 - Sir Francis Galton

1984 - Space Shuttle Challenger Disaster

Early Social Computing Systems

Commercialisation

Web 2.0

Rise of Crowdsourcing

Crowdsourcing for Social Good

Crowdsourcing and Machine Learning

Why Do We Still Use Social Computing?

Features of Computing Systems

Recommender Systems

Learning Outcomes

Resources

What are Recommender Systems?

Examples of Recommender Systems

Why use Recommender Systems?

Potential Issues

Paradigms of Recommender Systems

Content-based

Knowledge-based

Collaborative Filtering

Hybrid

Pros and Cons of Different Approaches

Similarity Coefficients

“Cleaning” Features

Calculating Similarity

Exercise

Self-Test Question Answers

Question 1

Question 2

Question 3

User-Based Collaborative Filtering

Similarity Measure (for Ratings)

Pearson Coefficient Example

Neighbourhood Selection

Prediction

Improving Prediction

Produce Recommendation

Item-Based Collaborative Filtering

Similarity Measure (for Ratings)

Adjusted Cosine Similarity Example

Prediction

Exercise 2

Alternative Approaches

Simple Approach: Slope One

Types of Ratings

Example: Items Bought Together

Evaluating Recommender Systems

Correct and Incorrect Classifications

Metrics

Precision

Recall

Accuracy

Rank Score

Error Measures

Online Auctions

Where Are Auctions Used and Why?

Sponsored Search