Browse Source

sem1/comp: add information_retrieval.{md,tex}

Callum R. Renwick 7 months ago
  1. 45
  2. 44


@ -0,0 +1,45 @@
# Information Retrieval
## What is information retrieval?
**Information retrieval** is the study of how to retrieve and evaluate information, both through automated methods and on a smaller scale. Naturally, much of it focuses on how to effectively retrieve information from the largest information repostitory we know of -- the internet (to be specific, usually the world wide web).
## Deciding what to retrieve
What rules do we use to decide whether a query matches a particular item from a set, and how well it matches?
We must take care obtaining information from the internet because the quality of much of it is greatly variable.
Some information on the internet is also not accessible without special authorisation: for example, Facebook pages can only be accessed if you are logged in with a Facebook account.
## Measuring the effectiveness of a search
Classically, there have been two main metrics of search effectiveness:
* Precision: what proportion of returned results are relevant
* Recall: what proportion of relevant results are in the return set
(These looks like reciprocals of one another but may not always be.)
These two metrics are needed to classify the effectiveness of a search method with regards to their ability to avoid the two main kinds of search error:
* Errors of omission: relevant items are not included in the return set
* Errors of commission: irrelevant items are included in the return set
In the internet age, there is a new problem: ranking results. Even if the level of commission is relatively low (say 90% relevant results), there are often hundreds of millions of results for a single query. So if the 10% of negative results are ranked as highly relevant, the searcher will have to sift through many irrelevant results before they can find relevant ones.
## Keyword-based search models
The standard keyword search method is to return documents that contain the keyword. Documents that contain the keyword *more times* are ranked *more highly*.
### Boolean model
This can be improved upon by understanding certain words (e.g. "AND", "OR") as Boolean operators. This allows one to make a more complex search, such as "rudolph AND reindeer" for results containing both "rudolph" and "reindeer". Under this model, the technique of **stemming** is usually applied to the input keywords -- this means including all words that use the same linguistic stem as one of the keywords; for example, if the user input "snatching", the search would return documents that also match any of "snatch", "snatched", or "snatcher".
The Boolean model allows more precise searches, but has a significant weakness: when searching the web, imprecise or "fuzzy" matching is often more useful than matching criteria precisely. Specifying criteria too precisely often risks excluding information that is highly relevant but does not quite match the query.
### Vector model (cosine similarity)
The vector model is another improvement on keyword search. This model represents each document in the search set as a vector. Each value in the vector is the frequency of occurance of one of the keywords in the search query. For example, if the search was for the keywords "phanerozoic mesozoic cenozoic", and some document contained the word "phanerozoic" zero times, the word "mesozoic" twice and the word "cenozoic" three times, the vector for that document would be `[0, 2, 3]`.
Now suppose that another document, with a vector calculated using the same method, has the vector `[1, 0, 0]`. The similarity between those documents can be calculated using the **cosine similarity formula (see [info_retrieval.tex](info_retrieval.tex)).


@ -0,0 +1,44 @@
\title{Information Retrieval -- Cosine Similarity Function}
\section{Cosine Similarity Formula}
Given the vector dot product formula,
A \cdot B = \lVert A \rVert \lVert B \rVert \cos \theta
we can find $\cos \theta$ by rearranging:
\cos \theta = \frac{A \cdot B}{\lVert A \rVert \lVert B \rVert}
\textbf{Cosine similarity} is a measure of similarity between two non-zero
vectors. It is defined to be equal to the cosine of the angle between the
vectors. $\theta$ in the equation above is defined as the angle between
the vectors $A$ and $B$, so $\cos \theta$ is precisely the cosine
similarity we seek.
The dot product of a vector can be calculated as the sum of the products
of the components of each vector, thus:
A \cdot B = A_1B_1 + A_2B_2 + \hdots
And the magnitude of a vector can be calculated as the square root of the
sum of the squares of its components:
\lVert A \rVert = \sqrt{{A_1}^2 + {A_2}^2}
So the formula for the cosine similarity of two vectors $A$ and $B$ becomes:
\cos \theta = \frac{A_1B_1 + A_2B_2 + \hdots}{\sqrt{{A_1}^2 + {A_2}^2}}