What Makes a Word a âKeywordâ?
In the field of Natural Language Processing, one of the most fundamental (and fascinating) tools which you learn in your first few weeks of an NLP class is the notion of âtf-idf weightingâ. When Iâve led classes/workshops on NLP, one of my favorite things to do is to lay out the following high-level pieces of the tf-idf âpuzzleâ, to nudge students towards developinging their own tf-idf equation from scratch, drawing solely on their intuition rather than an equation tossed at them from out of nowhere.
The overarching goal is an algorithm that a computer could use to detect âkeyâ words within a text. For example, consider the first sentence in the Wikipedia article for âbasketballâ:
Basketball is a team sport in which two teams compete with the primary objective of shooting a basketball through the defenderâs hoop, while preventing the opposing team from shooting through their own hoop.^[Iâve simplified the sentence a lot for ease-of-reading here!]
The challenge is, how can we program a computer to derive information about the relative importance of individual words in that sentence with respect to the concept of basketball? When you read that sentence, your brain is likely able to pick out:
-
Words like âsportâ, âteamsâ, and âhoopâ as being directly pertinent to the semantic meaning of basketball,
-
Words like âisâ, âofâ, or âthroughâ as being more generally useful across many different topics, and
-
Words like âobjectiveâ and âpreventingâ as being somewhat in-between these two extremes: pertinent to a wider range of concepts than just basketball or sports (in contrast with e.g. âhoopâ), but not as abstract/general as syntactic connectors like âisâ or âofâ.
This vague categorization of words within a sentence is âintuitiveâ for most English speakers (itâs a task we perform many times per day, with little need for conscious attention most of the time, whenever we read), yet it turns out to be far from trivial to take this semi-conscious process and turn it into an explicit algorithm that a computer can carry out!
Two Halfway-There Attempts
To get a sense for this, consider two approaches that might initially come to mind, but which turn out to be not so helpful upon further reflection:
-
Attempt 1: A word which appears a bunch of times in the âBasketballâ article must be very relevant to the concept of basketball, so, letâs draw on that intuition and just use the frequency of a word in a document as our measure of the importance of with respect to .
-
Thus, for example, if 20% of all the words in are , but only 1% of the words in are , then weâll estimate that is 20 times more central to the concept of than .
The problem with Approach 1 is that, since the document is written in natural-language English, the most frequent words are likely to be words like âtheâ, âbutâ, or âandâ, so that Approach 1 will likely just say that these three words are the most important words for understanding e.g. Basketball, vacuum cleaners, King Louis XVI, and everything elseâŠ
So, it seems like weâre going to have to âzoom outâ from just the Basketball article, and take into account the distribution of words across English more broadly. One way we might try to achieve this is to place high weight on words which are unique to the Basketball article:
-
Attempt 2: To avoid placing high importance on words which appear frequently across English in general, letâs instead construct a measure of word âs importance with respect to document that diminishes as we see appearing in more and more other documents besides : letâs try , where is the set of all documents which contain at least once, and is the cardinality operator, so that this becomes one over the total number of documents containing .
-
For example, if the word âhoopâ appears only in the articles for âBasketballâ and âHoop Rollingâ, its importance score will be , whereas if the word âandâ appears in 1000 documents, its importance score will be much lower at .
The problems with Attempt 2 are⊠perhaps more subtle than the problems with Attempt 1. But, one that you may notice immediately is that this approach will view rare typos within a document as immensely important for understanding document! For example, if âandâ appears in 1000 documents, but the typo âanndâ appears only in the âBasketballâ article, this approach will assign âanndâ the highest possible importance score relative to the concept of Basketball.
Synthesizing Two Halfway-Theres into a Wholeway-There!
tf-df roughly boils down to just text frequency / document frequency, where we add logs to things to take into account how the collection of all possible words may be much larger (and more skewed in distribution!) than the collection of documents in a corpus.