TF-IDF

intermediate
TechniquesLast updated: 2025-01-15
Also known as: Term Frequency-Inverse Document Frequency

What is TF-IDF?


TF-IDF (Term Frequency-Inverse Document Frequency) is a statistical measure that evaluates how important a term is to a document within a collection of documents. It combines two factors: term frequency (how often a term appears in a document) and inverse document frequency (how rare the term is across all documents). Terms that appear frequently in a document but rarely across the collection receive high TF-IDF scores, identifying words that are distinctive and important for that document.


The calculation multiplies term frequency (TF) by inverse document frequency (IDF). TF measures how often a term appears in a document, often normalized by document length. IDF measures term rarity, calculated as the logarithm of the total number of documents divided by the number of documents containing the term. The product gives higher weights to terms that are common in the specific document but rare in the overall collection, helping identify the document's distinctive content.


TF-IDF has been a foundational technique in information retrieval for decades, used for document ranking, keyword extraction, and document similarity computation. While more sophisticated methods like BM25 and neural embeddings have largely superseded TF-IDF for retrieval tasks, it remains useful for tasks like identifying important keywords in documents, computing simple document similarity, and as a component in hybrid retrieval systems. Understanding TF-IDF provides insight into the principles behind more advanced sparse retrieval methods.


Related Terms