TF-IDF (Term Frequency-Inverse Document Frequency) is a numerical statistic that reflects how important a word is to a document in a collection or corpus. It’s widely used in information retrieval, text mining, and SEO to evaluate the significance of a term relative to a set of documents. It helps balance the importance of commonly used words (e.g., “the,” “and”) and identifies terms that are more relevant or unique to a particular document.

TF-IDF Formula
The formula consists of two main components:
- Term Frequency (TF): This measures how often a term appears in a document.
- TF(t,d)=Number of times term t appears in document d divided by (Total number of terms in document d)

- Inverse Document Frequency (IDF): This measures how important a term is across all documents in the corpus. The idea is to give less importance to common words that appear in many documents and more weight to rarer terms.
- IDF(t)=log(Total number of documents divided by Number of documents containing the term t)
- The logarithmic scale is used to dampen the effect of very common terms.

- TF-IDF Score: Combining TF and IDF gives us the TF-IDF score for a word in a document, indicating how important the word is to that document in comparison to the corpus.
- TF-IDF(t,d)=TF(t,d)×IDF(t)

How TF-IDF Works: Step-by-Step
- Term Frequency (TF) Calculation:
- This simply counts how often a term occurs in a document. For instance, if the word “SEO” appears 10 times in a document with 100 words, its term frequency would be:

- TF(SEO)=10/100=0.1
- This means “SEO” constitutes 10% of the words in that document.
- Inverse Document Frequency (IDF) Calculation:
- Next, IDF helps determine how common or rare the word is across a set of documents. For example, if “SEO” appears in 100 out of 1000 documents, the IDF is calculated as:

- IDF(SEO)=log(1000/100)=log(10)=1
- This gives a weight that reflects the relative importance of the word across the entire corpus.
- Calculating TF-IDF:
- Now, combining the two values gives the final TF-IDF score:

- TF-IDF(SEO,doc)=0.1×1=0.1
- This score represents the importance of the word “SEO” in this specific document compared to the rest of the corpus.
Example of TF-IDF in Action
Imagine we have three documents:
- Document 1: “SEO is important for digital marketing.”
- Document 2: “SEO strategies help improve search engine rankings.”
- Document 3: “Digital marketing includes SEO and content marketing.”
Now let’s calculate TF-IDF for the term “SEO” in these three documents.
Step 1: Calculate TF (Term Frequency)
- Document 1: “SEO” appears once in a document of 6 words. TF = 1/6 ≈ 0.167.
- Document 2: “SEO” appears once in a document of 7 words. TF = 1/7 ≈ 0.143.
- Document 3: “SEO” appears once in a document of 8 words. TF = 1/8 = 0.125.
Step 2: Calculate IDF (Inverse Document Frequency)
- “SEO” appears in all 3 documents.

- Therefore, IDF is:
- IDF(SEO)=log(3/3)=log(1)=0
Since the term appears in every document, its IDF score is zero, meaning it does not help differentiate between documents in this corpus.
Step 3: Calculate TF-IDF
- Document 1: TF-IDF(SEO) = 0.167 * 0 = 0
- Document 2: TF-IDF(SEO) = 0.143 * 0 = 0
- Document 3: TF-IDF(SEO) = 0.125 * 0 = 0
In this case, the term “SEO” does not carry much weight because it appears in all the documents. Other terms, like “content marketing” or “search engine,” might have higher TF-IDF scores if they are less common across the entire corpus.
Why TF-IDF is Useful
- Identifying Important Keywords: TF-IDF helps surface important but less common keywords in documents, avoiding generic terms like “the” or “and.”
- Improving SEO: In SEO, TF-IDF can help discover keywords that competitors are using effectively but are underrepresented in your content.
- Document Comparison: It helps quantify how similar or different documents are based on their key terms.
- Text Summarization: TF-IDF can be used to extract the most important sentences or terms in a document, which is useful for generating summaries.
Limitations of TF-IDF
- Context Sensitivity: TF-IDF doesn’t consider the meaning or context of words. For example, it treats homonyms (words that have different meanings) the same.
- Handling of Synonyms: Synonyms aren’t accounted for in TF-IDF, which means related terms might not be weighted together.
- Static Weighting: TF-IDF works well for static documents but doesn’t adapt dynamically to new content unless recalculated.
Conclusion
TF-IDF is a powerful tool for determining the relevance of a term within a document relative to a larger corpus. It balances common terms against rare ones, helping to highlight words that carry more meaning in a specific context. Despite its limitations, TF-IDF is widely used in information retrieval, SEO analysis, and various natural language processing applications to help improve the focus and relevance of content.