TF-IDF

 TF-IDF stands for Term Frequency–Inverse Document Frequency. It’s a numerical statistic used in text mining and natural language processing (NLP) to measure how important a word is in a document relative to a collection of documents (corpus).

Think of it as a way to weigh words: common words (“the”, “and”) are less important, while rare but meaningful words get more weight.


1. Components

  1. Term Frequency (TF)
    Measures how often a word appears in a document.

    TF(t,d)=Number of times term t appears in document dTotal number of terms in document dTF(t,d) = \frac{\text{Number of times term t appears in document d}}{\text{Total number of terms in document d}}
  2. Inverse Document Frequency (IDF)
    Measures how unique or rare a word is across all documents.

    IDF(t)=logTotal number of documentsNumber of documents containing term tIDF(t) = \log \frac{\text{Total number of documents}}{\text{Number of documents containing term t}}
  3. TF-IDF Score
    Multiply TF and IDF to get the weight:

    TF-IDF(t,d)=TF(t,d)×IDF(t)TF\text{-}IDF(t,d) = TF(t,d) \times IDF(t)

2. Intuition

  • Words that appear frequently in a document but rarely across all documents are more important.

  • Example:

    • Corpus: 3 documents

      • Doc1: “patient billing paid on time”

      • Doc2: “billing delayed patient follow-up”

      • Doc3: “patient insurance claims”

    • Word “patient” appears in all docs → low IDF → low weight

    • Word “claims” appears in only one doc → high IDF → high weight


3. Use Cases

  • Information retrieval / search engines → rank documents based on query relevance

  • Text classification / NLP tasks → convert text into numerical features for ML models

  • Keyword extraction → find the most important words in a document


4. Implementation

In Python, you can use TfidfVectorizer from scikit-learn:

from sklearn.feature_extraction.text import TfidfVectorizer docs = ["patient billing paid on time", "billing delayed patient follow-up", "patient insurance claims"] vectorizer = TfidfVectorizer() tfidf_matrix = vectorizer.fit_transform(docs) print(tfidf_matrix.toarray()) print(vectorizer.get_feature_names_out())

This will give you a matrix of TF-IDF scores, where each row is a document and each column is a word.

Comments

Popular Posts