TF-IDF

Think of it as a way to weigh words: common words (“the”, “and”) are less important, while rare but meaningful words get more weight.

1. Components

Term Frequency (TF)
Measures how often a word appears in a document.
$TF(t,d) = \frac{\text{Number of times term t appears in document d}}{\text{Total number of terms in document d}}$
Inverse Document Frequency (IDF)
Measures how unique or rare a word is across all documents.
$IDF(t) = \log \frac{\text{Total number of documents}}{\text{Number of documents containing term t}}$
TF-IDF Score
Multiply TF and IDF to get the weight:
$TF\text{-}IDF(t,d) = TF(t,d) \times IDF(t)$

2. Intuition

Words that appear frequently in a document but rarely across all documents are more important.
Example:
- Corpus: 3 documents
  - Doc1: “patient billing paid on time”
  - Doc2: “billing delayed patient follow-up”
  - Doc3: “patient insurance claims”
- Word “patient” appears in all docs → low IDF → low weight
- Word “claims” appears in only one doc → high IDF → high weight

3. Use Cases

Information retrieval / search engines → rank documents based on query relevance
Text classification / NLP tasks → convert text into numerical features for ML models
Keyword extraction → find the most important words in a document

4. Implementation

In Python, you can use TfidfVectorizer from scikit-learn:


from sklearn.feature_extraction.text import TfidfVectorizer

docs = ["patient billing paid on time",
        "billing delayed patient follow-up",
        "patient insurance claims"]

vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(docs)
print(tfidf_matrix.toarray())
print(vectorizer.get_feature_names_out())

This will give you a matrix of TF-IDF scores, where each row is a document and each column is a word.

Search This Blog

Machine Learning

TF-IDF

1. Components

2. Intuition

3. Use Cases

4. Implementation

Comments

Post a Comment

Popular Posts

NLP - Tokenization

NLP - Text Preprocessing