TF-IDF
TF-IDF stands for Term Frequency–Inverse Document Frequency. It’s a numerical statistic used in text mining and natural language processing (NLP) to measure how important a word is in a document relative to a collection of documents (corpus).
Think of it as a way to weigh words: common words (“the”, “and”) are less important, while rare but meaningful words get more weight.
1. Components
-
Term Frequency (TF)
TF(t,d)=Total number of terms in document dNumber of times term t appears in document d
Measures how often a word appears in a document. -
Inverse Document Frequency (IDF)
IDF(t)=logNumber of documents containing term tTotal number of documents
Measures how unique or rare a word is across all documents. -
TF-IDF Score
TF-IDF(t,d)=TF(t,d)×IDF(t)
Multiply TF and IDF to get the weight:
2. Intuition
-
Words that appear frequently in a document but rarely across all documents are more important.
-
Example:
-
Corpus: 3 documents
-
Doc1: “patient billing paid on time”
-
Doc2: “billing delayed patient follow-up”
-
Doc3: “patient insurance claims”
-
-
Word “patient” appears in all docs → low IDF → low weight
-
Word “claims” appears in only one doc → high IDF → high weight
-
3. Use Cases
-
Information retrieval / search engines → rank documents based on query relevance
-
Text classification / NLP tasks → convert text into numerical features for ML models
-
Keyword extraction → find the most important words in a document
4. Implementation
In Python, you can use TfidfVectorizer from scikit-learn:
from sklearn.feature_extraction.text import TfidfVectorizer
docs = ["patient billing paid on time",
"billing delayed patient follow-up",
"patient insurance claims"]
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(docs)
print(tfidf_matrix.toarray())
print(vectorizer.get_feature_names_out())
This will give you a matrix of TF-IDF scores, where each row is a document and each column is a word.
Comments
Post a Comment