2 Comments
User's avatar
Yograj thakur's avatar

I liked the post and read it line by line, but I didn't got the formula , it would be nice if you can explain that in depth.

Expand full comment
Subhrajyoty Roy's avatar

There are 3 steps.

Step 1: Calculate the term frequency (tf) for a word. The term frequency of a word is equal to the number of times a term (word) appears in a document. So, tf("like") = 34 means that all the documents together contain the word "like" 34 times.

Step 2: Calculate the document frequency (df) for a word. The document frequency of a word is equal to the number of documents containing that word. So, tf("like") = 5 means that there are 5 documents among the dataset that contains the word "like".

Step 3: Put everything into the tf-idf formula.

tf-idf(word) = tf(word) * log(N / df(word)).

So if there are N = 100 documents in total, then for the above example,

tf-idf("like") = 34 * log(100 / 5) = 34 * log(20) ~ 44.23

Expand full comment