What does Term Frequency-Inverse Document Frequency (TF-IDF) mean in SEO?
TF-IDF stands for term frequency-inverse document frequency and I can already hear you tuning out. Yes it is a very mathematical thing and if you want to get mathematical then
Wikipedia has all the equations.
I'm going to talk about how you can use the concept in every day search engine optimization and the basics are this. For any given search term who is ranking on Google's first page. Look at this content and see what words they are using.
Let's take the phrase "best places to visit in new york".
At the time I wrote this one of the top ranking articles was from U.S. News -
Best Things to do in New York.
OK so straight off that didn't match my search exactly. I wanted places to visit and Google gave me things to do. The words 'places' and 'visit' aren't even in the title of the U.S News article.
However on the article page:
- 'places' is used 7 times - a term frequency of 7
- 'visit' is used 51 times - a term frequency of 51
There are two things going on here. First Latent Semantic Indexing - Google understands that 'places to visit' can mean the same thing as 'things to do'. Second the words I used in my search are mentioned frequently in the article, thats term frequency.
Now if you've been working in SEO for a while you'll be thinking "Isn't this just keyword density with a fancy title" and you would be on to something because the Inverse Document Frequency looks at, not just how often a phrase is used as a number but compared to all the other words on a page.
For our U.S. News article the word visit makes up nearly 1% of the text. 'New York' makes up over 1% with its 71 mentions.
Isn't keyword density dead?
No, its very much alive but just more complicated than it used to be which I'll explain in a moment. Term Frequency-Inverse Document Frequency is very much a new label to make something which everyone was kicking as old school now more respectable and acceptable - similar thing, new label but with a bit more math and a little more polish.
It's like Keyword Density after it went to school so TF-IDF is better at not getting distracted by stop words like 'the', 'a', 'and'.
But call it keyword density or TF-IDF it is only giving you half the story and it is definitely not a simple answer of achieving a magic percentage and hitting the rankings jackpot. There are other things at play:
- Latent Semantic Analysis - what other words are in use that mean the same as 'visit'? 'go' is mentioned 17 times. Using alternatives is a signal to search engines that a piece of content is well written.
- Entity Analysis - what words should be on a page about visiting things in New York and are they there? 'Empire State Building', 'Statue of Liberty', etc. These words tell search engines that this article is really about the entity (New York) that it claims to be. 'tickets', 'queue', etc. tell search engines another entity is 'visiting'
- Article length - at around 6,000 words if all the other boxes are ticked (well written, other keywords are in use) this is a sign that we're looking at a comprehensive bit of content that is highly likely to please people.
Then of course there are things like how many links the article gets from other websites but I'm just concentrating here on how a piece of content stacks up on its own.
How can you use TF-IDF in SEO
I approach any piece of content this way:
- Run a basic keyword density check - basically because many writers don't use their central keyword or keyword phrase enough. Its also a good sign that there may be waffle or the text has gone off topic at certain points. It surprises me sometimes how close writers get to a keyword density of zero without even realizing it.
- Check a service like thesaurus.com to get ideas for alternative words. This will not only make your text a better read but give you extra brownie points from the search engines.
- Look at what related words the current top ranking pages use - and make sure you are using them as well to strengthen the entity signals to search engines
- If your page has been out there a while check Google Search Console to see what words you are currently ranking for but not getting clicks - could you add these to the content or even use them to help extend and expand your content even further? This is low hanging fruit - Google is saying, "You're nearly there" and more importantly "I don't have much else to show for this search term so help me out".
All this is going to give you great content for search engines because you have produced a great read for humans. Write better, more comprehensive content than what is already out there and don't be surprised if you can outrank pages with better backlinks and authority because Google can see you are a whole lot more relevant.