Rand Fishkin along with Mike King may have published one of the biggest data leaks outside of the Department of Justice reveal around Google Search and its internal ranking features and signals. The document was from an anonymous source (no longer anonymous, see below) but verified by Rand Fishkin and contains a ton of details on how Google Search reportedly works.

More importantly, it seems to contradict a number of the Google statements made over the past two decades from numerous Google Search employees, as I covered here over the past.

I have not gone through it all yet but I felt it was important for you all to read this yourself, you can see the details at these headlines:

Rand wrote, “Many of their claims directly contradict public statements made by Googlers over the years, in particular the company’s repeated denial that click-centric user signals are employed, denial that subdomains are considered separately in rankings, denials of a sandbox for newer websites, denials that a domain’s age is collected or considered, and more.”

Mike King wrote, “I have reviewed the API reference docs and contextualized them with some other previous Google leaks and the DOJ antitrust testimony. I’m combining that with the extensive patent and whitepaper research done for my upcoming book, The Science of SEO. While there is no detail about Google’s scoring functions in the documentation I’ve reviewed, there is a wealth of information about data stored for content, links, and user interactions. There are also varying degrees of descriptions (ranging from disappointingly sparse to surprisingly revealing) of the features being manipulated and stored. You’d be tempted to broadly call these “ranking factors,” but that would be imprecise.”

Aleyda Solis has a quick summary on X where she summed up part of the leak:

  • There are 14K ranking features and more in the docs
  • Google has a feature they compute called “siteAuthority”
  • Navboost has a specific module entirely focused on click signals representing users as voters and their clicks are stored as their votes
  • Google stores which result has the longest click during the session
  • Google has an attribute called hostAge that is used specifically “to sandbox fresh spam in serving time”
  • One of the modules related to page quality scores features a site-level measure of views from Chrome

I have not had time to go through everything yet, I will do that over the next several days.

I have also not seen any Googler publicly comment on this yet – I know it is new and I don’t know if we will see any Googler comment on this.

This reminds me a bit like the Yandex search ranking leak.

Update: Google has confirmed with me that the data leak is real but urged caution when making assumptions on how and if Google uses what is in these documents. Google told me:

We would caution against making inaccurate assumptions about Search based on out-of-context, outdated, or incomplete information. We’ve shared extensive information about how Search works and the types of factors that our systems weigh, while also working to protect the integrity of our results from manipulation.

Here are some posts on social about this – again, this has only been out for a few hours and no one but Rand and Mike had any real time to process this in super detail.

I am looking forward to really digging in on this.

Update: I briefly went through those two stories and dug a bit into the actual API documentation and honestly, based on everything I’ve followed over the past 20+ years around Google Search – these really look legit. Some of the specifics in these docs I heard both on and off the record as real ranking features, some are no longer used from what I understand and some I do not know how they are used (i.e. directly for ranking or after the fact ranking validation). It is worth digging through these docs in detail, in my opinion.

Update 2: The source of the leak has spoken out – Erfan Azimi emailed me this video:

Forum discussion at X.