How we fought Search spam on Google in 2020

Thursday, April 29, 2021

Googlebot and spider friend are reading the 2020 webspam report

Google Search is a powerful tool to help you find useful information on the open web. Unfortunately, not all web pages are created with good intent. Many of them are explicitly created to deceive people, and that is something we fight against every day. To ensure your safety and protect your search experience against disruptive content and malicious behaviors, Search has invested in many innovations in 2020.

Fighting spam smarter

While we have been fighting spam since the early days of Search, recent advances in Artificial Intelligence (AI) offer unprecedented potential to revolutionize our approach.

By combining our deep knowledge of spam with AI, last year we were able to build our very own spam-fighting AI that is incredibly effective at catching both known and new spam trends. For example, we have reduced sites with auto-generated and scraped content by more than 80% compared to a couple of years ago.

Hacked spam was still rampant in 2020 as the number of vulnerable web sites remained quite large, although we have improved our detection capability by more than 50% and removed most of the hacked spam from search results.

This is a problem that we cannot solve alone. Even if we could detect and protect against all spam, the hackers would not cease exploiting loopholes until they’re all closed. Website owners can protect their sites by practicing good security hygiene: it is easier to prevent a site from getting hacked than to recover from a hack. Google offers resources to help you understand the most common ways websites get hacked and how to use Search Console to check whether your site got hacked. Please do take a look and let's keep the web safer together!

With major events last year, including a global pandemic, we have devoted significant effort in extending protection to the billions of searches we received on such important topics. If you're looking for a COVID testing site near you, you shouldn't have to worry about landing on gibberish spam that may redirect you to phishing sites. Besides eliminating spam content, we worked with several other Search teams to make sure you receive the most up-to-date and highest quality information when and where it matters the most.

Preventing spam from reaching you

Before we deliver a set of search results on Google, there's a lot that happens behind the scenes. Every day, we're discovering, crawling, and indexing billions of web pages. Among those pages is a lot of spam—every day, we discover 40 billion spammy pages. Here’s how we work to keep that spam from getting in the way of your search for helpful, useful information.

how we defend against spam at every step
This diagram conceptualizes how we defend against spam.

First, we have systems that can detect spam when we crawl pages or other content. Crawling is when our automatic systems visit content and consider it for inclusion in the index we use to provide search results. Some content detected as spam isn't added to the index.

These systems also work for content we discover through sitemaps and Search Console. For example, Search Console has a Request Indexing feature so creators can let us know about new pages that should be added quickly. We observed spammers hacking into vulnerable sites, pretending to be the owners of these sites, verifying themselves in the Search Console and using the tool to ask Google to crawl and index the many spammy pages they created. Using AI, we were able to pinpoint suspicious verifications and prevented spam URLs from getting into our index this way.

Next, we have systems that analyze the content that is included in our index. When you issue a search, they work to double-check if the content that matches might be spam. If so, that content won’t appear in the top search results. We also use this information to better improve our systems to prevent such spam from being included in the index at all.

The result is that very little spam actually makes it into the top results anyone sees for a search, thanks to our automated systems that are aided by AI. We estimated that these automated systems help keep more than 99% of visits from Search completely without spam. As for the tiny percentage left, our teams take manual action and use the learnings from that to further improve our automated systems.

Prot