What You Should Know About Google's new SMITH Algorithm

Google’s new SMITH algorithm is similar to BERT in many ways, just better.

Several natural language processing and information retrieval problems involve semantic matching. It’s a technique that’s used to identify semantically related information.

For example, such a model could detect that a document labeled “car” is equivalent to another labeled “automobile.”

Sounds simple enough, right? However, semantic matching applications extend beyond merely identifying words.

Today, search algorithm components such as BERT and Transformer rely on this technique to understand the nuances and contexts of words. It’s why Google uses the BERT to organize Top Stories and Featured Snippets.

However, BERT is far from perfect.

According to Google, the NLP model focuses primarily on matching short texts such as a few sentences or paragraphs. As a result, it may struggle with long-form documents, which have several essential applications.

These include:

News recommendation
Related article recommendation
Document clustering

To address this problem, the search giant published a document proposing a new model for long-form content matching. It’s called the Siamese Multi-depth Transformer-based Hierarchical Encoder — or SMITH for short.

The Google document reads:

“In this paper, we address the issue by proposing the Siamese Multi-depth Transformer-based Hierarchical (SMITH) Encoder for long-form document matching.”

So, how does SMITH compare to BERT?

Google’s SMITH Algorithm vs. BERT: A Basic Comparison

As said earlier, BERT is trained to understand words within the context of sentences. On the other hand, SMITH can capture sentence-level semantic relations within a document.

In other words, Google trained the new model to match passages within the context of the entire content. But how?

First, the search giant trained SMITH with a masked word language modeling task used by BERT. That way, it could predict random words within the context of sentences.

However, pre-training the model with a novel masked sentence-block language modeling task made all the difference. With that, SMITH was able to identify the next block of text in a long-form document.

In several benchmark tests for long-form document matching, Google noted that the new model outperforms previous ones, including BERT.

The document reads:

“Comparing to BERT based baselines, our model is able to increase maximum input text length from 512 to 2048.”

Indeed, the idea of SMITH outperforming previous state-of-the-art models such as BERT is intriguing. However, it’s unlikely that the new model would replace the old one.

Instead, Google could use SMITH alongside BERT to understand both long and short queries and documents.

Read the original research paper here.