Improve search quality

Search quality refers to the quality of search results in terms of ranking and recall as perceived by the user making the search query.

Ranking refers to the ordering of items and recall refers to the number of relevant items retrieved. An item (also referred to as a document) is any piece of digital content that Google Cloud Search can index. Types of items include Microsoft Office documents, PDF files, a row in a database, unique URLs, and so on. An item is comprised of:

Structured metadata
Indexable content
ACLs

Cloud Search uses a variety of signals to retrieve and to rank search query results; the items resulting from a search query. You can influence Cloud Search’s signals through settings in the schema, the item's content and metadata (during indexing), and the search application. The goal of this document is to help you improve search quality through modification of these signal influencers.

For a summary of recommended and optional settings, refer to Summary of recommended and optional search quality settings.

Influence topicality score

Topicality refers to the relevance of a search result to the original query terms. Topicality of an item is calculated based on the following criteria:

The importance of each query term.
The number of hits (the number of times a query term appears in the item’s content or metadata).
The type of matches the query term, and their variants, have with an item indexed in Cloud Search.

To influence a text property's topicality score, define the RetrievalImportance on the text property in your schema. A match on a property with high RetrievalImportance results in a higher score compared to a match on a property with low RetrievalImportance.

For example, suppose you have a data source with the following characteristics:

The data source is used to store history for software bugs.
Each bug has a name, description, and priority.

Most users would query this data source using the bug name, so you would set the RetrievalImportance on the name to HIGHEST in the schema.

Conversely, most users may not query this data source using the description of the bug, so, set the RetrievalImportance on the description to DEFAULT. Following is sample schema containing RetrievalImportance settings.

{
  "objectDefinitions": [
    {
      "name": "issues",
      "propertyDefinitions": [
        {
          "name": "summary",
          "textPropertyOptions": {
            "retrievalImportance": {
              "importance": HIGHEST
              }
            }
          },
        {
          "name": "description",
          "textPropertyOptions": {
            "retrievalImportance": {
              "importance": DEFAULT
              }
            }
          },
        {
          "name": "label",
            "isRepeatable": true,
          "textPropertyOptions": {
            "retrievalImportance": {
              "importance": DEFAULT
              }
            }
          },
        {
          "name": "comments",
          "textPropertyOptions": {
            "retrievalImportance": {
              "importance": DEFAULT
              }
            }
          },
        {
          "name": "project",
          "textPropertyOptions": {
            "retrievalImportance": {
              "importance": HIGH
              }
            }
          },
        {
          "name": "duedate",
          "datePropertyOptions": {
          }
        },
        ...
      ]
    }
  ]
}

In the case of HTML documents, tags such as <title> and <h1>, along with formatting settings such as font size and bolding, are used for determining the importance of various terms. If the ContentFormat is TEXT, ItemContent has DEFAULT retrieval importance and if it is HTML, its retrieval importance is determined on the basis of HTML properties.

Influence freshness

Freshness measures how recently an item has been modified and is determined by the createTime and updateTime properties in the ItemMetadata. Older items are demoted in the search results..

It is possible to influence how freshness is computed for an object by adjusting the freshnessProperty and freshnessDuration of FreshnessOptions in the schema.

The freshnessProperty allows you to use a date or timestamp properties for computing freshness instead of the default updateTime.

In our previous example of a software bug tracking system, the due date could be used as a freshnessProperty such that items with a due date closest to the current date are considered “fresher” and obtain a ranking boost. Following is sample schema containing freshnessProperty settings:

{
  "objectDefinitions": [
    {
      "name": "issues",
      "options": {
        "freshnessOptions": {
          "freshnessProperty": "duedate"
        }
      },
      "propertyDefinitions": [
        {
          "name": "summary",
          "textPropertyOptions": {
            "retrievalImportance": {
              "importance": HIGHEST
            }
          }
        },
        {
          "name": "duedate",
          "datePropertyOptions": {
          }
        },
        ...
      ]
    }
  ]
}

Use the freshnessDuration to identify when an item is considered out-of-date. For example, you may have a data source that is not indexed regularly or for which you do not want freshness to influence the ranking. You can achieve this goal by specifying a high value for freshnessDuration.

Suppose you have a data source with employee profile information. In this scenario, you might want a high freshnessDuration because changes to employee information is often not relevant to the ranking of the employee. Following is sample schema containing freshnessDuration setting:

{
  "objectDefinitions": [
    {
      "name": "people",
      "options": {
        "freshnessOptions": {
          "freshnessDuration": "315360000s", # 100 years
        }
      },
    }
  ]
}

You can also set freshnessDuration to a very small value for data sources whose content changes rapidly, such as a data source containing news articles. In this scenario, the most-recently created or modified documents are most relevant. Following is sample schema containing freshnessDuration setting for a data source containing rapidly changing content:

{
  "objectDefinitions": [
    {
      "name": "news",
      "options": {
        "freshnessOptions": {
          "freshnessDuration": "259200s", # 3 days
        }
      },
    }
  ]
}

Influence quality

Quality is a measurement of the accuracy and usefulness of an item. A data source can contain multiple semantically similar documents, each with a different level of quality. You can specify a quality value between 0 and 1 using SearchQualityMetadata. Items with higher values receive a ranking boost relative to items with a lower values. Use this setting only if you need to influence or boost the quality of an item outside of the information provided to Cloud Search.

For example, suppose you have a data source containing employee benefits documents. You might use SearchQualityMetadata to boost the ranking of documents authored by Human Resources employees over documents authored by other employees.

Following is sample schema containing SearchQualityMetadata settings for issues in a bug tracking system:

{
  "name": "datasources/.../items/issue1",
  "acl": {
    ...
  },
  "metadata": {
    "title": "Issue 1"
    "objectType": "issues"
  },
  ...
}

{
  "name": "datasources/.../items/issue2",
  "acl": {
    ...
  },
  "metadata": {
    "title": "Issue 2"
    "objectType": "issues"
    "searchQualityMetadata": {
      "quality": 0.5
    }
  },
  ...
}

{
  "name": "datasources/.../items/issue3",
  "acl": {
    ...
  },
  "metadata": {
    "title": "Issue 3"
    "objectType": "issues"
    "searchQualityMetadata": {
      "quality": 1
    }
  },
  ...
}

Given this schema, when a user searches using the search term “issue,” Issue 3 in the schema (quality of 1) is ranked higher than Issue 2 (quality of .5) and Issue 1 (if nothing is specified, the default quality is 0).

Influence using field type

Cloud Search allows you to influence ranking based on the value of enum or integer properties. For each integer or enum property, an OrderedRanking can be specified. This setting has the following values:

NO_ORDER (default): The property does not affect ranking.
ASCENDING: Items with higher values of this integer or enum property receive a ranking boost compared to items with lower values.
DESCENDING: Items with lower values of the integer or enum property receive a ranking boost compared to items with higher values.

For example, suppose each bug in a bug tracking system has an enum property for storing the priority of the bug as either HIGH (1), MEDIUM (2), or LOW (3). In this scenario, setting an OrderedRanking of DESCENDING provides a ranking boost to HIGH priority bugs in comparison to LOW priority bugs. Following is sample schema containing OrderedRanking settings for issues in a bug tracking system:

{
  "objectDefinitions": [
    {
      "name": "issues",
      "options": {
        "freshnessOptions": {
          "freshnessProperty": "duedate",
        }
      },
      "propertyDefinitions": [
        {
          "name": "summary",
          "textPropertyOptions": {
            "retrievalImportance": {
              "importance": HIGHEST
            }
          }
        },
        {
          "name": "duedate",
          "datePropertyOptions": {
          }
        },
        {
          "name": "priority",
          "enumPropertyOptions": {
            "possibleValues": [
              {
                "stringValue": "HIGH",
                "integerValue": 1
              },
              {
                "stringValue": "MEDIUM",
                "integerValue": 2
              },
              {
                "stringValue": "LOW",
                "integerValue": 3
              }
            ],
            "orderedRanking": DESCENDING,
          }
        },

        ...
      ]
    }
  ]
}

A bug tracking system could also have an integer property called votes used to gather feedback from users on the relative importance of a bug. You could use the votes property to influence ranking by providing higher importance to the bugs with the most votes. In this case, you could specify OrderedRanking as ASCENDING for the votes property so that issues with the most votes receive a ranking boost. Following is sample schema containing OrderedRanking settings for issues in a bug tracking system:

{
  "objectDefinitions": [
    {
      "name": "issues",
      "propertyDefinitions": [
        {
          "name": "summary",
          "textPropertyOptions": {
            "retrievalImportance": {
              "importance": HIGHEST
            }
          }
        },
        {
          "name": "description",
          "textPropertyOptions": {
            "retrievalImportance": {
              "importance": DEFAULT
            }
          }
        },
        {
          "name": "votes",
          "integerPropertyOptions": {
            "orderedRanking": ASCENDING,
            "minimumValue": 0,
            "maximumValue": 1000,
          }
        },

        ...
      ]
    }
  ]
}

Influence ranking through query expansion

Query expansion refers to expanding the terms in the query, using synonyms and spelling, to retrieve better results.

Use synonyms to influence search results

Cloud Search utilizes synonyms inferred from public web content to expand the query terms. You can also define custom synonyms to capture organization-specific terminology, such as common acronyms used within an organization or industry-specific terminology.

Custom synonyms can be defined within a data source or as a separate data source. By default, synonyms are applied to all data sources across all search applications. However, you can group synonyms by data source and search application. For information on defining custom synonyms including grouping by search application, refer to Define synonyms.

Use spelling to influence search results

Cloud Search provides spelling suggestions based on models built using the public Google Search data. If Cloud Search detects a misspelling in the context of a query, it returns the suggested query in the SpellResult. The suggested spelling can be displayed to the user as a suggestion. For example, the user might misspell the query term “employe” and could receive the suggestion “Did you mean employee?”

Cloud Search also uses spell corrections as synonyms to help retrieve documents that may otherwise be missed due to a spelling error.

Influencing ranking through search application settings

As mentioned in the Introduction to Google Cloud Search, a Search Application is a group of settings that, when associated with a search interface, provide contextual information about searches. The following configurations allow you to influence ranking through the search application:

Scoring configuration
Source configuration

The following two sections explain how these configurations are useful in influencing ranking.

Adjust the scoring configuration

For each search application, you can specify a ScoringConfig used for controlling the application of some signals during ranking. Currently, you can disable freshness and personalization.

If freshness is disabled, it is disabled for all data sources listed in the search application, regardless of the freshness options specified in the schema for the data source. Similarly, if personalization is disabled, owner boost and interaction boost doesn’t affect the ranking.

For step-by-step instructions on configuring this setting, refer to Customize the search experience in Cloud Search.

Adjust the source configuration

The source configuration allows you to specify data source-level settings in a search application. The following settings are supported:

Source importance
Crowding

Set source importance

Source importance refers to the relative importance of a data source within a search application. This setting can be specified in SourceImportance field inside SourceScoringConfig. Items from a data source with HIGH source importance receive a ranking boost compared to items from a data source with a DEFAULT or a LOW source importance. Use this setting to influence ranking when you believe users would prefer results from certain datasources.

For example, suppose you have a product support portal containing external and internal troubleshooting data. In this scenario, you might want to configure your search application to prioritize results from the internal data source.

For step-by-step instructions on configuring this setting, refer to Customize the search experience in Cloud Search.

Set crowding

Crowding refers to a the maximum number of results that can be returned from a data source in a search application. This value can be controlled using the numResults field in SourceCrowdingConfig. This value defaults to 3 which means if we have shown 3 results from a data source Cloud Search starts presenting results from other data sources. Items from the first data source are reconsidered only if all data sources have reached their crowding limit or there are no more results from other data sources.

This setting is helpful in ensuring diversity of the search results and preventing one data source from dominating the search result page.

For step-by-step instructions on configuring this setting, refer to Customize the search experience in Cloud Search.

Influencing ranking through personalization

Personalization refers to the presentation of personalized search results based on the individual user accessing the result. You can influence ranking by prioritizing items based on the following criteria:

Item ownership
Item interaction
User clicks
Item language

The following three sections address how to influence search quality based on these criteria.

Influence ranking based on item ownership

Item ownership refers to providing a ranking boost to items owned by the user performing the search query. Each item has an ItemAcl with an owners field. If the user executing a query is the owner of an item, then, by default, that item receives a ranking boost. You can turn off personalization in the search application.

Increase ranking based on item interaction

Item interaction refers to providing a ranking boost to items that the search query user interacted with (viewed, commented, edited, and so on).

Item interaction signals are automatically obtained for Google Workspace products such as Drive and Gmail. For other products, you can provide item-level interaction data, including the type of interaction (view, edit), the timestamp of the interaction, and the principal (user who interacted with the item). Note that items with recent interactions obtain a higher ranking boost.

Increase ranking based on user clicks

Cloud Search collects the clicks on current search results and uses it to improve ranking for future searches by boosting items clicked previously by the same user.

Influence ranking through query interpretation

Cloud Search’s query interpretation feature automatically interprets the operators and filters in a user’s query, and converts those elements into a structured, operator-based query. Query interpretation uses operators defined in the schema, together with the indexed documents, to deduce what the user's query means. This feature allows a user to search with minimal keywords, yet still obtain precise results. For further information, refer to Structure a schema for optimal query interpretation.

Increase ranking based on item language

Language refers to providing a ranking demotion to items whose language does not match the language of the query. The following factors affect the ranking of items based on language:

The query language. The auto-detected language of the search query, or the languageCode specified in the RequestOptions.

If you build a custom search interface, you should set the languageCode to the user's interface language or language preference (for example, the language of the web browser or the search interface page). The auto-detected query language takes precedence over the languageCode, so that search quality is not compromised when a user types a query in a language that differs from their interface.
The item language. The contentLanguage set in ItemMetadata at index time, or the content language automatically detected by Cloud Search.

If a document's contentLanguage is left empty at index time, and the ItemContent is populated, Cloud Search attempts to detect the language used in the ItemContent and stores it internally. The auto-detected language is not added to the contentLanguage field.

If the language of the query and item match, no language demotion is applied. If these settings do not match, then the item is demoted. Language demotion is not applied to documents where contentLanguage is empty and Cloud Search could not automatically detect the language. As a result, the ranking of a document is not impacted if Cloud Search can't detect its language.

Increase ranking based on item context

You can increase the ranking for items which are more relevant to the context of a search query. The context (contextAttributes) is a set of named attributes that you can specify during indexing, and in the search request, to provide context for a specific search query.

For example, suppose an item, such as an employee benefit document, is more relevant in the context of a Location and Department, such as a city (San Francisco), state (California), country (USA), and a Department (Engineering). In this case, you could index the item with the following named attributes:

{
  ...
  "metadata": {
    "contextAttributes": [
      {
        name: "Location"
        values: [
          "San Francisco",
          "California",
          "USA"
        ],
      },
      {
        name: "Department"
        values: [
          "Engineering"
        ],
      }
    ],
  },
  ...
}

When the user enters a search query of "benefits" into the search interface, you might include the user's location information and department in the search request. For example, here's a search request containing location and department information for an Engineer in Chicago:

{
  ...
  "contextAttributes": [
    {
      name: "Location"
      values: [
        "Chicago",
        "Illinois",
        "USA"
      ],
    },
    {
      name: "Department"
      values: [
        "Engineering"
      ],
    }
  ],
  ...
}

Because both the indexed item and the search request contain the attributes of "Department=Engineering" and "Location=USA," the indexed item (an employee benefit document) appears higher in the search results.

Now suppose another user, an Engineer in India, enters a search query of "benefits" into the search interface. Here's a search request containing their location and department information:

{
  ...
  "contextAttributes": [
    {
      name: "Location"
      values: [
        "Bengaluru",
        "Karnataka",
        "India"
      ],
    },
    {
      name: "Department"
      values: [
        "Engineering"
      ],
    }
  ],
  ...
}

Because both the indexed item and the search request only contain the attribute of "Department=Engineering," the indexed item appears only slightly higher in the search results (when compared to the first search query of "benefits" entered by an Engineer located in Chicago Illinois USA).

Following are some example contexts you might want use to increase ranking:

Location: Items can be more relevant to users in a particular location, such as a building, a city, a country, or a region.
Job role: Items can be more relevant to users in a particular job role, such as Technical Writer or Engineer.
Department: Items can be more relevant to certain departments, such as Sales or Marketing.
Job level: Items can be more relevant to certain job levels, such as Director or CEO.
Employee type: Items can be more relevant to certain types of employees, such as part-time and full-time employees.
Tenure: Items can be more relevant to an employee's tenure, such as a new hire.

Influencing ranking through item popularity

Cloud Search boosts popular items in ranking; that is, it boosts those items which have received clicks in recent search queries.

Influencing ranking through clickboost

Cloud Search collects the clicks on current search results and uses it to improve ranking for future searches by boosting popular items for a particular search query.

Summary of recommended and optional search quality settings

The following table lists all of the recommended and optional search quality settings. These recommendations should help you achieve the most benefit from Cloud Search's ranking models.

Setting	Location	Recommended/optional	Details
Schema settings
`ItemContent` field	`ItemContent`	Recommended	When creating or updating your schema, populate the unstructured content of an item. This field is used for generating snippets.
`RetrievalImportance` field	`RetrievalImportance`	Recommended	When creating or updating a schema, set for text properties which are clearly important or topical.
`FreshnessOptions`	`FreshnessOptions`	Optional	When creating or updating a schema, set to ensure that items aren't demoted because of incorrect data or cases when data is missing.
Indexing settings
`createTime`/`updateTime`	`ItemMetadata`	Recommended	Populate during indexing of an item.
`contentLanguage`	`ItemMetadata`	Recommended	Populate during indexing of an item. If absent, Cloud Search attempts to detect the language used in the `ItemContent`.
`owners` field	`ItemAcl()`	Recommended	Populate during indexing of an item.
Custom synonyms	`_dictionaryEntry` schema	Recommended	Define at data source-level or as separate data source during indexing.
`quality` field	`SearchQualityMetadata`	Optional	To provide a base quality boost compared to other semantically similar items, set quality during indexing. Setting this field for all items in a data source nullifies its effect.
item-level interaction data	`interaction`	Optional	If the data source records and provides access to user's interactions, populate the interactions for each item during indexing.
integer/enum properties	`OrderedRanking`	Optional	When order of items is relevant, specify the ordered ranking for integer and enum properties during indexing.
Search application settings
`Personalization=false`	`ScoringConfig` or using CloudSearch admin UI	Recommended	When creating or updating the search application. Ensure you provide the correct owner information as described in Influencing ranking through personalization
`SourceImportance` field	`SourceCrowdingConfig`	Optional	To bias the results from certain data sources, set this field.
`numResults` field	`SourceCrowdingConfig`	Optional	To control the diversity of results, set this field.

Next Steps

Here are a few next steps you might take:

Structure a schema for optimal query interpretation.
Learn how to leverage the _dictionaryEntry schema to define synonyms for terms commonly used in your company. To use the _dictionaryEntry schema, refer to Define synonyms.