Every connector has an associated configuration file containing parameters used by the connector,
such as the ID for your repository. Parameters are defined as key-value pairs, such as
api.sourceId=1234567890abcdef
.
The Google Cloud Search SDK contains several Google-supplied configuration parameters used by different connectors. Of the Google-supplied configuration parameters, only the Data source access parameters are required to be defined in your configuration file. You do not need to redefine the Google-supplied parameters in your configuration file unless you want to override their default values.
This reference describes the Google-supplied configuration parameters.
Configuration file example
The following example shows a identity configuration file with parameter key-value pairs.
# # Configuration file sample # api.sourceId=1234567890abcdef api.identitySourceId=0987654321lmnopq api.serviceAccountPrivateKeyFile= ./PrivateKey.json # # Traversal schedules # schedule.traversalIntervalSecs=7200 schedule.incrementalTraversalIntervalSecs=600 # # Default ACLs # defaultAcl.mode=fallback defaultAcl.public=true
Commonly set parameters
This section lists required and optional commonly set configuration parameters. If you do not change values for the optional parameters, the connector uses the default values provided by the SDK.
Data source access
The following table lists all of the parameters that are required to appear in a configuration file. The parameters you use depend on the type of connector you are building (content connector or identity connector).
Setting | Parameter |
---|---|
Data source id | api.sourceId=1234567890abcdef
This parameter is required by a connector to identify the location of your repository. You obtain this value when you added a data source to search. This parameter must be in connector configuration files. |
Identity source id | api.identitySourceId=0987654321lmnopq
This parameter is required by identity connectors to identify the location of an external identity source. You obtained this value when you map user identities in Cloud Search. This parameter must be in all identity connector configuration files. |
Service account private key file | api.serviceAccountPrivateKeyFile=./PrivateKey.json
This parameteter contains the private key needed to access the repository. You obtained this value when you configured access to the Google Cloud Search REST API. This parameter must be in all configuration files. |
Service account ID | api.serviceAccountId=123abcdef4567890
This parameter specifies the service account ID. The default empty string value is only allowable when the configuration file specifies a private key file parameter. This parameter is required if your private key file is not a JSON key. |
Google Workspace Account ID | api.customerId=123abcdef4567890
This parameter specifies the account ID for the enterprise's Google Workspace account. You obtained this value when you map user identities in Cloud Search. This parameter is required when syncing users using an identity connector. |
Root URL | api.rootUrl=baseURLPath
This parameter specifies the indexing service base URL path. The default value for this parameter is an empty string which is converted to
|
Traversal schedules
The scheduling parameters determine how often the connector waits between traversals.
Setting | Parameter |
---|---|
Full traversal at connector startup | schedule.performTraversalOnStart=true|false
The connector performs a full traversal at connector startup, rather than
waiting for the first interval to expire. The default value is |
Full traversal after an interval | schedule.traversalIntervalSecs=intervalInSeconds
The connector performs a full traversal after a specified interval. Specify the
interval between traversals in seconds. The default value is |
Exit after a single traversal | connector.runOnce=true|false
The connector runs a full traversal once, then exits. This parameter should only
be set to |
Incremental traversal after an interval | schedule.incrementalTraversalIntervalSecs=intervalInSeconds
The connector performs an incremental traversal after a specified interval.
Specify the interval between traversals in seconds. The default value is
|
Scheduled poll queue intervals | schedule.pollQueueIntervalSecs=interval_in_seconds
The interval between scheduled poll queue intervals (in seconds). This is used
only by a listing traversal connector. The default value is |
Access control lists
The connector controls access to items by using ACLs. Multiple parameters allow you to protect user access to indexed records with ACLs.
If your repository has individual ACL information associated with each item, upload all ACL information to control item access within Cloud Search. If your repository provides partial or no ACL information, you can supply default ACL information in the following parameters, which the SDK provides to the connector.
Setting | Parameter |
---|---|
ACL mode | defaultAcl.mode=mode
Determines when to apply the default ACL. Valid values:
The default mode is |
Default public ACL | defaultAcl.public=true|false
The default ACL used for the entire repository is set to public domain access.
The default value is |
Common ACL group readers | defaultAcl.readers.groups=google:group1@mydomain.com,
group2 |
Common ACL readers | defaultAcl.readers.users=user1, user2,
google:user3@mydomain.com |
Common ACL denied group readers | defaultAcl.denied.groups=group3 |
Common Acl denied readers | defaultAcl.denied.users=user4, user5 |
Entire domain access | To specify that every indexed record be publicly accessible by every user
in the domain, set both of the following parameters with values:
|
Common defined ACL | To specify one ACL for each record of the data repository, set all of the
following parameter values:
|
Metadata configuration parameters
Some of the item metadata is configurable. Connectors can set configurable metadata fields during indexing. If the connector doesn't set a field, the parameters in your configuration file are used to set the field.
The configuration file's has a series of named metadata configuration parameters indicated by
a .field
suffix, such as
itemMetadata.title.field=movieTitle
. If there is a value for these
parameters, it is used to configure the metadata field. If there is no value for the
named metadata parameter, the metadata is configured using an parameter with the
.defaultValue
suffix).
The following table shows metadata configuration parameters.
Setting | Parameter |
Title | itemMetadata.title.field=movieTitle
itemMetadata.title.defaultValue=
The item title. If the title.field isn't set to a value, the value for
title.defaultValue is used.
|
Source repository URL | itemMetadata.sourceRepositoryUrl.field=url
itemMetadata.sourceRepositoryUrl.defaultValue=https://www.imdb.com/title/tt0031381/
The item URL used in search results. You might just set the defaultValue to hold a
URL for the entire repository, such as if your repsitory is a CSV file and there is only one
URL for every item. If the sourceRepositoryUrl.field isn't set
to a value, the value for sourceRepositoryUrl.defaultValue is used.
|
Container name | itemMetadata.containerName.field=containerName
itemMetadata.containerName.defaultValue=myDefaultContainerName
The name of the item's container, such as the name of a file system directory or folder. If the containerName.field isn't set to a value, the value for
containerName.defaultValue is used.
|
Object type | itemMetadata.objectType.field=type itemMetadata.objectType.defaultValue=
The object type used by the connector, as defined in the schema. The connector won't index any structured data if this property is not specified. If the objectType.field isn't set to a value, the value for
objectType.defaultValue is used.
|
Create time | itemMetadata.createTime.field=releaseDate
itemMetadata.createTime.defaultValue=1940-01-17
The document creation timestamp. If the createTime.field isn't set to a value, the
value for createTime.defaultValue is used.
|
Update time | itemMetadata.updateTime.field=releaseDate
itemMetadata.updateTime.defaultValue=1940-01-17
The last modification timestamp for the item. If the updateTime.field isn't set to
a value, the value for updateTime.defaultValue is used.
|
Content language | itemMetadata.contentLanguage.field=languageCode
itemMetadata.contentLanguage.defaultValue=
The content language for documents being indexed. If the contentLanguage.field
isn't set to a value, the value for contentLanguage.defaultValue is used.
|
Mime type | itemMetadata.mimeType.field=mimeType
itemMetadata.mimeType.defaultValue=
The original mime-type of ItemContent.content in the source repository. The maximum length is 256 characters. If the mimeType.field isn't set to a value, the value for
mimeType.defaultValue is used.
|
Search quality metadata | itemMetadata.searchQualityMetadata.quality.field=quality
itemMetadata.searchQualityMetadata.quality.defaultValue=
An indication of the quality of the item, used to influence search quality. Value should be between 0.0 (lowest quality) and 1.0 (highest quality). The default value is 0.0. If the quality.field isn't set to a value, the value for
quality.defaultValue is used.
|
Hash | itemMetadata.hash.field=hash
itemMetadata.hash.defaultValue=f0fda58630310a6dd91a7d8f0a4ceda2
Hashing value provided by the API caller. This can be used with the items.push method to calculate modified state. The maximum length is 2048
characters. If the hash.field isn't set to a value, the value for
hash.defaultValue is used.
|
Datetime formats
Datetime formats specify the formats expected in metadata attributes. If the configuration file does not contain this parameter, default values are used. The following table shows this parameter.
Setting | Parameter |
Additional datetime formats | structuredData.dateTimePatterns=MM/dd/uuuu HH:mm:ssXXX
A semicolon-separated list of additional java.time.format.DateTimeFormatter
patterns. The patterns are used when parsing string values for any date or date-time fields
in the metadata or schema. The default value is an empty list, but RFC 3339 and RFC 1123
formats are always supported.
|
Structured data
The Cloud Search Indexing API provides a schema service that you can use to customize how Cloud Search indexes and serves your data. If you are using a local repository schema, you must specify the structured data local schema name.
Setting | Parameter |
---|---|
Local schema name | structuredData.localSchema=mySchemaName
The schema name is read from the data source and used for repository structured data. The default is an empty string. |
Content and search quality
For repositories that contain record or field-based content (such as a CRM, CVS, or database), the SDK allows automatic HTML formatting for data fields. Your connector defines the data fields at the beginning of connector execution, and then uses a content template to format each data record before uploading it to Cloud Search.
The content template defines the importance of each field value for searching.
The HTML <title>
field is required and defined as the highest priority. You can
designate search quality importance levels for all the other content fields:
high, medium or low. Any content field not defined in a specific category
defaults to low priority.
Setting | Parameter |
---|---|
Content HTML title | contentTemplate.templateName.title=myTitleField
The content HTML title and highest search quality field. This parameter is required only if you are using an HTML content template. The default value is an empty string. |
High search quality for content fields | contentTemplate.templateName.quality.high=hField1,hField2
Content fields given a high search priority. The default is an empty string. |
Medium search quality for content fields | contentTemplate.templateName.quality.medium=mField1,mField2
Content fields given a medium search priority. The default is an empty string. |
Low search quality for content fields | contentTemplate.templateName.quality.low=lField1,lField2
Content fields given a low search priority. The default is an empty string. |
Unspecified content fields | contentTemplate.templateName.unmappedColumnsMode=value
How the connector handles unspecified content fields. Valid values are:
|
Include field names in HTML template | contentTemplate.templateName.includeFieldName=true|false
Specifies whether to include the field names along with the field data in the HTML
template. The default is |
Uncommonly set parameters
You rarely need to set the parameters listed in this section. The parameters's defaults are set for optimal performance. Google does not recommend setting these parameters to values different from their defaults without specific requirements within your repository.
Proxy configuration
The SDK allows you to configure your connector to use a proxy for outgoing connections.
The transport.proxy.hostname
and transport.proxy.port
parameters are
required to enable transport through a proxy. The other parameters may be required
if your proxy requires authentication or operates over the SOCKS protocol instead of HTTP. If
transport.proxy.hostname
is not set, the SDK will not use a proxy.
Setting | Parameter |
---|---|
Hostname | transport.proxy.hostname=hostname
The hostname for the proxy server. This parameter is required when using a proxy. |
Port | transport.proxy.port=port
The port number for the proxy server. This parameter is required when using a proxy. |
Proxy type | transport.proxy.type=type
The type of proxy. Valid values are:
The default value is |
Username | transport.proxy.username=username
The username to use when constructing a proxy authorization token. This parameter is optional, and should only be set if your proxy requires authentication. |
Password | transport.proxy.password=password
The password to use when constructing a proxy authorization token. This parameter is optional, and should only be set if your proxy requires authentication. |
Traversers
The SDK enables you to specify multiple individual traversers to allow for parallel traversals of a data repository. The SDK template connectors use this feature.
Setting | Parameter |
---|---|
Thread pool size | traverse.threadPoolSize=size
Number of threads the connector creates to allow for parallel processing. A single iterator fetches operations serially (typically RepositoryDoc objects), but the API calls processes in parallel using this number of threads. The default value is |
Partition size | traverse.partitionSize=batchSize
Number of The default value is |
Traverser poll requests
The core of the Cloud Search indexing queue is a priority queue containing an entry for each item known to exist. A listing connector can request to poll items from the indexing API. A poll request gets the highest priority entries from the indexing queue.
The following parameters are used by the SDK listing connector template to define polling parameters.
Setting | Parameter |
---|---|
Repository traverser | repository.traversers=t1, t2, t3, ...
Creates one or more individual traversers where t1, t2, t3,
... is the unique name of each. Each named traverser has its own set of settings
which are identified using the traverser's unique name, such as
|
Queue to be polled | traverser.pollRequest.queue=mySpecialQueue
Queue names that this traverser polls. The default is empty string (implies "default"). |
traverser.t1.pollRequest.queue=mySpecialQueue
When you have multiple traversers, set item's statuses for each traverser (where t1, represents a specific traverser). |
|
Polling behavior | traverser.pollRequest.limit=maxItems
Maximum number of items to return from a polling request.
The default value is |
traverser.t1.pollRequest.limit=limit
When you have multiple traversers, set item's statuses for each traverser (where t1, represents a specific traverser). |
|
Item status | traverser.pollRequest.statuses=statuses
The specific item's statuses that this traverser polls, where statuses can be
any combination of |
traverser.t1.pollRequest.statuses=statusesForThisTraverser
When you have multiple traversers, set item's statuses for each traverser (where t1, represents a specific traverser). | |
Host load | traverser.hostload=threads
Maximum number of active parallel threads available for polling. The default
value is |
traverser.t1.hostload=threadsForThisTraverser
When you have multiple traversers, set item's statuses for each traverser (where t1, represents a specific traverser). |
|
Timeout | traverser.timeout=timeout
Timeout value for interrupting this traverser poll attempt. The default value is |
traverser.t1.timeout=timeoutForThisTraverser
When you have multiple traversers, set item's statuses for each traverser (where t1, represents a specific traverser). |
|
traverser.timeunit=timeoutUunit
The timeout units. Valid values are |
|
traverser.t1.timeunit=timeoutUnit
When you have multiple traversers, set item's statuses for each traverser (where t1, represents a specific traverser). |
In most cases, a connector using the SDK listing connector template only requires a single set of parameters for polling. In some cases, you may need to define more than one polling criteria if your traversal algorithm requires separating item processing using different queues, for example.
In this case, you have the option of defining multiple sets of polling
parameters. Begin by specifying the names of the parameter sets using
repository.traversers
. For each defined traverser name, supply the
configuration file with the parameters in the table above replacing the
t1
with the traverser name. This creates a set of polling
parameters for each defined traverser.
Checkpoints
A checkpoint is useful for tracking the state of an incremental traversal.
Setting | Parameter |
---|---|
Checkpoint directory | connector.checkpointDirectory=/path/to/checkpoint
Specifies the path to the local directory to use for the incremental and full traversal checkpoints. |
Content uploads
Item content is uploaded to Cloud Search with the item when the content's size does not exceeds the specified threshold. If the content's size exceeds the threshold, the content is uploaded separately from the item's metadata and structured data.
Setting | Parameter |
---|---|
Content threshold | api.contentUploadThresholdBytes=bytes
The threshold for content that determines whether it is uploaded "in-line" with the item versus using a separate upload. The default value is |
Containers
The full connector template uses an algorithm involving the concept of a temporary data source queue toggle for detecting deleted records in the database. This means that upon each full traversal, the fetched records, which are in a new queue, replace all the existing Cloud Search records indexed from the previous traversal, which are in an old queue.
Setting | Parameter |
---|---|
Container name tag | traverse.queueTag=instance
To run multiple instances of the connector in parallel to index a common data repository (whether on different data repositories or separate parts of a common data repository) without interfering with each other, assign a unique container name tag to each run of the connector. A unique name tag prevents a connector instance from deleting another's records. The name tag is appended to the Full Traversal Connector toggle queue id. |
Disable delete detection | traverse.useQueues=true|false
Indicates if connector uses queue toggle logic for delete detection. The default value is Note: This configuration parameter is only applicable to connectors
implementing the |
Batch policy
The SDK supports a batch policy that enables you to perform the following actions:
- Batch requests
- Specify the number of requests in a batch queue
- Manage concurrently executing batches
- Flush batched requests
The SDK batches together the connector's requests to speed throughput during uploads. The SDK trigger for uploading a batch of requests is by either the number of requests or the timeout, whichever comes first. For example, if the batch delay time has expired without the batch size being reached, or if the batch size number of items is reached before the delay time expires, then the batch upload is triggered.
Setting | Parameter |
---|---|
Batch requests | batch.batchSize
Batch requests together. The default value is |
Number of requests in a batch queue | batch.maxQueueLength=maxQueueLength
Maximum number of requests in a batch queue for execution.
The default value is |
Concurrently executing batches | batch.maxActiveBatches=maxActiveBatches
Number of allowable concurrently executing batches.
The default value is |
Flush batched requests automatically | batch.maxBatchDelaySeconds=maxBatchDelay
Number of seconds to wait before batched requests are
flushed automatically. The
default value is |
Flush batched requests on shutdown | batch.flushOnShutdown=true|false
Flush batched requests during service shutdown.
The default value is |
Exception handlers
The exception handlers parameters determine how the traverser proceeds after it encounters an exception.
Setting | Parameter |
---|---|
Traverser instruction in case of error | traverse.exceptionHandler=exceptions
How the traverser should proceed after an exception is thrown. Valid values are:
|
Wait time between exceptions | abortExceptionHander.backoffMilliSeconds=backoff
Backoff time in milliseconds to wait between detected handler exceptions
(typically used when traversing a repository). The default value is |