Tune connector settings

The Google Cloud Search SDK contains several Google-supplied configuration parameters used by all connectors. Knowing how to tune these settings can greatly streamline the indexing of data. This guide lists several issues that can surface during indexing and the settings used to resolve them.

Indexing throughput is low for FullTraversalConnector

The following table lists configuration settings to improve throughput for a FullTraversalConnector:

Setting Description Default Configuration change to try
traverse.partitionSize The number of ApiOperation() to be processed in batches before fetching additional APIOperation(). The SDK waits for current partition to be processed before fetching additional items. This setting is dependent on amount of memory available. Smaller partition sizes, such as 50 or 100, require less memory but more waiting on behalf of the SDK. 50 If you have a lot of memory available, try increasing partitionSize to 1000 or more.
batch.batchSize The number of requests batched together. At the end of partitioning the SDK waits for all batched requests to process from the partition. Larger batches require a longer wait. 10 Try lowering batch size.
batch.maxActiveBatches Number of allowable concurrently executing batches. 20 If you lower batchSize, you should bump maxActiveBatches according to the this formula:

maxActiveBatches = (partitionSize / batchSize) + 50. For example if your partititionSize is 1000 and your batchSize is 5, your maxActiveBatches should be 250. The extra 50 is a buffer for retry requests. This increase allows the connector to batch all requests without blocking.
traverse.threadPoolSize Number of threads the connector creates to allow for parallel processing. A single iterator fetches operations (typically RepositoryDoc objects) serially, but the API calls process in parallel using threadPoolSize number of threads. Each thread processes one item at a time. The default of 50 would process at max only 50 items simultaneously and it takes approximately 4 seconds to process an individual item (including the indexing request). 50 Try increasing threadPoolSize by a multiple of 10.

Finally, consider using the setRequestMode() method to change the API request mode (either ASYNCHRONOUS or SYNCHRONOUS).

For additional information on configuration file parameters, refer to Google-supplied configuration parameters.

Indexing throughput is low for ListTraversalConnector

By default, a connector that implements the ListTraversalConnnector uses a single traverser to index your items. To increase indexing throughput, you can create multiple traversers each with its own configuration focusing on specific item statuses (NEW_ITEM, MODIFIED, and so on). The following table lists configuration settings to improve throughput:

.
SettingDescriptionDefaultConfiguration change to try
repository.traversers = t1, t2, t3, ...Creates one or more individual traversers where t1, t2, t3, ... is the unique name of each. Each named traverser has its own set of settings which are identified using the traverser's unique name, such as traversers.t1.hostload and traversers.t2.hostloadOne traverserUse this setting to add additional traversers
traversers.t1.hostload = nIdentifies the number of threads, n, to use to simultaneously index items.5Experiment with tuning n based on how much load you want to put on your repository. Start with values of 10 or above.
schedule.pollQueueIntervalSecs = sIdentifies the number of seconds, s, to wait before re-polling . The content connector continues to poll items as long as the API returns items in the poll response. When poll response is empty, the connector waits for s seconds before trying again. This setting is only used by the ListingConnector10Try lowering to 1.
traverser.t1.pollRequest.statuses = status1, status2, …Specifies the statuses, status1, status2, , of the items to index. For example, setting status1 to NEW_ITEM and status2 to MODIFIED instructs traverser t1 to index only items with those statuses.One traverser checks for all statusesExperiment with having different traversers poll for different statuses.

For additional information on configuration file parameters, refer to Google-supplied configuration parameters.

SDK timeouts or interrupts while uploading large files

If you experience SDK timeout or interrupts while uploading large files, specify a larger timeout using traverser.timeout=s (where s = number of seconds). This value identifies how long worker threads have to process an item. The default timeout in the SDK is 60 seconds for traverser threads. Additionally, if you experience individual API requests timing out, use the following methods to increase request timeout values:

Request timeout parameter Description Default
indexingService.connectTimeoutSeconds Connect timeout for indexing API requests. 120 seconds.
indexingService.readTimeoutSeconds Read timeout for indexing API requests. 120 seconds.