The Google Cloud Search SDK contains several Google-supplied configuration parameters used by all connectors. Knowing how to tune these settings can greatly streamline the indexing of data. This guide lists several issues that can surface during indexing and the settings used to resolve them.
Indexing throughput is low for FullTraversalConnector
The following table lists configuration settings to improve throughput for a FullTraversalConnector:
Setting | Description | Default | Configuration change to try |
---|---|---|---|
traverse.partitionSize |
The number of ApiOperation() to be processed in batches before fetching additional APIOperation() . The SDK waits for current partition to be processed before fetching additional items. This setting is dependent on amount of memory available. Smaller partition sizes, such as 50 or 100, require less memory but more waiting on behalf of the SDK. |
50 | If you have a lot of memory available, try increasing partitionSize to 1000 or more. |
batch.batchSize |
The number of requests batched together. At the end of partitioning the SDK waits for all batched requests to process from the partition. Larger batches require a longer wait. | 10 | Try lowering batch size. |
batch.maxActiveBatches |
Number of allowable concurrently executing batches. | 20 | If you lower batchSize , you should bump maxActiveBatches according to the this formula: maxActiveBatches = (partitionSize / batchSize ) + 50. For example if your partititionSize is 1000 and your batchSize is 5, your maxActiveBatches should be 250. The extra 50 is a buffer for retry requests. This increase allows the connector to batch all requests without blocking. |
traverse.threadPoolSize |
Number of threads the connector creates to allow for parallel processing. A single iterator fetches operations (typically RepositoryDoc objects) serially, but the API calls process in parallel using threadPoolSize number of threads. Each thread processes one item at a time. The default of 50 would process at max only 50 items simultaneously and it takes approximately 4 seconds to process an individual item (including the indexing request). |
50 | Try increasing threadPoolSize by a multiple of 10. |
Finally, consider using the setRequestMode()
method to change the API request mode (either ASYNCHRONOUS
or SYNCHRONOUS
).
For additional information on configuration file parameters, refer to Google-supplied configuration parameters.
Indexing throughput is low for ListTraversalConnector
By default, a connector that implements the ListTraversalConnnector uses a
single traverser to index your items. To increase indexing throughput, you can
create multiple traversers each with its own configuration focusing on specific
item statuses (NEW_ITEM
, MODIFIED
, and so on). The following table lists
configuration settings to improve throughput:
Setting | Description | Default | Configuration change to try |
---|---|---|---|
repository.traversers = t1, t2, t3, ... | Creates one or more individual traversers where t1, t2, t3, ... is the unique name of each. Each named traverser has its own set of settings which are identified using the traverser's unique name, such as traversers.t1.hostload and traversers.t2.hostload | One traverser | Use this setting to add additional traversers |
traversers.t1.hostload = n | Identifies the number of threads, n, to use to simultaneously index items. | 5 | Experiment with tuning n based on how much load you want to put on your repository. Start with values of 10 or above. |
schedule.pollQueueIntervalSecs = s | Identifies the number of seconds, s, to wait before re-polling . The content connector continues to poll items as long as the API returns items in the poll response. When poll response is empty, the connector waits for s seconds before trying again. This setting is only used by the ListingConnector | 10 | Try lowering to 1. |
traverser.t1.pollRequest.statuses = status1, status2, … | Specifies the statuses, status1, status2, …, of the items to index. For example, setting status1 to NEW_ITEM and status2 to MODIFIED instructs traverser t1 to index only items with those statuses. | One traverser checks for all statuses | Experiment with having different traversers poll for different statuses. |
For additional information on configuration file parameters, refer to Google-supplied configuration parameters.
SDK timeouts or interrupts while uploading large files
If you experience SDK timeout or interrupts while uploading large files,
specify a larger timeout using
traverser.timeout=s
(where s = number of seconds). This value identifies how long worker
threads have to process an item. The default timeout in the SDK is 60 seconds
for traverser threads. Additionally, if you experience individual API requests
timing out, use the following methods to increase request timeout values:
Request timeout parameter | Description | Default |
---|---|---|
indexingService.connectTimeoutSeconds |
Connect timeout for indexing API requests. | 120 seconds. |
indexingService.readTimeoutSeconds |
Read timeout for indexing API requests. | 120 seconds. |