This page of the Cloud Search tutorial shows how to set up a data source and content connector for indexing data. To start from the beginning of this tutorial, refer to Cloud Search getting started tutorial
Build the connector
Change your working directory to the cloud-search-samples/end-to-end/connector
directory and run this command:
mvn package -DskipTests
The command downloads the required dependencies needed for building the content connector and compiles the code.
Create service account credentials
The connector requires service account credentials to call the Cloud Search APIs. To create the credentials:
- Return to the Google Cloud console.
- In the left navigation, click Credentials. The "Credential" page appears.
- Click the + CREATE CREDENTIALS drop-down list and select Service account. The "Create service account" page appears.
- In the Service account name field, enter "tutorial".
- Note the Service account ID value (right after the Service account name). This value is used later.
- Click CREATE. The "Service account permissions (optional)" dialog appears.
- Click CONTINUE. The "Grant users access to this service account (optional)" dialog appears.
- Click DONE. The "Credentials" screen appears.
- Under Service Accounts, click on the service account email. The "service account details" page appeaers.
- Under Keys, click the ADD KEY drop-down list and select Create new key. The "Create private key" dialog appears.
- Click CREATE.
- (optional) If the "Do you want to allow downloads on console.cloud.google.com?” dialog appears, click Allow.
- A private key file is saved to your computer. Note the location of the downloaded file. This file is used to configure the content connector so it can authenticate itself when calling the Google Cloud Search APIs.
Initialize third-party support
Before you can call any other Cloud Search APIs, you must initialize third-party support for Google Cloud Search.
To initialize third-party support for Cloud Search:
Your Cloud Search platform project contains service account credentials. However, for the sake of initializing third-party support, you must create web application credentials. For instructions on how to create web application credentials, refer to Create credentials. Upon completing this step, you should have a client ID and client secret file.
Use Google's OAuth 2 playground to obtain an access token:
- Click settings and check User your own auth credentials.
- Enter the client ID and client secret from step 1.
- Click Close.
- In the scopes field, type
https://www.googleapis.com/auth/cloud_search.settings
and click Authorize. The OAuth 2 playground returns an authorization code. - Click Exchange authorization code for tokens. A token is returned.
To initialize third-party support for Cloud Search, use the following curl command. Be sure to substitute
[YOUR_ACCESS_TOKEN]
with the token obtained in step 2.curl --request POST \ 'https://cloudsearch.googleapis.com/v1:initializeCustomer' \ --header 'Authorization: Bearer [YOUR_ACCESS_TOKEN]' \ --header 'Accept: application/json' \ --header 'Content-Type: application/json' \ --data '{}' \ --compressed
If successful, the response body contains an instance of
operation
. For example:{ name: "operations/customers/01b3fqdm/lro/AOIL6eBv7fEfiZ_hUSpm8KQDt1Mnd6dj5Ru3MXf-jri4xK6Pyb2-Lwfn8vQKg74pgxlxjrY" }
If unsuccessful, contact Cloud Search support.
Use operations.get to verify that third-party support is initialized:
curl \ 'https://cloudsearch.googleapis.com/v1/operations/customers/01b3fqdm/lro/AOIL6eBv7fEfiZ_hUSpm8KQDt1Mnd6dj5Ru3MXf-jri4xK6Pyb2-Lwfn8vQKg74pgxlxjrY?key= [YOUR_API_KEY]' \ --header 'Authorization: Bearer [YOUR_ACCESS_TOKEN]' \ --header 'Accept: application/json' \ --compressed
When the third-party initialization is complete, it contains the field
done
set totrue
. For example:{ name: "operations/customers/01b3fqdm/lro/AOIL6eBv7fEfiZ_hUSpm8KQDt1Mnd6dj5Ru3MXf-jri4xK6Pyb2-Lwfn8vQKg74pgxlxjrY" done: true }
Create the data source
Next, create a data source in the admin console. The data source provides a namespace for indexing content using the connector.
- Open the Google Admin console.
- Click the Apps icon. The "Apps administration" page appears.
- Click Google Workspace. The "Apps Google Workspace administration" page appears.
- Scroll down and Click Cloud Search. The "Settings for Google Workspace" page appears.
- Click Third-party data sources. The "Data Sources" page appears.
- Click the round yellow +. The "Add new data source" dialog appears.
- In the Display name field, type "tutorial".
- In the Service account email addresses field, enter the email address of the service account you created in the previous section. If you do not know the email address of the service account, look up the value in the service accounts page.
- Click ADD. The "Successfully created data source" dialog appears.
- Click *OK. Note the Source ID for the newly created data source. The Source ID is used to configure the content connector.
Generate a personal access token for the GitHub API
The connector requires authenticated access to the GitHub API in order to have sufficient quota. For simplicity, the connector leverages personal access tokens instead of OAuth. Personal tokens allow authenticating as a user with a limited set of permissions similar to OAuth.
- Log in to GitHub.
- In the upper-right corner, click on your profile picture. A drop-down menu appears.
- Click Settings.
- Click Developer settings.
- Click Personal access tokens.
- Click Generate personal access token.
- In the Note field, enter "Cloud Search tutorial".
- Check the public_repo scope.
- Click Generate token.
- Note the generated token. It is used by the connector to call the GitHub APIs and provides API quota to perform the indexing.
Configure the connector
After creating the credentials and data source, update the connector configuration to include these values:
- From the command line, change directory to
cloud-search-samples/end-to-end/connector/
. - Open the
sample-config.properties
file with a text editor. - Set the
api.serviceAccountPrivateKeyFile
parameter to the file path of the service credentials you previously downloaded. - Set the
api.sourceId
parameter to the ID of the data source you previously created. - Set the
github.user
parameter to your GitHub username. - Set the
github.token
parameter to the access token you previously created. - Save the file.
Update the schema
The connector indexes both structured and unstructured content. Before indexing data, you must update the schema for the data source. Run the following command to update the schema:
mvn exec:java -Dexec.mainClass=com.google.cloudsearch.tutorial.SchemaTool \
-Dexec.args="-Dconfig=sample-config.properties"
Run the connector
To run the connector and begin indexing, run the command:
mvn exec:java -Dexec.mainClass=com.google.cloudsearch.tutorial.GithubConnector \
-Dexec.args="-Dconfig=sample-config.properties"
The default configuration for the connector is to index a single repository
in the googleworkspace
organization. Indexing the repository takes about 1 minute.
After initial indexing, the connector continues to poll for changes to the
repository that need to be reflected in the Cloud Search index.
Reviewing the code
The remaining sections examine how the connector is built.
Starting the application
The entry point to the connector is the GithubConnector
class. The
main
method instantiates the SDK's IndexingApplication
and starts it.
The ListingConnector
provided by the SDK implements a traversal strategy
that leverages Cloud Search queues
for tracking the state of items in the index. It delegates to GithubRepository
,
implemented by the sample connector, for accessing content from GitHub.
Traversing the GitHub repositories
During full traversals, the getIds()
method is called to push items that may need to be index into the queue.
The connector can index multiple repositories or organizations. To miminize the
impact of a failure, one GitHub repository is traversed at a time. A checkpoint
is returned with the results of the traversal containing the list of
repositories to be index in subsequent calls to getIds()
. If an error
occurs, indexing is resumed at the current repository instead of starting
from the beginning.
The method collectRepositoryItems()
handles the traversal of a single
GitHub repo. This method returns a collection of ApiOperations
representing the items to be pushed into the queue. Items are pushed as a
resource name and a hash value representing the current state of the item.
The hash value is used in subsequent traversals of the GitHub repositories. This value provides a lightweight check to determine if the content has changed without having to upload additional content. The connector blindly queues all items. If the item is new or the hash value has changed, it is made available for polling in the queue. Otherwise the item is considered unmodified.
Processing the queue
After the full traversal completes, the connector begins polling the
queue for items that need to be indexed. The getDoc()
method is called for each item pulled from the queue. The method reads
the item from GitHub and converts it into the proper representation
for indexing.
As the connector is running against live data that may be changed at any
time, getDoc()
also verifies that the item in the queue is still valid
and deletes any items from the index that no longer exist.
For each of the GitHub objects the connector indexes, the corresponding
indexItem()
method handles building the item representation for
Cloud Search. For example, to build the representation for content items:
Next, deploy the search interface.