This guide is intended for Google Cloud Search CSV (comma-separated values) connector administrators, that is, anyone who is responsible for downloading, configuring, running, and monitoring the connector.
This guide includes instructions for performing key tasks related to CSV connector deployment:
- Download the Google Cloud Search CSV connector software
- Configure the connector for use with a specific CSV data source
- Deploy and run the connector
To understand the concepts in this document, you should be familiar with the fundamentals of Google Workspace, CSV files, and Access Control Lists (ACLs).
Overview of the Google Cloud Search CSV connector
The Cloud Search CSV connector works with any comma-separated values (CSV) text file. A CSV file stores tabular data, and each line of the file is a data record.
Google Cloud Search's CSV Connector extracts individual rows from a CSV file and indexes them into Cloud Search via Cloud Search's Indexing API. Once successfully indexed, individual rows from CSV files are searchable through Cloud Search's clients or Cloud Search's Query API. The CSV connector also supports controlling users' access to content in the search results, by using ACLs.
Google Cloud Search CSV connector can be installed on Linux or Windows. Before you deploy the Google Cloud Search CSV connector, ensure that you have the following required components:
- Java JRE 1.8 installed on a computer that runs the Google Cloud Search CSV connector
Google Workspace information required to establish relationships between Google Cloud Search and the data source:
- Google Workspace private key (which contains the service account ID)
- Google Workspace data source ID
Typically, the Google Workspace administrator for the domain can supply these credentials for you.
Deployment steps
To deploy the Google Cloud Search CSV connector follow these steps:
- Install the Google Cloud Search CSV connector software
- Specify the CSV connector configuration
- Configure access to the Google Cloud Search data source
- Configure CSV file access
- Specify columns names to index, unique key columns, and datetime columns
- Specify columns to use in clickable search result URLs
- Specify metadata information, column formats
- Schedule data traversal
- Specify Access Control List (ACL) options
1. Install the SDK
Install the SDK into your local Maven repository.
Clone the SDK repository from GitHub.
$ git clone https://github.com/google-cloudsearch/connector-sdk.git $ cd connector-sdk/csv
Check out the desired version of the SDK:
$ git checkout tags/v1-0.0.3
Build the connector:
$ mvn package
Copy the connector zip file to your local installation directory:
$ cp target/google-cloudsearch-csv-connector-v1-0.0.3.zip installation-dir $ cd installation-dir $ unzip google-cloudsearch-csv-connector-v1-0.0.3.zip $ cd google-cloudsearch-csv-connector-v1-0.0.3
2. Specify the CSV connector configuration
As the connector administrator, you control the CSV connector's behavior and attributes defining parameters in the connector's configuration file. Configurable parameters include:
- Access to a data source
- Location of the CSV file
- CSV column definitions
- Column(s) that define a unique id
- Traversal options
- ACL options to restrict data access
For the connector to properly access a CSV file and index the relevant content, you must first create its configuration file.
To create a configuration file:
- Open a text editor of your choice and name the configuration file.
Add key=value pairs to the file contents as described in the following sections. - Save and name the configuration file.
Google recommends that you name the configuration fileconnector-config.properties
so no additional command line parameters are required to run connector.
Because you can specify the configuration file path on the command line, a standard file location is not necessary. However, keep the configuration file in the same directory as the connector to simplify tracking and running the connector.
To ensure the connector recognizes your configuration file, specify its path on
the command line. Otherwise, the connector uses
connector-config.properties
in your local directory as the
default file name. For information about specifying the configuration path on
the command-line, see Run the Cloud Search CSV connector.
3. Configure access to the Google Cloud Search data source
The first parameters every configuration file must specify are the ones necessary to access the Cloud Search data source, as shown in the following table. Typically, you will need the Data source ID, service account ID, and the path to the service account's private key file in order to configure the connector's access to Cloud Search. The steps required to set up a data source are described in Manage third-party data sources
Setting | Parameter |
Data source ID | api.sourceId=1234567890abcdef
Required. The Google Cloud Search source ID set up by the Google Workspace administrator, as described in Manage third-party data sources. |
Path to the service account private key file | api.serviceAccountPrivateKeyFile=./PrivateKey.json
Required. The Google Cloud Search service account key file for Google Cloud Search CSV connector accessibility. |
Identity source ID | api.identitySourceId=x0987654321
Required if using external users and groups. The Google Cloud Search identity source ID set up by the Google Workspace administrator. |
4. Configure CSV file parameters
Before the connector can traverse a CSV file and extract data from it for indexing, you must identify the path to the file. You can also specify the file format and type of file encoding. Add the following parameters to specify the CSV file properties in the configuration file.
Setting | Parameter |
Path to the CSV file | csv.filePath=./movie_content.csv
Required. The path to the CSV file to be accessed and extract content for indexing. |
File format | csv.format=DEFAULT
The format of the file. Possible values are from the Apache Commons CSV CSVFormat class. Format values include: |
File format modifier | csv.format.withMethod=value
A modification to how Cloud Search handles the file. Possible methods are from the Apache Commons CSV CSVFormat class and include those that take a single character, string, or boolean value. For example, to specify a semicolon as a delimiter, use |
File encoding type | csv.fileEncoding=UTF-8
The Java character set to use when Cloud Search reads the file. If unspecified, Cloud Search uses the platform default character set. |
5. Specify column names to index and unique key columns
For the connector to access and index CSV files, you must provide information about column definitions in the configuration file. If the configuration file does not contain the parameters that specify the column names to index and unique key columns, default values are used.
Setting | Parameter |
Columns to index | csv.csvColumns=movieId,movieTitle,description,actors,releaseDate,year,userratings...
The column names to be indexed from the CSV file. If |
Unique key columns | csv.uniqueKeyColumns=movieId
The CSV column(s) whose values will be used to generate each record's unique ID. If not specified, the hash of the CSV record should be used as its unique key. Default value is the record's hashcode. |
6. Specify columns to use in clickable search result URLs
When a user searches using Google Cloud Search, it responds by showing a results page that includes clickable URLs for each result. To enable this feature, you must add the parameter shown in the following table to the configuration file.
Setting | Parameter |
Search result URL format | url.format=https://mymoviesite.com/movies/{0}
Required. The format to construct view URL for CSV content. |
Search results URL parameters. | url.columns=movieId
Required. The CSV column names whose values will be used to generate the record's view url. |
Search results URL parameters to escape | url.columnsToEscape=movieId
Optional. The CSV column names whose values will be URL escaped to generate valid view url. |
7. Specify metadata information, column formats, search quality
You can add parameters to the configuration file that specify:
Metadata Configuration Parameters
Metadata Configuration Parameters describes the CSV columns used for populating item metadata. If the configuration file does not contain these parameters, default values are used. The following table shows these parameters.
Setting | Parameter |
Title | itemMetadata.title.field=movieTitle
itemMetadata.title.defaultValue=Gone with the Wind
The metadata attribute that contains the value corresponding to the document title. The default value is an empty string. |
URL | itemMetadata.sourceRepositoryUrl.field=url
itemMetadata.sourceRepositoryUrl.defaultValue=https://www.imdb.com/title/tt0031381/
The metadata attribute that contains the value for the document URL for search results. |
Created timestamp | itemMetadata.createTime.field=releaseDate
itemMetadata.createTime.defaultValue=1940-01-17
The metadata attribute that contains the value for the document creation timestamp. |
Last modified time | itemMetadata.updateTime.field=releaseDate
itemMetadata.updateTime.defaultValue=1940-01-17
The metadata attribute that contains the value for the last modification timestamp for the document. |
Document language | itemMetadata.contentLanguage.field=languageCode
itemMetadata.contentLanguage.defaultValue=en-US
The content language for documents being indexed. |
Schema object type | itemMetadata.objectType.field=type itemMetadata.objectType.defaultValue=movie
The object type used by the connector, as defined in the schema. The connector won't index any structured data if this property is not specified. |
Datetime formats
Datetime formats specify the formats expected in metadata attributes. If the configuration file does not contain this parameter, default values are used. The following table shows this parameter.
Setting | Parameter |
Additional datetime formats | structuredData.dateTimePatterns=MM/dd/uuuu HH:mm:ssXXX
A semicolon-separated list of additional java.time.format.DateTimeFormatter patterns. The patterns are used when parsing string values for any date or date-time fields in the metadata or schema. The default value is an empty list, but RFC 3339 and RFC 1123 formats are always supported. |
Column formats
Column formats specify information about the column(s) that should be a part of the searchable content. If the configuration file does not contain these parameters, default values are used. The following table shows these parameters.
Setting | Parameter |
Skip header | csv.skipHeaderRecord=true
Boolean. Ignore the header record (first line) in the CSV file. If you have set |
Multi-value columns | csv.multiValueColumns=genre,actors
The column names in the CSV file that have multiple values. The default value is an empty string. |
Delimiter for multi-value columns | csv.multiValue.genre=;
The delimiter for the multi-value columns. The default delimiter is a comma. |
Search quality
The Cloud Search CSV connector allows automatic HTML formatting for data fields. Your connector defines the data fields at the beginning of connector execution, and then uses a content template to format each data record before uploading it to Cloud Search.
The content template defines the importance of each field value for searching. The title field is required and is defined as the highest priority. You can designate search quality importance levels for all the other content fields: high, medium or low. Any content field not defined in a specific category defaults to low priority. The following table shows these parameters.
Setting | Parameter |
Content title | contentTemplate.csv.title=movieTitle
The content title is the highest search quality field. |
High search quality for content fields | contentTemplate.csv.quality.high=actors
Content fields given a high search quality value. The default is an empty string. |
Low search quality for content fields | contentTemplate.csv.quality.low=genre
Content fields given a low search quality value. The default is an empty string. |
Medium search quality for content fields | contentTemplate.csv.quality.medium=description
Content fields given a medium search quality value. The default is an empty string. |
Unspecified content fields | contentTemplate.csv.unmappedColumnsMode=IGNORE
How the connector handles unspecified content fields. Valid values are:
|
8. Schedule data traversal
Traversal is the connector's process for discovering content from the data source, in this case, a CSV file. As the CSV connector runs, it will traverse the rows of a CSV file, and index each row to Cloud Search via the Indexing API.
Full traversal indexes all columns in the file. Incremental traversal only indexes columns that are added or modified since the previous traversal. The CSV connector only performs full traversals. It does not perform incremental traversals.
The scheduling parameters determine how often the connector waits between traversals. If the configuration file does not contain scheduling parameters, default values are used. The following table shows these parameters.
Setting | Parameter |
Full traversal after an interval | schedule.traversalIntervalSecs=7200
The connector performs a full traversal after a specified interval. Specify the interval between traversals in seconds. The default value is 86400 (number of seconds in one day). |
Full traversal at connector startup | schedule.performTraversalOnStart=false
The connector performs a full traversal at connector startup, rather than waiting for the first interval to expire. The default value is true. |
9. Specify Access Control List (ACL) options
Google Cloud Search CSV connector supports permissions through ACLs to control access to the content of the CSV file in search results. There are multiple ACL options available to allow you to protect user access to indexed records.
If your repository has individual ACL information associated with each document, upload all ACL information to control document access within Cloud Search. If your repository provides partial or no ACL information, you can supply default ACL information in the following parameters, which the SDK provides to the connector.
The connector relies on default ACLs being enabled in the configuration file. To
enable default ACLs, set defaultAcl.mode
to any mode other than none
and
configure it with defaultAcl.*
Setting | Parameter |
ACL mode | defaultAcl.mode=fallback
Required. CSV connector rely on Default ACL functionality. Connector supports only fallback mode. |
Default ACL Name | defaultAcl.name=VIRTUAL_CONTAINER_FOR_CONNECTOR_1
Optional. Allows to override virtual container name used by connector to setup default ACLs. Default value is "DEFAULT_ACL_VIRTUAL_CONTAINER". You may want to override this value if multiple connectors are indexing content in same datasource. |
Default public ACL | defaultAcl.public=true
The default ACL used for the entire repository is set to public domain access. The default value is false. |
Common ACL group readers | defaultAcl.readers.groups=google:group1, group2 |
Common ACL readers | defaultAcl.readers.users=user1, user2, google:user3 |
Common ACL denied group readers | defaultAcl.denied.groups=group3 |
Common Acl denied readers | defaultAcl.denied.users=user4, user5 |
Entire domain access | To specify that every indexed record be publicly accessible by every user in the domain, set both of the following options with values:
|
Common defined ACL | To specify one ACL for each record of the data repository, set all of the following parameter values:
|
Schema Definition
Cloud Search allows indexing and serving of structured and unstructured content. In order to support structured data queries on your data, you need to setup Schema for your datasource.
Once defined, CSV Connector can refer defined schema to build indexing requests. To provide an illustrative example, let's consider a CSV file containing information about Movies.
Let's assume, input CSV file has following content.
- movieId
- movieTitle
- description
- year
- releaseDate
- actors (multiple values separated by comma (,))
- genre (multiple values)
- ratings
Based on above structure of data, you can define schema for a datasource under which you want to index data from CSV file.
{
"objectDefinitions": [
{
"name": "movie",
"propertyDefinitions": [
{
"name": "actors",
"isReturnable": true,
"isRepeatable": true,
"isFacetable": true,
"textPropertyOptions": {
"operatorOptions": {
"operatorName": "actor"
}
}
},
{
"name": "releaseDate",
"isReturnable": true,
"isRepeatable": false,
"isFacetable": false,
"datePropertyOptions": {
"operatorOptions": {
"operatorName": "released",
"lessThanOperatorName": "releasedbefore",
"greaterThanOperatorName": "releasedafter"
}
}
},
{
"name": "movieTitle",
"isReturnable": true,
"isRepeatable": false,
"isFacetable": false,
"textPropertyOptions": {
"retrievalImportance": {
"importance": "HIGHEST"
},
"operatorOptions": {
"operatorName": "title"
}
}
},
{
"name": "genre",
"isReturnable": true,
"isRepeatable": true,
"isFacetable": true,
"enumPropertyOptions": {
"operatorOptions": {
"operatorName": "genre"
},
"possibleValues": [
{
"stringValue": "Action"
},
{
"stringValue": "Documentary"
},
{
"stringValue": "Drama"
},
{
"stringValue": "Crime"
},
{
"stringValue": "Sci-fi"
}
]
}
},
{
"name": "userRating",
"isReturnable": true,
"isRepeatable": false,
"isFacetable": true,
"integerPropertyOptions": {
"orderedRanking": "ASCENDING",
"maximumValue": "10",
"operatorOptions": {
"operatorName": "score",
"lessThanOperatorName": "scorebelow",
"greaterThanOperatorName": "scoreabove"
}
}
}
]
}
]
}
Example configuration file
The following example configuration file shows the parameter key=value
pairs
that define an example connector's behavior.
# data source access
api.sourceId=1234567890abcd
api.serviceAccountPrivateKeyFile=./PrivateKey.json
# CSV data structure
csv.filePath=./movie_content.csv
csv.csvColumns=movieId,movieTitle,description,releaseYear,genre,actors,ratings,releaseDate
csv.skipHeaderRecord=true
url.format=https://mymoviesite.com/movies/{0}
url.columns=movieId
csv.datetimeFormat.releaseDate=yyyy-mm-dd
csv.multiValueColumns=genre,actors
csv.multiValue.genre=;
contentTemplate.csv.title=movieTitle
# metadata structured data and content
itemMetadata.title.field=movieTitle
itemMetadata.createTime.field=releaseDate
itemMetadata.contentLanguage.defaultValue=en-US
itemMetadata.objectType.defaultValue=movie
contentTemplate.csv.quality.medium=description
contentTemplate.csv.unmappedColumnsMode=IGNORE
#ACLs
defaultAcl.mode=fallback
defaultAcl.public=true
For detailed descriptions of each parameter, see the Configuration parameters reference.
Run the Cloud Search CSV connector
To run the connector from the command line, type the following command:
$ java -jar google-cloudsearch-csv-connector-v1-0.0.3.jar -Dconfig=my.config
By default, connector logs are available on standard output. You can log to files
by specifying logging.properties
.