This guide is intended for Google Cloud Search Norconex HTTP Collector indexer plugin administrators, that is, anyone who is responsible for downloading, deploying, configuring, and maintaining the indexer plugin. The guide assumes that you are familiar with, Linux operating systems, fundamentals of web crawling, XML and Norconex HTTP Collector.
This guide includes instructions for performing key tasks related to indexer plugin deployment:
- Download the indexer plugin software
- Configure Google Cloud Search
- Configure Norconex HTTP Collector and web crawling
- Start the web crawl and upload content
Information about the tasks that the Google Workspace administrator must perform to map Google Cloud Search to the Norconex HTTP Collector indexer plugin does not appear in this guide. For information on those tasks, see Manage third-party data sources.
Overview of the Cloud Search Norconex HTTP Collector indexer plugin
By default, Cloud Search can discover, index, and serve content from Google Workspace products, such as Google Docs and Gmail. You can extend the reach of Google Cloud Search to include serving web content to your users by deploying the indexer plugin for Norconex HTTP Collector, an open source enterprise web crawler.
Configuration properties files
To enable the indexer plugin to perform web crawls and upload content to the indexing API, you, as the indexer plugin administrator, provide specific information during the configuration steps described in this document in Deployment steps.
To use the indexer plugin, you must set properties in two configuration files:
- {gcs-crawl-config.xml}-- contains settings for Norconex HTTP Collector.
- sdk-configuration.properties-- contains settings for Google Cloud Search.
Properties in each file enable the Google Cloud Search indexer plugin and Norconex HTTP Collector to communicate with each other.
Web crawl and content upload
After you have populated the configuration files, you have the necessary settings to start the web crawl. Norconex HTTP Collector crawls the web, discovering document content that pertains to its configuration and uploads original binary (or text) versions of document content to the Cloud Search indexing API where it gets indexed and ultimately served to your users.
Supported operating system
The Google Cloud Search Norconex HTTP Collector indexer plugin must be installed on Linux.
Supported Norconex HTTP Collector version
The Google Cloud Search Norconex HTTP Collector indexer plugin supports version 2.8.0.
ACL support
The indexer plugin supports controlling access to documents in the Google Workspace domain by using Access Control Lists (ACLs).
If default ACLs are enabled in the Google Cloud Search plugin configuration
(defaultAcl.mode set to other than none and configured with defaultAcl.*),
the indexer plugin first tries to create and apply a default ACL.
If default ACLs are not enabled, the plugin falls back to giving read permission to the entire Google Workspace domain.
For detailed descriptions of ACL configuration parameters, see Google-supplied connector parameters.
Prerequisites
Before you deploy the indexer plugin, ensure that you have the following required components:
- Java JRE 1.8 installed on a computer that runs the indexer plugin
- Google Workspace information required to establish relationships between Cloud Search and Norconex HTTP Collector: - Google Workspace private key (which contains the service account ID)
- Google Workspace data source ID
 - Typically, the Google Workspace administrator for the domain can supply these credentials for you. 
Deployment steps
To deploy the indexer plugin, follow these steps:
- Install Norconex HTTP Collector and the indexer plugin software
- Configure Google Cloud Search
- Configure Norconex HTTP Collector
- Configure web crawl
- Start a web crawl and content upload
Step 1: Install Norconex HTTP Collector and the indexer plugin software
- Download the Norconex commiter software from this page.
- Unzip the downloaded software to ~/norconex/folder
- Clone the commiter plugin from GitHub. git clone https://github.com/google-cloudsearch/norconex-committer-plugin.gitand thencd norconex-committer-plugin
- Check out the desired version of the commiter plugin and build the ZIP file:
git checkout tags/v1-0.0.3andmvn package(To skip the tests when building the connector, usemvn package -DskipTests.)
- cd target
- Copy the built plugin jar file into the norconex lib directory.
cp google-cloudsearch-norconex-committer-plugin-v1-0.0.3.jar ~/norconex/norconex-collector-http-{version}/lib
- Extract the ZIP file you just built then unzip the file: unzip google-cloudsearch-norconex-committer-plugin-v1-0.0.3.zip
- Execute the install script to copy the plugin's .jar and all the required
libraries into the http collector's directory:
- Change to the extracted commiter plugin unziped above: cd google-cloudsearch-norconex-committer-plugin-v1-0.0.3
- Execute $ sh install.shand provide the full path tonorconex/norconex-collector-http-{version}/libas the target directory when prompted.
- If duplicate jar files are found, select option 1(Copy source Jar only if greater or same version as target Jar after renaming target Jar).
 
- Change to the extracted commiter plugin unziped above: 
Step 2: Configure Google Cloud Search
For the indexer plugin to connect to Norconex HTTP Collector and index the
relevant content, you must create the Cloud Search configuration file in the
Norconex directory where Norconex HTTP Collector is installed. Google recommends
that you name the Cloud Search configuration file
sdk-configuration.properties.
This configuration file must contain key/value pairs that define a parameter. The configuration file must specify at least the following parameters, which are necessary to access the Cloud Search data source.
| Setting | Parameter | 
| Data source id | api.sourceId = 1234567890abcdefRequired. The Cloud Search source ID set up by the Google Workspace administrator. | 
| Service account | api.serviceAccountPrivateKeyFile = ./PrivateKey.jsonRequired. The Cloud Search service account key file that was created by the Google Workspace administrator for indexer plugin accessibility. | 
The following example shows an sdk-configuration.propertiesfile.
#
# data source access
api.sourceId=1234567890abcdef
api.serviceAccountPrivateKeyFile=./PrivateKey.json
#
The configuration file can also contain Google-supplied configuration parameters.
These parameters can affect how this plugin pushes data into the Google Cloud Search API. For example, the batch.* set of parameters
identifies how the connector combines requests.
If you do not define a parameter in the configuration file, the default value, if available, is used. For detailed descriptions of each parameter, see Google-supplied connector parameters.
You can configure the indexer plugin to populate metadata and structured data for content being indexed. Values to be populated for metadata and structured data fields can be extracted from meta tags in HTML content being indexed or default values can be specified in the configuration file.
| Setting | Parameter | 
| Title | itemMetadata.title.field=movieTitleitemMetadata.title.defaultValue=Gone with the WindBy default, the plugin uses HTML titleas title of document being indexed. In case of missing title, you can either refer to
the metadata attribute that contains the value corresponding to the document title or set a default value. | 
| Created timestamp | itemMetadata.createTime.field=releaseDateitemMetadata.createTime.defaultValue=1940-01-17The metadata attribute that contains the value for the document creation timestamp. | 
| Last modified time | itemMetadata.updateTime.field=releaseDateitemMetadata.updateTime.defaultValue=1940-01-17The metadata attribute that contains the value for the last modification timestamp for the document. | 
| Document language | itemMetadata.contentLanguage.field=languageCodeitemMetadata.contentLanguage.defaultValue=en-USThe content language for documents being indexed. | 
| Schema object type | itemMetadata.objectType=movieThe object type used by the site, as defined in the data source schema object definitions. The connector won't index any structured data if this property is not specified. 
Note: This configuration property points to a value rather 
than a metadata attribute, and the  | 
Datetime formats
Datetime formats specify the formats expected in metadata attributes. If the configuration file does not contain this parameter, default values are used. The following table shows this parameter.
structuredData.dateTimePatterns=MM/dd/uuuu HH:mm:ssXXX
A semicolon-separated list of additional java.time.format.DateTimeFormatter patterns. The patterns are used when parsing string values for any date or date-time fields in the metadata or schema. The default value is an empty list, but RFC 3339 and RFC 1123 formats are always supported.
Step 3: Configure Norconex HTTP Collector
The zip archive norconex-committer-google-cloud-search-{version}.zipincludes a
sample configuration file, minimum-config.xml.
Google recommends that you begin the configuration by copying the sample file:
- Change to the Norconex HTTP Collector directory: 
 $ cd ~/norconex/norconex-collector-http-{version}/
- Copy the configuration file: 
 $ cp examples/minimum/minimum-config.xml gcs-crawl-config.xml
- Edit the newly created file (in this example, gcs-crawl-config.xml) and add or replace existing<committer>and<tagger>nodes as described in the following table.
| Setting | Parameter | 
| <committer> node | <committer class="com.norconex.committer.googlecloudsearch.
GoogleCloudSearchCommitter">Required. To enable the plugin, you must add a <committer>node as a child of the root<httpcollector>node. | 
| <UploadFormat> | <uploadFormat>raw</uploadFormat>Optional. The format in which the indexer plugin pushes document content to the Google Cloud Search indexer API. Valid values are: 
 The default value is raw. | 
| BinaryContent Tagger <tagger> node | <tagger class="com.norconex.committer.googlecloudsearch.BinaryContentTagger"/>Required if the value of <UploadFormat>israw. In this case, the indexer plugin  needs the binary content field of the document to be available.You must add the BinaryContentTagger <tagger>node as a child element of the<importer> / <preParseHandlers>node. | 
The following example shows the required
modification to
gcs-crawl-config.xml.
<committer class="com.norconex.committer.googlecloudsearch.GoogleCloudSearchCommitter">
    <configFilePath>/full/path/to/gcs-sdk-config.properties</configFilePath>
    
    <uploadFormat>raw</uploadFormat>
</committer>
<importer>
  <preParseHandlers>
    <tagger class="com.norconex.committer.googlecloudsearch.BinaryContentTagger"/>
  </preParseHandlers>
</importer>
Step 4: Configure web crawl
Before starting a web crawl, you must configure the crawl so that it only
includes information that your organization wants to make available in search
results. The most important settings for web crawl are part of the <crawler>
node(s) and can include:
- Start URLs
- Maximum depth of the crawl
- Number of threads
Change these configuration values according to your needs. For more detailed information about setting up a web crawl, as well as a full list of available configuration parameters, see the HTTP Collector's Configuration page.
Step 5: Start a web crawl and content upload
After you have installed and have set up the indexer plugin, you can run it on its own in local mode.
The following example assumes the required components are located in the local directory on a Linux system. Run the following command:
$ ./collector-http[.bat|.sh] -a start -c gcs-crawl-config.xml
Monitor the crawler with JEF Monitor
Norconex JEF (Job Execution Framework) Monitor is a graphical tool for monitoring the progress of the Norconex Web Crawler (HTTP Collector) processes and jobs. For a complete tutorial on how to set up this utility, visit Monitor your crawler's progress with JEF Monitor.