You can set up Google Cloud Search to serve web content to your users by deploying the Google Cloud Search indexer plugin for Apache Nutch, an open source web crawler.
When you start the web crawl, Apache Nutch crawls the web and uses the indexer plugin to upload original binary (or text) versions of document content to the Google Cloud Search indexing API. The indexing API indexes the content and serves the results to your users.
Important considerations
System requirements
System requirements | |
---|---|
Operating system | Linux only:
|
Software |
|
Apache Tika document types | Apache Tika 1.18 supported document formats |
Deploy the indexer plugin
The following steps describe how to install the indexer plugin and configure its components to crawl the specified URLs and return the results to Cloud Search.
Prerequisites
Before you deploy the Cloud Search Apache Nutch indexer plugin, gather the information required to connect Google Cloud Search and the data source:
- Google Workspace private key (which contains the service account ID). For information on obtaining a private key, go to Configure access to the Google Cloud Search API.
- Google Workspace data source ID. For information on obtaining a data source ID, go to Add a data source to search.
Step 1: Build and install the plugin software and Apache Nutch
Clone the indexer plugin repository from GitHub.
$ git clone https://github.com/google-cloudsearch/apache-nutch-indexer-plugin.git $ cd apache-nutch-indexer-plugin
Check out the desired version of the indexer plugin:
$ git checkout tags/v1-0.0.5
Build the indexer plugin.
$ mvn package
To skip the tests when building the indexer plugin, use
mvn package -DskipTests
.Download Apache Nutch 1.15 and follow the Apache Nutch installation instructions.
Extract
target/google-cloudsearch-apache-nutch-indexer-plugin-v1.0.0.5.zip
(built in step 2) to a folder. Copy theplugins/indexer-google-cloudsearch
folder to the Apache Nutch install plugins folder (apache-nutch-1.15/plugins
).
Step 2: Configure the indexer plugin
To configure the Apache Nutch Indexer Plugin, create a file called plugin-configuration.properties
.
The configuration file must specify the following parameters, which are necessary to access the Google Cloud Search data source.
Setting | Parameter |
Data source ID | api.sourceId = 1234567890abcdef
Required. The Google Cloud Search source ID that the Google Workspace admin set up for the indexer plugin. |
Service account | api.serviceAccountPrivateKeyFile = ./PrivateKey.json
Required. The Google Cloud Search service account key file that the Google Workspace admin created for indexer plugin accessibility. |
The following example shows a sample configuration file with the required parameters.
#
# data source access
api.sourceId=1234567890abcdef
api.serviceAccountPrivateKeyFile=./PrivateKey.json
#
The configuration file can also contain other parameters that control indexer plugin behavior. You can configure how the
plugin pushes data into the Cloud Search API, defaultAcl.*
and batch.*
. You can also configure how the indexer plugin populates metadata and structured data.
For descriptions of these parameters, go to Google-supplied connector parameters.
Step 3: Configure Apache Nutch
Open
conf/nutch-site.xml
and add the following parameters:Setting Parameter Plugin includes plugin.includes = text
Required. List of plugins to use. This must include at least:
- index-basic
- index-more
- indexer-google-cloudsearch
conf/nutch-default.xml
provides a default value for this property, but you must also manually addindexer-google-cloudsearch
to it.Metatags names metatags.names = text
Optional. Comma-separated list of tags that map to properties in the corresponding data source's schema. To learn more about how to set up Apache Nutch for metatags, go to Nutch-parse metatags.
The following example shows the required modification to
nutch-site.xml
:<property> <name>plugin.includes</name> <value>protocol-(http|httpclient)|urlfilter-regex|index-(basic|more| metadata)|query-(basic|site|url|lang)|indexer-google-cloudsearch|nutch-extensionpoints|parse-(text|html|msexcel|msword|mspowerpoint|pdf|metatags)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)|parse-(html|tika|metatags)|index-(basic|anchor|more|metadata)</value> </property>
Open
conf/index-writers.xml
and add the following section:<writer id="indexer_google_cloud_search_1" class="org.apache.nutch.indexwriter.gcs.GoogleCloudSearchIndexWriter"> <parameters> <param name="gcs.config.file" value="path/to/sdk-configuration.properties"/> </parameters> <mapping> <copy /> <rename /> <remove /> </mapping> </writer>
The <writer> section contains the following parameters:
Setting Parameter Path to Google Cloud Search configuration file gcs.config.file = path
Required. The full (absolute) path to the Google Cloud Search configuration file.
Upload format gcs.uploadFormat = text
Optional. The format in which the indexer plugin pushes document content to the Google Cloud Search indexer API. Valid values are:
raw
: the indexer plugin pushes original, unconverted document content.text
: the indexer plugin pushes extracted textual content. The default value israw
.
Step 4: Configure web crawl
Before you start a web crawl, configure the crawl so that it only includes information that your organization wants to make available in search results. This section provides an overview; for more information about how to set up a web crawl, go to the Nutch tutorial.
Set up start URLs.
Start URLs control where the Apache Nutch web crawler begins crawling your content. The start URLs should enable the web crawler to reach all content that you want to include in a particular crawl by following the links. Start URLs are required.
To set up start URLs:
Change the working directory to the nutch installation directory:
$ cd ~/nutch/apache-nutch-X.Y/
Create a directory for urls:
$ mkdir urls
Create a file named
seed.txt
and list URLs in it with 1 URL per line.
Set up follow and do-not-follow rules.
Follow URL rules control which URLs are crawled and included in the Google Cloud Search index. The web crawler checks URLs against the follow URL rules. Only URLs that match these rules are crawled and indexed.
Do-not-follow rules exclude URLs from being crawled and included in the Google Cloud Search index. If a URL contains a do not crawl pattern, the web crawler does not crawl it.
To set up follow and do-not-follow URL rules:
Change the working directory to the nutch installation directory:
$ cd ~/nutch/apache-nutch-X.Y/
Edit
conf/regex-urlfilter.txt
to change follow/do-not-follow rules: \$ nano conf/regex-urlfilter.txt
Enter regular expressions with a "+" or "-" prefix to follow / do-not-follow URL patterns and extensions, as shown in the following examples. Open-ended expressions are allowed.
# skip file extensions -\.(gif|GIF|jpg|JPG|png|PNG|ico) # skip protocols (file: ftp: and mailto:) -^(file|ftp|mailto): # allow urls starting with https://support.google.com/gsa/ +^https://support.google.com/gsa/ # accept anything else # (commented out due to the single url-prefix allowed above) #+.
Edit the crawl script.
If the
gcs.uploadFormat
parameter is missing or set to "raw," you must add "-addBinaryContent -base64
" arguments to pass to thenutch index
command. These arguments tell the Nutch Indexer module to include binary content in Base64 when it invokes the indexer plugin. The ./bin/crawl script doesn't have these arguments by default.- Open
crawl
script inapache-nutch-1.15/bin
. Add the
-addBinaryContent -base64
options to the script, as in the following example:if $INDEXFLAG; then echo "Indexing $SEGMENT to index" __bin_nutch index $JAVA_PROPERTIES "$CRAWL_PATH"/crawldb -addBinaryContent -base64 -linkdb "$CRAWL_PATH"/linkdb "$CRAWL_PATH"/segments/$SEGMENT echo "Cleaning up index if possible" __bin_nutch clean $JAVA_PROPERTIES "$CRAWL_PATH"/crawldb else echo "Skipping indexing ..."
- Open
Step 5: Start a web crawl and content upload
After you install and set up the indexer plugin, you can run it on
its own in local mode. Use the scripts from ./bin
to execute a crawling job or
individual Nutch commands.
The following example assumes the required components are located in the local
directory. Run Nutch with the following command from the apache-nutch-1.15
directory:
$ bin/crawl -i -s urls/ crawl-test/ 5
Crawl logs are available on the std output (terminal) or in logs/
directory. To
direct the logging output or for more verbose logging, edit
conf/log4j.properties
.