This page describes how to define the coverage of your search engine using a XML annotations file.
Overview
Managing a large collection of sites can be tedious if you're building a large search engine. Instead, you can add and manage a lot of sites by listing them in an annotations file and uploading it. In addition, annotations files give you far greater control over the ranking of search results.
An annotations file is simply a list of annotations. Each annotation has two components: the site and its associated labels. The label tells Programmable Search Engine how to handle a site; that is, whether a site should be included, excluded, promoted, or demoted. In the context file, you define labels; in the annotations file, you tag sites with the appropriate labels.
When you start editing your annotations file, start out with a small number of annotations. It's easier to test and troubleshoot your search engine with a handful of annotations. When you get the results that you expect, incrementally add more annotations.
You can upload the annotations file to the Control Panel. For details about file limits, see the Annotations Limits section.
Using the Programmable Search XML Format
If you want to take advantage of all the features available in the Programmable Search Engine configuration file, XML is the way to go.
XML Annotations
The following is an example of XML annotations. This annotations file tells Programmable Search Engine to include everything under www.webmd.com/hw/* but exclude everything under www.webmd.com/hw/cancer/*.
<Annotations> <Annotation about="www.cancer.gov/cancertopics/types/liver/*"> <Label name="_include_"/> <Comment>government site</Comment> </Annotation> <Annotation about="www.medicinenet.com/liver_cancer/"> <Label name="_exclude_"/> <Comment>site on symptoms</Comment> </Annotation> <Annotation about="www.webmd.com/hw/*"> <Label name="_include_"/> <Comment>great sites for patients!</Comment> </Annotation> <Annotation about="www.webmd.com/hw/cancer/*"> <Label name="_exclude_"/> <Comment>great sites for patients!</Comment> </Annotation> <Annotation about="www.oncologychannel.com/*/treatment"> <Label name="_exclude_"/> </Annotation> </Annotations>
The annotations file has four elements in the following hierarchy:
-
Annotations
(root element)Annotation
Label
Comment
(optional)
Creating External Annotations
To list sites you want your search engine to cover, do the the following:
- Start the file with the
<Annotations></Annotations>
root element. - Create an annotation by adding the
<Annotation></Annotation>
tags, and then define theabout
attribute with the URL pattern of the site.<Annotations> <Annotation about="www.webmd.com/hw/cancer/*"> </Annotation> </Annotations>
- Associate the site with the search engine by using the
<Label name=" "/>
tag, and specify how that site should be treated by the search engine. You can get the labels for your search engine from the Context file of the search engine. You'll find two labels: one for adding sites to your Programmable Search Engine and one for excluding sites from it. If you have not changed the name of the search engine label in the context file, the label for including sites is in the form of_include_
, and the label for excluding sites is in the form of_exclude_
. To avoid errors, copy and paste these labels instead of typing them by hand.<Annotations> <Annotation about="http://www.solarenergy.org/*"> <Label name="_include_"/> </Annotation> </Annotations>
A single site can have multiple labels associated with it,
If you have changed the name of the label in the context file, remember to update the
Label name
values in your annotation file. - To add more sites, create and define another
Annotation
element. - Save the XML file.
Improving Search Coverage
Programmable Search Engine is built on top of the Google index. This means that webpages that are in the Google index are available to your search engine; conversely, webpages that have not been crawled by Google will not show up in your search results. If you want your Programmable Search Engine to include sites that are not currently in the Google index, submit a Sitemap to Google Search Console.
A Sitemap includes a list of pages in your site, as well as information about the update frequency of the webpages and their importance relative to each other. Submitting a Sitemap helps Google discover your webpages and improve the crawling schedule. To learn more about Sitemaps, see the Webmaster Help Center and Using the Sitemap Protocol. If you are interested in building fancier Sitemaps, see http://www.sitemaps.org/protocol.php.
Submitting Sitemaps is particularly helpful if your site has the following:
- Dynamic content
- Webpages that aren't easily discovered by Googlebot (Google's web crawler), such as pages with rich AJAX or Flash features
- Few websites linking to it.
Googlebot crawls the web by following links from one page to another, so if your site isn't well linked, it is hard for the crawler to discover it. If your website is new, probably not many websites are pointing to your site.
- A large archive of content pages that does not have a strong network of cross-linking
Google can index only pages it can access. So, if you use robots.txt file or robots meta tags in your webpages, make sure those pages don't block crawlers.
Improved coverage is not instantaneous, as it takes some time for the pages to be crawled and indexed. But once your webpages are in the index, they could appear in both Google search and your Programmable Search Engine.
Annotations Limits
The following table lists the limits for annotations files that are uploaded to Programmable Search Engine:
Note: Follow the limits closely; if you exceed them, your search engine might not show results.
Aspect | Limit |
---|---|
File size (context or annotations files) | 30KB |
Maximum number of annotations per search engine | 5,000
Tip: If you find your search engine outgrowing the large 5,000-site limit, consider consolidating individual URLs into URL patterns. |