This page describes how to define the coverage of your search engine using a TSV file or XML annotations file.
- Overview
- Choosing the Right Format
- Using the OPML Format
- Using the TSV Format
- Using the Programmable Search XML Format
- Improving Search Coverage
- Annotations Limits
Overview
Adding sites individually using the Programmable Search Engine Control Panel can be tedious if you're building a large search engine. In addition, managing a large collection of sites in the Control Panel isn't fun either. Instead, you can add and manage a lot of sites by listing them in an annotations file and uploading it. In addition, annotations files—particularly XML ones—give you far greater control over the ranking of search results.
An annotations file is simply a list of annotations. Each annotation has two components: the site and its associated labels. The label tells Programmable Search Engine how to handle a site; that is, whether a site should be included, excluded, promoted, or demoted. In the context file, you define labels; in the annotations file, you tag sites with the appropriate labels.
Annotations files can be in any of the following formats:
- Outline Processor Markup Language (OPML)
- Text files with tab-separated values (TSV)
- Programmable Search XML
When you start editing your annotations file, start out with a small number of annotations, and then test some search queries in the Preview tab of the Control Panel. It's easier to test and troubleshoot your search engine with a handful of annotations. When you get the results that you expect, incrementally add more annotations.
You can upload the annotations file to the Control Panel. For details about file limits, see the Annotations Limits section.
Choosing the Right Format
Before you start creating annotations, determine which file format best suits your needs. If your search engine increases in complexity, you can consider using multiple annotations files, even files of different formats. For example, you can upload OPML annotations files generated by other sites and XML annotations files you created. Programmable Search Engine combines all the annotations files in all your search engines into a single XML annotations file.
Use the following table to pick the appropriate format:
To create | Use | Because | Limitations | More information |
---|---|---|---|---|
A search engine with an existing OMPL file (feed-based search engine) | OPML format | You do not need to recreate annotations if you already have OPML files with URL patterns. You can upload the existing file directly to the Control Panel. | You cannot directly fine-tune the ranking of search results. | Using the OPML format |
A search engine that does not need all the advanced features | TSV format | You can create and manage the annotations in a more readable format. You can use a spreadsheet editor. You can take advantage of many advanced features, such as applying labels, associating scores, adding comments. You can create your own attributes. However, they are mostly for your own use; Programmable Search Engine does not do anything with them. |
You cannot refer to another Programmable Search Engine file, and this is not the best option for programmatically created search engines. | Using the TSV format |
A complex and heavily customized search engine | Programmable Search XML format | It's the most powerful format. It is appropriate for developers who want to create advanced search engines with bells and whistles. It gives you more flexibility and greater control over the ranking of your search results. | It is the most complex format. | Using the XML format |
Using the OPML Format
OPML is a type of XML format that was originally developed for defining ordered lists of elements or outlines, but it is now also commonly used for web feeds. OPML specification.
If you have OPML files from some feed aggregators, you can upload the OPML file without bothering with typing each site. Programmable Search Engine grabs the value of the OPML attribute htmlUrl
and adds it to the list of sites to search. You can upload multiple OPML files for each of your search engines.
Here's an example of an OPML file:
<opml version="1.0"> <head> <title>Bicycles</title> <dateCreated>Fri Mar 14 23:21:11 PDT 2008</dateCreated> <dateModified>Fri Mar 14 23:21:11 PDT 2008</dateModified> </head> <body> <outline type="rss" text="Road Bikes" xmlUrl="http://www.google.com/exampleurl.opml" htmlUrl="http://www.google.com/sampleurl1.opml"/> <outline type="rss" text="Mountain Bikes" xmlUrl="http://www.google.com/exampleurl2.opml" htmlUrl="http://www.google.com/sampleurl2.opml"/> </body> </opml>
When you upload an OPML file in the Control Panel, Programmable Search Engine automatically converts OPML to Programmable Search XML. It adds search engine labels (<Label name="_cse_example"/>
) and scores (score="1"
). More information about scores.
The following is an example of an OPML file that has been converted to have Programmable Search XML:
<GoogleCustomizations> <Annotations> <Annotation about="www.google.com/exampleurl1.opml" score="1"> <Label name="_cse_example"/> </Annotation> <Annotation about="www.google.com/exampleurl2.opml" score="1"> <Label name="_cse_example"/> </Annotation> </Annotations> </GoogleCustomizations>
Using the TSV Format
You can create annotations using a text file with tab-separated values (TSV).
You can use a plain text editor or a spreadsheet editor to create the file. It does not matter what you name the file, so long as you save it with the file extension .tsv
(for example, cse_bicycles.tsv
). If you are using a plain text editor, separate each element by a single tab character. Do not try to prettify and align the lines with multiple tab characters. If you are using a spreadsheet editor, allocate a column for each of the fields.
Each line of text in your TSV file can list a site and its associated labels.
Elements of a Programmable Search Engine TSV
Your TSV files must begin with a heading that enumerates the fields that you will be using in the subsequent annotation lines. The headings are case-sensitive, so follow the capitalization in this guide. The order of the heading elements doesn't really matter, but the annotation lines that follow the heading must follow the order of the headings. When you create the headings, you are essentially creating columns of data, so you can't just plug the annotation data any which way.
A heading has the following fields:
-
URL
- The URL pattern of the site. Label
- The search engine label or refinement label that should be applied to the site. You can get the labels for your search engine from the Context section of the Advanced tab in the Control Panel. You'll find at least two search engine or background labels: one for adding sites to your Programmable Search Engine and one for excluding sites from it. If you have not changed the search engine label, the label for including sites is in the form of_cse_xxxxxxxxxxx
, wherex
is a character, and the label for excluding sites is in the form of_cse_exclude_xxxxxxxxxxx
. To avoid errors, copy and paste these labels instead of typing them by hand.Comment
- Optional. Notes about each annotation.Score
- Optional. Discussed in detail in the Ranking Search Results page.- Custom Field - Optional. Your own attributes. To create an attribute, just prefix it with "
A=
". For example, to create a date attribute, use "A=Date
". Programmable Search Engine does not process these fields.
Each subsequent line corresponds to an annotation. It provides the values for the fields that were defined in the headings.
TSV Example
Let's look at an example of a basic TSV file.
URL Label www.webmd.com/hw/* _cse_Ansi-stoubiq www.webmd.com/hw/cancer/* _cse_exclude_Ansi-stoubiq
The example has a heading with the two required fields: URL
and Label
. The two annotation lines supply the values for the fields. The label in the first annotation line, _cse_Ansi-stoubiq
, adds the site, www.webmd.com/hw/*
, to the search engine. The other label, _cse_exclude_Ansi-stoubiq
, excludes the site, www.webmd.com/hw/cancer/*
, from the search engine.
You can add more fields to your TSV annotations. Here's an exmple that includes a Comment
field and a custom field, A=Date
.
URL Label Comment A=Date www.cancer.gov/cancertopics/types/liver/* _cse_Ansi-stoubiq government site 20060504 www.medicinenet.com/liver_cancer/* _cse_Ansi-stoubiq site on symptoms 20060504 www.webmd.com/hw/cancer/* _cse_Ansi-stoubiq great site for patients! 20060504 www.oncologychannel.com/*/treatment _cse_Ansi-stoubiq 20060504
Even though you added new fields in the header, you are not obligated to supply the values for all them, which is why it's fine for the last line to not have a comment. But that's not the case for URL
and Label
, which are required fields.
Using the Programmable Search XML Format
If you want to take advantage of all the features available in the Custom Search JSON API, XML is the way to go.
XML Annotations
The following is an example of XML annotations. It is roughly the XML version of the TSV example in the previous section. It includes the same elements, except for custom attributes, which are available only in the TSV format. This annotations file tells Programmable Search Engine to include everything under www.webmd.com/hw/* but exclude everything under www.webmd.com/hw/cancer/*.
<Annotations> <Annotation about="www.cancer.gov/cancertopics/types/liver/*"> <Label name="_cse_Ansi-stoubiq"/> <Comment>government site</Comment> </Annotation> <Annotation about="www.medicinenet.com/liver_cancer/"> <Label name="_cse_exclude_Ansi-stoubiq"/> <Comment>site on symptoms</Comment> </Annotation> <Annotation about="www.webmd.com/hw/*"> <Label name="_cse_Ansi-stoubiq"/> <Comment>great sites for patients!</Comment> </Annotation> <Annotation about="www.webmd.com/hw/cancer/*"> <Label name="_cse_exclude_Ansi-stoubiq"/> <Comment>great sites for patients!</Comment> </Annotation> <Annotation about="www.oncologychannel.com/*/treatment"> <Label name="_cse_exclude_Ansi-stoubiq"/> </Annotation> </Annotations>
The annotations file has four elements in the following hierarchy:
-
Annotations
(root element)Annotation
Label
Comment
(optional)
Creating External Annotations
To list sites you want your search engine to cover, do the the following:
- Start the file with the
<Annotations></Annotations>
root element. - Create an annotation by adding the
<Annotation></Annotation>
tags, and then define theabout
attribute with the URL pattern of the site.<Annotations> <Annotation about="www.webmd.com/hw/cancer/*"> </Annotation> </Annotations>
- Associate the site with the search engine by using the
<Label name=" "/>
tag, and specify how that site should be treated by the search engine. You can get the labels for your search engine from the Context section of the Advanced tab in the Control Panel. You'll find two labels: one for adding sites to your Programmable Search Engine and one for excluding sites from it. If you have not changed the name of the search engine label in the context file, the label for including sites is in the form of_cse_xxxxxxxxxxx
, wherex
is a character, and the label for excluding sites is in the form of_cse_exclude_xxxxxxxxxxx
. To avoid errors, copy and paste these labels instead of typing them by hand.<Annotations> <Annotation about="http://www.solarenergy.org/*"> <Label name="_cse_abcdefghijk"/> </Annotation> </Annotations> ;
A single site can have multiple labels associated with it,
If you have changed the name of the label in the context file, remember to update the
Label name
values in your annotation file. - To add more sites, create and define another
Annotation
element. - Save the XML file.
Improving Search Coverage
Programmable Search Engine is built on top of the Google index. This means that webpages that are in the Google index are available to your search engine; conversely, webpages that have not been crawled by Google will not show up in your search results. If you want your Programmable Search Engine to include sites that are not currently in the Google index, submit a Sitemap to Google Search Console.
A Sitemap includes a list of pages in your site, as well as information about the update frequency of the webpages and their importance relative to each other. Submitting a Sitemap helps Google discover your webpages and improve the crawling schedule. To learn more about Sitemaps, see the Webmaster Help Center and Using the Sitemap Protocol. If you are interested in building fancier Sitemaps, see http://www.sitemaps.org/protocol.php.
Submitting Sitemaps is particularly helpful if your site has the following:
- Dynamic content
- Webpages that aren't easily discovered by Googlebot (Google's web crawler), such as pages with rich AJAX or Flash features
- Few websites linking to it.
Googlebot crawls the web by following links from one page to another, so if your site isn't well linked, it is hard for the crawler to discover it. If your website is new, probably not many websites are pointing to your site.
- A large archive of content pages that does not have a strong network of cross-linking
Google can index only pages it can access. So, if you use robots.txt file or robots meta tags in your webpages, make sure those pages don't block crawlers.
Improved coverage is not instantaneous, as it takes some time for the pages to be crawled and indexed. But once your webpages are in the index, they could appear in both Google search and your Programmable Search Engine.
Annotations Limits
The following table lists the limits for annotations files that are uploaded to Programmable Search Engine:
Note: Follow the limits closely; if you exceed them, your search engine might not show results.
Aspect | Limit |
---|---|
File size (context or annotations files) | 30KB |
Number of annotations per file | 2,000 |
Maximum number of annotations per search engine | 5,000
Tip: If you find your search engine outgrowing the large 5,000-site limit, consider consolidating individual URLs into URL patterns. |