Translations of this page:

Feeding your Google Search Appliance

You can feed content and document locations (urls) to your search appliance. Given that the Google Search Appliance is capable of spidering content you may ask why you would want to do this. Here are a few reasons:

  1. content is not reachable by the normal spidering process
  2. you want to add metadata to the content
  3. you want to prompt the GSA to index new content in a timely manner

For the Swiss Television and Radio project we were confronted by all three of these problems.

Wysistats was used to record the most popular news stories, videos and audios. The top 10 metadata was then feed into the Google Search Appliance to enable this to be accessed from the search page.

Swiss Television and Radio have a lot of information in video and audio formats. The Google Search Appliance cannot index this information. Instead a feed was used to push metadata for pages containing these formats to provide either an embedded object or an image in the search results.

Journalists are constantly adding articles to the website and want to see that information indexed immediately. This is particularly important for fast moving current events.

Enabling Feeds

You will need to allow the IP address of your host to post feed data. This is done from the GSA Admin tool from the Crawl and Index > Feeds page. You will also need to give the format of any URLs to be indexed from the feed on this page.

Feed URL

The Google Search Applicance feed URL is found at the following address:

http://myappliance:19900/xmlfeed

replace the myappliance part with the hostname of your GSA. Feed data is HTTP POSTed to this URL.

Feed Protocol

You need to supply the following data:

  • Datasource – name of the feed
  • Feedtype – feed type
    • metadata-and-url
    • full
    • incremental
  • Data – Feed data in xml format

Feeding URLs

There is a special datasource name called “web” which is just a feed of URLs to be indexed. In the case of updates to our Content Management System we took the RSS (Really Simple Syndication) output, extracted to the URLs and pushed this into the Google Search Appliance.

info:http://tsr.preview.ecedev.tsr.ch/video/emissions/abe/1026320-whiskys-abe-veille-au-grain.html

Feeding Metadata

Image and Video Indexing

In this example we want to provide a thumbnail image and an video link which should be included in search results for any page that contains video. We could do this via metadata but the client wanted to separate this process.

The following Feed XML shows two bits of metadata: an image and a video. These are extracted from the Content Management System. Note that the feed type is metadata-and-url and we've used the name sample_feed.

<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE gsafeed PUBLIC "-//Google//DTD GSA Feeds//EN" "">
<gsafeed>
 <header>
  <datasource>sample_feed</datasource>
  <feedtype>metadata-and-url</feedtype>
 </header>
 <group>
  <record url="http://tsr.preview.ecedev.tsr.ch/video/emissions/abe/1026320-whis
kys-abe-veille-au-grain.html" action="add" mimetype="text/html">
   <metadata>
     <meta name="image" content="http://www.tsr.ch/xobix_media/images/tsr/2007/a_bon_entendeur_20071023_8345356_1.jpg" />
     <meta name="video" content="http://media.tsr.ch/xobix_media/tsr/abe/2007/abe_10232007-450k.wmv" />
   </metadata>
  </record>
 </group>
</gsafeed>

cURL

We can submit this data directly to the feed using the cURL utility which is pretty simple. We should get back a success response. Note that the feedname is taken from the datasource given on the command line not that in the XML file. If you don't include the datasource and feedtype on the command line as shown below the command will fail with a syntax error.

$ curl -F "datasource=samplefeed" -F "feedtype=metadata-and-url" -F "data=@feed.xml;type=text/xml"
http://search.mpc.tsr.ch:19900/xmlfeed

Success

Search Results

To show this metadata in the Search Results we will need to create a front end and edit the style sheet. Include the following lines above the Results Header section

<xsl:for-each select="MT">
    <xsl:if test="@N='image' and @V!=''">
        <a href="{$protocol}://{$escaped_url}"><img align="left" height="60" width="80" src="{@V}"/></a>
    </xsl:if>
</xsl:for-each>
<!-- *** Result Header *** -->

For each result we loop over the Metatags (MT) testing for our image metatag and making sure content (@V) is not empty. We then output a link using the image link as an anchor element.

To include a video snippet after the result snippet we can use similar code, this time checking for the video meta tag.

<xsl:for-each select="MT">
  <xsl:if test="@N='video' and @V!=''">
    <p><object width="90" height="90" data="{@V}" type="video/x-ms-wmv">
    <param name="src" value="{@V}" />
    <param name="autostart" value="true" />
    <param name="controller" value="false" />
    </object></p>
  </xsl:if>
</xsl:for-each>
<!-- *** Result Footer *** -->

Feed Backlog

To find out how many feed files remain to be processed run the following command:-

$ curl  http://search.mpc.tsr.ch:19900/getbacklogcount
1

Further Information

tech/search/feeding-your-google-search-appliance.txt · Last modified: 2010/03/08 16:17 by davidof
Recent changes RSS feed