These are some random tips and tricks to get more out of your Google Search Appliance. This has been developed on version 6.2
XML is the internal format generated by the Google Search Appliance. Normally this is processed by a Front End, that is a piece of XSLT to produce formatted results. However you can ask for the XML directly without passing through this style sheet. For example
http://gsa.yoursite.com/search?site=your_collection&client&q=federer
The metadata consists of a version number (3.2), the time the search took (0.072854 seconds), the query (federer) and then all the parameters that were passed to the Search Appliance by the search form.
<GSP VER="3.2"> <TM>0.072854</TM> <Q>federer</Q> <PARAM name="site" value="articles" original_value="articles"/> <PARAM name="client" value="tf1" original_value="tf1"/> <PARAM name="q" value="federer" original_value="federer"/> <PARAM name="ie" value="UTF-8" original_value="UTF-8"/> <PARAM name="ip" value="146.159.4.201" original_value="146.159.4.201"/> <PARAM name="access" value="p" original_value="p"/> <PARAM name="sort" value="date:D:L:d1" original_value="date%3AD%3AL%3Ad1"/> <PARAM name="entqr" value="3" original_value="3"/> <PARAM name="lr" value="lang_fr" original_value="lang_fr"/>
There are ten results and we are on the first page of results.
<RES SN="1" EN="10"> <M>81</M> <FI/> <NB> <NU> /search?q=federer&site=my_sitelr=lang_fr&ie=UTF-8&access=p&sort=date:D:L:d1&start=10&sa=N </NU> </NB>
Here is the first result
<R N="1"> <U> http://myhost.com/sport/tennis/messieurs/1343076-aus-open-federer-convaincant.html </U> <UE> http://myhost.com/sport/tennis/messieurs/1343076-aus-open-federer-convaincant.html </UE> <T> AUS Open: <b>Federer</b> <b>...</b> </T> <RK>10</RK> <ENT_SOURCE>T2-GJMCBQS9EQSAB</ENT_SOURCE> <FS NAME="date" VALUE=""/> <S> AUS Open: <b>Federer</b> convaincant.<br> tf1.tv. Tout le sport en vidéo. <b>...</b> AUS Open: <b>Federer</b> convaincant. <b>...</b> </S> <LANG>fr</LANG> <HAS> <L/> <C SZ="48k" CID="w1hHNssA6b8J" ENC="UTF-8"/> </HAS> </R> <R N="2" L="2">
<!-- Added for RTS --> <xsl:variable name="referrer_host"> <xsl:for-each select="/GSP/PARAM[@name = 'referrer']"><xsl:value-of select="@original_value"/> </xsl:for-each> </xsl:variable> ... <!-- ********************************************************************** Spelling suggestions in result page (do not customize) ********************************************************************** --> <xsl:template name="spelling"> <xsl:if test="/GSP/Spelling/Suggestion"> <p><span class="p"><font color="{$spelling_text_color}"> <xsl:value-of select="$spelling_text"/> <xsl:call-template name="nbsp"/> </font></span> <a href="http://{$referrer_host}/services/recherche/?q={/GSP/Spelling/Suggestion[1]/@qe}&spell=1&{$base_url}" target="_parent"> <xsl:value-of disable-output-escaping="yes" select="/GSP/Spelling/Suggestion[1]"/> </a> </p> </xsl:if> </xsl:template>
You may want to change the URLs output by the Google Search Appliance to some other value. One reason I can think of is that you are developing before the launch of the website. You are indexing a staging server but want the results to point to the production server at launch time. You can do this with the following bit of code:
<xsl:variable name="rewrite_from" select="'http://staging.myhost.com'"/> <xsl:variable name="rewrite_to" select="'http://production.myhost.com'"/> <xsl:variable name="rewritten_url"> <xsl:choose> <xsl:when test="(substring-before(UE, $rewrite_from) = '' ) and contains(UE, $rewrite_from)"> <xsl:value-of select="concat($rewrite_to, substring-after(UE, $rewrite_from))"/> </xsl:when> <xsl:otherwise> <xsl:value-of select="UE"/> </xsl:otherwise> </xsl:choose> </xsl:variable> <!-- strip the protocol -- <xsl:variable name="display_url3" select="substring-after($rewritten_url, '//')"/> ... <a href="{$protocol}://{$display_url3}" target="_parent">...</a>
You may want to search your document by date or include document dates in the search results. The Google Search Appliance will use the HTTP Last-modified header by default for the document date.
$ curl -I http://www.tf1.tv/sport/tennis/dames/1921182-patty-et-timea-depassees-par-les-williams.html HTTP/1.1 200 OK Content-Type: text/html;charset=UTF-8 Content-Language: fr Cache-Control: max-age=150 Date: Tue, 04 May 2010 16:21:13 GMT Connection: keep-alive
A quick check of the website shows that this is not being set. The results is no document dates so you can't order by date, give a preference to newer documents or show dates in results. However under the Menu item: Crawl and Index → Document Dates that can help. In the form you can specify a URL pattern, a location (Body, URL, Title…) and a date format. For example if you may have a date in the first H2 of the page of the format two digit day (02, 23), the month (June, Juin), year in digits (2009)
You can tell the Google Search Appliance to look for this kind of date in the Body of the document. We then wanted to output the date in DD-MM-YYYY format in the Title (.css 'l' span). The date is stored in MM-DD-YYYY format internally so it is just a case of separting the string into its component parts and re-outputting.
<xsl:variable name="date" select="FS[@NAME='date']/@VALUE"/> <br/> <xsl:variable name="dayDate" select="substring-after($date, '-')"/> <xsl:value-of select="substring-after($dayDate, '-')"/>-<xsl:value-of select="substring-before($dayDate, '-')"/>-<xsl:value-of select="substring-before($date, '-')"/>
The GSA provides a lot of information for diagnosing problems. Under the Status and Reports menu you can export all the crawled URLs. Here we found a problem with Canonical URLs. A single page of content could be accessed hundreds of times by adding a &page=xx query string. The difference in each page was a paragraph of content and the query string. These can be removed from the index using a rule in the website robots.txt file (this is good as it works for all external search engines too), using the nofollow attribute or, maybe, with the canonical attribute. Alternatively go to the Crawl and Index menu and click on the Crawl URLs form. In the 'Do not crawl URLs with the following patterns' add the following type of rule:
contains:?page=
If you sort the Crawled URL document it can be useful for spotting areas of the website that are not being explored. You may even compare it with urls in the server logs or a sitemap generated by your CMS to see if pages are getting missed for some reason (no inbound url for example).
Often the first place you will notice
Under Status and Reports go the the Crawl Diagnostics page you will see the following kind of information.
On the form you can zoom on the kind of error and on certain urls. Notice the large number of crawl errors and excluded URLs. Examing the exluded URLs it seems that someone had blocked the GSA in the robots.txt:
In this case fix the robots.txt (it is under the document root). You can then schedule the directory for a recrawl from the same page or go to the Crawl and Index → Freshness Tuning page and scroll to the bottom of the page. Enter the URLs/domains to recrawl in the 'Recrawl these URLs' box and click the button.
On the real time diagnosis page you can get a GSA view of the contents it is indexing by fetching a specific URL.
This gives the HTTP return code (200 is good) and redirects (fewer the better) and the HTTP headers.
You can also examing URLs that are current being processed by the crawl and indexing pipeline
This confirms that the GSA is indexing and what the results are for each page. You can focus on certain sections of the site if necessary. Remember this has a performance impact. At a lower level you can even get a packet dump (up to 1GB of data) which you can export to a machine with a tool such as wireshark for offline analysis.