Translations of this page:

Solr 1.4+ and Liferay 6.0+ Integration

Objectives

  • Install and configure Solr from the SVN Trunk
  • Integrate Solr with Liferay running under the same Tomcat web application server
  • Configure Solr to index data stored in a database and files stored on a file system using a JDBC import handler and Tika

Introduction

By default Liferay come with Lucene as its search engine. Liferay uses Lucene for full text indexing of its contents including events and web contents.

Replacing the default Liferay Lucene indexer with Solr brings a number of benefits. You get all the Solr configuration goodness including stop words, localization, synonyms as well as better support in clustered environments. Solr can run outside of the Liferay Web Contain, even on a separate host.

This guide outlines installing Solr to run under the Liferay Tomcat instance but with a few obvious changes Solr could be run under its own web application server such as Jetty.

Solr Install

Solr Installation. Download Solr and configure a solr-instance directory under say: /opt (for linux). The install directory should have the following layout.

solr-instance/
 lib/
 conf/
  solrconfig.xml
  data-config.xml
  schema.xml
 contrib/
 data/
  index/
  spellchecker/
 solr.war
 solr-web.war

In conf/solrconfig.xml make sure the contrib/extraction/lib directory is in the Solr classpath. Add other directories as appropriate.

<lib dir="/opt/solr-instance/contrib/extraction/lib" />

Add the Solr instance directory to the Tomcat environment, edit

$ vi bin/catalina.sh

adding the following line to JAVA_OPTS:

-Dsolr.solr.home=/opt/solr-instance/data

Create a Tomcat context that points to the solr.war file in /opt/solr-instance:

conf/Catalina/localhost/solr.xml

<?xml version="1.0" encoding="utf-8"?>
<Context docBase="/opt/solr-instance/solr.war" debug="0" crossContext="true">
  <Environment name="solr/home" type="java.lang.String" value="/opt/solr-instance" override="true"/>
</Context>

This will deploy solr in the Tomcat webapps directory on startup.

Solr Liferay Plugin

Add the Solr plugin. Currently this is in version 6.0.1.1. Download it to your solr-instance directory and rename to solr-web.war:

$ http://liferay.cignex.com/palm_tree/book/0387/chapter12/solr-web-6.0.1.1.war $ mv solr-weg.6.0.1.1.war solr-web.war

Add the following context to tomcat in: conf/Catalina/localhost/solr-web.xml

<?xml version="1.0" encoding="utf-8"?>
<Context docBase="/opt/solr-instance/solr-web.war" debug="0" crossContext="true"
>
</Context>

This will deploy the plugin on Tomcat startup.

Solr Plugin Configuration

The XML file solr-spring.xml under the folder solr-web/WEB-INF/classes/META-INF describes how to integrate Solr in the portal via the plugin web. You will need to tell it where to find the Solr instance (by default on localhost, port 8080):

<bean id="solrServer" class="com.liferay.portal.search.solr. server.BasicAuthSolrServer">
  <constructor-arg type="java.lang.String"
    value=" http://${solr.host.domain}:${solr.port.number}/solr"
  />
</bean>

The XML file also describes the index searcher, index writer, search engine, and so on via a set of beans:

<bean id="indexSearcher.solr" class="com.liferay.portal.search.solr.SolrIndexSearcherImpl">
  <property name="solrServer" ref="solrServer" />
</bean>

These let Liferay add/update and delete content in Solr.

Default Schema

The schema.xml under the folder solr-web/WEB-INF/conf describes how the fields will be indexed to the Solr index. This should form the basis of your schema even if you add other data sources. The schema describes types, fields, the default search field, and the operation of the Solr query parser.

JAR files

The following jars are included in solr.war WEB-INF/lib and so available to the application

apache-solr-noggit-r944541.jar
commons-codec-1.3.jar
commons-csv-1.0-SNAPSHOT-r609327.jar
commons-fileupload-1.2.1.jar
commons-httpclient-3.1.jar
commons-io-1.4.jar
geronimo-stax-api_1.0_spec-1.0.1.jar
google-collect-1.0.jar
jcl-over-slf4j-1.5.5.jar
slf4j-api-1.5.5.jar
slf4j-jdk14-1.5.5.jar
wstx-asl-3.2.7.jar
lucene-analyzers-common-4.0-dev.jar
lucene-core-4.0-dev.jar
lucene-highlighter-4.0-dev.jar
lucene-memory-4.0-dev.jar
lucene-misc-4.0-dev.jar
lucene-queries-4.0-dev.jar
lucene-spatial-4.0-dev.jar
lucene-spellchecker-4.0-dev.jar
apache-solr-core-4.0-dev.jar
apache-solr-solrj-4.0-dev.jar
apache-solr-dataimporthandler-4.0-dev.jar

If you want to index documents using Tika add the following additional classes in solr-instance/lib, included in the Solr classpath by default. The jar ojdbc14 is included for an Oracle data source.

apache-solr-dataimporthandler-extras-4.0-dev.jar
tika-core-0.6.jar
tika-parsers-0.6.jar
ojdbc14.jar

You will need the services of some of the jars in contrib/extraction/lib (specify the location in solrconfig.xml)

asm-3.1.jar                           pdfbox-1.1.0.jar
bcmail-jdk15-1.45.jar                 poi-3.6.jar
bcprov-jdk15-1.45.jar                 poi-ooxml-3.6.jar
commons-compress-1.0.jar              poi-ooxml-schemas-3.6.jar
commons-logging-1.1.1.jar             poi-scratchpad-3.6.jar
dom4j-1.6.1.jar                       tagsoup-1.2.jar
fontbox-1.1.0.jar                     tika-core-0.8-SNAPSHOT.jar
geronimo-stax-api_1.0_spec-1.0.1.jar  tika-parsers-0.8-SNAPSHOT.jar
icu4j-4_2_1.jar                       xercesImpl-2.8.1.jar
jempbox-1.1.0.jar                     xml-apis-1.0.b2.jar
log4j-1.2.14.jar                      xmlbeans-2.3.0.jar
metadata-extractor-2.4.0-beta-1.jar

Dealing with Accented Characters

You may want to treat accented characters the same as their none-accented latin equivalents. Thus a search for Crêtet, crétet and cretet will all yield the same results. We already reduce upper case to lower case during indexing and searching so Cretet and cretet yield the same thing.

In schema.xml search for the filter: solr.LowerCaseFilterFactory in the text field type and add:

<filter class="solr.ASCIIFoldingFilterFactory" />

above the line. This will remove accents before indexing the text. We need to do this for any other field where there are accented characters.

Do the same for the Query analyzer so that we remove accents from the query string.

Adding External Sources

You may wish to index some external sources and have these available in the index. The following line in solrconfig.xml specifies a data import handler configured in the file conf/data-config.xml:

<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">
  <lst name="defaults">
    <str name="config">data-config.xml</str>
  </lst>
</requestHandler>

Here is a snippet from dataconfig.xml. It configures two datasources. “bin” is for importing files for indexing in Tika and “oracle” is an external Oracle Product database. We select all the products and for each product retrieve associated document meta information including the document's path on the local hard disk. We then invoke Tika to index this document.

<dataConfig>
  <dataSource type="BinFileDataSource" name="bin" />
  <dataSource type="JdbcDataSource" driver="oracle.jdbc.driver.OracleDriver" url="jdbc:oracle:thin:@172.16.32.1:1521:schema" user="user" password="pass" name="oracle"/>
 
  <document name="product">
    <entity name="product" transformer="HTMLStripTransformer" query="select * from products" dataSource="oracle">
      <field column="PRODUCT_REFERENCE" name="reference" />
      <field column="ID" name="uid" />
      <field column="OBJECT_TYPE" name="object_type" />
 
      <entity name="documents" query="select * products_documents where product_id='${product.ID}'" dataSource="oracle">
        <field column="MEDIA_ID" name="documentID" />
        <field column="MEDIA_TITLE" name="documentTitle" />
        <field column="MEDIA_PATH" name="documentPath" />
      </entity>
 
      <entity name="tika" processor="TikaEntityProcessor" url="${documents.MEDIA_PATH}" format="text" dataSource="bin">
        <field column="text" name="text"/>
      </entity
    </entity>
  </document>
</dataConfig>

If the Tika Entity processor encounters an exception it will stop indexing. I had to make two fixes to TikaEntityProcessor to work around this problem.

From the Solr post 1.4 SVN trunk edit the file:

~/src/solr-svn/trunk/solr/contrib/dataimporthandler/src/extras/main/java/org/apache/solr/handler/dataimport/TikaEntityProcessor.jar

First of all if a file is not found on the disk (a synch problem between the database and filesystem) we want to continue indexing. At the top of nextRow() add

File f = new File (context.getResolvedEntityAttribute(URL));
if (! f.exists()) {
  return null;
}

Secondly if the document parser throws an error, for example certain PDF revisions can cause the PDFBox parser to barf, we will trap the exception and continue:

try {
  tikaParser.parse(is, contentHandler, metadata , new ParseContext());
} catch (Exception e) {
  return null;
} finally {
  IOUtils.closeQuietly(is);
}

We will also close IOUtils in the finally section which is not done in the original code. Build and deploy the extras.jar in the solr-instance/lib directory.

Crontab

Use the cron daemon to periodically reindex external sources 30 * * * * root curl http://172.16.32.190:8080/solr/select?qt=%2Fdataimport&verbose=true&clean=true&commit=true&command=full-import

tech/search/solr-and-liferay-integration.txt · Last modified: 2010/06/10 13:28 by davidof
Recent changes RSS feed