By default Liferay come with Lucene as its search engine. Liferay uses Lucene for full text indexing of its contents including events and web contents.
Replacing the default Liferay Lucene indexer with Solr brings a number of benefits. You get all the Solr configuration goodness including stop words, localization, synonyms as well as better support in clustered environments. Solr can run outside of the Liferay Web Contain, even on a separate host.
This guide outlines installing Solr to run under the Liferay Tomcat instance but with a few obvious changes Solr could be run under its own web application server such as Jetty.
Solr Installation. Download Solr and configure a solr-instance directory under say: /opt (for linux). The install directory should have the following layout.
solr-instance/ lib/ conf/ solrconfig.xml data-config.xml schema.xml contrib/ data/ index/ spellchecker/ solr.war solr-web.war
In conf/solrconfig.xml make sure the contrib/extraction/lib directory is in the Solr classpath. Add other directories as appropriate.
<lib dir="/opt/solr-instance/contrib/extraction/lib" />
Add the Solr instance directory to the Tomcat environment, edit
$ vi bin/catalina.sh
adding the following line to JAVA_OPTS:
-Dsolr.solr.home=/opt/solr-instance/data
Create a Tomcat context that points to the solr.war file in /opt/solr-instance:
conf/Catalina/localhost/solr.xml
<?xml version="1.0" encoding="utf-8"?> <Context docBase="/opt/solr-instance/solr.war" debug="0" crossContext="true"> <Environment name="solr/home" type="java.lang.String" value="/opt/solr-instance" override="true"/> </Context>
This will deploy solr in the Tomcat webapps directory on startup.
Add the Solr plugin. Currently this is in version 6.0.1.1. Download it to your solr-instance directory and rename to solr-web.war:
$ http://liferay.cignex.com/palm_tree/book/0387/chapter12/solr-web-6.0.1.1.war $ mv solr-weg.6.0.1.1.war solr-web.war
Add the following context to tomcat in: conf/Catalina/localhost/solr-web.xml
<?xml version="1.0" encoding="utf-8"?> <Context docBase="/opt/solr-instance/solr-web.war" debug="0" crossContext="true" > </Context>
This will deploy the plugin on Tomcat startup.
The XML file solr-spring.xml under the folder solr-web/WEB-INF/classes/META-INF describes how to integrate Solr in the portal via the plugin web. You will need to tell it where to find the Solr instance (by default on localhost, port 8080):
<bean id="solrServer" class="com.liferay.portal.search.solr. server.BasicAuthSolrServer">
<constructor-arg type="java.lang.String"
value=" http://${solr.host.domain}:${solr.port.number}/solr"
/>
</bean>
The XML file also describes the index searcher, index writer, search engine, and so on via a set of beans:
<bean id="indexSearcher.solr" class="com.liferay.portal.search.solr.SolrIndexSearcherImpl"> <property name="solrServer" ref="solrServer" /> </bean>
These let Liferay add/update and delete content in Solr.
The schema.xml under the folder solr-web/WEB-INF/conf describes how the fields will be indexed to the Solr index. This should form the basis of your schema even if you add other data sources. The schema describes types, fields, the default search field, and the operation of the Solr query parser.
The following jars are included in solr.war WEB-INF/lib and so available to the application
apache-solr-noggit-r944541.jar commons-codec-1.3.jar commons-csv-1.0-SNAPSHOT-r609327.jar commons-fileupload-1.2.1.jar commons-httpclient-3.1.jar commons-io-1.4.jar geronimo-stax-api_1.0_spec-1.0.1.jar google-collect-1.0.jar jcl-over-slf4j-1.5.5.jar slf4j-api-1.5.5.jar slf4j-jdk14-1.5.5.jar wstx-asl-3.2.7.jar lucene-analyzers-common-4.0-dev.jar lucene-core-4.0-dev.jar lucene-highlighter-4.0-dev.jar lucene-memory-4.0-dev.jar lucene-misc-4.0-dev.jar lucene-queries-4.0-dev.jar lucene-spatial-4.0-dev.jar lucene-spellchecker-4.0-dev.jar apache-solr-core-4.0-dev.jar apache-solr-solrj-4.0-dev.jar apache-solr-dataimporthandler-4.0-dev.jar
If you want to index documents using Tika add the following additional classes in solr-instance/lib, included in the Solr classpath by default. The jar ojdbc14 is included for an Oracle data source.
apache-solr-dataimporthandler-extras-4.0-dev.jar tika-core-0.6.jar tika-parsers-0.6.jar ojdbc14.jar
You will need the services of some of the jars in contrib/extraction/lib (specify the location in solrconfig.xml)
asm-3.1.jar pdfbox-1.1.0.jar bcmail-jdk15-1.45.jar poi-3.6.jar bcprov-jdk15-1.45.jar poi-ooxml-3.6.jar commons-compress-1.0.jar poi-ooxml-schemas-3.6.jar commons-logging-1.1.1.jar poi-scratchpad-3.6.jar dom4j-1.6.1.jar tagsoup-1.2.jar fontbox-1.1.0.jar tika-core-0.8-SNAPSHOT.jar geronimo-stax-api_1.0_spec-1.0.1.jar tika-parsers-0.8-SNAPSHOT.jar icu4j-4_2_1.jar xercesImpl-2.8.1.jar jempbox-1.1.0.jar xml-apis-1.0.b2.jar log4j-1.2.14.jar xmlbeans-2.3.0.jar metadata-extractor-2.4.0-beta-1.jar
You may want to treat accented characters the same as their none-accented latin equivalents. Thus a search for Crêtet, crétet and cretet will all yield the same results. We already reduce upper case to lower case during indexing and searching so Cretet and cretet yield the same thing.
In schema.xml search for the filter: solr.LowerCaseFilterFactory in the text field type and add:
<filter class="solr.ASCIIFoldingFilterFactory" />
above the line. This will remove accents before indexing the text. We need to do this for any other field where there are accented characters.
Do the same for the Query analyzer so that we remove accents from the query string.
You may wish to index some external sources and have these available in the index. The following line in solrconfig.xml specifies a data import handler configured in the file conf/data-config.xml:
<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler"> <lst name="defaults"> <str name="config">data-config.xml</str> </lst> </requestHandler>
Here is a snippet from dataconfig.xml. It configures two datasources. “bin” is for importing files for indexing in Tika and “oracle” is an external Oracle Product database. We select all the products and for each product retrieve associated document meta information including the document's path on the local hard disk. We then invoke Tika to index this document.
<dataConfig> <dataSource type="BinFileDataSource" name="bin" /> <dataSource type="JdbcDataSource" driver="oracle.jdbc.driver.OracleDriver" url="jdbc:oracle:thin:@172.16.32.1:1521:schema" user="user" password="pass" name="oracle"/> <document name="product"> <entity name="product" transformer="HTMLStripTransformer" query="select * from products" dataSource="oracle"> <field column="PRODUCT_REFERENCE" name="reference" /> <field column="ID" name="uid" /> <field column="OBJECT_TYPE" name="object_type" /> <entity name="documents" query="select * products_documents where product_id='${product.ID}'" dataSource="oracle"> <field column="MEDIA_ID" name="documentID" /> <field column="MEDIA_TITLE" name="documentTitle" /> <field column="MEDIA_PATH" name="documentPath" /> </entity> <entity name="tika" processor="TikaEntityProcessor" url="${documents.MEDIA_PATH}" format="text" dataSource="bin"> <field column="text" name="text"/> </entity </entity> </document> </dataConfig>
If the Tika Entity processor encounters an exception it will stop indexing. I had to make two fixes to TikaEntityProcessor to work around this problem.
From the Solr post 1.4 SVN trunk edit the file:
~/src/solr-svn/trunk/solr/contrib/dataimporthandler/src/extras/main/java/org/apache/solr/handler/dataimport/TikaEntityProcessor.jar
First of all if a file is not found on the disk (a synch problem between the database and filesystem) we want to continue indexing. At the top of nextRow() add
File f = new File (context.getResolvedEntityAttribute(URL)); if (! f.exists()) { return null; }
Secondly if the document parser throws an error, for example certain PDF revisions can cause the PDFBox parser to barf, we will trap the exception and continue:
try { tikaParser.parse(is, contentHandler, metadata , new ParseContext()); } catch (Exception e) { return null; } finally { IOUtils.closeQuietly(is); }
We will also close IOUtils in the finally section which is not done in the original code. Build and deploy the extras.jar in the solr-instance/lib directory.
Use the cron daemon to periodically reindex external sources 30 * * * * root curl http://172.16.32.190:8080/solr/select?qt=%2Fdataimport&verbose=true&clean=true&commit=true&command=full-import