Apache Tika is a toolkit for detecting document types and extracting the metadata and content in a format that can, amongst other things, be feed into the Solr/Lucene search engine combo. Tika support has been included in Solr since version 1.4 released at the end of 2009.
Smiley and Pugh's book: Solr 1.4 Enterprise Search Server covers using Tika with SOLR Cell. However actually spidering document sets is left up to the end user. The DataImportHandler seems to offer a solution for spidering documents either in a Database or on the local file system. However documentation on integrating with Tika, despite a wiki page, is relatively sparse.
The first thing to do is download Tika. Use the 0.6 version (there is a bug in version 0.7, fixed in SVN and due to be released with 0.8 which stops it working with Solr. Tika is supplied as source and you will have to build it using the Maven tool which is relatively painless. On the Tika website click on the Download menu and look in the Archives, you will need this version http://archive.apache.org/dist/lucene/tika/apache-tika-0.6-src.zip. Unpack it on your local drive.
After setting up Maven (which may be as simple as an apt get maven on Ubuntu) change to your Tika source root directory and type:
$ mvn install
This will download all the dependencies to your local maven repository, run the unit tests and build the Tika jars. You should end up with the following jars:
tika-core/target/tika-core-0.6.jar tika-parsers/target/tika-parsers-0.6.jar
Assuming Solr is already installed (see below) copy these to your solr/lib directory, e.g. ./solr-svn/trunk/solr/example/solr/lib/
If, like me, you access the Internet via a proxy you may need to configure this for maven. This is done in the settings.xml file.
$ vi /etc/maven2/settings.xml
<proxy> <id>optional</id> <active>true</active> <protocol>http</protocol> <host>172.16.32.254</host> <port>8080</port> <nonProxyHosts>localhost</nonProxyHosts> </proxy> </proxies>
Using its plugin parser architecture Tika can currently index a wide variety of document formats. If you have a format that is unkown to Tika you can even write your own parser. If you look in the tika-parsers source you can see the document types that Tika can index
$ ls tika-parsers/src/main/java/org/apache/tika/parser/
asm epub image mbox mp3 opendocument pkg txt xml audio html jpeg microsoft odf pdf rtf video
That includes the common Microsoft Office formats and PDF, not bad for starters.
You can run Tika standalone, which is useful for analyzing the output you will get from various document formats. In the first example we will analyze a document in the solr-svn source directory and look at the metadata (-m option). When we run with SOLR we will principally extract the metadata and text for indexing.
$ java -jar tika-app/target/tika-app-0.6.jar -m ../solr-svn/trunk/solr/contrib/extraction/src/test/resources/solr-word.pdf Author: Grant Ingersoll Content-Type: application/pdf Keywords: solr, word, pdf Last-Modified: Thu Nov 13 14:35:51 CET 2008 created: Thu Nov 13 14:35:51 CET 2008 creator: Microsoft Word producer: Mac OS X 10.5.5 Quartz PDFContext resourceName: solr-word.pdf subject: solr word title: solr-word
Amongst other fields the metadata includes the author, created date, title and keyword information. This could all be interesting information to index in Lucene.
Here we index the text.
$ java -jar tika-app/target/tika-app-0.6.jar -t ../solr-svn/trunk/solr/contrib/extraction/src/test/resources/solr-word.pdf This is a test of PDF and Word extraction in Solr, it is only a test. Do not panic.
We would most likely just put this into a field on its own in Lucene.
I initially worked with Tika 0.7. Despite seeing that the document was getting indexed by Solr none of the metadata or content was getting extracted. After some analysis it appeared Tika was just using the default parser. The Solr TikaEntityProcessor, which is the interface between Tika and Solr uses the Tika AutoDetectParser by default. I forced it to use the Tika PDF Parser and it indexed the document as expected.
After tracing the code through the Tika AutoDetectParser ← Composite Parser ← Parser chain I found that the TikaConfig class: config.getParsers() was returning an empty list. All the Parsers were present on the Solr Classpath. TikaEntityProcessor uses a default TikaConfig and this has the following constructor
public TikaConfig() throws MimeTypeException, IOException { ParseContext context = new ParseContext(); Iterator<Parser> iterator = ServiceRegistry.lookupProviders(Parser.class); while (iterator.hasNext()) { Parser parser = iterator.next(); for (MediaType type : parser.getSupportedTypes(context)) { parsers.put(type.toString(), parser); } } mimeTypes = MimeTypesFactory.create("tika-mimetypes.xml"); }
Solr is running under a web application. These usually have more than 1 class loader. The call to ServiceRegistry.lookupProviders() should get a list of Parser classes from the jar's META-INF/services/org.apache.tika.parser.Parser file however it seems that the Parser jar is not accessible from the context of the TikaEntityProcessor. The problem is documented in SOLR-1920 and TIKA-419.
The solution is to either user the SVN brank of Tika which has a fix or use the 0.6 release of Tika. The main changes between 0.6 and 0.7 concern MP3 processing, which we are not using and a change to the PDF extraction library for better text extraction. We can probably live without this in the short term.
First of all we want to use the DataImportHandler in the following configuration:
FileListEntityProcessor → BinFileDataSource → TikaEntityProcessor
just one problem, the BinFileDataSource was introduced after the last release. Solr 1.4. So we have to work with a snapshot from the svn trunk.
Assuming svn is installed we can check out the Solr source using the following command:
$ svn co http://svn.apache.org/repos/asf/lucene/dev/ solr-svn
again if you have a proxy configure this in /etc/subversion/servers
[global] http-proxy-host=172.16.32.79 http-proxy-port=8080
Solr uses Ant to build, so install the latest version. Go to the solr home directory and type
$ cd solr-svn/trunk/solr $ ant $ ant $ ant
We will work with the example's directory.
In the solr/conf/solrconfig.xml add a requestHandler for data importing
<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler"> <lst name="defaults"> <str name="config">tika-data-config.xml</str> </lst> </requestHandler>
<dataConfig> <dataSource type="BinFileDataSource" name="bin" /> <document> <entity name="sd" processor="FileListEntityProcessor" newerThan="'NOW-30DAYS'" fileName=".*pdf$" baseDir="../site" recursive="true" rootEntity="false" transformer="DateFormatTransformer" > <entity name="tika-test" processor="TikaEntityProcessor" url="${sd.fileAbsolutePath}" format="text" dataSource="bin"> <field column="Author" name="author" meta="true"/> <field column="Content-Type" name="title" meta="true"/> <!-- field column="title" name="title" meta="true"/ --> <field column="text" name="all_text"/> </entity> <!-- field column="fileLastModified" name="date" dateTimeFormat="yyyy-MM-dd'T'hh:mm:ss" / --> <field column="fileSize" name="size"/> <field column="file" name="filename"/> </entity> </document> </dataConfig>