Translations of this page:

TikeEntityProcessor dissected

(from tika 0.7)

The TikaEntityProcessor class is an interface between Solr and the Tika document parser. It basically has to take a document and output a row of data consisting of the document's metadata and text. In this respect it functions in a similar way to a database import handler. There are two principal methods. An initialization routine and a document processing routine.

The initialization route: firstInit() creates a TikaConfig, by default this uses the default Tika initialization. The list of available parsers is obtained by a call to ServiceRegistry.lookupProviders(Parser.class) which obtains the parsers from the META-INF/services/org.apache.tika.parser.Parser file in the parsers jar. An alternative configuration can be specified by using the tikaConfig attribute in the TikeEntityProcessor intialization. A format can also be specified, choices are xml, html, text and none. Normally we will work with text. An alternative parser can be given using the parser attribute.

Most of the time the default AutoDetectParser is the correct Parser to use. This will use document content and file extensions to determine the type of document (pdf, word etc) and will then invoke the correct parser. It uses a detector to achieve this. The detector is that returned by the TikeConfig's getMimeRepository method. This is configured by the tika-mimetypes.xml file included in the parser's jar and loaded from the classpath by TikaConfig. If new parsers are to be used the document MIME type must be configured in this file and they will have to be registered in the META-INF/services file.

For each document encounter the nextRow() method is called. This returns a Map of column names and column data. It opens the URL passed by the in the url attribute, for example from a FileListEntityProcessor using the data source type configured in the data handler config xml file:

<dataSource type="BinFileDataSource" name="bin" />
<document>
  <entity name="sd"
  processor="FileListEntityProcessor"
  ...
  >
    <entity name="tika-test" processor="TikaEntityProcessor" url="${sd.fileAbsolutePath}" format="text" dataSource="bin">

A content handler based on the format attribute is created to handle the document contents. This data is stored in an entry (column) called text in the Map and will normally be indexed in a single Lucene field representing the document contents. The metadata and contents are then parsed in a call to the Tika api. The method loops over the metadata and any requested columns added to the map. If the format is not none a special column “text” is added containing the document content (if the format is xml or html this will be wrapped in the appropriate elements).

StringWriter sw = new StringWriter();
...
contentHandler = getTextContentHandler(sw);
...
tikaParser.parse(is, contentHandler, metadata , new ParseContext());
 
for (Map<String, String> field : context.getAllEntityFields()) {
  if (!"true".equals(field.get("meta")))
    continue;
  String col = field.get(COLUMN);
  String s = metadata.get(col);
      if (s != null) row.put(col, s);
}
 
if(!"none".equals(format) ) row.put("text", sw.toString());

Metadata is requested in the TikaEntityProcessor (meta=true) xml configuration and the Tika column names are also mapped onto the Lucene index names configured in the Lucene schema.

<entity name="tika-test" processor="TikaEntityProcessor"
  <field column="Author" name="author" meta="true"/>
  <field column="title" name="title" meta="true"/>
  <field column="text" name="all_text"/>

Voila. Hopefully that gives you a better idea of what is going on “under the covers” and will be useful for understanding the output during debugging.

tech/search/tikaentityprocessor-dissected.txt · Last modified: 2010/05/28 14:21 by davidof
Recent changes RSS feed