Tag Archives: solr4

Indexing Special Terms Using Solr

Problem

If you use Solr for any technical corpus, you will soon need to know how to perform indexing special terms using Solr. Solr supplies some really convenient field types in their default schema.xml.  If you have used text-general to index any document with special terms (words) you have probably experienced the frustration with missing hyphenated terms.  Also, special terms like computer skills are not indexed correctly.

Requirement

Index documents like resumes and job descriptions, which contain terms that include punctuation.  Preserve the punctuation on terms that need to be indexed, but remove similar punctuation else where so other terms are also properly indexed.

Solution

Before we look at the solution, let’s consider how Solr’s fieldType element works.  As you would expect, I am going to point you to the official Solr Wiki on Analyzers, Tokenizers and TokenFilters. Although you can put your tokenizers, filters and charFilter elements in any order you choose, Solr will process them in this order.

  1. charFilter
  2. analyzer
  3. filter

Read more