Apache Solr – Search Suggestions

July 26, 2019

Apache Solr has many ways to assist web application users search for terms and phrases. Some ways work best with terms, others work best with phrases. They are generally used to correct spelling, suggest alternative words, suggest a list of search phrases and suggest search queries. The basic approach is to define multiple request handlers in solrconfig.xml and then choose which ones to return from in the UI.

Request

Spellcheck

Show list of term suggestions with spelling and word break corrections, where terms begin with query characters.

Show list of term suggestions with spelling and word break corrections, where terms begin with query characters.

  • See Solr In Action
  • Operates on isolated terms, starting at the prefix or beginning of a term in the index. Has the complete query term from which to generate suggestions. Works with or without stopwords. Can pull terms from other fields such as the article text field to add more terms to your spell-checking dictionary.
  • Uses Levenshtein algorithm to order suggestions by edit distance between two terms that begin with characters typed.
  • Query time /spell RequestHandler combines DirectSolrSpellChecker and WordBreakSolrSpellChecker.
  • Summary
    • Good for single terms only. Not for phrases or fields such as titles.
    • Builds spell-checking support into the /select request handler, because it should be a core feature enabled for all queries by default.
    • If using spellcheck.collate=true, spellcheck must be listed as a last component, because generating the collation query requires the query component to have already executed.

managed-schema.xml

Add new text_suggest field type

<fieldType class="solr.TextField" name="text_suggest" positionIncrementGap="100">
	<analyzer>
		<tokenizer class="solr.UAX29URLEmailTokenizerFactory"/>
		<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
		<filter class="solr.LowerCaseFilterFactory"/>
		<filter class="solr.ASCIIFoldingFilterFactory"/>
		<filter class="solr.EnglishPossessiveFilterFactory"/>
	</analyzer>
</fieldType>

Add new suggest field

<field name="suggest" type="text_suggest" indexed="true" stored="false"/>

Copy fields to new suggest field (title, and etc.)

<copyField source="title" dest="suggest"/>

solrconfig.xml

See http://wiki.apache.org/solr/SpellCheckComponent. The spell check component returns a list of alternative spelling suggestions. Multiple “Spell Checkers” can be declared and used by this component to build a composite dictionary:

  • Default – uses indexed words from the “suggest” field and can safely handle stop words, URLs, ascii, lowercase and English possessives
  • Wordbreak – fixes mashed together words like “northatlantic curent”

Add searchComponent

<searchComponent name="spellcheck" class="solr.SpellCheckComponent">
	<str name="queryAnalyzerFieldType">text_general</str>

	<!-- a spellchecker built from a field of the main index -->
	<lst name="spellchecker">
		<str name="name">default</str>
		<str name="field">_text_</str>
		<str name="classname">solr.DirectSolrSpellChecker</str>
		<!-- the spellcheck distance measure used, the default is the internal levenshtein -->
		<str name="distanceMeasure">internal</str>
		<!-- minimum accuracy needed to be considered a valid spellcheck suggestion -->
		<float name="accuracy">0.5</float>
		<!-- the maximum #edits we consider when enumerating terms: can be 1 or 2 -->
		<int name="maxEdits">2</int>
		<!-- the minimum shared prefix when enumerating terms -->
		<int name="minPrefix">1</int>
		<!-- maximum number of inspections per result. -->
		<int name="maxInspections">5</int>
		<!-- minimum length of a query term to be considered for correction -->
		<int name="minQueryLength">4</int>
		<!-- maximum threshold of documents a query term can appear to be considered for correction -->
		<float name="maxQueryFrequency">0.01</float>
		<!-- uncomment this to require suggestions to occur in 1% of the documents
		<float name="thresholdTokenFrequency">.01</float>
		-->
	</lst>

	<!-- a spellchecker that can break or combine words.  See "/spell" handler below for usage -->
	<lst name="spellchecker">
		<str name="name">wordbreak</str>
		<str name="classname">solr.WordBreakSolrSpellChecker</str>
		<str name="field">name</str>
		<str name="combineWords">true</str>
		<str name="breakWords">true</str>
		<int name="maxChanges">10</int>
	</lst>
</searchComponent>
  • The accuracy parameter is a floating value between 0 and 1 that determines how accurate the suggestions need to be. The higher the number, the more accurate the suggestions will be.

Add requestHandler

The component is either associated with a /spell requestHandler activated by a “spellcheck=true” parameter in the query string, or it is added to the /select requestHandler defaults configuration.

<requestHandler name="/spell" class="solr.SearchHandler" startup="lazy">
	<lst name="defaults">
		<str name="spellcheck">on</str>
		<str name="spellcheck.dictionary">wordbreak</str>
		<str name="spellcheck.dictionary">default</str>
		<str name="spellcheck.extendedResults">false</str>
		<str name="spellcheck.count">5</str>
		<str name="spellcheck.alternativeTermCount">2</str>
		<str name="spellcheck.maxResultsForSuggest">5</str>
		<str name="spellcheck.collate">true</str>
		<str name="spellcheck.collateExtendedResults">true</str>
		<str name="spellcheck.maxCollationTries">5</str>
		<str name="spellcheck.maxCollations">3</str>
	</lst>
	<arr name="last-components">
		<str>spellcheck</str>
	</arr>
</requestHandler>
  • Solr will combine suggestions from the ‘default’ spellchecker and from the ‘wordbreak’ spellcheckers, so put the ‘wordbreak’ dictionary before the ‘default’ dictionary to split words before they are analyzed for spelling.
  • Set collate = true for collations (re-written queries) to include a combination of corrections from both spellcheckers.
  • Configure spell-check as a last component in the Solr request handler.
  • Set distanceMeasure set to “internal” for Levenshtein algorithm.
Research
  • Solr Spellchecker internals (now with tests!) – 2011, Emmanuel Espina: The Solr spellcheker isn’t much more than a pattern similarity algorithm.
  • Creating a spellchecker with Solr – 2011, Emmanuel Espina
  • Getting started Spell Checking with Apache Lucene and Solr – 2010, Grant Ingersoll: The spell checker does a decent job out of the box, but not great, so you should be prepared to spend some time tuning it. First off, make sure you are doing effective analysis of the source content. See http://wiki.apache.org/solr/SpellCheckComponent#Spell_Checking_Analysis for more info. The primary take away is your spell check field shouldn’t do things like stemming, etc. You may also wish to use a word-based n-gram (called Shingles in Lucene/Solr parlance) so that you can not only give single word suggestions, but also phrase suggestions. Next up, take the time to work with the onlyMorePopular, accuracy, custom comparators, String Distance Measures and other items to get better results. Also consider how you can incorporate log analysis and other user feedback into your spell checker. While Solr doesn’t have anything that directly does the analysis, it can support them through file-based spell checking dictionaries that can include weights.

Term Suggester = /suggest

Terms Component = /terms

Topic Suggester

Suggest Component

Behavior-Driven Suggester





References

Leave a Reply

Your email address will not be published. Required fields are marked *