Complete graph query support in Lucene: candidate implementation

As described in the previous post, enhancements to various Lucene components have increasingly invalidated some assumptions in the SpanNearQuery graph query implementation, leading to buggy and/or unpredictable query behavior. This post presents a high-level overview of the approach used to implement a candidate fix for this bug. The fixes/enhancements described have been running in production for several months as the backend to the University of Pennsylvania Libraries’ main catalog interface.

Lazy matching will not work

As Christoph Goller correctly points out, lazy iteration (moving subSpans forward only, without the ability to backtrack) is incapable of guaranteeing accuracy.

The starting point for this bugfix candidate was to wrap each subSpans in a backtracking-capable Spans implementation. In place of the int nextStartPosition() method, callers interact with the wrapper through an interface int nextMatch(int hardMinStart, int softMinStart, int startCeiling, int minEnd), which abstracts all the information necessary to determine when positions must be stored for possible future backtracking, and when they may be discarded:

hardMinStart: forever discard spans with start < this parameter
softMinStart: skip (for purpose of returning nextMatch, but do not discard) spans with start < this parameter
startCeiling: disregard (for purpose of returning nextMatch, but do not discard) spans with start >= this parameter
minEnd: when non-negative, defines a minimum threshold for span endPositions. Spans with endPosition < this value should be discarded forever.

To provide effcient navigation and resetting of stored spans (for backtracking) the nextMatch() wrapper is backed by a PositionDeque – storage consisting of a sorted linked deque, implemented as a circular buffer over a dynamically resizable backing array (which, being sorted, supports O(log n) random-access seek; the deque is sorted as a natural consequence of being populated serially by input from backing Spans iteration that conforms to the specified ordering for Spans iteration). Support for mid-deque Node removal can lead the deque to become sparse; so to enable efficient iteration over the deque, nodes in each deque are also doubly-linked (LinkedArrayList uses a similar approach).

Build matches across adjacent phrase positions

Storage nodes in the PositionDeque (referred to hereafter as Nodes) serve a dual purpose. They store information about stored/buffered positions and previous/subsequent Nodes for a given backing Spans at their associated PositionDeque’s phrase index, but they also store references to “reachable” (within slop constraints) Nodes in the PositionDeque of previous and subsequent phrase indexes, and to “reachable” Nodes in the PositionDeque of the last phrase index.

These references are used to build a “lattice” of Nodes consisting of (at any point in time): an ordered list of valid Nodes at a given phrase position, a directed acyclic graph of valid paths from phrase start to phrase end, and a directed acyclic graph of valid paths from phrase end to phrase start.

The lattice is built by using the nextMatch() wrapper to drive a depth-first search to discover (and build/cache) edges in the dynamic graph represented by valid Nodes at a point in time. The graph is “dynamic” in the sense that subsequent calls to nextMatch() can invalidate extant Nodes, requiring special handling (e.g., removing or updating match-graph edges).

Managing GC inherent in a heavily-linked approach

To form links/edges between Nodes, this approach requires many simple linking nodes (similar to those required to “link” a linked list). To avoid excessive GC that might accompany such a scale of object creation, Nodes are pooled and reused per-PositionDeque (and thus, per phrase index); simple linking nodes are also pooled per-PositionDeque, and are returned to the pool upon release/reuse of their currently associated Node. On a moderate-sized production index, experience has shown that this object pooling approach leads to consistent performance. With object pooling disabled over the same index, hundreds of millions of objects were being created per request, and query response time was anywhere from 2x to 10x worse than with object pooling enabled. (The highly variable response time is in keeping with the intermittent nature of long GC-related pauses).

New types of information in the Lucene index

This development exposed new possiblities that make certain additional indexed information useful at query-time. As proof-of-concept all have been implemented using Payloads.

We can do something useful with `positionLength` now!

Currently, absent a query implementation that can do something useful with positionLength, Lucene simply ignores the PositionLengthAttribute at index time and implicitly assumes a positionLength of 1 for all index terms at query time. (See brief discussion under LUCENE-4312)

Now that we have a query implementation that respects positionLength, we can take the advice in the second comment on LUCENE-4312, and write it to the Payload. This is done using the PositionLengthOrderTokenFilter, which in addition to simply storing the positionLength in the Payload, must reorder tokens to conform to the ordering prescribed by the Spans specification. (Recall that there would previously have been no point to such reordering of tokens, since PositionLengthAttribute was ignored anyway at index time). In addition to VInt-encoded positionLength, PositionLengthOrderTokenFilter records a “magic” prefix in the Payload to prevent confusion among fields that use Payload differently. (The proof-of-concept implementation of indexed positionLength, etc., thus cannot be leveraged in conjunction with other uses of Payloads, but existing uses of Payload not in conjunction with PositionLengthOrderTokenFilter is supported).

Index lookahead (don’t buffer positions if you don’t have to!)

In order to perform exhaustive matching, we must progress through all possible valid matches for a given subSpan, buffering as we go. But with atomic TermSpans in stock Lucene (and consequently for higher-level Spans – e.g., Or, Near, etc.), lazy iteration means that the only way to know whether we’ve seen all relevant positions is to iterate past all relevant positions, to the first irrelevant position. This has the nasty consequence that we are always forced to buffer every relevant position.

To prevent unnecessary buffering, as a proof-of-concept we record per-term startPosition-lookahead directly in the index, leveraging Payloads. For ease of adoption and backward compatibility, this is triggered by the presence of a “magic” value that may be placed in the Payload by the PositionLengthOrderTokenFilter.

Spans may now determine at query-time a floor for the value that would be returned by the next call to nextStartPosition(), and prevent progressing beyond the last relevant match. Each Node thus begins its life as a “provisional” Node, which initially represents an unbuffered layer over the backing Spans. As a result, in many common cases (e.g., no-slop phrase searching of simple terms), buffering of position data is often entirely avoided.

subSpans positionLength edge cases, new matches exposed

In some cases, a new match possibility may be exposed by a decrease in the endPosition of a subSpans, or by a decrease in slop caused by an increase in the match length of a subSpans. In order to determine when it’s worth checking for these cases at the atomic TermSpans level, it would be helpful to know the following at query time for a given Term in a given Doc:

Does the endPosition ever decrease? (this could happen in the event of increased startPosition and decreased positionLength)
What is the maximum positionLength for an instance of this Term in this Doc?

To make this information available at query time, we will at index time allocate 4 bytes in the Payload of the first instance of a Term in each Doc. When the doc is finished processing, as part of flushing the doc, the maximum positionLength for each term is written to the allocated space; endPositions are tracked, and if we have seen a decrease in endPosition for a term in this doc, the value written will be (-maxPositionLength) - 1. This information may be parsed and exposed by the atomic TermSpans implementation at query time.

shingle-based pre-filtering of `ConjunctionSpans`

For accuracy and consistency, we would like to use true graph queries for all phrase searching. Keeping in mind that DismaxQueryParser boosts scores by running phrase searches over the pf fields for all user queries (not only for explicit phrase queries), it would be desirable for graph SpanNearQuery performance to be on par with performance of the extant default PhraseQuery implementation.

This is particularly challenging for queries consisting mainly of common terms, especially because stop words are incompatible with SpanNearQuery without significant sacrifices in functionality (see documentation for ComplexPhraseQParser).

CommonGramsFilter provides a possible solution, but only by sacrificing a considerable amount of flexibility (consider that with stopword the, we might reasonably want "the caterpillar"~2 to match the very hungry caterpillar).

To address this challenge, we have implemented an index-time ShingleFilterUpdateRequestProcessor, which behaves similar to the CommonGramsFilter, but instead of emitting shingle tokens inline on the source token stream, it buffers shingles (of configurable slop limit for a configurable list of stopwords) as it evaluates the entire source token stream, then writes these unique shingles to a separate field that indexes term frequency, but not positions. We then hijack the destination field’s TermFrequency attribute (which, after all, is just a positive integer) to record (for each unique shingle/token) the minimum slop (between shingle components) <= the configured slop limit.

At query time, a corresponding filter on the ExtendedDismaxQParser examines the parsed graph query for SpanNearQuerys containing shingled stopwords, and constructs TwoPhaseIterators for the shingles. These TwoPhaseIterators are passed as pre-filters (respecting configured phrase slop) into the ConjunctionSpans approximation for the associated SpanNearQuery, significantly reducing the overhead of setting up Spans for each document in the intersection of query terms.

For our use cases, this approach has resulted in worst-case SpanNearQuery performance nearly identical to extant worst-case PhraseQuery performance, and we are now running full graph queries in production for every user query. (Incidentally, this approach of shingle-based filtering should also be applicable (in modified form) to other ordered slop-dependent queries (such as PhraseQuery).

Match modes

There are various modes supported for combinatoric matching; three of the most useful are described below:

Greedy

Once a valid match is found for a given startPosition, greedy match mode shortcircuits without attempting to find other valid paths for that startPosition. The subsequent call to nextStartPosition() advances the first subSpans to the next startPosition > extant startPosition.

Although this is the mode whose behavior is most similar to stock behavior, it is in some ways misleading to compare it to the stock implementation, because when used at the top-level, this mode is guaranteed to match all valid start positions and all valid documents (while there is no such guarantee for the stock implementation). This mode minimizes buffering of positions, at the expense of missing some redundant matches. This mode may score differently, and if used in a subclause, could result in some missed matches (as a result of skipping some valid endPositions for a given startPosition).

PerEndPosition

This mode will continue exploring possible match paths until it is determined that all valid endPositions for a given startPosition have been discovered. This mode is always guaranteed to match all valid startPosition-endPosition combinations (and consequently will match all valid docs), whether used as top-level or subclause. It may however skip some redundant matches, potentially resulting in slightly different scoring behavior.

PerPosition

This mode will continue exploring possible match paths until it is determined that all valid positions for all subclauses have been reported as a match. Match completeness is on par with PerEndPosition, but this mode includes redundant matches in SpanCollection, and is thus better for complete scoring, at some cost to performance.

Configuration, evaluation, feedback

The candidate fix branch consists mostly of additions and modifications to Lucene code. But to facilitate evaluation and feedback, it also includes a modified version of Solr’s EDismaxQParser that supports configuration of various query behaviors. The stock NearSpansOrdered implementation (from master) has been retained, but refactored to LegacyNearSpansOrdered. It may be invoked using the useLegacyImplementation boolean configuration (see below). The new implementation is different enough that in some ways it might have made more sense to give the new implementation a new name (rather than renaming the extant implementation and replacing it). The main reason for the decision to rename and replace was on this initial candidate branch was to be able to use the new implementation as a drop-in replacement, using existing tests for the extant implementation and helping demonstrate backward compatiblity.

This branch also includes some enhancements that, while not directly related to SpanNearQuery, add features that support significant performance improvement for the new SpanNearQuery implementation.

There are also one or two simple bug fixes that simply address unrelated upstream bugs – but rather than adjust tests to work around the bugs, the bug fixes are simply bundled in here. One example from this category is the fix for PRESERVE_ORIGINAL in WordDelimiterGraphFilter. (I just noticed that this was recently fixed in LUCENE-8395).

Please try it out!

In addition to running tests, it should be straightforward to try using this in a full Solr deployment. In order to fully use this feature, the following configuration should suffice:

Field type

As in the below example, include the PositionLengthOrderTokenFilterFactory as the last element in the index-time analyzer. (N.b., the token reordering performed by PositionLengthOrderTokenFilterFactory should make FlattenGraphFilterFactory redundant)

<fieldType name="search" class="solr.TextField" positionIncrementGap="100">
    <analyzer type="query">
        <tokenizer class="solr.ICUTokenizerFactory"/>
        <filter class="solr.WordDelimiterGraphFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="1" splitOnCaseChange="1" preserveOriginal="1" stemEnglishPossessive="0" protected="protwords.txt"/>
        <filter class="solr.ICUFoldingFilterFactory"/>
    </analyzer>
    <analyzer type="index">
        <tokenizer class="solr.ICUTokenizerFactory"/>
        <filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.WordDelimiterGraphFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="1" splitOnCaseChange="1" preserveOriginal="1" stemEnglishPossessive="0" protected="protwords.txt"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
        <filter class="solr.ICUFoldingFilterFactory"/>
        <!--<filter class="solr.FlattenGraphFilterFactory"/>-->
        <filter class="solr.ReversedWildcardFilterFactory" withOriginal="true" maxPosAsterisk="3" maxPosQuestion="2" maxFractionAsterisk="0.33"/>
        <filter class="solr.PositionLengthOrderTokenFilterFactory" indexLookahead="true"/>
    </analyzer>
</fieldType>

Query parser configuration

ExtendedDismaxQParserPlugin enhanced to support filtering/modification of queries. Most of the new configuration options are modified via a configurable GraphQueryFilter; configurations may be overridden at query time using params or local params, which are passed along to the GraphQueryFilter

<queryParser name="perEndPosition" class="solr.ExtendedDismaxQParserPlugin">
  <!-- below allow to configure whether different classes of queries default to
       graph query implementation (e.g., as opposed to default PhraseQuery for phrases);
       distinction is made between explicit phrase searches and implicit (i.e.,
       configured via "ps" param) -->
  <bool name="phraseAsGraphQuery">true</bool>
  <bool name="explicitPhraseAsGraphQuery">true</bool>
  <bool name="multiphraseAsGraphQuery">true</bool>
  <bool name="explicitMultiphraseAsGraphQuery">true</bool>
  
  <!-- post-processing of initial parsed query allows for special handling -->
  <lst name="graphQueryFilter">
    <str name="class">solr.GraphQueryDensityFilter</str>
    
    <!-- uses old NearSpansOrdered implementation, renamed as LegacyNearSpansOrdered;
         this is as close as we can come to comparing apples to apples -->
    <bool name="useLegacyImplementation">false</bool>
    
    <!-- match type greedy for top level SpanNearQuery -->
    <bool name="makeOuterGreedy">true</bool>
    
    <!-- default true; otherwise use recursive implementation -->
    <bool name="useLoopImplementation">true</bool>
    
    <!-- queries that consist exclusively of terms in the top n terms in a given field
         will failsafe to GREEDY match type, to prevent resource monopolization -->
    <lst name="denseFields">
      <int name="field1">20</int>
      <int name="field2_search">10</int>
      <int name="field3_search">10</int>
    </lst>
    
    <!-- like stopwords, but used only to pre-filter domain for phrase queries where 
         slop &lt;= maxSlop; requires corresponding configuration of updateProcessor
         (see below) -->
    <lst name="shingleWords">
      <lst name="field1">
        <str name="words">shingleWords.txt</str>
        <int name="maxSlop">3</int>
      </lst>
      <lst name="*_search">
        <str name="words">shingleWords.txt</str>
        <int name="maxSlop">3</int>
      </lst>
    </lst>
  </lst>
</queryParser>

Update processor to build complementary “shingle” fields

These complementary fields are used to pre-filter the domain of graph queries containing common terms (Cf. common-grams, stopwords), in the common case of relatively small slop.

<updateProcessor class="solr.ShingleFilterUpdateRequestProcessorFactory" name="shingles">
  <lst name="field1">
    <str name="words">shingleWords.txt</str>
    <int name="maxSlop">3</int>
  </lst>
  <lst name="*_search">
    <str name="words">shingleWords.txt</str>
    <int name="maxSlop">3</int>
  </lst>
</updateProcessor>