Complete graph query support in Lucene: candidate implementation
As described in the previous post, enhancements
to various Lucene components have increasingly invalidated some assumptions in the
SpanNearQuery
graph query implementation, leading to buggy and/or unpredictable
query behavior. This post presents a high-level overview of the approach used to
implement a candidate fix for this bug. The fixes/enhancements described have been
running in production for several months as the backend to the
University of Pennsylvania Libraries’ main catalog interface.
Lazy matching will not work
As Christoph Goller correctly points out,
lazy iteration (moving subSpans
forward only, without the ability to
backtrack) is incapable of guaranteeing accuracy.
The starting point for this bugfix candidate was to wrap each subSpans
in a backtracking-capable Spans
implementation. In place of the
int nextStartPosition()
method, callers interact with the wrapper
through an interface
int nextMatch(int hardMinStart, int softMinStart, int startCeiling, int minEnd)
,
which abstracts all the information necessary to determine when
positions must be stored for possible future backtracking, and
when they may be discarded:
hardMinStart
: forever discard spans with start < this parametersoftMinStart
: skip (for purpose of returningnextMatch
, but do not discard) spans with start < this parameterstartCeiling
: disregard (for purpose of returningnextMatch
, but do not discard) spans with start >= this parameterminEnd
: when non-negative, defines a minimum threshold for spanendPosition
s.Spans
withendPosition
< this value should be discarded forever.
To provide effcient navigation and resetting of stored spans (for
backtracking) the nextMatch()
wrapper is backed by a PositionDeque
–
storage consisting of a sorted linked deque, implemented as a
circular buffer over a dynamically resizable backing array (which,
being sorted, supports O(log n) random-access seek; the deque is sorted
as a natural consequence of being populated serially by input from backing
Spans
iteration that conforms to the specified ordering for Spans
iteration). Support for mid-deque Node
removal can lead the deque to
become sparse; so to enable efficient iteration over the deque, nodes in
each deque are also doubly-linked (LinkedArrayList
uses a similar approach).
Build matches across adjacent phrase positions
Storage nodes in the PositionDeque
(referred to hereafter as Node
s)
serve a dual purpose. They store information about stored/buffered positions
and previous/subsequent Node
s for a given backing Spans
at their associated
PositionDeque
’s phrase index, but they also store references to “reachable”
(within slop constraints) Node
s in the PositionDeque
of previous and
subsequent phrase indexes, and to “reachable” Node
s in the PositionDeque
of the last phrase index.
These references are used to build a “lattice” of Node
s consisting of (at
any point in time): an ordered list of valid Node
s at a given phrase position,
a directed acyclic graph of valid paths from phrase start to phrase end,
and a directed acyclic graph of valid paths from phrase end to phrase start.
The lattice is built by using the nextMatch()
wrapper to drive a
depth-first search to discover (and build/cache) edges in the dynamic
graph represented by valid Node
s at a point in time. The graph is “dynamic”
in the sense that subsequent calls to nextMatch()
can invalidate
extant Node
s, requiring special handling (e.g., removing or updating
match-graph edges).
Managing GC inherent in a heavily-linked approach
To form links/edges between Node
s, this approach requires many simple
linking nodes (similar to those required to “link” a linked list). To avoid
excessive GC that might accompany such a scale of object creation, Node
s
are pooled and reused per-PositionDeque
(and thus, per phrase
index); simple linking nodes are also pooled per-PositionDeque
, and
are returned to the pool upon release/reuse of their currently associated Node
.
On a moderate-sized production index, experience has shown that this object
pooling approach leads to consistent performance. With object pooling disabled
over the same index, hundreds of millions of objects were being created per
request, and query response time was anywhere from 2x to 10x worse than with
object pooling enabled. (The highly variable response time is in keeping with
the intermittent nature of long GC-related pauses).
New types of information in the Lucene index
This development exposed new possiblities that make certain additional
indexed information useful at query-time. As proof-of-concept
all have been implemented using Payload
s.
We can do something useful with positionLength
now!
Currently, absent a query implementation that can do something useful
with positionLength
, Lucene simply ignores the PositionLengthAttribute
at index time and implicitly assumes a positionLength
of 1
for all
index terms at query time. (See brief discussion under LUCENE-4312)
Now that we have a query implementation that respects positionLength
,
we can take the advice in the second comment on LUCENE-4312,
and write it to the Payload
. This is done using the PositionLengthOrderTokenFilter
,
which in addition to simply storing the positionLength
in the Payload
,
must reorder tokens to conform to the ordering prescribed by the Spans
specification. (Recall that there would previously have been no point to
such reordering of tokens, since PositionLengthAttribute
was ignored
anyway at index time). In addition to VInt-encoded positionLength
,
PositionLengthOrderTokenFilter
records a “magic” prefix in the
Payload
to prevent confusion among fields that use Payload
differently.
(The proof-of-concept implementation of indexed positionLength
, etc.,
thus cannot be leveraged in conjunction with other uses of Payload
s,
but existing uses of Payload
not in conjunction with
PositionLengthOrderTokenFilter
is supported).
Index lookahead (don’t buffer positions if you don’t have to!)
In order to perform exhaustive matching, we must progress through all possible
valid matches for a given subSpan
, buffering as we go. But with atomic
TermSpans
in stock Lucene (and consequently for higher-level Spans
–
e.g., Or, Near, etc.), lazy iteration means that the only way to
know whether we’ve seen all relevant positions is to iterate past
all relevant positions, to the first irrelevant position. This has
the nasty consequence that we are always forced to buffer every
relevant position.
To prevent unnecessary buffering, as a proof-of-concept
we record per-term startPosition
-lookahead directly in the index, leveraging
Payload
s. For ease of adoption and backward compatibility, this is triggered
by the presence of a “magic” value that may be placed in the Payload
by the PositionLengthOrderTokenFilter
.
Spans may now determine at query-time a floor for the value that would be
returned by the next call to nextStartPosition()
, and prevent progressing
beyond the last relevant match. Each Node
thus begins its life as a
“provisional” Node
, which initially represents an unbuffered layer over
the backing Spans
. As a result, in many common cases (e.g., no-slop phrase
searching of simple terms), buffering of position data is often entirely avoided.
subSpans positionLength edge cases, new matches exposed
In some cases, a new match possibility may be exposed by a decrease in the
endPosition
of a subSpans
, or by a decrease in slop caused by an increase
in the match length of a subSpans
. In order to determine when it’s worth
checking for these cases at the atomic TermSpans
level, it would be helpful to
know the following at query time for a given Term in a given Doc:
- Does the
endPosition
ever decrease? (this could happen in the event of increasedstartPosition
and decreasedpositionLength
) - What is the maximum
positionLength
for an instance of this Term in this Doc?
To make this information available at query time, we will at index time
allocate 4 bytes in the Payload
of the first instance of a Term in each
Doc. When the doc is finished processing, as part of flushing the doc, the
maximum positionLength
for each term is written to the allocated space;
endPosition
s are tracked, and if we have seen a decrease in endPosition
for
a term in this doc, the value written will be (-maxPositionLength) - 1
.
This information may be parsed and exposed by the atomic TermSpans
implementation at query time.
shingle-based pre-filtering of ConjunctionSpans
For accuracy and consistency, we would like to use true graph queries
for all phrase searching. Keeping in mind that DismaxQueryParser
boosts
scores by running phrase searches over the pf
fields for all user queries
(not only for explicit phrase queries), it would be desirable for graph
SpanNearQuery
performance to be on par with performance of the extant default
PhraseQuery
implementation.
This is particularly challenging for queries consisting mainly of common terms,
especially because stop words are incompatible with SpanNearQuery
without
significant sacrifices in functionality
(see documentation for ComplexPhraseQParser).
CommonGramsFilter
provides a possible solution, but only by sacrificing a
considerable amount of flexibility (consider that with stopword the
, we might
reasonably want "the caterpillar"~2
to match the very hungry caterpillar
).
To address this challenge, we have implemented an index-time
ShingleFilterUpdateRequestProcessor
, which behaves similar to the
CommonGramsFilter
, but instead of emitting shingle tokens inline on the source
token stream, it buffers shingles (of configurable slop limit for a configurable
list of stopwords) as it evaluates the entire source token stream, then writes
these unique shingles to a separate field that indexes term frequency, but
not positions. We then hijack the destination field’s TermFrequency
attribute
(which, after all, is just a positive integer) to record (for each unique
shingle/token) the minimum slop (between shingle components) <= the configured
slop limit.
At query time, a corresponding filter on the ExtendedDismaxQParser
examines
the parsed graph query for SpanNearQuery
s containing shingled stopwords, and
constructs TwoPhaseIterator
s for the shingles. These TwoPhaseIterator
s are
passed as pre-filters (respecting configured phrase slop) into the ConjunctionSpans
approximation
for the associated SpanNearQuery
, significantly reducing the
overhead of setting up Spans
for each document in the intersection of query terms.
For our use cases, this approach has resulted in worst-case SpanNearQuery
performance
nearly identical to extant worst-case PhraseQuery
performance, and we are now running
full graph queries in production for every user query. (Incidentally, this approach
of shingle-based filtering should also be applicable (in modified form) to other
ordered slop-dependent queries (such as PhraseQuery
).
Match modes
There are various modes supported for combinatoric matching; three of the most useful are described below:
Greedy
Once a valid match is found for a given startPosition, greedy match mode
shortcircuits without attempting to find other valid paths for that startPosition.
The subsequent call to nextStartPosition()
advances the first subSpans to the
next startPosition > extant startPosition.
Although this is the mode whose behavior is most similar to stock behavior, it is in some ways misleading to compare it to the stock implementation, because when used at the top-level, this mode is guaranteed to match all valid start positions and all valid documents (while there is no such guarantee for the stock implementation). This mode minimizes buffering of positions, at the expense of missing some redundant matches. This mode may score differently, and if used in a subclause, could result in some missed matches (as a result of skipping some valid endPositions for a given startPosition).
PerEndPosition
This mode will continue exploring possible match paths until it is determined that all valid endPositions for a given startPosition have been discovered. This mode is always guaranteed to match all valid startPosition-endPosition combinations (and consequently will match all valid docs), whether used as top-level or subclause. It may however skip some redundant matches, potentially resulting in slightly different scoring behavior.
PerPosition
This mode will continue exploring possible match paths until it is determined that all valid positions for all subclauses have been reported as a match. Match completeness is on par with PerEndPosition, but this mode includes redundant matches in SpanCollection, and is thus better for complete scoring, at some cost to performance.
Configuration, evaluation, feedback
The candidate fix branch consists mostly of additions and modifications to Lucene
code. But to facilitate evaluation and feedback, it also includes a modified version
of Solr’s EDismaxQParser
that supports configuration of various query behaviors.
The stock NearSpansOrdered
implementation (from master) has been retained, but
refactored to LegacyNearSpansOrdered
. It may be invoked using the useLegacyImplementation
boolean configuration (see below). The new implementation is different enough that in
some ways it might have made more sense to give the new implementation a new name
(rather than renaming the extant implementation and replacing it). The main reason for
the decision to rename and replace was on this initial candidate branch was to be able
to use the new implementation as a drop-in replacement, using existing tests for the extant
implementation and helping demonstrate backward compatiblity.
This branch also includes some enhancements that, while not directly related to SpanNearQuery, add features that support significant performance improvement for the new SpanNearQuery implementation.
There are also one or two simple bug fixes that simply address unrelated upstream
bugs – but rather than adjust tests to work around the bugs, the bug fixes are
simply bundled in here. One example from this category is the fix for PRESERVE_ORIGINAL
in WordDelimiterGraphFilter
. (I just noticed that this was recently fixed in LUCENE-8395).
Please try it out!
In addition to running tests, it should be straightforward to try using this in a full Solr deployment. In order to fully use this feature, the following configuration should suffice:
Field type
As in the below example, include the PositionLengthOrderTokenFilterFactory
as the last
element in the index-time analyzer. (N.b., the token reordering performed by
PositionLengthOrderTokenFilterFactory
should make FlattenGraphFilterFactory
redundant)
<fieldType name="search" class="solr.TextField" positionIncrementGap="100">
<analyzer type="query">
<tokenizer class="solr.ICUTokenizerFactory"/>
<filter class="solr.WordDelimiterGraphFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="1" splitOnCaseChange="1" preserveOriginal="1" stemEnglishPossessive="0" protected="protwords.txt"/>
<filter class="solr.ICUFoldingFilterFactory"/>
</analyzer>
<analyzer type="index">
<tokenizer class="solr.ICUTokenizerFactory"/>
<filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.WordDelimiterGraphFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="1" splitOnCaseChange="1" preserveOriginal="1" stemEnglishPossessive="0" protected="protwords.txt"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
<filter class="solr.ICUFoldingFilterFactory"/>
<!--<filter class="solr.FlattenGraphFilterFactory"/>-->
<filter class="solr.ReversedWildcardFilterFactory" withOriginal="true" maxPosAsterisk="3" maxPosQuestion="2" maxFractionAsterisk="0.33"/>
<filter class="solr.PositionLengthOrderTokenFilterFactory" indexLookahead="true"/>
</analyzer>
</fieldType>
Query parser configuration
ExtendedDismaxQParserPlugin
enhanced to support filtering/modification of queries.
Most of the new configuration options are modified via a configurable GraphQueryFilter
;
configurations may be overridden at query time using params or local params, which are
passed along to the GraphQueryFilter
<queryParser name="perEndPosition" class="solr.ExtendedDismaxQParserPlugin">
<!-- below allow to configure whether different classes of queries default to
graph query implementation (e.g., as opposed to default PhraseQuery for phrases);
distinction is made between explicit phrase searches and implicit (i.e.,
configured via "ps" param) -->
<bool name="phraseAsGraphQuery">true</bool>
<bool name="explicitPhraseAsGraphQuery">true</bool>
<bool name="multiphraseAsGraphQuery">true</bool>
<bool name="explicitMultiphraseAsGraphQuery">true</bool>
<!-- post-processing of initial parsed query allows for special handling -->
<lst name="graphQueryFilter">
<str name="class">solr.GraphQueryDensityFilter</str>
<!-- uses old NearSpansOrdered implementation, renamed as LegacyNearSpansOrdered;
this is as close as we can come to comparing apples to apples -->
<bool name="useLegacyImplementation">false</bool>
<!-- match type greedy for top level SpanNearQuery -->
<bool name="makeOuterGreedy">true</bool>
<!-- default true; otherwise use recursive implementation -->
<bool name="useLoopImplementation">true</bool>
<!-- queries that consist exclusively of terms in the top n terms in a given field
will failsafe to GREEDY match type, to prevent resource monopolization -->
<lst name="denseFields">
<int name="field1">20</int>
<int name="field2_search">10</int>
<int name="field3_search">10</int>
</lst>
<!-- like stopwords, but used only to pre-filter domain for phrase queries where
slop <= maxSlop; requires corresponding configuration of updateProcessor
(see below) -->
<lst name="shingleWords">
<lst name="field1">
<str name="words">shingleWords.txt</str>
<int name="maxSlop">3</int>
</lst>
<lst name="*_search">
<str name="words">shingleWords.txt</str>
<int name="maxSlop">3</int>
</lst>
</lst>
</lst>
</queryParser>
Update processor to build complementary “shingle” fields
These complementary fields are used to pre-filter the domain of graph queries containing common terms (Cf. common-grams, stopwords), in the common case of relatively small slop.
<updateProcessor class="solr.ShingleFilterUpdateRequestProcessorFactory" name="shingles">
<lst name="field1">
<str name="words">shingleWords.txt</str>
<int name="maxSlop">3</int>
</lst>
<lst name="*_search">
<str name="words">shingleWords.txt</str>
<int name="maxSlop">3</int>
</lst>
</updateProcessor>