Lucene graph query support is incomplete
SpanNearQuery implementation of graph queries is incomplete
Lucene SpanNearQuery
identifies valid paths through token-graphs that
represent the content of each document. As the component concerned with
discovering the “edges” linking query subclause “nodes”, SpanNearQuery
is arguably the essential component of graph query in Lucene.
But SpanNearQuery
is not a complete graph query implementation;
accurate matching depends on restrictive assumptions about the match-length
variability of individual subclauses.
For many common use cases (e.g., simple tokenization schemes, flat queries, minimal phrase slop), the required match-length variability assumptions are largely valid, and the effect of incomplete matching on search result quality is difficult to perceive. In these cases, incomplete matching may be (and has been!) ignored, to no ill effect. However, there remain many common (and potential) use cases for which the match-length variability assumptions are not valid; these cases can result in the skipping of enough legitimate matches (positions and documents) to render queries unpredictable or even useless.
The problem is well summarized and diagnosed in Christoph Goller’s comment for the relevant Lucene Jira issue, LUCENE-7398.
Emergence of the problem, and attempts to address it
According to Goller’s above-linked comment, ever since the initial introduction of
SpanNearQuery
(in Lucene 1.4 (2004)), “there never was an implementation that
compared all possible combinations of subspan matches for SpanNearQuery
in Lucene. So SpanNearQuery
always missed some matches.”
The historic state of available Lucene features has effectively masked the
shortcomings of SpanNearQuery
. Although the implementation of SpanNearQuery
has changed relatively little, changes in related components have caused some
of its longstanding issues to manifest more prominently.
Per-token position length
Per-token position length (via
PositionLengthAttribute
) was introduced in analysis in Lucene 3.6.0 (2012). Prior
to this introduction, the concept of per-token position length was unavailable at index-time
and query-time. Even after the introduction of the PositionLengthAttribute
,
position length has continued to be discarded at index time (i.e., not recorded
to the index). This decision makes sense in the absence of query implementations
capable of using indexed position length (see LUCENE-4312).
In the absence of variable per-token position length, the
only way for SpanNearQuery
sub-clauses to have variable length
is in nested queries (e.g., sub-clauses consisting of SpanOrQuery
s or
other SpanNearQuery
s). This type of nested SpanQuery
(the first true
graph queries), have until recently been chiefly the domain of
specialized query parsers. This changed in 2017 with the release of
Lucene 6.4, which saw the introduction of graph queries in
org.apache.lucene.util.QueryBuilder
(LUCENE-7603),
which in turn incorporated graph queries in many more commonly used query
parsers, including Solr’s ExtendedDismaxQParser
. (As a side-note,
ExtendedDismaxQParser
works only with explicit phrase queries – it has
not been updated to expect queries of type SpanNearQuery
, so implicit
(pf
) graph phrase queries are dropped entirely,
bypassing phrase boosting as if no pf
param had been specified at all)
The best practical solution to date (described by Mike McCandless here)
depends on employing graph-producing analysis components exclusively at query time.
McCandless mentions some tradeoffs of this approach; but another fundamental tradeoff
is that this approach limits you to tokenization based on information available at query time.
This is fine for straightforward, symmetrical synonym transformations, but not, e.g.,
for contextual/asymmetrical synonym schemes, or index-time WordDelimiterGraphFilter
that
one might use to index i'll
as (i ll) | ill | i'll
. The latter use case is especially
important to support user input that elides punctuation.
So, LUCENE-7398!
In summation, I think there’s a strong case for the continued relevance and high priority of the unresolved issue LUCENE-7398. In fact, advances in other aspects of Lucene’s architecture have raised (and will likely continue to raise) the profile of this issue. The next post introduces and describes a candidate implementation that addresses this issue.