Lucene graph query support is incomplete
SpanNearQuery implementation of graph queries is incomplete
SpanNearQuery identifies valid paths through token-graphs that
represent the content of each document. As the component concerned with
discovering the “edges” linking query subclause “nodes”,
is arguably the essential component of graph query in Lucene.
SpanNearQuery is not a complete graph query implementation;
accurate matching depends on restrictive assumptions about the match-length
variability of individual subclauses.
For many common use cases (e.g., simple tokenization schemes, flat queries, minimal phrase slop), the required match-length variability assumptions are largely valid, and the effect of incomplete matching on search result quality is difficult to perceive. In these cases, incomplete matching may be (and has been!) ignored, to no ill effect. However, there remain many common (and potential) use cases for which the match-length variability assumptions are not valid; these cases can result in the skipping of enough legitimate matches (positions and documents) to render queries unpredictable or even useless.
The problem is well summarized and diagnosed in Christoph Goller’s comment for the relevant Lucene Jira issue, LUCENE-7398.
Emergence of the problem, and attempts to address it
According to Goller’s above-linked comment, ever since the initial introduction of
SpanNearQuery (in Lucene 1.4 (2004)), “there never was an implementation that
compared all possible combinations of subspan matches for
in Lucene. So
SpanNearQuery always missed some matches.”
The historic state of available Lucene features has effectively masked the
SpanNearQuery. Although the implementation of
has changed relatively little, changes in related components have caused some
of its longstanding issues to manifest more prominently.
Per-token position length
Per-token position length (via
PositionLengthAttribute) was introduced in analysis in Lucene 3.6.0 (2012). Prior
to this introduction, the concept of per-token position length was unavailable at index-time
and query-time. Even after the introduction of the
position length has continued to be discarded at index time (i.e., not recorded
to the index). This decision makes sense in the absence of query implementations
capable of using indexed position length (see LUCENE-4312).
In the absence of variable per-token position length, the
only way for
SpanNearQuery sub-clauses to have variable length
is in nested queries (e.g., sub-clauses consisting of
SpanNearQuerys). This type of nested
SpanQuery (the first true
graph queries), have until recently been chiefly the domain of
specialized query parsers. This changed in 2017 with the release of
Lucene 6.4, which saw the introduction of graph queries in
which in turn incorporated graph queries in many more commonly used query
parsers, including Solr’s
ExtendedDismaxQParser. (As a side-note,
ExtendedDismaxQParser works only with explicit phrase queries – it has
not been updated to expect queries of type
SpanNearQuery, so implicit
pf) graph phrase queries are dropped entirely,
bypassing phrase boosting as if no
pf param had been specified at all)
The best practical solution to date (described by Mike McCandless here)
depends on employing graph-producing analysis components exclusively at query time.
McCandless mentions some tradeoffs of this approach; but another fundamental tradeoff
is that this approach limits you to tokenization based on information available at query time.
This is fine for straightforward, symmetrical synonym transformations, but not, e.g.,
for contextual/asymmetrical synonym schemes, or index-time
one might use to index
(i ll) | ill | i'll. The latter use case is especially
important to support user input that elides punctuation.
In summation, I think there’s a strong case for the continued relevance and high priority of the unresolved issue LUCENE-7398. In fact, advances in other aspects of Lucene’s architecture have raised (and will likely continue to raise) the profile of this issue. The next post introduces and describes a candidate implementation that addresses this issue.