SpanNearQuery implementation of graph queries is incomplete

Lucene SpanNearQuery identifies valid paths through token-graphs that represent the content of each document. As the component concerned with discovering the “edges” linking query subclause “nodes”, SpanNearQuery is arguably the essential component of graph query in Lucene. But SpanNearQuery is not a complete graph query implementation; accurate matching depends on restrictive assumptions about the match-length variability of individual subclauses.

For many common use cases (e.g., simple tokenization schemes, flat queries, minimal phrase slop), the required match-length variability assumptions are largely valid, and the effect of incomplete matching on search result quality is difficult to perceive. In these cases, incomplete matching may be (and has been!) ignored, to no ill effect. However, there remain many common (and potential) use cases for which the match-length variability assumptions are not valid; these cases can result in the skipping of enough legitimate matches (positions and documents) to render queries unpredictable or even useless.

The problem is well summarized and diagnosed in Christoph Goller’s comment for the relevant Lucene Jira issue, LUCENE-7398.

Emergence of the problem, and attempts to address it

According to Goller’s above-linked comment, ever since the initial introduction of SpanNearQuery (in Lucene 1.4 (2004)), “there never was an implementation that compared all possible combinations of subspan matches for SpanNearQuery in Lucene. So SpanNearQuery always missed some matches.”

The historic state of available Lucene features has effectively masked the shortcomings of SpanNearQuery. Although the implementation of SpanNearQuery has changed relatively little, changes in related components have caused some of its longstanding issues to manifest more prominently.

Per-token position length

Per-token position length (via PositionLengthAttribute) was introduced in analysis in Lucene 3.6.0 (2012). Prior to this introduction, the concept of per-token position length was unavailable at index-time and query-time. Even after the introduction of the PositionLengthAttribute, position length has continued to be discarded at index time (i.e., not recorded to the index). This decision makes sense in the absence of query implementations capable of using indexed position length (see LUCENE-4312).

In the absence of variable per-token position length, the only way for SpanNearQuery sub-clauses to have variable length is in nested queries (e.g., sub-clauses consisting of SpanOrQuerys or other SpanNearQuerys). This type of nested SpanQuery (the first true graph queries), have until recently been chiefly the domain of specialized query parsers. This changed in 2017 with the release of Lucene 6.4, which saw the introduction of graph queries in org.apache.lucene.util.QueryBuilder (LUCENE-7603), which in turn incorporated graph queries in many more commonly used query parsers, including Solr’s ExtendedDismaxQParser. (As a side-note, ExtendedDismaxQParser works only with explicit phrase queries – it has not been updated to expect queries of type SpanNearQuery, so implicit (pf) graph phrase queries are dropped entirely, bypassing phrase boosting as if no pf param had been specified at all)

The best practical solution to date (described by Mike McCandless here) depends on employing graph-producing analysis components exclusively at query time. McCandless mentions some tradeoffs of this approach; but another fundamental tradeoff is that this approach limits you to tokenization based on information available at query time. This is fine for straightforward, symmetrical synonym transformations, but not, e.g., for contextual/asymmetrical synonym schemes, or index-time WordDelimiterGraphFilter that one might use to index i'll as (i ll) | ill | i'll. The latter use case is especially important to support user input that elides punctuation.

So, LUCENE-7398!

In summation, I think there’s a strong case for the continued relevance and high priority of the unresolved issue LUCENE-7398. In fact, advances in other aspects of Lucene’s architecture have raised (and will likely continue to raise) the profile of this issue. The next post introduces and describes a candidate implementation that addresses this issue.