The previous two posts describe a longstanding limitation of Lucene graph queries, and describe an enhancement that attempts to address this limitation. This post enumerates some of the downstream possibilities that this enhancement could facilitate.

Restores functionality for index-time multi-term synonyms

This in turn allows synonym generation to incorporate contextual analysis, as opposed to query-time synonym expansion, which generally has much less context available.

CJK text and orthographic variation

CJK text indexing in particular could stand to benefit from support for indexed graphs and complete graph query support. The high degree of orthographic variation, usefulness of context in synonym expansion, and common use of ngram/shingle indexing strategies could produce quite complex indexed token graphs. A complete graph query implementation should be able to accurately query over such graphs despite their complexity.

More thorough scoring

The ability to find redundant matches opens possibilities of more thorough scoring. For instance, one could score based on matches of both normalized and non-normalized terms. One could imagine other scoring applications (even without synonyms or variant forms of normalization) where match density would be inadequately represented without exploring redundant and/or overlapping matches.

Expands the potential usefulness of complex queries

Complex queries like nested SpanQuerys, complex graph queries, sloppy queries, etc. would previously have behaved unpredictably in some cases, and/or carried limitations and caveats based on assumptions in the underlying query implementation. In particular, ComplexPhraseQParser will likely behave more predictably, since it tends to construct the type of complex queries that are more likely to benefit from complete matching.

Potential non-text use cases

There are non-text use cases that can be well represented through the Spans API. Consider travel scheduling, for example: one could represent each transportation leg (a discrete flight, train, bus, etc.) as a composite term, composed of the departure and arrival locations. Departure time could be represented as startPosition, arrival time as endPosition, an entire multi-leg trip would be a “phrase”, and overlapping time windows would be “documents”.

The token graph for such a use case would be quite complex, with many overlapping tokens and large position lengths. Some interesting (and practical!) graph queries could be run over such an index, but complete matching (returning all valid matches, including redundant terms/positions) would be essential to the usefulness of the results.

There are surely other scheduling problems, and indeed probably entire other classes of problems, that could be fruitfully represented through the Lucene index, Spans API, and graph queries.