Sort by SKG/relatedness over high-cardinality fields

One of the most intriguing uses for the Solr JSON facet API’s SKG/relatedness function is for sorting facet buckets by relatedness; when sorting buckets by relatedness, the FacetFieldProcessor must calculate relatedness for every bucket.

The current FacetFieldProcessorByArray implementation (as of Solr 7.6) uses a standard uninverted approach (either docValues or UnInvertedField) to calculate facet counts over the domain base docSet, and then uses that initial pass as a pre-filter for a second-pass, inverted approach of fetching docSets for each relevant term (i.e., count > minCount) and calculating intersection size of each of those sets with the domain base docSet.

Over high-cardinality domains and fields, the overhead of per-term docSet creation and set intersection operations increases request latency to the point where sort-by-relatedness is not practical for many use cases.

Initial experiments: fix filterCache thrashing

Initial experiments with sort-by-relatedness in a high-cardinality context revealed that the current implementation leans heavily on the filterCache, effectively blowing it out over even modestly high-cardinality fields. SOLR-13108 aims to address this thrashing issue by respecting cacheDf.

Single sweep facet counts over domain, foreground, and background sets

Even with the above filterCache thrashing patch (SOLR-13108) applied, the current implementation continues to rely on per-term docSet creation and set intersection, and performance for my use case was not good enough for production use: for a field with ~220k unique terms per core, QTime for high-cardinality domain docSets were, e.g.: cardinality 1816684=9000ms, cardinality 5032902=18000ms.

In considering how to more efficiently determine docFreq (to evaluate against cacheDf setting to determine whether to consult the filterCache), I realized that it would be faster to just piggyback off the existing field-type-optimized FacetFieldProcessorByArray code to calculate facet counts simultaneously over the facet domain, foreground, and background sets. The resulting patch is proposed at SOLR-13132, and improves performance in the above-mentioned use case by as much as 10x-60x (to consistently somewhere in the range of 250-300ms per request).

Why just for FacetFieldProcessorByArray?

Sweep counts are only relevant where relatedness is required for primary sort, and thus must be calculated for all buckets.

Of the four concrete implementations of FacetFieldProcessor (FacetFieldProcessorByArrayDV, FacetFieldProcessorByArrayUIF, FacetFieldProcessorByEnumTermsStream, and FacetFieldProcessorByHashDV) only the first two (each a subclass of FacetFieldProcessorByArray) are supported.

Because FacetFieldProcessorByEnumTermsStream is sorted strictly in “index” order, sort-by-relatedness would not be applicable, so sweep count collection would be pointless, and is not supported.

FacetFieldProcessorByHashDV is used in high-cardinality field contexts; however, it is only beneficial when domain cardinality is low. Because sweep collection effectively induces a composite domain that is a union with the background set (which is normally high-cardinality), sweep count collection would in most cases be an antipattern, and is not currently supported.

How is this good?

Aside from being an order of magnitude faster, why is this a good thing? One alternative might be to just avoid sorting by relatedness over high-cardinality domains and/or fields. For heavily normalized/stemmed plain text fields, one might end up with a managable number of unique terms, and could probably achieve reasonable sort-by-relatedness performance.

But if what we’re after is meaningful semantic connections, it could be desirable to sort by relatedness over terms that represent semantic concepts (topics, names, etc.), as opposed to simple plain-text tokens, and it would be nice to not be unnecessarily limited in the cardinality of our semantic concepts.

Further opportunities: termFacetCache!

Something that has been in the back of my mind (and actually running in production for over a year, for DocValuesFacets), is a cache for term facet counts.

The current array-based docValues and UnInvertedField term faceting implementations recalculate facet counts for every request. This means that if you’re faceting over a field with a million unique values per core, at a minimum the JVM must allocate a new int[1000000] for every request, and populate it by iterating over all term values for all documents in the facet domain.

It is a testament to Solr and Lucene that this can actually be done in a relatively performant manner; but even so, this would seem to be a prime candidate to benefit from even a very small cache (to catch repeated variant requests, and very common top-level requests over common high-cardinality domains).

Such a cache would be particularly beneficial to relatedness, because with the new approach described here, we’re calculating facets over the “background” set for every request. The background set is usually relatively high-cardinality (e.g., *:*), and thus (wrt number of docs visited, and corresponding terms) every facet request has the performance characteristics of a facet request over the high-cardinality background set.

Rough benchmarks for a functional implementation of such a cache running in production (cache compatible and shared across JSON facet FacetFieldProcessorByArray and simple DocValuesFacets) are summarized in the table below. For these benchmarks, the index has ~1.5 million docs per core, and we are faceting over a field with ~220k unique values per core. For a given “recall” (percentage of documents returned for a simple query (not “recall” in the IR sense)), we provide approximate best-case QTime (milliseconds) for a facet response (sorted by relatedness) over the test field. Foreground is the same as base domain, background set is *:*. Each query was run across a static, non-optimized six-node cluster ~20 times, enough to get a sense of the best-case response-time. QTimes converged pretty tightly and quickly, fwiw. For these benchmarks we explicitly wanted to run identical commands in quick succession (2 seconds apart) because we’re looking for best-case latency, so want to take advantage of all configued caches. QTimes are listed alongside rough latency reduction factor (with respect to the extant implementation baseline).

“recall”: approximate percentage of docs returned by domain query
“extant”: the existing implementation (with filterCache patch from SOLR-13108 applied)
“sweep”: the proposed modification, collecting facet counts for domain, foreground, and background set in a single sweep, with no facet count caching
“cache bg”: “sweep” with added facet count cache (configured to cache counts for background set only)
“cache all”: “sweep” with added facet count cache (configured to cache counts for domain, foreground, and background sets)

recall	extant	sweep	cache bg	cache all
50%	16000	250 (64x)	135 (118x)	35 (457x)
20%	8000	230 (35x)	85 (94x)	30 (267x)
1%	1300	230 (5x)	25 (52x)	20 (65x)

The improvement for high-recall queries is particularly significant because high-recall “worst-case” queries are likely to be a determining factor in evaluating the suitability of the relatedness feature for production deployment.

Inline collection of “missing” bucket in FacetFieldProcessorByArray

In order to make termFacetCache compatible for use by both DocValuesFacets and FacetFieldProcessorByArray, I found it convenient to implement inline collection of the “missing” bucket in FacetFieldProcessorByArray, which had previously been handled by a separate “fieldMissingQuery”.

EDIT

June 28, 2019: updated terminology to refer to “sweep” facet count collection. Previously referred to “parallel” collection, which was misleading in that “parallel” was usually (and understandably) interpreted in the “concurrent” sense.

Sort by SKG/relatedness over high-cardinality fields

Initial experiments: fix filterCache thrashing

Single sweep facet counts over domain, foreground, and background sets

Why just for FacetFieldProcessorByArray?

How is this good?

Further opportunities: termFacetCache!

Inline collection of “missing” bucket in FacetFieldProcessorByArray

EDIT

Michael Gibney

Sort by SKG/relatedness over high-cardinality fields

Lucene graph queries: potential applications

Complete graph query support in Lucene: candidate implementation

Lucene graph query support is incomplete