Sort by SKG/relatedness over high-cardinality fields
One of the most intriguing uses for the Solr JSON facet API’s
SKG/relatedness function
is for sorting facet buckets by relatedness; when sorting buckets by relatedness, the FacetFieldProcessor
must
calculate relatedness for every bucket.
The current FacetFieldProcessorByArray
implementation (as of Solr 7.6) uses a standard uninverted approach
(either docValues or UnInvertedField) to calculate facet counts over the domain base docSet, and then uses
that initial pass as a pre-filter for a second-pass, inverted approach of fetching docSets for each relevant
term (i.e., count > minCount) and calculating intersection size of each of those sets with the domain base docSet.
Over high-cardinality domains and fields, the overhead of per-term docSet creation and set intersection operations increases request latency to the point where sort-by-relatedness is not practical for many use cases.
Initial experiments: fix filterCache thrashing
Initial experiments with sort-by-relatedness in a high-cardinality context revealed that the current implementation
leans heavily on the filterCache
, effectively blowing it out over even modestly high-cardinality fields.
SOLR-13108 aims to address this thrashing issue by
respecting cacheDf
.
Single sweep facet counts over domain, foreground, and background sets
Even with the above filterCache
thrashing patch (SOLR-13108) applied, the current implementation
continues to rely on per-term docSet creation and set intersection, and performance for my use case
was not good enough for production use: for a field with ~220k unique terms per core, QTime for
high-cardinality domain docSets were, e.g.: cardinality 1816684=9000ms, cardinality 5032902=18000ms.
In considering how to more efficiently determine docFreq
(to evaluate against cacheDf
setting to
determine whether to consult the filterCache
), I realized that it would be faster to just piggyback
off the existing field-type-optimized FacetFieldProcessorByArray
code to calculate facet counts
simultaneously over the facet domain, foreground, and background sets. The resulting patch is proposed at
SOLR-13132, and improves performance in the
above-mentioned use case by as much as 10x-60x (to consistently somewhere in the range of 250-300ms
per request).
Why just for FacetFieldProcessorByArray?
Sweep counts are only relevant where relatedness is required for primary sort, and thus must be calculated for all buckets.
Of the four concrete implementations of FacetFieldProcessor
(FacetFieldProcessorByArrayDV
,
FacetFieldProcessorByArrayUIF
, FacetFieldProcessorByEnumTermsStream
, and FacetFieldProcessorByHashDV
)
only the first two (each a subclass of FacetFieldProcessorByArray
) are supported.
Because FacetFieldProcessorByEnumTermsStream
is sorted strictly in “index” order, sort-by-relatedness
would not be applicable, so sweep count collection would be pointless, and is not supported.
FacetFieldProcessorByHashDV
is used in high-cardinality field contexts; however, it is only beneficial
when domain cardinality is low. Because sweep collection effectively induces a composite domain
that is a union with the background set (which is normally high-cardinality), sweep count collection
would in most cases be an antipattern, and is not currently supported.
How is this good?
Aside from being an order of magnitude faster, why is this a good thing? One alternative might be
to just avoid sorting by relatedness
over high-cardinality domains and/or fields. For heavily normalized/stemmed
plain text fields, one might end up with a managable number of unique terms, and could probably achieve
reasonable sort-by-relatedness performance.
But if what we’re after is meaningful semantic connections, it could be desirable to sort by
relatedness
over terms that represent semantic concepts (topics, names, etc.), as opposed to
simple plain-text tokens, and it would be nice to not be unnecessarily limited in the cardinality
of our semantic concepts.
Further opportunities: termFacetCache!
Something that has been in the back of my mind (and actually running in production for over a year, for
DocValuesFacets
), is a cache for term facet counts.
The current array-based docValues and UnInvertedField term faceting implementations recalculate facet
counts for every request. This means that if you’re faceting over a field with a million unique values
per core, at a minimum the JVM must allocate a new int[1000000]
for every request, and populate it by
iterating over all term values for all documents in the facet domain.
It is a testament to Solr and Lucene that this can actually be done in a relatively performant manner; but even so, this would seem to be a prime candidate to benefit from even a very small cache (to catch repeated variant requests, and very common top-level requests over common high-cardinality domains).
Such a cache would be particularly beneficial to relatedness
, because with the new approach described
here, we’re calculating facets over the “background” set for every request. The background set is usually
relatively high-cardinality (e.g., *:*
), and thus (wrt number of docs visited, and corresponding terms)
every facet request has the performance characteristics of a facet request over the high-cardinality
background set.
Rough benchmarks for a functional implementation of such a cache running in production (cache
compatible and shared across JSON facet FacetFieldProcessorByArray
and simple DocValuesFacets
)
are summarized in the table below. For these benchmarks, the index has ~1.5 million docs per core,
and we are faceting over a field with ~220k unique values per core. For a given “recall” (percentage
of documents returned for a simple query (not “recall” in the IR sense)), we provide approximate
best-case QTime (milliseconds) for a facet response (sorted by relatedness) over the test field.
Foreground is the same as base domain, background set is *:*
. Each query was run across a static,
non-optimized six-node cluster ~20 times, enough to get a sense of the best-case response-time.
QTimes converged pretty tightly and quickly, fwiw. For these benchmarks we explicitly wanted to run
identical commands in quick succession (2 seconds apart) because we’re looking for best-case latency,
so want to take advantage of all configued caches. QTimes are listed alongside rough latency reduction
factor (with respect to the extant implementation baseline).
- “recall”: approximate percentage of docs returned by domain query
- “extant”: the existing implementation (with filterCache patch from SOLR-13108 applied)
- “sweep”: the proposed modification, collecting facet counts for domain, foreground, and background set in a single sweep, with no facet count caching
- “cache bg”: “sweep” with added facet count cache (configured to cache counts for background set only)
- “cache all”: “sweep” with added facet count cache (configured to cache counts for domain, foreground, and background sets)
recall | extant | sweep | cache bg | cache all |
---|---|---|---|---|
50% | 16000 | 250 (64x) | 135 (118x) | 35 (457x) |
20% | 8000 | 230 (35x) | 85 (94x) | 30 (267x) |
1% | 1300 | 230 (5x) | 25 (52x) | 20 (65x) |
The improvement for high-recall queries is particularly significant because high-recall “worst-case” queries
are likely to be a determining factor in evaluating the suitability of the relatedness
feature for
production deployment.
Inline collection of “missing” bucket in FacetFieldProcessorByArray
In order to make termFacetCache
compatible for use by both DocValuesFacets
and FacetFieldProcessorByArray
,
I found it convenient to implement inline collection of the “missing” bucket in FacetFieldProcessorByArray
, which
had previously been handled by a separate “fieldMissingQuery”.
EDIT
June 28, 2019: updated terminology to refer to “sweep” facet count collection. Previously referred to “parallel” collection, which was misleading in that “parallel” was usually (and understandably) interpreted in the “concurrent” sense.