![]() ![]() ![]() Lucene tries to delegate as much of this memory management to the operating system as it can by loading index data using memory mapping. Reading large index files can use a lot of memory. This makes looking up a value for a particular document much faster, as rather than decoding blocks of values in sequence until the correct document is reached, we can now jump directly to the correct block. To help improve this access speed, Lucene 8 adds jump tables to the default codec, similar to the skip lists described above. ![]() The trade-off here, however, is that access to a particular document's value can be slower, as it's no longer a simple case of looking for the value at a particular address. To deal with this, Lucene 7 changed the doc values API to use iterators, which allows fields where few documents have values to be stored in much less space. It also makes it difficult to distinguish between documents that have an empty value and documents that have no value at all. When this was first added to Lucene back in version 4.0, it was implemented as a straight look-up table with a fixed size for every entry, which allows for very fast access but has a large footprint on disk. Lucene provides a data structure called a docvalue that allows efficient per-document lookup, used for things like sorting or faceting. Elasticsearch will provide these via the new distance feature query, due to be included in version 7.1. These skip non-competitive hits in a slightly different way: we can convert a minimum competitive score into a bounding box that excludes documents which are too far from the origin to make it into the top k hits. Elasticsearch makes these available via the rank_feature and rank_features fields, as described in this relevance tuning blog.Īs well as simple boosts, you can also score by recency or proximity using distance feature queries. These queries can then implement the same skipping shortcuts as described above, resulting in very efficient custom-scoring queries. Lucene 8 provides a new field type called a FeatureField that uses term frequencies to encode numerical data, and exposes special queries that use this information for scoring. The standard indexing chain stores term frequencies in the impacts list, but an impact is just a pair of numbers, and we can put any information we like in there. More details can be found in this blog post about faster retrieval of top hits. Skip lists are much smaller and more efficient to decode than the postings lists they refer to, so the ability to avoid reading blocks altogether can yield enormous speedups for queries that touch a lot of documents. By adding a summary of the highest impacts in a block to that skip list, it's possible to calculate the largest score that could be produced by that block, and to skip over it entirely if the score is not competitive. Lucene already divides indexing information for any given term into blocks, and builds a parallel structure called a skip list to allow queries to efficiently jump over documents that we know won't match a query. These take the form of a pair of numbers, the length of the document (compressed down into a single byte, known as a ‘norm'), and the frequency of the term in that document. In general, the values that contribute to a document's score for any given query can be split into global factors (things like the total term frequency or average document length), and per-document per-term factors, known as impacts. The idea that kicked off all these query speedups was first proposed back in 2012, and involves adding new information to the index, making it possible to calculate maximum scores for blocks of documents. This allows the introduction of a number of shortcuts, speeding up query execution. Lucene 8 introduces a new API that allows you to opt out of this counting, returning instead a lower bound of the number of documents that match. In many circumstances, an accurate count is unnecessary, and for queries that match a large number of documents, significant time is spent counting and scoring documents that will not end up in the top hits. When executing a search in Lucene 7, the scoring code will visit every document that matches the query, yielding both the top k highest scoring hits and an accurate count of the number of documents that matched. Apache Lucene 8 was released a few weeks ago with lots of exciting new features and improvements. ![]()
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |