机器学习利用 Elasticsearch 进行更智能搜索 已翻译 100%

oschina 投递于 2017/03/03 16:38 (共 13 段, 翻译完成于 03-23)
阅读 7792
收藏 69
5
加载中

It’s no secret that Machine Learning is revolutionizing many industries. This is equally true in search, where companies exhaust themselves capturing nuance through manually tuned search relevance. Mature search organizations want to get past the “good enough” of manual tuning to build smarter, self-learning search systems.

That’s why we’re excited to release the Elasticsearch Learning to Rank Plugin. What is Learning to Rank? With Learning to Rank, a team trains a Machine Learning model to learn what users deem relevant.

已有 1 人翻译此段
我来翻译

When implementing Learning to Rank, you need to:

  • Measure what users deem relevant through analytics to build a judgment list grading documents as exactly relevant, moderately relevant, or not relevant for queries.

  • Hypothesize which features might help predict relevance, such as the TF*IDF of specific field matches, recency, personalization for the searching user, etc.

  • Train a model that can accurately map features to a relevance score.

  • Deploy the model to your search infrastructure, using it to rank search results in production.

Don’t fool yourself. Underneath each of these steps lie complex, hard technical, and non-technical problems. There’s still no silver bullet. As we mention in Relevant Search, manual tuning of search results comes with many of the same challenges as a good learning to rank solution. We’ll have more to say about the many infrastructure, technical, and non-technical challenges of mature learning to rank solutions in future blog posts.

已有 1 人翻译此段
我来翻译

In this blog post, I want to tell you about our work to integrate learning to rank within Elasticsearch. Clients ask us in nearly every relevance consulting engagement whether or not this technology can help them. However, while there’s a clear path in Solr thanks to Bloomberg, there hasn’t been one in Elasticsearch. Many clients want the modern affordances of Elasticsearch, but find this a crucial missing piece to selecting the technology for their search stack.

Indeed, Elasticsearch’s Query DSL can rank results with tremendous power and sophistication. A skilled relevance engineer can use the query DSL to compute a broad variety of query-time features that might signal relevance, giving quantitative answers to questions like:

  1. How much is the search term mentioned in the title?

  2. How long ago was the article/movie/etc. published?

  3. How does the document relate to user’s browsing behaviors?

  4. How expensive is this product relative to a buyer’s expectations?

  5. How conceptually related is the user’s search term to the subject of the article?

已有 1 人翻译此段
我来翻译

Many of these features aren’t static properties of the documents in the search engine. Instead, they are query-dependent, meaning that they measure some relationship between the user or their query and a document. To readers of Relevant Search, this is what we term signals in that book.

So, the question becomes, how can we marry the power of machine learning with existing power of the Elasticsearch Query DSL? That’s exactly what our plugin does: use Elasticsearch Query DSL queries as feature inputs to a Machine Learning model.

已有 1 人翻译此段
我来翻译

How Does It Work?

The plugin integrates RankLib and Elasticsearch. Ranklib takes as input a file with judgments and outputting a model in its own native, human-readable format. Ranklib then lets you trains models either programmatically or via the command line. Once you have a model, the Elasticsearch plugin contains the following:

  • A custom Elasticsearch script language called ranklib that can accept ranklib generated models as an Elasticsearch scripts.

  • A custom ltr query that inputs a list of Query DSL queries (the features) and a model name (what was uploaded at 1) and scores results.

As learning to rank models can be expensive to implement, you almost never want to use ltr query directly. Rather, you would rescore the top N results such as:

{
 "query": { /*a simple base query goes here*/ },
 "rescore": {
  "window_size": 100,
  "query": {
   "rescore_query": {
    "ltr": {
     "model": {
      "stored": "dummy"
     },
     "features": [{
        "match": {
         "title": < users keyword search >
        }
       }...
已有 1 人翻译此段
我来翻译

You can dig into a fully functioning example in the scripts directory of the project. It’s a canned example, using hand-created judgments of movies from TMDB. I use an Elasticsearch index with TMDB to execute queries corresponding to features, augment a judgment file with the relevance scores of those queries and features, and train a Ranklib model at the command line. I store the model in Elasticsearch and provide a script to search using the model.

Don’t be fooled by the simplicity of this example. The reality of a real learning to rank solutions is a tremendous amount of work, including studying users, processing analytics, data engineering, and feature engineering. I say that to not dissuade you because the payoff can be worth it; just know what you’re getting into. Smaller organizations might still do better with the ROI of hand-tuned results.

已有 1 人翻译此段
我来翻译

Training and Loading the Learning to Rank Model

Let’s start with the hand-created, minimal judgment list I’ve provided to show how our example trains a model.

Ranklib judgment lists come in a fairly standard format. The first column contains the judgment (0-4) for a document. The next column is a query id, such as “qid:1.” The subsequent columns contain the values of the features associated with that query-document pair. On the left-hand side is the 1-based index of the feature. To the right of that number is the value for the feature. The example in the Ranklib README is:

3 qid:1 1:1 2:1 3:0 4:0.2 5:0 # 1A 2 qid:1 1:0 2:0 3:1 4:0.1 5:1 # 1B 1 qid:1 1:0 2:1 3:0 4:0.4 5:0 # 1C 1 qid:1 1:0 2:0 3:1 4:0.3 5:0 # 1D 1 qid:2 1:0 2:0 3:1 4:0.2 5:0 # 2A

Notice also the comment (# 1A , etc). That comment is the document identifier for this judgment. The document identifier isn’t needed by Ranklib, but it’s fairly handy to human readers. As we’ll see it’s useful for us as well when we gather features via Elasticsearch queries.

已有 1 人翻译此段
我来翻译

Our example starts with a minimal version of the above file (seen here). We need to start with a trimmed-down version of the judgment file that simply has a grade, query id, and document id tuple. Like so:

4 qid:1 # 7555 3 qid:1 # 1370 3 qid:1 # 1369 3 qid:1 # 1368 0 qid:1 # 136278 ...

As above, we provide the Elasticsearch _id for the graded document as the comment on each line.

We need to enhance this a bit further. We must map each query id (qid:1) to an actual keyword query (“Rambo”) so we can use the keyword to generate feature values. We provide this mapping in the header which the example code will pull out:

# Add your keyword strings below, the feature script will 
# Use them to populate your query templates # # qid:1: rambo # qid:2: rocky # qid:3: bullwinkle # # https://sourceforge.net/p/lemur/wiki/RankLib%20File%20Format/ # # 4 qid:1 # 7555 3 qid:1 # 1370 3 qid:1 # 1369 3 qid:1 # 1368 0 qid:1 # 136278 ...

To help clear up some confusion, I’m going to start talking about ranklib “queries” (the qid:1 etc) as “keywords” to differentiate from the Elasticsearch Query DSL “queries” which are Elasticsearch-specific constructs used to generate feature values.

What’s above isn’t a complete Ranklib judgment list. It’s just a minimal sample of relevance grades for given documents for a given keyword search. To be a fully-fledged training set, it needs to include the feature values shown above, the 1:0 2:1 … included after each line in the first judgment list shown.

已有 1 人翻译此段
我来翻译

To generate those feature values, we also need to have proposed features that might correspond to relevance for movies. These, as we said, are Elasticsearch queries. The scores for these Elasticseach queries will finish filling out the judgment list above. In the example above, we do this using a jinja template corresponding to each feature number. For example, the file 1.json.jinja is the following Query DSL query:

{ "query": { "match": { "title": "" } } }

In other words, we’ve decided that feature 1 for our movie search system ought to be the TF*IDF relevance score for the user’s keywords when matched against the title field. There’s also 2.jinja.json , which performs a more complex search across multiple text fields:

{ "query": { "multi_match": { "query": "", "type": "cross_fields", "fields": ["overview", "genres.name", "title", "tagline", "belongs_to_collection.name", "cast.name", "directors.name"], "tie_breaker": 1.0 } } }

Part of the fun of learning to rank is hypothesizing what features might correlate with relevance. In the example, you can change features 1 and 2 to any Elasticsearch query. You can also experiment by adding additional features 3 through however many. There are problems with too many features, as you’ll want to get enough representative training samples that cover all reasonable feature values. We’ll discuss more training and testing learning to rank models in a future blog post.

已有 1 人翻译此段
我来翻译

With these two ingredients, the minimal judgment list and a set of proposed Query DSL queries/features, we need to generate a fully-fleshed out judgment list for Ranklib and load the Ranklib generated model into Elasticsearch to be used. This means:

  1. Getting relevance scores for features for each keyword/document pair. Aka issuing queries to Elasticsearch to log relevance scores.

  2. Outputting a full judgment file not only with grades and keyword query ids but also with feature values from step 1:

  • Running Ranklib to train the model.

  • Loading the model into Elasticsearch for use at search time.

  • The code to do this is all bundled up in train.py, which I encourage you to take apart. To run this, you’ll need:

    • RankLib.jar downloaded to the scripts folder.

    • Python packages Elasticsearch and Jinja2 installed (there’s a Python requirements.txt if

      you’re familiar).

已有 1 人翻译此段
我来翻译
本文中的所有译文仅用于学习和交流目的,转载请务必注明文章译者、出处、和本文链接。
我们的翻译工作遵照 CC 协议,如果我们的工作有侵犯到您的权益,请及时联系我们。
加载中

评论(5)

Abel-ymg
Abel-ymg
没看明白~
雨林星空
雨林星空

引用来自“寂寞不痛”的评论

es用的很高级啊~
逝水巟言
逝水巟言
mark
寂寞不痛
寂寞不痛
es用的很高级啊~
dimdim
dimdim
😃
返回顶部
顶部