当前位置：首页 > news >正文

企业手机网站建设策划书企业门户网站的设计流程图

news 2025/12/8 21:36:45

企业手机网站建设策划书,企业门户网站的设计流程图,东莞网站建设求职简历,生鲜网站建设规划书样板作者#xff1a;PRISCILLA PARODI 在这篇博文中#xff0c;你将探索使用 Elasticsearch 检索信息的各种方法#xff0c;特别关注文本#xff1a;词汇 (lexical) 和语义搜索 (semantic search)。使用 Elasticsearch 进行词汇和语义搜索搜索是根据你的搜索查询或组合查询…作者PRISCILLA PARODI 在这篇博文中你将探索使用 Elasticsearch 检索信息的各种方法特别关注文本词汇 (lexical) 和语义搜索 (semantic search)。使用 Elasticsearch 进行词汇和语义搜索搜索是根据你的搜索查询或组合查询查找最相关信息的过程相关搜索结果是与这些查询最匹配的文档。尽管存在与搜索相关的多种挑战和方法但最终目标仍然相同即找到问题的最佳答案。考虑到这一目标在这篇博文中我们将探索使用 Elasticsearch 检索信息的不同方法特别关注文本搜索词汇和语义搜索。先决条件为了实现这一目标我们将提供 Python 示例演示在为模拟电子商务产品信息而生成的数据集上的各种搜索场景。该数据集包含 2,500 多种产品每种产品都有描述。这些产品分为 76 个不同的产品类别每个类别包含不同数量的产品如下所示树形图可视化 - category.keyword产品类别的前 22 个值对于设置你将需要 Python 3.6 或更高版本Elastic Python 客户端Elastic 8.8 或更高版本部署具有 8GB 内存机器学习节点Elastic Learned Sparse EncodeR 模型已预加载到 Elastic 中并在你的部署中安装并启动我们将使用 Elastic Cloud可以免费试用。除了本博文中提供的搜索查询之外Python notebook 还将指导你完成以下过程使用 Python 客户端建立与我们的 Elastic 部署的连接将文本嵌入模型加载到 Elasticsearch 集群中使用用于索引特征向量和密集向量的映射创建索引。使用推理处理器创建摄取管道以进行文本嵌入和文本扩展词汇搜索 - 稀疏检索 Elasticsearch 基于文本查询对文档相关性进行排名的经典方式是使用 BM25 模型的 Lucene 实现BM25 模型是一种用于词法搜索的稀疏模型 (sparse model for lexical search)。此方法遵循传统的文本搜索方法寻找精确的术语匹配。为了使这种搜索成为可能Elasticsearch 通过执行文本分析将文本字段数据转换为可搜索的格式。文本分析由分析器执行分析器是一组规则用于管理提取相关标记进行搜索的过程。分析器必须恰好有一个分词器。分词器接收字符流并将其分解为单独的标记通常是单独的单词如下例所示词汇搜索的字符串标记化 #Performs text analysis on a string and returns the resulting tokens.# Define the text to be analyzed text Comfortable furniture for a large balcony# Define the analyze request request_body {analyzer: standard,text: text }# Perform the analyze request response client.indices.analyze(analyzerrequest_body[analyzer], textrequest_body[text])# Extract and display the analyzed tokens tokens [token[token] for token in response[tokens]] print(Analyzed Tokens:, tokens) 上述代码输出 Analyzed Tokens: [comfortable, furniture, for, a, large, balcony] 在此示例中我们使用默认分析器即标准分析器它适用于大多数用例因为它提供基于英语语法的分词化。标记化可以对各个术语进行匹配但每个分词仍然按字面意思进行匹配。如果你想个性化你的搜索体验你可以选择不同的内置分析器。例如通过更新代码以使用停止分析器它将在任何非字母字符处将文本分解为标记并支持删除停止词。 ... # Define the analyze request request_body {analyzer: stop,text: text } ... 上面的输出为 Analyzed Tokens: [comfortable, furniture, large, balcony] 当内置分析器不能满足你的需求时你可以创建自定义分析器它使用零个或多个字符过滤器、分词器和零个或多个 token 过滤器的适当组合。 analyzer: {my_analyzer: {type: custom, #For custom analyzers, use a type of custom or omit the type parameter.tokenizer: standard, #Built-in or customized tokenizerfilter: [lowercase, synonym] #Built-in or customized token filters} } 在上面结合了分词器和分词过滤器的示例中文本在被 synonym token filter 处理之前将被 lowercase filter 转为小写。如果你想了解更多关于 analyzer 方面的知识请参阅文章 “Elastic开发者上手指南” 中的 “分词器介绍” 部分。词汇匹配 - Lexical Matching BM25 将根据术语的频率及其重要性来衡量文档与给定搜索查询的相关性。下面的代码执行 match 查询考虑 “ecommerce-search” 索引中的 “decription” 字段值和搜索查询 “Comfortable furniture for a large balcony”搜索最多两个文档。细化被视为与该查询匹配的文档的标准可以提高精度。然而更具体的结果是以降低对变化的容忍度为代价的。 # BM25response client.search(size2, indexecommerce-search, query {match: {description : { query: Comfortable furniture for a large balcony,analyzer: stop}} } )hits response[hits][hits]if not hits:print(No matches found)else:for hit in hits:score hit[_score]product hit[_source][product]category hit[_source][category]description hit[_source][description]print(f\nScore: {score}\nProduct: {product}\nCategory: {category}\nDescription: {description}\n) 输出为 Score: 15.607948 Product: Barbie Dreamhouse Category: Toys Description: is a classic Barbie playset with multiple rooms, furniture, a large balcony, a pool, and accessories. It allows kids to create their dream Barbie world.Score: 9.137739 Product: Comfortable Rocking Chair Category: Indoor Furniture Description: enjoy relaxing moments with this comfortable rocking chair. Its smooth motion and cushioned seat make it an ideal piece of furniture for unwinding. 通过分析输出最相关的结果是 “Toys” 类别中的 “Barbie Dreamhouse” 产品其描述高度相关因为它包括术语 “furniture”、“large” 和 “balcony”这是唯一在描述中包含 3 个术语与搜索查询相匹配的产品该产品也是唯一在描述中包含术语“阳台”的产品。第二个最相关的产品是归类为 “Indoor Furniture” 的 “Comfortable Rocking Chair”其描述包括术语 “comfortable” 和 “furniture”。数据集中只有 3 个产品与此搜索查询的至少 2 个术语匹配该产品就是其中之一。 “Comfortable” 出现在 105 个产品的描述中“furniture” 出现在 4 个不同类别的 4 个产品的描述中Toys, Indoor Furniture, Outdoor Furniture 和 “Cat Supplies Toys”。正如你所看到的考虑到该查询最相关的产品是玩具第二相关的产品是室内家具。如果你想要有关分数计算的详细信息以了解为什么这些文档是匹配的你可以将 explain __query 参数设置为true。尽管这两个结果都是最相关的结果但考虑到该数据集中的文档数量和术语的出现次数查询 “Comfortable Furniture for a Large Baladal” 背后的意图是搜索实际大阳台的家具但是不包括其他玩具和室内家具。词汇搜索相对简单且快速但它有局限性因为在不一定知道用户的意图和查询的情况下并不总是可能知道所有可能的术语和同义词。自然语言使用中的一个常见现象是词汇不匹配。研究表明平均而言80% 的情况下不同的人同一领域的专家会对同一事物有不同的命名。这些限制促使我们寻找其他包含语义知识的评分模型。基于 Transformer 的模型擅长处理自然语言等顺序输入标记通过考虑文档和查询的数学表示来捕获搜索的潜在含义。这允许对文本进行密集的、上下文感知的向量表示为语义搜索提供动力这是一种查找相关内容的精细方法。语义搜索-密集检索在这种情况下将数据转换为有意义的向量值后将利用 k 最近邻 (kNN) 搜索算法来查找数据集中与查询向量最相似的向量表示。 Elasticsearch 支持两种 kNN 搜索方法精确 brute--fource kNN 和近似 kNN也称为 ANN。 Brute-force kNN 可以保证准确的结果但不能很好地适应大型数据集。近似 kNN 通过牺牲一些精度来提高性能从而有效地找到近似最近邻。借助 Lucene 对 kNN 搜索和密集向量索引的支持Elasticsearch 充分利用了分层可导航小世界 (HNSW) 算法该算法在各种 ANN 基准数据集上展示了强大的搜索性能。可以使用以下示例代码在 Python 中执行近似 kNN 搜索。使用近似 kNN 进行语义搜索 # KNN - approximate kNNresponse client.search(indexecommerce-search, size2, knn{field: description_vector.predicted_value,k: 50, # Number of nearest neighbors to return as top hits. #The optimal value of k is dependent on the data. It can vary in different scenarios.num_candidates: 500, # Number of nearest neighbor candidates to consider per shard.#Increasing num_candidates tends to improve the accuracy of the final k results.query_vector_builder: { # Object indicating how to build a query_vector. kNN search enables you to perform semantic search by using a previously deployed text embedding model, the steps for this process are demonstrated in the Python notebook.text_embedding: { model_id: sentence-transformers__all-mpnet-base-v2, # Text embedding model idmodel_text: Comfortable furniture for a large balcony # Query}} } )for hit in response[hits][hits]:score hit[_score]product hit[_source][product]category hit[_source][category]description hit[_source][description]print(f\nScore: {score}\nProduct: {product}\nCategory: {category}\nDescription: {description}\n) 考虑到产品数据集中 “description” 字段的嵌入此代码块使用 Elasticsearch 的 kNN 返回最多两个产品其描述类似于 “Comfortable furniture for a large balcony” 的向量化查询 (query_vector_build)。产品嵌入先前是在摄取管道中生成的其中包含 “all-mpnet-base-v2” 文本嵌入模型的推理处理器用于推断管道中摄取的数据。该模型是根据使用 “sentence_transformers.evaluation” 对预训练模型进行评估而选择的其中在训练期间使用不同的类别来评估模型。根据 Sentence-Transformers 排名“all-mpnet-base-v2” 模型展示了最佳的平均性能并且还在大规模文本嵌入基准 (MTEB) 排行榜上获得了有利的位置。该模型预先训练了 microsoft/mpnet-base 模型并在 1B 句子对数据集上进行了微调它将句子映射到 768 维密集向量空间。或者还有许多其他模型可供使用特别是那些针对特定领域数据进行微调的模型。上面代码的输出为 Score: 0.79207325 Product: Patio Sofa Set with Ottoman Category: Outdoor Furniture Description: is a versatile and comfortable patio sofa set, including a sofa, ottoman, and coffee table, great for outdoor lounging.Score: 0.7836937 Product: Patio Sofa Set with Canopy Category: Outdoor Furniture Description: is a luxurious and comfortable patio sofa set with a canopy, providing shade and style for outdoor lounging. 输出可能会根据所选模型、滤波器和近似 kNN 调整而有所不同。 kNN 搜索结果都属于 “Outdoor Furniture” 类别尽管查询中没有明确提及 “outdoor”一词这凸显了上下文中语义理解的重要性。密集向量搜索具有以下几个优点启用语义搜索处理非常大的数据集的可扩展性灵活处理各种数据类型然而密集向量搜索也面临着其自身的挑战为你的用例选择正确的嵌入模型选择模型后可能需要微调模型以优化特定领域数据集的性能这个过程需要领域专家的参与此外索引高维向量的计算成本可能很高语义搜索 - 学习稀疏检索 (Learned Sparse Retrieval) 让我们探索另一种方法学习稀疏检索这是执行语义搜索的另一种方法。作为稀疏模型它利用 Elasticsearch 基于 Lucene 的倒排索引该索引得益于数十年的优化。然而这种方法不仅仅是简单地使用 BM25 等词汇评分函数添加同义词。相反它使用更深入的语言规模知识来整合学习的关联以优化相关性。通过扩展搜索查询以包含原始查询中不存在的相关术语Elastic Learned Sparse Encoder 改进了稀疏向量嵌入如下面的示例所示。使用 Elastic Learned Sparse Encoder 进行稀疏向量搜索 # Elastic Learned Sparse Encoderresponse client.search(indexecommerce-search, size2, query{text_expansion: {ml.tokens: {model_id:elser_model,model_text:Comfortable furniture for a large balcony }} } )for hit in response[hits][hits]:score hit[_score]product hit[_source][product]category hit[_source][category]description hit[_source][description]print(f\nScore: {score}\nProduct: {product}\nCategory: {category}\nDescription: {description}\n)输出 Score: 14.405318 Product: Garden Lounge Set with Side Table Category: Garden Furniture Description: is a comfortable and stylish garden lounge set, including a sofa, chairs, and a side table for outdoor relaxation.Score: 14.281318 Product: Rattan Patio Conversation Set Category: Outdoor Furniture Description: is a stylish and comfortable outdoor furniture set, including a sofa, two chairs, and a coffee table, all made of durable rattan material. 本例中的结果包括 “Garden Furniture” 类别该类别提供与 “Outdoor Furniture” 非常相似的产品。通过分析 “ml.tokens”包含学习稀疏检索生成的标记的 “rank_features” 字段很明显在生成的各种标记中有些术语虽然不是搜索查询的一部分但在含义上仍然相关例如 “relax”comfortable、“sofa”furniture和 “outdoor”balcony。下图突出显示了查询旁边的一些术语包括带或不带术语扩展的情况。正如所观察到的该模型提供了上下文感知搜索有助于缓解词汇不匹配问题同时提供更具可解释性的结果。当不应用特定领域的再训练时它甚至可以超越密集向量模型。混合搜索结合词汇和语义搜索获得相关结果就搜索而言没有通用的解决方案。这些检索方法都有其优点但也有其挑战。根据用例最佳选项可能会发生变化。通常不同检索方法的最佳结果可以是互补的。因此为了提高相关性我们将考虑结合每种方法的优点。有多种方法可以实现混合搜索 (hybrid search)包括线性组合、为每个分数赋予权重以及倒数排名融合RRF其中不需要指定权重。 Elasticsearch词汇和语义搜索的两全其美 # BM25 Elastic Learned Sparse Encoder (Linear Combination)response client.search(indexecommerce-search, size2,query {bool: {should: [{match: {description : { query: A dining table and comfortable chairs for a large balcony,boost: 1}}}, {text_expansion: {ml.tokens: {model_id: elser_model,model_text: A dining table and comfortable chairs for a large balcony,boost: 1}}}]} } )# The boost value is 1 for the text expansion and match query. This means that the relevance score of the results of these queries are not boosted. You can specify a boost value to give a weight to each score in the sum. The scores will be calculated as: score boost value * match_score boost value * text_expansion_scorefor hit in response[hits][hits]:score hit[_score]product hit[_source][product]category hit[_source][category]description hit[_source][description]print(f\nScore: {score}\nProduct: {product}\nCategory: {category}\nDescription: {description}\n)在此代码中我们使用两个值为 “A dining table and comfortable chairs for a large balcony” 的查询执行混合搜索。我们没有使用 “furniture” 作为搜索词而是指定我们要查找的内容并且两个搜索都考虑相同的字段值 “description”。排名由 BM25 和 ELSER 分数等权重的线性组合确定。输出 Score: 31.628141 Product: Garden Dining Set with Swivel Rockers Category: Garden Furniture Description: is a functional and comfortable garden dining set, including a table and chairs with swivel rockers for easy movement.Score: 31.334227 Product: Garden Dining Set with Swivel Chairs Category: Garden Furniture Description: is a functional and comfortable garden dining set, including a table and chairs with swivel seats for convenience. 在下面的代码中我们将为查询使用相同的值但使用倒数排名融合方法结合 BM25查询参数和 kNNknn 参数的分数来对文档进行组合和排名。 # BM25 KNN (RRF)response client.search(indexecommerce-search, size2, query{bool: {should: [{match: {description: {query: A dining table and comfortable chairs for a large balcony}}}]} }, knn{field: description_vector.predicted_value,k: 50,num_candidates: 500,query_vector_builder: {text_embedding: {model_id: sentence-transformers__all-mpnet-base-v2,model_text: A dining table and comfortable chairs for a large balcony}} }, rank{rrf: { # Reciprocal rank fusionwindow_size: 50, # This value determines the size of the individual result sets per query.rank_constant: 20 # This value determines how much influence documents in individual result sets per query have over the final ranked result set.} } )for hit in response[hits][hits]:rank hit[_rank]category hit[_source][category]product hit[_source][product]description hit[_source][description]print(f\nRank: {rank}\nProduct: {product}\nCategory: {category}\nDescription: {description}\n) RRF 功能处于技术预览阶段。语法可能会在正式发布之前发生变化。输出 Rank: 1 Product: Patio Dining Set with Bench Category: Outdoor Furniture Description: is a spacious and functional patio dining set, including a dining table, chairs, and a bench for additional seating.Rank: 2 Product: Garden Dining Set with Swivel Chairs Category: Garden Furniture Description: is a functional and comfortable garden dining set, including a table and chairs with swivel seats for convenience. 这里我们还可以使用不同的字段和值 Python notebook 中提供了其中一些示例。正如你所看到的使用 Elasticsearch你可以两全其美传统的词法搜索和向量搜索无论是稀疏还是密集都可以实现你的目标并找到问题的最佳答案。如果你想继续了解此处提到的方法这些博客可能会很有用改进 Elastic Stack 中的信息检索混合检索Elasticsearch 中的向量搜索设计背后的基本原理如何利用 Elastic 的向量数据库充分利用词汇和 AI 驱动的搜索Elastic Learned Sparse Encoder 简介Elastic 用于语义搜索的 AI 模型改进 Elastic Stack 中的信息检索引入 Elastic Learned Sparse Encoder我们的新检索模型 Elasticsearch 提供向量数据库以及构建向量搜索所需的所有工具 Elasticsearch向量数据库Elastic 的向量搜索用例结论在这篇博文中我们探索了使用 Elasticsearch 检索信息的各种方法特别关注文本、词汇和语义搜索。为了演示这一点我们提供了 Python 示例展示了使用包含电子商务产品信息的数据集的不同搜索场景。我们回顾了 BM25 的经典词汇搜索并讨论了它的优点和挑战例如词汇不匹配。我们强调了结合语义知识来克服这个问题的重要性。此外我们讨论了密集向量搜索它支持语义搜索并讨论了与这种检索方法相关的挑战包括索引高维向量时的计算成本。另一方面我们提到稀疏向量的压缩效果非常好。因此我们讨论了 Elastic 的学习稀疏编码器它将搜索查询扩展为包含原始查询中不存在的相关术语。在搜索方面没有一种万能的解决方案。每种检索方法都有其优点和挑战。因此我们还讨论了混合搜索的概念。正如你所看到的使用 Elasticsearch你可以两全其美传统的词法搜索和向量搜索准备好开始了吗检查可用的 Python notebook 并开始免费试用 Elastic Cloud。

查看全文

http://www.sadfv.cn/news/110371/