文本-华为云

数据仓库服务 GAUSSDB(DWS)-文本检索操作符:||

|| 描述：将两个tsquery类型的词汇进行“或”操作示例： 1 2 3 4 5 6 7 8 9 10 SELECT 'fat | rat'::tsquery || 'cat'::tsquery AS RESULT; result --------------------------- ( 'fat' | 'rat' ) | 'cat' (1 row) SELECT 'a:1 b:2'::tsvector || 'c:1 d:2 b:3'::tsvector AS RESULT; result --------------------------- 'a':1 'b':2,5 'c':3 'd':4 (1 row)

数据仓库服务 GAUSSDB(DWS) 文本检索函数和操作符

数据仓库服务 GAUSSDB(DWS)-分词器测试

分词器测试函数ts_debug允许简单测试文本搜索分词器。 1 2 3 4 5 6 7 8 ts_debug([ config regconfig, ] document text, OUT alias text, OUT description text, OUT token text, OUT dictionaries regdictionary[], OUT dictionary regdictionary, OUT lexemes text[]) returns setof record ts_debug显示document的每个token信息，token是由解析器生成，由指定的词典进行处理。如果忽略对应参数，则使用config指定的分词器或者default_text_search_config指定的分词器。 ts_debug为文本解析器标识的每个token返回一行记录。记录中的列分别是： alias：text类型，token的别名。 description：text类型，token的描述。 token：text类型，token的文本内容。 dictionaries：regdictionary数组类型，是分词器为token选定的词典。 dictionary：regdictionary类型，用来识别token的词典。如果为空，则不做识别。 lexemes：text数组类型，词典识别token时生成的词素。如果为空，则不生成词素。空数组（{}）意味着token将被识别成停用词。一个简单的例子： 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 SELECT * FROM ts_debug('english','a fat cat sat on a mat - it ate a fat rats'); alias | description | token | dictionaries | dictionary | lexemes -----------+-----------------+-------+----------------+--------------+--------- asciiword | Word, all ASCII | a | {english_stem} | english_stem | {} blank | Space symbols | | {} | | asciiword | Word, all ASCII | fat | {english_stem} | english_stem | {fat} blank | Space symbols | | {} | | asciiword | Word, all ASCII | cat | {english_stem} | english_stem | {cat} blank | Space symbols | | {} | | asciiword | Word, all ASCII | sat | {english_stem} | english_stem | {sat} blank | Space symbols | | {} | | asciiword | Word, all ASCII | on | {english_stem} | english_stem | {} blank | Space symbols | | {} | | asciiword | Word, all ASCII | a | {english_stem} | english_stem | {} blank | Space symbols | | {} | | asciiword | Word, all ASCII | mat | {english_stem} | english_stem | {mat} blank | Space symbols | | {} | | blank | Space symbols | - | {} | | asciiword | Word, all ASCII | it | {english_stem} | english_stem | {} blank | Space symbols | | {} | | asciiword | Word, all ASCII | ate | {english_stem} | english_stem | {ate} blank | Space symbols | | {} | | asciiword | Word, all ASCII | a | {english_stem} | english_stem | {} blank | Space symbols | | {} | | asciiword | Word, all ASCII | fat | {english_stem} | english_stem | {fat} blank | Space symbols | | {} | | asciiword | Word, all ASCII | rats | {english_stem} | english_stem | {rat} (24 rows) 父主题：测试和调试文本搜索

数据仓库服务 GAUSSDB(DWS) 测试和调试文本搜索

数据仓库服务 GAUSSDB(DWS)-解析文档

解析文档 GaussDB (DWS)中提供了to_tsvector函数把文档处理成tsvector数据类型。 1 to_tsvector([ config regconfig, ] document text) returns tsvector to_tsvector将文本文档解析为token，再将token简化到词素，并返回一个tsvector。其中tsvector中列出了词素及它们在文档中的位置。文档是根据指定的或默认的文本搜索分词器进行处理的。这里有一个简单的例子： 1 2 3 4 SELECT to_tsvector('english', 'a fat cat sat on a mat - it ate a fat rats'); to_tsvector ----------------------------------------------------- 'ate':9 'cat':3 'fat':2,11 'mat':7 'rat':12 'sat':4 通过以上例子可发现结果tsvector不包含词a、on或者it，rats变成rat，并且忽略标点符号-。 to_tsvector函数内部调用一个解析器，将文档的文本分解成token并给每个token指定一个类型。对于每个token，有一系列词典可供查询。词典系列因token类型的不同而不同。识别token的第一本词典将发出一个或多个标准词素来表示token。例如： rats变成rat因为词典认为词rats是rat的复数形式。有些词被作为停用词（请参考停用词），这样它们就会被忽略，因为它们出现得太过频繁以致于搜索中没有用处。比如示例中的a、on和it。如果没有词典识别token，那么它也被忽略。在上述示例中，符号“-”被忽略，因为词典没有给它分配token类型（空间符号），即空间符号永远不会被索引。语法解析器、词典和要索引的token类型由选定的文本搜索分词器决定。可以在同一个数据库中有多种不同的分词器，以及提供各种语言的预定义分词器。在以上例子中，使用缺省分词器english。函数setweight可以给tsvector的记录加权重，权重是字母A、B、C、D之一。这通常用于标记来自文档不同部分的记录，比如标题、正文。之后，这些信息可以用于排序搜索结果。因为to_tsvector(NULL)会返回空，当字段可能是空的时候，建议使用coalesce。以下是为结构化文档创建tsvector的方法： 1 2 3 4 5 6 7 8 9 10 CREATE TABLE tsearch.tt (id int, title text, keyword text, abstract text, body text, ti tsvector); INSERT INTO tsearch.tt(id, title, keyword, abstract, body) VALUES (1, 'book', 'literature', 'Ancient poetry','Tang poem Song jambic verse'); UPDATE tsearch.tt SET ti = setweight(to_tsvector(coalesce(title,'')), 'A') || setweight(to_tsvector(coalesce(keyword,'')), 'B') || setweight(to_tsvector(coalesce(abstract,'')), 'C') || setweight(to_tsvector(coalesce(body,'')), 'D'); DROP TABLE tsearch.tt; 上例使用setweight标记已完成的tsvector中的每个词的来源，并且使用tsvector连接操作符“||”合并标记过的tsvector值，处理tsvector一节详细介绍了这些操作。父主题：控制文本搜索

数据仓库服务 GAUSSDB(DWS) 控制文本搜索

数据仓库服务 GAUSSDB(DWS)-搜索表

搜索表本章节主要介绍如何使用文本搜索运算符搜索数据库表。一个简单查询：将body字段中包含science的每一行打印出来。 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 DROP SCHEMA IF EXISTS tsearch CASCADE; CREATE SCHEMA tsearch; CREATE TABLE tsearch.pgweb(id int, body text, title text, last_mod_date date); INSERT INTO tsearch.pgweb VALUES(1, 'Philology is the study of words, especially the history and development of the words in a particular language or group of languages.', 'Philology', '2010-1-1'); INSERT INTO tsearch.pgweb VALUES(2, 'Mathematics is the science that deals with the logic of shape, quantity and arrangement.', 'Mathematics', '2010-1-1'); INSERT INTO tsearch.pgweb VALUES(3, 'Computer science is the study of processes that interact with data and that can be represented as data in the form of programs.', 'Computer science', '2010-1-1'); INSERT INTO tsearch.pgweb VALUES(4, 'Chemistry is the scientific discipline involved with elements and compounds composed of atoms, molecules and ions.', 'Chemistry', '2010-1-1'); INSERT INTO tsearch.pgweb VALUES(5, 'Geography is a field of science devoted to the study of the lands, features, inhabitants, and phenomena of the Earth and planets.', 'Geography', '2010-1-1'); INSERT INTO tsearch.pgweb VALUES(6, 'History is a subject studied in schools, colleges, and universities that deals with events that have happened in the past.', 'History', '2010-1-1'); INSERT INTO tsearch.pgweb VALUES(7, 'Medical science is the science of dealing with the maintenance of health and the prevention and treatment of disease.', 'Medical science', '2010-1-1'); INSERT INTO tsearch.pgweb VALUES(8, 'Physics is one of the most fundamental scientific disciplines, and its main goal is to understand how the universe behaves.', 'Physics', '2010-1-1'); SELECT id, body, title FROM tsearch.pgweb WHERE to_tsvector('english', body) @@ to_tsquery('english', 'science'); id | body | title ----+-------------------------------------------------------------------------------------------------------------------------+--------- 2 | Mathematics is the science that deals with the logic of shape, quantity and arrangement. | Mathematics 3 | Computer science is the study of processes that interact with data and that can be represented as data in the form of programs. | Computer science 5 | Geography is a field of science devoted to the study of the lands, features, inhabitants, and phenomena of the Earth and planets. | Geography 7 | Medical science is the science of dealing with the maintenance of health and the prevention and treatment of disease. | Medical science (4 rows) 像science这样的相关词都会被找到，因为这些词都被处理成了相同标准的词条。上面的查询指定english配置来解析和规范化字符串。也可以省略此配置，通过default_text_search_config进行配置设置： 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 SHOW default_text_search_config; default_text_search_config ---------------------------- pg_catalog.english (1 row) SELECT id, body, title FROM tsearch.pgweb WHERE to_tsvector(body) @@ to_tsquery('science'); id | body | title ----+-------------------------------------------------------------------------------------------------------------------------+--------- 2 | Mathematics is the science that deals with the logic of shape, quantity and arrangement. | Mathematics 3 | Computer science is the study of processes that interact with data and that can be represented as data in the form of programs. | Computer science 5 | Geography is a field of science devoted to the study of the lands, features, inhabitants, and phenomena of the Earth and planets. | Geography 7 | Medical science is the science of dealing with the maintenance of health and the prevention and treatment of disease. | Medical science (4 rows) 一个复杂查询：检索出在title或者body字段中包含treatment和science的最近10篇文档： 1 2 3 4 5 SELECT title FROM tsearch.pgweb WHERE to_tsvector(title || ' ' || body) @@ to_tsquery('treatment & science') ORDER BY last_mod_date DESC LIMIT 10; title -------- Medical science (1 rows) 为了清晰，举例中没有调用coalesce函数在两个字段中查找包含NULL的行。以上例子均在没有索引的情况下进行查询。对于大多数应用程序来说，这个方法很慢。因此除了偶尔的特定搜索，文本搜索在实际使用中通常需要创建索引。父主题：在数据库表中搜索文本

数据仓库服务 GAUSSDB(DWS) 在数据库表中搜索文本

内容审核 MODERATION-文本内容审核（V3）:响应示例

响应示例状态码： 200 成功响应示例 { "request_id" : "58e7d9c7-3456-4ba1-80df-6f25506bc4df", "result" : { "suggestion" : "block", "label" : "customized", "details" : [ { "suggestion" : "block", "label" : "customized", "confidence" : 1, "segments" : [ { "segment" : "xxx", "glossary_name" : "zzz" } ] } ] } } 状态码： 400 失败响应示例 { "error_code" : "AIS.0011", "error_msg" : "Lack the request parameter, or the request parameter is empty." }

内容审核 MODERATION 文本审核

内容审核 MODERATION-文本内容审核（V3）:请求示例

请求示例 “endpoint”即调用API的请求地址，不同服务不同区域的endpoint不同，具体请参见终端节点。例如，服务部署在“华北-北京四”区域的“endpoint”为“moderation.cn-north-4.myhuaweicloud.com”，请求URL为“https://moderation.cn-north-4.myhuaweicloud.com/v3/{project_id}/moderation/text”，“project_id”为项目ID，获取方法请参见获取项目ID。识别文本内容是否有敏感内容，事件类型为评论，命中的自定义词库名称为custom_xxx，检测时使用的自定义白名单词库列表为custom_xxx，检测文本为asdfasdfasdf。 POST https://{endpoint}/v3/{project_id}/moderation/text { "event_type" : "comment", "glossary_names" : [ "custom_xxx" ], "white_glossary_names" : [ "custom_xxx" ], "data" : { "text" : "asdfasdfasdf" } } 使用biz_type调用 POST https://{endpoint}/v3/{project_id}/moderation/text { "biz_type" : "my_custom_type", "data" : { "text" : "asdfasdfasdf" } }

内容审核 MODERATION 文本审核

内容审核 MODERATION-文本内容审核（V3）:响应参数

响应参数状态码： 200 表5 响应Body参数参数参数类型描述 request_id String 本次请求的唯一标识，用于问题排查，建议保存最小长度：2 最大长度：64 result TextDetectionResult object 调用成功时表示调用结果。调用失败时无此字段。表6 TextDetectionResult 参数参数类型描述 suggestion String 审核结果是否通过。 block：包含敏感信息，不通过 pass：不包含敏感信息，通过 review：需要人工复检 label String 检测结果的标签。支持label列表如下： terrorism: 暴恐 porn: 色情 ban: 违禁 abuse: 辱骂 ad: 广告 customized：自定义（命中自定义词库中的关键词） details Array of TextDetectionResultDetail objects 检测详情。表7 TextDetectionResultDetail 参数参数类型描述 suggestion String 审核结果是否通过。 block：包含敏感信息，不通过 pass：不包含敏感信息，通过 review：需要人工复检 label String 检测结果的标签。支持label列表如下： terrorism: 暴恐 porn: 色情 ban: 违禁 abuse: 辱骂 ad: 广告 customized：自定义（命中自定义词库中的关键词） confidence Float 置信度，取值范围 0-1，值越大，可信度越高。 segments Array of Segment objects 命中的风险片段信息，如果命中了语义算法模型，则会返回一个空的列表。表8 Segment 参数参数类型描述 segment String 命中的风险片段。 glossary_name String 命中的自定义词库名称。命中自定义词库时，才会返回当前字段。 position Array of integers 命中的风险片段在文本中的位置，起始位置从0开始。状态码： 400 表9 响应Body参数参数参数类型描述 error_code String 调用失败时的错误码，具体请参见错误码。调用成功时无此字段。 error_msg String 调用失败时的错误信息。调用成功时无此字段。

内容审核 MODERATION 文本审核

内容审核 MODERATION-文本内容审核（V3）:功能介绍

功能介绍分析并识别上传的文本内容是否有敏感内容，并将识别结果返回给您。相比于V2版本，V3版本增强了服务的审核能力，能够给您带来更好的内容审核使用体验。当前仅支持中文内容审核，其他语言的文本审核暂不支持。文本内容审核默认API调用最大并发为50，如需调整更高并发限制请联系华为专业工程师为您服务。您可以配置自定义词库，来过滤和检测指定文本内容。自定义词库的创建和使用请参见配置自定义词库 V3。

内容审核 MODERATION 文本审核

内容审核 MODERATION-文本内容审核（V3）:请求参数

请求参数表2 请求Header参数参数是否必选参数类型描述 X-Auth-Token 是 String 用户Token。用于获取操作API的权限。获取方法请参见获取Token 接口，响应消息头中X-Subject-Token的值即为Token。 Enterprise-Project-Id 否 String 企业项目ID。Moderation支持通过企业项目管理（EPS）对不同用户组和用户的资源使用，进行分账。获取方法：进入“企业项目管理”页面，单击企业项目名称，在企业项目详情页获取Enterprise-Project-Id（企业项目ID）。企业项目创建步骤请参见用户指南。说明：创建企业项目后，在传参时，有以下三类场景：携带正确的ID，正常使用Moderation服务，账单的企业项目会被分类到企业ID对应的企业项目中。携带格式正确但不存在的ID，正常使用Moderation服务，账单的企业项目会显示对应不存在的企业项目ID。不携带ID或格式错误ID（包含特殊字符等），正常使用Moderation服务，账单的企业项目会被分类到"default"中。表3 请求Body参数参数是否必选参数类型描述 event_type 否 String 事件类型。可选值如下： nickname: 昵称。 title: 标题。 article: 帖子。 comment: 评论。 barrage: 弹幕。 search: 搜索栏。 profile: 个人简介。 glossary_names 否 Array of strings 检测时使用的自定义黑名单词库列表。说明：自定义黑名单词库的创建和使用请参见配置自定义词库 V3。 white_glossary_names 否 Array of strings 检测时使用的自定义白名单词库列表。说明：自定义白名单词库的创建和使用请参见配置自定义词库 V3。 categories 否 Array of strings 文本审核场景，可选值如下： terrorism：暴恐。 porn：色情。 ban：违禁。 abuse: 辱骂。 ad：广告。当categories为空时，默认为所有审核场景。 data 是 TextDetectionDataReq object 检测数据。 biz_type 否 String 用户在控制台界面创建的自定义审核策略名称。如果请求参数中传了biz_type则优先使用biz_type，event_type和categories参数将不生效，审核策略由biz_type的设置决定。如果用户没传biz_type则event_type必须传。表4 TextDetectionDataReq 参数是否必选参数类型描述 text 是 String 待检测文本，编码格式为“utf-8”，限定1500个字符以内，文本长度超过1500个字符时，只检测前1500个字符。最小长度：1 最大长度：1500 language 否 String 支持检测的文本语言。可选值为zh：中文。不传该参数表示默认为zh。

内容审核 MODERATION 文本审核

智能数据洞察 DATAARTS INSIGHT-表格:样式

样式尺寸位置 W：设置图表的宽，单位为px。 H：设置图表的高，单位px。 X：设置图表在画布中的位置。单位为px。 Y：设置图表在画布中的位置。单位为px。不透明度：设置图表在画布上的透明度，可通过滑动条进行设置，也可手动输入百分比，比例越大透明程度越低。图1 尺寸位置全局样式表格布局：可以调整表格的比例，包含自动调整、按比例分配两种类型。分页展示：可以设置表格的行数、字体类型、字体颜色、字体大小、字体粗细。筛选：勾选筛选，表头出现，可以对数据进行筛选。排序：对数据进行排序。使用维度排序：拖拽维度字段进排序槽位，自动将维度字段添加到维度槽位并在图表中显示。使用度量排序：图表中默认不显示排序槽位中的度量字段，如需显示，请再次拖动度量字段到度量槽位。边框线：设置表格边框的粗细和颜色。表头显示/隐藏表头：单击“表头”右侧的勾选框，表示显示表头，表示隐藏表头。表头行高：输入数值，设置表头行高。背景色：单击颜色编辑器设置表头的背景色。对齐方式：单击下拉选项设置表头文本的对齐方式，可选择为水平居中、左侧、右侧。字体：单击下拉选项设置表头文本的字体类型。字体颜色：单击下拉选项设置表头文本的字体颜色。字号：单击下拉选项设置表头文本的字号大小。字体粗细：单击下拉选项设置表头文本的字体粗细样式。表头展示分组：可设置表格展示多行表头，组合上限为50。添加分组：勾选表头展示分组，单击“”，进入编辑表头分组页面，单击添加分组左边+，添加分组。分组命名：单击分组右边，给分组命名。设置分组内容：鼠标按住列往分组中拖拽。删除分组：单击分组右边，删除分组，分组内的列不会被删除。显示/隐藏表头分组：单击“表头”右侧的勾选框，表示显示表头分组，表示隐藏表头分组。改变分组顺序：在表头分组页面拖拽分组可改变分组顺序。行配置行高：设置行高，输入值不能小于45。奇行背景色：单击颜色编辑器设置奇行表格的背景色。偶行背景色：单击颜色编辑器设置偶行表格的背景色。选中背景颜色：选中联动字段所在行的颜色，可自定义颜色。对齐方式：行配置对齐方式类型有水平居中左侧右侧。自动换行：文本设置行的字体、字体颜色、字号、字体粗细。文本：设置行的字体、字体颜色、字号、字体类型。行分割线样式：设置行分割线样式，支持实线、虚线、点划线的设置。粗细：设置行分割线的粗细。行颜色：设置行分割线的颜色。系列设置：支持表格的表头和列内容独立对齐方式配置。选择系列：选择列项，用户根据需求可选择。自然对齐方式：勾选此选项，表头对齐、内容对齐才可以设置。表头对齐：设置表头对齐方式。对齐方式：自动、左对齐、居中、右对齐。内容对齐：设置内容对齐方式。对齐方式：自动、左对齐、居中、右对齐。条件格式表2 条件格式参数参数描述条件格式请选择系列：配置字段根据图表展示的字段选择。快捷样式。指在已经有的样式里面选择快捷图标。颜色翻转：打开颜色翻转，快捷图标颜色对换，关闭颜色翻转，快捷图标颜色恢复。自定义样式：文本条件选择：有与固定值比较和与动态值比较两种方式。配置筛选条件种类：包含大于号、大于等于号、等号、小于等于号、小于号、不等号、大于A小于等于B、大于等于A小于B、大于A小于B、大于等于A小于等于B，固定对比值自定义，动态字段比值根据系统选择。颜色场景：设置好筛选条件后，单击颜色按钮，自定义颜色。添加规则：单击“+”，增加筛选条件。删除：单击“-”，删除筛选条件。图标条件选择：有与固定值比较和与动态值比较两种方式。图标样式：预选图标，如果不满足可以在筛选条件后面单独设置。配置筛选条件种类：包含大于号、大于等于号、等号、小于等于号、小于号、不等号、大于A小于等于B、大于等于A小于B、大于A小于B、大于等于A小于等于B，固定对比值自定义，动态字段比值根据系统选择。添加规则：单击“+”，增加筛选条件。删除：单击“-”，删除筛选条件。

智能数据洞察 DATAARTS INSIGHT 制作文本

内容审核 MODERATION-文本内容审核（V3）

文本内容审核（V3）本章节对文本内容审核AK/SK方式使用SDK进行示例说明。示例代码中可以使用TextDetectionDataReq类的withText方法配置待检测的文本信息，配置完成后运行即可。服务所在的应用区域和终端节点，详情请查看地区和终端节点。 package com.huaweicloud.sdk.test; import com.huaweicloud.sdk.core.auth.ICredential; import com.huaweicloud.sdk.core.auth.BasicCredentials; import com.huaweicloud.sdk.core.exception.ConnectionException; import com.huaweicloud.sdk.core.exception.RequestTimeoutException; import com.huaweicloud.sdk.core.exception.ServiceResponseException; import com.huaweicloud.sdk.core.region.Region; import com.huaweicloud.sdk.moderation.v3.*; import com.huaweicloud.sdk.moderation.v3.model.*; public class RunTextModerationSolution { public static void main(String[] args) { // 认证用的ak和sk硬编码到代码中或者明文存储都有很大的安全风险，建议在配置文件或者环境变量中密文存放，使用时解密，确保安全 // 本示例以ak和sk保存在环境变量中来实现身份验证为例，运行本示例前请先在本地环境中设置环境变量HUAWEICLOUD_SDK_AK和HUAWEICLOUD_SDK_SK String ak = System.getenv("HUAWEICLOUD_SDK_AK"); String sk = System.getenv("HUAWEICLOUD_SDK_SK"); ICredential auth = new BasicCredentials() .withAk(ak) .withSk(sk); ModerationClient client = ModerationClient.newBuilder() .withCredential(auth) .withRegion(ModerationRegion.valueOf("xxx")) //把xxx替换成服务所在的区域，例如北京四：cn-north-4。 .build(); RunTextModerationRequest request = new RunTextModerationRequest(); TextDetectionReq body = new TextDetectionReq(); TextDetectionDataReq databody = new TextDetectionDataReq(); databody.withText("test"); body.withData(databody); body.withEventType("comment"); request.withBody(body); try { RunTextModerationResponse response = client.runTextModeration(request); System.out.println(response.toString()); } catch (ConnectionException e) { e.printStackTrace(); } catch (RequestTimeoutException e) { e.printStackTrace(); } catch (ServiceResponseException e) { e.printStackTrace(); System.out.println(e.getHttpStatusCode()); System.out.println(e.getErrorCode()); System.out.println(e.getErrorMsg()); } } } 控制台输出200即表示程序执行成功，文本内容审核结果输出到控制台。 class RunTextModerationResponse { requestId: 308b6ad2740e51de73597da9fdc94ee1 result: class TextDetectionResult { suggestion: pass label: normal details: [] } } 父主题：文本审核

内容审核 MODERATION 文本审核

数据仓库服务 GAUSSDB(DWS)-文本检索操作符:||

|| 描述：将两个tsquery类型的词汇进行“或”操作示例： 1 2 3 4 5 6 7 8 9 10 SELECT 'fat | rat'::tsquery || 'cat'::tsquery AS RESULT; result --------------------------- ( 'fat' | 'rat' ) | 'cat' (1 row) SELECT 'a:1 b:2'::tsvector || 'c:1 d:2 b:3'::tsvector AS RESULT; result --------------------------- 'a':1 'b':2,5 'c':3 'd':4 (1 row)

数据仓库服务 GAUSSDB(DWS) 文本检索函数和操作符

数据仓库服务 GAUSSDB(DWS)-分词器测试

分词器测试函数ts_debug允许简单测试文本搜索分词器。 1 2 3 4 5 6 7 8 ts_debug([ config regconfig, ] document text, OUT alias text, OUT description text, OUT token text, OUT dictionaries regdictionary[], OUT dictionary regdictionary, OUT lexemes text[]) returns setof record ts_debug显示document的每个token信息，token是由解析器生成，由指定的词典进行处理。如果忽略对应参数，则使用config指定的分词器或者default_text_search_config指定的分词器。 ts_debug为文本解析器标识的每个token返回一行记录。记录中的列分别是： alias：text类型，token的别名。 description：text类型，token的描述。 token：text类型，token的文本内容。 dictionaries：regdictionary数组类型，是分词器为token选定的词典。 dictionary：regdictionary类型，用来识别token的词典。如果为空，则不做识别。 lexemes：text数组类型，词典识别token时生成的词素。如果为空，则不生成词素。空数组（{}）意味着token将被识别成停用词。一个简单的例子： 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 SELECT * FROM ts_debug('english','a fat cat sat on a mat - it ate a fat rats'); alias | description | token | dictionaries | dictionary | lexemes -----------+-----------------+-------+----------------+--------------+--------- asciiword | Word, all ASCII | a | {english_stem} | english_stem | {} blank | Space symbols | | {} | | asciiword | Word, all ASCII | fat | {english_stem} | english_stem | {fat} blank | Space symbols | | {} | | asciiword | Word, all ASCII | cat | {english_stem} | english_stem | {cat} blank | Space symbols | | {} | | asciiword | Word, all ASCII | sat | {english_stem} | english_stem | {sat} blank | Space symbols | | {} | | asciiword | Word, all ASCII | on | {english_stem} | english_stem | {} blank | Space symbols | | {} | | asciiword | Word, all ASCII | a | {english_stem} | english_stem | {} blank | Space symbols | | {} | | asciiword | Word, all ASCII | mat | {english_stem} | english_stem | {mat} blank | Space symbols | | {} | | blank | Space symbols | - | {} | | asciiword | Word, all ASCII | it | {english_stem} | english_stem | {} blank | Space symbols | | {} | | asciiword | Word, all ASCII | ate | {english_stem} | english_stem | {ate} blank | Space symbols | | {} | | asciiword | Word, all ASCII | a | {english_stem} | english_stem | {} blank | Space symbols | | {} | | asciiword | Word, all ASCII | fat | {english_stem} | english_stem | {fat} blank | Space symbols | | {} | | asciiword | Word, all ASCII | rats | {english_stem} | english_stem | {rat} (24 rows) 父主题：测试和调试文本搜索

数据仓库服务 GAUSSDB(DWS) 测试和调试文本搜索

数据仓库服务 GAUSSDB(DWS)-解析文档

解析文档 GaussDB(DWS)中提供了to_tsvector函数把文档处理成tsvector数据类型。 1 to_tsvector([ config regconfig, ] document text) returns tsvector to_tsvector将文本文档解析为token，再将token简化到词素，并返回一个tsvector。其中tsvector中列出了词素及它们在文档中的位置。文档是根据指定的或默认的文本搜索分词器进行处理的。这里有一个简单的例子： 1 2 3 4 SELECT to_tsvector('english', 'a fat cat sat on a mat - it ate a fat rats'); to_tsvector ----------------------------------------------------- 'ate':9 'cat':3 'fat':2,11 'mat':7 'rat':12 'sat':4 通过以上例子可发现结果tsvector不包含词a、on或者it，rats变成rat，并且忽略标点符号-。 to_tsvector函数内部调用一个解析器，将文档的文本分解成token并给每个token指定一个类型。对于每个token，有一系列词典可供查询。词典系列因token类型的不同而不同。识别token的第一本词典将发出一个或多个标准词素来表示token。例如： rats变成rat因为词典认为词rats是rat的复数形式。有些词被作为停用词（请参考停用词），这样它们就会被忽略，因为它们出现得太过频繁以致于搜索中没有用处。比如例子中的a、on和it。如果没有词典识别token，那么它也被忽略。在这个例子中，符号“-”被忽略，因为词典没有给它分配token类型（空间符号），即空间符号永远不会被索引。语法解析器、词典和要索引的token类型由选定的文本搜索分词器决定。可以在同一个数据库中有多种不同的分词器，以及提供各种语言的预定义分词器。在以上例子中，使用缺省分词器english。函数setweight可以给tsvector的记录加权重，权重是字母A、B、C、D之一。这通常用于标记来自文档不同部分的记录，比如标题、正文。之后，这些信息可以用于排序搜索结果。因为to_tsvector(NULL)会返回空，当字段可能是空的时候，建议使用coalesce。以下是推荐的为结构化文档创建tsvector的方法： 1 2 3 4 5 6 7 8 9 10 CREATE TABLE tsearch.tt (id int, title text, keyword text, abstract text, body text, ti tsvector); INSERT INTO tsearch.tt(id, title, keyword, abstract, body) VALUES (1, 'book', 'literature', 'Ancient poetry','Tang poem Song jambic verse'); UPDATE tsearch.tt SET ti = setweight(to_tsvector(coalesce(title,'')), 'A') || setweight(to_tsvector(coalesce(keyword,'')), 'B') || setweight(to_tsvector(coalesce(abstract,'')), 'C') || setweight(to_tsvector(coalesce(body,'')), 'D'); DROP TABLE tsearch.tt; 上例使用setweight标记已完成的tsvector中的每个词的来源，并且使用tsvector连接操作符||合并标记过的tsvector值，处理tsvector一节详细介绍了这些操作。父主题：控制文本搜索

数据仓库服务 GAUSSDB(DWS) 控制文本搜索

数据仓库服务 GAUSSDB(DWS)-文本检索操作符:||

|| 描述：将两个tsquery类型的词汇进行“或”操作示例： 1 2 3 4 5 6 7 8 9 10 SELECT 'fat | rat'::tsquery || 'cat'::tsquery AS RESULT; result --------------------------- ( 'fat' | 'rat' ) | 'cat' (1 row) SELECT 'a:1 b:2'::tsvector || 'c:1 d:2 b:3'::tsvector AS RESULT; result --------------------------- 'a':1 'b':2,5 'c':3 'd':4 (1 row)

数据仓库服务 GAUSSDB(DWS) 文本检索函数和操作符

云服务器内容精选

文本

7*24

备案

专业服务

退订

建议反馈

售前咨询热线