云服务器内容精选
-
搜索表 在不使用索引的情况下也可以进行全文检索。 一个简单查询:将body字段中包含america的每一行打印出来。 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 openGauss=# DROP SCHEMA IF EXISTS tsearch CASCADE; openGauss=# CREATE SCHEMA tsearch; openGauss=# CREATE TABLE tsearch.pgweb(id int, body text, title text, last_mod_date date); openGauss=# INSERT INTO tsearch.pgweb VALUES(1, 'China, officially the People''s Republic of China (PRC), located in Asia, is the world''s most populous state.', 'China', '2010-1-1'); openGauss=# INSERT INTO tsearch.pgweb VALUES(2, 'America is a rock band, formed in England in 1970 by multi-instrumentalists Dewey Bunnell, Dan Peek, and Gerry Beckley.', 'America', '2010-1-1'); openGauss=# INSERT INTO tsearch.pgweb VALUES(3, 'England is a country that is part of the United Kingdom. It shares land borders with Scotland to the north and Wales to the west.', 'England', '2010-1-1'); openGauss=# INSERT INTO tsearch.pgweb VALUES(4, 'Australia, officially the Commonwealth of Australia, is a country comprising the mainland of the Australian continent, the island of Tasmania, and numerous smaller islands.', 'Australia', '2010-1-1'); openGauss=# INSERT INTO tsearch.pgweb VALUES(6, 'Japan is an island country in East Asia.', 'Japan', '2010-1-1'); openGauss=# INSERT INTO tsearch.pgweb VALUES(7, 'Germany, officially the Federal Republic of Germany, is a sovereign state and federal parliamentary republic in central-western Europe.', 'Germany', '2010-1-1'); openGauss=# INSERT INTO tsearch.pgweb VALUES(8, 'France, is a sovereign state comprising territory in western Europe and several overseas regions and territories.', 'France', '2010-1-1'); openGauss=# INSERT INTO tsearch.pgweb VALUES(9, 'Italy officially the Italian Republic, is a unitary parliamentary republic in Europe.', 'Italy', '2010-1-1'); openGauss=# INSERT INTO tsearch.pgweb VALUES(10, 'India, officially the Republic of India, is a country in South Asia.', 'India', '2010-1-1'); openGauss=# INSERT INTO tsearch.pgweb VALUES(11, 'Brazil, officially the Federative Republic of Brazil, is the largest country in both South America and Latin America.', 'Brazil', '2010-1-1'); openGauss=# INSERT INTO tsearch.pgweb VALUES(12, 'Canada is a country in the northern half of North America.', 'Canada', '2010-1-1'); openGauss=# INSERT INTO tsearch.pgweb VALUES(13, 'Mexico, officially the United Mexican States, is a federal republic in the southern part of North America.', 'Mexico', '2010-1-1'); openGauss=# SELECT id, body, title FROM tsearch.pgweb WHERE to_tsvector('english', body) @@ to_tsquery('english', 'america'); id | body | title ----+-------------------------------------------------------------------------------------------------------------------------+--------- 2 | America is a rock band, formed in England in 1970 by multi-instrumentalists Dewey Bunnell, Dan Peek, and Gerry Beckley. | America 11 | Brazil, officially the Federative Republic of Brazil, is the largest country in both South America and Latin America. | Brazil 12 | Canada is a country in the northern half of North America. | Canada 13 | Mexico, officially the United Mexican States, is a federal republic in the southern part of North America. | Mexico (4 rows) 像America这样的相关词也会被找到,因为这些词都被处理成了相同标准的词条。 上面的查询指定english配置来解析和规范化字符串。当然也可以省略此配置,通过default_text_search_config进行配置设置: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 openGauss=# SHOW default_text_search_config; default_text_search_config ---------------------------- pg_catalog.english (1 row) openGauss=# SELECT id, body, title FROM tsearch.pgweb WHERE to_tsvector(body) @@ to_tsquery('america'); id | body | title ----+-------------------------------------------------------------------------------------------------------------------------+--------- 2 | America is a rock band, formed in England in 1970 by multi-instrumentalists Dewey Bunnell, Dan Peek, and Gerry Beckley. | America 11 | Brazil, officially the Federative Republic of Brazil, is the largest country in both South America and Latin America. | Brazil 12 | Canada is a country in the northern half of North America. | Canada 13 | Mexico, officially the United Mexican States, is a federal republic in the southern part of North America. | Mexico (4 rows) 一个复杂查询:检索出在title或者body字段中包含north和america的最近10篇文档: 1 2 3 4 5 6 openGauss=# SELECT title FROM tsearch.pgweb WHERE to_tsvector(title || ' ' || body) @@ to_tsquery('north & america') ORDER BY last_mod_date DESC LIMIT 10; title -------- Canada Mexico (2 rows) 为了清晰,举例中没有调用coalesce函数在两个字段中查找包含NULL的行。 以上例子均在没有索引的情况下进行查询。对于大多数应用程序来说,这个方法很慢。因此除了偶尔的特定搜索,文本搜索在实际使用中通常需要创建索引。 父主题: 表和索引
-
搜索表 在不使用索引的情况下也可以进行全文检索。 一个简单查询:将body字段中包含america的每一行打印出来。 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 openGauss=# DROP SCHEMA IF EXISTS tsearch CASCADE; openGauss=# CREATE SCHEMA tsearch; openGauss=# CREATE TABLE tsearch.pgweb(id int, body text, title text, last_mod_date date); openGauss=# INSERT INTO tsearch.pgweb VALUES(1, 'China, officially the People''s Republic of China (PRC), located in Asia, is the world''s most populous state.', 'China', '2010-1-1'); openGauss=# INSERT INTO tsearch.pgweb VALUES(2, 'America is a rock band, formed in England in 1970 by multi-instrumentalists Dewey Bunnell, Dan Peek, and Gerry Beckley.', 'America', '2010-1-1'); openGauss=# INSERT INTO tsearch.pgweb VALUES(3, 'England is a country that is part of the United Kingdom. It shares land borders with Scotland to the north and Wales to the west.', 'England', '2010-1-1'); openGauss=# INSERT INTO tsearch.pgweb VALUES(4, 'Australia, officially the Commonwealth of Australia, is a country comprising the mainland of the Australian continent, the island of Tasmania, and numerous smaller islands.', 'Australia', '2010-1-1'); openGauss=# INSERT INTO tsearch.pgweb VALUES(6, 'Japan is an island country in East Asia.', 'Japan', '2010-1-1'); openGauss=# INSERT INTO tsearch.pgweb VALUES(7, 'Germany, officially the Federal Republic of Germany, is a sovereign state and federal parliamentary republic in central-western Europe.', 'Germany', '2010-1-1'); openGauss=# INSERT INTO tsearch.pgweb VALUES(8, 'France, is a sovereign state comprising territory in western Europe and several overseas regions and territories.', 'France', '2010-1-1'); openGauss=# INSERT INTO tsearch.pgweb VALUES(9, 'Italy officially the Italian Republic, is a unitary parliamentary republic in Europe.', 'Italy', '2010-1-1'); openGauss=# INSERT INTO tsearch.pgweb VALUES(10, 'India, officially the Republic of India, is a country in South Asia.', 'India', '2010-1-1'); openGauss=# INSERT INTO tsearch.pgweb VALUES(11, 'Brazil, officially the Federative Republic of Brazil, is the largest country in both South America and Latin America.', 'Brazil', '2010-1-1'); openGauss=# INSERT INTO tsearch.pgweb VALUES(12, 'Canada is a country in the northern half of North America.', 'Canada', '2010-1-1'); openGauss=# INSERT INTO tsearch.pgweb VALUES(13, 'Mexico, officially the United Mexican States, is a federal republic in the southern part of North America.', 'Mexico', '2010-1-1'); openGauss=# SELECT id, body, title FROM tsearch.pgweb WHERE to_tsvector('english', body) @@ to_tsquery('english', 'america'); id | body | title ----+-------------------------------------------------------------------------------------------------------------------------+--------- 2 | America is a rock band, formed in England in 1970 by multi-instrumentalists Dewey Bunnell, Dan Peek, and Gerry Beckley. | America 12 | Canada is a country in the northern half of North America. | Canada 13 | Mexico, officially the United Mexican States, is a federal republic in the southern part of North America. | Mexico 11 | Brazil, officially the Federative Republic of Brazil, is the largest country in both South America and Latin America. | Brazil (4 rows) 像America这样的相关词也会被找到,因为这些词都被处理成了相同标准的词条。 上面的查询指定english配置来解析和规范化字符串。当然也可以省略此配置,通过default_text_search_config进行配置设置: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 openGauss=# SHOW default_text_search_config; default_text_search_config ---------------------------- pg_catalog.english (1 row) openGauss=# SELECT id, body, title FROM tsearch.pgweb WHERE to_tsvector(body) @@ to_tsquery('america'); id | body | title ----+-------------------------------------------------------------------------------------------------------------------------+--------- 11 | Brazil, officially the Federative Republic of Brazil, is the largest country in both South America and Latin America. | Brazil 2 | America is a rock band, formed in England in 1970 by multi-instrumentalists Dewey Bunnell, Dan Peek, and Gerry Beckley. | America 12 | Canada is a country in the northern half of North America. | Canada 13 | Mexico, officially the United Mexican States, is a federal republic in the southern part of North America. | Mexico (4 rows) 一个复杂查询:检索出在title或者body字段中包含north和america的最近10篇文档: 1 2 3 4 5 6 openGauss=# SELECT title FROM tsearch.pgweb WHERE to_tsvector(title || ' ' || body) @@ to_tsquery('north & america') ORDER BY last_mod_date DESC LIMIT 10; title -------- Mexico Canada (2 rows) 为了清晰,举例中没有调用coalesce函数在两个字段中查找包含NULL的行。 以上例子均在没有索引的情况下进行查询。对于大多数应用程序来说,这个方法很慢。因此除了偶尔的特定搜索,文本搜索在实际使用中通常需要创建索引。 父主题: 表和索引
-
搜索表 在不使用索引的情况下也可以进行全文检索。 一个简单查询:将body字段中包含science的每一行打印出来。 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 DROP SCHEMA IF EXISTS tsearch CASCADE; CREATE SCHEMA tsearch; CREATE TABLE tsearch.pgweb(id int, body text, title text, last_mod_date date); INSERT INTO tsearch.pgweb VALUES(1, 'Philology is the study of words, especially the history and development of the words in a particular language or group of languages.', 'Philology', '2010-1-1'); INSERT INTO tsearch.pgweb VALUES(2, 'Mathematics is the science that deals with the logic of shape, quantity and arrangement.', 'Mathematics', '2010-1-1'); INSERT INTO tsearch.pgweb VALUES(3, 'Computer science is the study of processes that interact with data and that can be represented as data in the form of programs.', 'Computer science', '2010-1-1'); INSERT INTO tsearch.pgweb VALUES(4, 'Chemistry is the scientific discipline involved with elements and compounds composed of atoms, molecules and ions.', 'Chemistry', '2010-1-1'); INSERT INTO tsearch.pgweb VALUES(5, 'Geography is a field of science devoted to the study of the lands, features, inhabitants, and phenomena of the Earth and planets.', 'Geography', '2010-1-1'); INSERT INTO tsearch.pgweb VALUES(6, 'History is a subject studied in schools, colleges, and universities that deals with events that have happened in the past.', 'History', '2010-1-1'); INSERT INTO tsearch.pgweb VALUES(7, 'Medical science is the science of dealing with the maintenance of health and the prevention and treatment of disease.', 'Medical science', '2010-1-1'); INSERT INTO tsearch.pgweb VALUES(8, 'Physics is one of the most fundamental scientific disciplines, and its main goal is to understand how the universe behaves.', 'Physics', '2010-1-1'); SELECT id, body, title FROM tsearch.pgweb WHERE to_tsvector('english', body) @@ to_tsquery('english', 'science'); id | body | title ----+-------------------------------------------------------------------------------------------------------------------------+--------- 2 | Mathematics is the science that deals with the logic of shape, quantity and arrangement. | Mathematics 3 | Computer science is the study of processes that interact with data and that can be represented as data in the form of programs. | Computer science 5 | Geography is a field of science devoted to the study of the lands, features, inhabitants, and phenomena of the Earth and planets. | Geography 7 | Medical science is the science of dealing with the maintenance of health and the prevention and treatment of disease. | Medical science (4 rows) 像science这样的相关词也会被找到,因为这些词都被处理成了相同标准的词条。 上面的查询指定english配置来解析和规范化字符串。也可以省略此配置,通过default_text_search_config进行配置设置: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 SHOW default_text_search_config; default_text_search_config ---------------------------- pg_catalog.english (1 row) SELECT id, body, title FROM tsearch.pgweb WHERE to_tsvector(body) @@ to_tsquery('science'); id | body | title ----+-------------------------------------------------------------------------------------------------------------------------+--------- 2 | Mathematics is the science that deals with the logic of shape, quantity and arrangement. | Mathematics 3 | Computer science is the study of processes that interact with data and that can be represented as data in the form of programs. | Computer science 5 | Geography is a field of science devoted to the study of the lands, features, inhabitants, and phenomena of the Earth and planets. | Geography 7 | Medical science is the science of dealing with the maintenance of health and the prevention and treatment of disease. | Medical science (4 rows) 一个复杂查询:检索出在title或者body字段中包含treatment和science的最近10篇文档: 1 2 3 4 5 6 7 SELECT title FROM tsearch.pgweb WHERE to_tsvector(title || ' ' || body) @@ to_tsquery('treatment & science') ORDER BY last_mod_date DESC LIMIT 10; title -------- Medical science (1 rows) 为了清晰,举例中没有调用coalesce函数在两个字段中查找包含NULL的行。 以上例子均在没有索引的情况下进行查询。对于大多数应用程序来说,这个方法很慢。因此除了偶尔的特定搜索,文本搜索在实际使用中通常需要创建索引。 父主题: 表和索引
-
搜索表 在不使用索引的情况下也可以进行全文检索。 一个简单查询:将body字段中包含science的每一行打印出来。 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 DROP SCHEMA IF EXISTS tsearch CASCADE; CREATE SCHEMA tsearch; CREATE TABLE tsearch.pgweb(id int, body text, title text, last_mod_date date); INSERT INTO tsearch.pgweb VALUES(1, 'Philology is the study of words, especially the history and development of the words in a particular language or group of languages.', 'Philology', '2010-1-1'); INSERT INTO tsearch.pgweb VALUES(2, 'Mathematics is the science that deals with the logic of shape, quantity and arrangement.', 'Mathematics', '2010-1-1'); INSERT INTO tsearch.pgweb VALUES(3, 'Computer science is the study of processes that interact with data and that can be represented as data in the form of programs.', 'Computer science', '2010-1-1'); INSERT INTO tsearch.pgweb VALUES(4, 'Chemistry is the scientific discipline involved with elements and compounds composed of atoms, molecules and ions.', 'Chemistry', '2010-1-1'); INSERT INTO tsearch.pgweb VALUES(5, 'Geography is a field of science devoted to the study of the lands, features, inhabitants, and phenomena of the Earth and planets.', 'Geography', '2010-1-1'); INSERT INTO tsearch.pgweb VALUES(6, 'History is a subject studied in schools, colleges, and universities that deals with events that have happened in the past.', 'History', '2010-1-1'); INSERT INTO tsearch.pgweb VALUES(7, 'Medical science is the science of dealing with the maintenance of health and the prevention and treatment of disease.', 'Medical science', '2010-1-1'); INSERT INTO tsearch.pgweb VALUES(8, 'Physics is one of the most fundamental scientific disciplines, and its main goal is to understand how the universe behaves.', 'Physics', '2010-1-1'); SELECT id, body, title FROM tsearch.pgweb WHERE to_tsvector('english', body) @@ to_tsquery('english', 'science'); id | body | title ----+-------------------------------------------------------------------------------------------------------------------------+--------- 2 | Mathematics is the science that deals with the logic of shape, quantity and arrangement. | Mathematics 3 | Computer science is the study of processes that interact with data and that can be represented as data in the form of programs. | Computer science 5 | Geography is a field of science devoted to the study of the lands, features, inhabitants, and phenomena of the Earth and planets. | Geography 7 | Medical science is the science of dealing with the maintenance of health and the prevention and treatment of disease. | Medical science (4 rows) 像science这样的相关词也会被找到,因为这些词都被处理成了相同标准的词条。 上面的查询指定english配置来解析和规范化字符串。也可以省略此配置,通过default_text_search_config进行配置设置: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 SHOW default_text_search_config; default_text_search_config ---------------------------- pg_catalog.english (1 row) SELECT id, body, title FROM tsearch.pgweb WHERE to_tsvector(body) @@ to_tsquery('science'); id | body | title ----+-------------------------------------------------------------------------------------------------------------------------+--------- 2 | Mathematics is the science that deals with the logic of shape, quantity and arrangement. | Mathematics 3 | Computer science is the study of processes that interact with data and that can be represented as data in the form of programs. | Computer science 5 | Geography is a field of science devoted to the study of the lands, features, inhabitants, and phenomena of the Earth and planets. | Geography 7 | Medical science is the science of dealing with the maintenance of health and the prevention and treatment of disease. | Medical science (4 rows) 一个复杂查询:检索出在title或者body字段中包含treatment和science的最近10篇文档: 1 2 3 4 5 6 7 SELECT title FROM tsearch.pgweb WHERE to_tsvector(title || ' ' || body) @@ to_tsquery('treatment & science') ORDER BY last_mod_date DESC LIMIT 10; title -------- Medical science (1 rows) 为了清晰,举例中没有调用coalesce函数在两个字段中查找包含NULL的行。 以上例子均在没有索引的情况下进行查询。对于大多数应用程序来说,这个方法很慢。因此除了偶尔的特定搜索,文本搜索在实际使用中通常需要创建索引。 父主题: 表和索引
-
创建索引 为了加速文本搜索,可以创建GIN索引。 1 CREATE INDEX pgweb_idx_1 ON tsearch.pgweb USING gin(to_tsvector('english', body)); to_tsvector()函数有两个版本,只输一个参数的版本和输两个参数的版本。 只输一个参数时,系统默认采用default_text_search_config所指定的分词器。 创建索引时必须使用to_tsvector的两参数版本,否则索引内容可能不一致。只有指定了分词器名称的全文检索函数才可以在索引表达式中使用。因为索引的内容不受default_text_search_config的影响。由于default_text_search_config的值可以随时调整,从而导致不同条目生成的tsvector采用了不同的分词器,并且无法区分究竟使用了哪个分词器。正确地转储和恢复这样的索引也是不支持的。 在上述创建索引中to_tsvector使用了两个参数,只有当查询时也使用了两个参数,且参数值与索引中相同时,才会使用该索引。例如,WHERE to_tsvector('english', body) @@ 'a & b' 可以使用索引,但WHERE to_tsvector(body) @@ 'a & b'不能使用索引。这确保索引各条目是使用相同的分词器创建的。 索引中的分词器名称由另一列指定时可以建立更复杂的表达式索引。例如: 1 CREATE INDEX pgweb_idx_2 ON tsearch.pgweb USING gin(to_tsvector('zhparser', body)); 本示例中zhparser仅支持UTF8/GBK的数据库编码格式,在Encoding为SQL_ASCII下会报错。 其中body是pgweb表中的一列。当对索引的各条目使用了哪个分词器进行记录时,允许在同一索引中存在混合分词器。在某些场景下这将是有用的。例如,文档集合中包含不同语言的文档时。再次强调,打算使用索引的查询必须措辞匹配,例如,WHERE to_tsvector(config_name, body) @@ 'a & b'与索引中的to_tsvector措辞匹配。 索引甚至可以连接列: 1 CREATE INDEX pgweb_idx_3 ON tsearch.pgweb USING gin(to_tsvector('english', title || ' ' || body)); 另一个方法是创建一个单独的tsvector列控制to_tsvector的输出。下面的例子是title和body的连接, 当其它是NULL的时候,使用coalesce确保一个字段仍然会被索引: 1 2 ALTER TABLE tsearch.pgweb ADD COLUMN textsearchable_index_col tsvector; UPDATE tsearch.pgweb SET textsearchable_index_col = to_tsvector('english', coalesce(title,'') || ' ' || coalesce(body,'')); 然后为加速搜索创建一个GIN索引: 1 CREATE INDEX textsearch_idx_4 ON tsearch.pgweb USING gin(textsearchable_index_col); 现在,就可以执行一个快速全文搜索了: 1 2 3 4 5 6 7 8 9 10 11 SELECT title FROM tsearch.pgweb WHERE textsearchable_index_col @@ to_tsquery('science & Computer') ORDER BY last_mod_date DESC LIMIT 10; title -------- Computer science (1 rows) 相比于一个表达式索引,单独列方法的一个优势是:它没有必要在查询时明确指定分词器以便能使用索引。正如上面例子所示,查询可以依赖于default_text_search_config。另一个优势是搜索比较快速,因为它没有必要重新利用to_tsvector调用来验证索引匹配。表达式索引方法更容易建立,且它需要较少的磁盘空间,因为tsvector形式没有明确存储。 父主题: 表和索引
-
搜索表 在不使用索引的情况下也可以进行全文检索。 一个简单查询:将body字段中包含science的每一行打印出来。 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 DROP SCHEMA IF EXISTS tsearch CASCADE; CREATE SCHEMA tsearch; CREATE TABLE tsearch.pgweb(id int, body text, title text, last_mod_date date); INSERT INTO tsearch.pgweb VALUES(1, 'Philology is the study of words, especially the history and development of the words in a particular language or group of languages.', 'Philology', '2010-1-1'); INSERT INTO tsearch.pgweb VALUES(2, 'Mathematics is the science that deals with the logic of shape, quantity and arrangement.', 'Mathematics', '2010-1-1'); INSERT INTO tsearch.pgweb VALUES(3, 'Computer science is the study of processes that interact with data and that can be represented as data in the form of programs.', 'Computer science', '2010-1-1'); INSERT INTO tsearch.pgweb VALUES(4, 'Chemistry is the scientific discipline involved with elements and compounds composed of atoms, molecules and ions.', 'Chemistry', '2010-1-1'); INSERT INTO tsearch.pgweb VALUES(5, 'Geography is a field of science devoted to the study of the lands, features, inhabitants, and phenomena of the Earth and planets.', 'Geography', '2010-1-1'); INSERT INTO tsearch.pgweb VALUES(6, 'History is a subject studied in schools, colleges, and universities that deals with events that have happened in the past.', 'History', '2010-1-1'); INSERT INTO tsearch.pgweb VALUES(7, 'Medical science is the science of dealing with the maintenance of health and the prevention and treatment of disease.', 'Medical science', '2010-1-1'); INSERT INTO tsearch.pgweb VALUES(8, 'Physics is one of the most fundamental scientific disciplines, and its main goal is to understand how the universe behaves.', 'Physics', '2010-1-1'); SELECT id, body, title FROM tsearch.pgweb WHERE to_tsvector('english', body) @@ to_tsquery('english', 'science'); id | body | title ----+-------------------------------------------------------------------------------------------------------------------------+--------- 2 | Mathematics is the science that deals with the logic of shape, quantity and arrangement. | Mathematics 3 | Computer science is the study of processes that interact with data and that can be represented as data in the form of programs. | Computer science 5 | Geography is a field of science devoted to the study of the lands, features, inhabitants, and phenomena of the Earth and planets. | Geography 7 | Medical science is the science of dealing with the maintenance of health and the prevention and treatment of disease. | Medical science (4 rows) 像science这样的相关词也会被找到,因为这些词都被处理成了相同标准的词条。 上面的查询指定english配置来解析和规范化字符串。也可以省略此配置,通过default_text_search_config进行配置设置: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 SHOW default_text_search_config; default_text_search_config ---------------------------- pg_catalog.english (1 row) SELECT id, body, title FROM tsearch.pgweb WHERE to_tsvector(body) @@ to_tsquery('science'); id | body | title ----+-------------------------------------------------------------------------------------------------------------------------+--------- 2 | Mathematics is the science that deals with the logic of shape, quantity and arrangement. | Mathematics 3 | Computer science is the study of processes that interact with data and that can be represented as data in the form of programs. | Computer science 5 | Geography is a field of science devoted to the study of the lands, features, inhabitants, and phenomena of the Earth and planets. | Geography 7 | Medical science is the science of dealing with the maintenance of health and the prevention and treatment of disease. | Medical science (4 rows) 一个复杂查询:检索出在title或者body字段中包含treatment和science的最近10篇文档: 1 2 3 4 5 6 7 SELECT title FROM tsearch.pgweb WHERE to_tsvector(title || ' ' || body) @@ to_tsquery('treatment & science') ORDER BY last_mod_date DESC LIMIT 10; title -------- Medical science (1 rows) 为了清晰,举例中没有调用coalesce函数在两个字段中查找包含NULL的行。 以上例子均在没有索引的情况下进行查询。对于大多数应用程序来说,这个方法很慢。因此除了偶尔的特定搜索,文本搜索在实际使用中通常需要创建索引。 父主题: 表和索引
-
创建索引 为了加速文本搜索,可以创建GIN索引。 1 CREATE INDEX pgweb_idx_1 ON tsearch.pgweb USING gin(to_tsvector('english', body)); to_tsvector()函数有两个版本,只输一个参数的版本和输两个参数的版本。 只输一个参数时,系统默认采用default_text_search_config所指定的分词器。 创建索引时必须使用to_tsvector的两参数版本,否则索引内容可能不一致。只有指定了分词器名称的全文检索函数才可以在索引表达式中使用。因为索引的内容不受default_text_search_config的影响。由于default_text_search_config的值可以随时调整,从而导致不同条目生成的tsvector采用了不同的分词器,并且无法区分究竟使用了哪个分词器。正确地转储和恢复这样的索引也是不支持的。 在上述创建索引中to_tsvector使用了两个参数,只有当查询时也使用了两个参数,且参数值与索引中相同时,才会使用该索引。例如,WHERE to_tsvector('english', body) @@ 'a & b' 可以使用索引,但WHERE to_tsvector(body) @@ 'a & b'不能使用索引。这确保索引各条目是使用相同的分词器创建的。 索引中的分词器名称由另一列指定时可以建立更复杂的表达式索引。例如: 1 CREATE INDEX pgweb_idx_2 ON tsearch.pgweb USING gin(to_tsvector('zhparser', body)); 本示例中zhparser仅支持UTF8/GBK的数据库编码格式,在Encoding为SQL_ASCII下会报错。 其中body是pgweb表中的一列。当对索引的各条目使用了哪个分词器进行记录时,允许在同一索引中存在混合分词器。在某些场景下这将是有用的。例如,文档集合中包含不同语言的文档时。再次强调,打算使用索引的查询必须措辞匹配,例如,WHERE to_tsvector(config_name, body) @@ 'a & b'与索引中的to_tsvector措辞匹配。 索引甚至可以连接列: 1 CREATE INDEX pgweb_idx_3 ON tsearch.pgweb USING gin(to_tsvector('english', title || ' ' || body)); 另一个方法是创建一个单独的tsvector列控制to_tsvector的输出。下面的例子是title和body的连接, 当其它是NULL的时候,使用coalesce确保一个字段仍然会被索引: 1 2 ALTER TABLE tsearch.pgweb ADD COLUMN textsearchable_index_col tsvector; UPDATE tsearch.pgweb SET textsearchable_index_col = to_tsvector('english', coalesce(title,'') || ' ' || coalesce(body,'')); 然后为加速搜索创建一个GIN索引: 1 CREATE INDEX textsearch_idx_4 ON tsearch.pgweb USING gin(textsearchable_index_col); 现在,就可以执行一个快速全文搜索了: 1 2 3 4 5 6 7 8 9 10 11 SELECT title FROM tsearch.pgweb WHERE textsearchable_index_col @@ to_tsquery('science & Computer') ORDER BY last_mod_date DESC LIMIT 10; title -------- Computer science (1 rows) 相比于一个表达式索引,单独列方法的一个优势是:它没有必要在查询时明确指定分词器以便能使用索引。正如上面例子所示,查询可以依赖于default_text_search_config。另一个优势是搜索比较快速,因为它没有必要重新利用to_tsvector调用来验证索引匹配。表达式索引方法更容易建立,且它需要较少的磁盘空间,因为tsvector形式没有明确存储。 父主题: 表和索引
-
创建索引 为了加速文本搜索,可以创建GIN索引。 1 gaussdb=# CREATE INDEX pgweb_idx_1 ON tsearch.pgweb USING gin(to_tsvector('english', body)); to_tsvector()函数有两个版本。只输一个参数的版本和输两个参数的版本。只输一个参数时,系统默认采用default_text_search_config所指定的分词器。 请注意:创建索引时必须使用to_tsvector的两参数版本。只有指定了分词器名称的全文检索函数才可以在索引表达式中使用。这是因为索引的内容必须不受default_text_search_config的影响,否则索引内容可能不一致。由于default_text_search_config的值可以随时调整,从而导致不同条目生成的tsvector采用了不同的分词器,并且没有办法区分究竟使用了哪个分词器。正确地转储和恢复这样的索引也是不可能的。 因为在上述创建索引中to_tsvector使用了两个参数,只有当查询时也使用了两个参数,且参数值与索引中相同时,才会使用该索引。也就是说,WHERE to_tsvector('english', body) @@ 'a & b' 可以使用索引,但WHERE to_tsvector(body) @@ 'a & b'不能使用索引。这确保只使用这样的索引——索引各条目是使用相同的分词器创建的。 索引中的分词器名称由另一列指定时可以建立更复杂的表达式索引。例如: 1 gaussdb=# CREATE INDEX pgweb_idx_2 ON tsearch.pgweb USING gin(to_tsvector('ngram', body)); 其中body是pgweb表中的一列。当对索引的各条目使用了哪个分词器进行记录时,允许在同一索引中存在混合分词器。在某些场景下这将是有用的。例如,文档集合中包含不同语言的文档时。再次强调,打算使用索引的查询必须措辞匹配,例如,WHERE to_tsvector(config_name, body) @@ 'a & b'与索引中的to_tsvector措辞匹配。 索引甚至可以连接列: 1 gaussdb=# CREATE INDEX pgweb_idx_3 ON tsearch.pgweb USING gin(to_tsvector('english', title || ' ' || body)); 另一个方法是创建一个单独的tsvector列控制to_tsvector的输出。下面的例子是title和body的连接, 当其它是NULL的时候,使用coalesce确保一个字段仍然会被索引: 1 2 gaussdb=# ALTER TABLE tsearch.pgweb ADD COLUMN textsearchable_index_col tsvector; gaussdb=# UPDATE tsearch.pgweb SET textsearchable_index_col = to_tsvector('english', coalesce(title,'') || ' ' || coalesce(body,'')); 然后为加速搜索创建一个GIN索引: 1 gaussdb=# CREATE INDEX textsearch_idx_4 ON tsearch.pgweb USING gin(textsearchable_index_col); 现在,就可以执行一个快速全文搜索了: 1 2 3 4 5 6 7 8 9 10 11 gaussdb=# SELECT title FROM tsearch.pgweb WHERE textsearchable_index_col @@ to_tsquery('north & america') ORDER BY last_mod_date DESC LIMIT 10; title -------- Canada Mexico (2 rows) 相比于一个表达式索引,单独列方法的一个优势是:它没有必要在查询时明确指定分词器以便能使用索引。正如上面例子所示,查询可以依赖于default_text_search_config。另一个优势是搜索比较快速,因为它没有必要重新利用to_tsvector调用来验证索引匹配。表达式索引方法更容易建立,且它需要较少的磁盘空间,因为tsvector形式没有明确存储。 父主题: 表和索引
-
搜索表 在不使用索引的情况下也可以进行全文检索。 一个简单查询:将body字段中包含america的每一行打印出来。 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 gaussdb=# DROP SCHEMA IF EXISTS tsearch CASCADE; gaussdb=# CREATE SCHEMA tsearch; gaussdb=# CREATE TABLE tsearch.pgweb(id int, body text, title text, last_mod_date date); gaussdb=# INSERT INTO tsearch.pgweb VALUES(1, 'China, officially the People''s Republic of China (PRC), located in Asia, is the world''s most populous state.', 'China', '2010-1-1'); gaussdb=# INSERT INTO tsearch.pgweb VALUES(2, 'America is a rock band, formed in England in 1970 by multi-instrumentalists Dewey Bunnell, Dan Peek, and Gerry Beckley.', 'America', '2010-1-1'); gaussdb=# INSERT INTO tsearch.pgweb VALUES(3, 'England is a country that is part of the United Kingdom. It shares land borders with Scotland to the north and Wales to the west.', 'England', '2010-1-1'); gaussdb=# INSERT INTO tsearch.pgweb VALUES(4, 'Australia, officially the Commonwealth of Australia, is a country comprising the mainland of the Australian continent, the island of Tasmania, and numerous smaller islands.', 'Australia', '2010-1-1'); gaussdb=# INSERT INTO tsearch.pgweb VALUES(6, 'Japan is an island country in East Asia.', 'Japan', '2010-1-1'); gaussdb=# INSERT INTO tsearch.pgweb VALUES(7, 'Germany, officially the Federal Republic of Germany, is a sovereign state and federal parliamentary republic in central-western Europe.', 'Germany', '2010-1-1'); gaussdb=# INSERT INTO tsearch.pgweb VALUES(8, 'France, is a sovereign state comprising territory in western Europe and several overseas regions and territories.', 'France', '2010-1-1'); gaussdb=# INSERT INTO tsearch.pgweb VALUES(9, 'Italy officially the Italian Republic, is a unitary parliamentary republic in Europe.', 'Italy', '2010-1-1'); gaussdb=# INSERT INTO tsearch.pgweb VALUES(10, 'India, officially the Republic of India, is a country in South Asia.', 'India', '2010-1-1'); gaussdb=# INSERT INTO tsearch.pgweb VALUES(11, 'Brazil, officially the Federative Republic of Brazil, is the largest country in both South America and Latin America.', 'Brazil', '2010-1-1'); gaussdb=# INSERT INTO tsearch.pgweb VALUES(12, 'Canada is a country in the northern half of North America.', 'Canada', '2010-1-1'); gaussdb=# INSERT INTO tsearch.pgweb VALUES(13, 'Mexico, officially the United Mexican States, is a federal republic in the southern part of North America.', 'Mexico', '2010-1-1'); gaussdb=# SELECT id, body, title FROM tsearch.pgweb WHERE to_tsvector('english', body) @@ to_tsquery('english', 'america'); id | body | title ----+-------------------------------------------------------------------------------------------------------------------------+--------- 2 | America is a rock band, formed in England in 1970 by multi-instrumentalists Dewey Bunnell, Dan Peek, and Gerry Beckley. | America 12 | Canada is a country in the northern half of North America. | Canada 13 | Mexico, officially the United Mexican States, is a federal republic in the southern part of North America. | Mexico 11 | Brazil, officially the Federative Republic of Brazil, is the largest country in both South America and Latin America. | Brazil (4 rows) 像America这样的相关词也会被找到,因为这些词都被处理成了相同标准的词条。 上面的查询指定english配置来解析和规范化字符串。当然也可以省略此配置,通过default_text_search_config进行配置设置: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 gaussdb=# SHOW default_text_search_config; default_text_search_config ---------------------------- pg_catalog.english (1 row) gaussdb=# SELECT id, body, title FROM tsearch.pgweb WHERE to_tsvector(body) @@ to_tsquery('america'); id | body | title ----+-------------------------------------------------------------------------------------------------------------------------+--------- 11 | Brazil, officially the Federative Republic of Brazil, is the largest country in both South America and Latin America. | Brazil 2 | America is a rock band, formed in England in 1970 by multi-instrumentalists Dewey Bunnell, Dan Peek, and Gerry Beckley. | America 12 | Canada is a country in the northern half of North America. | Canada 13 | Mexico, officially the United Mexican States, is a federal republic in the southern part of North America. | Mexico (4 rows) 一个复杂查询:检索出在title或者body字段中包含north和america的最近10篇文档: 1 2 3 4 5 6 gaussdb=# SELECT title FROM tsearch.pgweb WHERE to_tsvector(title || ' ' || body) @@ to_tsquery('north & america') ORDER BY last_mod_date DESC LIMIT 10; title -------- Mexico Canada (2 rows) 为了清晰,举例中没有调用coalesce函数在两个字段中查找包含NULL的行。 以上例子均在没有索引的情况下进行查询。对于大多数应用程序来说,这个方法很慢。因此除了偶尔的特定搜索,文本搜索在实际使用中通常需要创建索引。 父主题: 表和索引
-
创建索引 为了加速文本搜索,可以创建GIN索引。 1 gaussdb=# CREATE INDEX pgweb_idx_1 ON tsearch.pgweb USING gin(to_tsvector('english', body)); to_tsvector()函数有两个版本。只输一个参数的版本和输两个参数的版本。只输一个参数时,系统默认采用default_text_search_config所指定的分词器。 请注意:创建索引时必须使用to_tsvector的两参数版本。只有指定了分词器名称的全文检索函数才可以在索引表达式中使用。这是因为索引的内容必须不受default_text_search_config的影响,否则索引内容可能不一致。由于default_text_search_config的值可以随时调整,从而导致不同条目生成的tsvector采用了不同的分词器,并且没有办法区分究竟使用了哪个分词器。正确地转储和恢复这样的索引也是不可能的。 因为在上述创建索引中to_tsvector使用了两个参数,只有当查询时也使用了两个参数,且参数值与索引中相同时,才会使用该索引。也就是说,WHERE to_tsvector('english', body) @@ 'a & b' 可以使用索引,但WHERE to_tsvector(body) @@ 'a & b'不能使用索引。这确保只使用这样的索引——索引各条目是使用相同的分词器创建的。 索引中的分词器名称由另一列指定时可以建立更复杂的表达式索引。例如: 1 gaussdb=# CREATE INDEX pgweb_idx_2 ON tsearch.pgweb USING gin(to_tsvector('ngram', body)); 其中body是pgweb表中的一列。当对索引的各条目使用了哪个分词器进行记录时,允许在同一索引中存在混合分词器。在某些场景下这将是有用的。例如,文档集合中包含不同语言的文档时。再次强调,打算使用索引的查询必须措辞匹配,例如,WHERE to_tsvector(config_name, body) @@ 'a & b'与索引中的to_tsvector措辞匹配。 索引甚至可以连接列: 1 gaussdb=# CREATE INDEX pgweb_idx_3 ON tsearch.pgweb USING gin(to_tsvector('english', title || ' ' || body)); 另一个方法是创建一个单独的tsvector列控制to_tsvector的输出。下面的例子是title和body的连接, 当其它是NULL的时候,使用coalesce确保一个字段仍然会被索引: 1 2 gaussdb=# ALTER TABLE tsearch.pgweb ADD COLUMN textsearchable_index_col tsvector; gaussdb=# UPDATE tsearch.pgweb SET textsearchable_index_col = to_tsvector('english', coalesce(title,'') || ' ' || coalesce(body,'')); 然后为加速搜索创建一个GIN索引: 1 gaussdb=# CREATE INDEX textsearch_idx_4 ON tsearch.pgweb USING gin(textsearchable_index_col); 现在,就可以执行一个快速全文搜索了: 1 2 3 4 5 6 7 8 9 10 11 gaussdb=# SELECT title FROM tsearch.pgweb WHERE textsearchable_index_col @@ to_tsquery('north & america') ORDER BY last_mod_date DESC LIMIT 10; title -------- Canada Mexico (2 rows) 相比于一个表达式索引,单独列方法的一个优势是:它没有必要在查询时明确指定分词器以便能使用索引。正如上面例子所示,查询可以依赖于default_text_search_config。另一个优势是搜索比较快速,因为它没有必要重新利用to_tsvector调用来验证索引匹配。表达式索引方法更容易建立,且它需要较少的磁盘空间,因为tsvector形式没有明确存储。 父主题: 表和索引
-
创建索引 为了加速文本搜索,可以创建GIN索引。 1 postgres=# CREATE INDEX pgweb_idx_1 ON tsearch.pgweb USING gin(to_tsvector('english', body)); to_tsvector()函数有两个版本。只输一个参数的版本和输两个参数的版本。只输一个参数时,系统默认采用default_text_search_config所指定的分词器。 请注意:创建索引时必须使用to_tsvector的两参数版本。只有指定了分词器名称的全文检索函数才可以在索引表达式中使用。这是因为索引的内容必须不受default_text_search_config的影响,否则索引内容可能不一致。由于default_text_search_config的值可以随时调整,从而导致不同条目生成的tsvector采用了不同的分词器,并且没有办法区分究竟使用了哪个分词器。正确地转储和恢复这样的索引也是不可能的。 因为在上述创建索引中to_tsvector使用了两个参数,只有当查询时也使用了两个参数,且参数值与索引中相同时,才会使用该索引。也就是说,WHERE to_tsvector('english', body) @@ 'a & b' 可以使用索引,但WHERE to_tsvector(body) @@ 'a & b'不能使用索引。这确保只使用这样的索引——索引各条目是使用相同的分词器创建的。 索引中的分词器名称由另一列指定时可以建立更复杂的表达式索引。例如: 1 postgres=# CREATE INDEX pgweb_idx_2 ON tsearch.pgweb USING gin(to_tsvector('ngram', body)); 其中body是pgweb表中的一列。当对索引的各条目使用了哪个分词器进行记录时,允许在同一索引中存在混合分词器。在某些场景下这将是有用的。例如,文档集合中包含不同语言的文档时。再次强调,打算使用索引的查询必须措辞匹配,例如,WHERE to_tsvector(config_name, body) @@ 'a & b'与索引中的to_tsvector措辞匹配。 索引甚至可以连接列: 1 postgres=# CREATE INDEX pgweb_idx_3 ON tsearch.pgweb USING gin(to_tsvector('english', title || ' ' || body)); 另一个方法是创建一个单独的tsvector列控制to_tsvector的输出。下面的例子是title和body的连接, 当其它是NULL的时候,使用coalesce确保一个字段仍然会被索引: 1 2 postgres=# ALTER TABLE tsearch.pgweb ADD COLUMN textsearchable_index_col tsvector; postgres=# UPDATE tsearch.pgweb SET textsearchable_index_col = to_tsvector('english', coalesce(title,'') || ' ' || coalesce(body,'')); 然后为加速搜索创建一个GIN索引: 1 postgres=# CREATE INDEX textsearch_idx_4 ON tsearch.pgweb USING gin(textsearchable_index_col); 现在,就可以执行一个快速全文搜索了: 1 2 3 4 5 6 7 8 9 10 11 postgres=# SELECT title FROM tsearch.pgweb WHERE textsearchable_index_col @@ to_tsquery('north & america') ORDER BY last_mod_date DESC LIMIT 10; title -------- Canada Mexico (2 rows) 相比于一个表达式索引,单独列方法的一个优势是:它没有必要在查询时明确指定分词器以便能使用索引。正如上面例子所示,查询可以依赖于default_text_search_config。另一个优势是搜索比较快速,因为它没有必要重新利用to_tsvector调用来验证索引匹配。表达式索引方法更容易建立,且它需要较少的磁盘空间,因为tsvector形式没有明确存储。 父主题: 表和索引
-
创建索引 为了加速文本搜索,可以创建GIN索引。 1 CREATE INDEX pgweb_idx_1 ON tsearch.pgweb USING gin(to_tsvector('english', body)); to_tsvector()函数有两个版本,只输一个参数的版本和输两个参数的版本。 只输一个参数时,系统默认采用default_text_search_config所指定的分词器。 创建索引时必须使用to_tsvector的两参数版本,否则索引内容可能不一致。只有指定了分词器名称的全文检索函数才可以在索引表达式中使用。因为索引的内容不受default_text_search_config的影响。由于default_text_search_config的值可以随时调整,从而导致不同条目生成的tsvector采用了不同的分词器,并且无法区分究竟使用了哪个分词器。正确地转储和恢复这样的索引也是不支持的。 在上述创建索引中to_tsvector使用了两个参数,只有当查询时也使用了两个参数,且参数值与索引中相同时,才会使用该索引。例如,WHERE to_tsvector('english', body) @@ 'a & b' 可以使用索引,但WHERE to_tsvector(body) @@ 'a & b'不能使用索引。这确保索引各条目是使用相同的分词器创建的。 索引中的分词器名称由另一列指定时可以建立更复杂的表达式索引。例如: 1 CREATE INDEX pgweb_idx_2 ON tsearch.pgweb USING gin(to_tsvector('zhparser', body)); 本示例中zhparser仅支持UTF8/GBK的数据库编码格式,在Encoding为SQL_ASCII下会报错。 其中body是pgweb表中的一列。当对索引的各条目使用了哪个分词器进行记录时,允许在同一索引中存在混合分词器。在某些场景下这将是有用的。例如,文档集合中包含不同语言的文档时。再次强调,打算使用索引的查询必须措辞匹配,例如,WHERE to_tsvector(config_name, body) @@ 'a & b'与索引中的to_tsvector措辞匹配。 索引甚至可以连接列: 1 CREATE INDEX pgweb_idx_3 ON tsearch.pgweb USING gin(to_tsvector('english', title || ' ' || body)); 另一个方法是创建一个单独的tsvector列控制to_tsvector的输出。下面的例子是title和body的连接, 当其它是NULL的时候,使用coalesce确保一个字段仍然会被索引: 12 ALTER TABLE tsearch.pgweb ADD COLUMN textsearchable_index_col tsvector;UPDATE tsearch.pgweb SET textsearchable_index_col = to_tsvector('english', coalesce(title,'') || ' ' || coalesce(body,'')); 然后为加速搜索创建一个GIN索引: 1 CREATE INDEX textsearch_idx_4 ON tsearch.pgweb USING gin(textsearchable_index_col); 现在,就可以执行一个快速全文搜索了: 1 2 3 4 5 6 7 8 91011 SELECT title FROM tsearch.pgweb WHERE textsearchable_index_col @@ to_tsquery('science & Computer') ORDER BY last_mod_date DESC LIMIT 10; title -------- Computer science(1 rows) 相比于一个表达式索引,单独列方法的一个优势是:它没有必要在查询时明确指定分词器以便能使用索引。正如上面例子所示,查询可以依赖于default_text_search_config。另一个优势是搜索比较快速,因为它没有必要重新利用to_tsvector调用来验证索引匹配。表达式索引方法更容易建立,且它需要较少的磁盘空间,因为tsvector形式没有明确存储。 父主题: 表和索引
更多精彩内容
CDN加速
GaussDB
文字转换成语音
免费的服务器
如何创建网站
域名网站购买
私有云桌面
云主机哪个好
域名怎么备案
手机云电脑
SSL证书申请
云点播服务器
免费OCR是什么
电脑云桌面
域名备案怎么弄
语音转文字
文字图片识别
云桌面是什么
网址安全检测
网站建设搭建
国外CDN加速
SSL免费证书申请
短信批量发送
图片OCR识别
云数据库MySQL
个人域名购买
录音转文字
扫描图片识别文字
OCR图片识别
行驶证识别
虚拟电话号码
电话呼叫中心软件
怎么制作一个网站
Email注册网站
华为VNC
图像文字识别
企业网站制作
个人网站搭建
华为云计算
免费租用云托管
云桌面云服务器
ocr文字识别免费版
HTTPS证书申请
图片文字识别转换
国外域名注册商
使用免费虚拟主机
云电脑主机多少钱
鲲鹏云手机
短信验证码平台
OCR图片文字识别
SSL证书是什么
申请企业邮箱步骤
免费的企业用邮箱
云免流搭建教程
域名价格