SQL Feeds - All your SQL feeds in one place.

Sponsors

Thursday, May 20, 2010

Full-text query in local languages

by Kun Cheng via Microsoft SQL Server Development Customer Advisory Team on 5/20/2010 11:46:00 PM

Author: Kun Cheng
Reviewer: Alexei Khalyako

Last week, I was presented with an interesting Full-text search problem, which I like to share with you, especially the ones who need to support international markets with SQL Server Full-text solution.

The issue was that the customer application not able to find target strings as part of their routine test cycle. Note this customer is planning to enter Asia market with support of local languages. The unit test of Chinese language failed because of the issue. To simplify the scenario, here is simplified repro table schema and Full-text query (See appendix for complete repro script):

Table:

CREATE TABLE FTtable( id int NOT NULL, content nvarchar(255) COLLATE Chinese_Simplified_Pinyin_100_CI_AS );

Query:

SELECT * FROM FTtable WHERE CONTAINS([content], N'????',language 2052 );

Language 2052 is Simplified Chinese. The test machine is running Chinese locale Windows and SQL Server 2008. By directly selecting rows from the table, I can tell there are matching rows with the searched target string '????'. Now the next question is maybe the Full-text catalog doesn’t have these keywords when it’s populated by Full-text indexing. In SQL Server 2008 there is a new DMV to display keywords of Full-text index, which is really helpful to identify what’s indexed and what’s not.

select * from sys.dm_fts_index_keywords(DB_ID('FTdb'), object_id('FTtable'))

Surprisingly, it DOES return the target key words:

keyword

display_term

column_id

document_count

0x3583358B359D35AD

????

2

1

0x7ED36838

??

2

1

0xFF

END OF FILE

2

2

 

Now there is another new DMF introduced in SQL Server 2008, which takes an input string and generates the same result as SQL Full-text engine. Let’s give it a try to see if it’s because of the way SQL Full-text engine processes the query string.

SELECT * FROM sys.dm_fts_parser

(N'????', 2052, null, 0)

keyword

group_id

phrase_id

occurrence

special_term

display_term

expansion_type

source_term

0x3583

1

0

1

Exact Match

?

0

????

0x358B

1

0

2

Exact Match

?

0

????

0x359D

1

0

3

Exact Match

?

0

????

0x35AD

1

0

4

Exact Match

?

0

????

 

Now it makes more sense. The Full-text index contains the whole string vs the Full-text engine breaks the input string into smaller words, thus no match. Let’s try English as language clause (1033) instead:

SELECT * FROM sys.dm_fts_parser

(N'????', 1033, null, 0)

keyword

group_id

phrase_id

occurrence

special_term

display_term

expansion_type

source_term

0x3583358B359D35AD

1

0

1

Exact Match

????

0

????

 

As it turned out, what this customer did and forgot was creating full-index for the table on a machine using default language of English. Then it’s restored to the Chinese test machine. So the Full-text index was populated using English word breaker and now the query is using Chinese language to search for it. In most cases, it actually works out fine for Unicode. However for strings like '????', which is called surrogate (supplementary) Unicode and not supported by Full-text engine, it might or might not work. For definition of surrogate characters, check out http://msdn.microsoft.com/en-us/library/ms180942.aspx. As best practice, use same language to create Full-text index and issue Full-text query if all possible. You may use Full-text search for other languages but only for exact matches. For more details of multi-language support of Full-text, I recommend read of the article: http://msdn.microsoft.com/en-us/library/ms142507.aspx.

 

 

Appendix:

CREATE DATABASE FTdb

GO

USE FTdb

GO

CREATE FULLTEXT CATALOG FTcatalog AS DEFAULT;

GO

 

CREATE TABLE FTtable( id int NOT NULL, content nvarchar(255) COLLATE Chinese_Simplified_Pinyin_100_CI_AS );

CREATE UNIQUE CLUSTERED INDEX FTtable_C_INDEX1 ON FTtable(id)

GO

INSERT INTO FTtable VALUES (1, N'??')

INSERT INTO FTtable VALUES (2, N'????')

GO

 

-- English word breaker is used

CREATE FULLTEXT INDEX ON FTtable(content LANGUAGE 1033) KEY INDEX FTtable_C_INDEX1;

-- Chinese word breaker is used

-- CREATE FULLTEXT INDEX ON FTtable(content LANGUAGE 2052) KEY INDEX FTtable_C_INDEX1;

GO

 

-- regular unicode string search works fine

SELECT * FROM FTtable WHERE CONTAINS([content], N'??',language 1033 );

SELECT * FROM FTtable WHERE CONTAINS([content], N'??',language 2052 );

 

-- surrogate unicode string search works for English only

SELECT * FROM FTtable WHERE CONTAINS([content], N'????',language 2052 );

SELECT * FROM FTtable WHERE CONTAINS([content], N'????',language 1033 );

 

email it!bookmark it!digg it!

Original Post: Full-text query in local languages

Legal Note

The content of the postings is owned by the respective author. SQL Feeds is not responsible for the contents of the postings. This site is automatically generated and cannot be reviewed for abusive content. If you find abusive content on SQL Feeds, please contact us. Designated trademarks and brands are the property of their respective owners. All rights reserved.

Advertise with us