by Kun Cheng via Microsoft SQL Server Development Customer Advisory Team on 5/20/2010 11:46:00 PM
Author: Kun ChengReviewer: Alexei Khalyako
Last week, I was presented with an interesting Full-text search problem, which I like to share with you, especially the ones who need to support international markets with SQL Server Full-text solution.
The issue was that the customer application not able to find target strings as part of their routine test cycle. Note this customer is planning to enter Asia market with support of local languages. The unit test of Chinese language failed because of the issue. To simplify the scenario, here is simplified repro table schema and Full-text query (See appendix for complete repro script):
Table:
CREATE TABLE FTtable( id int NOT NULL, content nvarchar(255) COLLATE Chinese_Simplified_Pinyin_100_CI_AS );
Query:
SELECT * FROM FTtable WHERE CONTAINS([content], N'????',language 2052 );
Language 2052 is Simplified Chinese. The test machine is running Chinese locale Windows and SQL Server 2008. By directly selecting rows from the table, I can tell there are matching rows with the searched target string '????'. Now the next question is maybe the Full-text catalog doesn’t have these keywords when it’s populated by Full-text indexing. In SQL Server 2008 there is a new DMV to display keywords of Full-text index, which is really helpful to identify what’s indexed and what’s not.
select * from sys.dm_fts_index_keywords(DB_ID('FTdb'), object_id('FTtable'))
Surprisingly, it DOES return the target key words:
keyword
display_term
column_id
document_count
0x3583358B359D35AD
????
2
1
0x7ED36838
??
0xFF
END OF FILE
Now there is another new DMF introduced in SQL Server 2008, which takes an input string and generates the same result as SQL Full-text engine. Let’s give it a try to see if it’s because of the way SQL Full-text engine processes the query string.
SELECT * FROM sys.dm_fts_parser
(N'????', 2052, null, 0)
group_id
phrase_id
occurrence
special_term
expansion_type
source_term
0x3583
0
Exact Match
?
0x358B
0x359D
3
0x35AD
4
Now it makes more sense. The Full-text index contains the whole string vs the Full-text engine breaks the input string into smaller words, thus no match. Let’s try English as language clause (1033) instead:
(N'????', 1033, null, 0)
As it turned out, what this customer did and forgot was creating full-index for the table on a machine using default language of English. Then it’s restored to the Chinese test machine. So the Full-text index was populated using English word breaker and now the query is using Chinese language to search for it. In most cases, it actually works out fine for Unicode. However for strings like '????', which is called surrogate (supplementary) Unicode and not supported by Full-text engine, it might or might not work. For definition of surrogate characters, check out http://msdn.microsoft.com/en-us/library/ms180942.aspx. As best practice, use same language to create Full-text index and issue Full-text query if all possible. You may use Full-text search for other languages but only for exact matches. For more details of multi-language support of Full-text, I recommend read of the article: http://msdn.microsoft.com/en-us/library/ms142507.aspx.
Appendix:
CREATE DATABASE FTdb
GO
USE FTdb
CREATE FULLTEXT CATALOG FTcatalog AS DEFAULT;
CREATE UNIQUE CLUSTERED INDEX FTtable_C_INDEX1 ON FTtable(id)
INSERT INTO FTtable VALUES (1, N'??')
INSERT INTO FTtable VALUES (2, N'????')
-- English word breaker is used
CREATE FULLTEXT INDEX ON FTtable(content LANGUAGE 1033) KEY INDEX FTtable_C_INDEX1;
-- Chinese word breaker is used
-- CREATE FULLTEXT INDEX ON FTtable(content LANGUAGE 2052) KEY INDEX FTtable_C_INDEX1;
-- regular unicode string search works fine
SELECT * FROM FTtable WHERE CONTAINS([content], N'??',language 1033 );
SELECT * FROM FTtable WHERE CONTAINS([content], N'??',language 2052 );
-- surrogate unicode string search works for English only
SELECT * FROM FTtable WHERE CONTAINS([content], N'????',language 1033 );
Original Post: Full-text query in local languages
The content of the postings is owned by the respective author. SQL Feeds is not responsible for the contents of the postings. This site is automatically generated and cannot be reviewed for abusive content. If you find abusive content on SQL Feeds, please contact us. Designated trademarks and brands are the property of their respective owners. All rights reserved.