by Hilary Cotter via Hilary Cotter on 3/8/2007 1:11:00 PM
On the newsgropes recently I have encountered some posters who are perplexed by the results of their full-text searches.
The first poster was puzzled as to why searches on f.b.i. would not find F.B.I. in his content. SQL FTS and Microsoft Search products index uppercased tokens with periods in them differently than other lower or upper cased tokens. During indexing most language word breakers will index upper cased tokens with periods in them as the token, and then the token with all the periods stripped out. So F.B.I. is indexed as F.B.I. and FBI. All tokens are stored upper cased in the index with the exception of path names (for cosmetic reasons - e.g. in Indexing services a path like c:\TeMp is stored as c:\temp as lowercase was considered to look better than uppercase).
There is no length cut off as far as I am aware. For example, C.R.E.E.P. (Committee to Re-Elect the President - the unfortunately and what we can only hope was unintentional acronym for President Richard Nixon's second election campaign) is indexed as CREEP and C.R.E.E.P. I've tried up to 10 letter combinations.
Here is an example:
Note that this works for most languages but does NOT work for the neutral, Chinese (both types), Japanese, and Thai word breakers. They will index F.B.I. as F, B, and I - in other words as three separate letters. Korean will index F.B.I. as F.B.I
The next post covers some of the unusual behaviors while searching on C#, C+, C++, and F#, etc
Original Post: SQL Full-text search anomolies
The content of the postings is owned by the respective author. SQL Feeds is not responsible for the contents of the postings. This site is automatically generated and cannot be reviewed for abusive content. If you find abusive content on SQL Feeds, please contact us. Designated trademarks and brands are the property of their respective owners. All rights reserved.