In this post we will see how Cerebrata can be used efficiently to create and manage character (char) filters in the azure search service index. Azure search service supports character filter functionality where these filters are used to preprocess a stream of characters before it is passed to the tokenizer.
Please note that the char filters works only when incorporated with the custom analyzer and that analyzer should be assigned to any one of the source fields of the index.
A character filter receives the original text as a stream of characters and can transform the stream by adding, removing, or changing characters. For instance, a character filter could be used to convert Hindu-Arabic numerals (٠١٢٣٤٥٦٧٨٩) into their Arabic-Latin equivalents (0123456789), or to strip HTML elements like <b> from the stream. There are different types of character filters like
- HTML Strip Character Filter
- Mapping Character Filter
- Pattern Replace Character Filter
HTML Strip Character Filter
The HTML strip character filter strips HTML elements from the text and replaces HTML entities with their decoded value (e.g. replacing “&” with “&”). This filter removes the HTML tags while preserving other contents.
To make this filter work, you will need to incorporate the created character filter with the custom analyzer and that analyzer should be assigned to an index field. This HTML strip character filter is predefined filter which can be created while creating the custom analyzer. To use this filter, you will need to navigate to Search service index > Index Settings > Analyzers > Custom Analyzer > select choose option from the char filters.
Mapping Character Filter
The Mapping Character Filter accepts a map of keys and values. Whenever it encounters a string of characters that is the same as a key, it replaces them with the value associated with that key. Matching is greedy i.e. the longest pattern matching at a given point wins. Replacements are allowed to be the empty string (For example, the character ‘a’ can be mapped to the character ‘b’ where you can filter the content with character ‘a’ by specifying the character ‘b’ i.e. if the content is “age” then you can specify “bge” where it will replace and filter the content age while searching the index).
To create a mapping character you will need to navigate to the Search service index > Index Settings > Char Filters tab > New Char Filter > Mapping Char Filter option > New Mapping option. You can easily map the characters by Specifying character in the “string to find” and the string to be mapped in the “replacement string” in their respective input box.
Pattern Replace Character Filter
Pattern Replace Character Filter uses regular expressions to match the characters that should be replaced with the specified replacement string. This filter uses the regular expressions to identify the character sequence to preserve and a replacement pattern to identify the characters to replace. The replacement string can refer to the capturing group in the regular expression, which can reference capture groups using the $1..$9 syntax.
For example, if you need to replace the “underscore (_)” from the incoming string (145_326_897) to “#” then you have to define a regular expression in the char filter pattern input textbox as (“(\\d+)_(?=\\d)”) which insists that the incoming pattern would be separated with the underscore , so we have to define the replacement string with the syntax using “$1#” in the char filter replacement input textbox. By defining this syntax the filter will replace the “underscore” with “#” where the final replaced string will be like (145#326#897). While searching the index documents using Cerebrata you can search for (“145#326#897”) but the search analyzer will still return the results with (“145_326_897”) as the character filter will replace the “underscore” character with “#”.
To create a pattern replace filter in Cerebrata you will need to navigate to the search service index > Index Settings -> Char Filters tab > New Char Filter > select “Pattern Replace” char filter option > assign a friendly name for the filter. You can easily map the characters by Specifying character filter pattern and the char filter replacement string in their respective input box.
Summary
In this post we learnt about character filters in Azure Cognitive Search service and how they can be managed using Cerebrata. Cerebrata provides most comprehensive set of features for managing indexes, data sources, indexers, synonym maps, and data in your Azure Cognitive Search service. You can learn more about it on our website at https://cerebrata.com/features/azure-search.
Other than that, Cerebrata has best-of-the-breed management features for Azure Storage, Cosmos DB, Service Bus, Redis Cache and more. You can learn more about the available features on our website at https://www.cerebrata.com/.