Pages

Monday, September 8, 2014

Removing diacritics (accents on letters) from strings in X++

If you want to remove diacritics (accents on letters) from strings like this "ÁÂÃÄÅÇÈÉàáâãäåèéêëìíîïòóôõ£ALEX" to use the more friendly string "AAAAAACEEaaaaaaeeeeiiiioooo£ALEX", you can use this block of code:

static void AlexRemoveDiacritics(Args _args)
{
    str strInput = 'ÁÂÃÄÅÇÈÉàáâãäåèéêëìíîïòóôõ£ALEX';
    System.String input = strInput;
    str retVal;
    int i;

    System.Char c;
    System.Text.NormalizationForm FormD = System.Text.NormalizationForm::FormD;
    str normalizedString = input.Normalize(FormD);
    System.Text.StringBuilder stringBuilder = new System.Text.StringBuilder();

    for (i = 1; i <= strLen(normalizedString); i++)
    {
        c = System.Char::Parse(subStr(normalizedString, i, 1));

        if (System.Globalization.CharUnicodeInfo::GetUnicodeCategory(c) != System.Globalization.UnicodeCategory::NonSpacingMark)
        {
            stringBuilder.Append(c);
        }
    }

    input = stringBuilder.ToString();
    input = input.Normalize();
    retVal = input;

    info(strFmt("Before: '%1'", strInput));
    info(strFmt("After: '%1'", retVal));
}



I did not come up with this code, I merely adapted it from http://www.codeproject.com/Tips/410074/Removing-Diacritics-from-Strings who adapted it from somewhere another person who deleted their blog.  It is still very useful.

8 comments:

  1. Why stop at the diacritics? Let's start using the $ symbol instead of S or P instead of B as well! I think it looks way friendlier.

    ReplyDelete
    Replies
    1. Not sure if you're joking or not...but diacritics are not the same as similar looking symbols. Diacritics are symbols written above or below a letter to indicate a difference in pronunciation. If you notice in the sample I used £ near the end in the before string, and it remained in the after string.

      An example of when this could be needed is if a user was searching for a string, and you would want to remove diacritics for string matching...perhaps they were using a standard English keyboard. Or sometimes integrations between disparate systems. The unicode normalization allows for all languages to be represented too.

      H0p3 th1$ h31ps!

      Delete
    2. A good example I just thought of would be if a user was search for the word "resume" in a dictionary, and the dictionary actually has the word "résumé" stored in the database.

      Delete
    3. Yes, exactly. Resume is not the same as résumé, so why strip the diacritics at all? The days of 7-bit US-ASCII should finally be laid to rest.

      Delete
    4. The point of the code is how to do it if you need to.

      When I go to google to search for résumé templates...my keyboard doesn't have a way to type é symbols easily. I have to spend a few minutes figuring out just how to type that letter. I instead just search for "resume templates" and google can return results for both. Search, is the best example...does that not make perfect sense? If I have a user looking for data...I'd like to return the best results.

      Delete
  2. Thanks for this snippet.
    Please note that the loop should start at 1 not 0.

    ReplyDelete
    Replies
    1. You are right! Made the update. I would think AX would notify of an out-of-bounds exception, but instead it just adjusts to 1. This could cause duplication for the first character. Good catch and I made the edit.

      Delete
  3. great, man! Thank you for the time you saved

    ReplyDelete