Unicode surrogate pairs and LDAP?

I came across unicode strings which contained characters outside plane 0, i.e. having codepoint value > U+FFFF. As an example, there was a character U+1F4AA ("Flexed Biceps") in one string. This character was received as an UTF-8 string and stored in eDirectory (case ignore string) attribute value through IDM.

Now, eDirectory seems to store string values internally using UTF-16 LE. Using iMonitor, I could see that the value was stored as follows:

0x3D 0xD8 0xAA 0xDC

This is correct representation of the given unicode character as an UTF-16 LE surrogate pair (U+B83D U+DCAA).

When reading the attribute value through LDAP (python 3.6.8 / python-ldap 3.4.4), the byte buffer for the value is as follows:

b'\xed\xa0\xbd\xed\xb2\xaa'

Investigating this a little, I found out that this is actually UTF-8 encoded representation of the two unicode surrogates:

\xed \xa0 \xdb UTF-8 decoded = 0xD83D

\xed \xb2 \xaa UTF-8 decoded = 0xDCAA

So, in order to get UTF-8 string out of the returned LDAP query result, one would first need to decode the response as UTF-8 and then decode any UTF-16 surrogate pairs from the result. In python 3, something like this might do:

value.decode('utf-8', errors='surrogatepass').encode('utf-16', errors='surrogatepass').decode('utf-16')

As LDAP RFCs specify that the strings should be sent and received as UTF-8, this seems like a bug to me. The received response string is indeed UTF-8 formatted, but I don't see any reason why it would contain embedded UTF-16 surrogate pairs. Instead, the response should be just plain UTF-8 string, with higher plane unicode characters encoded as UTF-8, right?

  • 0  

    Can you write an UTF-8 string containing "Flexed Biceps" directly to eDirectory via LDAP? If you retrieve that attribute again, do you also get UTF-8 encoded representation of the two unicode surrogates?

  • 0 in reply to   

    Just tried the LDAP writes. And yes, the same problem is there are well. I wasn't able to write that unicode character using either UTF-8 or UTF-16 (LE or BE) encoding to the eDirectory. The only syntax that worked was the UTF-16 surrogate pair further encoded as UTF-8. Here are the used LDAP attribute modify byte buffer values and the operation results:

    1. Plain UTF-8 (as per LDAP v3 RFC)
      • b'\xF0\x9F\x92\xAA'
      • NDS error: syntax violation (-613)
    2. Plain UTF-16 BE
      • b'\xD8\x3D\xDC\xAA'
      • Write succeeds, but reading the attribute value afterwards results in different result: b'\xd6\xbd\xdc\xaa'
    3. Plain UTF-16 LE
      • b'\x3D\xD8\xAA\xDC'
      • NDS error: buffer too small (-119)
    4. UTF-16 surrogate pair further encoded as UTF-8
      • b'\xED\xA0\xBD\xED\xB2\xAA'
      • Write succeeds and the read result is the same.
      • This double encoded version is the only one that "works", but LDAP strings shouldn't be double encoded

    I have also raised a support case about this, trying to get through the first line now...

  • 0   in reply to 

    I was wondering if python-ldap (which is not new) could be the problem, not handling these strings (data) correctly. In general eDirectory LDAP like to handle "unknown" data as base64.

    How does the data look like if you use ldapsearch?

  • 0 in reply to   

    python-ldap handles LDAP data purely as byte buffers, encoding/decoding is left to the caller. In the tests, I also used pure byte buffers to make sure the byte representation of the data was according to the UTF-8/16 formats.

    LDAP protocol on the other hand does not know anything about base64. LDIF is another thing of course, where base64 is used for the values which are binary or contain "exotic" characters.

    I also tried other tools, like ldapsearch and Apache directory studio, with similar results. Ldapsearch returns the value double encoded (UTF-8 string, which contains further UTF-16 surrogate pairs). Here's a sample result from ldapsearch (base64 value):
    7aC97bKq

    Apache directory studio shows the returned surrogate pair character as unicode replacement character, since those characters shouldn't be there. That also breaks directory studio functionality, since it cannot delete the attribute value (tries delete with the replacement chars --> value does no exist in directory).

    NetIQ IDM seems to have worked around the issue, since it is able to store higher plane unicode characters into eDirectory and ever retrieve them later on. And I think IDM is actually using LDAP nowadays, instead of NCP (Actually, I might be wrong here, looking at the connections of the ndsd process, there are both ldaps and NCP connections).

  • 0   in reply to 

    Humm, I've had a couple of surprises with python-ldap in the past, which is why I was wondering.

    The LDAP Connections are done by tools and ldapsearch, everything should be using ncp - it would be surprising if they changed the engine to use LDAP as it would add a layer of abstraction (and make operations slower).

    While Unicode is pretty cool and solves many problems we had with codepages, but it introduces just as many problems as we still do not have a proper way to handle all of them (just my humble opinion).

    With python-ldap I've used normalize to handle extended characters, though I doubt it would be helpful for you.

    stackoverflow.com/.../how-does-unicodedata-normalizeform-unistr-work

  • 0 in reply to   

    Yes, I have came across unicode string combined characters and normalization before, but that's another story. :)

    I think it was just the IDM Designer which was changed to use LDAP instead of NCP at some version. But IDM engine internal functionality and directory communication may of course be something else. Glad to see that the unicode strings seem to work there, at least.

    IMO, unicode is a great thing, but also brings some challenges. But it's actually pretty great to be able to enter all kinds of "characters" into strings. Muscle

  • 0   in reply to 

    See if you can get a bug opened, at least to get engineering to look at it - and comment on this (and hope that you'll get an answer).

    I do suspect that you might get a "working as designed".