Unicode surrogate pairs and LDAP?

I came across unicode strings which contained characters outside plane 0, i.e. having codepoint value > U+FFFF. As an example, there was a character U+1F4AA ("Flexed Biceps") in one string. This character was received as an UTF-8 string and stored in eDirectory (case ignore string) attribute value through IDM.

Now, eDirectory seems to store string values internally using UTF-16 LE. Using iMonitor, I could see that the value was stored as follows:

0x3D 0xD8 0xAA 0xDC

This is correct representation of the given unicode character as an UTF-16 LE surrogate pair (U+B83D U+DCAA).

When reading the attribute value through LDAP (python 3.6.8 / python-ldap 3.4.4), the byte buffer for the value is as follows:

b'\xed\xa0\xbd\xed\xb2\xaa'

Investigating this a little, I found out that this is actually UTF-8 encoded representation of the two unicode surrogates:

\xed \xa0 \xdb UTF-8 decoded = 0xD83D

\xed \xb2 \xaa UTF-8 decoded = 0xDCAA

So, in order to get UTF-8 string out of the returned LDAP query result, one would first need to decode the response as UTF-8 and then decode any UTF-16 surrogate pairs from the result. In python 3, something like this might do:

value.decode('utf-8', errors='surrogatepass').encode('utf-16', errors='surrogatepass').decode('utf-16')

As LDAP RFCs specify that the strings should be sent and received as UTF-8, this seems like a bug to me. The received response string is indeed UTF-8 formatted, but I don't see any reason why it would contain embedded UTF-16 surrogate pairs. Instead, the response should be just plain UTF-8 string, with higher plane unicode characters encoded as UTF-8, right?

Parents
  • 0  

    Can you write an UTF-8 string containing "Flexed Biceps" directly to eDirectory via LDAP? If you retrieve that attribute again, do you also get UTF-8 encoded representation of the two unicode surrogates?

  • 0 in reply to   

    Just tried the LDAP writes. And yes, the same problem is there are well. I wasn't able to write that unicode character using either UTF-8 or UTF-16 (LE or BE) encoding to the eDirectory. The only syntax that worked was the UTF-16 surrogate pair further encoded as UTF-8. Here are the used LDAP attribute modify byte buffer values and the operation results:

    1. Plain UTF-8 (as per LDAP v3 RFC)
      • b'\xF0\x9F\x92\xAA'
      • NDS error: syntax violation (-613)
    2. Plain UTF-16 BE
      • b'\xD8\x3D\xDC\xAA'
      • Write succeeds, but reading the attribute value afterwards results in different result: b'\xd6\xbd\xdc\xaa'
    3. Plain UTF-16 LE
      • b'\x3D\xD8\xAA\xDC'
      • NDS error: buffer too small (-119)
    4. UTF-16 surrogate pair further encoded as UTF-8
      • b'\xED\xA0\xBD\xED\xB2\xAA'
      • Write succeeds and the read result is the same.
      • This double encoded version is the only one that "works", but LDAP strings shouldn't be double encoded

    I have also raised a support case about this, trying to get through the first line now...

Reply
  • 0 in reply to   

    Just tried the LDAP writes. And yes, the same problem is there are well. I wasn't able to write that unicode character using either UTF-8 or UTF-16 (LE or BE) encoding to the eDirectory. The only syntax that worked was the UTF-16 surrogate pair further encoded as UTF-8. Here are the used LDAP attribute modify byte buffer values and the operation results:

    1. Plain UTF-8 (as per LDAP v3 RFC)
      • b'\xF0\x9F\x92\xAA'
      • NDS error: syntax violation (-613)
    2. Plain UTF-16 BE
      • b'\xD8\x3D\xDC\xAA'
      • Write succeeds, but reading the attribute value afterwards results in different result: b'\xd6\xbd\xdc\xaa'
    3. Plain UTF-16 LE
      • b'\x3D\xD8\xAA\xDC'
      • NDS error: buffer too small (-119)
    4. UTF-16 surrogate pair further encoded as UTF-8
      • b'\xED\xA0\xBD\xED\xB2\xAA'
      • Write succeeds and the read result is the same.
      • This double encoded version is the only one that "works", but LDAP strings shouldn't be double encoded

    I have also raised a support case about this, trying to get through the first line now...

Children