I came across unicode strings which contained characters outside plane 0, i.e. having codepoint value > U+FFFF. As an example, there was a character U+1F4AA ("Flexed Biceps") in one string. This character was received as an UTF-8 string and stored in eDirectory (case ignore string) attribute value through IDM.
Now, eDirectory seems to store string values internally using UTF-16 LE. Using iMonitor, I could see that the value was stored as follows:
0x3D 0xD8 0xAA 0xDC
This is correct representation of the given unicode character as an UTF-16 LE surrogate pair (U+B83D U+DCAA).
When reading the attribute value through LDAP (python 3.6.8 / python-ldap 3.4.4), the byte buffer for the value is as follows:
b'\xed\xa0\xbd\xed\xb2\xaa'
Investigating this a little, I found out that this is actually UTF-8 encoded representation of the two unicode surrogates:
\xed \xa0 \xdb UTF-8 decoded = 0xD83D
\xed \xb2 \xaa UTF-8 decoded = 0xDCAA
So, in order to get UTF-8 string out of the returned LDAP query result, one would first need to decode the response as UTF-8 and then decode any UTF-16 surrogate pairs from the result. In python 3, something like this might do:
value.decode('utf-8', errors='surrogatepass').encode('utf-16', errors='surrogatepass').decode('utf-16')
As LDAP RFCs specify that the strings should be sent and received as UTF-8, this seems like a bug to me. The received response string is indeed UTF-8 formatted, but I don't see any reason why it would contain embedded UTF-16 surrogate pairs. Instead, the response should be just plain UTF-8 string, with higher plane unicode characters encoded as UTF-8, right?