strings.rst revision 12391:ceeca8b41e4b
1Strings, bytes and Unicode conversions 2###################################### 3 4.. note:: 5 6 This section discusses string handling in terms of Python 3 strings. For 7 Python 2.7, replace all occurrences of ``str`` with ``unicode`` and 8 ``bytes`` with ``str``. Python 2.7 users may find it best to use ``from 9 __future__ import unicode_literals`` to avoid unintentionally using ``str`` 10 instead of ``unicode``. 11 12Passing Python strings to C++ 13============================= 14 15When a Python ``str`` is passed from Python to a C++ function that accepts 16``std::string`` or ``char *`` as arguments, pybind11 will encode the Python 17string to UTF-8. All Python ``str`` can be encoded in UTF-8, so this operation 18does not fail. 19 20The C++ language is encoding agnostic. It is the responsibility of the 21programmer to track encodings. It's often easiest to simply `use UTF-8 22everywhere <http://utf8everywhere.org/>`_. 23 24.. code-block:: c++ 25 26 m.def("utf8_test", 27 [](const std::string &s) { 28 cout << "utf-8 is icing on the cake.\n"; 29 cout << s; 30 } 31 ); 32 m.def("utf8_charptr", 33 [](const char *s) { 34 cout << "My favorite food is\n"; 35 cout << s; 36 } 37 ); 38 39.. code-block:: python 40 41 >>> utf8_test('') 42 utf-8 is icing on the cake. 43 44 45 >>> utf8_charptr('') 46 My favorite food is 47 48 49.. note:: 50 51 Some terminal emulators do not support UTF-8 or emoji fonts and may not 52 display the example above correctly. 53 54The results are the same whether the C++ function accepts arguments by value or 55reference, and whether or not ``const`` is used. 56 57Passing bytes to C++ 58-------------------- 59 60A Python ``bytes`` object will be passed to C++ functions that accept 61``std::string`` or ``char*`` *without* conversion. 62 63 64Returning C++ strings to Python 65=============================== 66 67When a C++ function returns a ``std::string`` or ``char*`` to a Python caller, 68**pybind11 will assume that the string is valid UTF-8** and will decode it to a 69native Python ``str``, using the same API as Python uses to perform 70``bytes.decode('utf-8')``. If this implicit conversion fails, pybind11 will 71raise a ``UnicodeDecodeError``. 72 73.. code-block:: c++ 74 75 m.def("std_string_return", 76 []() { 77 return std::string("This string needs to be UTF-8 encoded"); 78 } 79 ); 80 81.. code-block:: python 82 83 >>> isinstance(example.std_string_return(), str) 84 True 85 86 87Because UTF-8 is inclusive of pure ASCII, there is never any issue with 88returning a pure ASCII string to Python. If there is any possibility that the 89string is not pure ASCII, it is necessary to ensure the encoding is valid 90UTF-8. 91 92.. warning:: 93 94 Implicit conversion assumes that a returned ``char *`` is null-terminated. 95 If there is no null terminator a buffer overrun will occur. 96 97Explicit conversions 98-------------------- 99 100If some C++ code constructs a ``std::string`` that is not a UTF-8 string, one 101can perform a explicit conversion and return a ``py::str`` object. Explicit 102conversion has the same overhead as implicit conversion. 103 104.. code-block:: c++ 105 106 // This uses the Python C API to convert Latin-1 to Unicode 107 m.def("str_output", 108 []() { 109 std::string s = "Send your r\xe9sum\xe9 to Alice in HR"; // Latin-1 110 py::str py_s = PyUnicode_DecodeLatin1(s.data(), s.length()); 111 return py_s; 112 } 113 ); 114 115.. code-block:: python 116 117 >>> str_output() 118 'Send your résumé to Alice in HR' 119 120The `Python C API 121<https://docs.python.org/3/c-api/unicode.html#built-in-codecs>`_ provides 122several built-in codecs. 123 124 125One could also use a third party encoding library such as libiconv to transcode 126to UTF-8. 127 128Return C++ strings without conversion 129------------------------------------- 130 131If the data in a C++ ``std::string`` does not represent text and should be 132returned to Python as ``bytes``, then one can return the data as a 133``py::bytes`` object. 134 135.. code-block:: c++ 136 137 m.def("return_bytes", 138 []() { 139 std::string s("\xba\xd0\xba\xd0"); // Not valid UTF-8 140 return py::bytes(s); // Return the data without transcoding 141 } 142 ); 143 144.. code-block:: python 145 146 >>> example.return_bytes() 147 b'\xba\xd0\xba\xd0' 148 149 150Note the asymmetry: pybind11 will convert ``bytes`` to ``std::string`` without 151encoding, but cannot convert ``std::string`` back to ``bytes`` implicitly. 152 153.. code-block:: c++ 154 155 m.def("asymmetry", 156 [](std::string s) { // Accepts str or bytes from Python 157 return s; // Looks harmless, but implicitly converts to str 158 } 159 ); 160 161.. code-block:: python 162 163 >>> isinstance(example.asymmetry(b"have some bytes"), str) 164 True 165 166 >>> example.asymmetry(b"\xba\xd0\xba\xd0") # invalid utf-8 as bytes 167 UnicodeDecodeError: 'utf-8' codec can't decode byte 0xba in position 0: invalid start byte 168 169 170Wide character strings 171====================== 172 173When a Python ``str`` is passed to a C++ function expecting ``std::wstring``, 174``wchar_t*``, ``std::u16string`` or ``std::u32string``, the ``str`` will be 175encoded to UTF-16 or UTF-32 depending on how the C++ compiler implements each 176type, in the platform's native endianness. When strings of these types are 177returned, they are assumed to contain valid UTF-16 or UTF-32, and will be 178decoded to Python ``str``. 179 180.. code-block:: c++ 181 182 #define UNICODE 183 #include <windows.h> 184 185 m.def("set_window_text", 186 [](HWND hwnd, std::wstring s) { 187 // Call SetWindowText with null-terminated UTF-16 string 188 ::SetWindowText(hwnd, s.c_str()); 189 } 190 ); 191 m.def("get_window_text", 192 [](HWND hwnd) { 193 const int buffer_size = ::GetWindowTextLength(hwnd) + 1; 194 auto buffer = std::make_unique< wchar_t[] >(buffer_size); 195 196 ::GetWindowText(hwnd, buffer.data(), buffer_size); 197 198 std::wstring text(buffer.get()); 199 200 // wstring will be converted to Python str 201 return text; 202 } 203 ); 204 205.. warning:: 206 207 Wide character strings may not work as described on Python 2.7 or Python 208 3.3 compiled with ``--enable-unicode=ucs2``. 209 210Strings in multibyte encodings such as Shift-JIS must transcoded to a 211UTF-8/16/32 before being returned to Python. 212 213 214Character literals 215================== 216 217C++ functions that accept character literals as input will receive the first 218character of a Python ``str`` as their input. If the string is longer than one 219Unicode character, trailing characters will be ignored. 220 221When a character literal is returned from C++ (such as a ``char`` or a 222``wchar_t``), it will be converted to a ``str`` that represents the single 223character. 224 225.. code-block:: c++ 226 227 m.def("pass_char", [](char c) { return c; }); 228 m.def("pass_wchar", [](wchar_t w) { return w; }); 229 230.. code-block:: python 231 232 >>> example.pass_char('A') 233 'A' 234 235While C++ will cast integers to character types (``char c = 0x65;``), pybind11 236does not convert Python integers to characters implicitly. The Python function 237``chr()`` can be used to convert integers to characters. 238 239.. code-block:: python 240 241 >>> example.pass_char(0x65) 242 TypeError 243 244 >>> example.pass_char(chr(0x65)) 245 'A' 246 247If the desire is to work with an 8-bit integer, use ``int8_t`` or ``uint8_t`` 248as the argument type. 249 250Grapheme clusters 251----------------- 252 253A single grapheme may be represented by two or more Unicode characters. For 254example 'é' is usually represented as U+00E9 but can also be expressed as the 255combining character sequence U+0065 U+0301 (that is, the letter 'e' followed by 256a combining acute accent). The combining character will be lost if the 257two-character sequence is passed as an argument, even though it renders as a 258single grapheme. 259 260.. code-block:: python 261 262 >>> example.pass_wchar('é') 263 'é' 264 265 >>> combining_e_acute = 'e' + '\u0301' 266 267 >>> combining_e_acute 268 'é' 269 270 >>> combining_e_acute == 'é' 271 False 272 273 >>> example.pass_wchar(combining_e_acute) 274 'e' 275 276Normalizing combining characters before passing the character literal to C++ 277may resolve *some* of these issues: 278 279.. code-block:: python 280 281 >>> example.pass_wchar(unicodedata.normalize('NFC', combining_e_acute)) 282 'é' 283 284In some languages (Thai for example), there are `graphemes that cannot be 285expressed as a single Unicode code point 286<http://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries>`_, so there is 287no way to capture them in a C++ character type. 288 289 290C++17 string views 291================== 292 293C++17 string views are automatically supported when compiling in C++17 mode. 294They follow the same rules for encoding and decoding as the corresponding STL 295string type (for example, a ``std::u16string_view`` argument will be passed 296UTF-16-encoded data, and a returned ``std::string_view`` will be decoded as 297UTF-8). 298 299References 300========== 301 302* `The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) <https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/>`_ 303* `C++ - Using STL Strings at Win32 API Boundaries <https://msdn.microsoft.com/en-ca/magazine/mt238407.aspx>`_ 304