| 1 | <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> |
|---|
| 2 | <html> |
|---|
| 3 | <head> |
|---|
| 4 | <meta name="generator" content= |
|---|
| 5 | "HTML Tidy for Linux/x86 (vers 1st November 2002), see www.w3.org"> |
|---|
| 6 | <meta name="description" content= |
|---|
| 7 | "A simple, portable and lightweigt C++ library for easy handling of UTF-8 encoded strings"> |
|---|
| 8 | <meta name="keywords" content="UTF-8 C++ portable utf8 unicode generic templates"> |
|---|
| 9 | <meta name="author" content="Nemanja Trifunovic"> |
|---|
| 10 | <title> |
|---|
| 11 | UTF8-CPP: UTF-8 with C++ in a Portable Way |
|---|
| 12 | </title> |
|---|
| 13 | <style type="text/css"> |
|---|
| 14 | <!-- |
|---|
| 15 | span.return_value { |
|---|
| 16 | color: brown; |
|---|
| 17 | } |
|---|
| 18 | span.keyword { |
|---|
| 19 | color: blue; |
|---|
| 20 | } |
|---|
| 21 | span.preprocessor { |
|---|
| 22 | color: navy; |
|---|
| 23 | } |
|---|
| 24 | span.literal { |
|---|
| 25 | color: olive; |
|---|
| 26 | } |
|---|
| 27 | span.comment { |
|---|
| 28 | color: green; |
|---|
| 29 | } |
|---|
| 30 | code { |
|---|
| 31 | font-weight: bold; |
|---|
| 32 | } |
|---|
| 33 | ul.toc { |
|---|
| 34 | list-style-type: none; |
|---|
| 35 | } |
|---|
| 36 | p.version { |
|---|
| 37 | font-size: small; |
|---|
| 38 | font-style: italic; |
|---|
| 39 | } |
|---|
| 40 | --> |
|---|
| 41 | </style> |
|---|
| 42 | </head> |
|---|
| 43 | <body> |
|---|
| 44 | <h1> |
|---|
| 45 | UTF8-CPP: UTF-8 with C++ in a Portable Way |
|---|
| 46 | </h1> |
|---|
| 47 | <p> |
|---|
| 48 | <a href="https://sourceforge.net/projects/utfcpp">The Sourceforge project page</a> |
|---|
| 49 | </p> |
|---|
| 50 | <div id="toc"> |
|---|
| 51 | <h2> |
|---|
| 52 | Table of Contents |
|---|
| 53 | </h2> |
|---|
| 54 | <ul class="toc"> |
|---|
| 55 | <li> |
|---|
| 56 | <a href="#introduction">Introduction</a> |
|---|
| 57 | </li> |
|---|
| 58 | <li> |
|---|
| 59 | <a href="#examples">Examples of Use</a> |
|---|
| 60 | </li> |
|---|
| 61 | <li> |
|---|
| 62 | <a href="#reference">Reference</a> |
|---|
| 63 | <ul class="toc"> |
|---|
| 64 | <li> |
|---|
| 65 | <a href="#funutf8">Functions From utf8 Namespace </a> |
|---|
| 66 | </li> |
|---|
| 67 | <li> |
|---|
| 68 | <a href="#typesutf8">Types From utf8 Namespace </a> |
|---|
| 69 | </li> |
|---|
| 70 | <li> |
|---|
| 71 | <a href="#fununchecked">Functions From utf8::unchecked Namespace </a> |
|---|
| 72 | </li> |
|---|
| 73 | <li> |
|---|
| 74 | <a href="#typesunchecked">Types From utf8::unchecked Namespace </a> |
|---|
| 75 | </li> |
|---|
| 76 | </ul> |
|---|
| 77 | </li> |
|---|
| 78 | <li> |
|---|
| 79 | <a href="#points">Points of Interest</a> |
|---|
| 80 | </li> |
|---|
| 81 | <li> |
|---|
| 82 | <a href="#conclusion">Conclusion</a> |
|---|
| 83 | </li> |
|---|
| 84 | <li> |
|---|
| 85 | <a href="#links">Links</a> |
|---|
| 86 | </li> |
|---|
| 87 | </ul> |
|---|
| 88 | </div> |
|---|
| 89 | <h2 id="introduction"> |
|---|
| 90 | Introduction |
|---|
| 91 | </h2> |
|---|
| 92 | <p> |
|---|
| 93 | Many C++ developers miss an easy and portable way of handling Unicode encoded |
|---|
| 94 | strings. C++ Standard is currently Unicode agnostic, and while some work is being |
|---|
| 95 | done to introduce Unicode to the next incarnation called C++0x, for the moment |
|---|
| 96 | nothing of the sort is available. In the meantime, developers use 3rd party |
|---|
| 97 | libraries like ICU, OS specific capabilities, or simply roll out their own |
|---|
| 98 | solutions. |
|---|
| 99 | </p> |
|---|
| 100 | <p> |
|---|
| 101 | In order to easily handle UTF-8 encoded Unicode strings, I have come up with a small |
|---|
| 102 | generic library. For anybody used to work with STL algorithms and iterators, it should be |
|---|
| 103 | easy and natural to use. The code is freely available for any purpose - check out |
|---|
| 104 | the license at the beginning of the utf8.h file. If you run into |
|---|
| 105 | bugs or performance issues, please let me know and I'll do my best to address them. |
|---|
| 106 | </p> |
|---|
| 107 | <p> |
|---|
| 108 | The purpose of this article is not to offer an introduction to Unicode in general, |
|---|
| 109 | and UTF-8 in particular. If you are not familiar with Unicode, be sure to check out |
|---|
| 110 | <a href="http://www.unicode.org/">Unicode Home Page</a> or some other source of |
|---|
| 111 | information for Unicode. Also, it is not my aim to advocate the use of UTF-8 |
|---|
| 112 | encoded strings in C++ programs; if you want to handle UTF-8 encoded strings from |
|---|
| 113 | C++, I am sure you have good reasons for it. |
|---|
| 114 | </p> |
|---|
| 115 | <h2 id="examples"> |
|---|
| 116 | Examples of use |
|---|
| 117 | </h2> |
|---|
| 118 | <p> |
|---|
| 119 | To illustrate the use of this utf8 library, we shall open a file containing UTF-8 |
|---|
| 120 | encoded text, check whether it starts with a byte order mark, read each line into a |
|---|
| 121 | <code>std::string</code>, check it for validity, convert the text to UTF-16, and |
|---|
| 122 | back to UTF-8: |
|---|
| 123 | </p> |
|---|
| 124 | <pre> |
|---|
| 125 | <span class="preprocessor">#include <fstream></span> |
|---|
| 126 | <span class="preprocessor">#include <iostream></span> |
|---|
| 127 | <span class="preprocessor">#include <string></span> |
|---|
| 128 | <span class="preprocessor">#include <vector></span> |
|---|
| 129 | <span class="preprocessor">#include "utf8.h"</span> |
|---|
| 130 | <span class="keyword">using namespace</span> std; |
|---|
| 131 | <span class="keyword">int</span> main() |
|---|
| 132 | { |
|---|
| 133 | <span class="keyword">if</span> (argc != <span class="literal">2</span>) { |
|---|
| 134 | cout << <span class="literal">"\nUsage: docsample filename\n"</span>; |
|---|
| 135 | <span class="keyword">return</span> <span class="literal">0</span>; |
|---|
| 136 | } |
|---|
| 137 | <span class="keyword">const char</span>* test_file_path = argv[1]; |
|---|
| 138 | <span class="comment">// Open the test file (must be UTF-8 encoded)</span> |
|---|
| 139 | ifstream fs8(test_file_path); |
|---|
| 140 | <span class="keyword">if</span> (!fs8.is_open()) { |
|---|
| 141 | cout << <span class= |
|---|
| 142 | "literal">"Could not open "</span> << test_file_path << endl; |
|---|
| 143 | <span class="keyword">return</span> <span class="literal">0</span>; |
|---|
| 144 | } |
|---|
| 145 | <span class="comment">// Read the first line of the file</span> |
|---|
| 146 | <span class="keyword">unsigned</span> line_count = <span class="literal">1</span>; |
|---|
| 147 | string line; |
|---|
| 148 | <span class="keyword">if</span> (!getline(fs8, line)) |
|---|
| 149 | <span class="keyword">return</span> <span class="literal">0</span>; |
|---|
| 150 | <span class="comment">// Look for utf-8 byte-order mark at the beginning</span> |
|---|
| 151 | <span class="keyword">if</span> (line.size() > <span class="literal">2</span>) { |
|---|
| 152 | <span class="keyword">if</span> (utf8::is_bom(line.c_str())) |
|---|
| 153 | cout << <span class= |
|---|
| 154 | "literal">"There is a byte order mark at the beginning of the file\n"</span>; |
|---|
| 155 | } |
|---|
| 156 | <span class="comment">// Play with all the lines in the file</span> |
|---|
| 157 | <span class="keyword">do</span> { |
|---|
| 158 | <span class="comment">// check for invalid utf-8 (for a simple yes/no check, there is also utf8::is_valid function)</span> |
|---|
| 159 | string::iterator end_it = utf8::find_invalid(line.begin(), line.end()); |
|---|
| 160 | <span class="keyword">if</span> (end_it != line.end()) { |
|---|
| 161 | cout << <span class= |
|---|
| 162 | "literal">"Invalid UTF-8 encoding detected at line "</span> << line_count << <span |
|---|
| 163 | class="literal">"\n"</span>; |
|---|
| 164 | cout << <span class= |
|---|
| 165 | "literal">"This part is fine: "</span> << string(line.begin(), end_it) << <span |
|---|
| 166 | class="literal">"\n"</span>; |
|---|
| 167 | } |
|---|
| 168 | <span class="comment">// Get the line length (at least for the valid part)</span> |
|---|
| 169 | <span class="keyword">int</span> length = utf8::distance(line.begin(), end_it); |
|---|
| 170 | cout << <span class= |
|---|
| 171 | "literal">"Length of line "</span> << line_count << <span class= |
|---|
| 172 | "literal">" is "</span> << length << <span class="literal">"\n"</span>; |
|---|
| 173 | <span class="comment">// Convert it to utf-16</span> |
|---|
| 174 | vector<unsigned short> utf16line; |
|---|
| 175 | utf8::utf8to16(line.begin(), end_it, back_inserter(utf16line)); |
|---|
| 176 | <span class="comment">// And back to utf-8</span> |
|---|
| 177 | string utf8line; |
|---|
| 178 | utf8::utf16to8(utf16line.begin(), utf16line.end(), back_inserter(utf8line)); |
|---|
| 179 | <span class="comment">// Confirm that the conversion went OK:</span> |
|---|
| 180 | <span class="keyword">if</span> (utf8line != string(line.begin(), end_it)) |
|---|
| 181 | cout << <span class= |
|---|
| 182 | "literal">"Error in UTF-16 conversion at line: "</span> << line_count << <span |
|---|
| 183 | class="literal">"\n"</span>; |
|---|
| 184 | getline(fs8, line); |
|---|
| 185 | line_count++; |
|---|
| 186 | } <span class="keyword">while</span> (!fs8.eof()); |
|---|
| 187 | <span class="keyword">return</span> <span class="literal">0</span>; |
|---|
| 188 | } |
|---|
| 189 | </pre> |
|---|
| 190 | <p> |
|---|
| 191 | In the previous code sample, we have seen the use of the following functions from |
|---|
| 192 | <code>utf8</code> namespace: first we used <code>is_bom</code> function to detect |
|---|
| 193 | UTF-8 byte order mark at the beginning of the file; then for each line we performed |
|---|
| 194 | a detection of invalid UTF-8 sequences with <code>find_invalid</code>; the number |
|---|
| 195 | of characters (more precisely - the number of Unicode code points) in each line was |
|---|
| 196 | determined with a use of <code>utf8::distance</code>; finally, we have converted |
|---|
| 197 | each line to UTF-16 encoding with <code>utf8to16</code> and back to UTF-8 with |
|---|
| 198 | <code>utf16to8</code>. |
|---|
| 199 | </p> |
|---|
| 200 | <h2 id="reference"> |
|---|
| 201 | Reference |
|---|
| 202 | </h2> |
|---|
| 203 | <h3 id="funutf8"> |
|---|
| 204 | Functions From utf8 Namespace |
|---|
| 205 | </h3> |
|---|
| 206 | <h4> |
|---|
| 207 | utf8::append |
|---|
| 208 | </h4> |
|---|
| 209 | <p class="version"> |
|---|
| 210 | Available in version 1.0 and later. |
|---|
| 211 | </p> |
|---|
| 212 | <p> |
|---|
| 213 | Encodes a 32 bit code point as a UTF-8 sequence of octets and appends the sequence |
|---|
| 214 | to a UTF-8 string. |
|---|
| 215 | </p> |
|---|
| 216 | <pre> |
|---|
| 217 | <span class="keyword">template</span> <<span class= |
|---|
| 218 | "keyword">typename</span> octet_iterator> |
|---|
| 219 | octet_iterator append(uint32_t cp, octet_iterator result); |
|---|
| 220 | |
|---|
| 221 | </pre> |
|---|
| 222 | <p> |
|---|
| 223 | <code>cp</code>: A 32 bit integer representing a code point to append to the |
|---|
| 224 | sequence.<br> |
|---|
| 225 | <code>result</code>: An output iterator to the place in the sequence where to |
|---|
| 226 | append the code point.<br> |
|---|
| 227 | <span class="return_value">Return value</span>: An iterator pointing to the place |
|---|
| 228 | after the newly appended sequence. |
|---|
| 229 | </p> |
|---|
| 230 | <p> |
|---|
| 231 | Example of use: |
|---|
| 232 | </p> |
|---|
| 233 | <pre> |
|---|
| 234 | <span class="keyword">unsigned char</span> u[<span class="literal">5</span>] = {<span |
|---|
| 235 | class="literal">0</span>,<span class="literal">0</span>,<span class= |
|---|
| 236 | "literal">0</span>,<span class="literal">0</span>,<span class="literal">0</span>}; |
|---|
| 237 | <span class="keyword">unsigned char</span>* end = append(<span class= |
|---|
| 238 | "literal">0x0448</span>, u); |
|---|
| 239 | assert (u[<span class="literal">0</span>] == <span class= |
|---|
| 240 | "literal">0xd1</span> && u[<span class="literal">1</span>] == <span class= |
|---|
| 241 | "literal">0x88</span> && u[<span class="literal">2</span>] == <span class= |
|---|
| 242 | "literal">0</span> && u[<span class="literal">3</span>] == <span class= |
|---|
| 243 | "literal">0</span> && u[<span class="literal">4</span>] == <span class= |
|---|
| 244 | "literal">0</span>); |
|---|
| 245 | </pre> |
|---|
| 246 | <p> |
|---|
| 247 | Note that <code>append</code> does not allocate any memory - it is the burden of |
|---|
| 248 | the caller to make sure there is enough memory allocated for the operation. To make |
|---|
| 249 | things more interesting, <code>append</code> can add anywhere between 1 and 4 |
|---|
| 250 | octets to the sequence. In practice, you would most often want to use |
|---|
| 251 | <code>std::back_inserter</code> to ensure that the necessary memory is allocated. |
|---|
| 252 | </p> |
|---|
| 253 | <p> |
|---|
| 254 | In case of an invalid code point, a <code>utf8::invalid_code_point</code> exception |
|---|
| 255 | is thrown. |
|---|
| 256 | </p> |
|---|
| 257 | <h4> |
|---|
| 258 | utf8::next |
|---|
| 259 | </h4> |
|---|
| 260 | <p class="version"> |
|---|
| 261 | Available in version 1.0 and later. |
|---|
| 262 | </p> |
|---|
| 263 | <p> |
|---|
| 264 | Given the iterator to the beginning of the UTF-8 sequence, it returns the code |
|---|
| 265 | point and moves the iterator to the next position. |
|---|
| 266 | </p> |
|---|
| 267 | <pre> |
|---|
| 268 | <span class="keyword">template</span> <<span class= |
|---|
| 269 | "keyword">typename</span> octet_iterator> |
|---|
| 270 | uint32_t next(octet_iterator& it, octet_iterator end); |
|---|
| 271 | |
|---|
| 272 | </pre> |
|---|
| 273 | <p> |
|---|
| 274 | <code>it</code>: a reference to an iterator pointing to the beginning of an UTF-8 |
|---|
| 275 | encoded code point. After the function returns, it is incremented to point to the |
|---|
| 276 | beginning of the next code point.<br> |
|---|
| 277 | <code>end</code>: end of the UTF-8 sequence to be processed. If <code>it</code> |
|---|
| 278 | gets equal to <code>end</code> during the extraction of a code point, an |
|---|
| 279 | <code>utf8::not_enough_room</code> exception is thrown.<br> |
|---|
| 280 | <span class="return_value">Return value</span>: the 32 bit representation of the |
|---|
| 281 | processed UTF-8 code point. |
|---|
| 282 | </p> |
|---|
| 283 | <p> |
|---|
| 284 | Example of use: |
|---|
| 285 | </p> |
|---|
| 286 | <pre> |
|---|
| 287 | <span class="keyword">char</span>* twochars = <span class= |
|---|
| 288 | "literal">"\xe6\x97\xa5\xd1\x88"</span>; |
|---|
| 289 | <span class="keyword">char</span>* w = twochars; |
|---|
| 290 | <span class="keyword">int</span> cp = next(w, twochars + <span class="literal">6</span>); |
|---|
| 291 | assert (cp == <span class="literal">0x65e5</span>); |
|---|
| 292 | assert (w == twochars + <span class="literal">3</span>); |
|---|
| 293 | </pre> |
|---|
| 294 | <p> |
|---|
| 295 | This function is typically used to iterate through a UTF-8 encoded string. |
|---|
| 296 | </p> |
|---|
| 297 | <p> |
|---|
| 298 | In case of an invalid UTF-8 seqence, a <code>utf8::invalid_utf8</code> exception is |
|---|
| 299 | thrown. |
|---|
| 300 | </p> |
|---|
| 301 | <h4> |
|---|
| 302 | utf8::peek_next |
|---|
| 303 | </h4> |
|---|
| 304 | <p class="version"> |
|---|
| 305 | Available in version 2.1 and later. |
|---|
| 306 | </p> |
|---|
| 307 | <p> |
|---|
| 308 | Given the iterator to the beginning of the UTF-8 sequence, it returns the code |
|---|
| 309 | point for the following sequence without changing the value of the iterator. |
|---|
| 310 | </p> |
|---|
| 311 | <pre> |
|---|
| 312 | <span class="keyword">template</span> <<span class= |
|---|
| 313 | "keyword">typename</span> octet_iterator> |
|---|
| 314 | uint32_t peek_next(octet_iterator it, octet_iterator end); |
|---|
| 315 | |
|---|
| 316 | </pre> |
|---|
| 317 | <p> |
|---|
| 318 | <code>it</code>: an iterator pointing to the beginning of an UTF-8 |
|---|
| 319 | encoded code point.<br> |
|---|
| 320 | <code>end</code>: end of the UTF-8 sequence to be processed. If <code>it</code> |
|---|
| 321 | gets equal to <code>end</code> during the extraction of a code point, an |
|---|
| 322 | <code>utf8::not_enough_room</code> exception is thrown.<br> |
|---|
| 323 | <span class="return_value">Return value</span>: the 32 bit representation of the |
|---|
| 324 | processed UTF-8 code point. |
|---|
| 325 | </p> |
|---|
| 326 | <p> |
|---|
| 327 | Example of use: |
|---|
| 328 | </p> |
|---|
| 329 | <pre> |
|---|
| 330 | <span class="keyword">char</span>* twochars = <span class= |
|---|
| 331 | "literal">"\xe6\x97\xa5\xd1\x88"</span>; |
|---|
| 332 | <span class="keyword">char</span>* w = twochars; |
|---|
| 333 | <span class="keyword">int</span> cp = peek_next(w, twochars + <span class="literal">6</span>); |
|---|
| 334 | assert (cp == <span class="literal">0x65e5</span>); |
|---|
| 335 | assert (w == twochars); |
|---|
| 336 | </pre> |
|---|
| 337 | <p> |
|---|
| 338 | In case of an invalid UTF-8 seqence, a <code>utf8::invalid_utf8</code> exception is |
|---|
| 339 | thrown. |
|---|
| 340 | </p> |
|---|
| 341 | <h4> |
|---|
| 342 | utf8::prior |
|---|
| 343 | </h4> |
|---|
| 344 | <p class="version"> |
|---|
| 345 | Available in version 1.02 and later. |
|---|
| 346 | </p> |
|---|
| 347 | <p> |
|---|
| 348 | Given a reference to an iterator pointing to an octet in a UTF-8 seqence, it |
|---|
| 349 | decreases the iterator until it hits the beginning of the previous UTF-8 encoded |
|---|
| 350 | code point and returns the 32 bits representation of the code point. |
|---|
| 351 | </p> |
|---|
| 352 | <pre> |
|---|
| 353 | <span class="keyword">template</span> <<span class= |
|---|
| 354 | "keyword">typename</span> octet_iterator> |
|---|
| 355 | uint32_t prior(octet_iterator& it, octet_iterator start); |
|---|
| 356 | |
|---|
| 357 | </pre> |
|---|
| 358 | <p> |
|---|
| 359 | <code>it</code>: a reference pointing to an octet within a UTF-8 encoded string. |
|---|
| 360 | After the function returns, it is decremented to point to the beginning of the |
|---|
| 361 | previous code point.<br> |
|---|
| 362 | <code>start</code>: an iterator to the beginning of the sequence where the search |
|---|
| 363 | for the beginning of a code point is performed. It is a |
|---|
| 364 | safety measure to prevent passing the beginning of the string in the search for a |
|---|
| 365 | UTF-8 lead octet.<br> |
|---|
| 366 | <span class="return_value">Return value</span>: the 32 bit representation of the |
|---|
| 367 | previous code point. |
|---|
| 368 | </p> |
|---|
| 369 | <p> |
|---|
| 370 | Example of use: |
|---|
| 371 | </p> |
|---|
| 372 | <pre> |
|---|
| 373 | <span class="keyword">char</span>* twochars = <span class= |
|---|
| 374 | "literal">"\xe6\x97\xa5\xd1\x88"</span>; |
|---|
| 375 | <span class="keyword">unsigned char</span>* w = twochars + <span class= |
|---|
| 376 | "literal">3</span>; |
|---|
| 377 | <span class="keyword">int</span> cp = prior (w, twochars); |
|---|
| 378 | assert (cp == <span class="literal">0x65e5</span>); |
|---|
| 379 | assert (w == twochars); |
|---|
| 380 | </pre> |
|---|
| 381 | <p> |
|---|
| 382 | This function has two purposes: one is two iterate backwards through a UTF-8 |
|---|
| 383 | encoded string. Note that it is usually a better idea to iterate forward instead, |
|---|
| 384 | since <code>utf8::next</code> is faster. The second purpose is to find a beginning |
|---|
| 385 | of a UTF-8 sequence if we have a random position within a string. |
|---|
| 386 | </p> |
|---|
| 387 | <p> |
|---|
| 388 | <code>it</code> will typically point to the beginning of |
|---|
| 389 | a code point, and <code>start</code> will point to the |
|---|
| 390 | beginning of the string to ensure we don't go backwards too far. <code>it</code> is |
|---|
| 391 | decreased until it points to a lead UTF-8 octet, and then the UTF-8 sequence |
|---|
| 392 | beginning with that octet is decoded to a 32 bit representation and returned. |
|---|
| 393 | </p> |
|---|
| 394 | <p> |
|---|
| 395 | In case <code>pass_end</code> is reached before a UTF-8 lead octet is hit, or if an |
|---|
| 396 | invalid UTF-8 sequence is started by the lead octet, an <code>invalid_utf8</code> |
|---|
| 397 | exception is thrown. |
|---|
| 398 | </p> |
|---|
| 399 | <h4> |
|---|
| 400 | utf8::previous |
|---|
| 401 | </h4> |
|---|
| 402 | <p class="version"> |
|---|
| 403 | Deprecated in version 1.02 and later. |
|---|
| 404 | </p> |
|---|
| 405 | <p> |
|---|
| 406 | Given a reference to an iterator pointing to an octet in a UTF-8 seqence, it |
|---|
| 407 | decreases the iterator until it hits the beginning of the previous UTF-8 encoded |
|---|
| 408 | code point and returns the 32 bits representation of the code point. |
|---|
| 409 | </p> |
|---|
| 410 | <pre> |
|---|
| 411 | <span class="keyword">template</span> <<span class= |
|---|
| 412 | "keyword">typename</span> octet_iterator> |
|---|
| 413 | uint32_t previous(octet_iterator& it, octet_iterator pass_start); |
|---|
| 414 | |
|---|
| 415 | </pre> |
|---|
| 416 | <p> |
|---|
| 417 | <code>it</code>: a reference pointing to an octet within a UTF-8 encoded string. |
|---|
| 418 | After the function returns, it is decremented to point to the beginning of the |
|---|
| 419 | previous code point.<br> |
|---|
| 420 | <code>pass_start</code>: an iterator to the point in the sequence where the search |
|---|
| 421 | for the beginning of a code point is aborted if no result was reached. It is a |
|---|
| 422 | safety measure to prevent passing the beginning of the string in the search for a |
|---|
| 423 | UTF-8 lead octet.<br> |
|---|
| 424 | <span class="return_value">Return value</span>: the 32 bit representation of the |
|---|
| 425 | previous code point. |
|---|
| 426 | </p> |
|---|
| 427 | <p> |
|---|
| 428 | Example of use: |
|---|
| 429 | </p> |
|---|
| 430 | <pre> |
|---|
| 431 | <span class="keyword">char</span>* twochars = <span class= |
|---|
| 432 | "literal">"\xe6\x97\xa5\xd1\x88"</span>; |
|---|
| 433 | <span class="keyword">unsigned char</span>* w = twochars + <span class= |
|---|
| 434 | "literal">3</span>; |
|---|
| 435 | <span class="keyword">int</span> cp = previous (w, twochars - <span class= |
|---|
| 436 | "literal">1</span>); |
|---|
| 437 | assert (cp == <span class="literal">0x65e5</span>); |
|---|
| 438 | assert (w == twochars); |
|---|
| 439 | </pre> |
|---|
| 440 | <p> |
|---|
| 441 | <code>utf8::previous</code> is deprecated, and <code>utf8::prior</code> should |
|---|
| 442 | be used instead, although the existing code can continue using this function. |
|---|
| 443 | The problem is the parameter <code>pass_start</code> that points to the position |
|---|
| 444 | just before the beginning of the sequence. Standard containers don't have the |
|---|
| 445 | concept of "pass start" and the function can not be used with their iterators. |
|---|
| 446 | </p> |
|---|
| 447 | <p> |
|---|
| 448 | <code>it</code> will typically point to the beginning of |
|---|
| 449 | a code point, and <code>pass_start</code> will point to the octet just before the |
|---|
| 450 | beginning of the string to ensure we don't go backwards too far. <code>it</code> is |
|---|
| 451 | decreased until it points to a lead UTF-8 octet, and then the UTF-8 sequence |
|---|
| 452 | beginning with that octet is decoded to a 32 bit representation and returned. |
|---|
| 453 | </p> |
|---|
| 454 | <p> |
|---|
| 455 | In case <code>pass_end</code> is reached before a UTF-8 lead octet is hit, or if an |
|---|
| 456 | invalid UTF-8 sequence is started by the lead octet, an <code>invalid_utf8</code> |
|---|
| 457 | exception is thrown |
|---|
| 458 | </p> |
|---|
| 459 | <h4> |
|---|
| 460 | utf8::advance |
|---|
| 461 | </h4> |
|---|
| 462 | <p class="version"> |
|---|
| 463 | Available in version 1.0 and later. |
|---|
| 464 | </p> |
|---|
| 465 | <p> |
|---|
| 466 | Advances an iterator by the specified number of code points within an UTF-8 |
|---|
| 467 | sequence. |
|---|
| 468 | </p> |
|---|
| 469 | <pre> |
|---|
| 470 | <span class="keyword">template</span> <<span class= |
|---|
| 471 | "keyword">typename</span> octet_iterator, typename distance_type> |
|---|
| 472 | <span class= |
|---|
| 473 | "keyword">void</span> advance (octet_iterator& it, distance_type n, octet_iterator end); |
|---|
| 474 | |
|---|
| 475 | </pre> |
|---|
| 476 | <p> |
|---|
| 477 | <code>it</code>: a reference to an iterator pointing to the beginning of an UTF-8 |
|---|
| 478 | encoded code point. After the function returns, it is incremented to point to the |
|---|
| 479 | nth following code point.<br> |
|---|
| 480 | <code>n</code>: a positive integer that shows how many code points we want to |
|---|
| 481 | advance.<br> |
|---|
| 482 | <code>end</code>: end of the UTF-8 sequence to be processed. If <code>it</code> |
|---|
| 483 | gets equal to <code>end</code> during the extraction of a code point, an |
|---|
| 484 | <code>utf8::not_enough_room</code> exception is thrown.<br> |
|---|
| 485 | </p> |
|---|
| 486 | <p> |
|---|
| 487 | Example of use: |
|---|
| 488 | </p> |
|---|
| 489 | <pre> |
|---|
| 490 | <span class="keyword">char</span>* twochars = <span class= |
|---|
| 491 | "literal">"\xe6\x97\xa5\xd1\x88"</span>; |
|---|
| 492 | <span class="keyword">unsigned char</span>* w = twochars; |
|---|
| 493 | advance (w, <span class="literal">2</span>, twochars + <span class="literal">6</span>); |
|---|
| 494 | assert (w == twochars + <span class="literal">5</span>); |
|---|
| 495 | </pre> |
|---|
| 496 | <p> |
|---|
| 497 | This function works only "forward". In case of a negative <code>n</code>, there is |
|---|
| 498 | no effect. |
|---|
| 499 | </p> |
|---|
| 500 | <p> |
|---|
| 501 | In case of an invalid code point, a <code>utf8::invalid_code_point</code> exception |
|---|
| 502 | is thrown. |
|---|
| 503 | </p> |
|---|
| 504 | <h4> |
|---|
| 505 | utf8::distance |
|---|
| 506 | </h4> |
|---|
| 507 | <p class="version"> |
|---|
| 508 | Available in version 1.0 and later. |
|---|
| 509 | </p> |
|---|
| 510 | <p> |
|---|
| 511 | Given the iterators to two UTF-8 encoded code points in a seqence, returns the |
|---|
| 512 | number of code points between them. |
|---|
| 513 | </p> |
|---|
| 514 | <pre> |
|---|
| 515 | <span class="keyword">template</span> <<span class= |
|---|
| 516 | "keyword">typename</span> octet_iterator> |
|---|
| 517 | <span class= |
|---|
| 518 | "keyword">typename</span> std::iterator_traits<octet_iterator>::difference_type distance (octet_iterator first, octet_iterator last); |
|---|
| 519 | |
|---|
| 520 | </pre> |
|---|
| 521 | <p> |
|---|
| 522 | <code>first</code>: an iterator to a beginning of a UTF-8 encoded code point.<br> |
|---|
| 523 | <code>last</code>: an iterator to a "post-end" of the last UTF-8 encoded code |
|---|
| 524 | point in the sequence we are trying to determine the length. It can be the |
|---|
| 525 | beginning of a new code point, or not.<br> |
|---|
| 526 | <span class="return_value">Return value</span> the distance between the iterators, |
|---|
| 527 | in code points. |
|---|
| 528 | </p> |
|---|
| 529 | <p> |
|---|
| 530 | Example of use: |
|---|
| 531 | </p> |
|---|
| 532 | <pre> |
|---|
| 533 | <span class="keyword">char</span>* twochars = <span class= |
|---|
| 534 | "literal">"\xe6\x97\xa5\xd1\x88"</span>; |
|---|
| 535 | size_t dist = utf8::distance(twochars, twochars + <span class="literal">5</span>); |
|---|
| 536 | assert (dist == <span class="literal">2</span>); |
|---|
| 537 | </pre> |
|---|
| 538 | <p> |
|---|
| 539 | This function is used to find the length (in code points) of a UTF-8 encoded |
|---|
| 540 | string. The reason it is called <em>distance</em>, rather than, say, |
|---|
| 541 | <em>length</em> is mainly because developers are used that <em>length</em> is an |
|---|
| 542 | O(1) function. Computing the length of an UTF-8 string is a linear operation, and |
|---|
| 543 | it looked better to model it after <code>std::distance</code> algorithm. |
|---|
| 544 | </p> |
|---|
| 545 | <p> |
|---|
| 546 | In case of an invalid UTF-8 seqence, a <code>utf8::invalid_utf8</code> exception is |
|---|
| 547 | thrown. If <code>last</code> does not point to the past-of-end of a UTF-8 seqence, |
|---|
| 548 | a <code>utf8::not_enough_room</code> exception is thrown. |
|---|
| 549 | </p> |
|---|
| 550 | <h4> |
|---|
| 551 | utf8::utf16to8 |
|---|
| 552 | </h4> |
|---|
| 553 | <p class="version"> |
|---|
| 554 | Available in version 1.0 and later. |
|---|
| 555 | </p> |
|---|
| 556 | <p> |
|---|
| 557 | Converts a UTF-16 encoded string to UTF-8. |
|---|
| 558 | </p> |
|---|
| 559 | <pre> |
|---|
| 560 | <span class="keyword">template</span> <<span class= |
|---|
| 561 | "keyword">typename</span> u16bit_iterator, <span class= |
|---|
| 562 | "keyword">typename</span> octet_iterator> |
|---|
| 563 | octet_iterator utf16to8 (u16bit_iterator start, u16bit_iterator end, octet_iterator result); |
|---|
| 564 | |
|---|
| 565 | </pre> |
|---|
| 566 | <p> |
|---|
| 567 | <code>start</code>: an iterator pointing to the beginning of the UTF-16 encoded |
|---|
| 568 | string to convert.<br> |
|---|
| 569 | <code>end</code>: an iterator pointing to pass-the-end of the UTF-16 encoded |
|---|
| 570 | string to convert.<br> |
|---|
| 571 | <code>result</code>: an output iterator to the place in the UTF-8 string where to |
|---|
| 572 | append the result of conversion.<br> |
|---|
| 573 | <span class="return_value">Return value</span>: An iterator pointing to the place |
|---|
| 574 | after the appended UTF-8 string. |
|---|
| 575 | </p> |
|---|
| 576 | <p> |
|---|
| 577 | Example of use: |
|---|
| 578 | </p> |
|---|
| 579 | <pre> |
|---|
| 580 | <span class="keyword">unsigned short</span> utf16string[] = {<span class= |
|---|
| 581 | "literal">0x41</span>, <span class="literal">0x0448</span>, <span class= |
|---|
| 582 | "literal">0x65e5</span>, <span class="literal">0xd834</span>, <span class= |
|---|
| 583 | "literal">0xdd1e</span>}; |
|---|
| 584 | vector<<span class="keyword">unsigned char</span>> utf8result; |
|---|
| 585 | utf16to8(utf16string, utf16string + <span class= |
|---|
| 586 | "literal">5</span>, back_inserter(utf8result)); |
|---|
| 587 | assert (utf8result.size() == <span class="literal">10</span>); |
|---|
| 588 | </pre> |
|---|
| 589 | <p> |
|---|
| 590 | In case of invalid UTF-16 sequence, a <code>utf8::invalid_utf16</code> exception is |
|---|
| 591 | thrown. |
|---|
| 592 | </p> |
|---|
| 593 | <h4> |
|---|
| 594 | utf8::utf8to16 |
|---|
| 595 | </h4> |
|---|
| 596 | <p class="version"> |
|---|
| 597 | Available in version 1.0 and later. |
|---|
| 598 | </p> |
|---|
| 599 | <p> |
|---|
| 600 | Converts an UTF-8 encoded string to UTF-16 |
|---|
| 601 | </p> |
|---|
| 602 | <pre> |
|---|
| 603 | <span class="keyword">template</span> <<span class= |
|---|
| 604 | "keyword">typename</span> u16bit_iterator, typename octet_iterator> |
|---|
| 605 | u16bit_iterator utf8to16 (octet_iterator start, octet_iterator end, u16bit_iterator result); |
|---|
| 606 | |
|---|
| 607 | </pre> |
|---|
| 608 | <p> |
|---|
| 609 | <code>start</code>: an iterator pointing to the beginning of the UTF-8 encoded |
|---|
| 610 | string to convert. < br /> <code>end</code>: an iterator pointing to |
|---|
| 611 | pass-the-end of the UTF-8 encoded string to convert.<br> |
|---|
| 612 | <code>result</code>: an output iterator to the place in the UTF-16 string where to |
|---|
| 613 | append the result of conversion.<br> |
|---|
| 614 | <span class="return_value">Return value</span>: An iterator pointing to the place |
|---|
| 615 | after the appended UTF-16 string. |
|---|
| 616 | </p> |
|---|
| 617 | <p> |
|---|
| 618 | Example of use: |
|---|
| 619 | </p> |
|---|
| 620 | <pre> |
|---|
| 621 | <span class="keyword">char</span> utf8_with_surrogates[] = <span class= |
|---|
| 622 | "literal">"\xe6\x97\xa5\xd1\x88\xf0\x9d\x84\x9e"</span>; |
|---|
| 623 | vector <<span class="keyword">unsigned short</span>> utf16result; |
|---|
| 624 | utf8to16(utf8_with_surrogates, utf8_with_surrogates + <span class= |
|---|
| 625 | "literal">9</span>, back_inserter(utf16result)); |
|---|
| 626 | assert (utf16result.size() == <span class="literal">4</span>); |
|---|
| 627 | assert (utf16result[<span class="literal">2</span>] == <span class= |
|---|
| 628 | "literal">0xd834</span>); |
|---|
| 629 | assert (utf16result[<span class="literal">3</span>] == <span class= |
|---|
| 630 | "literal">0xdd1e</span>); |
|---|
| 631 | </pre> |
|---|
| 632 | <p> |
|---|
| 633 | In case of an invalid UTF-8 seqence, a <code>utf8::invalid_utf8</code> exception is |
|---|
| 634 | thrown. If <code>end</code> does not point to the past-of-end of a UTF-8 seqence, a |
|---|
| 635 | <code>utf8::not_enough_room</code> exception is thrown. |
|---|
| 636 | </p> |
|---|
| 637 | <h4> |
|---|
| 638 | utf8::utf32to8 |
|---|
| 639 | </h4> |
|---|
| 640 | <p class="version"> |
|---|
| 641 | Available in version 1.0 and later. |
|---|
| 642 | </p> |
|---|
| 643 | <p> |
|---|
| 644 | Converts a UTF-32 encoded string to UTF-8. |
|---|
| 645 | </p> |
|---|
| 646 | <pre> |
|---|
| 647 | <span class="keyword">template</span> <<span class= |
|---|
| 648 | "keyword">typename</span> octet_iterator, typename u32bit_iterator> |
|---|
| 649 | octet_iterator utf32to8 (u32bit_iterator start, u32bit_iterator end, octet_iterator result); |
|---|
| 650 | |
|---|
| 651 | </pre> |
|---|
| 652 | <p> |
|---|
| 653 | <code>start</code>: an iterator pointing to the beginning of the UTF-32 encoded |
|---|
| 654 | string to convert.<br> |
|---|
| 655 | <code>end</code>: an iterator pointing to pass-the-end of the UTF-32 encoded |
|---|
| 656 | string to convert.<br> |
|---|
| 657 | <code>result</code>: an output iterator to the place in the UTF-8 string where to |
|---|
| 658 | append the result of conversion.<br> |
|---|
| 659 | <span class="return_value">Return value</span>: An iterator pointing to the place |
|---|
| 660 | after the appended UTF-8 string. |
|---|
| 661 | </p> |
|---|
| 662 | <p> |
|---|
| 663 | Example of use: |
|---|
| 664 | </p> |
|---|
| 665 | <pre> |
|---|
| 666 | <span class="keyword">int</span> utf32string[] = {<span class= |
|---|
| 667 | "literal">0x448</span>, <span class="literal">0x65E5</span>, <span class= |
|---|
| 668 | "literal">0x10346</span>, <span class="literal">0</span>}; |
|---|
| 669 | vector<<span class="keyword">unsigned char</span>> utf8result; |
|---|
| 670 | utf32to8(utf32string, utf32string + <span class= |
|---|
| 671 | "literal">3</span>, back_inserter(utf8result)); |
|---|
| 672 | assert (utf8result.size() == <span class="literal">9</span>); |
|---|
| 673 | </pre> |
|---|
| 674 | <p> |
|---|
| 675 | In case of invalid UTF-32 string, a <code>utf8::invalid_code_point</code> exception |
|---|
| 676 | is thrown. |
|---|
| 677 | </p> |
|---|
| 678 | <h4> |
|---|
| 679 | utf8::utf8to32 |
|---|
| 680 | </h4> |
|---|
| 681 | <p class="version"> |
|---|
| 682 | Available in version 1.0 and later. |
|---|
| 683 | </p> |
|---|
| 684 | <p> |
|---|
| 685 | Converts a UTF-8 encoded string to UTF-32. |
|---|
| 686 | </p> |
|---|
| 687 | <pre> |
|---|
| 688 | <span class="keyword">template</span> <<span class= |
|---|
| 689 | "keyword">typename</span> octet_iterator, <span class= |
|---|
| 690 | "keyword">typename</span> u32bit_iterator> |
|---|
| 691 | u32bit_iterator utf8to32 (octet_iterator start, octet_iterator end, u32bit_iterator result); |
|---|
| 692 | |
|---|
| 693 | </pre> |
|---|
| 694 | <p> |
|---|
| 695 | <code>start</code>: an iterator pointing to the beginning of the UTF-8 encoded |
|---|
| 696 | string to convert.<br> |
|---|
| 697 | <code>end</code>: an iterator pointing to pass-the-end of the UTF-8 encoded string |
|---|
| 698 | to convert.<br> |
|---|
| 699 | <code>result</code>: an output iterator to the place in the UTF-32 string where to |
|---|
| 700 | append the result of conversion.<br> |
|---|
| 701 | <span class="return_value">Return value</span>: An iterator pointing to the place |
|---|
| 702 | after the appended UTF-32 string. |
|---|
| 703 | </p> |
|---|
| 704 | <p> |
|---|
| 705 | Example of use: |
|---|
| 706 | </p> |
|---|
| 707 | <pre> |
|---|
| 708 | <span class="keyword">char</span>* twochars = <span class= |
|---|
| 709 | "literal">"\xe6\x97\xa5\xd1\x88"</span>; |
|---|
| 710 | vector<<span class="keyword">int</span>> utf32result; |
|---|
| 711 | utf8to32(twochars, twochars + <span class= |
|---|
| 712 | "literal">5</span>, back_inserter(utf32result)); |
|---|
| 713 | assert (utf32result.size() == <span class="literal">2</span>); |
|---|
| 714 | </pre> |
|---|
| 715 | <p> |
|---|
| 716 | In case of an invalid UTF-8 seqence, a <code>utf8::invalid_utf8</code> exception is |
|---|
| 717 | thrown. If <code>end</code> does not point to the past-of-end of a UTF-8 seqence, a |
|---|
| 718 | <code>utf8::not_enough_room</code> exception is thrown. |
|---|
| 719 | </p> |
|---|
| 720 | <h4> |
|---|
| 721 | utf8::find_invalid |
|---|
| 722 | </h4> |
|---|
| 723 | <p class="version"> |
|---|
| 724 | Available in version 1.0 and later. |
|---|
| 725 | </p> |
|---|
| 726 | <p> |
|---|
| 727 | Detects an invalid sequence within a UTF-8 string. |
|---|
| 728 | </p> |
|---|
| 729 | <pre> |
|---|
| 730 | <span class="keyword">template</span> <<span class= |
|---|
| 731 | "keyword">typename</span> octet_iterator> |
|---|
| 732 | octet_iterator find_invalid(octet_iterator start, octet_iterator end); |
|---|
| 733 | </pre> |
|---|
| 734 | <p> |
|---|
| 735 | <code>start</code>: an iterator pointing to the beginning of the UTF-8 string to |
|---|
| 736 | test for validity.<br> |
|---|
| 737 | <code>end</code>: an iterator pointing to pass-the-end of the UTF-8 string to test |
|---|
| 738 | for validity.<br> |
|---|
| 739 | <span class="return_value">Return value</span>: an iterator pointing to the first |
|---|
| 740 | invalid octet in the UTF-8 string. In case none were found, equals |
|---|
| 741 | <code>end</code>. |
|---|
| 742 | </p> |
|---|
| 743 | <p> |
|---|
| 744 | Example of use: |
|---|
| 745 | </p> |
|---|
| 746 | <pre> |
|---|
| 747 | <span class="keyword">char</span> utf_invalid[] = <span class= |
|---|
| 748 | "literal">"\xe6\x97\xa5\xd1\x88\xfa"</span>; |
|---|
| 749 | <span class= |
|---|
| 750 | "keyword">char</span>* invalid = find_invalid(utf_invalid, utf_invalid + <span class= |
|---|
| 751 | "literal">6</span>); |
|---|
| 752 | assert (invalid == utf_invalid + <span class="literal">5</span>); |
|---|
| 753 | </pre> |
|---|
| 754 | <p> |
|---|
| 755 | This function is typically used to make sure a UTF-8 string is valid before |
|---|
| 756 | processing it with other functions. It is especially important to call it if before |
|---|
| 757 | doing any of the <em>unchecked</em> operations on it. |
|---|
| 758 | </p> |
|---|
| 759 | <h4> |
|---|
| 760 | utf8::is_valid |
|---|
| 761 | </h4> |
|---|
| 762 | <p class="version"> |
|---|
| 763 | Available in version 1.0 and later. |
|---|
| 764 | </p> |
|---|
| 765 | <p> |
|---|
| 766 | Checks whether a sequence of octets is a valid UTF-8 string. |
|---|
| 767 | </p> |
|---|
| 768 | <pre> |
|---|
| 769 | <span class="keyword">template</span> <<span class= |
|---|
| 770 | "keyword">typename</span> octet_iterator> |
|---|
| 771 | <span class="keyword">bool</span> is_valid(octet_iterator start, octet_iterator end); |
|---|
| 772 | |
|---|
| 773 | </pre> |
|---|
| 774 | <p> |
|---|
| 775 | <code>start</code>: an iterator pointing to the beginning of the UTF-8 string to |
|---|
| 776 | test for validity.<br> |
|---|
| 777 | <code>end</code>: an iterator pointing to pass-the-end of the UTF-8 string to test |
|---|
| 778 | for validity.<br> |
|---|
| 779 | <span class="return_value">Return value</span>: <code>true</code> if the sequence |
|---|
| 780 | is a valid UTF-8 string; <code>false</code> if not. |
|---|
| 781 | </p> |
|---|
| 782 | Example of use: |
|---|
| 783 | <pre> |
|---|
| 784 | <span class="keyword">char</span> utf_invalid[] = <span class= |
|---|
| 785 | "literal">"\xe6\x97\xa5\xd1\x88\xfa"</span>; |
|---|
| 786 | <span class="keyword">bool</span> bvalid = is_valid(utf_invalid, utf_invalid + <span |
|---|
| 787 | class="literal">6</span>); |
|---|
| 788 | assert (bvalid == false); |
|---|
| 789 | </pre> |
|---|
| 790 | <p> |
|---|
| 791 | <code>is_valid</code> is a shorthand for <code>find_invalid(start, end) == |
|---|
| 792 | end;</code>. You may want to use it to make sure that a byte seqence is a valid |
|---|
| 793 | UTF-8 string without the need to know where it fails if it is not valid. |
|---|
| 794 | </p> |
|---|
| 795 | <h4> |
|---|
| 796 | utf8::replace_invalid |
|---|
| 797 | </h4> |
|---|
| 798 | <p class="version"> |
|---|
| 799 | Available in version 2.0 and later. |
|---|
| 800 | </p> |
|---|
| 801 | <p> |
|---|
| 802 | Replaces all invalid UTF-8 sequences within a string with a replacement marker. |
|---|
| 803 | </p> |
|---|
| 804 | <pre> |
|---|
| 805 | <span class="keyword">template</span> <<span class= |
|---|
| 806 | "keyword">typename</span> octet_iterator, <span class= |
|---|
| 807 | "keyword">typename</span> output_iterator> |
|---|
| 808 | output_iterator replace_invalid(octet_iterator start, octet_iterator end, output_iterator out, uint32_t replacement); |
|---|
| 809 | <span class="keyword">template</span> <<span class= |
|---|
| 810 | "keyword">typename</span> octet_iterator, <span class= |
|---|
| 811 | "keyword">typename</span> output_iterator> |
|---|
| 812 | output_iterator replace_invalid(octet_iterator start, octet_iterator end, output_iterator out); |
|---|
| 813 | |
|---|
| 814 | </pre> |
|---|
| 815 | <p> |
|---|
| 816 | <code>start</code>: an iterator pointing to the beginning of the UTF-8 string to |
|---|
| 817 | look for invalid UTF-8 sequences.<br> |
|---|
| 818 | <code>end</code>: an iterator pointing to pass-the-end of the UTF-8 string to look |
|---|
| 819 | for invalid UTF-8 sequences.<br> |
|---|
| 820 | <code>out</code>: An output iterator to the range where the result of replacement |
|---|
| 821 | is stored.<br> |
|---|
| 822 | <code>replacement</code>: A Unicode code point for the replacement marker. The |
|---|
| 823 | version without this parameter assumes the value <code>0xfffd</code><br> |
|---|
| 824 | <span class="return_value">Return value</span>: An iterator pointing to the place |
|---|
| 825 | after the UTF-8 string with replaced invalid sequences. |
|---|
| 826 | </p> |
|---|
| 827 | <p> |
|---|
| 828 | Example of use: |
|---|
| 829 | </p> |
|---|
| 830 | <pre> |
|---|
| 831 | <span class="keyword">char</span> invalid_sequence[] = <span class= |
|---|
| 832 | "literal">"a\x80\xe0\xa0\xc0\xaf\xed\xa0\x80z"</span>; |
|---|
| 833 | vector<<span class="keyword">char</span>> replace_invalid_result; |
|---|
| 834 | replace_invalid (invalid_sequence, invalid_sequence + sizeof(invalid_sequence), back_inserter(replace_invalid_result), <span |
|---|
| 835 | class="literal">'?'</span>); |
|---|
| 836 | bvalid = is_valid(replace_invalid_result.begin(), replace_invalid_result.end()); |
|---|
| 837 | assert (bvalid); |
|---|
| 838 | <span class="keyword">char</span>* fixed_invalid_sequence = <span class= |
|---|
| 839 | "literal">"a????z"</span>; |
|---|
| 840 | assert (std::equal(replace_invalid_result.begin(), replace_invalid_result.end(), fixed_invalid_sequence)); |
|---|
| 841 | </pre> |
|---|
| 842 | <p> |
|---|
| 843 | <code>replace_invalid</code> does not perform in-place replacement of invalid |
|---|
| 844 | sequences. Rather, it produces a copy of the original string with the invalid |
|---|
| 845 | sequences replaced with a replacement marker. Therefore, <code>out</code> must not |
|---|
| 846 | be in the <code>[start, end]</code> range. |
|---|
| 847 | </p> |
|---|
| 848 | <p> |
|---|
| 849 | If <code>end</code> does not point to the past-of-end of a UTF-8 sequence, a |
|---|
| 850 | <code>utf8::not_enough_room</code> exception is thrown. |
|---|
| 851 | </p> |
|---|
| 852 | <h4> |
|---|
| 853 | utf8::is_bom |
|---|
| 854 | </h4> |
|---|
| 855 | <p class="version"> |
|---|
| 856 | Available in version 1.0 and later. |
|---|
| 857 | </p> |
|---|
| 858 | <p> |
|---|
| 859 | Checks whether a sequence of three octets is a UTF-8 byte order mark (BOM) |
|---|
| 860 | </p> |
|---|
| 861 | <pre> |
|---|
| 862 | <span class="keyword">template</span> <<span class= |
|---|
| 863 | "keyword">typename</span> octet_iterator> |
|---|
| 864 | <span class="keyword">bool</span> is_bom (octet_iterator it); |
|---|
| 865 | </pre> |
|---|
| 866 | <p> |
|---|
| 867 | <code>it</code>: beginning of the 3-octet sequence to check<br> |
|---|
| 868 | <span class="return_value">Return value</span>: <code>true</code> if the sequence |
|---|
| 869 | is UTF-8 byte order mark; <code>false</code> if not. |
|---|
| 870 | </p> |
|---|
| 871 | <p> |
|---|
| 872 | Example of use: |
|---|
| 873 | </p> |
|---|
| 874 | <pre> |
|---|
| 875 | <span class="keyword">unsigned char</span> byte_order_mark[] = {<span class= |
|---|
| 876 | "literal">0xef</span>, <span class="literal">0xbb</span>, <span class= |
|---|
| 877 | "literal">0xbf</span>}; |
|---|
| 878 | <span class="keyword">bool</span> bbom = is_bom(byte_order_mark); |
|---|
| 879 | assert (bbom == <span class="literal">true</span>); |
|---|
| 880 | </pre> |
|---|
| 881 | <p> |
|---|
| 882 | The typical use of this function is to check the first three bytes of a file. If |
|---|
| 883 | they form the UTF-8 BOM, we want to skip them before processing the actual UTF-8 |
|---|
| 884 | encoded text. |
|---|
| 885 | </p> |
|---|
| 886 | <h3 id="typesutf8"> |
|---|
| 887 | Types From utf8 Namespace |
|---|
| 888 | </h3> |
|---|
| 889 | <h4> |
|---|
| 890 | utf8::iterator |
|---|
| 891 | </h4> |
|---|
| 892 | <p class="version"> |
|---|
| 893 | Available in version 2.0 and later. |
|---|
| 894 | </p> |
|---|
| 895 | <p> |
|---|
| 896 | Adapts the underlying octet iterator to iterate over the sequence of code points, |
|---|
| 897 | rather than raw octets. |
|---|
| 898 | </p> |
|---|
| 899 | <pre> |
|---|
| 900 | <span class="keyword">template</span> <<span class="keyword">typename</span> octet_iterator> |
|---|
| 901 | <span class="keyword">class</span> iterator; |
|---|
| 902 | </pre> |
|---|
| 903 | |
|---|
| 904 | <h5>Member functions</h5> |
|---|
| 905 | <dl> |
|---|
| 906 | <dt><code>iterator();</code> <dd> the deafult constructor; the underlying <code>octet_iterator</code> is |
|---|
| 907 | constructed with its default constructor. |
|---|
| 908 | <dt><code><span class="keyword">explicit</span> iterator (const octet_iterator& octet_it, |
|---|
| 909 | const octet_iterator& range_start, |
|---|
| 910 | const octet_iterator& range_end);</code> <dd> a constructor |
|---|
| 911 | that initializes the underlying <code>octet_iterator</code> with <code>octet_it</code> |
|---|
| 912 | and sets the range in which the iterator is considered valid. |
|---|
| 913 | <dt><code>octet_iterator base () <span class="keyword">const</span>;</code> <dd> returns the |
|---|
| 914 | underlying <code>octet_iterator</code>. |
|---|
| 915 | <dt><code>uint32_t operator * () <span class="keyword">const</span>;</code> <dd> decodes the utf-8 sequence |
|---|
| 916 | the underlying <code>octet_iterator</code> is pointing to and returns the code point. |
|---|
| 917 | <dt><code><span class="keyword">bool operator</span> == (const iterator& rhs) |
|---|
| 918 | <span class="keyword">const</span>;</code> <dd> returns <span class="keyword">true</span> |
|---|
| 919 | if the two underlaying iterators are equal. |
|---|
| 920 | <dt><code><span class="keyword">bool operator</span> != (const iterator& rhs) |
|---|
| 921 | <span class="keyword">const</span>;</code> <dd> returns <span class="keyword">true</span> |
|---|
| 922 | if the two underlaying iterators are not equal. |
|---|
| 923 | <dt><code>iterator& <span class="keyword">operator</span> ++ (); </code> <dd> the prefix increment - moves |
|---|
| 924 | the iterator to the next UTF-8 encoded code point. |
|---|
| 925 | <dt><code>iterator <span class="keyword">operator</span> ++ (<span class="keyword">int</span>); </code> <dd> |
|---|
| 926 | the postfix increment - moves the iterator to the next UTF-8 encoded code point and returns the current one. |
|---|
| 927 | <dt><code>iterator& <span class="keyword">operator</span> -- (); </code> <dd> the prefix decrement - moves |
|---|
| 928 | the iterator to the previous UTF-8 encoded code point. |
|---|
| 929 | <dt><code>iterator <span class="keyword">operator</span> -- (<span class="keyword">int</span>); </code> <dd> |
|---|
| 930 | the postfix decrement - moves the iterator to the previous UTF-8 encoded code point and returns the current one. |
|---|
| 931 | </dl> |
|---|
| 932 | <p> |
|---|
| 933 | Example of use: |
|---|
| 934 | </p> |
|---|
| 935 | <pre> |
|---|
| 936 | <span class="keyword">char</span>* threechars = <span class="literal">"\xf0\x90\x8d\x86\xe6\x97\xa5\xd1\x88"</span>; |
|---|
| 937 | utf8::iterator<<span class="keyword">char</span>*> it(threechars, threechars, threechars + <span class="literal">9</span>); |
|---|
| 938 | utf8::iterator<<span class="keyword">char</span>*> it2 = it; |
|---|
| 939 | assert (it2 == it); |
|---|
| 940 | assert (*it == <span class="literal">0x10346</span>); |
|---|
| 941 | assert (*(++it) == <span class="literal">0x65e5</span>); |
|---|
| 942 | assert ((*it++) == <span class="literal">0x65e5</span>); |
|---|
| 943 | assert (*it == <span class="literal">0x0448</span>); |
|---|
| 944 | assert (it != it2); |
|---|
| 945 | utf8::iterator<<span class="keyword">char</span>*> endit (threechars + <span class="literal">9</span>, threechars, threechars + <span class="literal">9</span>); |
|---|
| 946 | assert (++it == endit); |
|---|
| 947 | assert (*(--it) == <span class="literal">0x0448</span>); |
|---|
| 948 | assert ((*it--) == <span class="literal">0x0448</span>); |
|---|
| 949 | assert (*it == <span class="literal">0x65e5</span>); |
|---|
| 950 | assert (--it == utf8::iterator<<span class="keyword">char</span>*>(threechars, threechars, threechars + <span class="literal">9</span>)); |
|---|
| 951 | assert (*it == <span class="literal">0x10346</span>); |
|---|
| 952 | </pre> |
|---|
| 953 | <p> |
|---|
| 954 | The purpose of <code>utf8::iterator</code> adapter is to enable easy iteration as well as the use of STL |
|---|
| 955 | algorithms with UTF-8 encoded strings. Increment and decrement operators are implemented in terms of |
|---|
| 956 | <code>utf8::next()</code> and <code>utf8::prior()</code> functions. |
|---|
| 957 | </p> |
|---|
| 958 | <p> |
|---|
| 959 | Note that <code>utf8::iterator</code> adapter is a checked iterator. It operates on the range specified in |
|---|
| 960 | the constructor; any attempt to go out of that range will result in an exception. Even the comparison operators |
|---|
| 961 | require both iterator object to be constructed against the same range - otherwise an exception is thrown. Typically, |
|---|
| 962 | the range will be determined by sequence container functions <code>begin</code> and <code>end</code>, i.e.: |
|---|
| 963 | </p> |
|---|
| 964 | <pre> |
|---|
| 965 | std::string s = <span class="literal">"example"</span>; |
|---|
| 966 | utf8::iterator i (s.begin(), s.begin(), s.end()); |
|---|
| 967 | </pre> |
|---|
| 968 | <h3 id="fununchecked"> |
|---|
| 969 | Functions From utf8::unchecked Namespace |
|---|
| 970 | </h3> |
|---|
| 971 | <h4> |
|---|
| 972 | utf8::unchecked::append |
|---|
| 973 | </h4> |
|---|
| 974 | <p class="version"> |
|---|
| 975 | Available in version 1.0 and later. |
|---|
| 976 | </p> |
|---|
| 977 | <p> |
|---|
| 978 | Encodes a 32 bit code point as a UTF-8 sequence of octets and appends the sequence |
|---|
| 979 | to a UTF-8 string. |
|---|
| 980 | </p> |
|---|
| 981 | <pre> |
|---|
| 982 | <span class="keyword">template</span> <<span class= |
|---|
| 983 | "keyword">typename</span> octet_iterator> |
|---|
| 984 | octet_iterator append(uint32_t cp, octet_iterator result); |
|---|
| 985 | |
|---|
| 986 | </pre> |
|---|
| 987 | <p> |
|---|
| 988 | <code>cp</code>: A 32 bit integer representing a code point to append to the |
|---|
| 989 | sequence.<br> |
|---|
| 990 | <code>result</code>: An output iterator to the place in the sequence where to |
|---|
| 991 | append the code point.<br> |
|---|
| 992 | <span class="return_value">Return value</span>: An iterator pointing to the place |
|---|
| 993 | after the newly appended sequence. |
|---|
| 994 | </p> |
|---|
| 995 | <p> |
|---|
| 996 | Example of use: |
|---|
| 997 | </p> |
|---|
| 998 | <pre> |
|---|
| 999 | <span class="keyword">unsigned char</span> u[<span class="literal">5</span>] = {<span |
|---|
| 1000 | class="literal">0</span>,<span class="literal">0</span>,<span class= |
|---|
| 1001 | "literal">0</span>,<span class="literal">0</span>,<span class="literal">0</span>}; |
|---|
| 1002 | <span class="keyword">unsigned char</span>* end = unchecked::append(<span class= |
|---|
| 1003 | "literal">0x0448</span>, u); |
|---|
| 1004 | assert (u[<span class="literal">0</span>] == <span class= |
|---|
| 1005 | "literal">0xd1</span> && u[<span class="literal">1</span>] == <span class= |
|---|
| 1006 | "literal">0x88</span> && u[<span class="literal">2</span>] == <span class= |
|---|
| 1007 | "literal">0</span> && u[<span class="literal">3</span>] == <span class= |
|---|
| 1008 | "literal">0</span> && u[<span class="literal">4</span>] == <span class= |
|---|
| 1009 | "literal">0</span>); |
|---|
| 1010 | </pre> |
|---|
| 1011 | <p> |
|---|
| 1012 | This is a faster but less safe version of <code>utf8::append</code>. It does not |
|---|
| 1013 | check for validity of the supplied code point, and may produce an invalid UTF-8 |
|---|
| 1014 | sequence. |
|---|
| 1015 | </p> |
|---|
| 1016 | <h4> |
|---|
| 1017 | utf8::unchecked::next |
|---|
| 1018 | </h4> |
|---|
| 1019 | <p class="version"> |
|---|
| 1020 | Available in version 1.0 and later. |
|---|
| 1021 | </p> |
|---|
| 1022 | <p> |
|---|
| 1023 | Given the iterator to the beginning of a UTF-8 sequence, it returns the code point |
|---|
| 1024 | and moves the iterator to the next position. |
|---|
| 1025 | </p> |
|---|
| 1026 | <pre> |
|---|
| 1027 | <span class="keyword">template</span> <<span class= |
|---|
| 1028 | "keyword">typename</span> octet_iterator> |
|---|
| 1029 | uint32_t next(octet_iterator& it); |
|---|
| 1030 | |
|---|
| 1031 | </pre> |
|---|
| 1032 | <p> |
|---|
| 1033 | <code>it</code>: a reference to an iterator pointing to the beginning of an UTF-8 |
|---|
| 1034 | encoded code point. After the function returns, it is incremented to point to the |
|---|
| 1035 | beginning of the next code point.<br> |
|---|
| 1036 | <span class="return_value">Return value</span>: the 32 bit representation of the |
|---|
| 1037 | processed UTF-8 code point. |
|---|
| 1038 | </p> |
|---|
| 1039 | <p> |
|---|
| 1040 | Example of use: |
|---|
| 1041 | </p> |
|---|
| 1042 | <pre> |
|---|
| 1043 | <span class="keyword">char</span>* twochars = <span class= |
|---|
| 1044 | "literal">"\xe6\x97\xa5\xd1\x88"</span>; |
|---|
| 1045 | <span class="keyword">char</span>* w = twochars; |
|---|
| 1046 | <span class="keyword">int</span> cp = unchecked::next(w); |
|---|
| 1047 | assert (cp == <span class="literal">0x65e5</span>); |
|---|
| 1048 | assert (w == twochars + <span class="literal">3</span>); |
|---|
| 1049 | </pre> |
|---|
| 1050 | <p> |
|---|
| 1051 | This is a faster but less safe version of <code>utf8::next</code>. It does not |
|---|
| 1052 | check for validity of the supplied UTF-8 sequence. |
|---|
| 1053 | </p> |
|---|
| 1054 | <h4> |
|---|
| 1055 | utf8::unchecked::peek_next |
|---|
| 1056 | </h4> |
|---|
| 1057 | <p class="version"> |
|---|
| 1058 | Available in version 2.1 and later. |
|---|
| 1059 | </p> |
|---|
| 1060 | <p> |
|---|
| 1061 | Given the iterator to the beginning of a UTF-8 sequence, it returns the code point. |
|---|
| 1062 | </p> |
|---|
| 1063 | <pre> |
|---|
| 1064 | <span class="keyword">template</span> <<span class= |
|---|
| 1065 | "keyword">typename</span> octet_iterator> |
|---|
| 1066 | uint32_t peek_next(octet_iterator it); |
|---|
| 1067 | |
|---|
| 1068 | </pre> |
|---|
| 1069 | <p> |
|---|
| 1070 | <code>it</code>: an iterator pointing to the beginning of an UTF-8 |
|---|
| 1071 | encoded code point.<br> |
|---|
| 1072 | <span class="return_value">Return value</span>: the 32 bit representation of the |
|---|
| 1073 | processed UTF-8 code point. |
|---|
| 1074 | </p> |
|---|
| 1075 | <p> |
|---|
| 1076 | Example of use: |
|---|
| 1077 | </p> |
|---|
| 1078 | <pre> |
|---|
| 1079 | <span class="keyword">char</span>* twochars = <span class= |
|---|
| 1080 | "literal">"\xe6\x97\xa5\xd1\x88"</span>; |
|---|
| 1081 | <span class="keyword">char</span>* w = twochars; |
|---|
| 1082 | <span class="keyword">int</span> cp = unchecked::peek_next(w); |
|---|
| 1083 | assert (cp == <span class="literal">0x65e5</span>); |
|---|
| 1084 | assert (w == twochars); |
|---|
| 1085 | </pre> |
|---|
| 1086 | <p> |
|---|
| 1087 | This is a faster but less safe version of <code>utf8::peek_next</code>. It does not |
|---|
| 1088 | check for validity of the supplied UTF-8 sequence. |
|---|
| 1089 | </p> |
|---|
| 1090 | <h4> |
|---|
| 1091 | utf8::unchecked::prior |
|---|
| 1092 | </h4> |
|---|
| 1093 | <p class="version"> |
|---|
| 1094 | Available in version 1.02 and later. |
|---|
| 1095 | </p> |
|---|
| 1096 | <p> |
|---|
| 1097 | Given a reference to an iterator pointing to an octet in a UTF-8 seqence, it |
|---|
| 1098 | decreases the iterator until it hits the beginning of the previous UTF-8 encoded |
|---|
| 1099 | code point and returns the 32 bits representation of the code point. |
|---|
| 1100 | </p> |
|---|
| 1101 | <pre> |
|---|
| 1102 | <span class="keyword">template</span> <<span class= |
|---|
| 1103 | "keyword">typename</span> octet_iterator> |
|---|
| 1104 | uint32_t prior(octet_iterator& it); |
|---|
| 1105 | |
|---|
| 1106 | </pre> |
|---|
| 1107 | <p> |
|---|
| 1108 | <code>it</code>: a reference pointing to an octet within a UTF-8 encoded string. |
|---|
| 1109 | After the function returns, it is decremented to point to the beginning of the |
|---|
| 1110 | previous code point.<br> |
|---|
| 1111 | <span class="return_value">Return value</span>: the 32 bit representation of the |
|---|
| 1112 | previous code point. |
|---|
| 1113 | </p> |
|---|
| 1114 | <p> |
|---|
| 1115 | Example of use: |
|---|
| 1116 | </p> |
|---|
| 1117 | <pre> |
|---|
| 1118 | <span class="keyword">char</span>* twochars = <span class= |
|---|
| 1119 | "literal">"\xe6\x97\xa5\xd1\x88"</span>; |
|---|
| 1120 | <span class="keyword">char</span>* w = twochars + <span class="literal">3</span>; |
|---|
| 1121 | <span class="keyword">int</span> cp = unchecked::prior (w); |
|---|
| 1122 | assert (cp == <span class="literal">0x65e5</span>); |
|---|
| 1123 | assert (w == twochars); |
|---|
| 1124 | </pre> |
|---|
| 1125 | <p> |
|---|
| 1126 | This is a faster but less safe version of <code>utf8::prior</code>. It does not |
|---|
| 1127 | check for validity of the supplied UTF-8 sequence and offers no boundary checking. |
|---|
| 1128 | </p> |
|---|
| 1129 | <h4> |
|---|
| 1130 | utf8::unchecked::previous (deprecated, see utf8::unchecked::prior) |
|---|
| 1131 | </h4> |
|---|
| 1132 | <p class="version"> |
|---|
| 1133 | Deprecated in version 1.02 and later. |
|---|
| 1134 | </p> |
|---|
| 1135 | <p> |
|---|
| 1136 | Given a reference to an iterator pointing to an octet in a UTF-8 seqence, it |
|---|
| 1137 | decreases the iterator until it hits the beginning of the previous UTF-8 encoded |
|---|
| 1138 | code point and returns the 32 bits representation of the code point. |
|---|
| 1139 | </p> |
|---|
| 1140 | <pre> |
|---|
| 1141 | <span class="keyword">template</span> <<span class= |
|---|
| 1142 | "keyword">typename</span> octet_iterator> |
|---|
| 1143 | uint32_t previous(octet_iterator& it); |
|---|
| 1144 | |
|---|
| 1145 | </pre> |
|---|
| 1146 | <p> |
|---|
| 1147 | <code>it</code>: a reference pointing to an octet within a UTF-8 encoded string. |
|---|
| 1148 | After the function returns, it is decremented to point to the beginning of the |
|---|
| 1149 | previous code point.<br> |
|---|
| 1150 | <span class="return_value">Return value</span>: the 32 bit representation of the |
|---|
| 1151 | previous code point. |
|---|
| 1152 | </p> |
|---|
| 1153 | <p> |
|---|
| 1154 | Example of use: |
|---|
| 1155 | </p> |
|---|
| 1156 | <pre> |
|---|
| 1157 | <span class="keyword">char</span>* twochars = <span class= |
|---|
| 1158 | "literal">"\xe6\x97\xa5\xd1\x88"</span>; |
|---|
| 1159 | <span class="keyword">char</span>* w = twochars + <span class="literal">3</span>; |
|---|
| 1160 | <span class="keyword">int</span> cp = unchecked::previous (w); |
|---|
| 1161 | assert (cp == <span class="literal">0x65e5</span>); |
|---|
| 1162 | assert (w == twochars); |
|---|
| 1163 | </pre> |
|---|
| 1164 | <p> |
|---|
| 1165 | The reason this function is deprecated is just the consistency with the "checked" |
|---|
| 1166 | versions, where <code>prior</code> should be used instead of <code>previous</code>. |
|---|
| 1167 | In fact, <code>unchecked::previous</code> behaves exactly the same as <code> |
|---|
| 1168 | unchecked::prior</code> |
|---|
| 1169 | </p> |
|---|
| 1170 | <p> |
|---|
| 1171 | This is a faster but less safe version of <code>utf8::previous</code>. It does not |
|---|
| 1172 | check for validity of the supplied UTF-8 sequence and offers no boundary checking. |
|---|
| 1173 | </p> |
|---|
| 1174 | <h4> |
|---|
| 1175 | utf8::unchecked::advance |
|---|
| 1176 | </h4> |
|---|
| 1177 | <p class="version"> |
|---|
| 1178 | Available in version 1.0 and later. |
|---|
| 1179 | </p> |
|---|
| 1180 | <p> |
|---|
| 1181 | Advances an iterator by the specified number of code points within an UTF-8 |
|---|
| 1182 | sequence. |
|---|
| 1183 | </p> |
|---|
| 1184 | <pre> |
|---|
| 1185 | <span class="keyword">template</span> <<span class= |
|---|
| 1186 | "keyword">typename</span> octet_iterator, typename distance_type> |
|---|
| 1187 | <span class="keyword">void</span> advance (octet_iterator& it, distance_type n); |
|---|
| 1188 | |
|---|
| 1189 | </pre> |
|---|
| 1190 | <p> |
|---|
| 1191 | <code>it</code>: a reference to an iterator pointing to the beginning of an UTF-8 |
|---|
| 1192 | encoded code point. After the function returns, it is incremented to point to the |
|---|
| 1193 | nth following code point.<br> |
|---|
| 1194 | <code>n</code>: a positive integer that shows how many code points we want to |
|---|
| 1195 | advance.<br> |
|---|
| 1196 | </p> |
|---|
| 1197 | <p> |
|---|
| 1198 | Example of use: |
|---|
| 1199 | </p> |
|---|
| 1200 | <pre> |
|---|
| 1201 | <span class="keyword">char</span>* twochars = <span class= |
|---|
| 1202 | "literal">"\xe6\x97\xa5\xd1\x88"</span>; |
|---|
| 1203 | <span class="keyword">char</span>* w = twochars; |
|---|
| 1204 | unchecked::advance (w, <span class="literal">2</span>); |
|---|
| 1205 | assert (w == twochars + <span class="literal">5</span>); |
|---|
| 1206 | </pre> |
|---|
| 1207 | <p> |
|---|
| 1208 | This function works only "forward". In case of a negative <code>n</code>, there is |
|---|
| 1209 | no effect. |
|---|
| 1210 | </p> |
|---|
| 1211 | <p> |
|---|
| 1212 | This is a faster but less safe version of <code>utf8::advance</code>. It does not |
|---|
| 1213 | check for validity of the supplied UTF-8 sequence and offers no boundary checking. |
|---|
| 1214 | </p> |
|---|
| 1215 | <h4> |
|---|
| 1216 | utf8::unchecked::distance |
|---|
| 1217 | </h4> |
|---|
| 1218 | <p class="version"> |
|---|
| 1219 | Available in version 1.0 and later. |
|---|
| 1220 | </p> |
|---|
| 1221 | <p> |
|---|
| 1222 | Given the iterators to two UTF-8 encoded code points in a seqence, returns the |
|---|
| 1223 | number of code points between them. |
|---|
| 1224 | </p> |
|---|
| 1225 | <pre> |
|---|
| 1226 | <span class="keyword">template</span> <<span class= |
|---|
| 1227 | "keyword">typename</span> octet_iterator> |
|---|
| 1228 | <span class= |
|---|
| 1229 | "keyword">typename</span> std::iterator_traits<octet_iterator>::difference_type distance (octet_iterator first, octet_iterator last); |
|---|
| 1230 | </pre> |
|---|
| 1231 | <p> |
|---|
| 1232 | <code>first</code>: an iterator to a beginning of a UTF-8 encoded code point.<br> |
|---|
| 1233 | <code>last</code>: an iterator to a "post-end" of the last UTF-8 encoded code |
|---|
| 1234 | point in the sequence we are trying to determine the length. It can be the |
|---|
| 1235 | beginning of a new code point, or not.<br> |
|---|
| 1236 | <span class="return_value">Return value</span> the distance between the iterators, |
|---|
| 1237 | in code points. |
|---|
| 1238 | </p> |
|---|
| 1239 | <p> |
|---|
| 1240 | Example of use: |
|---|
| 1241 | </p> |
|---|
| 1242 | <pre> |
|---|
| 1243 | <span class="keyword">char</span>* twochars = <span class= |
|---|
| 1244 | "literal">"\xe6\x97\xa5\xd1\x88"</span>; |
|---|
| 1245 | size_t dist = utf8::unchecked::distance(twochars, twochars + <span class= |
|---|
| 1246 | "literal">5</span>); |
|---|
| 1247 | assert (dist == <span class="literal">2</span>); |
|---|
| 1248 | </pre> |
|---|
| 1249 | <p> |
|---|
| 1250 | This is a faster but less safe version of <code>utf8::distance</code>. It does not |
|---|
| 1251 | check for validity of the supplied UTF-8 sequence. |
|---|
| 1252 | </p> |
|---|
| 1253 | <h4> |
|---|
| 1254 | utf8::unchecked::utf16to8 |
|---|
| 1255 | </h4> |
|---|
| 1256 | <p class="version"> |
|---|
| 1257 | Available in version 1.0 and later. |
|---|
| 1258 | </p> |
|---|
| 1259 | <p> |
|---|
| 1260 | Converts a UTF-16 encoded string to UTF-8. |
|---|
| 1261 | </p> |
|---|
| 1262 | <pre> |
|---|
| 1263 | <span class="keyword">template</span> <<span class= |
|---|
| 1264 | "keyword">typename</span> u16bit_iterator, <span class= |
|---|
| 1265 | "keyword">typename</span> octet_iterator> |
|---|
| 1266 | octet_iterator utf16to8 (u16bit_iterator start, u16bit_iterator end, octet_iterator result); |
|---|
| 1267 | |
|---|
| 1268 | </pre> |
|---|
| 1269 | <p> |
|---|
| 1270 | <code>start</code>: an iterator pointing to the beginning of the UTF-16 encoded |
|---|
| 1271 | string to convert.<br> |
|---|
| 1272 | <code>end</code>: an iterator pointing to pass-the-end of the UTF-16 encoded |
|---|
| 1273 | string to convert.<br> |
|---|
| 1274 | <code>result</code>: an output iterator to the place in the UTF-8 string where to |
|---|
| 1275 | append the result of conversion.<br> |
|---|
| 1276 | <span class="return_value">Return value</span>: An iterator pointing to the place |
|---|
| 1277 | after the appended UTF-8 string. |
|---|
| 1278 | </p> |
|---|
| 1279 | <p> |
|---|
| 1280 | Example of use: |
|---|
| 1281 | </p> |
|---|
| 1282 | <pre> |
|---|
| 1283 | <span class="keyword">unsigned short</span> utf16string[] = {<span class= |
|---|
| 1284 | "literal">0x41</span>, <span class="literal">0x0448</span>, <span class= |
|---|
| 1285 | "literal">0x65e5</span>, <span class="literal">0xd834</span>, <span class= |
|---|
| 1286 | "literal">0xdd1e</span>}; |
|---|
| 1287 | vector<<span class="keyword">unsigned char</span>> utf8result; |
|---|
| 1288 | unchecked::utf16to8(utf16string, utf16string + <span class= |
|---|
| 1289 | "literal">5</span>, back_inserter(utf8result)); |
|---|
| 1290 | assert (utf8result.size() == <span class="literal">10</span>); |
|---|
| 1291 | </pre> |
|---|
| 1292 | <p> |
|---|
| 1293 | This is a faster but less safe version of <code>utf8::utf16to8</code>. It does not |
|---|
| 1294 | check for validity of the supplied UTF-16 sequence. |
|---|
| 1295 | </p> |
|---|
| 1296 | <h4> |
|---|
| 1297 | utf8::unchecked::utf8to16 |
|---|
| 1298 | </h4> |
|---|
| 1299 | <p class="version"> |
|---|
| 1300 | Available in version 1.0 and later. |
|---|
| 1301 | </p> |
|---|
| 1302 | <p> |
|---|
| 1303 | Converts an UTF-8 encoded string to UTF-16 |
|---|
| 1304 | </p> |
|---|
| 1305 | <pre> |
|---|
| 1306 | <span class="keyword">template</span> <<span class= |
|---|
| 1307 | "keyword">typename</span> u16bit_iterator, typename octet_iterator> |
|---|
| 1308 | u16bit_iterator utf8to16 (octet_iterator start, octet_iterator end, u16bit_iterator result); |
|---|
| 1309 | |
|---|
| 1310 | </pre> |
|---|
| 1311 | <p> |
|---|
| 1312 | <code>start</code>: an iterator pointing to the beginning of the UTF-8 encoded |
|---|
| 1313 | string to convert. < br /> <code>end</code>: an iterator pointing to |
|---|
| 1314 | pass-the-end of the UTF-8 encoded string to convert.<br> |
|---|
| 1315 | <code>result</code>: an output iterator to the place in the UTF-16 string where to |
|---|
| 1316 | append the result of conversion.<br> |
|---|
| 1317 | <span class="return_value">Return value</span>: An iterator pointing to the place |
|---|
| 1318 | after the appended UTF-16 string. |
|---|
| 1319 | </p> |
|---|
| 1320 | <p> |
|---|
| 1321 | Example of use: |
|---|
| 1322 | </p> |
|---|
| 1323 | <pre> |
|---|
| 1324 | <span class="keyword">char</span> utf8_with_surrogates[] = <span class= |
|---|
| 1325 | "literal">"\xe6\x97\xa5\xd1\x88\xf0\x9d\x84\x9e"</span>; |
|---|
| 1326 | vector <<span class="keyword">unsigned short</span>> utf16result; |
|---|
| 1327 | unchecked::utf8to16(utf8_with_surrogates, utf8_with_surrogates + <span class= |
|---|
| 1328 | "literal">9</span>, back_inserter(utf16result)); |
|---|
| 1329 | assert (utf16result.size() == <span class="literal">4</span>); |
|---|
| 1330 | assert (utf16result[<span class="literal">2</span>] == <span class= |
|---|
| 1331 | "literal">0xd834</span>); |
|---|
| 1332 | assert (utf16result[<span class="literal">3</span>] == <span class= |
|---|
| 1333 | "literal">0xdd1e</span>); |
|---|
| 1334 | </pre> |
|---|
| 1335 | <p> |
|---|
| 1336 | This is a faster but less safe version of <code>utf8::utf8to16</code>. It does not |
|---|
| 1337 | check for validity of the supplied UTF-8 sequence. |
|---|
| 1338 | </p> |
|---|
| 1339 | <h4> |
|---|
| 1340 | utf8::unchecked::utf32to8 |
|---|
| 1341 | </h4> |
|---|
| 1342 | <p class="version"> |
|---|
| 1343 | Available in version 1.0 and later. |
|---|
| 1344 | </p> |
|---|
| 1345 | <p> |
|---|
| 1346 | Converts a UTF-32 encoded string to UTF-8. |
|---|
| 1347 | </p> |
|---|
| 1348 | <pre> |
|---|
| 1349 | <span class="keyword">template</span> <<span class= |
|---|
| 1350 | "keyword">typename</span> octet_iterator, <span class= |
|---|
| 1351 | "keyword">typename</span> u32bit_iterator> |
|---|
| 1352 | octet_iterator utf32to8 (u32bit_iterator start, u32bit_iterator end, octet_iterator result); |
|---|
| 1353 | |
|---|
| 1354 | </pre> |
|---|
| 1355 | <p> |
|---|
| 1356 | <code>start</code>: an iterator pointing to the beginning of the UTF-32 encoded |
|---|
| 1357 | string to convert.<br> |
|---|
| 1358 | <code>end</code>: an iterator pointing to pass-the-end of the UTF-32 encoded |
|---|
| 1359 | string to convert.<br> |
|---|
| 1360 | <code>result</code>: an output iterator to the place in the UTF-8 string where to |
|---|
| 1361 | append the result of conversion.<br> |
|---|
| 1362 | <span class="return_value">Return value</span>: An iterator pointing to the place |
|---|
| 1363 | after the appended UTF-8 string. |
|---|
| 1364 | </p> |
|---|
| 1365 | <p> |
|---|
| 1366 | Example of use: |
|---|
| 1367 | </p> |
|---|
| 1368 | <pre> |
|---|
| 1369 | <span class="keyword">int</span> utf32string[] = {<span class= |
|---|
| 1370 | "literal">0x448</span>, <span class="literal">0x65e5</span>, <span class= |
|---|
| 1371 | "literal">0x10346</span>, <span class="literal">0</span>}; |
|---|
| 1372 | vector<<span class="keyword">unsigned char</span>> utf8result; |
|---|
| 1373 | utf32to8(utf32string, utf32string + <span class= |
|---|
| 1374 | "literal">3</span>, back_inserter(utf8result)); |
|---|
| 1375 | assert (utf8result.size() == <span class="literal">9</span>); |
|---|
| 1376 | </pre> |
|---|
| 1377 | <p> |
|---|
| 1378 | This is a faster but less safe version of <code>utf8::utf32to8</code>. It does not |
|---|
| 1379 | check for validity of the supplied UTF-32 sequence. |
|---|
| 1380 | </p> |
|---|
| 1381 | <h4> |
|---|
| 1382 | utf8::unchecked::utf8to32 |
|---|
| 1383 | </h4> |
|---|
| 1384 | <p class="version"> |
|---|
| 1385 | Available in version 1.0 and later. |
|---|
| 1386 | </p> |
|---|
| 1387 | <p> |
|---|
| 1388 | Converts a UTF-8 encoded string to UTF-32. |
|---|
| 1389 | </p> |
|---|
| 1390 | <pre> |
|---|
| 1391 | <span class="keyword">template</span> <<span class= |
|---|
| 1392 | "keyword">typename</span> octet_iterator, typename u32bit_iterator> |
|---|
| 1393 | u32bit_iterator utf8to32 (octet_iterator start, octet_iterator end, u32bit_iterator result); |
|---|
| 1394 | |
|---|
| 1395 | </pre> |
|---|
| 1396 | <p> |
|---|
| 1397 | <code>start</code>: an iterator pointing to the beginning of the UTF-8 encoded |
|---|
| 1398 | string to convert.<br> |
|---|
| 1399 | <code>end</code>: an iterator pointing to pass-the-end of the UTF-8 encoded string |
|---|
| 1400 | to convert.<br> |
|---|
| 1401 | <code>result</code>: an output iterator to the place in the UTF-32 string where to |
|---|
| 1402 | append the result of conversion.<br> |
|---|
| 1403 | <span class="return_value">Return value</span>: An iterator pointing to the place |
|---|
| 1404 | after the appended UTF-32 string. |
|---|
| 1405 | </p> |
|---|
| 1406 | <p> |
|---|
| 1407 | Example of use: |
|---|
| 1408 | </p> |
|---|
| 1409 | <pre> |
|---|
| 1410 | <span class="keyword">char</span>* twochars = <span class= |
|---|
| 1411 | "literal">"\xe6\x97\xa5\xd1\x88"</span>; |
|---|
| 1412 | vector<<span class="keyword">int</span>> utf32result; |
|---|
| 1413 | unchecked::utf8to32(twochars, twochars + <span class= |
|---|
| 1414 | "literal">5</span>, back_inserter(utf32result)); |
|---|
| 1415 | assert (utf32result.size() == <span class="literal">2</span>); |
|---|
| 1416 | </pre> |
|---|
| 1417 | <p> |
|---|
| 1418 | This is a faster but less safe version of <code>utf8::utf8to32</code>. It does not |
|---|
| 1419 | check for validity of the supplied UTF-8 sequence. |
|---|
| 1420 | </p> |
|---|
| 1421 | <h3 id="typesunchecked"> |
|---|
| 1422 | Types From utf8::unchecked Namespace |
|---|
| 1423 | </h3> |
|---|
| 1424 | <h4> |
|---|
| 1425 | utf8::iterator |
|---|
| 1426 | </h4> |
|---|
| 1427 | <p class="version"> |
|---|
| 1428 | Available in version 2.0 and later. |
|---|
| 1429 | </p> |
|---|
| 1430 | <p> |
|---|
| 1431 | Adapts the underlying octet iterator to iterate over the sequence of code points, |
|---|
| 1432 | rather than raw octets. |
|---|
| 1433 | </p> |
|---|
| 1434 | <pre> |
|---|
| 1435 | <span class="keyword">template</span> <<span class="keyword">typename</span> octet_iterator> |
|---|
| 1436 | <span class="keyword">class</span> iterator; |
|---|
| 1437 | </pre> |
|---|
| 1438 | |
|---|
| 1439 | <h5>Member functions</h5> |
|---|
| 1440 | <dl> |
|---|
| 1441 | <dt><code>iterator();</code> <dd> the deafult constructor; the underlying <code>octet_iterator</code> is |
|---|
| 1442 | constructed with its default constructor. |
|---|
| 1443 | <dt><code><span class="keyword">explicit</span> iterator (const octet_iterator& octet_it); |
|---|
| 1444 | </code> <dd> a constructor |
|---|
| 1445 | that initializes the underlying <code>octet_iterator</code> with <code>octet_it</code> |
|---|
| 1446 | <dt><code>octet_iterator base () <span class="keyword">const</span>;</code> <dd> returns the |
|---|
| 1447 | underlying <code>octet_iterator</code>. |
|---|
| 1448 | <dt><code>uint32_t operator * () <span class="keyword">const</span>;</code> <dd> decodes the utf-8 sequence |
|---|
| 1449 | the underlying <code>octet_iterator</code> is pointing to and returns the code point. |
|---|
| 1450 | <dt><code><span class="keyword">bool operator</span> == (const iterator& rhs) |
|---|
| 1451 | <span class="keyword">const</span>;</code> <dd> returns <span class="keyword">true</span> |
|---|
| 1452 | if the two underlaying iterators are equal. |
|---|
| 1453 | <dt><code><span class="keyword">bool operator</span> != (const iterator& rhs) |
|---|
| 1454 | <span class="keyword">const</span>;</code> <dd> returns <span class="keyword">true</span> |
|---|
| 1455 | if the two underlaying iterators are not equal. |
|---|
| 1456 | <dt><code>iterator& <span class="keyword">operator</span> ++ (); </code> <dd> the prefix increment - moves |
|---|
| 1457 | the iterator to the next UTF-8 encoded code point. |
|---|
| 1458 | <dt><code>iterator <span class="keyword">operator</span> ++ (<span class="keyword">int</span>); </code> <dd> |
|---|
| 1459 | the postfix increment - moves the iterator to the next UTF-8 encoded code point and returns the current one. |
|---|
| 1460 | <dt><code>iterator& <span class="keyword">operator</span> -- (); </code> <dd> the prefix decrement - moves |
|---|
| 1461 | the iterator to the previous UTF-8 encoded code point. |
|---|
| 1462 | <dt><code>iterator <span class="keyword">operator</span> -- (<span class="keyword">int</span>); </code> <dd> |
|---|
| 1463 | the postfix decrement - moves the iterator to the previous UTF-8 encoded code point and returns the current one. |
|---|
| 1464 | </dl> |
|---|
| 1465 | <p> |
|---|
| 1466 | Example of use: |
|---|
| 1467 | </p> |
|---|
| 1468 | <pre> |
|---|
| 1469 | <span class="keyword">char</span>* threechars = <span class="literal">"\xf0\x90\x8d\x86\xe6\x97\xa5\xd1\x88"</span>; |
|---|
| 1470 | utf8::unchecked::iterator<<span class="keyword">char</span>*> un_it(threechars); |
|---|
| 1471 | utf8::unchecked::iterator<<span class="keyword">char</span>*> un_it2 = un_it; |
|---|
| 1472 | assert (un_it2 == un_it); |
|---|
| 1473 | assert (*un_it == <span class="literal">0x10346</span>); |
|---|
| 1474 | assert (*(++un_it) == <span class="literal">0x65e5</span>); |
|---|
| 1475 | assert ((*un_it++) == <span class="literal">0x65e5</span>); |
|---|
| 1476 | assert (*un_it == <span class="literal">0x0448</span>); |
|---|
| 1477 | assert (un_it != un_it2); |
|---|
| 1478 | utf8::::unchecked::iterator<<span class="keyword">char</span>*> un_endit (threechars + <span class="literal">9</span>); |
|---|
| 1479 | assert (++un_it == un_endit); |
|---|
| 1480 | assert (*(--un_it) == <span class="literal">0x0448</span>); |
|---|
| 1481 | assert ((*un_it--) == <span class="literal">0x0448</span>); |
|---|
| 1482 | assert (*un_it == <span class="literal">0x65e5</span>); |
|---|
| 1483 | assert (--un_it == utf8::unchecked::iterator<<span class="keyword">char</span>*>(threechars)); |
|---|
| 1484 | assert (*un_it == <span class="literal">0x10346</span>); |
|---|
| 1485 | </pre> |
|---|
| 1486 | <p> |
|---|
| 1487 | This is an unchecked version of <code>utf8::iterator</code>. It is faster in many cases, but offers |
|---|
| 1488 | no validity or range checks. |
|---|
| 1489 | </p> |
|---|
| 1490 | <h2 id="points"> |
|---|
| 1491 | Points of interest |
|---|
| 1492 | </h2> |
|---|
| 1493 | <h4> |
|---|
| 1494 | Design goals and decisions |
|---|
| 1495 | </h4> |
|---|
| 1496 | <p> |
|---|
| 1497 | The library was designed to be: |
|---|
| 1498 | </p> |
|---|
| 1499 | <ol> |
|---|
| 1500 | <li> |
|---|
| 1501 | Generic: for better or worse, there are many C++ string classes out there, and |
|---|
| 1502 | the library should work with as many of them as possible. |
|---|
| 1503 | </li> |
|---|
| 1504 | <li> |
|---|
| 1505 | Portable: the library should be portable both accross different platforms and |
|---|
| 1506 | compilers. The only non-portable code is a small section that declares unsigned |
|---|
| 1507 | integers of different sizes: three typedefs. They can be changed by the users of |
|---|
| 1508 | the library if they don't match their platform. The default setting should work |
|---|
| 1509 | for Windows (both 32 and 64 bit), and most 32 bit and 64 bit Unix derivatives. |
|---|
| 1510 | </li> |
|---|
| 1511 | <li> |
|---|
| 1512 | Lightweight: follow the "pay only for what you use" guidline. |
|---|
| 1513 | </li> |
|---|
| 1514 | <li> |
|---|
| 1515 | Unintrusive: avoid forcing any particular design or even programming style on the |
|---|
| 1516 | user. This is a library, not a framework. |
|---|
| 1517 | </li> |
|---|
| 1518 | </ol> |
|---|
| 1519 | <h4> |
|---|
| 1520 | Alternatives |
|---|
| 1521 | </h4> |
|---|
| 1522 | <p> |
|---|
| 1523 | In case you want to look into other means of working with UTF-8 strings from C++, |
|---|
| 1524 | here is the list of solutions I am aware of: |
|---|
| 1525 | </p> |
|---|
| 1526 | <ol> |
|---|
| 1527 | <li> |
|---|
| 1528 | <a href="http://icu.sourceforge.net/">ICU Library</a>. It is very powerful, |
|---|
| 1529 | complete, feature-rich, mature, and widely used. Also big, intrusive, |
|---|
| 1530 | non-generic, and doesn't play well with the Standard Library. I definitelly |
|---|
| 1531 | recommend looking at ICU even if you don't plan to use it. |
|---|
| 1532 | </li> |
|---|
| 1533 | <li> |
|---|
| 1534 | <a href= |
|---|
| 1535 | "http://www.gtkmm.org/gtkmm2/docs/tutorial/html/ch03s04.html">Glib::ustring</a>. |
|---|
| 1536 | A class specifically made to work with UTF-8 strings, and also feel like |
|---|
| 1537 | <code>std::string</code>. If you prefer to have yet another string class in your |
|---|
| 1538 | code, it may be worth a look. Be aware of the licensing issues, though. |
|---|
| 1539 | </li> |
|---|
| 1540 | <li> |
|---|
| 1541 | Platform dependent solutions: Windows and POSIX have functions to convert strings |
|---|
| 1542 | from one encoding to another. That is only a subset of what my library offers, |
|---|
| 1543 | but if that is all you need it may be good enough, especially given the fact that |
|---|
| 1544 | these functions are mature and tested in production. |
|---|
| 1545 | </li> |
|---|
| 1546 | </ol> |
|---|
| 1547 | <h2 id="conclusion"> |
|---|
| 1548 | Conclusion |
|---|
| 1549 | </h2> |
|---|
| 1550 | <p> |
|---|
| 1551 | Until Unicode becomes officially recognized by the C++ Standard Library, we need to |
|---|
| 1552 | use other means to work with UTF-8 strings. Template functions I describe in this |
|---|
| 1553 | article may be a good step in this direction. |
|---|
| 1554 | </p> |
|---|
| 1555 | <h2 id="links"> |
|---|
| 1556 | Links |
|---|
| 1557 | </h2> |
|---|
| 1558 | <ol> |
|---|
| 1559 | <li> |
|---|
| 1560 | <a href="http://www.unicode.org/">The Unicode Consortium</a>. |
|---|
| 1561 | </li> |
|---|
| 1562 | <li> |
|---|
| 1563 | <a href="http://icu.sourceforge.net/">ICU Library</a>. |
|---|
| 1564 | </li> |
|---|
| 1565 | <li> |
|---|
| 1566 | <a href="http://en.wikipedia.org/wiki/UTF-8">UTF-8 at Wikipedia</a> |
|---|
| 1567 | </li> |
|---|
| 1568 | <li> |
|---|
| 1569 | <a href="http://www.cl.cam.ac.uk/~mgk25/unicode.html">UTF-8 and Unicode FAQ for |
|---|
| 1570 | Unix/Linux</a> |
|---|
| 1571 | </li> |
|---|
| 1572 | </ol> |
|---|
| 1573 | </body> |
|---|
| 1574 | </html> |
|---|