1 | <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> |
---|
2 | <html> |
---|
3 | <head> |
---|
4 | <meta name="generator" content= |
---|
5 | "HTML Tidy for Linux/x86 (vers 1st November 2002), see www.w3.org"> |
---|
6 | <meta name="description" content= |
---|
7 | "A simple, portable and lightweigt C++ library for easy handling of UTF-8 encoded strings"> |
---|
8 | <meta name="keywords" content="UTF-8 C++ portable utf8 unicode generic templates"> |
---|
9 | <meta name="author" content="Nemanja Trifunovic"> |
---|
10 | <title> |
---|
11 | UTF8-CPP: UTF-8 with C++ in a Portable Way |
---|
12 | </title> |
---|
13 | <style type="text/css"> |
---|
14 | <!-- |
---|
15 | span.return_value { |
---|
16 | color: brown; |
---|
17 | } |
---|
18 | span.keyword { |
---|
19 | color: blue; |
---|
20 | } |
---|
21 | span.preprocessor { |
---|
22 | color: navy; |
---|
23 | } |
---|
24 | span.literal { |
---|
25 | color: olive; |
---|
26 | } |
---|
27 | span.comment { |
---|
28 | color: green; |
---|
29 | } |
---|
30 | code { |
---|
31 | font-weight: bold; |
---|
32 | } |
---|
33 | ul.toc { |
---|
34 | list-style-type: none; |
---|
35 | } |
---|
36 | p.version { |
---|
37 | font-size: small; |
---|
38 | font-style: italic; |
---|
39 | } |
---|
40 | --> |
---|
41 | </style> |
---|
42 | </head> |
---|
43 | <body> |
---|
44 | <h1> |
---|
45 | UTF8-CPP: UTF-8 with C++ in a Portable Way |
---|
46 | </h1> |
---|
47 | <p> |
---|
48 | <a href="https://sourceforge.net/projects/utfcpp">The Sourceforge project page</a> |
---|
49 | </p> |
---|
50 | <div id="toc"> |
---|
51 | <h2> |
---|
52 | Table of Contents |
---|
53 | </h2> |
---|
54 | <ul class="toc"> |
---|
55 | <li> |
---|
56 | <a href="#introduction">Introduction</a> |
---|
57 | </li> |
---|
58 | <li> |
---|
59 | <a href="#examples">Examples of Use</a> |
---|
60 | </li> |
---|
61 | <li> |
---|
62 | <a href="#reference">Reference</a> |
---|
63 | <ul class="toc"> |
---|
64 | <li> |
---|
65 | <a href="#funutf8">Functions From utf8 Namespace </a> |
---|
66 | </li> |
---|
67 | <li> |
---|
68 | <a href="#typesutf8">Types From utf8 Namespace </a> |
---|
69 | </li> |
---|
70 | <li> |
---|
71 | <a href="#fununchecked">Functions From utf8::unchecked Namespace </a> |
---|
72 | </li> |
---|
73 | <li> |
---|
74 | <a href="#typesunchecked">Types From utf8::unchecked Namespace </a> |
---|
75 | </li> |
---|
76 | </ul> |
---|
77 | </li> |
---|
78 | <li> |
---|
79 | <a href="#points">Points of Interest</a> |
---|
80 | </li> |
---|
81 | <li> |
---|
82 | <a href="#conclusion">Conclusion</a> |
---|
83 | </li> |
---|
84 | <li> |
---|
85 | <a href="#links">Links</a> |
---|
86 | </li> |
---|
87 | </ul> |
---|
88 | </div> |
---|
89 | <h2 id="introduction"> |
---|
90 | Introduction |
---|
91 | </h2> |
---|
92 | <p> |
---|
93 | Many C++ developers miss an easy and portable way of handling Unicode encoded |
---|
94 | strings. C++ Standard is currently Unicode agnostic, and while some work is being |
---|
95 | done to introduce Unicode to the next incarnation called C++0x, for the moment |
---|
96 | nothing of the sort is available. In the meantime, developers use 3rd party |
---|
97 | libraries like ICU, OS specific capabilities, or simply roll out their own |
---|
98 | solutions. |
---|
99 | </p> |
---|
100 | <p> |
---|
101 | In order to easily handle UTF-8 encoded Unicode strings, I have come up with a small |
---|
102 | generic library. For anybody used to work with STL algorithms and iterators, it should be |
---|
103 | easy and natural to use. The code is freely available for any purpose - check out |
---|
104 | the license at the beginning of the utf8.h file. If you run into |
---|
105 | bugs or performance issues, please let me know and I'll do my best to address them. |
---|
106 | </p> |
---|
107 | <p> |
---|
108 | The purpose of this article is not to offer an introduction to Unicode in general, |
---|
109 | and UTF-8 in particular. If you are not familiar with Unicode, be sure to check out |
---|
110 | <a href="http://www.unicode.org/">Unicode Home Page</a> or some other source of |
---|
111 | information for Unicode. Also, it is not my aim to advocate the use of UTF-8 |
---|
112 | encoded strings in C++ programs; if you want to handle UTF-8 encoded strings from |
---|
113 | C++, I am sure you have good reasons for it. |
---|
114 | </p> |
---|
115 | <h2 id="examples"> |
---|
116 | Examples of use |
---|
117 | </h2> |
---|
118 | <p> |
---|
119 | To illustrate the use of this utf8 library, we shall open a file containing UTF-8 |
---|
120 | encoded text, check whether it starts with a byte order mark, read each line into a |
---|
121 | <code>std::string</code>, check it for validity, convert the text to UTF-16, and |
---|
122 | back to UTF-8: |
---|
123 | </p> |
---|
124 | <pre> |
---|
125 | <span class="preprocessor">#include <fstream></span> |
---|
126 | <span class="preprocessor">#include <iostream></span> |
---|
127 | <span class="preprocessor">#include <string></span> |
---|
128 | <span class="preprocessor">#include <vector></span> |
---|
129 | <span class="preprocessor">#include "utf8.h"</span> |
---|
130 | <span class="keyword">using namespace</span> std; |
---|
131 | <span class="keyword">int</span> main() |
---|
132 | { |
---|
133 | <span class="keyword">if</span> (argc != <span class="literal">2</span>) { |
---|
134 | cout << <span class="literal">"\nUsage: docsample filename\n"</span>; |
---|
135 | <span class="keyword">return</span> <span class="literal">0</span>; |
---|
136 | } |
---|
137 | <span class="keyword">const char</span>* test_file_path = argv[1]; |
---|
138 | <span class="comment">// Open the test file (must be UTF-8 encoded)</span> |
---|
139 | ifstream fs8(test_file_path); |
---|
140 | <span class="keyword">if</span> (!fs8.is_open()) { |
---|
141 | cout << <span class= |
---|
142 | "literal">"Could not open "</span> << test_file_path << endl; |
---|
143 | <span class="keyword">return</span> <span class="literal">0</span>; |
---|
144 | } |
---|
145 | <span class="comment">// Read the first line of the file</span> |
---|
146 | <span class="keyword">unsigned</span> line_count = <span class="literal">1</span>; |
---|
147 | string line; |
---|
148 | <span class="keyword">if</span> (!getline(fs8, line)) |
---|
149 | <span class="keyword">return</span> <span class="literal">0</span>; |
---|
150 | <span class="comment">// Look for utf-8 byte-order mark at the beginning</span> |
---|
151 | <span class="keyword">if</span> (line.size() > <span class="literal">2</span>) { |
---|
152 | <span class="keyword">if</span> (utf8::is_bom(line.c_str())) |
---|
153 | cout << <span class= |
---|
154 | "literal">"There is a byte order mark at the beginning of the file\n"</span>; |
---|
155 | } |
---|
156 | <span class="comment">// Play with all the lines in the file</span> |
---|
157 | <span class="keyword">do</span> { |
---|
158 | <span class="comment">// check for invalid utf-8 (for a simple yes/no check, there is also utf8::is_valid function)</span> |
---|
159 | string::iterator end_it = utf8::find_invalid(line.begin(), line.end()); |
---|
160 | <span class="keyword">if</span> (end_it != line.end()) { |
---|
161 | cout << <span class= |
---|
162 | "literal">"Invalid UTF-8 encoding detected at line "</span> << line_count << <span |
---|
163 | class="literal">"\n"</span>; |
---|
164 | cout << <span class= |
---|
165 | "literal">"This part is fine: "</span> << string(line.begin(), end_it) << <span |
---|
166 | class="literal">"\n"</span>; |
---|
167 | } |
---|
168 | <span class="comment">// Get the line length (at least for the valid part)</span> |
---|
169 | <span class="keyword">int</span> length = utf8::distance(line.begin(), end_it); |
---|
170 | cout << <span class= |
---|
171 | "literal">"Length of line "</span> << line_count << <span class= |
---|
172 | "literal">" is "</span> << length << <span class="literal">"\n"</span>; |
---|
173 | <span class="comment">// Convert it to utf-16</span> |
---|
174 | vector<unsigned short> utf16line; |
---|
175 | utf8::utf8to16(line.begin(), end_it, back_inserter(utf16line)); |
---|
176 | <span class="comment">// And back to utf-8</span> |
---|
177 | string utf8line; |
---|
178 | utf8::utf16to8(utf16line.begin(), utf16line.end(), back_inserter(utf8line)); |
---|
179 | <span class="comment">// Confirm that the conversion went OK:</span> |
---|
180 | <span class="keyword">if</span> (utf8line != string(line.begin(), end_it)) |
---|
181 | cout << <span class= |
---|
182 | "literal">"Error in UTF-16 conversion at line: "</span> << line_count << <span |
---|
183 | class="literal">"\n"</span>; |
---|
184 | getline(fs8, line); |
---|
185 | line_count++; |
---|
186 | } <span class="keyword">while</span> (!fs8.eof()); |
---|
187 | <span class="keyword">return</span> <span class="literal">0</span>; |
---|
188 | } |
---|
189 | </pre> |
---|
190 | <p> |
---|
191 | In the previous code sample, we have seen the use of the following functions from |
---|
192 | <code>utf8</code> namespace: first we used <code>is_bom</code> function to detect |
---|
193 | UTF-8 byte order mark at the beginning of the file; then for each line we performed |
---|
194 | a detection of invalid UTF-8 sequences with <code>find_invalid</code>; the number |
---|
195 | of characters (more precisely - the number of Unicode code points) in each line was |
---|
196 | determined with a use of <code>utf8::distance</code>; finally, we have converted |
---|
197 | each line to UTF-16 encoding with <code>utf8to16</code> and back to UTF-8 with |
---|
198 | <code>utf16to8</code>. |
---|
199 | </p> |
---|
200 | <h2 id="reference"> |
---|
201 | Reference |
---|
202 | </h2> |
---|
203 | <h3 id="funutf8"> |
---|
204 | Functions From utf8 Namespace |
---|
205 | </h3> |
---|
206 | <h4> |
---|
207 | utf8::append |
---|
208 | </h4> |
---|
209 | <p class="version"> |
---|
210 | Available in version 1.0 and later. |
---|
211 | </p> |
---|
212 | <p> |
---|
213 | Encodes a 32 bit code point as a UTF-8 sequence of octets and appends the sequence |
---|
214 | to a UTF-8 string. |
---|
215 | </p> |
---|
216 | <pre> |
---|
217 | <span class="keyword">template</span> <<span class= |
---|
218 | "keyword">typename</span> octet_iterator> |
---|
219 | octet_iterator append(uint32_t cp, octet_iterator result); |
---|
220 | |
---|
221 | </pre> |
---|
222 | <p> |
---|
223 | <code>cp</code>: A 32 bit integer representing a code point to append to the |
---|
224 | sequence.<br> |
---|
225 | <code>result</code>: An output iterator to the place in the sequence where to |
---|
226 | append the code point.<br> |
---|
227 | <span class="return_value">Return value</span>: An iterator pointing to the place |
---|
228 | after the newly appended sequence. |
---|
229 | </p> |
---|
230 | <p> |
---|
231 | Example of use: |
---|
232 | </p> |
---|
233 | <pre> |
---|
234 | <span class="keyword">unsigned char</span> u[<span class="literal">5</span>] = {<span |
---|
235 | class="literal">0</span>,<span class="literal">0</span>,<span class= |
---|
236 | "literal">0</span>,<span class="literal">0</span>,<span class="literal">0</span>}; |
---|
237 | <span class="keyword">unsigned char</span>* end = append(<span class= |
---|
238 | "literal">0x0448</span>, u); |
---|
239 | assert (u[<span class="literal">0</span>] == <span class= |
---|
240 | "literal">0xd1</span> && u[<span class="literal">1</span>] == <span class= |
---|
241 | "literal">0x88</span> && u[<span class="literal">2</span>] == <span class= |
---|
242 | "literal">0</span> && u[<span class="literal">3</span>] == <span class= |
---|
243 | "literal">0</span> && u[<span class="literal">4</span>] == <span class= |
---|
244 | "literal">0</span>); |
---|
245 | </pre> |
---|
246 | <p> |
---|
247 | Note that <code>append</code> does not allocate any memory - it is the burden of |
---|
248 | the caller to make sure there is enough memory allocated for the operation. To make |
---|
249 | things more interesting, <code>append</code> can add anywhere between 1 and 4 |
---|
250 | octets to the sequence. In practice, you would most often want to use |
---|
251 | <code>std::back_inserter</code> to ensure that the necessary memory is allocated. |
---|
252 | </p> |
---|
253 | <p> |
---|
254 | In case of an invalid code point, a <code>utf8::invalid_code_point</code> exception |
---|
255 | is thrown. |
---|
256 | </p> |
---|
257 | <h4> |
---|
258 | utf8::next |
---|
259 | </h4> |
---|
260 | <p class="version"> |
---|
261 | Available in version 1.0 and later. |
---|
262 | </p> |
---|
263 | <p> |
---|
264 | Given the iterator to the beginning of the UTF-8 sequence, it returns the code |
---|
265 | point and moves the iterator to the next position. |
---|
266 | </p> |
---|
267 | <pre> |
---|
268 | <span class="keyword">template</span> <<span class= |
---|
269 | "keyword">typename</span> octet_iterator> |
---|
270 | uint32_t next(octet_iterator& it, octet_iterator end); |
---|
271 | |
---|
272 | </pre> |
---|
273 | <p> |
---|
274 | <code>it</code>: a reference to an iterator pointing to the beginning of an UTF-8 |
---|
275 | encoded code point. After the function returns, it is incremented to point to the |
---|
276 | beginning of the next code point.<br> |
---|
277 | <code>end</code>: end of the UTF-8 sequence to be processed. If <code>it</code> |
---|
278 | gets equal to <code>end</code> during the extraction of a code point, an |
---|
279 | <code>utf8::not_enough_room</code> exception is thrown.<br> |
---|
280 | <span class="return_value">Return value</span>: the 32 bit representation of the |
---|
281 | processed UTF-8 code point. |
---|
282 | </p> |
---|
283 | <p> |
---|
284 | Example of use: |
---|
285 | </p> |
---|
286 | <pre> |
---|
287 | <span class="keyword">char</span>* twochars = <span class= |
---|
288 | "literal">"\xe6\x97\xa5\xd1\x88"</span>; |
---|
289 | <span class="keyword">char</span>* w = twochars; |
---|
290 | <span class="keyword">int</span> cp = next(w, twochars + <span class="literal">6</span>); |
---|
291 | assert (cp == <span class="literal">0x65e5</span>); |
---|
292 | assert (w == twochars + <span class="literal">3</span>); |
---|
293 | </pre> |
---|
294 | <p> |
---|
295 | This function is typically used to iterate through a UTF-8 encoded string. |
---|
296 | </p> |
---|
297 | <p> |
---|
298 | In case of an invalid UTF-8 seqence, a <code>utf8::invalid_utf8</code> exception is |
---|
299 | thrown. |
---|
300 | </p> |
---|
301 | <h4> |
---|
302 | utf8::peek_next |
---|
303 | </h4> |
---|
304 | <p class="version"> |
---|
305 | Available in version 2.1 and later. |
---|
306 | </p> |
---|
307 | <p> |
---|
308 | Given the iterator to the beginning of the UTF-8 sequence, it returns the code |
---|
309 | point for the following sequence without changing the value of the iterator. |
---|
310 | </p> |
---|
311 | <pre> |
---|
312 | <span class="keyword">template</span> <<span class= |
---|
313 | "keyword">typename</span> octet_iterator> |
---|
314 | uint32_t peek_next(octet_iterator it, octet_iterator end); |
---|
315 | |
---|
316 | </pre> |
---|
317 | <p> |
---|
318 | <code>it</code>: an iterator pointing to the beginning of an UTF-8 |
---|
319 | encoded code point.<br> |
---|
320 | <code>end</code>: end of the UTF-8 sequence to be processed. If <code>it</code> |
---|
321 | gets equal to <code>end</code> during the extraction of a code point, an |
---|
322 | <code>utf8::not_enough_room</code> exception is thrown.<br> |
---|
323 | <span class="return_value">Return value</span>: the 32 bit representation of the |
---|
324 | processed UTF-8 code point. |
---|
325 | </p> |
---|
326 | <p> |
---|
327 | Example of use: |
---|
328 | </p> |
---|
329 | <pre> |
---|
330 | <span class="keyword">char</span>* twochars = <span class= |
---|
331 | "literal">"\xe6\x97\xa5\xd1\x88"</span>; |
---|
332 | <span class="keyword">char</span>* w = twochars; |
---|
333 | <span class="keyword">int</span> cp = peek_next(w, twochars + <span class="literal">6</span>); |
---|
334 | assert (cp == <span class="literal">0x65e5</span>); |
---|
335 | assert (w == twochars); |
---|
336 | </pre> |
---|
337 | <p> |
---|
338 | In case of an invalid UTF-8 seqence, a <code>utf8::invalid_utf8</code> exception is |
---|
339 | thrown. |
---|
340 | </p> |
---|
341 | <h4> |
---|
342 | utf8::prior |
---|
343 | </h4> |
---|
344 | <p class="version"> |
---|
345 | Available in version 1.02 and later. |
---|
346 | </p> |
---|
347 | <p> |
---|
348 | Given a reference to an iterator pointing to an octet in a UTF-8 seqence, it |
---|
349 | decreases the iterator until it hits the beginning of the previous UTF-8 encoded |
---|
350 | code point and returns the 32 bits representation of the code point. |
---|
351 | </p> |
---|
352 | <pre> |
---|
353 | <span class="keyword">template</span> <<span class= |
---|
354 | "keyword">typename</span> octet_iterator> |
---|
355 | uint32_t prior(octet_iterator& it, octet_iterator start); |
---|
356 | |
---|
357 | </pre> |
---|
358 | <p> |
---|
359 | <code>it</code>: a reference pointing to an octet within a UTF-8 encoded string. |
---|
360 | After the function returns, it is decremented to point to the beginning of the |
---|
361 | previous code point.<br> |
---|
362 | <code>start</code>: an iterator to the beginning of the sequence where the search |
---|
363 | for the beginning of a code point is performed. It is a |
---|
364 | safety measure to prevent passing the beginning of the string in the search for a |
---|
365 | UTF-8 lead octet.<br> |
---|
366 | <span class="return_value">Return value</span>: the 32 bit representation of the |
---|
367 | previous code point. |
---|
368 | </p> |
---|
369 | <p> |
---|
370 | Example of use: |
---|
371 | </p> |
---|
372 | <pre> |
---|
373 | <span class="keyword">char</span>* twochars = <span class= |
---|
374 | "literal">"\xe6\x97\xa5\xd1\x88"</span>; |
---|
375 | <span class="keyword">unsigned char</span>* w = twochars + <span class= |
---|
376 | "literal">3</span>; |
---|
377 | <span class="keyword">int</span> cp = prior (w, twochars); |
---|
378 | assert (cp == <span class="literal">0x65e5</span>); |
---|
379 | assert (w == twochars); |
---|
380 | </pre> |
---|
381 | <p> |
---|
382 | This function has two purposes: one is two iterate backwards through a UTF-8 |
---|
383 | encoded string. Note that it is usually a better idea to iterate forward instead, |
---|
384 | since <code>utf8::next</code> is faster. The second purpose is to find a beginning |
---|
385 | of a UTF-8 sequence if we have a random position within a string. |
---|
386 | </p> |
---|
387 | <p> |
---|
388 | <code>it</code> will typically point to the beginning of |
---|
389 | a code point, and <code>start</code> will point to the |
---|
390 | beginning of the string to ensure we don't go backwards too far. <code>it</code> is |
---|
391 | decreased until it points to a lead UTF-8 octet, and then the UTF-8 sequence |
---|
392 | beginning with that octet is decoded to a 32 bit representation and returned. |
---|
393 | </p> |
---|
394 | <p> |
---|
395 | In case <code>pass_end</code> is reached before a UTF-8 lead octet is hit, or if an |
---|
396 | invalid UTF-8 sequence is started by the lead octet, an <code>invalid_utf8</code> |
---|
397 | exception is thrown. |
---|
398 | </p> |
---|
399 | <h4> |
---|
400 | utf8::previous |
---|
401 | </h4> |
---|
402 | <p class="version"> |
---|
403 | Deprecated in version 1.02 and later. |
---|
404 | </p> |
---|
405 | <p> |
---|
406 | Given a reference to an iterator pointing to an octet in a UTF-8 seqence, it |
---|
407 | decreases the iterator until it hits the beginning of the previous UTF-8 encoded |
---|
408 | code point and returns the 32 bits representation of the code point. |
---|
409 | </p> |
---|
410 | <pre> |
---|
411 | <span class="keyword">template</span> <<span class= |
---|
412 | "keyword">typename</span> octet_iterator> |
---|
413 | uint32_t previous(octet_iterator& it, octet_iterator pass_start); |
---|
414 | |
---|
415 | </pre> |
---|
416 | <p> |
---|
417 | <code>it</code>: a reference pointing to an octet within a UTF-8 encoded string. |
---|
418 | After the function returns, it is decremented to point to the beginning of the |
---|
419 | previous code point.<br> |
---|
420 | <code>pass_start</code>: an iterator to the point in the sequence where the search |
---|
421 | for the beginning of a code point is aborted if no result was reached. It is a |
---|
422 | safety measure to prevent passing the beginning of the string in the search for a |
---|
423 | UTF-8 lead octet.<br> |
---|
424 | <span class="return_value">Return value</span>: the 32 bit representation of the |
---|
425 | previous code point. |
---|
426 | </p> |
---|
427 | <p> |
---|
428 | Example of use: |
---|
429 | </p> |
---|
430 | <pre> |
---|
431 | <span class="keyword">char</span>* twochars = <span class= |
---|
432 | "literal">"\xe6\x97\xa5\xd1\x88"</span>; |
---|
433 | <span class="keyword">unsigned char</span>* w = twochars + <span class= |
---|
434 | "literal">3</span>; |
---|
435 | <span class="keyword">int</span> cp = previous (w, twochars - <span class= |
---|
436 | "literal">1</span>); |
---|
437 | assert (cp == <span class="literal">0x65e5</span>); |
---|
438 | assert (w == twochars); |
---|
439 | </pre> |
---|
440 | <p> |
---|
441 | <code>utf8::previous</code> is deprecated, and <code>utf8::prior</code> should |
---|
442 | be used instead, although the existing code can continue using this function. |
---|
443 | The problem is the parameter <code>pass_start</code> that points to the position |
---|
444 | just before the beginning of the sequence. Standard containers don't have the |
---|
445 | concept of "pass start" and the function can not be used with their iterators. |
---|
446 | </p> |
---|
447 | <p> |
---|
448 | <code>it</code> will typically point to the beginning of |
---|
449 | a code point, and <code>pass_start</code> will point to the octet just before the |
---|
450 | beginning of the string to ensure we don't go backwards too far. <code>it</code> is |
---|
451 | decreased until it points to a lead UTF-8 octet, and then the UTF-8 sequence |
---|
452 | beginning with that octet is decoded to a 32 bit representation and returned. |
---|
453 | </p> |
---|
454 | <p> |
---|
455 | In case <code>pass_end</code> is reached before a UTF-8 lead octet is hit, or if an |
---|
456 | invalid UTF-8 sequence is started by the lead octet, an <code>invalid_utf8</code> |
---|
457 | exception is thrown |
---|
458 | </p> |
---|
459 | <h4> |
---|
460 | utf8::advance |
---|
461 | </h4> |
---|
462 | <p class="version"> |
---|
463 | Available in version 1.0 and later. |
---|
464 | </p> |
---|
465 | <p> |
---|
466 | Advances an iterator by the specified number of code points within an UTF-8 |
---|
467 | sequence. |
---|
468 | </p> |
---|
469 | <pre> |
---|
470 | <span class="keyword">template</span> <<span class= |
---|
471 | "keyword">typename</span> octet_iterator, typename distance_type> |
---|
472 | <span class= |
---|
473 | "keyword">void</span> advance (octet_iterator& it, distance_type n, octet_iterator end); |
---|
474 | |
---|
475 | </pre> |
---|
476 | <p> |
---|
477 | <code>it</code>: a reference to an iterator pointing to the beginning of an UTF-8 |
---|
478 | encoded code point. After the function returns, it is incremented to point to the |
---|
479 | nth following code point.<br> |
---|
480 | <code>n</code>: a positive integer that shows how many code points we want to |
---|
481 | advance.<br> |
---|
482 | <code>end</code>: end of the UTF-8 sequence to be processed. If <code>it</code> |
---|
483 | gets equal to <code>end</code> during the extraction of a code point, an |
---|
484 | <code>utf8::not_enough_room</code> exception is thrown.<br> |
---|
485 | </p> |
---|
486 | <p> |
---|
487 | Example of use: |
---|
488 | </p> |
---|
489 | <pre> |
---|
490 | <span class="keyword">char</span>* twochars = <span class= |
---|
491 | "literal">"\xe6\x97\xa5\xd1\x88"</span>; |
---|
492 | <span class="keyword">unsigned char</span>* w = twochars; |
---|
493 | advance (w, <span class="literal">2</span>, twochars + <span class="literal">6</span>); |
---|
494 | assert (w == twochars + <span class="literal">5</span>); |
---|
495 | </pre> |
---|
496 | <p> |
---|
497 | This function works only "forward". In case of a negative <code>n</code>, there is |
---|
498 | no effect. |
---|
499 | </p> |
---|
500 | <p> |
---|
501 | In case of an invalid code point, a <code>utf8::invalid_code_point</code> exception |
---|
502 | is thrown. |
---|
503 | </p> |
---|
504 | <h4> |
---|
505 | utf8::distance |
---|
506 | </h4> |
---|
507 | <p class="version"> |
---|
508 | Available in version 1.0 and later. |
---|
509 | </p> |
---|
510 | <p> |
---|
511 | Given the iterators to two UTF-8 encoded code points in a seqence, returns the |
---|
512 | number of code points between them. |
---|
513 | </p> |
---|
514 | <pre> |
---|
515 | <span class="keyword">template</span> <<span class= |
---|
516 | "keyword">typename</span> octet_iterator> |
---|
517 | <span class= |
---|
518 | "keyword">typename</span> std::iterator_traits<octet_iterator>::difference_type distance (octet_iterator first, octet_iterator last); |
---|
519 | |
---|
520 | </pre> |
---|
521 | <p> |
---|
522 | <code>first</code>: an iterator to a beginning of a UTF-8 encoded code point.<br> |
---|
523 | <code>last</code>: an iterator to a "post-end" of the last UTF-8 encoded code |
---|
524 | point in the sequence we are trying to determine the length. It can be the |
---|
525 | beginning of a new code point, or not.<br> |
---|
526 | <span class="return_value">Return value</span> the distance between the iterators, |
---|
527 | in code points. |
---|
528 | </p> |
---|
529 | <p> |
---|
530 | Example of use: |
---|
531 | </p> |
---|
532 | <pre> |
---|
533 | <span class="keyword">char</span>* twochars = <span class= |
---|
534 | "literal">"\xe6\x97\xa5\xd1\x88"</span>; |
---|
535 | size_t dist = utf8::distance(twochars, twochars + <span class="literal">5</span>); |
---|
536 | assert (dist == <span class="literal">2</span>); |
---|
537 | </pre> |
---|
538 | <p> |
---|
539 | This function is used to find the length (in code points) of a UTF-8 encoded |
---|
540 | string. The reason it is called <em>distance</em>, rather than, say, |
---|
541 | <em>length</em> is mainly because developers are used that <em>length</em> is an |
---|
542 | O(1) function. Computing the length of an UTF-8 string is a linear operation, and |
---|
543 | it looked better to model it after <code>std::distance</code> algorithm. |
---|
544 | </p> |
---|
545 | <p> |
---|
546 | In case of an invalid UTF-8 seqence, a <code>utf8::invalid_utf8</code> exception is |
---|
547 | thrown. If <code>last</code> does not point to the past-of-end of a UTF-8 seqence, |
---|
548 | a <code>utf8::not_enough_room</code> exception is thrown. |
---|
549 | </p> |
---|
550 | <h4> |
---|
551 | utf8::utf16to8 |
---|
552 | </h4> |
---|
553 | <p class="version"> |
---|
554 | Available in version 1.0 and later. |
---|
555 | </p> |
---|
556 | <p> |
---|
557 | Converts a UTF-16 encoded string to UTF-8. |
---|
558 | </p> |
---|
559 | <pre> |
---|
560 | <span class="keyword">template</span> <<span class= |
---|
561 | "keyword">typename</span> u16bit_iterator, <span class= |
---|
562 | "keyword">typename</span> octet_iterator> |
---|
563 | octet_iterator utf16to8 (u16bit_iterator start, u16bit_iterator end, octet_iterator result); |
---|
564 | |
---|
565 | </pre> |
---|
566 | <p> |
---|
567 | <code>start</code>: an iterator pointing to the beginning of the UTF-16 encoded |
---|
568 | string to convert.<br> |
---|
569 | <code>end</code>: an iterator pointing to pass-the-end of the UTF-16 encoded |
---|
570 | string to convert.<br> |
---|
571 | <code>result</code>: an output iterator to the place in the UTF-8 string where to |
---|
572 | append the result of conversion.<br> |
---|
573 | <span class="return_value">Return value</span>: An iterator pointing to the place |
---|
574 | after the appended UTF-8 string. |
---|
575 | </p> |
---|
576 | <p> |
---|
577 | Example of use: |
---|
578 | </p> |
---|
579 | <pre> |
---|
580 | <span class="keyword">unsigned short</span> utf16string[] = {<span class= |
---|
581 | "literal">0x41</span>, <span class="literal">0x0448</span>, <span class= |
---|
582 | "literal">0x65e5</span>, <span class="literal">0xd834</span>, <span class= |
---|
583 | "literal">0xdd1e</span>}; |
---|
584 | vector<<span class="keyword">unsigned char</span>> utf8result; |
---|
585 | utf16to8(utf16string, utf16string + <span class= |
---|
586 | "literal">5</span>, back_inserter(utf8result)); |
---|
587 | assert (utf8result.size() == <span class="literal">10</span>); |
---|
588 | </pre> |
---|
589 | <p> |
---|
590 | In case of invalid UTF-16 sequence, a <code>utf8::invalid_utf16</code> exception is |
---|
591 | thrown. |
---|
592 | </p> |
---|
593 | <h4> |
---|
594 | utf8::utf8to16 |
---|
595 | </h4> |
---|
596 | <p class="version"> |
---|
597 | Available in version 1.0 and later. |
---|
598 | </p> |
---|
599 | <p> |
---|
600 | Converts an UTF-8 encoded string to UTF-16 |
---|
601 | </p> |
---|
602 | <pre> |
---|
603 | <span class="keyword">template</span> <<span class= |
---|
604 | "keyword">typename</span> u16bit_iterator, typename octet_iterator> |
---|
605 | u16bit_iterator utf8to16 (octet_iterator start, octet_iterator end, u16bit_iterator result); |
---|
606 | |
---|
607 | </pre> |
---|
608 | <p> |
---|
609 | <code>start</code>: an iterator pointing to the beginning of the UTF-8 encoded |
---|
610 | string to convert. < br /> <code>end</code>: an iterator pointing to |
---|
611 | pass-the-end of the UTF-8 encoded string to convert.<br> |
---|
612 | <code>result</code>: an output iterator to the place in the UTF-16 string where to |
---|
613 | append the result of conversion.<br> |
---|
614 | <span class="return_value">Return value</span>: An iterator pointing to the place |
---|
615 | after the appended UTF-16 string. |
---|
616 | </p> |
---|
617 | <p> |
---|
618 | Example of use: |
---|
619 | </p> |
---|
620 | <pre> |
---|
621 | <span class="keyword">char</span> utf8_with_surrogates[] = <span class= |
---|
622 | "literal">"\xe6\x97\xa5\xd1\x88\xf0\x9d\x84\x9e"</span>; |
---|
623 | vector <<span class="keyword">unsigned short</span>> utf16result; |
---|
624 | utf8to16(utf8_with_surrogates, utf8_with_surrogates + <span class= |
---|
625 | "literal">9</span>, back_inserter(utf16result)); |
---|
626 | assert (utf16result.size() == <span class="literal">4</span>); |
---|
627 | assert (utf16result[<span class="literal">2</span>] == <span class= |
---|
628 | "literal">0xd834</span>); |
---|
629 | assert (utf16result[<span class="literal">3</span>] == <span class= |
---|
630 | "literal">0xdd1e</span>); |
---|
631 | </pre> |
---|
632 | <p> |
---|
633 | In case of an invalid UTF-8 seqence, a <code>utf8::invalid_utf8</code> exception is |
---|
634 | thrown. If <code>end</code> does not point to the past-of-end of a UTF-8 seqence, a |
---|
635 | <code>utf8::not_enough_room</code> exception is thrown. |
---|
636 | </p> |
---|
637 | <h4> |
---|
638 | utf8::utf32to8 |
---|
639 | </h4> |
---|
640 | <p class="version"> |
---|
641 | Available in version 1.0 and later. |
---|
642 | </p> |
---|
643 | <p> |
---|
644 | Converts a UTF-32 encoded string to UTF-8. |
---|
645 | </p> |
---|
646 | <pre> |
---|
647 | <span class="keyword">template</span> <<span class= |
---|
648 | "keyword">typename</span> octet_iterator, typename u32bit_iterator> |
---|
649 | octet_iterator utf32to8 (u32bit_iterator start, u32bit_iterator end, octet_iterator result); |
---|
650 | |
---|
651 | </pre> |
---|
652 | <p> |
---|
653 | <code>start</code>: an iterator pointing to the beginning of the UTF-32 encoded |
---|
654 | string to convert.<br> |
---|
655 | <code>end</code>: an iterator pointing to pass-the-end of the UTF-32 encoded |
---|
656 | string to convert.<br> |
---|
657 | <code>result</code>: an output iterator to the place in the UTF-8 string where to |
---|
658 | append the result of conversion.<br> |
---|
659 | <span class="return_value">Return value</span>: An iterator pointing to the place |
---|
660 | after the appended UTF-8 string. |
---|
661 | </p> |
---|
662 | <p> |
---|
663 | Example of use: |
---|
664 | </p> |
---|
665 | <pre> |
---|
666 | <span class="keyword">int</span> utf32string[] = {<span class= |
---|
667 | "literal">0x448</span>, <span class="literal">0x65E5</span>, <span class= |
---|
668 | "literal">0x10346</span>, <span class="literal">0</span>}; |
---|
669 | vector<<span class="keyword">unsigned char</span>> utf8result; |
---|
670 | utf32to8(utf32string, utf32string + <span class= |
---|
671 | "literal">3</span>, back_inserter(utf8result)); |
---|
672 | assert (utf8result.size() == <span class="literal">9</span>); |
---|
673 | </pre> |
---|
674 | <p> |
---|
675 | In case of invalid UTF-32 string, a <code>utf8::invalid_code_point</code> exception |
---|
676 | is thrown. |
---|
677 | </p> |
---|
678 | <h4> |
---|
679 | utf8::utf8to32 |
---|
680 | </h4> |
---|
681 | <p class="version"> |
---|
682 | Available in version 1.0 and later. |
---|
683 | </p> |
---|
684 | <p> |
---|
685 | Converts a UTF-8 encoded string to UTF-32. |
---|
686 | </p> |
---|
687 | <pre> |
---|
688 | <span class="keyword">template</span> <<span class= |
---|
689 | "keyword">typename</span> octet_iterator, <span class= |
---|
690 | "keyword">typename</span> u32bit_iterator> |
---|
691 | u32bit_iterator utf8to32 (octet_iterator start, octet_iterator end, u32bit_iterator result); |
---|
692 | |
---|
693 | </pre> |
---|
694 | <p> |
---|
695 | <code>start</code>: an iterator pointing to the beginning of the UTF-8 encoded |
---|
696 | string to convert.<br> |
---|
697 | <code>end</code>: an iterator pointing to pass-the-end of the UTF-8 encoded string |
---|
698 | to convert.<br> |
---|
699 | <code>result</code>: an output iterator to the place in the UTF-32 string where to |
---|
700 | append the result of conversion.<br> |
---|
701 | <span class="return_value">Return value</span>: An iterator pointing to the place |
---|
702 | after the appended UTF-32 string. |
---|
703 | </p> |
---|
704 | <p> |
---|
705 | Example of use: |
---|
706 | </p> |
---|
707 | <pre> |
---|
708 | <span class="keyword">char</span>* twochars = <span class= |
---|
709 | "literal">"\xe6\x97\xa5\xd1\x88"</span>; |
---|
710 | vector<<span class="keyword">int</span>> utf32result; |
---|
711 | utf8to32(twochars, twochars + <span class= |
---|
712 | "literal">5</span>, back_inserter(utf32result)); |
---|
713 | assert (utf32result.size() == <span class="literal">2</span>); |
---|
714 | </pre> |
---|
715 | <p> |
---|
716 | In case of an invalid UTF-8 seqence, a <code>utf8::invalid_utf8</code> exception is |
---|
717 | thrown. If <code>end</code> does not point to the past-of-end of a UTF-8 seqence, a |
---|
718 | <code>utf8::not_enough_room</code> exception is thrown. |
---|
719 | </p> |
---|
720 | <h4> |
---|
721 | utf8::find_invalid |
---|
722 | </h4> |
---|
723 | <p class="version"> |
---|
724 | Available in version 1.0 and later. |
---|
725 | </p> |
---|
726 | <p> |
---|
727 | Detects an invalid sequence within a UTF-8 string. |
---|
728 | </p> |
---|
729 | <pre> |
---|
730 | <span class="keyword">template</span> <<span class= |
---|
731 | "keyword">typename</span> octet_iterator> |
---|
732 | octet_iterator find_invalid(octet_iterator start, octet_iterator end); |
---|
733 | </pre> |
---|
734 | <p> |
---|
735 | <code>start</code>: an iterator pointing to the beginning of the UTF-8 string to |
---|
736 | test for validity.<br> |
---|
737 | <code>end</code>: an iterator pointing to pass-the-end of the UTF-8 string to test |
---|
738 | for validity.<br> |
---|
739 | <span class="return_value">Return value</span>: an iterator pointing to the first |
---|
740 | invalid octet in the UTF-8 string. In case none were found, equals |
---|
741 | <code>end</code>. |
---|
742 | </p> |
---|
743 | <p> |
---|
744 | Example of use: |
---|
745 | </p> |
---|
746 | <pre> |
---|
747 | <span class="keyword">char</span> utf_invalid[] = <span class= |
---|
748 | "literal">"\xe6\x97\xa5\xd1\x88\xfa"</span>; |
---|
749 | <span class= |
---|
750 | "keyword">char</span>* invalid = find_invalid(utf_invalid, utf_invalid + <span class= |
---|
751 | "literal">6</span>); |
---|
752 | assert (invalid == utf_invalid + <span class="literal">5</span>); |
---|
753 | </pre> |
---|
754 | <p> |
---|
755 | This function is typically used to make sure a UTF-8 string is valid before |
---|
756 | processing it with other functions. It is especially important to call it if before |
---|
757 | doing any of the <em>unchecked</em> operations on it. |
---|
758 | </p> |
---|
759 | <h4> |
---|
760 | utf8::is_valid |
---|
761 | </h4> |
---|
762 | <p class="version"> |
---|
763 | Available in version 1.0 and later. |
---|
764 | </p> |
---|
765 | <p> |
---|
766 | Checks whether a sequence of octets is a valid UTF-8 string. |
---|
767 | </p> |
---|
768 | <pre> |
---|
769 | <span class="keyword">template</span> <<span class= |
---|
770 | "keyword">typename</span> octet_iterator> |
---|
771 | <span class="keyword">bool</span> is_valid(octet_iterator start, octet_iterator end); |
---|
772 | |
---|
773 | </pre> |
---|
774 | <p> |
---|
775 | <code>start</code>: an iterator pointing to the beginning of the UTF-8 string to |
---|
776 | test for validity.<br> |
---|
777 | <code>end</code>: an iterator pointing to pass-the-end of the UTF-8 string to test |
---|
778 | for validity.<br> |
---|
779 | <span class="return_value">Return value</span>: <code>true</code> if the sequence |
---|
780 | is a valid UTF-8 string; <code>false</code> if not. |
---|
781 | </p> |
---|
782 | Example of use: |
---|
783 | <pre> |
---|
784 | <span class="keyword">char</span> utf_invalid[] = <span class= |
---|
785 | "literal">"\xe6\x97\xa5\xd1\x88\xfa"</span>; |
---|
786 | <span class="keyword">bool</span> bvalid = is_valid(utf_invalid, utf_invalid + <span |
---|
787 | class="literal">6</span>); |
---|
788 | assert (bvalid == false); |
---|
789 | </pre> |
---|
790 | <p> |
---|
791 | <code>is_valid</code> is a shorthand for <code>find_invalid(start, end) == |
---|
792 | end;</code>. You may want to use it to make sure that a byte seqence is a valid |
---|
793 | UTF-8 string without the need to know where it fails if it is not valid. |
---|
794 | </p> |
---|
795 | <h4> |
---|
796 | utf8::replace_invalid |
---|
797 | </h4> |
---|
798 | <p class="version"> |
---|
799 | Available in version 2.0 and later. |
---|
800 | </p> |
---|
801 | <p> |
---|
802 | Replaces all invalid UTF-8 sequences within a string with a replacement marker. |
---|
803 | </p> |
---|
804 | <pre> |
---|
805 | <span class="keyword">template</span> <<span class= |
---|
806 | "keyword">typename</span> octet_iterator, <span class= |
---|
807 | "keyword">typename</span> output_iterator> |
---|
808 | output_iterator replace_invalid(octet_iterator start, octet_iterator end, output_iterator out, uint32_t replacement); |
---|
809 | <span class="keyword">template</span> <<span class= |
---|
810 | "keyword">typename</span> octet_iterator, <span class= |
---|
811 | "keyword">typename</span> output_iterator> |
---|
812 | output_iterator replace_invalid(octet_iterator start, octet_iterator end, output_iterator out); |
---|
813 | |
---|
814 | </pre> |
---|
815 | <p> |
---|
816 | <code>start</code>: an iterator pointing to the beginning of the UTF-8 string to |
---|
817 | look for invalid UTF-8 sequences.<br> |
---|
818 | <code>end</code>: an iterator pointing to pass-the-end of the UTF-8 string to look |
---|
819 | for invalid UTF-8 sequences.<br> |
---|
820 | <code>out</code>: An output iterator to the range where the result of replacement |
---|
821 | is stored.<br> |
---|
822 | <code>replacement</code>: A Unicode code point for the replacement marker. The |
---|
823 | version without this parameter assumes the value <code>0xfffd</code><br> |
---|
824 | <span class="return_value">Return value</span>: An iterator pointing to the place |
---|
825 | after the UTF-8 string with replaced invalid sequences. |
---|
826 | </p> |
---|
827 | <p> |
---|
828 | Example of use: |
---|
829 | </p> |
---|
830 | <pre> |
---|
831 | <span class="keyword">char</span> invalid_sequence[] = <span class= |
---|
832 | "literal">"a\x80\xe0\xa0\xc0\xaf\xed\xa0\x80z"</span>; |
---|
833 | vector<<span class="keyword">char</span>> replace_invalid_result; |
---|
834 | replace_invalid (invalid_sequence, invalid_sequence + sizeof(invalid_sequence), back_inserter(replace_invalid_result), <span |
---|
835 | class="literal">'?'</span>); |
---|
836 | bvalid = is_valid(replace_invalid_result.begin(), replace_invalid_result.end()); |
---|
837 | assert (bvalid); |
---|
838 | <span class="keyword">char</span>* fixed_invalid_sequence = <span class= |
---|
839 | "literal">"a????z"</span>; |
---|
840 | assert (std::equal(replace_invalid_result.begin(), replace_invalid_result.end(), fixed_invalid_sequence)); |
---|
841 | </pre> |
---|
842 | <p> |
---|
843 | <code>replace_invalid</code> does not perform in-place replacement of invalid |
---|
844 | sequences. Rather, it produces a copy of the original string with the invalid |
---|
845 | sequences replaced with a replacement marker. Therefore, <code>out</code> must not |
---|
846 | be in the <code>[start, end]</code> range. |
---|
847 | </p> |
---|
848 | <p> |
---|
849 | If <code>end</code> does not point to the past-of-end of a UTF-8 sequence, a |
---|
850 | <code>utf8::not_enough_room</code> exception is thrown. |
---|
851 | </p> |
---|
852 | <h4> |
---|
853 | utf8::is_bom |
---|
854 | </h4> |
---|
855 | <p class="version"> |
---|
856 | Available in version 1.0 and later. |
---|
857 | </p> |
---|
858 | <p> |
---|
859 | Checks whether a sequence of three octets is a UTF-8 byte order mark (BOM) |
---|
860 | </p> |
---|
861 | <pre> |
---|
862 | <span class="keyword">template</span> <<span class= |
---|
863 | "keyword">typename</span> octet_iterator> |
---|
864 | <span class="keyword">bool</span> is_bom (octet_iterator it); |
---|
865 | </pre> |
---|
866 | <p> |
---|
867 | <code>it</code>: beginning of the 3-octet sequence to check<br> |
---|
868 | <span class="return_value">Return value</span>: <code>true</code> if the sequence |
---|
869 | is UTF-8 byte order mark; <code>false</code> if not. |
---|
870 | </p> |
---|
871 | <p> |
---|
872 | Example of use: |
---|
873 | </p> |
---|
874 | <pre> |
---|
875 | <span class="keyword">unsigned char</span> byte_order_mark[] = {<span class= |
---|
876 | "literal">0xef</span>, <span class="literal">0xbb</span>, <span class= |
---|
877 | "literal">0xbf</span>}; |
---|
878 | <span class="keyword">bool</span> bbom = is_bom(byte_order_mark); |
---|
879 | assert (bbom == <span class="literal">true</span>); |
---|
880 | </pre> |
---|
881 | <p> |
---|
882 | The typical use of this function is to check the first three bytes of a file. If |
---|
883 | they form the UTF-8 BOM, we want to skip them before processing the actual UTF-8 |
---|
884 | encoded text. |
---|
885 | </p> |
---|
886 | <h3 id="typesutf8"> |
---|
887 | Types From utf8 Namespace |
---|
888 | </h3> |
---|
889 | <h4> |
---|
890 | utf8::iterator |
---|
891 | </h4> |
---|
892 | <p class="version"> |
---|
893 | Available in version 2.0 and later. |
---|
894 | </p> |
---|
895 | <p> |
---|
896 | Adapts the underlying octet iterator to iterate over the sequence of code points, |
---|
897 | rather than raw octets. |
---|
898 | </p> |
---|
899 | <pre> |
---|
900 | <span class="keyword">template</span> <<span class="keyword">typename</span> octet_iterator> |
---|
901 | <span class="keyword">class</span> iterator; |
---|
902 | </pre> |
---|
903 | |
---|
904 | <h5>Member functions</h5> |
---|
905 | <dl> |
---|
906 | <dt><code>iterator();</code> <dd> the deafult constructor; the underlying <code>octet_iterator</code> is |
---|
907 | constructed with its default constructor. |
---|
908 | <dt><code><span class="keyword">explicit</span> iterator (const octet_iterator& octet_it, |
---|
909 | const octet_iterator& range_start, |
---|
910 | const octet_iterator& range_end);</code> <dd> a constructor |
---|
911 | that initializes the underlying <code>octet_iterator</code> with <code>octet_it</code> |
---|
912 | and sets the range in which the iterator is considered valid. |
---|
913 | <dt><code>octet_iterator base () <span class="keyword">const</span>;</code> <dd> returns the |
---|
914 | underlying <code>octet_iterator</code>. |
---|
915 | <dt><code>uint32_t operator * () <span class="keyword">const</span>;</code> <dd> decodes the utf-8 sequence |
---|
916 | the underlying <code>octet_iterator</code> is pointing to and returns the code point. |
---|
917 | <dt><code><span class="keyword">bool operator</span> == (const iterator& rhs) |
---|
918 | <span class="keyword">const</span>;</code> <dd> returns <span class="keyword">true</span> |
---|
919 | if the two underlaying iterators are equal. |
---|
920 | <dt><code><span class="keyword">bool operator</span> != (const iterator& rhs) |
---|
921 | <span class="keyword">const</span>;</code> <dd> returns <span class="keyword">true</span> |
---|
922 | if the two underlaying iterators are not equal. |
---|
923 | <dt><code>iterator& <span class="keyword">operator</span> ++ (); </code> <dd> the prefix increment - moves |
---|
924 | the iterator to the next UTF-8 encoded code point. |
---|
925 | <dt><code>iterator <span class="keyword">operator</span> ++ (<span class="keyword">int</span>); </code> <dd> |
---|
926 | the postfix increment - moves the iterator to the next UTF-8 encoded code point and returns the current one. |
---|
927 | <dt><code>iterator& <span class="keyword">operator</span> -- (); </code> <dd> the prefix decrement - moves |
---|
928 | the iterator to the previous UTF-8 encoded code point. |
---|
929 | <dt><code>iterator <span class="keyword">operator</span> -- (<span class="keyword">int</span>); </code> <dd> |
---|
930 | the postfix decrement - moves the iterator to the previous UTF-8 encoded code point and returns the current one. |
---|
931 | </dl> |
---|
932 | <p> |
---|
933 | Example of use: |
---|
934 | </p> |
---|
935 | <pre> |
---|
936 | <span class="keyword">char</span>* threechars = <span class="literal">"\xf0\x90\x8d\x86\xe6\x97\xa5\xd1\x88"</span>; |
---|
937 | utf8::iterator<<span class="keyword">char</span>*> it(threechars, threechars, threechars + <span class="literal">9</span>); |
---|
938 | utf8::iterator<<span class="keyword">char</span>*> it2 = it; |
---|
939 | assert (it2 == it); |
---|
940 | assert (*it == <span class="literal">0x10346</span>); |
---|
941 | assert (*(++it) == <span class="literal">0x65e5</span>); |
---|
942 | assert ((*it++) == <span class="literal">0x65e5</span>); |
---|
943 | assert (*it == <span class="literal">0x0448</span>); |
---|
944 | assert (it != it2); |
---|
945 | utf8::iterator<<span class="keyword">char</span>*> endit (threechars + <span class="literal">9</span>, threechars, threechars + <span class="literal">9</span>); |
---|
946 | assert (++it == endit); |
---|
947 | assert (*(--it) == <span class="literal">0x0448</span>); |
---|
948 | assert ((*it--) == <span class="literal">0x0448</span>); |
---|
949 | assert (*it == <span class="literal">0x65e5</span>); |
---|
950 | assert (--it == utf8::iterator<<span class="keyword">char</span>*>(threechars, threechars, threechars + <span class="literal">9</span>)); |
---|
951 | assert (*it == <span class="literal">0x10346</span>); |
---|
952 | </pre> |
---|
953 | <p> |
---|
954 | The purpose of <code>utf8::iterator</code> adapter is to enable easy iteration as well as the use of STL |
---|
955 | algorithms with UTF-8 encoded strings. Increment and decrement operators are implemented in terms of |
---|
956 | <code>utf8::next()</code> and <code>utf8::prior()</code> functions. |
---|
957 | </p> |
---|
958 | <p> |
---|
959 | Note that <code>utf8::iterator</code> adapter is a checked iterator. It operates on the range specified in |
---|
960 | the constructor; any attempt to go out of that range will result in an exception. Even the comparison operators |
---|
961 | require both iterator object to be constructed against the same range - otherwise an exception is thrown. Typically, |
---|
962 | the range will be determined by sequence container functions <code>begin</code> and <code>end</code>, i.e.: |
---|
963 | </p> |
---|
964 | <pre> |
---|
965 | std::string s = <span class="literal">"example"</span>; |
---|
966 | utf8::iterator i (s.begin(), s.begin(), s.end()); |
---|
967 | </pre> |
---|
968 | <h3 id="fununchecked"> |
---|
969 | Functions From utf8::unchecked Namespace |
---|
970 | </h3> |
---|
971 | <h4> |
---|
972 | utf8::unchecked::append |
---|
973 | </h4> |
---|
974 | <p class="version"> |
---|
975 | Available in version 1.0 and later. |
---|
976 | </p> |
---|
977 | <p> |
---|
978 | Encodes a 32 bit code point as a UTF-8 sequence of octets and appends the sequence |
---|
979 | to a UTF-8 string. |
---|
980 | </p> |
---|
981 | <pre> |
---|
982 | <span class="keyword">template</span> <<span class= |
---|
983 | "keyword">typename</span> octet_iterator> |
---|
984 | octet_iterator append(uint32_t cp, octet_iterator result); |
---|
985 | |
---|
986 | </pre> |
---|
987 | <p> |
---|
988 | <code>cp</code>: A 32 bit integer representing a code point to append to the |
---|
989 | sequence.<br> |
---|
990 | <code>result</code>: An output iterator to the place in the sequence where to |
---|
991 | append the code point.<br> |
---|
992 | <span class="return_value">Return value</span>: An iterator pointing to the place |
---|
993 | after the newly appended sequence. |
---|
994 | </p> |
---|
995 | <p> |
---|
996 | Example of use: |
---|
997 | </p> |
---|
998 | <pre> |
---|
999 | <span class="keyword">unsigned char</span> u[<span class="literal">5</span>] = {<span |
---|
1000 | class="literal">0</span>,<span class="literal">0</span>,<span class= |
---|
1001 | "literal">0</span>,<span class="literal">0</span>,<span class="literal">0</span>}; |
---|
1002 | <span class="keyword">unsigned char</span>* end = unchecked::append(<span class= |
---|
1003 | "literal">0x0448</span>, u); |
---|
1004 | assert (u[<span class="literal">0</span>] == <span class= |
---|
1005 | "literal">0xd1</span> && u[<span class="literal">1</span>] == <span class= |
---|
1006 | "literal">0x88</span> && u[<span class="literal">2</span>] == <span class= |
---|
1007 | "literal">0</span> && u[<span class="literal">3</span>] == <span class= |
---|
1008 | "literal">0</span> && u[<span class="literal">4</span>] == <span class= |
---|
1009 | "literal">0</span>); |
---|
1010 | </pre> |
---|
1011 | <p> |
---|
1012 | This is a faster but less safe version of <code>utf8::append</code>. It does not |
---|
1013 | check for validity of the supplied code point, and may produce an invalid UTF-8 |
---|
1014 | sequence. |
---|
1015 | </p> |
---|
1016 | <h4> |
---|
1017 | utf8::unchecked::next |
---|
1018 | </h4> |
---|
1019 | <p class="version"> |
---|
1020 | Available in version 1.0 and later. |
---|
1021 | </p> |
---|
1022 | <p> |
---|
1023 | Given the iterator to the beginning of a UTF-8 sequence, it returns the code point |
---|
1024 | and moves the iterator to the next position. |
---|
1025 | </p> |
---|
1026 | <pre> |
---|
1027 | <span class="keyword">template</span> <<span class= |
---|
1028 | "keyword">typename</span> octet_iterator> |
---|
1029 | uint32_t next(octet_iterator& it); |
---|
1030 | |
---|
1031 | </pre> |
---|
1032 | <p> |
---|
1033 | <code>it</code>: a reference to an iterator pointing to the beginning of an UTF-8 |
---|
1034 | encoded code point. After the function returns, it is incremented to point to the |
---|
1035 | beginning of the next code point.<br> |
---|
1036 | <span class="return_value">Return value</span>: the 32 bit representation of the |
---|
1037 | processed UTF-8 code point. |
---|
1038 | </p> |
---|
1039 | <p> |
---|
1040 | Example of use: |
---|
1041 | </p> |
---|
1042 | <pre> |
---|
1043 | <span class="keyword">char</span>* twochars = <span class= |
---|
1044 | "literal">"\xe6\x97\xa5\xd1\x88"</span>; |
---|
1045 | <span class="keyword">char</span>* w = twochars; |
---|
1046 | <span class="keyword">int</span> cp = unchecked::next(w); |
---|
1047 | assert (cp == <span class="literal">0x65e5</span>); |
---|
1048 | assert (w == twochars + <span class="literal">3</span>); |
---|
1049 | </pre> |
---|
1050 | <p> |
---|
1051 | This is a faster but less safe version of <code>utf8::next</code>. It does not |
---|
1052 | check for validity of the supplied UTF-8 sequence. |
---|
1053 | </p> |
---|
1054 | <h4> |
---|
1055 | utf8::unchecked::peek_next |
---|
1056 | </h4> |
---|
1057 | <p class="version"> |
---|
1058 | Available in version 2.1 and later. |
---|
1059 | </p> |
---|
1060 | <p> |
---|
1061 | Given the iterator to the beginning of a UTF-8 sequence, it returns the code point. |
---|
1062 | </p> |
---|
1063 | <pre> |
---|
1064 | <span class="keyword">template</span> <<span class= |
---|
1065 | "keyword">typename</span> octet_iterator> |
---|
1066 | uint32_t peek_next(octet_iterator it); |
---|
1067 | |
---|
1068 | </pre> |
---|
1069 | <p> |
---|
1070 | <code>it</code>: an iterator pointing to the beginning of an UTF-8 |
---|
1071 | encoded code point.<br> |
---|
1072 | <span class="return_value">Return value</span>: the 32 bit representation of the |
---|
1073 | processed UTF-8 code point. |
---|
1074 | </p> |
---|
1075 | <p> |
---|
1076 | Example of use: |
---|
1077 | </p> |
---|
1078 | <pre> |
---|
1079 | <span class="keyword">char</span>* twochars = <span class= |
---|
1080 | "literal">"\xe6\x97\xa5\xd1\x88"</span>; |
---|
1081 | <span class="keyword">char</span>* w = twochars; |
---|
1082 | <span class="keyword">int</span> cp = unchecked::peek_next(w); |
---|
1083 | assert (cp == <span class="literal">0x65e5</span>); |
---|
1084 | assert (w == twochars); |
---|
1085 | </pre> |
---|
1086 | <p> |
---|
1087 | This is a faster but less safe version of <code>utf8::peek_next</code>. It does not |
---|
1088 | check for validity of the supplied UTF-8 sequence. |
---|
1089 | </p> |
---|
1090 | <h4> |
---|
1091 | utf8::unchecked::prior |
---|
1092 | </h4> |
---|
1093 | <p class="version"> |
---|
1094 | Available in version 1.02 and later. |
---|
1095 | </p> |
---|
1096 | <p> |
---|
1097 | Given a reference to an iterator pointing to an octet in a UTF-8 seqence, it |
---|
1098 | decreases the iterator until it hits the beginning of the previous UTF-8 encoded |
---|
1099 | code point and returns the 32 bits representation of the code point. |
---|
1100 | </p> |
---|
1101 | <pre> |
---|
1102 | <span class="keyword">template</span> <<span class= |
---|
1103 | "keyword">typename</span> octet_iterator> |
---|
1104 | uint32_t prior(octet_iterator& it); |
---|
1105 | |
---|
1106 | </pre> |
---|
1107 | <p> |
---|
1108 | <code>it</code>: a reference pointing to an octet within a UTF-8 encoded string. |
---|
1109 | After the function returns, it is decremented to point to the beginning of the |
---|
1110 | previous code point.<br> |
---|
1111 | <span class="return_value">Return value</span>: the 32 bit representation of the |
---|
1112 | previous code point. |
---|
1113 | </p> |
---|
1114 | <p> |
---|
1115 | Example of use: |
---|
1116 | </p> |
---|
1117 | <pre> |
---|
1118 | <span class="keyword">char</span>* twochars = <span class= |
---|
1119 | "literal">"\xe6\x97\xa5\xd1\x88"</span>; |
---|
1120 | <span class="keyword">char</span>* w = twochars + <span class="literal">3</span>; |
---|
1121 | <span class="keyword">int</span> cp = unchecked::prior (w); |
---|
1122 | assert (cp == <span class="literal">0x65e5</span>); |
---|
1123 | assert (w == twochars); |
---|
1124 | </pre> |
---|
1125 | <p> |
---|
1126 | This is a faster but less safe version of <code>utf8::prior</code>. It does not |
---|
1127 | check for validity of the supplied UTF-8 sequence and offers no boundary checking. |
---|
1128 | </p> |
---|
1129 | <h4> |
---|
1130 | utf8::unchecked::previous (deprecated, see utf8::unchecked::prior) |
---|
1131 | </h4> |
---|
1132 | <p class="version"> |
---|
1133 | Deprecated in version 1.02 and later. |
---|
1134 | </p> |
---|
1135 | <p> |
---|
1136 | Given a reference to an iterator pointing to an octet in a UTF-8 seqence, it |
---|
1137 | decreases the iterator until it hits the beginning of the previous UTF-8 encoded |
---|
1138 | code point and returns the 32 bits representation of the code point. |
---|
1139 | </p> |
---|
1140 | <pre> |
---|
1141 | <span class="keyword">template</span> <<span class= |
---|
1142 | "keyword">typename</span> octet_iterator> |
---|
1143 | uint32_t previous(octet_iterator& it); |
---|
1144 | |
---|
1145 | </pre> |
---|
1146 | <p> |
---|
1147 | <code>it</code>: a reference pointing to an octet within a UTF-8 encoded string. |
---|
1148 | After the function returns, it is decremented to point to the beginning of the |
---|
1149 | previous code point.<br> |
---|
1150 | <span class="return_value">Return value</span>: the 32 bit representation of the |
---|
1151 | previous code point. |
---|
1152 | </p> |
---|
1153 | <p> |
---|
1154 | Example of use: |
---|
1155 | </p> |
---|
1156 | <pre> |
---|
1157 | <span class="keyword">char</span>* twochars = <span class= |
---|
1158 | "literal">"\xe6\x97\xa5\xd1\x88"</span>; |
---|
1159 | <span class="keyword">char</span>* w = twochars + <span class="literal">3</span>; |
---|
1160 | <span class="keyword">int</span> cp = unchecked::previous (w); |
---|
1161 | assert (cp == <span class="literal">0x65e5</span>); |
---|
1162 | assert (w == twochars); |
---|
1163 | </pre> |
---|
1164 | <p> |
---|
1165 | The reason this function is deprecated is just the consistency with the "checked" |
---|
1166 | versions, where <code>prior</code> should be used instead of <code>previous</code>. |
---|
1167 | In fact, <code>unchecked::previous</code> behaves exactly the same as <code> |
---|
1168 | unchecked::prior</code> |
---|
1169 | </p> |
---|
1170 | <p> |
---|
1171 | This is a faster but less safe version of <code>utf8::previous</code>. It does not |
---|
1172 | check for validity of the supplied UTF-8 sequence and offers no boundary checking. |
---|
1173 | </p> |
---|
1174 | <h4> |
---|
1175 | utf8::unchecked::advance |
---|
1176 | </h4> |
---|
1177 | <p class="version"> |
---|
1178 | Available in version 1.0 and later. |
---|
1179 | </p> |
---|
1180 | <p> |
---|
1181 | Advances an iterator by the specified number of code points within an UTF-8 |
---|
1182 | sequence. |
---|
1183 | </p> |
---|
1184 | <pre> |
---|
1185 | <span class="keyword">template</span> <<span class= |
---|
1186 | "keyword">typename</span> octet_iterator, typename distance_type> |
---|
1187 | <span class="keyword">void</span> advance (octet_iterator& it, distance_type n); |
---|
1188 | |
---|
1189 | </pre> |
---|
1190 | <p> |
---|
1191 | <code>it</code>: a reference to an iterator pointing to the beginning of an UTF-8 |
---|
1192 | encoded code point. After the function returns, it is incremented to point to the |
---|
1193 | nth following code point.<br> |
---|
1194 | <code>n</code>: a positive integer that shows how many code points we want to |
---|
1195 | advance.<br> |
---|
1196 | </p> |
---|
1197 | <p> |
---|
1198 | Example of use: |
---|
1199 | </p> |
---|
1200 | <pre> |
---|
1201 | <span class="keyword">char</span>* twochars = <span class= |
---|
1202 | "literal">"\xe6\x97\xa5\xd1\x88"</span>; |
---|
1203 | <span class="keyword">char</span>* w = twochars; |
---|
1204 | unchecked::advance (w, <span class="literal">2</span>); |
---|
1205 | assert (w == twochars + <span class="literal">5</span>); |
---|
1206 | </pre> |
---|
1207 | <p> |
---|
1208 | This function works only "forward". In case of a negative <code>n</code>, there is |
---|
1209 | no effect. |
---|
1210 | </p> |
---|
1211 | <p> |
---|
1212 | This is a faster but less safe version of <code>utf8::advance</code>. It does not |
---|
1213 | check for validity of the supplied UTF-8 sequence and offers no boundary checking. |
---|
1214 | </p> |
---|
1215 | <h4> |
---|
1216 | utf8::unchecked::distance |
---|
1217 | </h4> |
---|
1218 | <p class="version"> |
---|
1219 | Available in version 1.0 and later. |
---|
1220 | </p> |
---|
1221 | <p> |
---|
1222 | Given the iterators to two UTF-8 encoded code points in a seqence, returns the |
---|
1223 | number of code points between them. |
---|
1224 | </p> |
---|
1225 | <pre> |
---|
1226 | <span class="keyword">template</span> <<span class= |
---|
1227 | "keyword">typename</span> octet_iterator> |
---|
1228 | <span class= |
---|
1229 | "keyword">typename</span> std::iterator_traits<octet_iterator>::difference_type distance (octet_iterator first, octet_iterator last); |
---|
1230 | </pre> |
---|
1231 | <p> |
---|
1232 | <code>first</code>: an iterator to a beginning of a UTF-8 encoded code point.<br> |
---|
1233 | <code>last</code>: an iterator to a "post-end" of the last UTF-8 encoded code |
---|
1234 | point in the sequence we are trying to determine the length. It can be the |
---|
1235 | beginning of a new code point, or not.<br> |
---|
1236 | <span class="return_value">Return value</span> the distance between the iterators, |
---|
1237 | in code points. |
---|
1238 | </p> |
---|
1239 | <p> |
---|
1240 | Example of use: |
---|
1241 | </p> |
---|
1242 | <pre> |
---|
1243 | <span class="keyword">char</span>* twochars = <span class= |
---|
1244 | "literal">"\xe6\x97\xa5\xd1\x88"</span>; |
---|
1245 | size_t dist = utf8::unchecked::distance(twochars, twochars + <span class= |
---|
1246 | "literal">5</span>); |
---|
1247 | assert (dist == <span class="literal">2</span>); |
---|
1248 | </pre> |
---|
1249 | <p> |
---|
1250 | This is a faster but less safe version of <code>utf8::distance</code>. It does not |
---|
1251 | check for validity of the supplied UTF-8 sequence. |
---|
1252 | </p> |
---|
1253 | <h4> |
---|
1254 | utf8::unchecked::utf16to8 |
---|
1255 | </h4> |
---|
1256 | <p class="version"> |
---|
1257 | Available in version 1.0 and later. |
---|
1258 | </p> |
---|
1259 | <p> |
---|
1260 | Converts a UTF-16 encoded string to UTF-8. |
---|
1261 | </p> |
---|
1262 | <pre> |
---|
1263 | <span class="keyword">template</span> <<span class= |
---|
1264 | "keyword">typename</span> u16bit_iterator, <span class= |
---|
1265 | "keyword">typename</span> octet_iterator> |
---|
1266 | octet_iterator utf16to8 (u16bit_iterator start, u16bit_iterator end, octet_iterator result); |
---|
1267 | |
---|
1268 | </pre> |
---|
1269 | <p> |
---|
1270 | <code>start</code>: an iterator pointing to the beginning of the UTF-16 encoded |
---|
1271 | string to convert.<br> |
---|
1272 | <code>end</code>: an iterator pointing to pass-the-end of the UTF-16 encoded |
---|
1273 | string to convert.<br> |
---|
1274 | <code>result</code>: an output iterator to the place in the UTF-8 string where to |
---|
1275 | append the result of conversion.<br> |
---|
1276 | <span class="return_value">Return value</span>: An iterator pointing to the place |
---|
1277 | after the appended UTF-8 string. |
---|
1278 | </p> |
---|
1279 | <p> |
---|
1280 | Example of use: |
---|
1281 | </p> |
---|
1282 | <pre> |
---|
1283 | <span class="keyword">unsigned short</span> utf16string[] = {<span class= |
---|
1284 | "literal">0x41</span>, <span class="literal">0x0448</span>, <span class= |
---|
1285 | "literal">0x65e5</span>, <span class="literal">0xd834</span>, <span class= |
---|
1286 | "literal">0xdd1e</span>}; |
---|
1287 | vector<<span class="keyword">unsigned char</span>> utf8result; |
---|
1288 | unchecked::utf16to8(utf16string, utf16string + <span class= |
---|
1289 | "literal">5</span>, back_inserter(utf8result)); |
---|
1290 | assert (utf8result.size() == <span class="literal">10</span>); |
---|
1291 | </pre> |
---|
1292 | <p> |
---|
1293 | This is a faster but less safe version of <code>utf8::utf16to8</code>. It does not |
---|
1294 | check for validity of the supplied UTF-16 sequence. |
---|
1295 | </p> |
---|
1296 | <h4> |
---|
1297 | utf8::unchecked::utf8to16 |
---|
1298 | </h4> |
---|
1299 | <p class="version"> |
---|
1300 | Available in version 1.0 and later. |
---|
1301 | </p> |
---|
1302 | <p> |
---|
1303 | Converts an UTF-8 encoded string to UTF-16 |
---|
1304 | </p> |
---|
1305 | <pre> |
---|
1306 | <span class="keyword">template</span> <<span class= |
---|
1307 | "keyword">typename</span> u16bit_iterator, typename octet_iterator> |
---|
1308 | u16bit_iterator utf8to16 (octet_iterator start, octet_iterator end, u16bit_iterator result); |
---|
1309 | |
---|
1310 | </pre> |
---|
1311 | <p> |
---|
1312 | <code>start</code>: an iterator pointing to the beginning of the UTF-8 encoded |
---|
1313 | string to convert. < br /> <code>end</code>: an iterator pointing to |
---|
1314 | pass-the-end of the UTF-8 encoded string to convert.<br> |
---|
1315 | <code>result</code>: an output iterator to the place in the UTF-16 string where to |
---|
1316 | append the result of conversion.<br> |
---|
1317 | <span class="return_value">Return value</span>: An iterator pointing to the place |
---|
1318 | after the appended UTF-16 string. |
---|
1319 | </p> |
---|
1320 | <p> |
---|
1321 | Example of use: |
---|
1322 | </p> |
---|
1323 | <pre> |
---|
1324 | <span class="keyword">char</span> utf8_with_surrogates[] = <span class= |
---|
1325 | "literal">"\xe6\x97\xa5\xd1\x88\xf0\x9d\x84\x9e"</span>; |
---|
1326 | vector <<span class="keyword">unsigned short</span>> utf16result; |
---|
1327 | unchecked::utf8to16(utf8_with_surrogates, utf8_with_surrogates + <span class= |
---|
1328 | "literal">9</span>, back_inserter(utf16result)); |
---|
1329 | assert (utf16result.size() == <span class="literal">4</span>); |
---|
1330 | assert (utf16result[<span class="literal">2</span>] == <span class= |
---|
1331 | "literal">0xd834</span>); |
---|
1332 | assert (utf16result[<span class="literal">3</span>] == <span class= |
---|
1333 | "literal">0xdd1e</span>); |
---|
1334 | </pre> |
---|
1335 | <p> |
---|
1336 | This is a faster but less safe version of <code>utf8::utf8to16</code>. It does not |
---|
1337 | check for validity of the supplied UTF-8 sequence. |
---|
1338 | </p> |
---|
1339 | <h4> |
---|
1340 | utf8::unchecked::utf32to8 |
---|
1341 | </h4> |
---|
1342 | <p class="version"> |
---|
1343 | Available in version 1.0 and later. |
---|
1344 | </p> |
---|
1345 | <p> |
---|
1346 | Converts a UTF-32 encoded string to UTF-8. |
---|
1347 | </p> |
---|
1348 | <pre> |
---|
1349 | <span class="keyword">template</span> <<span class= |
---|
1350 | "keyword">typename</span> octet_iterator, <span class= |
---|
1351 | "keyword">typename</span> u32bit_iterator> |
---|
1352 | octet_iterator utf32to8 (u32bit_iterator start, u32bit_iterator end, octet_iterator result); |
---|
1353 | |
---|
1354 | </pre> |
---|
1355 | <p> |
---|
1356 | <code>start</code>: an iterator pointing to the beginning of the UTF-32 encoded |
---|
1357 | string to convert.<br> |
---|
1358 | <code>end</code>: an iterator pointing to pass-the-end of the UTF-32 encoded |
---|
1359 | string to convert.<br> |
---|
1360 | <code>result</code>: an output iterator to the place in the UTF-8 string where to |
---|
1361 | append the result of conversion.<br> |
---|
1362 | <span class="return_value">Return value</span>: An iterator pointing to the place |
---|
1363 | after the appended UTF-8 string. |
---|
1364 | </p> |
---|
1365 | <p> |
---|
1366 | Example of use: |
---|
1367 | </p> |
---|
1368 | <pre> |
---|
1369 | <span class="keyword">int</span> utf32string[] = {<span class= |
---|
1370 | "literal">0x448</span>, <span class="literal">0x65e5</span>, <span class= |
---|
1371 | "literal">0x10346</span>, <span class="literal">0</span>}; |
---|
1372 | vector<<span class="keyword">unsigned char</span>> utf8result; |
---|
1373 | utf32to8(utf32string, utf32string + <span class= |
---|
1374 | "literal">3</span>, back_inserter(utf8result)); |
---|
1375 | assert (utf8result.size() == <span class="literal">9</span>); |
---|
1376 | </pre> |
---|
1377 | <p> |
---|
1378 | This is a faster but less safe version of <code>utf8::utf32to8</code>. It does not |
---|
1379 | check for validity of the supplied UTF-32 sequence. |
---|
1380 | </p> |
---|
1381 | <h4> |
---|
1382 | utf8::unchecked::utf8to32 |
---|
1383 | </h4> |
---|
1384 | <p class="version"> |
---|
1385 | Available in version 1.0 and later. |
---|
1386 | </p> |
---|
1387 | <p> |
---|
1388 | Converts a UTF-8 encoded string to UTF-32. |
---|
1389 | </p> |
---|
1390 | <pre> |
---|
1391 | <span class="keyword">template</span> <<span class= |
---|
1392 | "keyword">typename</span> octet_iterator, typename u32bit_iterator> |
---|
1393 | u32bit_iterator utf8to32 (octet_iterator start, octet_iterator end, u32bit_iterator result); |
---|
1394 | |
---|
1395 | </pre> |
---|
1396 | <p> |
---|
1397 | <code>start</code>: an iterator pointing to the beginning of the UTF-8 encoded |
---|
1398 | string to convert.<br> |
---|
1399 | <code>end</code>: an iterator pointing to pass-the-end of the UTF-8 encoded string |
---|
1400 | to convert.<br> |
---|
1401 | <code>result</code>: an output iterator to the place in the UTF-32 string where to |
---|
1402 | append the result of conversion.<br> |
---|
1403 | <span class="return_value">Return value</span>: An iterator pointing to the place |
---|
1404 | after the appended UTF-32 string. |
---|
1405 | </p> |
---|
1406 | <p> |
---|
1407 | Example of use: |
---|
1408 | </p> |
---|
1409 | <pre> |
---|
1410 | <span class="keyword">char</span>* twochars = <span class= |
---|
1411 | "literal">"\xe6\x97\xa5\xd1\x88"</span>; |
---|
1412 | vector<<span class="keyword">int</span>> utf32result; |
---|
1413 | unchecked::utf8to32(twochars, twochars + <span class= |
---|
1414 | "literal">5</span>, back_inserter(utf32result)); |
---|
1415 | assert (utf32result.size() == <span class="literal">2</span>); |
---|
1416 | </pre> |
---|
1417 | <p> |
---|
1418 | This is a faster but less safe version of <code>utf8::utf8to32</code>. It does not |
---|
1419 | check for validity of the supplied UTF-8 sequence. |
---|
1420 | </p> |
---|
1421 | <h3 id="typesunchecked"> |
---|
1422 | Types From utf8::unchecked Namespace |
---|
1423 | </h3> |
---|
1424 | <h4> |
---|
1425 | utf8::iterator |
---|
1426 | </h4> |
---|
1427 | <p class="version"> |
---|
1428 | Available in version 2.0 and later. |
---|
1429 | </p> |
---|
1430 | <p> |
---|
1431 | Adapts the underlying octet iterator to iterate over the sequence of code points, |
---|
1432 | rather than raw octets. |
---|
1433 | </p> |
---|
1434 | <pre> |
---|
1435 | <span class="keyword">template</span> <<span class="keyword">typename</span> octet_iterator> |
---|
1436 | <span class="keyword">class</span> iterator; |
---|
1437 | </pre> |
---|
1438 | |
---|
1439 | <h5>Member functions</h5> |
---|
1440 | <dl> |
---|
1441 | <dt><code>iterator();</code> <dd> the deafult constructor; the underlying <code>octet_iterator</code> is |
---|
1442 | constructed with its default constructor. |
---|
1443 | <dt><code><span class="keyword">explicit</span> iterator (const octet_iterator& octet_it); |
---|
1444 | </code> <dd> a constructor |
---|
1445 | that initializes the underlying <code>octet_iterator</code> with <code>octet_it</code> |
---|
1446 | <dt><code>octet_iterator base () <span class="keyword">const</span>;</code> <dd> returns the |
---|
1447 | underlying <code>octet_iterator</code>. |
---|
1448 | <dt><code>uint32_t operator * () <span class="keyword">const</span>;</code> <dd> decodes the utf-8 sequence |
---|
1449 | the underlying <code>octet_iterator</code> is pointing to and returns the code point. |
---|
1450 | <dt><code><span class="keyword">bool operator</span> == (const iterator& rhs) |
---|
1451 | <span class="keyword">const</span>;</code> <dd> returns <span class="keyword">true</span> |
---|
1452 | if the two underlaying iterators are equal. |
---|
1453 | <dt><code><span class="keyword">bool operator</span> != (const iterator& rhs) |
---|
1454 | <span class="keyword">const</span>;</code> <dd> returns <span class="keyword">true</span> |
---|
1455 | if the two underlaying iterators are not equal. |
---|
1456 | <dt><code>iterator& <span class="keyword">operator</span> ++ (); </code> <dd> the prefix increment - moves |
---|
1457 | the iterator to the next UTF-8 encoded code point. |
---|
1458 | <dt><code>iterator <span class="keyword">operator</span> ++ (<span class="keyword">int</span>); </code> <dd> |
---|
1459 | the postfix increment - moves the iterator to the next UTF-8 encoded code point and returns the current one. |
---|
1460 | <dt><code>iterator& <span class="keyword">operator</span> -- (); </code> <dd> the prefix decrement - moves |
---|
1461 | the iterator to the previous UTF-8 encoded code point. |
---|
1462 | <dt><code>iterator <span class="keyword">operator</span> -- (<span class="keyword">int</span>); </code> <dd> |
---|
1463 | the postfix decrement - moves the iterator to the previous UTF-8 encoded code point and returns the current one. |
---|
1464 | </dl> |
---|
1465 | <p> |
---|
1466 | Example of use: |
---|
1467 | </p> |
---|
1468 | <pre> |
---|
1469 | <span class="keyword">char</span>* threechars = <span class="literal">"\xf0\x90\x8d\x86\xe6\x97\xa5\xd1\x88"</span>; |
---|
1470 | utf8::unchecked::iterator<<span class="keyword">char</span>*> un_it(threechars); |
---|
1471 | utf8::unchecked::iterator<<span class="keyword">char</span>*> un_it2 = un_it; |
---|
1472 | assert (un_it2 == un_it); |
---|
1473 | assert (*un_it == <span class="literal">0x10346</span>); |
---|
1474 | assert (*(++un_it) == <span class="literal">0x65e5</span>); |
---|
1475 | assert ((*un_it++) == <span class="literal">0x65e5</span>); |
---|
1476 | assert (*un_it == <span class="literal">0x0448</span>); |
---|
1477 | assert (un_it != un_it2); |
---|
1478 | utf8::::unchecked::iterator<<span class="keyword">char</span>*> un_endit (threechars + <span class="literal">9</span>); |
---|
1479 | assert (++un_it == un_endit); |
---|
1480 | assert (*(--un_it) == <span class="literal">0x0448</span>); |
---|
1481 | assert ((*un_it--) == <span class="literal">0x0448</span>); |
---|
1482 | assert (*un_it == <span class="literal">0x65e5</span>); |
---|
1483 | assert (--un_it == utf8::unchecked::iterator<<span class="keyword">char</span>*>(threechars)); |
---|
1484 | assert (*un_it == <span class="literal">0x10346</span>); |
---|
1485 | </pre> |
---|
1486 | <p> |
---|
1487 | This is an unchecked version of <code>utf8::iterator</code>. It is faster in many cases, but offers |
---|
1488 | no validity or range checks. |
---|
1489 | </p> |
---|
1490 | <h2 id="points"> |
---|
1491 | Points of interest |
---|
1492 | </h2> |
---|
1493 | <h4> |
---|
1494 | Design goals and decisions |
---|
1495 | </h4> |
---|
1496 | <p> |
---|
1497 | The library was designed to be: |
---|
1498 | </p> |
---|
1499 | <ol> |
---|
1500 | <li> |
---|
1501 | Generic: for better or worse, there are many C++ string classes out there, and |
---|
1502 | the library should work with as many of them as possible. |
---|
1503 | </li> |
---|
1504 | <li> |
---|
1505 | Portable: the library should be portable both accross different platforms and |
---|
1506 | compilers. The only non-portable code is a small section that declares unsigned |
---|
1507 | integers of different sizes: three typedefs. They can be changed by the users of |
---|
1508 | the library if they don't match their platform. The default setting should work |
---|
1509 | for Windows (both 32 and 64 bit), and most 32 bit and 64 bit Unix derivatives. |
---|
1510 | </li> |
---|
1511 | <li> |
---|
1512 | Lightweight: follow the "pay only for what you use" guidline. |
---|
1513 | </li> |
---|
1514 | <li> |
---|
1515 | Unintrusive: avoid forcing any particular design or even programming style on the |
---|
1516 | user. This is a library, not a framework. |
---|
1517 | </li> |
---|
1518 | </ol> |
---|
1519 | <h4> |
---|
1520 | Alternatives |
---|
1521 | </h4> |
---|
1522 | <p> |
---|
1523 | In case you want to look into other means of working with UTF-8 strings from C++, |
---|
1524 | here is the list of solutions I am aware of: |
---|
1525 | </p> |
---|
1526 | <ol> |
---|
1527 | <li> |
---|
1528 | <a href="http://icu.sourceforge.net/">ICU Library</a>. It is very powerful, |
---|
1529 | complete, feature-rich, mature, and widely used. Also big, intrusive, |
---|
1530 | non-generic, and doesn't play well with the Standard Library. I definitelly |
---|
1531 | recommend looking at ICU even if you don't plan to use it. |
---|
1532 | </li> |
---|
1533 | <li> |
---|
1534 | <a href= |
---|
1535 | "http://www.gtkmm.org/gtkmm2/docs/tutorial/html/ch03s04.html">Glib::ustring</a>. |
---|
1536 | A class specifically made to work with UTF-8 strings, and also feel like |
---|
1537 | <code>std::string</code>. If you prefer to have yet another string class in your |
---|
1538 | code, it may be worth a look. Be aware of the licensing issues, though. |
---|
1539 | </li> |
---|
1540 | <li> |
---|
1541 | Platform dependent solutions: Windows and POSIX have functions to convert strings |
---|
1542 | from one encoding to another. That is only a subset of what my library offers, |
---|
1543 | but if that is all you need it may be good enough, especially given the fact that |
---|
1544 | these functions are mature and tested in production. |
---|
1545 | </li> |
---|
1546 | </ol> |
---|
1547 | <h2 id="conclusion"> |
---|
1548 | Conclusion |
---|
1549 | </h2> |
---|
1550 | <p> |
---|
1551 | Until Unicode becomes officially recognized by the C++ Standard Library, we need to |
---|
1552 | use other means to work with UTF-8 strings. Template functions I describe in this |
---|
1553 | article may be a good step in this direction. |
---|
1554 | </p> |
---|
1555 | <h2 id="links"> |
---|
1556 | Links |
---|
1557 | </h2> |
---|
1558 | <ol> |
---|
1559 | <li> |
---|
1560 | <a href="http://www.unicode.org/">The Unicode Consortium</a>. |
---|
1561 | </li> |
---|
1562 | <li> |
---|
1563 | <a href="http://icu.sourceforge.net/">ICU Library</a>. |
---|
1564 | </li> |
---|
1565 | <li> |
---|
1566 | <a href="http://en.wikipedia.org/wiki/UTF-8">UTF-8 at Wikipedia</a> |
---|
1567 | </li> |
---|
1568 | <li> |
---|
1569 | <a href="http://www.cl.cam.ac.uk/~mgk25/unicode.html">UTF-8 and Unicode FAQ for |
---|
1570 | Unix/Linux</a> |
---|
1571 | </li> |
---|
1572 | </ol> |
---|
1573 | </body> |
---|
1574 | </html> |
---|