root/trunk/dep/include/utf8cpp/doc/utf8cpp.html @ 2

Revision 2, 62.9 kB (checked in by yumileroy, 17 years ago)

[svn] * Proper SVN structure

Original author: Neo2003
Date: 2008-10-02 16:23:55-05:00

Line 
1<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
2<html>
3  <head>
4    <meta name="generator" content=
5    "HTML Tidy for Linux/x86 (vers 1st November 2002), see www.w3.org">
6    <meta name="description" content=
7    "A simple, portable and lightweigt C++ library for easy handling of UTF-8 encoded strings">
8    <meta name="keywords" content="UTF-8 C++ portable utf8 unicode generic templates">
9    <meta name="author" content="Nemanja Trifunovic">
10    <title>
11      UTF8-CPP: UTF-8 with C++ in a Portable Way
12    </title>
13    <style type="text/css">
14    <!--
15    span.return_value {
16      color: brown;
17    }
18    span.keyword {
19      color: blue;
20    }
21    span.preprocessor {
22      color: navy;
23    }
24    span.literal {
25      color: olive;
26    }
27    span.comment {
28      color: green;
29    }
30    code {
31      font-weight: bold; 
32    }
33    ul.toc {
34      list-style-type: none;
35    }
36    p.version {
37      font-size: small;
38      font-style: italic;
39    }
40    -->
41        </style>
42  </head>
43  <body>
44    <h1>
45      UTF8-CPP: UTF-8 with C++ in a Portable Way
46    </h1>
47    <p>
48      <a href="https://sourceforge.net/projects/utfcpp">The Sourceforge project page</a>
49    </p>
50    <div id="toc">
51      <h2>
52        Table of Contents
53      </h2>
54      <ul class="toc">
55        <li>
56          <a href="#introduction">Introduction</a>
57        </li>
58        <li>
59          <a href="#examples">Examples of Use</a>
60        </li>
61        <li>
62          <a href="#reference">Reference</a>
63          <ul class="toc">
64            <li>
65              <a href="#funutf8">Functions From utf8 Namespace </a>
66            </li>
67            <li>
68              <a href="#typesutf8">Types From utf8 Namespace </a>
69            </li>
70            <li>
71              <a href="#fununchecked">Functions From utf8::unchecked Namespace </a>
72            </li>
73            <li>
74              <a href="#typesunchecked">Types From utf8::unchecked Namespace </a>
75            </li>
76          </ul>
77        </li>
78        <li>
79          <a href="#points">Points of Interest</a>
80        </li>
81        <li>
82          <a href="#conclusion">Conclusion</a>
83        </li>
84        <li>
85          <a href="#links">Links</a>
86        </li>
87      </ul>
88    </div>
89    <h2 id="introduction">
90      Introduction
91    </h2>
92    <p>
93      Many C++ developers miss an easy and portable way of handling Unicode encoded
94      strings. C++ Standard is currently Unicode agnostic, and while some work is being
95      done to introduce Unicode to the next incarnation called C++0x, for the moment
96      nothing of the sort is available. In the meantime, developers use 3rd party
97      libraries like ICU, OS specific capabilities, or simply roll out their own
98      solutions.
99    </p>
100    <p>
101      In order to easily handle UTF-8 encoded Unicode strings, I have come up with a small
102      generic library. For anybody used to work with STL algorithms and iterators, it should be
103      easy and natural to use. The code is freely available for any purpose - check out
104      the license at the beginning of the utf8.h file. If you run into
105      bugs or performance issues, please let me know and I'll do my best to address them.
106    </p>
107    <p>
108      The purpose of this article is not to offer an introduction to Unicode in general,
109      and UTF-8 in particular. If you are not familiar with Unicode, be sure to check out
110      <a href="http://www.unicode.org/">Unicode Home Page</a> or some other source of
111      information for Unicode. Also, it is not my aim to advocate the use of UTF-8
112      encoded strings in C++ programs; if you want to handle UTF-8 encoded strings from
113      C++, I am sure you have good reasons for it.
114    </p>
115    <h2 id="examples">
116      Examples of use
117    </h2>
118    <p>
119      To illustrate the use of this utf8 library, we shall open a file containing UTF-8
120      encoded text, check whether it starts with a byte order mark, read each line into a
121      <code>std::string</code>, check it for validity, convert the text to UTF-16, and
122      back to UTF-8:
123    </p>
124<pre>
125<span class="preprocessor">#include &lt;fstream&gt;</span>
126<span class="preprocessor">#include &lt;iostream&gt;</span>
127<span class="preprocessor">#include &lt;string&gt;</span>
128<span class="preprocessor">#include &lt;vector&gt;</span>
129<span class="preprocessor">#include "utf8.h"</span>
130<span class="keyword">using namespace</span> std;
131<span class="keyword">int</span> main()
132{
133    <span class="keyword">if</span> (argc != <span class="literal">2</span>) {
134        cout &lt;&lt; <span class="literal">"\nUsage: docsample filename\n"</span>;
135        <span class="keyword">return</span> <span class="literal">0</span>;
136    }
137    <span class="keyword">const char</span>* test_file_path = argv[1];
138    <span class="comment">// Open the test file (must be UTF-8 encoded)</span>
139    ifstream fs8(test_file_path);
140    <span class="keyword">if</span> (!fs8.is_open()) {
141    cout &lt;&lt; <span class=
142"literal">"Could not open "</span> &lt;&lt; test_file_path &lt;&lt; endl;
143    <span class="keyword">return</span> <span class="literal">0</span>;
144    }
145    <span class="comment">// Read the first line of the file</span>
146    <span class="keyword">unsigned</span> line_count = <span class="literal">1</span>;
147    string line;
148    <span class="keyword">if</span> (!getline(fs8, line))
149        <span class="keyword">return</span> <span class="literal">0</span>;
150    <span class="comment">// Look for utf-8 byte-order mark at the beginning</span>
151    <span class="keyword">if</span> (line.size() &gt; <span class="literal">2</span>) {
152        <span class="keyword">if</span> (utf8::is_bom(line.c_str()))
153            cout &lt;&lt; <span class=
154"literal">"There is a byte order mark at the beginning of the file\n"</span>;
155    }
156    <span class="comment">// Play with all the lines in the file</span>
157    <span class="keyword">do</span> {
158       <span class="comment">// check for invalid utf-8 (for a simple yes/no check, there is also utf8::is_valid function)</span>
159        string::iterator end_it = utf8::find_invalid(line.begin(), line.end());
160        <span class="keyword">if</span> (end_it != line.end()) {
161            cout &lt;&lt; <span class=
162"literal">"Invalid UTF-8 encoding detected at line "</span> &lt;&lt; line_count &lt;&lt; <span
163 class="literal">"\n"</span>;
164            cout &lt;&lt; <span class=
165"literal">"This part is fine: "</span> &lt;&lt; string(line.begin(), end_it) &lt;&lt; <span
166 class="literal">"\n"</span>;
167        }
168        <span class="comment">// Get the line length (at least for the valid part)</span>
169        <span class="keyword">int</span> length = utf8::distance(line.begin(), end_it);
170        cout &lt;&lt; <span class=
171"literal">"Length of line "</span> &lt;&lt; line_count &lt;&lt; <span class=
172"literal">" is "</span> &lt;&lt; length &lt;&lt;  <span class="literal">"\n"</span>;
173        <span class="comment">// Convert it to utf-16</span>
174        vector&lt;unsigned short&gt; utf16line;
175        utf8::utf8to16(line.begin(), end_it, back_inserter(utf16line));
176        <span class="comment">// And back to utf-8</span>
177        string utf8line;
178        utf8::utf16to8(utf16line.begin(), utf16line.end(), back_inserter(utf8line));
179        <span class="comment">// Confirm that the conversion went OK:</span>
180        <span class="keyword">if</span> (utf8line != string(line.begin(), end_it))
181            cout &lt;&lt; <span class=
182"literal">"Error in UTF-16 conversion at line: "</span> &lt;&lt; line_count &lt;&lt; <span
183 class="literal">"\n"</span>;       
184        getline(fs8, line);
185        line_count++;
186    } <span class="keyword">while</span> (!fs8.eof());
187    <span class="keyword">return</span> <span class="literal">0</span>;
188}
189</pre>
190    <p>
191      In the previous code sample, we have seen the use of the following functions from
192      <code>utf8</code> namespace: first we used <code>is_bom</code> function to detect
193      UTF-8 byte order mark at the beginning of the file; then for each line we performed
194      a detection of invalid UTF-8 sequences with <code>find_invalid</code>; the number
195      of characters (more precisely - the number of Unicode code points) in each line was
196      determined with a use of <code>utf8::distance</code>; finally, we have converted
197      each line to UTF-16 encoding with <code>utf8to16</code> and back to UTF-8 with
198      <code>utf16to8</code>.
199    </p>
200    <h2 id="reference">
201      Reference
202    </h2>
203    <h3 id="funutf8">
204      Functions From utf8 Namespace
205    </h3>
206    <h4>
207      utf8::append
208    </h4>
209    <p class="version">
210    Available in version 1.0 and later.
211    </p>
212    <p>
213      Encodes a 32 bit code point as a UTF-8 sequence of octets and appends the sequence
214      to a UTF-8 string.
215    </p>
216<pre>
217<span class="keyword">template</span> &lt;<span class=
218"keyword">typename</span> octet_iterator&gt;
219octet_iterator append(uint32_t cp, octet_iterator result);
220   
221</pre>
222    <p>
223      <code>cp</code>: A 32 bit integer representing a code point to append to the
224      sequence.<br>
225       <code>result</code>: An output iterator to the place in the sequence where to
226      append the code point.<br>
227       <span class="return_value">Return value</span>: An iterator pointing to the place
228      after the newly appended sequence.
229    </p>
230    <p>
231      Example of use:
232    </p>
233<pre>
234<span class="keyword">unsigned char</span> u[<span class="literal">5</span>] = {<span
235class="literal">0</span>,<span class="literal">0</span>,<span class=
236"literal">0</span>,<span class="literal">0</span>,<span class="literal">0</span>};
237<span class="keyword">unsigned char</span>* end = append(<span class=
238"literal">0x0448</span>, u);
239assert (u[<span class="literal">0</span>] == <span class=
240"literal">0xd1</span> &amp;&amp; u[<span class="literal">1</span>] == <span class=
241"literal">0x88</span> &amp;&amp; u[<span class="literal">2</span>] == <span class=
242"literal">0</span> &amp;&amp; u[<span class="literal">3</span>] == <span class=
243"literal">0</span> &amp;&amp; u[<span class="literal">4</span>] == <span class=
244"literal">0</span>);
245</pre>
246    <p>
247      Note that <code>append</code> does not allocate any memory - it is the burden of
248      the caller to make sure there is enough memory allocated for the operation. To make
249      things more interesting, <code>append</code> can add anywhere between 1 and 4
250      octets to the sequence. In practice, you would most often want to use
251      <code>std::back_inserter</code> to ensure that the necessary memory is allocated.
252    </p>
253    <p>
254      In case of an invalid code point, a <code>utf8::invalid_code_point</code> exception
255      is thrown.
256    </p>
257    <h4>
258      utf8::next
259    </h4>
260    <p class="version">
261    Available in version 1.0 and later.
262    </p>
263    <p>
264      Given the iterator to the beginning of the UTF-8 sequence, it returns the code
265      point and moves the iterator to the next position.
266    </p>
267<pre>
268<span class="keyword">template</span> &lt;<span class=
269"keyword">typename</span> octet_iterator&gt; 
270uint32_t next(octet_iterator&amp; it, octet_iterator end);
271   
272</pre>
273    <p>
274      <code>it</code>: a reference to an iterator pointing to the beginning of an UTF-8
275      encoded code point. After the function returns, it is incremented to point to the
276      beginning of the next code point.<br>
277       <code>end</code>: end of the UTF-8 sequence to be processed. If <code>it</code>
278      gets equal to <code>end</code> during the extraction of a code point, an
279      <code>utf8::not_enough_room</code> exception is thrown.<br>
280       <span class="return_value">Return value</span>: the 32 bit representation of the
281      processed UTF-8 code point.
282    </p>
283    <p>
284      Example of use:
285    </p>
286<pre>
287<span class="keyword">char</span>* twochars = <span class=
288"literal">"\xe6\x97\xa5\xd1\x88"</span>;
289<span class="keyword">char</span>* w = twochars;
290<span class="keyword">int</span> cp = next(w, twochars + <span class="literal">6</span>);
291assert (cp == <span class="literal">0x65e5</span>);
292assert (w == twochars + <span class="literal">3</span>);
293</pre>
294    <p>
295      This function is typically used to iterate through a UTF-8 encoded string.
296    </p>
297    <p>
298      In case of an invalid UTF-8 seqence, a <code>utf8::invalid_utf8</code> exception is
299      thrown.
300    </p>
301    <h4>
302      utf8::peek_next
303    </h4>
304    <p class="version">
305    Available in version 2.1 and later.
306    </p>
307    <p>
308      Given the iterator to the beginning of the UTF-8 sequence, it returns the code
309      point for the following sequence without changing the value of the iterator.
310    </p>
311<pre>
312<span class="keyword">template</span> &lt;<span class=
313"keyword">typename</span> octet_iterator&gt; 
314uint32_t peek_next(octet_iterator it, octet_iterator end);
315   
316</pre>
317    <p>
318      <code>it</code>: an iterator pointing to the beginning of an UTF-8
319      encoded code point.<br>
320       <code>end</code>: end of the UTF-8 sequence to be processed. If <code>it</code>
321      gets equal to <code>end</code> during the extraction of a code point, an
322      <code>utf8::not_enough_room</code> exception is thrown.<br>
323       <span class="return_value">Return value</span>: the 32 bit representation of the
324      processed UTF-8 code point.
325    </p>
326    <p>
327      Example of use:
328    </p>
329<pre>
330<span class="keyword">char</span>* twochars = <span class=
331"literal">"\xe6\x97\xa5\xd1\x88"</span>;
332<span class="keyword">char</span>* w = twochars;
333<span class="keyword">int</span> cp = peek_next(w, twochars + <span class="literal">6</span>);
334assert (cp == <span class="literal">0x65e5</span>);
335assert (w == twochars);
336</pre>
337    <p>
338      In case of an invalid UTF-8 seqence, a <code>utf8::invalid_utf8</code> exception is
339      thrown.
340    </p>
341    <h4>
342      utf8::prior
343    </h4>
344    <p class="version">
345    Available in version 1.02 and later.
346    </p>
347    <p>
348      Given a reference to an iterator pointing to an octet in a UTF-8 seqence, it
349      decreases the iterator until it hits the beginning of the previous UTF-8 encoded
350      code point and returns the 32 bits representation of the code point.
351    </p>
352<pre>
353<span class="keyword">template</span> &lt;<span class=
354"keyword">typename</span> octet_iterator&gt; 
355uint32_t prior(octet_iterator&amp; it, octet_iterator start);
356   
357</pre>
358    <p>
359      <code>it</code>: a reference pointing to an octet within a UTF-8 encoded string.
360      After the function returns, it is decremented to point to the beginning of the
361      previous code point.<br>
362       <code>start</code>: an iterator to the beginning of the sequence where the search
363      for the beginning of a code point is performed. It is a
364      safety measure to prevent passing the beginning of the string in the search for a
365      UTF-8 lead octet.<br>
366       <span class="return_value">Return value</span>: the 32 bit representation of the
367      previous code point.
368    </p>
369    <p>
370      Example of use:
371    </p>
372<pre>
373<span class="keyword">char</span>* twochars = <span class=
374"literal">"\xe6\x97\xa5\xd1\x88"</span>;
375<span class="keyword">unsigned char</span>* w = twochars + <span class=
376"literal">3</span>;
377<span class="keyword">int</span> cp = prior (w, twochars);
378assert (cp == <span class="literal">0x65e5</span>);
379assert (w == twochars);
380</pre>
381    <p> 
382      This function has two purposes: one is two iterate backwards through a UTF-8
383      encoded string. Note that it is usually a better idea to iterate forward instead,
384      since <code>utf8::next</code> is faster. The second purpose is to find a beginning
385      of a UTF-8 sequence if we have a random position within a string.
386    </p> 
387    <p>
388      <code>it</code> will typically point to the beginning of
389      a code point, and <code>start</code> will point to the
390      beginning of the string to ensure we don't go backwards too far. <code>it</code> is
391      decreased until it points to a lead UTF-8 octet, and then the UTF-8 sequence
392      beginning with that octet is decoded to a 32 bit representation and returned.
393    </p>
394    <p>
395      In case <code>pass_end</code> is reached before a UTF-8 lead octet is hit, or if an
396      invalid UTF-8 sequence is started by the lead octet, an <code>invalid_utf8</code>
397      exception is thrown.
398    </p>
399    <h4>
400      utf8::previous
401    </h4>
402    <p class="version">
403    Deprecated in version 1.02 and later.
404    </p>
405    <p>
406      Given a reference to an iterator pointing to an octet in a UTF-8 seqence, it
407      decreases the iterator until it hits the beginning of the previous UTF-8 encoded
408      code point and returns the 32 bits representation of the code point.
409    </p>
410<pre>
411<span class="keyword">template</span> &lt;<span class=
412"keyword">typename</span> octet_iterator&gt; 
413uint32_t previous(octet_iterator&amp; it, octet_iterator pass_start);
414   
415</pre>
416    <p>
417      <code>it</code>: a reference pointing to an octet within a UTF-8 encoded string.
418      After the function returns, it is decremented to point to the beginning of the
419      previous code point.<br>
420       <code>pass_start</code>: an iterator to the point in the sequence where the search
421      for the beginning of a code point is aborted if no result was reached. It is a
422      safety measure to prevent passing the beginning of the string in the search for a
423      UTF-8 lead octet.<br>
424       <span class="return_value">Return value</span>: the 32 bit representation of the
425      previous code point.
426    </p>
427    <p>
428      Example of use:
429    </p>
430<pre>
431<span class="keyword">char</span>* twochars = <span class=
432"literal">"\xe6\x97\xa5\xd1\x88"</span>;
433<span class="keyword">unsigned char</span>* w = twochars + <span class=
434"literal">3</span>;
435<span class="keyword">int</span> cp = previous (w, twochars - <span class=
436"literal">1</span>);
437assert (cp == <span class="literal">0x65e5</span>);
438assert (w == twochars);
439</pre>
440    <p>
441      <code>utf8::previous</code> is deprecated, and <code>utf8::prior</code> should
442      be used instead, although the existing code can continue using this function.
443      The problem is the parameter <code>pass_start</code> that points to the position
444      just before the beginning of the sequence. Standard containers don't have the
445      concept of "pass start" and the function can not be used with their iterators.
446    </p>
447    <p>
448      <code>it</code> will typically point to the beginning of
449      a code point, and <code>pass_start</code> will point to the octet just before the
450      beginning of the string to ensure we don't go backwards too far. <code>it</code> is
451      decreased until it points to a lead UTF-8 octet, and then the UTF-8 sequence
452      beginning with that octet is decoded to a 32 bit representation and returned.
453    </p>
454    <p>
455      In case <code>pass_end</code> is reached before a UTF-8 lead octet is hit, or if an
456      invalid UTF-8 sequence is started by the lead octet, an <code>invalid_utf8</code>
457      exception is thrown
458    </p>
459    <h4>
460      utf8::advance
461    </h4>
462    <p class="version">
463    Available in version 1.0 and later.
464    </p>
465    <p>
466      Advances an iterator by the specified number of code points within an UTF-8
467      sequence.
468    </p>
469<pre>
470<span class="keyword">template</span> &lt;<span class=
471"keyword">typename</span> octet_iterator, typename distance_type&gt; 
472<span class=
473"keyword">void</span> advance (octet_iterator&amp; it, distance_type n, octet_iterator end);
474   
475</pre>
476    <p>
477      <code>it</code>: a reference to an iterator pointing to the beginning of an UTF-8
478      encoded code point. After the function returns, it is incremented to point to the
479      nth following code point.<br>
480       <code>n</code>: a positive integer that shows how many code points we want to
481      advance.<br>
482       <code>end</code>: end of the UTF-8 sequence to be processed. If <code>it</code>
483      gets equal to <code>end</code> during the extraction of a code point, an
484      <code>utf8::not_enough_room</code> exception is thrown.<br>
485    </p>
486    <p>
487      Example of use:
488    </p>
489<pre>
490<span class="keyword">char</span>* twochars = <span class=
491"literal">"\xe6\x97\xa5\xd1\x88"</span>;
492<span class="keyword">unsigned char</span>* w = twochars;
493advance (w, <span class="literal">2</span>, twochars + <span class="literal">6</span>);
494assert (w == twochars + <span class="literal">5</span>);
495</pre>
496    <p>
497      This function works only "forward". In case of a negative <code>n</code>, there is
498      no effect.
499    </p>
500    <p>
501      In case of an invalid code point, a <code>utf8::invalid_code_point</code> exception
502      is thrown.
503    </p>
504    <h4>
505      utf8::distance
506    </h4>
507    <p class="version">
508    Available in version 1.0 and later.
509    </p>
510    <p>
511      Given the iterators to two UTF-8 encoded code points in a seqence, returns the
512      number of code points between them.
513    </p>
514<pre>
515<span class="keyword">template</span> &lt;<span class=
516"keyword">typename</span> octet_iterator&gt; 
517<span class=
518"keyword">typename</span> std::iterator_traits&lt;octet_iterator&gt;::difference_type distance (octet_iterator first, octet_iterator last);
519   
520</pre>
521    <p>
522      <code>first</code>: an iterator to a beginning of a UTF-8 encoded code point.<br>
523       <code>last</code>: an iterator to a "post-end" of the last UTF-8 encoded code
524      point in the sequence we are trying to determine the length. It can be the
525      beginning of a new code point, or not.<br>
526       <span class="return_value">Return value</span> the distance between the iterators,
527      in code points.
528    </p>
529    <p>
530      Example of use:
531    </p>
532<pre>
533<span class="keyword">char</span>* twochars = <span class=
534"literal">"\xe6\x97\xa5\xd1\x88"</span>;
535size_t dist = utf8::distance(twochars, twochars + <span class="literal">5</span>);
536assert (dist == <span class="literal">2</span>);
537</pre>
538    <p>
539      This function is used to find the length (in code points) of a UTF-8 encoded
540      string. The reason it is called <em>distance</em>, rather than, say,
541      <em>length</em> is mainly because developers are used that <em>length</em> is an
542      O(1) function. Computing the length of an UTF-8 string is a linear operation, and
543      it looked better to model it after <code>std::distance</code> algorithm.
544    </p>
545    <p>
546      In case of an invalid UTF-8 seqence, a <code>utf8::invalid_utf8</code> exception is
547      thrown. If <code>last</code> does not point to the past-of-end of a UTF-8 seqence,
548      a <code>utf8::not_enough_room</code> exception is thrown.
549    </p>
550    <h4>
551      utf8::utf16to8
552    </h4>
553    <p class="version">
554    Available in version 1.0 and later.
555    </p>
556    <p>
557      Converts a UTF-16 encoded string to UTF-8.
558    </p>
559<pre>
560<span class="keyword">template</span> &lt;<span class=
561"keyword">typename</span> u16bit_iterator, <span class=
562"keyword">typename</span> octet_iterator&gt;
563octet_iterator utf16to8 (u16bit_iterator start, u16bit_iterator end, octet_iterator result);
564   
565</pre>
566    <p>
567      <code>start</code>: an iterator pointing to the beginning of the UTF-16 encoded
568      string to convert.<br>
569       <code>end</code>: an iterator pointing to pass-the-end of the UTF-16 encoded
570      string to convert.<br>
571       <code>result</code>: an output iterator to the place in the UTF-8 string where to
572      append the result of conversion.<br>
573       <span class="return_value">Return value</span>: An iterator pointing to the place
574      after the appended UTF-8 string.
575    </p>
576    <p>
577      Example of use:
578    </p>
579<pre>
580<span class="keyword">unsigned short</span> utf16string[] = {<span class=
581"literal">0x41</span>, <span class="literal">0x0448</span>, <span class=
582"literal">0x65e5</span>, <span class="literal">0xd834</span>, <span class=
583"literal">0xdd1e</span>};
584vector&lt;<span class="keyword">unsigned char</span>&gt; utf8result;
585utf16to8(utf16string, utf16string + <span class=
586"literal">5</span>, back_inserter(utf8result));
587assert (utf8result.size() == <span class="literal">10</span>);   
588</pre>
589    <p>
590      In case of invalid UTF-16 sequence, a <code>utf8::invalid_utf16</code> exception is
591      thrown.
592    </p>
593    <h4>
594      utf8::utf8to16
595    </h4>
596    <p class="version">
597    Available in version 1.0 and later.
598    </p>
599    <p>
600      Converts an UTF-8 encoded string to UTF-16
601    </p>
602<pre>
603<span class="keyword">template</span> &lt;<span class=
604"keyword">typename</span> u16bit_iterator, typename octet_iterator&gt;
605u16bit_iterator utf8to16 (octet_iterator start, octet_iterator end, u16bit_iterator result);
606   
607</pre>
608    <p>
609      <code>start</code>: an iterator pointing to the beginning of the UTF-8 encoded
610      string to convert. &lt; br /&gt; <code>end</code>: an iterator pointing to
611      pass-the-end of the UTF-8 encoded string to convert.<br>
612       <code>result</code>: an output iterator to the place in the UTF-16 string where to
613      append the result of conversion.<br>
614       <span class="return_value">Return value</span>: An iterator pointing to the place
615      after the appended UTF-16 string.
616    </p>
617    <p>
618      Example of use:
619    </p>
620<pre>
621<span class="keyword">char</span> utf8_with_surrogates[] = <span class=
622"literal">"\xe6\x97\xa5\xd1\x88\xf0\x9d\x84\x9e"</span>;
623vector &lt;<span class="keyword">unsigned short</span>&gt; utf16result;
624utf8to16(utf8_with_surrogates, utf8_with_surrogates + <span class=
625"literal">9</span>, back_inserter(utf16result));
626assert (utf16result.size() == <span class="literal">4</span>);
627assert (utf16result[<span class="literal">2</span>] == <span class=
628"literal">0xd834</span>);
629assert (utf16result[<span class="literal">3</span>] == <span class=
630"literal">0xdd1e</span>);
631</pre>
632    <p>
633      In case of an invalid UTF-8 seqence, a <code>utf8::invalid_utf8</code> exception is
634      thrown. If <code>end</code> does not point to the past-of-end of a UTF-8 seqence, a
635      <code>utf8::not_enough_room</code> exception is thrown.
636    </p>
637    <h4>
638      utf8::utf32to8
639    </h4>
640    <p class="version">
641    Available in version 1.0 and later.
642    </p>
643    <p>
644      Converts a UTF-32 encoded string to UTF-8.
645    </p>
646<pre>
647<span class="keyword">template</span> &lt;<span class=
648"keyword">typename</span> octet_iterator, typename u32bit_iterator&gt;
649octet_iterator utf32to8 (u32bit_iterator start, u32bit_iterator end, octet_iterator result);
650   
651</pre>
652    <p>
653      <code>start</code>: an iterator pointing to the beginning of the UTF-32 encoded
654      string to convert.<br>
655       <code>end</code>: an iterator pointing to pass-the-end of the UTF-32 encoded
656      string to convert.<br>
657       <code>result</code>: an output iterator to the place in the UTF-8 string where to
658      append the result of conversion.<br>
659       <span class="return_value">Return value</span>: An iterator pointing to the place
660      after the appended UTF-8 string.
661    </p>
662    <p>
663      Example of use:
664    </p>
665<pre>
666<span class="keyword">int</span> utf32string[] = {<span class=
667"literal">0x448</span>, <span class="literal">0x65E5</span>, <span class=
668"literal">0x10346</span>, <span class="literal">0</span>};
669vector&lt;<span class="keyword">unsigned char</span>&gt; utf8result;
670utf32to8(utf32string, utf32string + <span class=
671"literal">3</span>, back_inserter(utf8result));
672assert (utf8result.size() == <span class="literal">9</span>);
673</pre>
674    <p>
675      In case of invalid UTF-32 string, a <code>utf8::invalid_code_point</code> exception
676      is thrown.
677    </p>
678    <h4>
679      utf8::utf8to32
680    </h4>
681    <p class="version">
682    Available in version 1.0 and later.
683    </p>
684    <p>
685      Converts a UTF-8 encoded string to UTF-32.
686    </p>
687<pre>
688<span class="keyword">template</span> &lt;<span class=
689"keyword">typename</span> octet_iterator, <span class=
690"keyword">typename</span> u32bit_iterator&gt;
691u32bit_iterator utf8to32 (octet_iterator start, octet_iterator end, u32bit_iterator result);
692   
693</pre>
694    <p>
695      <code>start</code>: an iterator pointing to the beginning of the UTF-8 encoded
696      string to convert.<br>
697       <code>end</code>: an iterator pointing to pass-the-end of the UTF-8 encoded string
698      to convert.<br>
699       <code>result</code>: an output iterator to the place in the UTF-32 string where to
700      append the result of conversion.<br>
701       <span class="return_value">Return value</span>: An iterator pointing to the place
702      after the appended UTF-32 string.
703    </p>
704    <p>
705      Example of use:
706    </p>
707<pre>
708<span class="keyword">char</span>* twochars = <span class=
709"literal">"\xe6\x97\xa5\xd1\x88"</span>;
710vector&lt;<span class="keyword">int</span>&gt; utf32result;
711utf8to32(twochars, twochars + <span class=
712"literal">5</span>, back_inserter(utf32result));
713assert (utf32result.size() == <span class="literal">2</span>);
714</pre>
715    <p>
716      In case of an invalid UTF-8 seqence, a <code>utf8::invalid_utf8</code> exception is
717      thrown. If <code>end</code> does not point to the past-of-end of a UTF-8 seqence, a
718      <code>utf8::not_enough_room</code> exception is thrown.
719    </p>
720    <h4>
721      utf8::find_invalid
722    </h4>
723    <p class="version">
724    Available in version 1.0 and later.
725    </p>
726    <p>
727      Detects an invalid sequence within a UTF-8 string.
728    </p>
729<pre>
730<span class="keyword">template</span> &lt;<span class=
731"keyword">typename</span> octet_iterator&gt; 
732octet_iterator find_invalid(octet_iterator start, octet_iterator end);
733</pre>
734    <p>
735      <code>start</code>: an iterator pointing to the beginning of the UTF-8 string to
736      test for validity.<br>
737       <code>end</code>: an iterator pointing to pass-the-end of the UTF-8 string to test
738      for validity.<br>
739       <span class="return_value">Return value</span>: an iterator pointing to the first
740      invalid octet in the UTF-8 string. In case none were found, equals
741      <code>end</code>.
742    </p>
743    <p>
744      Example of use:
745    </p>
746<pre>
747<span class="keyword">char</span> utf_invalid[] = <span class=
748"literal">"\xe6\x97\xa5\xd1\x88\xfa"</span>;
749<span class=
750"keyword">char</span>* invalid = find_invalid(utf_invalid, utf_invalid + <span class=
751"literal">6</span>);
752assert (invalid == utf_invalid + <span class="literal">5</span>);
753</pre>
754    <p>
755      This function is typically used to make sure a UTF-8 string is valid before
756      processing it with other functions. It is especially important to call it if before
757      doing any of the <em>unchecked</em> operations on it.
758    </p>
759    <h4>
760      utf8::is_valid
761    </h4>
762    <p class="version">
763    Available in version 1.0 and later.
764    </p>
765    <p>
766      Checks whether a sequence of octets is a valid UTF-8 string.
767    </p>
768<pre>
769<span class="keyword">template</span> &lt;<span class=
770"keyword">typename</span> octet_iterator&gt; 
771<span class="keyword">bool</span> is_valid(octet_iterator start, octet_iterator end);
772   
773</pre>
774    <p>
775      <code>start</code>: an iterator pointing to the beginning of the UTF-8 string to
776      test for validity.<br>
777       <code>end</code>: an iterator pointing to pass-the-end of the UTF-8 string to test
778      for validity.<br>
779       <span class="return_value">Return value</span>: <code>true</code> if the sequence
780      is a valid UTF-8 string; <code>false</code> if not.
781    </p>
782    Example of use:
783<pre>
784<span class="keyword">char</span> utf_invalid[] = <span class=
785"literal">"\xe6\x97\xa5\xd1\x88\xfa"</span>;
786<span class="keyword">bool</span> bvalid = is_valid(utf_invalid, utf_invalid + <span
787class="literal">6</span>);
788assert (bvalid == false);
789</pre>
790    <p>
791      <code>is_valid</code> is a shorthand for <code>find_invalid(start, end) ==
792      end;</code>. You may want to use it to make sure that a byte seqence is a valid
793      UTF-8 string without the need to know where it fails if it is not valid.
794    </p>
795    <h4>
796      utf8::replace_invalid
797    </h4>
798    <p class="version">
799    Available in version 2.0 and later.
800    </p>
801    <p>
802      Replaces all invalid UTF-8 sequences within a string with a replacement marker.
803    </p>
804<pre>
805<span class="keyword">template</span> &lt;<span class=
806"keyword">typename</span> octet_iterator, <span class=
807"keyword">typename</span> output_iterator&gt;
808output_iterator replace_invalid(octet_iterator start, octet_iterator end, output_iterator out, uint32_t replacement);
809<span class="keyword">template</span> &lt;<span class=
810"keyword">typename</span> octet_iterator, <span class=
811"keyword">typename</span> output_iterator&gt;
812output_iterator replace_invalid(octet_iterator start, octet_iterator end, output_iterator out);
813   
814</pre>
815    <p>
816      <code>start</code>: an iterator pointing to the beginning of the UTF-8 string to
817      look for invalid UTF-8 sequences.<br>
818       <code>end</code>: an iterator pointing to pass-the-end of the UTF-8 string to look
819      for invalid UTF-8 sequences.<br>
820       <code>out</code>: An output iterator to the range where the result of replacement
821      is stored.<br>
822       <code>replacement</code>: A Unicode code point for the replacement marker. The
823      version without this parameter assumes the value <code>0xfffd</code><br>
824       <span class="return_value">Return value</span>: An iterator pointing to the place
825      after the UTF-8 string with replaced invalid sequences.
826    </p>
827    <p>
828      Example of use:
829    </p>
830<pre>
831<span class="keyword">char</span> invalid_sequence[] = <span class=
832"literal">"a\x80\xe0\xa0\xc0\xaf\xed\xa0\x80z"</span>;
833vector&lt;<span class="keyword">char</span>&gt; replace_invalid_result;
834replace_invalid (invalid_sequence, invalid_sequence + sizeof(invalid_sequence), back_inserter(replace_invalid_result), <span
835 class="literal">'?'</span>);
836bvalid = is_valid(replace_invalid_result.begin(), replace_invalid_result.end());
837assert (bvalid);
838<span class="keyword">char</span>* fixed_invalid_sequence = <span class=
839"literal">"a????z"</span>;
840assert (std::equal(replace_invalid_result.begin(), replace_invalid_result.end(), fixed_invalid_sequence));
841</pre>
842    <p>
843      <code>replace_invalid</code> does not perform in-place replacement of invalid
844      sequences. Rather, it produces a copy of the original string with the invalid
845      sequences replaced with a replacement marker. Therefore, <code>out</code> must not
846      be in the <code>[start, end]</code> range.
847    </p>
848    <p>
849      If <code>end</code> does not point to the past-of-end of a UTF-8 sequence, a
850      <code>utf8::not_enough_room</code> exception is thrown.
851    </p>
852    <h4>
853      utf8::is_bom
854    </h4>
855    <p class="version">
856    Available in version 1.0 and later.
857    </p>
858    <p>
859      Checks whether a sequence of three octets is a UTF-8 byte order mark (BOM)
860    </p>
861<pre>
862<span class="keyword">template</span> &lt;<span class=
863"keyword">typename</span> octet_iterator&gt; 
864<span class="keyword">bool</span> is_bom (octet_iterator it);
865</pre>
866    <p>
867      <code>it</code>: beginning of the 3-octet sequence to check<br>
868       <span class="return_value">Return value</span>: <code>true</code> if the sequence
869      is UTF-8 byte order mark; <code>false</code> if not.
870    </p>
871    <p>
872      Example of use:
873    </p>
874<pre>
875<span class="keyword">unsigned char</span> byte_order_mark[] = {<span class=
876"literal">0xef</span>, <span class="literal">0xbb</span>, <span class=
877"literal">0xbf</span>};
878<span class="keyword">bool</span> bbom = is_bom(byte_order_mark);
879assert (bbom == <span class="literal">true</span>);
880</pre>
881    <p>
882      The typical use of this function is to check the first three bytes of a file. If
883      they form the UTF-8 BOM, we want to skip them before processing the actual UTF-8
884      encoded text.
885    </p>
886    <h3 id="typesutf8">
887      Types From utf8 Namespace
888    </h3>
889    <h4>
890      utf8::iterator
891    </h4>
892    <p class="version">
893    Available in version 2.0 and later.
894    </p>
895    <p>
896      Adapts the underlying octet iterator to iterate over the sequence of code points,
897      rather than raw octets.
898    </p>
899<pre>
900<span class="keyword">template</span> &lt;<span class="keyword">typename</span> octet_iterator&gt;
901<span class="keyword">class</span> iterator;
902</pre>
903   
904    <h5>Member functions</h5>
905      <dl>
906      <dt><code>iterator();</code> <dd> the deafult constructor; the underlying <code>octet_iterator</code> is
907      constructed with its default constructor.
908      <dt><code><span class="keyword">explicit</span> iterator (const octet_iterator&amp; octet_it,
909                         const octet_iterator&amp; range_start,
910                         const octet_iterator&amp; range_end);</code> <dd> a constructor
911      that initializes the underlying <code>octet_iterator</code> with <code>octet_it</code>
912      and sets the range in which the iterator is considered valid.
913      <dt><code>octet_iterator base () <span class="keyword">const</span>;</code> <dd> returns the
914      underlying <code>octet_iterator</code>.
915      <dt><code>uint32_t operator * () <span class="keyword">const</span>;</code> <dd> decodes the utf-8 sequence
916      the underlying <code>octet_iterator</code> is pointing to and returns the code point.
917      <dt><code><span class="keyword">bool operator</span> == (const iterator&amp; rhs)
918      <span class="keyword">const</span>;</code> <dd> returns <span class="keyword">true</span>
919      if the two underlaying iterators are equal.
920      <dt><code><span class="keyword">bool operator</span> != (const iterator&amp; rhs)
921      <span class="keyword">const</span>;</code> <dd> returns <span class="keyword">true</span>
922      if the two underlaying iterators are not equal.
923      <dt><code>iterator&amp; <span class="keyword">operator</span> ++ (); </code> <dd> the prefix increment - moves
924      the iterator to the next UTF-8 encoded code point.
925      <dt><code>iterator <span class="keyword">operator</span> ++ (<span class="keyword">int</span>); </code> <dd>
926      the postfix increment - moves the iterator to the next UTF-8 encoded code point and returns the current one.
927      <dt><code>iterator&amp; <span class="keyword">operator</span> -- (); </code> <dd> the prefix decrement - moves
928      the iterator to the previous UTF-8 encoded code point.
929      <dt><code>iterator <span class="keyword">operator</span> -- (<span class="keyword">int</span>); </code> <dd>
930      the postfix decrement - moves the iterator to the previous UTF-8 encoded code point and returns the current one.
931      </dl>
932      <p>
933      Example of use:
934      </p>
935<pre>
936<span class="keyword">char</span>* threechars = <span class="literal">"\xf0\x90\x8d\x86\xe6\x97\xa5\xd1\x88"</span>;
937utf8::iterator&lt;<span class="keyword">char</span>*&gt; it(threechars, threechars, threechars + <span class="literal">9</span>);
938utf8::iterator&lt;<span class="keyword">char</span>*&gt; it2 = it;
939assert (it2 == it);
940assert (*it == <span class="literal">0x10346</span>);
941assert (*(++it) == <span class="literal">0x65e5</span>);
942assert ((*it++) == <span class="literal">0x65e5</span>);
943assert (*it == <span class="literal">0x0448</span>);
944assert (it != it2);
945utf8::iterator&lt;<span class="keyword">char</span>*&gt; endit (threechars + <span class="literal">9</span>, threechars, threechars + <span class="literal">9</span>); 
946assert (++it == endit);
947assert (*(--it) == <span class="literal">0x0448</span>);
948assert ((*it--) == <span class="literal">0x0448</span>);
949assert (*it == <span class="literal">0x65e5</span>);
950assert (--it == utf8::iterator&lt;<span class="keyword">char</span>*&gt;(threechars, threechars, threechars + <span class="literal">9</span>));
951assert (*it == <span class="literal">0x10346</span>);
952</pre>
953      <p>
954      The purpose of <code>utf8::iterator</code> adapter is to enable easy iteration as well as the use of STL
955      algorithms with UTF-8 encoded strings. Increment and decrement operators are implemented in terms of
956      <code>utf8::next()</code> and <code>utf8::prior()</code> functions.
957      </p>
958      <p>
959      Note that <code>utf8::iterator</code> adapter is a checked iterator. It operates on the range specified in
960      the constructor; any attempt to go out of that range will result in an exception. Even the comparison operators
961      require both iterator object to be constructed against the same range - otherwise an exception is thrown. Typically,
962      the range will be determined by sequence container functions <code>begin</code> and <code>end</code>, i.e.:
963      </p>
964<pre>
965std::string s = <span class="literal">"example"</span>;
966utf8::iterator i (s.begin(), s.begin(), s.end());
967</pre>
968    <h3 id="fununchecked">
969      Functions From utf8::unchecked Namespace
970    </h3>
971    <h4>
972      utf8::unchecked::append
973    </h4>
974    <p class="version">
975    Available in version 1.0 and later.
976    </p>
977    <p>
978      Encodes a 32 bit code point as a UTF-8 sequence of octets and appends the sequence
979      to a UTF-8 string.
980    </p>
981<pre>
982<span class="keyword">template</span> &lt;<span class=
983"keyword">typename</span> octet_iterator&gt;
984octet_iterator append(uint32_t cp, octet_iterator result);
985   
986</pre>
987    <p>
988      <code>cp</code>: A 32 bit integer representing a code point to append to the
989      sequence.<br>
990       <code>result</code>: An output iterator to the place in the sequence where to
991      append the code point.<br>
992       <span class="return_value">Return value</span>: An iterator pointing to the place
993      after the newly appended sequence.
994    </p>
995    <p>
996      Example of use:
997    </p>
998<pre>
999<span class="keyword">unsigned char</span> u[<span class="literal">5</span>] = {<span
1000class="literal">0</span>,<span class="literal">0</span>,<span class=
1001"literal">0</span>,<span class="literal">0</span>,<span class="literal">0</span>};
1002<span class="keyword">unsigned char</span>* end = unchecked::append(<span class=
1003"literal">0x0448</span>, u);
1004assert (u[<span class="literal">0</span>] == <span class=
1005"literal">0xd1</span> &amp;&amp; u[<span class="literal">1</span>] == <span class=
1006"literal">0x88</span> &amp;&amp; u[<span class="literal">2</span>] == <span class=
1007"literal">0</span> &amp;&amp; u[<span class="literal">3</span>] == <span class=
1008"literal">0</span> &amp;&amp; u[<span class="literal">4</span>] == <span class=
1009"literal">0</span>);
1010</pre>
1011    <p>
1012      This is a faster but less safe version of <code>utf8::append</code>. It does not
1013      check for validity of the supplied code point, and may produce an invalid UTF-8
1014      sequence.
1015    </p>
1016    <h4>
1017      utf8::unchecked::next
1018    </h4>
1019    <p class="version">
1020    Available in version 1.0 and later.
1021    </p>
1022    <p>
1023      Given the iterator to the beginning of a UTF-8 sequence, it returns the code point
1024      and moves the iterator to the next position.
1025    </p>
1026<pre>
1027<span class="keyword">template</span> &lt;<span class=
1028"keyword">typename</span> octet_iterator&gt;
1029uint32_t next(octet_iterator&amp; it);
1030   
1031</pre>
1032    <p>
1033      <code>it</code>: a reference to an iterator pointing to the beginning of an UTF-8
1034      encoded code point. After the function returns, it is incremented to point to the
1035      beginning of the next code point.<br>
1036       <span class="return_value">Return value</span>: the 32 bit representation of the
1037      processed UTF-8 code point.
1038    </p>
1039    <p>
1040      Example of use:
1041    </p>
1042<pre>
1043<span class="keyword">char</span>* twochars = <span class=
1044"literal">"\xe6\x97\xa5\xd1\x88"</span>;
1045<span class="keyword">char</span>* w = twochars;
1046<span class="keyword">int</span> cp = unchecked::next(w);
1047assert (cp == <span class="literal">0x65e5</span>);
1048assert (w == twochars + <span class="literal">3</span>);
1049</pre>
1050    <p>
1051      This is a faster but less safe version of <code>utf8::next</code>. It does not
1052      check for validity of the supplied UTF-8 sequence.
1053    </p>
1054    <h4>
1055      utf8::unchecked::peek_next
1056    </h4>
1057    <p class="version">
1058    Available in version 2.1 and later.
1059    </p>
1060    <p>
1061      Given the iterator to the beginning of a UTF-8 sequence, it returns the code point.
1062    </p>
1063<pre>
1064<span class="keyword">template</span> &lt;<span class=
1065"keyword">typename</span> octet_iterator&gt;
1066uint32_t peek_next(octet_iterator it);
1067   
1068</pre>
1069    <p>
1070      <code>it</code>: an iterator pointing to the beginning of an UTF-8
1071      encoded code point.<br>
1072       <span class="return_value">Return value</span>: the 32 bit representation of the
1073      processed UTF-8 code point.
1074    </p>
1075    <p>
1076      Example of use:
1077    </p>
1078<pre>
1079<span class="keyword">char</span>* twochars = <span class=
1080"literal">"\xe6\x97\xa5\xd1\x88"</span>;
1081<span class="keyword">char</span>* w = twochars;
1082<span class="keyword">int</span> cp = unchecked::peek_next(w);
1083assert (cp == <span class="literal">0x65e5</span>);
1084assert (w == twochars);
1085</pre>
1086    <p>
1087      This is a faster but less safe version of <code>utf8::peek_next</code>. It does not
1088      check for validity of the supplied UTF-8 sequence.
1089    </p>
1090    <h4>
1091      utf8::unchecked::prior
1092    </h4>
1093    <p class="version">
1094    Available in version 1.02 and later.
1095    </p>
1096    <p>
1097      Given a reference to an iterator pointing to an octet in a UTF-8 seqence, it
1098      decreases the iterator until it hits the beginning of the previous UTF-8 encoded
1099      code point and returns the 32 bits representation of the code point.
1100    </p>
1101<pre>
1102<span class="keyword">template</span> &lt;<span class=
1103"keyword">typename</span> octet_iterator&gt;
1104uint32_t prior(octet_iterator&amp; it);
1105   
1106</pre>
1107    <p>
1108      <code>it</code>: a reference pointing to an octet within a UTF-8 encoded string.
1109      After the function returns, it is decremented to point to the beginning of the
1110      previous code point.<br>
1111       <span class="return_value">Return value</span>: the 32 bit representation of the
1112      previous code point.
1113    </p>
1114    <p>
1115      Example of use:
1116    </p>
1117<pre>
1118<span class="keyword">char</span>* twochars = <span class=
1119"literal">"\xe6\x97\xa5\xd1\x88"</span>;
1120<span class="keyword">char</span>* w = twochars + <span class="literal">3</span>;
1121<span class="keyword">int</span> cp = unchecked::prior (w);
1122assert (cp == <span class="literal">0x65e5</span>);
1123assert (w == twochars);
1124</pre>
1125    <p>
1126      This is a faster but less safe version of <code>utf8::prior</code>. It does not
1127      check for validity of the supplied UTF-8 sequence and offers no boundary checking.
1128    </p>
1129    <h4>
1130      utf8::unchecked::previous (deprecated, see utf8::unchecked::prior)
1131    </h4>
1132    <p class="version">
1133    Deprecated in version 1.02 and later.
1134    </p>
1135    <p>
1136      Given a reference to an iterator pointing to an octet in a UTF-8 seqence, it
1137      decreases the iterator until it hits the beginning of the previous UTF-8 encoded
1138      code point and returns the 32 bits representation of the code point.
1139    </p>
1140<pre>
1141<span class="keyword">template</span> &lt;<span class=
1142"keyword">typename</span> octet_iterator&gt;
1143uint32_t previous(octet_iterator&amp; it);
1144   
1145</pre>
1146    <p>
1147      <code>it</code>: a reference pointing to an octet within a UTF-8 encoded string.
1148      After the function returns, it is decremented to point to the beginning of the
1149      previous code point.<br>
1150       <span class="return_value">Return value</span>: the 32 bit representation of the
1151      previous code point.
1152    </p>
1153    <p>
1154      Example of use:
1155    </p>
1156<pre>
1157<span class="keyword">char</span>* twochars = <span class=
1158"literal">"\xe6\x97\xa5\xd1\x88"</span>;
1159<span class="keyword">char</span>* w = twochars + <span class="literal">3</span>;
1160<span class="keyword">int</span> cp = unchecked::previous (w);
1161assert (cp == <span class="literal">0x65e5</span>);
1162assert (w == twochars);
1163</pre>
1164    <p>
1165     The reason this function is deprecated is just the consistency with the "checked"
1166     versions, where <code>prior</code> should be used instead of <code>previous</code>.
1167     In fact, <code>unchecked::previous</code> behaves exactly the same as <code>
1168     unchecked::prior</code>
1169    </p>
1170    <p>
1171      This is a faster but less safe version of <code>utf8::previous</code>. It does not
1172      check for validity of the supplied UTF-8 sequence and offers no boundary checking.
1173    </p>
1174    <h4>
1175      utf8::unchecked::advance
1176    </h4>
1177    <p class="version">
1178    Available in version 1.0 and later.
1179    </p>
1180    <p>
1181      Advances an iterator by the specified number of code points within an UTF-8
1182      sequence.
1183    </p>
1184<pre>
1185<span class="keyword">template</span> &lt;<span class=
1186"keyword">typename</span> octet_iterator, typename distance_type&gt;
1187<span class="keyword">void</span> advance (octet_iterator&amp; it, distance_type n);
1188   
1189</pre>
1190    <p>
1191      <code>it</code>: a reference to an iterator pointing to the beginning of an UTF-8
1192      encoded code point. After the function returns, it is incremented to point to the
1193      nth following code point.<br>
1194       <code>n</code>: a positive integer that shows how many code points we want to
1195      advance.<br>
1196    </p>
1197    <p>
1198      Example of use:
1199    </p>
1200<pre>
1201<span class="keyword">char</span>* twochars = <span class=
1202"literal">"\xe6\x97\xa5\xd1\x88"</span>;
1203<span class="keyword">char</span>* w = twochars;
1204unchecked::advance (w, <span class="literal">2</span>);
1205assert (w == twochars + <span class="literal">5</span>);
1206</pre>
1207    <p>
1208      This function works only "forward". In case of a negative <code>n</code>, there is
1209      no effect.
1210    </p>
1211    <p>
1212      This is a faster but less safe version of <code>utf8::advance</code>. It does not
1213      check for validity of the supplied UTF-8 sequence and offers no boundary checking.
1214    </p>
1215    <h4>
1216      utf8::unchecked::distance
1217    </h4>
1218    <p class="version">
1219    Available in version 1.0 and later.
1220    </p>
1221    <p>
1222      Given the iterators to two UTF-8 encoded code points in a seqence, returns the
1223      number of code points between them.
1224    </p>
1225<pre>
1226<span class="keyword">template</span> &lt;<span class=
1227"keyword">typename</span> octet_iterator&gt;
1228<span class=
1229"keyword">typename</span> std::iterator_traits&lt;octet_iterator&gt;::difference_type distance (octet_iterator first, octet_iterator last);
1230</pre>
1231    <p>
1232      <code>first</code>: an iterator to a beginning of a UTF-8 encoded code point.<br>
1233       <code>last</code>: an iterator to a "post-end" of the last UTF-8 encoded code
1234      point in the sequence we are trying to determine the length. It can be the
1235      beginning of a new code point, or not.<br>
1236       <span class="return_value">Return value</span> the distance between the iterators,
1237      in code points.
1238    </p>
1239    <p>
1240      Example of use:
1241    </p>
1242<pre>
1243<span class="keyword">char</span>* twochars = <span class=
1244"literal">"\xe6\x97\xa5\xd1\x88"</span>;
1245size_t dist = utf8::unchecked::distance(twochars, twochars + <span class=
1246"literal">5</span>);
1247assert (dist == <span class="literal">2</span>);
1248</pre>
1249    <p>
1250      This is a faster but less safe version of <code>utf8::distance</code>. It does not
1251      check for validity of the supplied UTF-8 sequence.
1252    </p>
1253    <h4>
1254      utf8::unchecked::utf16to8
1255    </h4>
1256    <p class="version">
1257    Available in version 1.0 and later.
1258    </p>
1259    <p>
1260      Converts a UTF-16 encoded string to UTF-8.
1261    </p>
1262<pre>
1263<span class="keyword">template</span> &lt;<span class=
1264"keyword">typename</span> u16bit_iterator, <span class=
1265"keyword">typename</span> octet_iterator&gt;
1266octet_iterator utf16to8 (u16bit_iterator start, u16bit_iterator end, octet_iterator result);
1267   
1268</pre>
1269    <p>
1270      <code>start</code>: an iterator pointing to the beginning of the UTF-16 encoded
1271      string to convert.<br>
1272       <code>end</code>: an iterator pointing to pass-the-end of the UTF-16 encoded
1273      string to convert.<br>
1274       <code>result</code>: an output iterator to the place in the UTF-8 string where to
1275      append the result of conversion.<br>
1276       <span class="return_value">Return value</span>: An iterator pointing to the place
1277      after the appended UTF-8 string.
1278    </p>
1279    <p>
1280      Example of use:
1281    </p>
1282<pre>
1283<span class="keyword">unsigned short</span> utf16string[] = {<span class=
1284"literal">0x41</span>, <span class="literal">0x0448</span>, <span class=
1285"literal">0x65e5</span>, <span class="literal">0xd834</span>, <span class=
1286"literal">0xdd1e</span>};
1287vector&lt;<span class="keyword">unsigned char</span>&gt; utf8result;
1288unchecked::utf16to8(utf16string, utf16string + <span class=
1289"literal">5</span>, back_inserter(utf8result));
1290assert (utf8result.size() == <span class="literal">10</span>);   
1291</pre>
1292    <p>
1293      This is a faster but less safe version of <code>utf8::utf16to8</code>. It does not
1294      check for validity of the supplied UTF-16 sequence.
1295    </p>
1296    <h4>
1297      utf8::unchecked::utf8to16
1298    </h4>
1299    <p class="version">
1300    Available in version 1.0 and later.
1301    </p>
1302    <p>
1303      Converts an UTF-8 encoded string to UTF-16
1304    </p>
1305<pre>
1306<span class="keyword">template</span> &lt;<span class=
1307"keyword">typename</span> u16bit_iterator, typename octet_iterator&gt;
1308u16bit_iterator utf8to16 (octet_iterator start, octet_iterator end, u16bit_iterator result);
1309   
1310</pre>
1311    <p>
1312      <code>start</code>: an iterator pointing to the beginning of the UTF-8 encoded
1313      string to convert. &lt; br /&gt; <code>end</code>: an iterator pointing to
1314      pass-the-end of the UTF-8 encoded string to convert.<br>
1315       <code>result</code>: an output iterator to the place in the UTF-16 string where to
1316      append the result of conversion.<br>
1317       <span class="return_value">Return value</span>: An iterator pointing to the place
1318      after the appended UTF-16 string.
1319    </p>
1320    <p>
1321      Example of use:
1322    </p>
1323<pre>
1324<span class="keyword">char</span> utf8_with_surrogates[] = <span class=
1325"literal">"\xe6\x97\xa5\xd1\x88\xf0\x9d\x84\x9e"</span>;
1326vector &lt;<span class="keyword">unsigned short</span>&gt; utf16result;
1327unchecked::utf8to16(utf8_with_surrogates, utf8_with_surrogates + <span class=
1328"literal">9</span>, back_inserter(utf16result));
1329assert (utf16result.size() == <span class="literal">4</span>);
1330assert (utf16result[<span class="literal">2</span>] == <span class=
1331"literal">0xd834</span>);
1332assert (utf16result[<span class="literal">3</span>] == <span class=
1333"literal">0xdd1e</span>);
1334</pre>
1335    <p>
1336      This is a faster but less safe version of <code>utf8::utf8to16</code>. It does not
1337      check for validity of the supplied UTF-8 sequence.
1338    </p>
1339    <h4>
1340      utf8::unchecked::utf32to8
1341    </h4>
1342    <p class="version">
1343    Available in version 1.0 and later.
1344    </p>
1345    <p>
1346      Converts a UTF-32 encoded string to UTF-8.
1347    </p>
1348<pre>
1349<span class="keyword">template</span> &lt;<span class=
1350"keyword">typename</span> octet_iterator, <span class=
1351"keyword">typename</span> u32bit_iterator&gt;
1352octet_iterator utf32to8 (u32bit_iterator start, u32bit_iterator end, octet_iterator result);
1353   
1354</pre>
1355    <p>
1356      <code>start</code>: an iterator pointing to the beginning of the UTF-32 encoded
1357      string to convert.<br>
1358       <code>end</code>: an iterator pointing to pass-the-end of the UTF-32 encoded
1359      string to convert.<br>
1360       <code>result</code>: an output iterator to the place in the UTF-8 string where to
1361      append the result of conversion.<br>
1362       <span class="return_value">Return value</span>: An iterator pointing to the place
1363      after the appended UTF-8 string.
1364    </p>
1365    <p>
1366      Example of use:
1367    </p>
1368<pre>
1369<span class="keyword">int</span> utf32string[] = {<span class=
1370"literal">0x448</span>, <span class="literal">0x65e5</span>, <span class=
1371"literal">0x10346</span>, <span class="literal">0</span>};
1372vector&lt;<span class="keyword">unsigned char</span>&gt; utf8result;
1373utf32to8(utf32string, utf32string + <span class=
1374"literal">3</span>, back_inserter(utf8result));
1375assert (utf8result.size() == <span class="literal">9</span>);
1376</pre>
1377    <p>
1378      This is a faster but less safe version of <code>utf8::utf32to8</code>. It does not
1379      check for validity of the supplied UTF-32 sequence.
1380    </p>
1381    <h4>
1382      utf8::unchecked::utf8to32
1383    </h4>
1384    <p class="version">
1385    Available in version 1.0 and later.
1386    </p>
1387    <p>
1388      Converts a UTF-8 encoded string to UTF-32.
1389    </p>
1390<pre>
1391<span class="keyword">template</span> &lt;<span class=
1392"keyword">typename</span> octet_iterator, typename u32bit_iterator&gt;
1393u32bit_iterator utf8to32 (octet_iterator start, octet_iterator end, u32bit_iterator result);
1394   
1395</pre>
1396    <p>
1397      <code>start</code>: an iterator pointing to the beginning of the UTF-8 encoded
1398      string to convert.<br>
1399       <code>end</code>: an iterator pointing to pass-the-end of the UTF-8 encoded string
1400      to convert.<br>
1401       <code>result</code>: an output iterator to the place in the UTF-32 string where to
1402      append the result of conversion.<br>
1403       <span class="return_value">Return value</span>: An iterator pointing to the place
1404      after the appended UTF-32 string.
1405    </p>
1406    <p>
1407      Example of use:
1408    </p>
1409<pre>
1410<span class="keyword">char</span>* twochars = <span class=
1411"literal">"\xe6\x97\xa5\xd1\x88"</span>;
1412vector&lt;<span class="keyword">int</span>&gt; utf32result;
1413unchecked::utf8to32(twochars, twochars + <span class=
1414"literal">5</span>, back_inserter(utf32result));
1415assert (utf32result.size() == <span class="literal">2</span>);
1416</pre>
1417    <p>
1418      This is a faster but less safe version of <code>utf8::utf8to32</code>. It does not
1419      check for validity of the supplied UTF-8 sequence.
1420    </p>
1421    <h3 id="typesunchecked">
1422      Types From utf8::unchecked Namespace
1423    </h3>
1424    <h4>
1425      utf8::iterator
1426    </h4>
1427    <p class="version">
1428    Available in version 2.0 and later.
1429    </p>
1430    <p>
1431      Adapts the underlying octet iterator to iterate over the sequence of code points,
1432      rather than raw octets.
1433    </p>
1434<pre>
1435<span class="keyword">template</span> &lt;<span class="keyword">typename</span> octet_iterator&gt;
1436<span class="keyword">class</span> iterator;
1437</pre>
1438   
1439    <h5>Member functions</h5>
1440      <dl>
1441      <dt><code>iterator();</code> <dd> the deafult constructor; the underlying <code>octet_iterator</code> is
1442      constructed with its default constructor.
1443      <dt><code><span class="keyword">explicit</span> iterator (const octet_iterator&amp; octet_it);
1444                         </code> <dd> a constructor
1445      that initializes the underlying <code>octet_iterator</code> with <code>octet_it</code>
1446      <dt><code>octet_iterator base () <span class="keyword">const</span>;</code> <dd> returns the
1447      underlying <code>octet_iterator</code>.
1448      <dt><code>uint32_t operator * () <span class="keyword">const</span>;</code> <dd> decodes the utf-8 sequence
1449      the underlying <code>octet_iterator</code> is pointing to and returns the code point.
1450      <dt><code><span class="keyword">bool operator</span> == (const iterator&amp; rhs)
1451      <span class="keyword">const</span>;</code> <dd> returns <span class="keyword">true</span>
1452      if the two underlaying iterators are equal.
1453      <dt><code><span class="keyword">bool operator</span> != (const iterator&amp; rhs)
1454      <span class="keyword">const</span>;</code> <dd> returns <span class="keyword">true</span>
1455      if the two underlaying iterators are not equal.
1456      <dt><code>iterator&amp; <span class="keyword">operator</span> ++ (); </code> <dd> the prefix increment - moves
1457      the iterator to the next UTF-8 encoded code point.
1458      <dt><code>iterator <span class="keyword">operator</span> ++ (<span class="keyword">int</span>); </code> <dd>
1459      the postfix increment - moves the iterator to the next UTF-8 encoded code point and returns the current one.
1460      <dt><code>iterator&amp; <span class="keyword">operator</span> -- (); </code> <dd> the prefix decrement - moves
1461      the iterator to the previous UTF-8 encoded code point.
1462      <dt><code>iterator <span class="keyword">operator</span> -- (<span class="keyword">int</span>); </code> <dd>
1463      the postfix decrement - moves the iterator to the previous UTF-8 encoded code point and returns the current one.
1464      </dl>
1465      <p>
1466      Example of use:
1467      </p>
1468<pre>
1469<span class="keyword">char</span>* threechars = <span class="literal">"\xf0\x90\x8d\x86\xe6\x97\xa5\xd1\x88"</span>;
1470utf8::unchecked::iterator&lt;<span class="keyword">char</span>*&gt; un_it(threechars);
1471utf8::unchecked::iterator&lt;<span class="keyword">char</span>*&gt; un_it2 = un_it;
1472assert (un_it2 == un_it);
1473assert (*un_it == <span class="literal">0x10346</span>);
1474assert (*(++un_it) == <span class="literal">0x65e5</span>);
1475assert ((*un_it++) == <span class="literal">0x65e5</span>);
1476assert (*un_it == <span class="literal">0x0448</span>);
1477assert (un_it != un_it2);
1478utf8::::unchecked::iterator&lt;<span class="keyword">char</span>*&gt; un_endit (threechars + <span class="literal">9</span>); 
1479assert (++un_it == un_endit);
1480assert (*(--un_it) == <span class="literal">0x0448</span>);
1481assert ((*un_it--) == <span class="literal">0x0448</span>);
1482assert (*un_it == <span class="literal">0x65e5</span>);
1483assert (--un_it == utf8::unchecked::iterator&lt;<span class="keyword">char</span>*&gt;(threechars));
1484assert (*un_it == <span class="literal">0x10346</span>);
1485</pre>
1486      <p>
1487      This is an unchecked version of <code>utf8::iterator</code>. It is faster in many cases, but offers
1488      no validity or range checks.
1489      </p>
1490    <h2 id="points">
1491      Points of interest
1492    </h2>
1493    <h4>
1494      Design goals and decisions
1495    </h4>
1496    <p>
1497      The library was designed to be:
1498    </p>
1499    <ol>
1500      <li>
1501        Generic: for better or worse, there are many C++ string classes out there, and
1502        the library should work with as many of them as possible.
1503      </li>
1504      <li>
1505        Portable: the library should be portable both accross different platforms and
1506        compilers. The only non-portable code is a small section that declares unsigned
1507        integers of different sizes: three typedefs. They can be changed by the users of
1508        the library if they don't match their platform. The default setting should work
1509        for Windows (both 32 and 64 bit), and most 32 bit and 64 bit Unix derivatives.
1510      </li>
1511      <li>
1512        Lightweight: follow the "pay only for what you use" guidline.
1513      </li>
1514      <li>
1515        Unintrusive: avoid forcing any particular design or even programming style on the
1516        user. This is a library, not a framework.
1517      </li>
1518    </ol>
1519    <h4>
1520      Alternatives
1521    </h4>
1522    <p>
1523      In case you want to look into other means of working with UTF-8 strings from C++,
1524      here is the list of solutions I am aware of:
1525    </p>
1526    <ol>
1527      <li>
1528        <a href="http://icu.sourceforge.net/">ICU Library</a>. It is very powerful,
1529        complete, feature-rich, mature, and widely used. Also big, intrusive,
1530        non-generic, and doesn't play well with the Standard Library. I definitelly
1531        recommend looking at ICU even if you don't plan to use it.
1532      </li>
1533      <li>
1534        <a href=
1535        "http://www.gtkmm.org/gtkmm2/docs/tutorial/html/ch03s04.html">Glib::ustring</a>.
1536        A class specifically made to work with UTF-8 strings, and also feel like
1537        <code>std::string</code>. If you prefer to have yet another string class in your
1538        code, it may be worth a look. Be aware of the licensing issues, though.
1539      </li>
1540      <li>
1541        Platform dependent solutions: Windows and POSIX have functions to convert strings
1542        from one encoding to another. That is only a subset of what my library offers,
1543        but if that is all you need it may be good enough, especially given the fact that
1544        these functions are mature and tested in production.
1545      </li>
1546    </ol>
1547    <h2 id="conclusion">
1548      Conclusion
1549    </h2>
1550    <p>
1551      Until Unicode becomes officially recognized by the C++ Standard Library, we need to
1552      use other means to work with UTF-8 strings. Template functions I describe in this
1553      article may be a good step in this direction.
1554    </p>
1555    <h2 id="links">
1556      Links
1557    </h2>
1558    <ol>
1559      <li>
1560        <a href="http://www.unicode.org/">The Unicode Consortium</a>.
1561      </li>
1562      <li>
1563        <a href="http://icu.sourceforge.net/">ICU Library</a>.
1564      </li>
1565      <li>
1566        <a href="http://en.wikipedia.org/wiki/UTF-8">UTF-8 at Wikipedia</a>
1567      </li>
1568      <li>
1569        <a href="http://www.cl.cam.ac.uk/~mgk25/unicode.html">UTF-8 and Unicode FAQ for
1570        Unix/Linux</a>
1571      </li>
1572    </ol>
1573  </body>
1574</html>
Note: See TracBrowser for help on using the browser.