Context Navigation

utf8cpp.html @ 199

Revision 2, 62.9 kB (checked in by yumileroy, 17 years ago)

[svn] * Proper SVN structure

Original author: Neo2003
Date: 2008-10-02 16:23:55-05:00

Line
1	<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
2	<html>
3	<head>
4	<meta name="generator" content=
5	"HTML Tidy for Linux/x86 (vers 1st November 2002), see www.w3.org">
6	<meta name="description" content=
7	"A simple, portable and lightweigt C++ library for easy handling of UTF-8 encoded strings">
8	<meta name="keywords" content="UTF-8 C++ portable utf8 unicode generic templates">
9	<meta name="author" content="Nemanja Trifunovic">
10	<title>
11	UTF8-CPP: UTF-8 with C++ in a Portable Way
12	</title>
13	<style type="text/css">
14	<!--
15	span.return_value {
16	color: brown;
17	}
18	span.keyword {
19	color: blue;
20	}
21	span.preprocessor {
22	color: navy;
23	}
24	span.literal {
25	color: olive;
26	}
27	span.comment {
28	color: green;
29	}
30	code {
31	font-weight: bold;
32	}
33	ul.toc {
34	list-style-type: none;
35	}
36	p.version {
37	font-size: small;
38	font-style: italic;
39	}
40	-->
41	</style>
42	</head>
43	<body>
44	<h1>
45	UTF8-CPP: UTF-8 with C++ in a Portable Way
46	</h1>
47	<p>
48	<a href="https://sourceforge.net/projects/utfcpp">The Sourceforge project page</a>
49	</p>
50	<div id="toc">
51	<h2>
52	Table of Contents
53	</h2>
54	<ul class="toc">
55	<li>
56	<a href="#introduction">Introduction</a>
57	</li>
58	<li>
59	<a href="#examples">Examples of Use</a>
60	</li>
61	<li>
62	<a href="#reference">Reference</a>
63	<ul class="toc">
64	<li>
65	<a href="#funutf8">Functions From utf8 Namespace </a>
66	</li>
67	<li>
68	<a href="#typesutf8">Types From utf8 Namespace </a>
69	</li>
70	<li>
71	<a href="#fununchecked">Functions From utf8::unchecked Namespace </a>
72	</li>
73	<li>
74	<a href="#typesunchecked">Types From utf8::unchecked Namespace </a>
75	</li>
76	</ul>
77	</li>
78	<li>
79	<a href="#points">Points of Interest</a>
80	</li>
81	<li>
82	<a href="#conclusion">Conclusion</a>
83	</li>
84	<li>
85	<a href="#links">Links</a>
86	</li>
87	</ul>
88	</div>
89	<h2 id="introduction">
90	Introduction
91	</h2>
92	<p>
93	Many C++ developers miss an easy and portable way of handling Unicode encoded
94	strings. C++ Standard is currently Unicode agnostic, and while some work is being
95	done to introduce Unicode to the next incarnation called C++0x, for the moment
96	nothing of the sort is available. In the meantime, developers use 3rd party
97	libraries like ICU, OS specific capabilities, or simply roll out their own
98	solutions.
99	</p>
100	<p>
101	In order to easily handle UTF-8 encoded Unicode strings, I have come up with a small
102	generic library. For anybody used to work with STL algorithms and iterators, it should be
103	easy and natural to use. The code is freely available for any purpose - check out
104	the license at the beginning of the utf8.h file. If you run into
105	bugs or performance issues, please let me know and I'll do my best to address them.
106	</p>
107	<p>
108	The purpose of this article is not to offer an introduction to Unicode in general,
109	and UTF-8 in particular. If you are not familiar with Unicode, be sure to check out
110	<a href="http://www.unicode.org/">Unicode Home Page</a> or some other source of
111	information for Unicode. Also, it is not my aim to advocate the use of UTF-8
112	encoded strings in C++ programs; if you want to handle UTF-8 encoded strings from
113	C++, I am sure you have good reasons for it.
114	</p>
115	<h2 id="examples">
116	Examples of use
117	</h2>
118	<p>
119	To illustrate the use of this utf8 library, we shall open a file containing UTF-8
120	encoded text, check whether it starts with a byte order mark, read each line into a
121	<code>std::string</code>, check it for validity, convert the text to UTF-16, and
122	back to UTF-8:
123	</p>
124	<pre>
125	<span class="preprocessor">#include <fstream></span>
126	<span class="preprocessor">#include <iostream></span>
127	<span class="preprocessor">#include <string></span>
128	<span class="preprocessor">#include <vector></span>
129	<span class="preprocessor">#include "utf8.h"</span>
130	<span class="keyword">using namespace</span> std;
131	<span class="keyword">int</span> main()
132	{
133	<span class="keyword">if</span> (argc != <span class="literal">2</span>) {
134	cout << <span class="literal">"\nUsage: docsample filename\n"</span>;
135	<span class="keyword">return</span> <span class="literal">0</span>;
136	}
137	<span class="keyword">const char</span>* test_file_path = argv[1];
138	<span class="comment">// Open the test file (must be UTF-8 encoded)</span>
139	ifstream fs8(test_file_path);
140	<span class="keyword">if</span> (!fs8.is_open()) {
141	cout << <span class=
142	"literal">"Could not open "</span> << test_file_path << endl;
143	<span class="keyword">return</span> <span class="literal">0</span>;
144	}
145	<span class="comment">// Read the first line of the file</span>
146	<span class="keyword">unsigned</span> line_count = <span class="literal">1</span>;
147	string line;
148	<span class="keyword">if</span> (!getline(fs8, line))
149	<span class="keyword">return</span> <span class="literal">0</span>;
150	<span class="comment">// Look for utf-8 byte-order mark at the beginning</span>
151	<span class="keyword">if</span> (line.size() > <span class="literal">2</span>) {
152	<span class="keyword">if</span> (utf8::is_bom(line.c_str()))
153	cout << <span class=
154	"literal">"There is a byte order mark at the beginning of the file\n"</span>;
155	}
156	<span class="comment">// Play with all the lines in the file</span>
157	<span class="keyword">do</span> {
158	<span class="comment">// check for invalid utf-8 (for a simple yes/no check, there is also utf8::is_valid function)</span>
159	string::iterator end_it = utf8::find_invalid(line.begin(), line.end());
160	<span class="keyword">if</span> (end_it != line.end()) {
161	cout << <span class=
162	"literal">"Invalid UTF-8 encoding detected at line "</span> << line_count << <span
163	class="literal">"\n"</span>;
164	cout << <span class=
165	"literal">"This part is fine: "</span> << string(line.begin(), end_it) << <span
166	class="literal">"\n"</span>;
167	}
168	<span class="comment">// Get the line length (at least for the valid part)</span>
169	<span class="keyword">int</span> length = utf8::distance(line.begin(), end_it);
170	cout << <span class=
171	"literal">"Length of line "</span> << line_count << <span class=
172	"literal">" is "</span> << length << <span class="literal">"\n"</span>;
173	<span class="comment">// Convert it to utf-16</span>
174	vector<unsigned short> utf16line;
175	utf8::utf8to16(line.begin(), end_it, back_inserter(utf16line));
176	<span class="comment">// And back to utf-8</span>
177	string utf8line;
178	utf8::utf16to8(utf16line.begin(), utf16line.end(), back_inserter(utf8line));
179	<span class="comment">// Confirm that the conversion went OK:</span>
180	<span class="keyword">if</span> (utf8line != string(line.begin(), end_it))
181	cout << <span class=
182	"literal">"Error in UTF-16 conversion at line: "</span> << line_count << <span
183	class="literal">"\n"</span>;
184	getline(fs8, line);
185	line_count++;
186	} <span class="keyword">while</span> (!fs8.eof());
187	<span class="keyword">return</span> <span class="literal">0</span>;
188	}
189	</pre>
190	<p>
191	In the previous code sample, we have seen the use of the following functions from
192	<code>utf8</code> namespace: first we used <code>is_bom</code> function to detect
193	UTF-8 byte order mark at the beginning of the file; then for each line we performed
194	a detection of invalid UTF-8 sequences with <code>find_invalid</code>; the number
195	of characters (more precisely - the number of Unicode code points) in each line was
196	determined with a use of <code>utf8::distance</code>; finally, we have converted
197	each line to UTF-16 encoding with <code>utf8to16</code> and back to UTF-8 with
198	<code>utf16to8</code>.
199	</p>
200	<h2 id="reference">
201	Reference
202	</h2>
203	<h3 id="funutf8">
204	Functions From utf8 Namespace
205	</h3>
206	<h4>
207	utf8::append
208	</h4>
209	<p class="version">
210	Available in version 1.0 and later.
211	</p>
212	<p>
213	Encodes a 32 bit code point as a UTF-8 sequence of octets and appends the sequence
214	to a UTF-8 string.
215	</p>
216	<pre>
217	<span class="keyword">template</span> <<span class=
218	"keyword">typename</span> octet_iterator>
219	octet_iterator append(uint32_t cp, octet_iterator result);
220
221	</pre>
222	<p>
223	<code>cp</code>: A 32 bit integer representing a code point to append to the
224	sequence.<br>
225	<code>result</code>: An output iterator to the place in the sequence where to
226	append the code point.<br>
227	<span class="return_value">Return value</span>: An iterator pointing to the place
228	after the newly appended sequence.
229	</p>
230	<p>
231	Example of use:
232	</p>
233	<pre>
234	<span class="keyword">unsigned char</span> u[<span class="literal">5</span>] = {<span
235	class="literal">0</span>,<span class="literal">0</span>,<span class=
236	"literal">0</span>,<span class="literal">0</span>,<span class="literal">0</span>};
237	<span class="keyword">unsigned char</span>* end = append(<span class=
238	"literal">0x0448</span>, u);
239	assert (u[<span class="literal">0</span>] == <span class=
240	"literal">0xd1</span> && u[<span class="literal">1</span>] == <span class=
241	"literal">0x88</span> && u[<span class="literal">2</span>] == <span class=
242	"literal">0</span> && u[<span class="literal">3</span>] == <span class=
243	"literal">0</span> && u[<span class="literal">4</span>] == <span class=
244	"literal">0</span>);
245	</pre>
246	<p>
247	Note that <code>append</code> does not allocate any memory - it is the burden of
248	the caller to make sure there is enough memory allocated for the operation. To make
249	things more interesting, <code>append</code> can add anywhere between 1 and 4
250	octets to the sequence. In practice, you would most often want to use
251	<code>std::back_inserter</code> to ensure that the necessary memory is allocated.
252	</p>
253	<p>
254	In case of an invalid code point, a <code>utf8::invalid_code_point</code> exception
255	is thrown.
256	</p>
257	<h4>
258	utf8::next
259	</h4>
260	<p class="version">
261	Available in version 1.0 and later.
262	</p>
263	<p>
264	Given the iterator to the beginning of the UTF-8 sequence, it returns the code
265	point and moves the iterator to the next position.
266	</p>
267	<pre>
268	<span class="keyword">template</span> <<span class=
269	"keyword">typename</span> octet_iterator>
270	uint32_t next(octet_iterator& it, octet_iterator end);
271
272	</pre>
273	<p>
274	<code>it</code>: a reference to an iterator pointing to the beginning of an UTF-8
275	encoded code point. After the function returns, it is incremented to point to the
276	beginning of the next code point.<br>
277	<code>end</code>: end of the UTF-8 sequence to be processed. If <code>it</code>
278	gets equal to <code>end</code> during the extraction of a code point, an
279	<code>utf8::not_enough_room</code> exception is thrown.<br>
280	<span class="return_value">Return value</span>: the 32 bit representation of the
281	processed UTF-8 code point.
282	</p>
283	<p>
284	Example of use:
285	</p>
286	<pre>
287	<span class="keyword">char</span>* twochars = <span class=
288	"literal">"\xe6\x97\xa5\xd1\x88"</span>;
289	<span class="keyword">char</span>* w = twochars;
290	<span class="keyword">int</span> cp = next(w, twochars + <span class="literal">6</span>);
291	assert (cp == <span class="literal">0x65e5</span>);
292	assert (w == twochars + <span class="literal">3</span>);
293	</pre>
294	<p>
295	This function is typically used to iterate through a UTF-8 encoded string.
296	</p>
297	<p>
298	In case of an invalid UTF-8 seqence, a <code>utf8::invalid_utf8</code> exception is
299	thrown.
300	</p>
301	<h4>
302	utf8::peek_next
303	</h4>
304	<p class="version">
305	Available in version 2.1 and later.
306	</p>
307	<p>
308	Given the iterator to the beginning of the UTF-8 sequence, it returns the code
309	point for the following sequence without changing the value of the iterator.
310	</p>
311	<pre>
312	<span class="keyword">template</span> <<span class=
313	"keyword">typename</span> octet_iterator>
314	uint32_t peek_next(octet_iterator it, octet_iterator end);
315
316	</pre>
317	<p>
318	<code>it</code>: an iterator pointing to the beginning of an UTF-8
319	encoded code point.<br>
320	<code>end</code>: end of the UTF-8 sequence to be processed. If <code>it</code>
321	gets equal to <code>end</code> during the extraction of a code point, an
322	<code>utf8::not_enough_room</code> exception is thrown.<br>
323	<span class="return_value">Return value</span>: the 32 bit representation of the
324	processed UTF-8 code point.
325	</p>
326	<p>
327	Example of use:
328	</p>
329	<pre>
330	<span class="keyword">char</span>* twochars = <span class=
331	"literal">"\xe6\x97\xa5\xd1\x88"</span>;
332	<span class="keyword">char</span>* w = twochars;
333	<span class="keyword">int</span> cp = peek_next(w, twochars + <span class="literal">6</span>);
334	assert (cp == <span class="literal">0x65e5</span>);
335	assert (w == twochars);
336	</pre>
337	<p>
338	In case of an invalid UTF-8 seqence, a <code>utf8::invalid_utf8</code> exception is
339	thrown.
340	</p>
341	<h4>
342	utf8::prior
343	</h4>
344	<p class="version">
345	Available in version 1.02 and later.
346	</p>
347	<p>
348	Given a reference to an iterator pointing to an octet in a UTF-8 seqence, it
349	decreases the iterator until it hits the beginning of the previous UTF-8 encoded
350	code point and returns the 32 bits representation of the code point.
351	</p>
352	<pre>
353	<span class="keyword">template</span> <<span class=
354	"keyword">typename</span> octet_iterator>
355	uint32_t prior(octet_iterator& it, octet_iterator start);
356
357	</pre>
358	<p>
359	<code>it</code>: a reference pointing to an octet within a UTF-8 encoded string.
360	After the function returns, it is decremented to point to the beginning of the
361	previous code point.<br>
362	<code>start</code>: an iterator to the beginning of the sequence where the search
363	for the beginning of a code point is performed. It is a
364	safety measure to prevent passing the beginning of the string in the search for a
365	UTF-8 lead octet.<br>
366	<span class="return_value">Return value</span>: the 32 bit representation of the
367	previous code point.
368	</p>
369	<p>
370	Example of use:
371	</p>
372	<pre>
373	<span class="keyword">char</span>* twochars = <span class=
374	"literal">"\xe6\x97\xa5\xd1\x88"</span>;
375	<span class="keyword">unsigned char</span>* w = twochars + <span class=
376	"literal">3</span>;
377	<span class="keyword">int</span> cp = prior (w, twochars);
378	assert (cp == <span class="literal">0x65e5</span>);
379	assert (w == twochars);
380	</pre>
381	<p>
382	This function has two purposes: one is two iterate backwards through a UTF-8
383	encoded string. Note that it is usually a better idea to iterate forward instead,
384	since <code>utf8::next</code> is faster. The second purpose is to find a beginning
385	of a UTF-8 sequence if we have a random position within a string.
386	</p>
387	<p>
388	<code>it</code> will typically point to the beginning of
389	a code point, and <code>start</code> will point to the
390	beginning of the string to ensure we don't go backwards too far. <code>it</code> is
391	decreased until it points to a lead UTF-8 octet, and then the UTF-8 sequence
392	beginning with that octet is decoded to a 32 bit representation and returned.
393	</p>
394	<p>
395	In case <code>pass_end</code> is reached before a UTF-8 lead octet is hit, or if an
396	invalid UTF-8 sequence is started by the lead octet, an <code>invalid_utf8</code>
397	exception is thrown.
398	</p>
399	<h4>
400	utf8::previous
401	</h4>
402	<p class="version">
403	Deprecated in version 1.02 and later.
404	</p>
405	<p>
406	Given a reference to an iterator pointing to an octet in a UTF-8 seqence, it
407	decreases the iterator until it hits the beginning of the previous UTF-8 encoded
408	code point and returns the 32 bits representation of the code point.
409	</p>
410	<pre>
411	<span class="keyword">template</span> <<span class=
412	"keyword">typename</span> octet_iterator>
413	uint32_t previous(octet_iterator& it, octet_iterator pass_start);
414
415	</pre>
416	<p>
417	<code>it</code>: a reference pointing to an octet within a UTF-8 encoded string.
418	After the function returns, it is decremented to point to the beginning of the
419	previous code point.<br>
420	<code>pass_start</code>: an iterator to the point in the sequence where the search
421	for the beginning of a code point is aborted if no result was reached. It is a
422	safety measure to prevent passing the beginning of the string in the search for a
423	UTF-8 lead octet.<br>
424	<span class="return_value">Return value</span>: the 32 bit representation of the
425	previous code point.
426	</p>
427	<p>
428	Example of use:
429	</p>
430	<pre>
431	<span class="keyword">char</span>* twochars = <span class=
432	"literal">"\xe6\x97\xa5\xd1\x88"</span>;
433	<span class="keyword">unsigned char</span>* w = twochars + <span class=
434	"literal">3</span>;
435	<span class="keyword">int</span> cp = previous (w, twochars - <span class=
436	"literal">1</span>);
437	assert (cp == <span class="literal">0x65e5</span>);
438	assert (w == twochars);
439	</pre>
440	<p>
441	<code>utf8::previous</code> is deprecated, and <code>utf8::prior</code> should
442	be used instead, although the existing code can continue using this function.
443	The problem is the parameter <code>pass_start</code> that points to the position
444	just before the beginning of the sequence. Standard containers don't have the
445	concept of "pass start" and the function can not be used with their iterators.
446	</p>
447	<p>
448	<code>it</code> will typically point to the beginning of
449	a code point, and <code>pass_start</code> will point to the octet just before the
450	beginning of the string to ensure we don't go backwards too far. <code>it</code> is
451	decreased until it points to a lead UTF-8 octet, and then the UTF-8 sequence
452	beginning with that octet is decoded to a 32 bit representation and returned.
453	</p>
454	<p>
455	In case <code>pass_end</code> is reached before a UTF-8 lead octet is hit, or if an
456	invalid UTF-8 sequence is started by the lead octet, an <code>invalid_utf8</code>
457	exception is thrown
458	</p>
459	<h4>
460	utf8::advance
461	</h4>
462	<p class="version">
463	Available in version 1.0 and later.
464	</p>
465	<p>
466	Advances an iterator by the specified number of code points within an UTF-8
467	sequence.
468	</p>
469	<pre>
470	<span class="keyword">template</span> <<span class=
471	"keyword">typename</span> octet_iterator, typename distance_type>
472	<span class=
473	"keyword">void</span> advance (octet_iterator& it, distance_type n, octet_iterator end);
474
475	</pre>
476	<p>
477	<code>it</code>: a reference to an iterator pointing to the beginning of an UTF-8
478	encoded code point. After the function returns, it is incremented to point to the
479	nth following code point.<br>
480	<code>n</code>: a positive integer that shows how many code points we want to
481	advance.<br>
482	<code>end</code>: end of the UTF-8 sequence to be processed. If <code>it</code>
483	gets equal to <code>end</code> during the extraction of a code point, an
484	<code>utf8::not_enough_room</code> exception is thrown.<br>
485	</p>
486	<p>
487	Example of use:
488	</p>
489	<pre>
490	<span class="keyword">char</span>* twochars = <span class=
491	"literal">"\xe6\x97\xa5\xd1\x88"</span>;
492	<span class="keyword">unsigned char</span>* w = twochars;
493	advance (w, <span class="literal">2</span>, twochars + <span class="literal">6</span>);
494	assert (w == twochars + <span class="literal">5</span>);
495	</pre>
496	<p>
497	This function works only "forward". In case of a negative <code>n</code>, there is
498	no effect.
499	</p>
500	<p>
501	In case of an invalid code point, a <code>utf8::invalid_code_point</code> exception
502	is thrown.
503	</p>
504	<h4>
505	utf8::distance
506	</h4>
507	<p class="version">
508	Available in version 1.0 and later.
509	</p>
510	<p>
511	Given the iterators to two UTF-8 encoded code points in a seqence, returns the
512	number of code points between them.
513	</p>
514	<pre>
515	<span class="keyword">template</span> <<span class=
516	"keyword">typename</span> octet_iterator>
517	<span class=
518	"keyword">typename</span> std::iterator_traits<octet_iterator>::difference_type distance (octet_iterator first, octet_iterator last);
519
520	</pre>
521	<p>
522	<code>first</code>: an iterator to a beginning of a UTF-8 encoded code point.<br>
523	<code>last</code>: an iterator to a "post-end" of the last UTF-8 encoded code
524	point in the sequence we are trying to determine the length. It can be the
525	beginning of a new code point, or not.<br>
526	<span class="return_value">Return value</span> the distance between the iterators,
527	in code points.
528	</p>
529	<p>
530	Example of use:
531	</p>
532	<pre>
533	<span class="keyword">char</span>* twochars = <span class=
534	"literal">"\xe6\x97\xa5\xd1\x88"</span>;
535	size_t dist = utf8::distance(twochars, twochars + <span class="literal">5</span>);
536	assert (dist == <span class="literal">2</span>);
537	</pre>
538	<p>
539	This function is used to find the length (in code points) of a UTF-8 encoded
540	string. The reason it is called <em>distance</em>, rather than, say,
541	<em>length</em> is mainly because developers are used that <em>length</em> is an
542	O(1) function. Computing the length of an UTF-8 string is a linear operation, and
543	it looked better to model it after <code>std::distance</code> algorithm.
544	</p>
545	<p>
546	In case of an invalid UTF-8 seqence, a <code>utf8::invalid_utf8</code> exception is
547	thrown. If <code>last</code> does not point to the past-of-end of a UTF-8 seqence,
548	a <code>utf8::not_enough_room</code> exception is thrown.
549	</p>
550	<h4>
551	utf8::utf16to8
552	</h4>
553	<p class="version">
554	Available in version 1.0 and later.
555	</p>
556	<p>
557	Converts a UTF-16 encoded string to UTF-8.
558	</p>
559	<pre>
560	<span class="keyword">template</span> <<span class=
561	"keyword">typename</span> u16bit_iterator, <span class=
562	"keyword">typename</span> octet_iterator>
563	octet_iterator utf16to8 (u16bit_iterator start, u16bit_iterator end, octet_iterator result);
564
565	</pre>
566	<p>
567	<code>start</code>: an iterator pointing to the beginning of the UTF-16 encoded
568	string to convert.<br>
569	<code>end</code>: an iterator pointing to pass-the-end of the UTF-16 encoded
570	string to convert.<br>
571	<code>result</code>: an output iterator to the place in the UTF-8 string where to
572	append the result of conversion.<br>
573	<span class="return_value">Return value</span>: An iterator pointing to the place
574	after the appended UTF-8 string.
575	</p>
576	<p>
577	Example of use:
578	</p>
579	<pre>
580	<span class="keyword">unsigned short</span> utf16string[] = {<span class=
581	"literal">0x41</span>, <span class="literal">0x0448</span>, <span class=
582	"literal">0x65e5</span>, <span class="literal">0xd834</span>, <span class=
583	"literal">0xdd1e</span>};
584	vector<<span class="keyword">unsigned char</span>> utf8result;
585	utf16to8(utf16string, utf16string + <span class=
586	"literal">5</span>, back_inserter(utf8result));
587	assert (utf8result.size() == <span class="literal">10</span>);
588	</pre>
589	<p>
590	In case of invalid UTF-16 sequence, a <code>utf8::invalid_utf16</code> exception is
591	thrown.
592	</p>
593	<h4>
594	utf8::utf8to16
595	</h4>
596	<p class="version">
597	Available in version 1.0 and later.
598	</p>
599	<p>
600	Converts an UTF-8 encoded string to UTF-16
601	</p>
602	<pre>
603	<span class="keyword">template</span> <<span class=
604	"keyword">typename</span> u16bit_iterator, typename octet_iterator>
605	u16bit_iterator utf8to16 (octet_iterator start, octet_iterator end, u16bit_iterator result);
606
607	</pre>
608	<p>
609	<code>start</code>: an iterator pointing to the beginning of the UTF-8 encoded
610	string to convert. < br /> <code>end</code>: an iterator pointing to
611	pass-the-end of the UTF-8 encoded string to convert.<br>
612	<code>result</code>: an output iterator to the place in the UTF-16 string where to
613	append the result of conversion.<br>
614	<span class="return_value">Return value</span>: An iterator pointing to the place
615	after the appended UTF-16 string.
616	</p>
617	<p>
618	Example of use:
619	</p>
620	<pre>
621	<span class="keyword">char</span> utf8_with_surrogates[] = <span class=
622	"literal">"\xe6\x97\xa5\xd1\x88\xf0\x9d\x84\x9e"</span>;
623	vector <<span class="keyword">unsigned short</span>> utf16result;
624	utf8to16(utf8_with_surrogates, utf8_with_surrogates + <span class=
625	"literal">9</span>, back_inserter(utf16result));
626	assert (utf16result.size() == <span class="literal">4</span>);
627	assert (utf16result[<span class="literal">2</span>] == <span class=
628	"literal">0xd834</span>);
629	assert (utf16result[<span class="literal">3</span>] == <span class=
630	"literal">0xdd1e</span>);
631	</pre>
632	<p>
633	In case of an invalid UTF-8 seqence, a <code>utf8::invalid_utf8</code> exception is
634	thrown. If <code>end</code> does not point to the past-of-end of a UTF-8 seqence, a
635	<code>utf8::not_enough_room</code> exception is thrown.
636	</p>
637	<h4>
638	utf8::utf32to8
639	</h4>
640	<p class="version">
641	Available in version 1.0 and later.
642	</p>
643	<p>
644	Converts a UTF-32 encoded string to UTF-8.
645	</p>
646	<pre>
647	<span class="keyword">template</span> <<span class=
648	"keyword">typename</span> octet_iterator, typename u32bit_iterator>
649	octet_iterator utf32to8 (u32bit_iterator start, u32bit_iterator end, octet_iterator result);
650
651	</pre>
652	<p>
653	<code>start</code>: an iterator pointing to the beginning of the UTF-32 encoded
654	string to convert.<br>
655	<code>end</code>: an iterator pointing to pass-the-end of the UTF-32 encoded
656	string to convert.<br>
657	<code>result</code>: an output iterator to the place in the UTF-8 string where to
658	append the result of conversion.<br>
659	<span class="return_value">Return value</span>: An iterator pointing to the place
660	after the appended UTF-8 string.
661	</p>
662	<p>
663	Example of use:
664	</p>
665	<pre>
666	<span class="keyword">int</span> utf32string[] = {<span class=
667	"literal">0x448</span>, <span class="literal">0x65E5</span>, <span class=
668	"literal">0x10346</span>, <span class="literal">0</span>};
669	vector<<span class="keyword">unsigned char</span>> utf8result;
670	utf32to8(utf32string, utf32string + <span class=
671	"literal">3</span>, back_inserter(utf8result));
672	assert (utf8result.size() == <span class="literal">9</span>);
673	</pre>
674	<p>
675	In case of invalid UTF-32 string, a <code>utf8::invalid_code_point</code> exception
676	is thrown.
677	</p>
678	<h4>
679	utf8::utf8to32
680	</h4>
681	<p class="version">
682	Available in version 1.0 and later.
683	</p>
684	<p>
685	Converts a UTF-8 encoded string to UTF-32.
686	</p>
687	<pre>
688	<span class="keyword">template</span> <<span class=
689	"keyword">typename</span> octet_iterator, <span class=
690	"keyword">typename</span> u32bit_iterator>
691	u32bit_iterator utf8to32 (octet_iterator start, octet_iterator end, u32bit_iterator result);
692
693	</pre>
694	<p>
695	<code>start</code>: an iterator pointing to the beginning of the UTF-8 encoded
696	string to convert.<br>
697	<code>end</code>: an iterator pointing to pass-the-end of the UTF-8 encoded string
698	to convert.<br>
699	<code>result</code>: an output iterator to the place in the UTF-32 string where to
700	append the result of conversion.<br>
701	<span class="return_value">Return value</span>: An iterator pointing to the place
702	after the appended UTF-32 string.
703	</p>
704	<p>
705	Example of use:
706	</p>
707	<pre>
708	<span class="keyword">char</span>* twochars = <span class=
709	"literal">"\xe6\x97\xa5\xd1\x88"</span>;
710	vector<<span class="keyword">int</span>> utf32result;
711	utf8to32(twochars, twochars + <span class=
712	"literal">5</span>, back_inserter(utf32result));
713	assert (utf32result.size() == <span class="literal">2</span>);
714	</pre>
715	<p>
716	In case of an invalid UTF-8 seqence, a <code>utf8::invalid_utf8</code> exception is
717	thrown. If <code>end</code> does not point to the past-of-end of a UTF-8 seqence, a
718	<code>utf8::not_enough_room</code> exception is thrown.
719	</p>
720	<h4>
721	utf8::find_invalid
722	</h4>
723	<p class="version">
724	Available in version 1.0 and later.
725	</p>
726	<p>
727	Detects an invalid sequence within a UTF-8 string.
728	</p>
729	<pre>
730	<span class="keyword">template</span> <<span class=
731	"keyword">typename</span> octet_iterator>
732	octet_iterator find_invalid(octet_iterator start, octet_iterator end);
733	</pre>
734	<p>
735	<code>start</code>: an iterator pointing to the beginning of the UTF-8 string to
736	test for validity.<br>
737	<code>end</code>: an iterator pointing to pass-the-end of the UTF-8 string to test
738	for validity.<br>
739	<span class="return_value">Return value</span>: an iterator pointing to the first
740	invalid octet in the UTF-8 string. In case none were found, equals
741	<code>end</code>.
742	</p>
743	<p>
744	Example of use:
745	</p>
746	<pre>
747	<span class="keyword">char</span> utf_invalid[] = <span class=
748	"literal">"\xe6\x97\xa5\xd1\x88\xfa"</span>;
749	<span class=
750	"keyword">char</span>* invalid = find_invalid(utf_invalid, utf_invalid + <span class=
751	"literal">6</span>);
752	assert (invalid == utf_invalid + <span class="literal">5</span>);
753	</pre>
754	<p>
755	This function is typically used to make sure a UTF-8 string is valid before
756	processing it with other functions. It is especially important to call it if before
757	doing any of the <em>unchecked</em> operations on it.
758	</p>
759	<h4>
760	utf8::is_valid
761	</h4>
762	<p class="version">
763	Available in version 1.0 and later.
764	</p>
765	<p>
766	Checks whether a sequence of octets is a valid UTF-8 string.
767	</p>
768	<pre>
769	<span class="keyword">template</span> <<span class=
770	"keyword">typename</span> octet_iterator>
771	<span class="keyword">bool</span> is_valid(octet_iterator start, octet_iterator end);
772
773	</pre>
774	<p>
775	<code>start</code>: an iterator pointing to the beginning of the UTF-8 string to
776	test for validity.<br>
777	<code>end</code>: an iterator pointing to pass-the-end of the UTF-8 string to test
778	for validity.<br>
779	<span class="return_value">Return value</span>: <code>true</code> if the sequence
780	is a valid UTF-8 string; <code>false</code> if not.
781	</p>
782	Example of use:
783	<pre>
784	<span class="keyword">char</span> utf_invalid[] = <span class=
785	"literal">"\xe6\x97\xa5\xd1\x88\xfa"</span>;
786	<span class="keyword">bool</span> bvalid = is_valid(utf_invalid, utf_invalid + <span
787	class="literal">6</span>);
788	assert (bvalid == false);
789	</pre>
790	<p>
791	<code>is_valid</code> is a shorthand for <code>find_invalid(start, end) ==
792	end;</code>. You may want to use it to make sure that a byte seqence is a valid
793	UTF-8 string without the need to know where it fails if it is not valid.
794	</p>
795	<h4>
796	utf8::replace_invalid
797	</h4>
798	<p class="version">
799	Available in version 2.0 and later.
800	</p>
801	<p>
802	Replaces all invalid UTF-8 sequences within a string with a replacement marker.
803	</p>
804	<pre>
805	<span class="keyword">template</span> <<span class=
806	"keyword">typename</span> octet_iterator, <span class=
807	"keyword">typename</span> output_iterator>
808	output_iterator replace_invalid(octet_iterator start, octet_iterator end, output_iterator out, uint32_t replacement);
809	<span class="keyword">template</span> <<span class=
810	"keyword">typename</span> octet_iterator, <span class=
811	"keyword">typename</span> output_iterator>
812	output_iterator replace_invalid(octet_iterator start, octet_iterator end, output_iterator out);
813
814	</pre>
815	<p>
816	<code>start</code>: an iterator pointing to the beginning of the UTF-8 string to
817	look for invalid UTF-8 sequences.<br>
818	<code>end</code>: an iterator pointing to pass-the-end of the UTF-8 string to look
819	for invalid UTF-8 sequences.<br>
820	<code>out</code>: An output iterator to the range where the result of replacement
821	is stored.<br>
822	<code>replacement</code>: A Unicode code point for the replacement marker. The
823	version without this parameter assumes the value <code>0xfffd</code><br>
824	<span class="return_value">Return value</span>: An iterator pointing to the place
825	after the UTF-8 string with replaced invalid sequences.
826	</p>
827	<p>
828	Example of use:
829	</p>
830	<pre>
831	<span class="keyword">char</span> invalid_sequence[] = <span class=
832	"literal">"a\x80\xe0\xa0\xc0\xaf\xed\xa0\x80z"</span>;
833	vector<<span class="keyword">char</span>> replace_invalid_result;
834	replace_invalid (invalid_sequence, invalid_sequence + sizeof(invalid_sequence), back_inserter(replace_invalid_result), <span
835	class="literal">'?'</span>);
836	bvalid = is_valid(replace_invalid_result.begin(), replace_invalid_result.end());
837	assert (bvalid);
838	<span class="keyword">char</span>* fixed_invalid_sequence = <span class=
839	"literal">"a????z"</span>;
840	assert (std::equal(replace_invalid_result.begin(), replace_invalid_result.end(), fixed_invalid_sequence));
841	</pre>
842	<p>
843	<code>replace_invalid</code> does not perform in-place replacement of invalid
844	sequences. Rather, it produces a copy of the original string with the invalid
845	sequences replaced with a replacement marker. Therefore, <code>out</code> must not
846	be in the <code>[start, end]</code> range.
847	</p>
848	<p>
849	If <code>end</code> does not point to the past-of-end of a UTF-8 sequence, a
850	<code>utf8::not_enough_room</code> exception is thrown.
851	</p>
852	<h4>
853	utf8::is_bom
854	</h4>
855	<p class="version">
856	Available in version 1.0 and later.
857	</p>
858	<p>
859	Checks whether a sequence of three octets is a UTF-8 byte order mark (BOM)
860	</p>
861	<pre>
862	<span class="keyword">template</span> <<span class=
863	"keyword">typename</span> octet_iterator>
864	<span class="keyword">bool</span> is_bom (octet_iterator it);
865	</pre>
866	<p>
867	<code>it</code>: beginning of the 3-octet sequence to check<br>
868	<span class="return_value">Return value</span>: <code>true</code> if the sequence
869	is UTF-8 byte order mark; <code>false</code> if not.
870	</p>
871	<p>
872	Example of use:
873	</p>
874	<pre>
875	<span class="keyword">unsigned char</span> byte_order_mark[] = {<span class=
876	"literal">0xef</span>, <span class="literal">0xbb</span>, <span class=
877	"literal">0xbf</span>};
878	<span class="keyword">bool</span> bbom = is_bom(byte_order_mark);
879	assert (bbom == <span class="literal">true</span>);
880	</pre>
881	<p>
882	The typical use of this function is to check the first three bytes of a file. If
883	they form the UTF-8 BOM, we want to skip them before processing the actual UTF-8
884	encoded text.
885	</p>
886	<h3 id="typesutf8">
887	Types From utf8 Namespace
888	</h3>
889	<h4>
890	utf8::iterator
891	</h4>
892	<p class="version">
893	Available in version 2.0 and later.
894	</p>
895	<p>
896	Adapts the underlying octet iterator to iterate over the sequence of code points,
897	rather than raw octets.
898	</p>
899	<pre>
900	<span class="keyword">template</span> <<span class="keyword">typename</span> octet_iterator>
901	<span class="keyword">class</span> iterator;
902	</pre>
903
904	<h5>Member functions</h5>
905	<dl>
906	<dt><code>iterator();</code> <dd> the deafult constructor; the underlying <code>octet_iterator</code> is
907	constructed with its default constructor.
908	<dt><code><span class="keyword">explicit</span> iterator (const octet_iterator& octet_it,
909	const octet_iterator& range_start,
910	const octet_iterator& range_end);</code> <dd> a constructor
911	that initializes the underlying <code>octet_iterator</code> with <code>octet_it</code>
912	and sets the range in which the iterator is considered valid.
913	<dt><code>octet_iterator base () <span class="keyword">const</span>;</code> <dd> returns the
914	underlying <code>octet_iterator</code>.
915	<dt><code>uint32_t operator * () <span class="keyword">const</span>;</code> <dd> decodes the utf-8 sequence
916	the underlying <code>octet_iterator</code> is pointing to and returns the code point.
917	<dt><code><span class="keyword">bool operator</span> == (const iterator& rhs)
918	<span class="keyword">const</span>;</code> <dd> returns <span class="keyword">true</span>
919	if the two underlaying iterators are equal.
920	<dt><code><span class="keyword">bool operator</span> != (const iterator& rhs)
921	<span class="keyword">const</span>;</code> <dd> returns <span class="keyword">true</span>
922	if the two underlaying iterators are not equal.
923	<dt><code>iterator& <span class="keyword">operator</span> ++ (); </code> <dd> the prefix increment - moves
924	the iterator to the next UTF-8 encoded code point.
925	<dt><code>iterator <span class="keyword">operator</span> ++ (<span class="keyword">int</span>); </code> <dd>
926	the postfix increment - moves the iterator to the next UTF-8 encoded code point and returns the current one.
927	<dt><code>iterator& <span class="keyword">operator</span> -- (); </code> <dd> the prefix decrement - moves
928	the iterator to the previous UTF-8 encoded code point.
929	<dt><code>iterator <span class="keyword">operator</span> -- (<span class="keyword">int</span>); </code> <dd>
930	the postfix decrement - moves the iterator to the previous UTF-8 encoded code point and returns the current one.
931	</dl>
932	<p>
933	Example of use:
934	</p>
935	<pre>
936	<span class="keyword">char</span>* threechars = <span class="literal">"\xf0\x90\x8d\x86\xe6\x97\xa5\xd1\x88"</span>;
937	utf8::iterator<<span class="keyword">char</span>*> it(threechars, threechars, threechars + <span class="literal">9</span>);
938	utf8::iterator<<span class="keyword">char</span>*> it2 = it;
939	assert (it2 == it);
940	assert (*it == <span class="literal">0x10346</span>);
941	assert (*(++it) == <span class="literal">0x65e5</span>);
942	assert ((*it++) == <span class="literal">0x65e5</span>);
943	assert (*it == <span class="literal">0x0448</span>);
944	assert (it != it2);
945	utf8::iterator<<span class="keyword">char</span>*> endit (threechars + <span class="literal">9</span>, threechars, threechars + <span class="literal">9</span>);
946	assert (++it == endit);
947	assert (*(--it) == <span class="literal">0x0448</span>);
948	assert ((*it--) == <span class="literal">0x0448</span>);
949	assert (*it == <span class="literal">0x65e5</span>);
950	assert (--it == utf8::iterator<<span class="keyword">char</span>*>(threechars, threechars, threechars + <span class="literal">9</span>));
951	assert (*it == <span class="literal">0x10346</span>);
952	</pre>
953	<p>
954	The purpose of <code>utf8::iterator</code> adapter is to enable easy iteration as well as the use of STL
955	algorithms with UTF-8 encoded strings. Increment and decrement operators are implemented in terms of
956	<code>utf8::next()</code> and <code>utf8::prior()</code> functions.
957	</p>
958	<p>
959	Note that <code>utf8::iterator</code> adapter is a checked iterator. It operates on the range specified in
960	the constructor; any attempt to go out of that range will result in an exception. Even the comparison operators
961	require both iterator object to be constructed against the same range - otherwise an exception is thrown. Typically,
962	the range will be determined by sequence container functions <code>begin</code> and <code>end</code>, i.e.:
963	</p>
964	<pre>
965	std::string s = <span class="literal">"example"</span>;
966	utf8::iterator i (s.begin(), s.begin(), s.end());
967	</pre>
968	<h3 id="fununchecked">
969	Functions From utf8::unchecked Namespace
970	</h3>
971	<h4>
972	utf8::unchecked::append
973	</h4>
974	<p class="version">
975	Available in version 1.0 and later.
976	</p>
977	<p>
978	Encodes a 32 bit code point as a UTF-8 sequence of octets and appends the sequence
979	to a UTF-8 string.
980	</p>
981	<pre>
982	<span class="keyword">template</span> <<span class=
983	"keyword">typename</span> octet_iterator>
984	octet_iterator append(uint32_t cp, octet_iterator result);
985
986	</pre>
987	<p>
988	<code>cp</code>: A 32 bit integer representing a code point to append to the
989	sequence.<br>
990	<code>result</code>: An output iterator to the place in the sequence where to
991	append the code point.<br>
992	<span class="return_value">Return value</span>: An iterator pointing to the place
993	after the newly appended sequence.
994	</p>
995	<p>
996	Example of use:
997	</p>
998	<pre>
999	<span class="keyword">unsigned char</span> u[<span class="literal">5</span>] = {<span
1000	class="literal">0</span>,<span class="literal">0</span>,<span class=
1001	"literal">0</span>,<span class="literal">0</span>,<span class="literal">0</span>};
1002	<span class="keyword">unsigned char</span>* end = unchecked::append(<span class=
1003	"literal">0x0448</span>, u);
1004	assert (u[<span class="literal">0</span>] == <span class=
1005	"literal">0xd1</span> && u[<span class="literal">1</span>] == <span class=
1006	"literal">0x88</span> && u[<span class="literal">2</span>] == <span class=
1007	"literal">0</span> && u[<span class="literal">3</span>] == <span class=
1008	"literal">0</span> && u[<span class="literal">4</span>] == <span class=
1009	"literal">0</span>);
1010	</pre>
1011	<p>
1012	This is a faster but less safe version of <code>utf8::append</code>. It does not
1013	check for validity of the supplied code point, and may produce an invalid UTF-8
1014	sequence.
1015	</p>
1016	<h4>
1017	utf8::unchecked::next
1018	</h4>
1019	<p class="version">
1020	Available in version 1.0 and later.
1021	</p>
1022	<p>
1023	Given the iterator to the beginning of a UTF-8 sequence, it returns the code point
1024	and moves the iterator to the next position.
1025	</p>
1026	<pre>
1027	<span class="keyword">template</span> <<span class=
1028	"keyword">typename</span> octet_iterator>
1029	uint32_t next(octet_iterator& it);
1030
1031	</pre>
1032	<p>
1033	<code>it</code>: a reference to an iterator pointing to the beginning of an UTF-8
1034	encoded code point. After the function returns, it is incremented to point to the
1035	beginning of the next code point.<br>
1036	<span class="return_value">Return value</span>: the 32 bit representation of the
1037	processed UTF-8 code point.
1038	</p>
1039	<p>
1040	Example of use:
1041	</p>
1042	<pre>
1043	<span class="keyword">char</span>* twochars = <span class=
1044	"literal">"\xe6\x97\xa5\xd1\x88"</span>;
1045	<span class="keyword">char</span>* w = twochars;
1046	<span class="keyword">int</span> cp = unchecked::next(w);
1047	assert (cp == <span class="literal">0x65e5</span>);
1048	assert (w == twochars + <span class="literal">3</span>);
1049	</pre>
1050	<p>
1051	This is a faster but less safe version of <code>utf8::next</code>. It does not
1052	check for validity of the supplied UTF-8 sequence.
1053	</p>
1054	<h4>
1055	utf8::unchecked::peek_next
1056	</h4>
1057	<p class="version">
1058	Available in version 2.1 and later.
1059	</p>
1060	<p>
1061	Given the iterator to the beginning of a UTF-8 sequence, it returns the code point.
1062	</p>
1063	<pre>
1064	<span class="keyword">template</span> <<span class=
1065	"keyword">typename</span> octet_iterator>
1066	uint32_t peek_next(octet_iterator it);
1067
1068	</pre>
1069	<p>
1070	<code>it</code>: an iterator pointing to the beginning of an UTF-8
1071	encoded code point.<br>
1072	<span class="return_value">Return value</span>: the 32 bit representation of the
1073	processed UTF-8 code point.
1074	</p>
1075	<p>
1076	Example of use:
1077	</p>
1078	<pre>
1079	<span class="keyword">char</span>* twochars = <span class=
1080	"literal">"\xe6\x97\xa5\xd1\x88"</span>;
1081	<span class="keyword">char</span>* w = twochars;
1082	<span class="keyword">int</span> cp = unchecked::peek_next(w);
1083	assert (cp == <span class="literal">0x65e5</span>);
1084	assert (w == twochars);
1085	</pre>
1086	<p>
1087	This is a faster but less safe version of <code>utf8::peek_next</code>. It does not
1088	check for validity of the supplied UTF-8 sequence.
1089	</p>
1090	<h4>
1091	utf8::unchecked::prior
1092	</h4>
1093	<p class="version">
1094	Available in version 1.02 and later.
1095	</p>
1096	<p>
1097	Given a reference to an iterator pointing to an octet in a UTF-8 seqence, it
1098	decreases the iterator until it hits the beginning of the previous UTF-8 encoded
1099	code point and returns the 32 bits representation of the code point.
1100	</p>
1101	<pre>
1102	<span class="keyword">template</span> <<span class=
1103	"keyword">typename</span> octet_iterator>
1104	uint32_t prior(octet_iterator& it);
1105
1106	</pre>
1107	<p>
1108	<code>it</code>: a reference pointing to an octet within a UTF-8 encoded string.
1109	After the function returns, it is decremented to point to the beginning of the
1110	previous code point.<br>
1111	<span class="return_value">Return value</span>: the 32 bit representation of the
1112	previous code point.
1113	</p>
1114	<p>
1115	Example of use:
1116	</p>
1117	<pre>
1118	<span class="keyword">char</span>* twochars = <span class=
1119	"literal">"\xe6\x97\xa5\xd1\x88"</span>;
1120	<span class="keyword">char</span>* w = twochars + <span class="literal">3</span>;
1121	<span class="keyword">int</span> cp = unchecked::prior (w);
1122	assert (cp == <span class="literal">0x65e5</span>);
1123	assert (w == twochars);
1124	</pre>
1125	<p>
1126	This is a faster but less safe version of <code>utf8::prior</code>. It does not
1127	check for validity of the supplied UTF-8 sequence and offers no boundary checking.
1128	</p>
1129	<h4>
1130	utf8::unchecked::previous (deprecated, see utf8::unchecked::prior)
1131	</h4>
1132	<p class="version">
1133	Deprecated in version 1.02 and later.
1134	</p>
1135	<p>
1136	Given a reference to an iterator pointing to an octet in a UTF-8 seqence, it
1137	decreases the iterator until it hits the beginning of the previous UTF-8 encoded
1138	code point and returns the 32 bits representation of the code point.
1139	</p>
1140	<pre>
1141	<span class="keyword">template</span> <<span class=
1142	"keyword">typename</span> octet_iterator>
1143	uint32_t previous(octet_iterator& it);
1144
1145	</pre>
1146	<p>
1147	<code>it</code>: a reference pointing to an octet within a UTF-8 encoded string.
1148	After the function returns, it is decremented to point to the beginning of the
1149	previous code point.<br>
1150	<span class="return_value">Return value</span>: the 32 bit representation of the
1151	previous code point.
1152	</p>
1153	<p>
1154	Example of use:
1155	</p>
1156	<pre>
1157	<span class="keyword">char</span>* twochars = <span class=
1158	"literal">"\xe6\x97\xa5\xd1\x88"</span>;
1159	<span class="keyword">char</span>* w = twochars + <span class="literal">3</span>;
1160	<span class="keyword">int</span> cp = unchecked::previous (w);
1161	assert (cp == <span class="literal">0x65e5</span>);
1162	assert (w == twochars);
1163	</pre>
1164	<p>
1165	The reason this function is deprecated is just the consistency with the "checked"
1166	versions, where <code>prior</code> should be used instead of <code>previous</code>.
1167	In fact, <code>unchecked::previous</code> behaves exactly the same as <code>
1168	unchecked::prior</code>
1169	</p>
1170	<p>
1171	This is a faster but less safe version of <code>utf8::previous</code>. It does not
1172	check for validity of the supplied UTF-8 sequence and offers no boundary checking.
1173	</p>
1174	<h4>
1175	utf8::unchecked::advance
1176	</h4>
1177	<p class="version">
1178	Available in version 1.0 and later.
1179	</p>
1180	<p>
1181	Advances an iterator by the specified number of code points within an UTF-8
1182	sequence.
1183	</p>
1184	<pre>
1185	<span class="keyword">template</span> <<span class=
1186	"keyword">typename</span> octet_iterator, typename distance_type>
1187	<span class="keyword">void</span> advance (octet_iterator& it, distance_type n);
1188
1189	</pre>
1190	<p>
1191	<code>it</code>: a reference to an iterator pointing to the beginning of an UTF-8
1192	encoded code point. After the function returns, it is incremented to point to the
1193	nth following code point.<br>
1194	<code>n</code>: a positive integer that shows how many code points we want to
1195	advance.<br>
1196	</p>
1197	<p>
1198	Example of use:
1199	</p>
1200	<pre>
1201	<span class="keyword">char</span>* twochars = <span class=
1202	"literal">"\xe6\x97\xa5\xd1\x88"</span>;
1203	<span class="keyword">char</span>* w = twochars;
1204	unchecked::advance (w, <span class="literal">2</span>);
1205	assert (w == twochars + <span class="literal">5</span>);
1206	</pre>
1207	<p>
1208	This function works only "forward". In case of a negative <code>n</code>, there is
1209	no effect.
1210	</p>
1211	<p>
1212	This is a faster but less safe version of <code>utf8::advance</code>. It does not
1213	check for validity of the supplied UTF-8 sequence and offers no boundary checking.
1214	</p>
1215	<h4>
1216	utf8::unchecked::distance
1217	</h4>
1218	<p class="version">
1219	Available in version 1.0 and later.
1220	</p>
1221	<p>
1222	Given the iterators to two UTF-8 encoded code points in a seqence, returns the
1223	number of code points between them.
1224	</p>
1225	<pre>
1226	<span class="keyword">template</span> <<span class=
1227	"keyword">typename</span> octet_iterator>
1228	<span class=
1229	"keyword">typename</span> std::iterator_traits<octet_iterator>::difference_type distance (octet_iterator first, octet_iterator last);
1230	</pre>
1231	<p>
1232	<code>first</code>: an iterator to a beginning of a UTF-8 encoded code point.<br>
1233	<code>last</code>: an iterator to a "post-end" of the last UTF-8 encoded code
1234	point in the sequence we are trying to determine the length. It can be the
1235	beginning of a new code point, or not.<br>
1236	<span class="return_value">Return value</span> the distance between the iterators,
1237	in code points.
1238	</p>
1239	<p>
1240	Example of use:
1241	</p>
1242	<pre>
1243	<span class="keyword">char</span>* twochars = <span class=
1244	"literal">"\xe6\x97\xa5\xd1\x88"</span>;
1245	size_t dist = utf8::unchecked::distance(twochars, twochars + <span class=
1246	"literal">5</span>);
1247	assert (dist == <span class="literal">2</span>);
1248	</pre>
1249	<p>
1250	This is a faster but less safe version of <code>utf8::distance</code>. It does not
1251	check for validity of the supplied UTF-8 sequence.
1252	</p>
1253	<h4>
1254	utf8::unchecked::utf16to8
1255	</h4>
1256	<p class="version">
1257	Available in version 1.0 and later.
1258	</p>
1259	<p>
1260	Converts a UTF-16 encoded string to UTF-8.
1261	</p>
1262	<pre>
1263	<span class="keyword">template</span> <<span class=
1264	"keyword">typename</span> u16bit_iterator, <span class=
1265	"keyword">typename</span> octet_iterator>
1266	octet_iterator utf16to8 (u16bit_iterator start, u16bit_iterator end, octet_iterator result);
1267
1268	</pre>
1269	<p>
1270	<code>start</code>: an iterator pointing to the beginning of the UTF-16 encoded
1271	string to convert.<br>
1272	<code>end</code>: an iterator pointing to pass-the-end of the UTF-16 encoded
1273	string to convert.<br>
1274	<code>result</code>: an output iterator to the place in the UTF-8 string where to
1275	append the result of conversion.<br>
1276	<span class="return_value">Return value</span>: An iterator pointing to the place
1277	after the appended UTF-8 string.
1278	</p>
1279	<p>
1280	Example of use:
1281	</p>
1282	<pre>
1283	<span class="keyword">unsigned short</span> utf16string[] = {<span class=
1284	"literal">0x41</span>, <span class="literal">0x0448</span>, <span class=
1285	"literal">0x65e5</span>, <span class="literal">0xd834</span>, <span class=
1286	"literal">0xdd1e</span>};
1287	vector<<span class="keyword">unsigned char</span>> utf8result;
1288	unchecked::utf16to8(utf16string, utf16string + <span class=
1289	"literal">5</span>, back_inserter(utf8result));
1290	assert (utf8result.size() == <span class="literal">10</span>);
1291	</pre>
1292	<p>
1293	This is a faster but less safe version of <code>utf8::utf16to8</code>. It does not
1294	check for validity of the supplied UTF-16 sequence.
1295	</p>
1296	<h4>
1297	utf8::unchecked::utf8to16
1298	</h4>
1299	<p class="version">
1300	Available in version 1.0 and later.
1301	</p>
1302	<p>
1303	Converts an UTF-8 encoded string to UTF-16
1304	</p>
1305	<pre>
1306	<span class="keyword">template</span> <<span class=
1307	"keyword">typename</span> u16bit_iterator, typename octet_iterator>
1308	u16bit_iterator utf8to16 (octet_iterator start, octet_iterator end, u16bit_iterator result);
1309
1310	</pre>
1311	<p>
1312	<code>start</code>: an iterator pointing to the beginning of the UTF-8 encoded
1313	string to convert. < br /> <code>end</code>: an iterator pointing to
1314	pass-the-end of the UTF-8 encoded string to convert.<br>
1315	<code>result</code>: an output iterator to the place in the UTF-16 string where to
1316	append the result of conversion.<br>
1317	<span class="return_value">Return value</span>: An iterator pointing to the place
1318	after the appended UTF-16 string.
1319	</p>
1320	<p>
1321	Example of use:
1322	</p>
1323	<pre>
1324	<span class="keyword">char</span> utf8_with_surrogates[] = <span class=
1325	"literal">"\xe6\x97\xa5\xd1\x88\xf0\x9d\x84\x9e"</span>;
1326	vector <<span class="keyword">unsigned short</span>> utf16result;
1327	unchecked::utf8to16(utf8_with_surrogates, utf8_with_surrogates + <span class=
1328	"literal">9</span>, back_inserter(utf16result));
1329	assert (utf16result.size() == <span class="literal">4</span>);
1330	assert (utf16result[<span class="literal">2</span>] == <span class=
1331	"literal">0xd834</span>);
1332	assert (utf16result[<span class="literal">3</span>] == <span class=
1333	"literal">0xdd1e</span>);
1334	</pre>
1335	<p>
1336	This is a faster but less safe version of <code>utf8::utf8to16</code>. It does not
1337	check for validity of the supplied UTF-8 sequence.
1338	</p>
1339	<h4>
1340	utf8::unchecked::utf32to8
1341	</h4>
1342	<p class="version">
1343	Available in version 1.0 and later.
1344	</p>
1345	<p>
1346	Converts a UTF-32 encoded string to UTF-8.
1347	</p>
1348	<pre>
1349	<span class="keyword">template</span> <<span class=
1350	"keyword">typename</span> octet_iterator, <span class=
1351	"keyword">typename</span> u32bit_iterator>
1352	octet_iterator utf32to8 (u32bit_iterator start, u32bit_iterator end, octet_iterator result);
1353
1354	</pre>
1355	<p>
1356	<code>start</code>: an iterator pointing to the beginning of the UTF-32 encoded
1357	string to convert.<br>
1358	<code>end</code>: an iterator pointing to pass-the-end of the UTF-32 encoded
1359	string to convert.<br>
1360	<code>result</code>: an output iterator to the place in the UTF-8 string where to
1361	append the result of conversion.<br>
1362	<span class="return_value">Return value</span>: An iterator pointing to the place
1363	after the appended UTF-8 string.
1364	</p>
1365	<p>
1366	Example of use:
1367	</p>
1368	<pre>
1369	<span class="keyword">int</span> utf32string[] = {<span class=
1370	"literal">0x448</span>, <span class="literal">0x65e5</span>, <span class=
1371	"literal">0x10346</span>, <span class="literal">0</span>};
1372	vector<<span class="keyword">unsigned char</span>> utf8result;
1373	utf32to8(utf32string, utf32string + <span class=
1374	"literal">3</span>, back_inserter(utf8result));
1375	assert (utf8result.size() == <span class="literal">9</span>);
1376	</pre>
1377	<p>
1378	This is a faster but less safe version of <code>utf8::utf32to8</code>. It does not
1379	check for validity of the supplied UTF-32 sequence.
1380	</p>
1381	<h4>
1382	utf8::unchecked::utf8to32
1383	</h4>
1384	<p class="version">
1385	Available in version 1.0 and later.
1386	</p>
1387	<p>
1388	Converts a UTF-8 encoded string to UTF-32.
1389	</p>
1390	<pre>
1391	<span class="keyword">template</span> <<span class=
1392	"keyword">typename</span> octet_iterator, typename u32bit_iterator>
1393	u32bit_iterator utf8to32 (octet_iterator start, octet_iterator end, u32bit_iterator result);
1394
1395	</pre>
1396	<p>
1397	<code>start</code>: an iterator pointing to the beginning of the UTF-8 encoded
1398	string to convert.<br>
1399	<code>end</code>: an iterator pointing to pass-the-end of the UTF-8 encoded string
1400	to convert.<br>
1401	<code>result</code>: an output iterator to the place in the UTF-32 string where to
1402	append the result of conversion.<br>
1403	<span class="return_value">Return value</span>: An iterator pointing to the place
1404	after the appended UTF-32 string.
1405	</p>
1406	<p>
1407	Example of use:
1408	</p>
1409	<pre>
1410	<span class="keyword">char</span>* twochars = <span class=
1411	"literal">"\xe6\x97\xa5\xd1\x88"</span>;
1412	vector<<span class="keyword">int</span>> utf32result;
1413	unchecked::utf8to32(twochars, twochars + <span class=
1414	"literal">5</span>, back_inserter(utf32result));
1415	assert (utf32result.size() == <span class="literal">2</span>);
1416	</pre>
1417	<p>
1418	This is a faster but less safe version of <code>utf8::utf8to32</code>. It does not
1419	check for validity of the supplied UTF-8 sequence.
1420	</p>
1421	<h3 id="typesunchecked">
1422	Types From utf8::unchecked Namespace
1423	</h3>
1424	<h4>
1425	utf8::iterator
1426	</h4>
1427	<p class="version">
1428	Available in version 2.0 and later.
1429	</p>
1430	<p>
1431	Adapts the underlying octet iterator to iterate over the sequence of code points,
1432	rather than raw octets.
1433	</p>
1434	<pre>
1435	<span class="keyword">template</span> <<span class="keyword">typename</span> octet_iterator>
1436	<span class="keyword">class</span> iterator;
1437	</pre>
1438
1439	<h5>Member functions</h5>
1440	<dl>
1441	<dt><code>iterator();</code> <dd> the deafult constructor; the underlying <code>octet_iterator</code> is
1442	constructed with its default constructor.
1443	<dt><code><span class="keyword">explicit</span> iterator (const octet_iterator& octet_it);
1444	</code> <dd> a constructor
1445	that initializes the underlying <code>octet_iterator</code> with <code>octet_it</code>
1446	<dt><code>octet_iterator base () <span class="keyword">const</span>;</code> <dd> returns the
1447	underlying <code>octet_iterator</code>.
1448	<dt><code>uint32_t operator * () <span class="keyword">const</span>;</code> <dd> decodes the utf-8 sequence
1449	the underlying <code>octet_iterator</code> is pointing to and returns the code point.
1450	<dt><code><span class="keyword">bool operator</span> == (const iterator& rhs)
1451	<span class="keyword">const</span>;</code> <dd> returns <span class="keyword">true</span>
1452	if the two underlaying iterators are equal.
1453	<dt><code><span class="keyword">bool operator</span> != (const iterator& rhs)
1454	<span class="keyword">const</span>;</code> <dd> returns <span class="keyword">true</span>
1455	if the two underlaying iterators are not equal.
1456	<dt><code>iterator& <span class="keyword">operator</span> ++ (); </code> <dd> the prefix increment - moves
1457	the iterator to the next UTF-8 encoded code point.
1458	<dt><code>iterator <span class="keyword">operator</span> ++ (<span class="keyword">int</span>); </code> <dd>
1459	the postfix increment - moves the iterator to the next UTF-8 encoded code point and returns the current one.
1460	<dt><code>iterator& <span class="keyword">operator</span> -- (); </code> <dd> the prefix decrement - moves
1461	the iterator to the previous UTF-8 encoded code point.
1462	<dt><code>iterator <span class="keyword">operator</span> -- (<span class="keyword">int</span>); </code> <dd>
1463	the postfix decrement - moves the iterator to the previous UTF-8 encoded code point and returns the current one.
1464	</dl>
1465	<p>
1466	Example of use:
1467	</p>
1468	<pre>
1469	<span class="keyword">char</span>* threechars = <span class="literal">"\xf0\x90\x8d\x86\xe6\x97\xa5\xd1\x88"</span>;
1470	utf8::unchecked::iterator<<span class="keyword">char</span>*> un_it(threechars);
1471	utf8::unchecked::iterator<<span class="keyword">char</span>*> un_it2 = un_it;
1472	assert (un_it2 == un_it);
1473	assert (*un_it == <span class="literal">0x10346</span>);
1474	assert (*(++un_it) == <span class="literal">0x65e5</span>);
1475	assert ((*un_it++) == <span class="literal">0x65e5</span>);
1476	assert (*un_it == <span class="literal">0x0448</span>);
1477	assert (un_it != un_it2);
1478	utf8::::unchecked::iterator<<span class="keyword">char</span>*> un_endit (threechars + <span class="literal">9</span>);
1479	assert (++un_it == un_endit);
1480	assert (*(--un_it) == <span class="literal">0x0448</span>);
1481	assert ((*un_it--) == <span class="literal">0x0448</span>);
1482	assert (*un_it == <span class="literal">0x65e5</span>);
1483	assert (--un_it == utf8::unchecked::iterator<<span class="keyword">char</span>*>(threechars));
1484	assert (*un_it == <span class="literal">0x10346</span>);
1485	</pre>
1486	<p>
1487	This is an unchecked version of <code>utf8::iterator</code>. It is faster in many cases, but offers
1488	no validity or range checks.
1489	</p>
1490	<h2 id="points">
1491	Points of interest
1492	</h2>
1493	<h4>
1494	Design goals and decisions
1495	</h4>
1496	<p>
1497	The library was designed to be:
1498	</p>
1499	<ol>
1500	<li>
1501	Generic: for better or worse, there are many C++ string classes out there, and
1502	the library should work with as many of them as possible.
1503	</li>
1504	<li>
1505	Portable: the library should be portable both accross different platforms and
1506	compilers. The only non-portable code is a small section that declares unsigned
1507	integers of different sizes: three typedefs. They can be changed by the users of
1508	the library if they don't match their platform. The default setting should work
1509	for Windows (both 32 and 64 bit), and most 32 bit and 64 bit Unix derivatives.
1510	</li>
1511	<li>
1512	Lightweight: follow the "pay only for what you use" guidline.
1513	</li>
1514	<li>
1515	Unintrusive: avoid forcing any particular design or even programming style on the
1516	user. This is a library, not a framework.
1517	</li>
1518	</ol>
1519	<h4>
1520	Alternatives
1521	</h4>
1522	<p>
1523	In case you want to look into other means of working with UTF-8 strings from C++,
1524	here is the list of solutions I am aware of:
1525	</p>
1526	<ol>
1527	<li>
1528	<a href="http://icu.sourceforge.net/">ICU Library</a>. It is very powerful,
1529	complete, feature-rich, mature, and widely used. Also big, intrusive,
1530	non-generic, and doesn't play well with the Standard Library. I definitelly
1531	recommend looking at ICU even if you don't plan to use it.
1532	</li>
1533	<li>
1534	<a href=
1535	"http://www.gtkmm.org/gtkmm2/docs/tutorial/html/ch03s04.html">Glib::ustring</a>.
1536	A class specifically made to work with UTF-8 strings, and also feel like
1537	<code>std::string</code>. If you prefer to have yet another string class in your
1538	code, it may be worth a look. Be aware of the licensing issues, though.
1539	</li>
1540	<li>
1541	Platform dependent solutions: Windows and POSIX have functions to convert strings
1542	from one encoding to another. That is only a subset of what my library offers,
1543	but if that is all you need it may be good enough, especially given the fact that
1544	these functions are mature and tested in production.
1545	</li>
1546	</ol>
1547	<h2 id="conclusion">
1548	Conclusion
1549	</h2>
1550	<p>
1551	Until Unicode becomes officially recognized by the C++ Standard Library, we need to
1552	use other means to work with UTF-8 strings. Template functions I describe in this
1553	article may be a good step in this direction.
1554	</p>
1555	<h2 id="links">
1556	Links
1557	</h2>
1558	<ol>
1559	<li>
1560	<a href="http://www.unicode.org/">The Unicode Consortium</a>.
1561	</li>
1562	<li>
1563	<a href="http://icu.sourceforge.net/">ICU Library</a>.
1564	</li>
1565	<li>
1566	<a href="http://en.wikipedia.org/wiki/UTF-8">UTF-8 at Wikipedia</a>
1567	</li>
1568	<li>
1569	<a href="http://www.cl.cam.ac.uk/~mgk25/unicode.html">UTF-8 and Unicode FAQ for
1570	Unix/Linux</a>
1571	</li>
1572	</ol>
1573	</body>
1574	</html>

Note: See TracBrowser for help on using the browser.

Context Navigation

root/trunk/dep/include/utf8cpp/doc/utf8cpp.html @ 199

Download in other formats: