tsm: unicode: do not encode invalid UTF8

We must under all conditions avoid encoding invalid UTF8. Otherwise, we would rely on other applications to do error-recovery. Unfortunately, this is no syntactical change but a semnatical fix as the Unicode standard defines several codepoints which are invalid or which must never be used in UTF8. See the Unicode standard if you're interested in these codepoint ranges. Signed-off-by: David Herrmann <dh.herrmann@googlemail.com>
2012-09-30 17:59:36 +02:00 · 2012-09-30 17:59:36 +02:00 · 17a56a24f2
commit 17a56a24f2
parent e0d30b2283
1 changed files with 12 additions and 0 deletions
--- a/src/tsm_unicode.c
+++ b/src/tsm_unicode.c
@ -344,10 +344,22 @@ err_id:
 * indicates how long the written UTF8 string is.
 *
 * Please note @g is a real UCS4 code and not a tsm_symbol_t object!
+ *
+ * Unicode symbols between 0xD800 and 0xDFFF are not assigned and reserved for
+ * UTF16 compatibility. It is an error to encode them. Same applies to numbers
+ * greater than 0x10FFFF, the range 0xFDD0-0xFDEF and codepoints ending with
+ * 0xFFFF or 0xFFFE.
 */

 size_t tsm_ucs4_to_utf8(uint32_t g, char *txt)
 {
+	if (g >= 0xd800 && g <= 0xdfff)
+		return 0;
+	if (g > 0x10ffff || (g & 0xffff) == 0xffff || (g & 0xffff) == 0xfffe)
+		return 0;
+	if (g >= 0xfdd0 && g <= 0xfdef)
+		return 0;
+
 	if (g < (1 << 7)) {
 		txt[0] = g & 0x7f;
 		return 1;