c - Length of a multibyte sequence in bytes, (unicode) code points, characters and cursor positions -
i not sure if assumptions correct, sense 4 kinds of length of multibyte sequence can different, illustrate:
say, multibyte encoding utf-8, , have string "\xc3\xb8 \xe2\x86\x82 e\xcc\x88"
, utf-8 encoding of "\u00f8 \u2182 e\u0308"
, "ø ↂ ë"
this string has length of:
10 bytes 6 unicode code-points 5 characters 6 screen positions (with monospaced font) (ↂ takes 2 positions)
1.) returned strlen
, 2.) can determined <wchar.h>
functions.
but there portable way of determining 3.) , 4.)? not sure, if ↂ taking 2 cursor positions defined font-independently codepoints or font in use, sense “monospaced font” , “some characters take more 1 space” contradictional. @ least, in monospace character cover 2 cursor positions. unicode chart u2150 doesn't cursor positions.
lastly, number of positions negative character (i mean, character putting cursor position left in left-to-right script or vice versa)?
the posix interface wcwidth
can used find number of "cursor positions" of wchar_t
. in order wchar_t
values (one @ time), can utilize c99 standard library function mbtowc
, extracts single multibyte character string , returns number of bytes consumed. (repeatedly calling mbtowc
on string , updating string pointer each time tell how many multibyte characters nowadays in string, @ to the lowest degree if multibyte coding utf-8.)
the combination of wcwidth
, mbtowc
can more or less tell how many glyphs have in string (your question #3). wchar_t wcwidth
returns 0 either zero-width format command or combining character , wchar_t wcwidth
returns -1 either non-character or command character (like \n
). either way, can ignored, glyph count count of wchar_t width >0.
that makes clear 4 questions have different answers:
number of bytes.
number of multibyte codepoints.
number of multibyte codepoints wcwidth greater 0.
sum of wcwidth of multibyte codepoints wcwidth greater 0.
having said that, there no guarantee value returned wcwidth
corresponds either actual character widths of current console font or unicode version beingness used application. (i've had problem both of these.) values returned wcwidth
extracted current locale, can edit , recompile locale files prepare errors. see, example, reply here: how ncurses output astral plane unicode characters
c unicode encoding