Module utf8 | Tarantool

Module utf8

utf8 is Tarantool’s module for handling UTF-8 strings. It includes some functions which are compatible with ones in Lua 5.3 but Tarantool has much more. For example, because internally Tarantool contains a complete copy of the “International Components For Unicode” library, there are comparison functions which understand the default ordering for Cyrillic (Capital Letter Zhe Ж = Small Letter Zhe ж) and Japanese (Hiragana A = Katakana A).

Name Use
casecmp and
cmp
Comparisons
lower and
upper
Case conversions
isalpha,
isdigit,
islower and
isupper
Determine character types
sub Substrings
len Length in characters
next Character-at-a-time iterations
utf8.casecmp(UTF8-string, utf8-string)
Parameters:
  • string (UTF8-string) – a string encoded with UTF-8
Return:

-1 meaning “less”, 0 meaning “equal”, +1 meaning “greater”

Rtype:

number

Compare two strings with the Default Unicode Collation Element Table (DUCET) for the Unicode Collation Algorithm. Thus ‘å’ is less than ‘B’, even though the code-point value of å (229) is greater than the code-point value of B (66), because the algorithm depends on the values in the Collation Element Table, not the code-point values.

The comparison is done with primary weights. Therefore the elements which affect secondary or later weights (such as “case” in Latin or Cyrillic alphabets, or “kana differentiation” in Japanese) are ignored. If asked “is this like a Microsoft case-insensitive accent-insensitive collation” we tend to answer “yes”, though the Unicode Collation Algorithm is far more sophisticated than those terms imply.

Example:

tarantool> utf8.casecmp('é','e'),utf8.casecmp('E','e')
---
- 0
- 0
...
utf8.char(code-point[, code-point ...])
Parameters:
  • number (code-point) – a Unicode code point value, repeatable
Return:

a UTF-8 string

Rtype:

string

The code-point number is the value that corresponds to a character in the Unicode Character Database This is not the same as the byte values of the encoded character, because the UTF-8 encoding scheme is more complex than a simple copy of the code-point number.

Another way to construct a string with Unicode characters is with the \u{hex-digits} escape mechanism, for example ‘\u{41}\u{42}’ and utf8.char(65,66) both produce the string ‘AB’.

Example:

tarantool> utf8.char(229)
---
- å
...
utf8.cmp(UTF8-string, utf8-string)
Parameters:
  • string (UTF8-string) – a string encoded with UTF-8
Return:

-1 meaning “less”, 0 meaning “equal”, +1 meaning “greater”

Rtype:

number

Compare two strings with the Default Unicode Collation Element Table (DUCET) for the Unicode Collation Algorithm. Thus ‘å’ is less than ‘B’, even though the code-point value of å (229) is greater than the code-point value of B (66), because the algorithm depends on the values in the Collation Element Table, not the code values.

The comparison is done with at least three weights. Therefore the elements which affect secondary or later weights (such as “case” in Latin or Cyrillic alphabets, or “kana differentiation” in Japanese) are not ignored. and upper case comes after lower case.

Example:

tarantool> utf8.cmp('é','e'),utf8.cmp('E','e')
---
- 1
- 1
...
utf8.isalpha(UTF8-character)
Parameters:
  • string-or-number (UTF8-character) – a single UTF8 character, expressed as a one-byte string or a code point value
Return:

true or false

Rtype:

boolean

Return true if the input character is an “alphabetic-like” character, otherwise return false. Generally speaking a character will be considered alphabetic-like provided it is typically used within a word, as opposed to a digit or punctuation. It does not have to be a character in an alphabet.

Example:

tarantool> utf8.isalpha('Ж'),utf8.isalpha('å'),utf8.isalpha('9')
---
- true
- true
- false
...
utf8.isdigit(UTF8-character)
Parameters:
  • string-or-number (UTF8-character) – a single UTF8 character, expressed as a one-byte string or a code point value
Return:

true or false

Rtype:

boolean

Return true if the input character is a digit, otherwise return false.

Example:

tarantool> utf8.isdigit('Ж'),utf8.isdigit('å'),utf8.isdigit('9')
---
- false
- false
- true
...
utf8.islower(UTF8-character)
Parameters:
  • string-or-number (UTF8-character) – a single UTF8 character, expressed as a one-byte string or a code point value
Return:

true or false

Rtype:

boolean

Return true if the input character is lower case, otherwise return false.

Example:

tarantool> utf8.islower('Ж'),utf8.islower('å'),utf8.islower('9')
---
- false
- true
- false
...
utf8.isupper(UTF8-character)
Parameters:
  • string-or-number (UTF8-character) – a single UTF8 character, expressed as a one-byte string or a code point value
Return:

true or false

Rtype:

boolean

Return true if the input character is upper case, otherwise return false.

Example:

tarantool> utf8.isupper('Ж'),utf8.isupper('å'),utf8.isupper('9')
---
- true
- false
- false
...
utf8.len(UTF8-string[, start-byte[, end-byte]])
Parameters:
  • string (UTF8-string) – a string encoded with UTF-8
  • integer (end-byte) – byte position of the first character
  • integer – byte position where to stop
Return:

the number of characters in the string, or between start and end

Rtype:

number

Byte positions for start and end can be negative, which indicates “calculate from end of string” rather than “calculate from start of string”.

If the string contains a byte sequence which is not valid in UTF-8, each byte in the invalid byte sequence will be counted as one character.

UTF-8 is a variable-size encoding scheme. Typically a simple Latin letter takes one byte, a Cyrillic letter takes two bytes, a Chinese/Japanese character takes three bytes, and the maximum is four bytes.

Example:

tarantool> utf8.len('G'),utf8.len('ж')
---
- 1
- 1
...

tarantool> string.len('G'),string.len('ж')
---
- 1
- 2
...
utf8.lower(UTF8-string)
Parameters:
  • string (UTF8-string) – a string encoded with UTF-8
Return:

the same string, lower case

Rtype:

string

Example:

tarantool> utf8.lower('ÅΓÞЖABCDEFG')
---
- åγþжabcdefg
...
utf8.next(UTF8-string[, start-byte])
Parameters:
  • string (UTF8-string) – a string encoded with UTF-8
  • integer (start-byte) – byte position where to start within the string, default is 1
Return:

byte position of the next character and the code point value of the next character

Rtype:

table

The next function is often used in a loop to get one character at a time from a UTF-8 string.

Example:

In the string ‘åa’ the first character is ‘å’, it starts at position 1, it takes two bytes to store so the character after it will be at position 3, its Unicode code point value is (decimal) 229.

tarantool> -- show next-character position + first-character codepoint
tarantool> utf8.next('åa', 1)
---
- 3
- 229
...
tarantool> -- (loop) show codepoint of every character
tarantool> for position,codepoint in utf8.next,'åa' do print(codepoint) end
229
97
...
utf8.sub(UTF8-string, start-character[, end-character])
Parameters:
  • string (UTF8-string) – a string encoded as UTF-8
  • number (end-character) – the position of the first character
  • number – the position of the last character
Return:

a UTF-8 string, the “substring” of the input value

Rtype:

string

Character positions for start and end can be negative, which indicates “calculate from end of string” rather than “calculate from start of string”.

The default value for end-character is the length of the input string. Therefore, saying utf8.sub(1, 'abc') will return ‘abc’, the same as the input string.

Example:

tarantool> utf8.sub('åγþжabcdefg', 5, 8)
---
- abcd
...
utf8.upper(UTF8-string)
Parameters:
  • string (UTF8-string) – a string encoded with UTF-8
Return:

the same string, upper case

Rtype:

string

Note

In rare cases the upper-case result may be longer than the lower-case input, for example utf8.upper('ß') is ‘SS’.

Example:

tarantool> utf8.upper('åγþжabcdefg')
---
- ÅΓÞЖABCDEFG
...
Found what you were looking for?
Feedback