Рейтинг@Mail.ru
Tarantool » 2.0 » Справочники » Built-in modules reference » Module utf8

Module utf8

Module utf8

Overview

utf8 is a Tarantool’s module for handling UTF-8 strings. It includes some functions which are compatible with ones in Lua 5.3 but Tarantool has much more. For example, because internally Tarantool contains a complete copy of the «International Components For Unicode» library, there are comparison functions which understand the default ordering for Cyrillic (Capital Letter Zhe Ж = Small Letter Zhe ж) and Japanese (Hiragana A = Katakana A).

The module is fully built-in so require('utf8') is not necessary.

Index

Below is a list of all utf8 functions.

Name Use
casecmp and
cmp
Comparisons
lower and
upper
Case conversions
isalpha,
isdigit,
islower and
isupper
Determine character types
sub Substrings
len Length in characters
next Character-at-a-time iterations
utf8.casecmp(UTF8-string, utf8-string)
Параметры:
  • string (UTF8-string) – a string encoded with UTF-8
Return:

-1 meaning «less», 0 meaning «equal», +1 meaning «greater»

Rtype:

number

Compare two strings with the Default Unicode Collation Element Table (DUCET) for the Unicode Collation Algorithm. Thus „å“ is less than „B“, even though the code-point value of å (229) is greater than the code-point value of B (66), because the algorithm depends on the values in the Collation Element Table, not the code values.

The comparison is done with primary weights. Therefore the elements which affect secondary or later weights (such as «case» in Latin or Cyrillic alphabets, or «kana differentiation» in Japanese) are ignored. If asked «is this like a Microsoft case-insensitive accent-insensitive collation» we tend to answer «yes», though the Unicode Collation Algorithm is far more sophisticated than those terms imply.

Example:

tarantool> utf8.casecmp('é','e'),utf8.casecmp('E','e')
---
- 0
- 0
...
utf8.char(code-point[, code-point ...])
Параметры:
  • number (code-point) – a Unicode code point value, repeatable
Return:

a UTF-8 string

Rtype:

string

The code-point number is the value that corresponds to a character in the Unicode Character Database This is not the same as the byte values of the encoded character, because the UTF-8 encoding scheme is more complex than a simple copy of the code-point number.

Another way to construct a string with Unicode characters is with the \u{hex-digits} escape mechanism, for example „\u{41}\u{42}“ and utf8.char(65,66) both produce the string „AB“.

Example:

tarantool> utf8.char(229)
---
- å
...
utf8.cmp(UTF8-string, utf8-string)
Параметры:
  • string (UTF8-string) – a string encoded with UTF-8
Return:

-1 meaning «less», 0 meaning «equal», +1 meaning «greater»

Rtype:

number

Compare two strings with the Default Unicode Collation Element Table (DUCET) for the Unicode Collation Algorithm. Thus „å“ is less than „B“, even though the code-point value of å (229) is greater than the code-point value of B (66), because the algorithm depends on the values in the Collation Element Table, not the code values.

The comparison is done with all weights, and upper case comes before lower case.

Example:

tarantool> utf8.cmp('é','e'),utf8.cmp('E','e')
---
- 1
- 1
...
utf8.isalpha(UTF8-character)
Параметры:
  • string-or-number (UTF8-character) – a single UTF8 character, expressed as a one-byte string or a code point value
Return:

true or false

Rtype:

boolean

Return true if the input character is an «alphabetic-like» character, otherwise return false. Generally speaking a character will be considered alphabetic-like provided it is typically used within a word, as opposed to a digit or punctuation. It does not have to be a character in an alphabet.

Example:

tarantool> utf8.isalpha('Ж'),utf8.isalpha('å'),utf8.isalpha('9')
---
- true
- true
- false
...
utf8.isdigit(UTF8-character)
Параметры:
  • string-or-number (UTF8-character) – a single UTF8 character, expressed as a one-byte string or a code point value
Return:

true or false

Rtype:

boolean

Return true if the input character is a digit, otherwise return false.

Example:

tarantool> utf8.isdigit('Ж'),utf8.isdigit('å'),utf8.isdigit('9')
---
- false
- false
- true
...
utf8.islower(UTF8-character)
Параметры:
  • string-or-number (UTF8-character) – a single UTF8 character, expressed as a one-byte string or a code point value
Return:

true or false

Rtype:

boolean

Return true if the input character is lower case, otherwise return false.

Example:

tarantool> utf8.islower('Ж'),utf8.islower('å'),utf8.islower('9')
---
- false
- true
- false
...
utf8.isupper(UTF8-character)
Параметры:
  • string-or-number (UTF8-character) – a single UTF8 character, expressed as a one-byte string or a code point value
Return:

true or false

Rtype:

boolean

Return true if the input character is upper case, otherwise return false.

Example:

tarantool> utf8.isupper('Ж'),utf8.isupper('å'),utf8.isupper('9')
---
- true
- false
- false
...
utf8.len(UTF8-string[, start-byte[, end-byte]])
Параметры:
  • string (UTF8-string) – a string encoded with UTF-8
  • integer (end-byte) – byte position of the first character
  • integer – byte position where to stop
Return:

the number of characters in the string, or between start and end

Rtype:

number, or error if the input string is not valid UTF-8

Byte positions for start and end can be negative, which indicates «calculate from end of string» rather than «calculate from start of string».

If an error occurs, the error return will include the byte position where the not-valid UTF-8 character was found, as a second value.

UTF-8 is a variable-size encoding scheme. Typically a simple Latin letter takes one byte, a Cyrillic letter takes two bytes, a Chinese/Japanese character takes three bytes, and the maximum is four bytes.

Example:

tarantool> utf8.len('G'),utf8.len('ж')
---
- 1
- 1
...

tarantool> string.len('G'),string.len('ж')
---
- 1
- 2
...
utf8.lower(UTF8-string)
Параметры:
  • string (UTF8-string) – a string encoded with UTF-8
Return:

the same string, lower case

Rtype:

string, or error if the input string is not valid UTF-8

Example:

tarantool> utf8.lower('ÅΓÞЖABCDEFG')
---
- åγþжabcdefg
...
utf8.next(UTF8-string[, start-byte])
Параметры:
  • string (UTF8-string) – a string encoded with UTF-8
  • integer (start-byte) – byte position where to start within the string, default is 1
Return:

byte position of the next character and the code point value of the next character

Rtype:

table, or error if the input string is not valid UTF-8

The next function is often used in a loop to get one character at a time from a UTF-8 string.

Example:

In the string „åa“ the first character is „å“, it starts at position 1, it takes two bytes to store so the character after it will be at position 3, its Unicode code point value is (decimal) 229.

tarantool> utf8.next('åa', 1)
---
- 3
- 229
...
utf8.sub(UTF8-string[, start-character[, end-character]])
Параметры:
  • string (UTF8-string) – a string encoded as UTF-8
  • number (end-character) – the position of the first character
  • number – the position of the last character
Return:

a UTF-8 string, the «substring» of the input value

Rtype:

string

Character positions for start and end can be negative, which indicates «calculate from end of string» rather than «calculate from start of string».

The default value for start-character is 1, and the default value for end-character is the length of the input string. Therefore, saying utf8.sub('abc') will return „abc“, the same as the input string.

Example:

tarantool> utf8.sub('åγþжabcdefg', 5, 8)
---
- abcd
...
utf8.upper(UTF8-string)
Параметры:
  • string (UTF8-string) – a string encoded with UTF-8
Return:

the same string, upper case

Rtype:

string, or error if the input string is not valid UTF-8

Примечание

In rare cases the upper-case result may be longer than the lower-case input, for example utf8.upper('ß') is „SS“.

Example:

tarantool> utf8.upper('åγþжabcdefg')
---
- ÅΓÞЖABCDEFG
...