Unicode Text Segmentation

Shard TextSegment implements Unicode Text Segmentation according to Unicode Standard Annex #29 (Unicode version 13.0.0) to determine the grapheme cluster boundaries of unicode text.

In Crystal, String class provides a codepoints method to return Unicode code points. However, multiple code points may be combined into one user-perceived character or what the Unicode specification calls grapheme cluster. Here are some examples:

This shard provides a tool to iterate over these grapheme clusters. This may be used to determine the number of user-perceived characters, to split strings in their intended places, or to extract individual characters which form a unit.

Installation

Add the dependency to your shard.yml:

dependencies:
  textseg:
    github: naqvis/uni_text_seg

Run shards install

Usage

require "textseg"

TextSegment.each_grapheme("🔮👍🏼!") do |cluster|
  pp cluster.codepoints
  pp cluster.positions
  pp cluster.str
  pp cluster.bytes
end

Development

To run all tests:

crystal spec

Contributing

Fork it (https://github.com/naqvis/uni_text_seg/fork)
Create your feature branch (git checkout -b my-new-feature)
Commit your changes (git commit -am 'Add some feature')
Push to the branch (git push origin my-new-feature)
Create a new Pull Request

Contributors

Ali Naqvi - creator and maintainer