Skip to content

Sudachi version 0.6.0

Compare
Choose a tag to compare
@github-actions github-actions released this 09 Jun 01:26

Highlights

  • Improved analysis speed ~20% compared to 0.5.3
  • New typed configuration API (Config)
  • Regex matcher plugin
  • OOV Handlers can use fully-customized POS tags
  • API for compiling dictionaries

New features

API for building dictionaries

In addition to command line interface for building dictionaries, Sudachi now supports API.

See DicBuilder class and CLI for usage examples. No Javadocs here yet.

Configuration API

Introduced a new typed configuration API. See Config class. It supports flexible path resolution with respect to classpath (with customizable prefixes and classloaders) and filesystem. Dictionary creation API which uses old Settings is deprecated.

New configuration framework allow specifying some resources (dictionaries, character tables) preloaded and prebuilt.

For details on usage, see Javadoc for Config class.

Fully-custom POS tags in OOV providers

It is now possible to specify POS tags for OOV providers which are not present in dictionary. In that case, you must add "userPos": "allow" to OOV plugin configuration. POS tags still must have 6 layer structure.

"oovProviderPlugin" : [
        { "class" : "com.worksap.nlp.sudachi.SimpleOovProviderPlugin",
          "oovPOS" : [ "この", "たぐ", "", "ぞんざい", "しない", "" ],
          "userPOS": "allow",
          "leftId" : 8,
          "rightId" : 8,
          "cost" : 6000 }
    ]

Regex OOV Provider Plugin

Introduced a new OOV provider which matches a regular expression.

Recommendations:

  • Use non-capturing groups in regular expressions: (?:like this), but not capturing groups (like this)

Caveats:

  • Matches may start only on boundaries where character type changes.
  • Matches may not produce words which already present in the dictionary.
  • Match length is limited to 63 utf-16 code units

Example for matching URLs:

{
    "class": "com.worksap.nlp.sudachi.RegexOovProvider",
    "leftId": 5968,
    "rightId": 5968,
    "cost": 19000,
    "regex": "^(?:https?://|www)[\\-_.!~*'()a-zA-Z0-9;/?:@&=+$,%#¯−]+",
    "pos": [ "補助記号", "一般", "URL", "*", "*", "*" ],
    "userPOS": "allow"
}

Speedup

  • Improved lattice construction logic, it is faster and generates less GC pressure now
  • Improved trie index lookup logic, it is slightly faster and generates much less GC pressure now

Deprecations

All deprecations in this section will be removed with 1.0 release.

  • DictionaryFactory methods which use Settings
  • getPath method of Settings, use getResource instead.

Internal & Infrastructure

  • Build now uses Gradle instead of Maven
  • Tests can be written in Kotlin in addition to Java
  • OOV provider plugin internal API has changed. It now must create candidate nodes into the provided list and return number of created OOVs. See Javadoc for details.