Skip to content

Releases: WorksApplications/Sudachi

Sudachi version 0.7.5

05 Nov 05:55
Compare
Choose a tag to compare

Highlights

  • Behavior of the dictionary printer and builder are changed (#234)
    • DictionaryPrinter now prints word references in the (Surface, POS, Reading) triple format, instead of the line number format.
    • DictionaryBuilder now allows the dictionary form to be written in the triple format, not only the line number format.

Added

  • Benchmark scripts are added (#235)

Fixed

  • Tutorial and readme are updated (#237, #240)
  • Config.Resource.asByteBuffer now always returns ByteBuffer with little endian byte order (#239)
    • StringUtil.readAllBytes also now returns ByteBuffer with little endian byte order.

Sudachi version 0.7.4

02 Jul 07:27
Compare
Choose a tag to compare

Highlights

  • Add Tokenizer.lazyTokenizeSentences(SplitMode mode, Readable input), that performs analysis lazily and saves memory usage (#231)
    • Tokenizer.tokenizeSentences(SplitMode mode, Reader input) is marked as deprecated.

Fixed

  • Do not segfault on tokenizing with closed dictionary (#217)
  • The default config sudachi.json sets non-existent property joinKanjiNumeric in JoinNumericPlugin (#221)
  • fix incorrect size calculation when expand (#227)
  • Update tutorial.md (#226)

Sudachi version 0.7.3

26 Jun 02:07
Compare
Choose a tag to compare

This is a support release for Elasticsearch/OpenSearch integration 3.1.0 release.

Highlights

  • Added Config.fromResource method for reading Configs vial PathAnchor. (#212)

Internals

  • Plugin classloading is done by PathAnchor and support multiple classloaders (#210, #209)

Notes about v0.7.2

Release v0.7.2 contains subset of the functionality of this release but did not contain crucial features. It is not a broken release, but there are no user-visible changed from v0.7.1.

Sudachi version 0.7.1

09 Mar 09:51
Compare
Choose a tag to compare

This is a maintenance release

Highlights

  • Fixed analysis truncation when using analysis with sentence splitting and the input does not contain data which can be treated as splittable sentences
  • Fixed O(N^2) performance in sentence splitting when underlying reader does not fill buffer fully at once
  • Stop calling into reader with full buffer

0.6.4

09 Mar 09:51
Compare
Choose a tag to compare

This is a maintenance release

Highlights

  • Fixed analysis truncation when using analysis with sentence splitting and the input does not contain data which can be treated as splittable sentences
  • Fixed O(N^2) performance in sentence splitting when underlying reader does not fill buffer fully at once
  • Stop calling into reader with full buffer

Sudachi version 0.6.3

29 Aug 12:50
Compare
Choose a tag to compare
Sudachi version 0.6.3 Pre-release
Pre-release

Port relaxed boundary mode from 0.7.0 while keeping ABI compatibility with pre-0.7.0 versions.

Sudachi version 0.7.0

16 Aug 03:00
Compare
Choose a tag to compare

Highlights

  • Tokenizer.tokenize API returns MorphemeList instead of List<Morpheme>. This change is ABI-incompatible with previous versions and applications which use Sudachi require recompilation. The change should be source-compatible with no changes required to the source code which uses Sudachi.
  • New API: MorphemeList.split: resplit C-mode token sequence to lower level without re-analyzing the whole string.
  • Added relaxed boundary matching mode for Regex OOV handler

Sudachi version 0.6.2

21 Jun 01:05
Compare
Choose a tag to compare

Highlights

  • Fixed invalid POS tags which appeared when using user-defined POS tags both in user dictionaries and OOV handlers. You are not affected by this bug if you did not use user-defined POS in OOV handlers.

Sudachi version 0.6.1

10 Jun 08:24
Compare
Choose a tag to compare

Highlights

  • DO NOT USE 0.6.0, IT IS INCOMPATIBLE WITH 0.6.1
  • Regex OOV plugin has configurable maximum token length
  • SettingsAnchor renamed to PathAnchor to make more clear its purpose
  • Add useful Config methods, e.g. for a common case of loading default configuration with provided PathAnchor to resolve default paths in another directory.
  • Filesystem-based PathAnchor now plays correctly with SecurityManager present (e.g. in ElasticSearch).

Regex OOV length

Use maxLength field of the plugin configuration object to set maximum allowed length, in utf-8 bytes (by default 32). The unit will change to unicode codepoints in the future.

Sudachi version 0.6.0

09 Jun 01:26
Compare
Choose a tag to compare

Highlights

  • Improved analysis speed ~20% compared to 0.5.3
  • New typed configuration API (Config)
  • Regex matcher plugin
  • OOV Handlers can use fully-customized POS tags
  • API for compiling dictionaries

New features

API for building dictionaries

In addition to command line interface for building dictionaries, Sudachi now supports API.

See DicBuilder class and CLI for usage examples. No Javadocs here yet.

Configuration API

Introduced a new typed configuration API. See Config class. It supports flexible path resolution with respect to classpath (with customizable prefixes and classloaders) and filesystem. Dictionary creation API which uses old Settings is deprecated.

New configuration framework allow specifying some resources (dictionaries, character tables) preloaded and prebuilt.

For details on usage, see Javadoc for Config class.

Fully-custom POS tags in OOV providers

It is now possible to specify POS tags for OOV providers which are not present in dictionary. In that case, you must add "userPos": "allow" to OOV plugin configuration. POS tags still must have 6 layer structure.

"oovProviderPlugin" : [
        { "class" : "com.worksap.nlp.sudachi.SimpleOovProviderPlugin",
          "oovPOS" : [ "この", "たぐ", "", "ぞんざい", "しない", "" ],
          "userPOS": "allow",
          "leftId" : 8,
          "rightId" : 8,
          "cost" : 6000 }
    ]

Regex OOV Provider Plugin

Introduced a new OOV provider which matches a regular expression.

Recommendations:

  • Use non-capturing groups in regular expressions: (?:like this), but not capturing groups (like this)

Caveats:

  • Matches may start only on boundaries where character type changes.
  • Matches may not produce words which already present in the dictionary.
  • Match length is limited to 63 utf-16 code units

Example for matching URLs:

{
    "class": "com.worksap.nlp.sudachi.RegexOovProvider",
    "leftId": 5968,
    "rightId": 5968,
    "cost": 19000,
    "regex": "^(?:https?://|www)[\\-_.!~*'()a-zA-Z0-9;/?:@&=+$,%#¯−]+",
    "pos": [ "補助記号", "一般", "URL", "*", "*", "*" ],
    "userPOS": "allow"
}

Speedup

  • Improved lattice construction logic, it is faster and generates less GC pressure now
  • Improved trie index lookup logic, it is slightly faster and generates much less GC pressure now

Deprecations

All deprecations in this section will be removed with 1.0 release.

  • DictionaryFactory methods which use Settings
  • getPath method of Settings, use getResource instead.

Internal & Infrastructure

  • Build now uses Gradle instead of Maven
  • Tests can be written in Kotlin in addition to Java
  • OOV provider plugin internal API has changed. It now must create candidate nodes into the provided list and return number of created OOVs. See Javadoc for details.