Sudachi version 0.6.0
Highlights
- Improved analysis speed ~20% compared to 0.5.3
- New typed configuration API (
Config
) - Regex matcher plugin
- OOV Handlers can use fully-customized POS tags
- API for compiling dictionaries
New features
API for building dictionaries
In addition to command line interface for building dictionaries, Sudachi now supports API.
See DicBuilder
class and CLI
for usage examples. No Javadocs here yet.
Configuration API
Introduced a new typed configuration API. See Config
class. It supports flexible path resolution with respect to classpath (with customizable prefixes and classloaders) and filesystem. Dictionary creation API which uses old Settings
is deprecated.
New configuration framework allow specifying some resources (dictionaries, character tables) preloaded and prebuilt.
For details on usage, see Javadoc for Config
class.
Fully-custom POS tags in OOV providers
It is now possible to specify POS tags for OOV providers which are not present in dictionary. In that case, you must add "userPos": "allow"
to OOV plugin configuration. POS tags still must have 6 layer structure.
"oovProviderPlugin" : [
{ "class" : "com.worksap.nlp.sudachi.SimpleOovProviderPlugin",
"oovPOS" : [ "この", "たぐ", "は", "ぞんざい", "しない", "よ" ],
"userPOS": "allow",
"leftId" : 8,
"rightId" : 8,
"cost" : 6000 }
]
Regex OOV Provider Plugin
Introduced a new OOV provider which matches a regular expression.
Recommendations:
- Use non-capturing groups in regular expressions:
(?:like this)
, but not capturing groups(like this)
Caveats:
- Matches may start only on boundaries where character type changes.
- Matches may not produce words which already present in the dictionary.
- Match length is limited to 63 utf-16 code units
Example for matching URLs:
{
"class": "com.worksap.nlp.sudachi.RegexOovProvider",
"leftId": 5968,
"rightId": 5968,
"cost": 19000,
"regex": "^(?:https?://|www)[\\-_.!~*'()a-zA-Z0-9;/?:@&=+$,%#¯−]+",
"pos": [ "補助記号", "一般", "URL", "*", "*", "*" ],
"userPOS": "allow"
}
Speedup
- Improved lattice construction logic, it is faster and generates less GC pressure now
- Improved trie index lookup logic, it is slightly faster and generates much less GC pressure now
Deprecations
All deprecations in this section will be removed with 1.0 release.
DictionaryFactory
methods which useSettings
getPath
method ofSettings
, usegetResource
instead.
Internal & Infrastructure
- Build now uses Gradle instead of Maven
- Tests can be written in Kotlin in addition to Java
- OOV provider plugin internal API has changed. It now must create candidate nodes into the provided list and return number of created OOVs. See Javadoc for details.