Skip to content

Commit

Permalink
增加性能对比,编写更直白的说明
Browse files Browse the repository at this point in the history
  • Loading branch information
geezmolycos committed Mar 10, 2023
1 parent 19a1b47 commit 921a0e0
Show file tree
Hide file tree
Showing 24 changed files with 22,486 additions and 51 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -3,3 +3,4 @@ dist
node_modules
.vscode-test/
*.vsix
extension-profiles/
1 change: 1 addition & 0 deletions .vscodeignore
Original file line number Diff line number Diff line change
Expand Up @@ -9,3 +9,4 @@ vsc-extension-quickstart.md
**/.eslintrc.json
**/*.map
**/*.ts
images/large/
12 changes: 5 additions & 7 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,20 +4,18 @@ All notable changes to the "vscode-hanzi-counter" extension will be documented i

## TODO

- 编写测试用的文本
- 加入统计字数扩展比较和在线工具比较
- 性能比较
- 正确性比较
- 加配置编写教程
- twitter character count
- 日本语较准确的「原稿用紙換算」
- 发布到open vsx上

## [1.3.2]
## [1.3.2] - 2023-03-10

- 将汉字和 CJK 字符包括 Letter_Number 类,使得「〇」和苏州码字也包括在汉字内
- 将汉字和 CJK 字符包括 Letter_Number 类,使得「〇」和苏州码子也包括在汉字内
- 更改非空白字符规则,添加 Segmenter
- 编写了该扩展和其他工具的正确性比较
- 增加了该扩展和其他工具的正确性对比文档
- 增加了性能对比文档
- 修改了 `README.md` 更直白,易于理解

## [1.3.1] - 2023-03-08

Expand Down
56 changes: 20 additions & 36 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,21 +11,7 @@ Customizable word counter with great support of Chinese characters (Hanzi), Japa
作者不会日语和(朝鲜语/韩语),但是这些语言字数统计是准确的。更多细节见下方说明。\
The author hasn't learned Japanese or Korean language, but the character/word counters of those languages are correct. See below for more details. (You may want to use [google translate](https://translate.google.com/).)

## 特色功能

- 该扩展只在打开文件、保存文件时对全文统计字数。编辑过程中,扩展会只统计改变的部分,并动态更新统计结果。相比其他扩展每次修改都会重新全文统计,无论字数多少,该扩展只会占用极少的计算资源。约50万单词时,[VS Code 的示例词数统计扩展](https://marketplace.visualstudio.com/items?itemName=ms-vscode.wordcount)在编辑时出现明显卡顿(更改 markdown 判断代码使其在纯文本模式下启动,以消除markdown语法高亮的性能影响),本扩展依然能实时更新,没有可察觉的延迟。
- 可使用 Javascript 自定义状态栏上、悬浮提示中显示的内容;使用正则表达式自主添加统计规则。自主编写规则、更改格式,可以解决你个人统计字数的绝大部分需求。
- 可以为不同编程语言配置不同设置,或启用禁用显示。
- 有丰富的预设配置,方便不同国家、不同语言用户使用。将来还会编写配置教程和英文描述。
- 使用 `Intl.Segmenter` 划分字词,组合字符可以合起来统计,支持emoji

## 图片展示

![中文界面](images/screenshot-tooltip.png)
![English](images/screenshot-tooltip-english.png)
![高亮](images/screenshot-highlight.png)

## 功能
## 功能简介

刚安装扩展时,状态栏的右下角会出现一个铅笔的图标,显示字数,鼠标移动到上面会弹出一个使用教程的提示,请按照提示更改设置。

Expand All @@ -37,34 +23,32 @@ The author hasn't learned Japanese or Korean language, but the character/word co

状态栏显示默认在右边,也可以在设置里更改。

## (朝鲜语/韩语)字符(谚文)规则说明
## 对比其他扩展

**不卡顿,性能好**

统计结果按行缓存,不会因为字数很多就变卡顿。以下是与其他扩展的速度对比([详情](comparison.md#性能对比))。

[<img alt="speed-comparison.png" src="images/speed-comparison.png" width="400px" />](comparison.md#性能对比)

**支持多语言,中日韩**

(朝鲜语/韩语)使用谚文作为主要书写系统。Unicode中,谚文可以使用音节形式和组合形式两种方法表示:音节形式即一个谚文方块字(音节)对应一个字符;而组合形式中,一个谚文方块字可由若干个部件字符(包括初声、中声、终声几类字符)组成。
默认配置有英简繁日韩,内容符合对应语言用户所需

通过对不同类型字符区别对待,可以准确地统计出谚文文本方块字(音节)的个数。规则如下:
<img alt="中文界面" src="images/screenshot-tooltip.png" width="200px" />
<img alt="English" src="images/screenshot-tooltip-english.png" width="200px" />

|字符类型|正则|描述|
|-|-|-|
|音节和兼容字符|`[\\u{ac00}-\\u{d7af}\\u{3130}-\\u{318f}\\u{ffa0}-\\u{ffdf}]`|包括所有 [Hangul Syllables](https://en.wikipedia.org/wiki/Hangul_Syllables)[Hangul Compatibility Jamo](https://en.wikipedia.org/wiki/Hangul_Compatibility_Jamo) 中的字符,和 [Halfwidth and Fullwidth Forms](https://en.wikipedia.org/wiki/Halfwidth_and_Fullwidth_Forms_(Unicode_block)) 中的半角谚文字符|
|初声L|`[\\u{1100}-\\u{115f}\\u{a960}-\\u{a97f}]`|[Hangul Jamo](https://en.wikipedia.org/wiki/Hangul_Jamo_(Unicode_block))[Hangul Jamo Extended-A](https://en.wikipedia.org/wiki/Hangul_Jamo_Extended-A) 的所有L类字符|
|中声V|`[\\u{1160}-\\u{11a7}\\u{d7b0}-\\u{d7ca}]`|[Hangul Jamo](https://en.wikipedia.org/wiki/Hangul_Jamo_(Unicode_block))[Hangul Jamo Extended-B](https://en.wikipedia.org/wiki/Hangul_Jamo_Extended-B) 的所有V类字符|
|终声T|`[\\u{11a8}-\\u{11ff}\\u{d7cb}-\\u{d7ff}]`|[Hangul Jamo](https://en.wikipedia.org/wiki/Hangul_Jamo_(Unicode_block))[Hangul Jamo Extended-B](https://en.wikipedia.org/wiki/Hangul_Jamo_Extended-B) 的所有T类字符|
**结果正确,符合直觉**

参考:\
<https://stackoverflow.com/questions/9928505/what-does-the-expression-x-match-when-inside-a-regex>\
<https://stackoverflow.com/questions/53198407/is-there-a-regular-expression-which-matches-a-single-grapheme-cluster>
正确地统计字数做起来比听起来难。本扩展使用 Unicode 属性决定字符属于哪类;使用现代 Javascript 分词 API `Intl.Segmenter` 处理组合字和组合符号。对各国文字、emoji 兼容性都极佳!([详情](comparison.md#)

- 音节和兼容字符算一个字
- L算一个字
- V前没有L,算一个字
- T前没有V,算一个字
**点击即可高亮**

合成後的正则表达式:
无论如何也找不到文档中的某个非 ASCII 字符在哪?只需点一下,就可以把它高亮出来。也是方便的文字类型可视化工具。

`[\\u{ac00}-\\u{d7af}\\u{3130}-\\u{318f}\\u{ffa0}-\\u{ffdf}]|[\\u{1100}-\\u{115f}\\u{a960}-\\u{a97f}]|(?<![\\u{1100}-\\u{115f}\\u{a960}-\\u{a97f}])[\\u{1160}-\\u{11a7}\\u{d7b0}-\\u{d7ca}]|(?<![\\u{1160}-\\u{11a7}\\u{d7b0}-\\u{d7ca}])[\\u{11a8}-\\u{11ff}\\u{d7cb}-\\u{d7ff}]`
<img alt="highlight" src="images/screenshot-highlight.png" width="400px" />

## Grapheme cluster boundary 和 Word boundary 规则说明
**可自行魔改**

Unicode grapheme cluster 是书写系统中[公认的「字符」](http://utf8everywhere.org/#characters)[一种近似](https://unicode.org/reports/tr29/)。有组合符号的字符,虽然组合符号是多个 codepoints,但是整体是一个 grapheme cluster。
作者「金毛」认为每个工具都应该留下足够大的魔改空间,总有一些深度用户有极强的改造力和控制力。因此,本扩展可以自行添加正则表达式匹配你想要的几乎任何东西,并用 Javascript 函数模板控制显示内容!(配置教程正在编写中…)

Unicode 网站上有提供 [grapheme cluster 和 word 的规则](https://unicode.org/reports/tr29/),javascript 中自带有 [`Intl.Segmenter`] 用来将文本分隔为 grapheme cluster 和 word 的。该扩展利用了该API进行指定语言的分词分句,详情请参考配置编写教程(暂未编写,有空会做)。
38 changes: 31 additions & 7 deletions comparison.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@
**测试结果**

|工具名|字符数|非空白字符数|单词数|中文字数(含/不含标点)|西文字数|句数|段数|其他|
|-|-|-|-|-|-|-|-|-|
|-|-:|-:|-:|-:|-:|-:|-:|-|
|Multi-purpose Hanzi and Word Counter|453|453|1|453/382||24|||
|Word Count CJK|453|453|0|382|||||
|Japanese Word Count|453|453||||||原稿用紙換算(400x?枚): 2|
Expand All @@ -56,9 +56,9 @@
> 家蚕(学名:Bombyx mori)是鳞翅目蚕蛾科家蚕蛾属的完全变态昆虫,为丝绸的主要原料来源,在人类经济生活及文化历史上占有重要地位。家蚕原产中国,其幼虫在华南地区俗称蚕宝宝或娘仔,成虫称为蚕蛾。
|工具名|字符数|单词数|中文字数(含/不含标点)|假名字数|谚文字数|西文字数|句数|段数|其他|
|-|-|-|-|-|-|-|-|-|-|
|-|-:|-:|-:|-:|-:|-:|-:|-:|-|
|Multi-purpose Hanzi and Word Counter|376/354|23|166/132|92(66/26)|100/66||10|||
|Word Count CJK|376/354|6|166/132|92(66/26)|100/66||10|||
|Word Count CJK|376/354|6|132||||10|||
|Japanese Word Count|374/354||||||||原稿用紙換算(400x?枚): 1|
|WordCounter|376|23||||||1|3Lines, ~0m7s reading time|
|Microsoft Word|374/354|14|(314)|(314)|(314)|||3|行: 13|
Expand Down Expand Up @@ -132,7 +132,7 @@
> Call me Ishmael. Some years ago⁠—never mind how long precisely⁠—having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world. It is a way I have of driving off the spleen and regulating the circulation. Whenever I find myself growing grim about the mouth; whenever it is a damp, drizzly November in my soul; whenever I find myself involuntarily pausing before coffin warehouses, and bringing up the rear of every funeral I meet; and especially whenever my hypos get such an upper hand of me, that it requires a strong moral principle to prevent me from deliberately stepping into the street, and methodically knocking people’s hats off⁠—then, I account it high time to get to sea as soon as I can.
|工具名|总/非空白字符数|字母/标点符号|单词数|句数|其他|
|-|-|-|-|-|-|
|-|-:|-:|-:|-:|-|
|Multi-purpose Hanzi and Word Counter|798/657|636/18|142/146/145|4|单词数分别为「空格分词」「基本词」和「分段分词」|
|Word Count CJK|798/657||146|||
|WordCounter|798||142||1 Line, 1 Paragraph, ~0m43s reading time|
Expand All @@ -148,7 +148,7 @@
> qú lì qwək qaˤ ʂaŋ sí jɨ́ kʰɨ́j ʔin ʁwá puq bìe tsʰjuo tɕɨ
|工具名|总/非空白字符数|字母/标点符号|单词数|句数|其他|
|-|-|-|-|-|-|
|-|-:|-:|-:|-:|-|
|Multi-purpose Hanzi and Word Counter|56/50|43/0|14/16/14|1|单词数分别为「空格分词」「基本词」和「分段分词」|
|Word Count CJK|63/50||18|||
|WordCounter|63||14||1 Line, 1 Paragraph, ~0m4s reading time|
Expand All @@ -162,7 +162,7 @@
> qú lì qwək qaˤ ʂaŋ sí jɨ́ kʰɨ́j ʔin ʁwá puq bìe tsʰjuo tɕɨ
|工具名|总/非空白字符数|字母/标点符号|单词数|句数|其他|
|-|-|-|-|-|-|
|-|-:|-:|-:|-:|-|
|Multi-purpose Hanzi and Word Counter|56/45|43/0|14/15/14|1|单词数分别为「空格分词」「基本词」和「分段分词」|
|Word Count CJK|58/45||18|||
|WordCounter|58||14||1 Line, 1 Paragraph, ~0m4s reading time|
Expand Down Expand Up @@ -192,7 +192,7 @@
- 结果为21字符,说明直接将 UTF-16 长度当作字符串长度了

|工具名|字符数|其他|
|-|-|-|
|-|-:|-|
|Multi-purpose Hanzi and Word Counter|4||
|Word Count CJK|21||
|Japanese Word Count|4||
Expand All @@ -202,3 +202,27 @@
|wordcounter.net|21|1 word 21 characters|
|wordcount.com|21|0 words 21 characters|
|countwordsworth.com|21|1 words; 21 characters|

## 性能对比

该扩展只在打开文件、保存文件时对全文统计字数。统计的结果按行缓存。编辑过程中,扩展会只统计改变的部分,并动态更新统计结果。无论字数多少,该扩展只会占用极少的计算资源,而其他扩展字数越多,性能就越差,占用 CPU 也会越多。

性能测试方法如下:

- 禁用其他字数统计扩展,启用待测试的字数统计扩展
- 启动 Extension Host Profile
- 打开待测文件
- 重复100次:
- 在文件最后输入`\naa bb`(`\n`代表换行符)
- 将输入内容全部删除
- 保存文件,关闭文件
- 停止 Extension Host Profile,保存 profile 结果
- 在 profile 结果中查找总时间最长的扩展内函数。

|扩展名称|moby-dick-1k|moby-dick-10k|moby-dick-100k|moby-dick-1000k
|-|-:|-:|-:|-:|
|Multi-purpose Hanzi and Word Counter|79.41ms|106.30ms|273.72ms|2,051.07ms|
|Word Count CJK|86.64ms|454.77ms|3,877.94ms|38,058.62ms|
|Japanese Word Count|2,562.11ms|假死|崩溃|崩溃|
|WordCounter|54.18ms|187.56ms|1,280.74ms|12,288.55ms|

Empty file added config-guide.md
Empty file.
Binary file added images/large/icon.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading

0 comments on commit 921a0e0

Please sign in to comment.