Skip to content

Commit

Permalink
Merge pull request #4 from mozillazg/develop
Browse files Browse the repository at this point in the history
v0.3.0
  • Loading branch information
mozillazg authored Aug 19, 2016
2 parents 79cbc32 + dc1d68f commit 1b473fb
Show file tree
Hide file tree
Showing 7 changed files with 42,250 additions and 42,088 deletions.
8 changes: 8 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,14 @@
# ChangeLog


## 0.3.0 (2016-08-19):

* Fixed format of zdic.txt via [b8e4394](https://github.com/mozillazg/pinyin-data/commit/b8e439490d2c6e8c711652983db52fb69136919b).
* Fixed some pinyin: 罗 via [468ffaa](https://github.com/mozillazg/pinyin-data/commit/468ffaa8eb678637c7565a02e6836255bd0df06c).
* Support Chinese that in PUA([Private Use Area](https://en.wikipedia.org/wiki/Private_Use_Areas>)) via [#2](https://github.com/mozillazg/pinyin-data/pull/2).
* pinyin.txt add line comments that startswith `#` via [9944f79](https://github.com/mozillazg/pinyin-data/commit/9944f795e191fb3606d65ada84b6fad5665f8776).


## 0.2.0 (2016-07-19):

* Update to the latest version of [Unihan Database](http://www.unicode.org/charts/unihan.html):
Expand Down
70 changes: 70 additions & 0 deletions PUA.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
U+E815: yè # 
U+E816: zuǒ,yǒu # 
U+E818: gǔn # 
U+E81A: zhòu,zhū # 
U+E81B: zhòu,zhū # 
U+E81D: jié,jiē # 
U+E81F: wāi # 
U+E820: hǎn # 
U+E821: hǎn # 
U+E824: zhòu # 
U+E825: zhòu # 
U+E826: shǒu # 
U+E827: gāng # 
U+E828: kuǎi # 
U+E829: sǒng # 
U+E82A: sǒng # 
U+E82B: fēng # 
U+E82C: gòng # 
U+E82D: gāng # 
U+E82E: huì,kuì # 
U+E82F: tà # 
U+E830: jiān # 
U+E831: ēn # 
U+E832: xiǎo # 
U+E834: lóu,lǘ # 
U+E835: cǎn,shān,cēn # 
U+E836: zhú # 
U+E837: chōu chóu # 
U+E838: wǎng # 
U+E83A: yáng,xiáng # 
U+E83B: zāi # 
U+E83C: bà,bēi # 
U+E83D: bà,bēi # 
U+E83F: zhuān,chuán,chún,zhuǎn # 
U+E840: qióng # 
U+E841: kuì,huì # 
U+E842: kuì,huì # 
U+E843: juǎn # 
U+E844: xīn # 
U+E845: yàn # 
U+E846: qíng # 
U+E847: qíng # 
U+E849: shàn # 
U+E84A: yé,yá # 
U+E84B: pō # 
U+E84C: shàn # 
U+E84D: zhuō # 
U+E84E: shàn # 
U+E84F: jué # 
U+E850: chuài # 
U+E851: zhèng # 
U+E852: chuài # 
U+E853: zhèng # 
U+E854: zhuó # 
U+E855: yíng # 
U+E856: yú # 
U+E857: yìn # 
U+E858: chūn # 
U+E859: qiū # 
U+E85A: yú # 
U+E85B: téng # 
U+E85C: shī # 
U+E85D: jiāo # 
U+E85E: liè # 
U+E85F: jīng # 
U+E860: jú # 
U+E861: tī # 
U+E862: pì # 
U+E863: yǎn # 
U+E864: luán # 
11 changes: 9 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,11 @@

## 数据介绍

数据格式:`{code point}: {pinyins} # {hanzi}` (示例:`U+4E2D: zhōng,zhòng # 中`
数据格式:

* 格式:`{code point}: {pinyins} # {hanzi}` (示例:`U+4E2D: zhōng,zhòng # 中`
*`#` 开头的行是注释


[Unihan Database][unihan] 数据版本:
> Date: 2016-06-01 07:01:48 GMT [JHJ]
Expand All @@ -14,8 +18,9 @@
* `kHanyuPinyin.txt`: [Unihan Database][unihan][kHanyuPinyin](http://www.unicode.org/reports/tr38/#kHanyuPinyin) 部分的拼音数据(来源于《漢語大字典》的拼音数据)
* `kXHC1983.txt`: [Unihan Database][unihan][kXHC1983](http://www.unicode.org/reports/tr38/#kXHC1983) 部分的拼音数据(来源于《现代汉语词典》的拼音数据)
* `kHanyuPinlu.txt`: [Unihan Database][unihan][kHanyuPinlu](http://www.unicode.org/reports/tr38/#kHanyuPinlu) 部分的拼音数据(来源于《現代漢語頻率詞典》的拼音数据)
* `nonCJKUI.txt`: 不属于 [CJK Unified Ideograph](https://en.wikipedia.org/wiki/CJK_Unified_Ideographs) 但是却有拼音的字符
* `kMandarin.txt`: [Unihan Database][unihan][kMandarin](http://www.unicode.org/reports/tr38/#kMandarin) 部分的拼音数据(普通话中最常用的一个读音。zh-CN 为主,如果 zh-CN 中没有则使用 zh-TW 中的拼音)
* `PUA.txt`: 位于 [Private Use Area](https://en.wikipedia.org/wiki/Private_Use_Areas) 有拼音的汉字
* `nonCJKUI.txt`: 不属于 [CJK Unified Ideograph](https://en.wikipedia.org/wiki/CJK_Unified_Ideographs) 但是却有拼音的字符
* `overwrite.txt`: 手工纠正的拼音数据(**上面的拼音数据都是通过程序生成的,修改的话只修改这个就可以了**
* `pinyin.txt`: 合并上述文件后的拼音数据
* `zdic.txt`: [汉典网](http://zdic.net) 的拼音数据
Expand All @@ -25,5 +30,7 @@

* [Unihan Database Lookup](http://www.unicode.org/charts/unihan.html)
* [汉典 zdic.net](http://www.zdic.net/)
* [字海网,叶典网](http://zisea.com/)
* [Unicode、GB2312、GBK和GB18030中的汉字](http://www.fmddlmyy.cn/text24.html)

[unihan]: http://www.unicode.org/charts/unihan.html
6 changes: 6 additions & 0 deletions merge_unihan.py
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,9 @@ def extend_pinyins(old_map, new_map, only_no_exists=False):
# 是因为 kHanyuPinlu 的拼音数据中存在一部分不需要的轻声拼音
# 以及部分音调标错了位置,比如把 ``ǒu`` 标成了 ``oǔ``
extend_pinyins(raw_pinyin_map, khanyupinyinlu, only_no_exists=True)
with open('PUA.txt') as fp:
pua_pinyin_map = parse_pinyins(fp.readlines())
extend_pinyins(raw_pinyin_map, pua_pinyin_map)

with open('overwrite.txt') as fp:
overwrite_pinyin_map = parse_pinyins(fp.readlines())
Expand All @@ -88,5 +91,8 @@ def extend_pinyins(old_map, new_map, only_no_exists=False):
assert set(kxhc1983.keys()) - code_set == set()
assert set(adjust_pinyin_map.keys()) - code_set == set()
assert set(overwrite_pinyin_map.keys()) - code_set == set()
assert set(pua_pinyin_map.keys()) - code_set == set()
with open('pinyin.txt', 'w') as fp:
fp.write('# version: 0.3.0\n')
fp.write('# source: https://github.com/mozillazg/pinyin-data\n')
save_data(new_pinyin_map, fp)
1 change: 1 addition & 0 deletions overwrite.txt
Original file line number Diff line number Diff line change
Expand Up @@ -3,3 +3,4 @@
# 数据格式:{code point}: {pinyins} # {hanzi}
# 示例:
# U+4E2D: zhōng,zhòng # 中
U+7F57: luó # 罗
74 changes: 73 additions & 1 deletion pinyin.txt
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
# version: 0.3.0
# source: https://github.com/mozillazg/pinyin-data
U+3007: líng # 〇
U+3400: qiū # 㐀
U+3401: tiàn # 㐁
Expand Down Expand Up @@ -18381,7 +18383,7 @@ U+7F53: gāng # 罓
U+7F54: wǎng,wáng # 罔
U+7F55: hǎn,hàn # 罕
U+7F56: luó # 罖
U+7F57: luō,luó # 罗
U+7F57: luó # 罗
U+7F58: fú # 罘
U+7F59: shēn # 罙
U+7F5A: fá # 罚
Expand Down Expand Up @@ -26658,6 +26660,76 @@ U+9FCE: tǎ # 鿎
U+9FCF: mài # 鿏
U+9FD4: gē # 鿔
U+9FD5: dān # 鿕
U+E815: yè # 
U+E816: zuǒ,yǒu # 
U+E818: gǔn # 
U+E81A: zhòu,zhū # 
U+E81B: zhòu,zhū # 
U+E81D: jié,jiē # 
U+E81F: wāi # 
U+E820: hǎn # 
U+E821: hǎn # 
U+E824: zhòu # 
U+E825: zhòu # 
U+E826: shǒu # 
U+E827: gāng # 
U+E828: kuǎi # 
U+E829: sǒng # 
U+E82A: sǒng # 
U+E82B: fēng # 
U+E82C: gòng # 
U+E82D: gāng # 
U+E82E: huì,kuì # 
U+E82F: tà # 
U+E830: jiān # 
U+E831: ēn # 
U+E832: xiǎo # 
U+E834: lóu,lǘ # 
U+E835: cǎn,shān,cēn # 
U+E836: zhú # 
U+E837: chōu,chóu # 
U+E838: wǎng # 
U+E83A: yáng,xiáng # 
U+E83B: zāi # 
U+E83C: bà,bēi # 
U+E83D: bà,bēi # 
U+E83F: zhuān,chuán,chún,zhuǎn # 
U+E840: qióng # 
U+E841: kuì,huì # 
U+E842: kuì,huì # 
U+E843: juǎn # 
U+E844: xīn # 
U+E845: yàn # 
U+E846: qíng # 
U+E847: qíng # 
U+E849: shàn # 
U+E84A: yé,yá # 
U+E84B: pō # 
U+E84C: shàn # 
U+E84D: zhuō # 
U+E84E: shàn # 
U+E84F: jué # 
U+E850: chuài # 
U+E851: zhèng # 
U+E852: chuài # 
U+E853: zhèng # 
U+E854: zhuó # 
U+E855: yíng # 
U+E856: yú # 
U+E857: yìn # 
U+E858: chūn # 
U+E859: qiū # 
U+E85A: yú # 
U+E85B: téng # 
U+E85C: shī # 
U+E85D: jiāo # 
U+E85E: liè # 
U+E85F: jīng # 
U+E860: jú # 
U+E861: tī # 
U+E862: pì # 
U+E863: yǎn # 
U+E864: luán # 
U+20000: hē # 𠀀
U+20001: qī # 𠀁
U+20003: qiě,jī # 𠀃
Expand Down
Loading

0 comments on commit 1b473fb

Please sign in to comment.