Skip to content

Commit

Permalink
change CH mark to EN mark (#355)
Browse files Browse the repository at this point in the history
* change CH mark to EN mark

* minor improvements

---------

Co-authored-by: null <[email protected]>
  • Loading branch information
BeachWang and drcege authored Jul 16, 2024
1 parent e3cb8cb commit 9c7f316
Show file tree
Hide file tree
Showing 4 changed files with 22 additions and 22 deletions.
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -410,8 +410,8 @@ Data-Juicer thanks and refers to several community projects, such as
If you find our work useful for your research or development, please kindly cite the following [paper](https://arxiv.org/abs/2309.02033).
```
@inproceedings{chen2024datajuicer,
title={Data-Juicer: A One-Stop Data Processing System for Large Language Models},
author={Daoyuan Chen and Yilun Huang and Zhijian Ma and Hesen Chen and Xuchen Pan and Ce Ge and Dawei Gao and Yuexiang Xie and Zhaoyang Liu and Jinyang Gao and Yaliang Li and Bolin Ding and Jingren Zhou},
title={Data-Juicer: A One-Stop Data Processing System for Large Language Models},
author={Daoyuan Chen and Yilun Huang and Zhijian Ma and Hesen Chen and Xuchen Pan and Ce Ge and Dawei Gao and Yuexiang Xie and Zhaoyang Liu and Jinyang Gao and Yaliang Li and Bolin Ding and Jingren Zhou},
booktitle={International Conference on Management of Data},
year={2024}
}
Expand Down
4 changes: 2 additions & 2 deletions README_ZH.md
Original file line number Diff line number Diff line change
Expand Up @@ -388,8 +388,8 @@ Data-Juicer 感谢并参考了社区开源项目:

```
@inproceedings{chen2024datajuicer,
title={Data-Juicer: A One-Stop Data Processing System for Large Language Models},
author={Daoyuan Chen and Yilun Huang and Zhijian Ma and Hesen Chen and Xuchen Pan and Ce Ge and Dawei Gao and Yuexiang Xie and Zhaoyang Liu and Jinyang Gao and Yaliang Li and Bolin Ding and Jingren Zhou},
title={Data-Juicer: A One-Stop Data Processing System for Large Language Models},
author={Daoyuan Chen and Yilun Huang and Zhijian Ma and Hesen Chen and Xuchen Pan and Ce Ge and Dawei Gao and Yuexiang Xie and Zhaoyang Liu and Jinyang Gao and Yaliang Li and Bolin Ding and Jingren Zhou},
booktitle={International Conference on Management of Data},
year={2024}
}
Expand Down
12 changes: 6 additions & 6 deletions docs/Operators.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@ All the specific operators are listed below, each featured with several capabili
|-----------------------------------------------------|--------------------|--------|---------------------------------------------------------------------------------------------------------------|
| audio_ffmpeg_wrapped_mapper | Audio | - | Simple wrapper to run a FFmpeg audio filter |
| chinese_convert_mapper | General | zh | Converts Chinese between Traditional Chinese, Simplified Chinese and Japanese Kanji (by [opencc](https://github.com/BYVoid/OpenCC)) |
| clean_copyright_mapper | Code | en, zh | Removes copyright notice at the beginning of code files (:warning: must contain the word *copyright*) |
| clean_copyright_mapper | Code | en, zh | Removes copyright notice at the beginning of code files (must contain the word *copyright*) |
| clean_email_mapper | General | en, zh | Removes email information |
| clean_html_mapper | General | en, zh | Removes HTML tags and returns plain text of all the nodes |
| clean_ip_mapper | General | en, zh | Removes IP addresses |
Expand Down Expand Up @@ -125,16 +125,16 @@ All the specific operators are listed below, each featured with several capabili
| stopwords_filter | General | en, zh | Keeps samples with stopword ratio above the specified threshold |
| suffix_filter | General | en, zh | Keeps samples with specified suffixes |
| text_action_filter | General | en, zh | Keeps samples containing action verbs in their texts |
| text_entity_dependency_filter | General | en, zh | Keeps samples containing entity nouns related to other tokens in the dependency tree of the texts |
| text_entity_dependency_filter | General | en, zh | Keeps samples containing dependency edges for an entity in the dependency tree of the texts |
| text_length_filter | General | en, zh | Keeps samples with total text length within the specified range |
| token_num_filter | General | en, zh | Keeps samples with token count within the specified range |
| video_aesthetics_filter | Video | - | Keeps samples whose specified frames have aesthetics scores within the specified range |
| video_aspect_ratio_filter | Video | - | Keeps samples containing videos with aspect ratios within the specified range |
| video_duration_filter | Video | - | Keep data samples whose videos' durations are within a specified range
| video_frames_text_similarity_filter | Multimodal | - | Keep data samples whose similarities between sampled video frame images and text are within a specific range
| video_motion_score_filter | Video | - | Keep samples with video motion scores within a specific range
| video_duration_filter | Video | - | Keep data samples whose videos' durations are within a specified range |
| video_frames_text_similarity_filter | Multimodal | - | Keep data samples whose similarities between sampled video frame images and text are within a specific range |
| video_motion_score_filter | Video | - | Keep samples with video motion scores within a specific range |
| video_nsfw_filter | Video | - | Keeps samples containing videos with NSFW scores below the threshold |
| video_ocr_area_ratio_filter | Video | - | Keep data samples whose detected text area ratios for specified frames in the video are within a specified range
| video_ocr_area_ratio_filter | Video | - | Keep data samples whose detected text area ratios for specified frames in the video are within a specified range |
| video_resolution_filter | Video | - | Keeps samples containing videos with horizontal and vertical resolutions within the specified range |
| video_watermark_filter | Video | - | Keeps samples containing videos with predicted watermark probabilities below the threshold |
| video_tagging_from_frames_filter | Video | - | Keep samples containing videos with given tags |
Expand Down
24 changes: 12 additions & 12 deletions docs/Operators_ZH.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,7 @@ Data-Juicer 中的算子分为以下 5 种类型。
|-----------------------------------------------------|-----------------------|-----------|--------------------------------------------------------|
| audio_ffmpeg_wrapped_mapper | Audio | - | 运行 FFmpeg 语音过滤器的简单封装 |
| chinese_convert_mapper | General | zh | 用于在繁体中文、简体中文和日文汉字之间进行转换(借助 [opencc](https://github.com/BYVoid/OpenCC)|
| clean_copyright_mapper | Code | en, zh | 删除代码文件开头的版权声明 (:warning: 必须包含单词 *copyright*) |
| clean_copyright_mapper | Code | en, zh | 删除代码文件开头的版权声明 (必须包含单词 *copyright*) |
| clean_email_mapper | General | en, zh | 删除邮箱信息 |
| clean_html_mapper | General | en, zh | 删除 HTML 标签并返回所有节点的纯文本 |
| clean_ip_mapper | General | en, zh | 删除 IP 地址 |
Expand All @@ -61,8 +61,8 @@ Data-Juicer 中的算子分为以下 5 种类型。
| image_blur_mapper | Image | - | 对图像进行模糊处理 |
| image_captioning_from_gpt4v_mapper | Multimodal | - | 基于gpt-4-vision和图像生成文本 |
| image_captioning_mapper | Multimodal | - | 生成样本,其标题是根据另一个辅助模型(例如 blip2)和原始样本中的图形生成的。 |
| image_diffusion_mapper | Multimodal | - | 用stable diffusion生成图像,对图像进行增强
| image_face_blur_mapper | Image | - | 对图像中的人脸进行模糊处理
| image_diffusion_mapper | Multimodal | - | 用stable diffusion生成图像,对图像进行增强 |
| image_face_blur_mapper | Image | - | 对图像中的人脸进行模糊处理 |
| nlpaug_en_mapper | General | en | 使用`nlpaug`库对英语文本进行简单增强 |
| nlpcda_zh_mapper | General | zh | 使用`nlpcda`库对中文文本进行简单增强 |
| punctuation_normalization_mapper | General | en, zh | 将各种 Unicode 标点符号标准化为其 ASCII 等效项 |
Expand All @@ -81,9 +81,9 @@ Data-Juicer 中的算子分为以下 5 种类型。
| video_captioning_from_frames_mapper | Multimodal | - | 生成样本,其标题是基于一个文字生成图片的模型和原始样本视频中指定帧的图像。不同帧产出的标题会拼接为一条单独的字符串。 |
| video_captioning_from_summarizer_mapper | Multimodal | - | 通过对多种不同方式生成的文本进行摘要以生成样本的标题(从视频/音频/帧生成标题,从音频/帧生成标签,...) |
| video_captioning_from_video_mapper | Multimodal | - | 生成样本,其标题是根据另一个辅助模型(video-blip)和原始样本中的视频中指定帧的图像。 |
| video_face_blur_mapper | Video | - | 对视频中的人脸进行模糊处理
| video_face_blur_mapper | Video | - | 对视频中的人脸进行模糊处理 |
| video_ffmpeg_wrapped_mapper | Video | - | 运行 FFmpeg 视频过滤器的简单封装 |
| video_remove_watermark_mapper | Video | - | 去除视频中给定区域的水印
| video_remove_watermark_mapper | Video | - | 去除视频中给定区域的水印 |
| video_resize_aspect_ratio_mapper | Video | - | 将视频的宽高比调整到指定范围内 |
| video_resize_resolution_mapper | Video | - | 将视频映射到给定的分辨率区间 |
| video_split_by_duration_mapper | Multimodal | - | 根据时长将视频切分为多个片段 |
Expand Down Expand Up @@ -127,14 +127,14 @@ Data-Juicer 中的算子分为以下 5 种类型。
| text_length_filter | General | en, zh | 保留总文本长度在指定范围内的样本 |
| token_num_filter | General | en, zh | 保留token数在指定范围内的样本 |
| video_aspect_ratio_filter | Video | - | 保留包含视频的宽高比在指定范围内的样本 |
| video_duration_filter | Video | - | 保留包含视频的时长在指定范围内的样本
| video_duration_filter | Video | - | 保留包含视频的时长在指定范围内的样本 |
| video_aesthetics_filter | Video | - | 保留指定帧的美学分数在指定范围内的样本|
| video_frames_text_similarity_filter | Multimodal | - | 保留视频中指定帧的图像-文本的特征余弦相似度(基于CLIP模型)在指定范围内的样本
| video_motion_score_filter | Video | - | 保留包含视频的运动分数(基于稠密光流)在指定范围内的样本
| video_nsfw_filter | Video | - | 保留包含视频的NSFW分数在指定阈值之下的样本
| video_ocr_area_ratio_filter | Video | - | 保留包含视频的特定帧中检测出的文本的面积占比在指定范围内的样本
| video_resolution_filter | Video | - | 保留包含视频的分辨率(包括横向分辨率和纵向分辨率)在指定范围内的样本
| video_watermark_filter | Video | - | 保留包含视频有水印的概率在指定阈值之下的样本
| video_frames_text_similarity_filter | Multimodal | - | 保留视频中指定帧的图像-文本的特征余弦相似度(基于CLIP模型)在指定范围内的样本 |
| video_motion_score_filter | Video | - | 保留包含视频的运动分数(基于稠密光流)在指定范围内的样本 |
| video_nsfw_filter | Video | - | 保留包含视频的NSFW分数在指定阈值之下的样本 |
| video_ocr_area_ratio_filter | Video | - | 保留包含视频的特定帧中检测出的文本的面积占比在指定范围内的样本 |
| video_resolution_filter | Video | - | 保留包含视频的分辨率(包括横向分辨率和纵向分辨率)在指定范围内的样本 |
| video_watermark_filter | Video | - | 保留包含视频有水印的概率在指定阈值之下的样本 |
| video_tagging_from_frames_filter | Video | - | 保留包含具有给定标签视频的样本 |
| words_num_filter | General | en, zh | 保留字数在指定范围内的样本 |
| word_repetition_filter | General | en, zh | 保留 word-level n-gram 重复比率在指定范围内的样本 |
Expand Down

0 comments on commit 9c7f316

Please sign in to comment.