Skip to content

Commit

Permalink
Update README to alert noise data existence (#110)
Browse files Browse the repository at this point in the history
  • Loading branch information
BlankerL committed Apr 19, 2022
1 parent 473767b commit f4c720f
Show file tree
Hide file tree
Showing 2 changed files with 8 additions and 3 deletions.
8 changes: 6 additions & 2 deletions README.en.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,8 +33,12 @@ there is a city-level document recording Nanyang (Dengzhou inclusive) and Dengzh
Therefore, the data of "Dengzhou" will be double-counted once during the summation.

### Noise Data
At present, some time series data in Zhejiang and Hubei are found containing noises.
The possible reason is the manually processed data were recorded by mistake.
1. At present, some time series data in Zhejiang and Hubei are found containing noises.
The possible reason is the manually processed data were recorded by mistake.
2. [Issue #110](https://github.com/BlankerL/DXY-COVID-19-Data/issues/110)
reported the reversal of the number of confirmed cases in Changchun and Jilin City in Jilin Province,
updated on March 15 by Dingxiang Yuan. To ensure data integrity,
I have not modified the data, so please adjust it manually when you use it if considered necessary.

The crawler just crawl what it sees, do not deal with any noise data.
Therefore, if you use the data for scientific research, please preprocess and clean the data properly.
Expand Down
3 changes: 2 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,8 @@
1. 部分数据存在重复统计的情况,如[Issue #21](https://github.com/BlankerL/DXY-COVID-19-Data/issues/21)中所述,河南省部分市级数据存在"南阳(含邓州)"及"邓州"两条数据,因此在求和时"邓州"的数据会被重复计算一次。

### 数据异常
目前发现浙江省/湖北省部分时间序列数据存在数据异常,可能的原因是丁香园数据为人工录入,某些数据可能录入错误,比如某一次爬虫获取的浙江省治愈人数为537人,数分钟后被修改回正常人数。
1. 目前发现浙江省/湖北省部分时间序列数据存在数据异常,可能的原因是丁香园数据为人工录入,某些数据可能录入错误,比如某一次爬虫获取的浙江省治愈人数为537人,数分钟后被修改回正常人数。
2. [Issue #110](https://github.com/BlankerL/DXY-COVID-19-Data/issues/110)中反馈丁香园3月15日更新的吉林省长春市和吉林市的确诊人数颠倒。为了保证数据完整,我没有修改这部分数据,请大家在使用的时候手动调整。

本项目爬虫仅从丁香园公开的数据中获取并储存数据,并不会对异常值进行判断和处理,因此如果将本数据用作科研目的,请自己对数据进行清洗。同时,我已经在Issue中开放了[异常数据反馈通道](https://github.com/BlankerL/DXY-COVID-19-Crawler/issues/34),可以直接在此问题中反馈潜在的异常数据,我会定期检查并处理。

Expand Down

0 comments on commit f4c720f

Please sign in to comment.