感谢大家对项目的支持。
近期,我在浏览数据库时发现,丁香园的数据更新异常:大量境外数据和少部分大陆地区数据的createTime
和modifyTime
字段即使在疫情数据没有任何变动的情况下也会发生变化,这就导致了外国数据被多次重复收录至数据库中,收录的条目仅是createTime
和modifyTime
字段与其他不一致。
个人推断,丁香园的createTime
和modifyTime
字段在任何一个国家/省份/城市的数据发生变动时都会发生变动,因此导致了这个问题。所以,我在实时爬虫最近的两次更新ced5fda
和540ae98
中,移除了这两个字段,未来不会再发生类似的问题。
与此同时,对于历史数据部分,我删除了重复的数据条目,删除的逻辑为:
- 保留第一次获取到的数据,删除掉剩余的重复数据。例如,在不同的三个时间点获取到相同疫情数据,只保留第一个时间点获取到的疫情数据,删除剩下两个时间点的疫情数据;
- 仅对重复疫情数据字段进行筛查。针对相同的疫情数据,如果进行数据录入的人
operator
不同,则两份数据都予以保留。
(可能表述有不准确的地方,可以参考此处。)
共计删除12716条重复数据。
在最新一次数据更新d166029
及之后的数据中,重复条目均不会再得到保留,如果需要回溯重复条目,可以查询c8d6947
及以前的数据。
Thank you for your support.
Recently, I found that the data of Ding Xiang Yuan was abnormally updated: the createTime
and modifyTime
fields of a large amount of overseas data and a small amount of data in the mainland China will change even if there is no change in the numbers. As a result, foreign data were found duplicated several times, and the only differences between those duplications are createTime
and modifyTime
.
I suppose that the createTime
and modifyTime
fields will change when the data of any country/province/city modified by Ding Xiang Yuan, thus causing this problem. Therefore, in my last 2 updates in real-time crawler ced5fda
and 540ae98
, these two fields have been removed, and similar problems will not occur in the future.
At the same time, for the historical data part, I deleted the duplicate data entries, and the deletion methodology was:
- Keep the data obtained for the first time and delete the remaining duplicate data. For example, if the same epidemic data is obtained at three different time points, only the epidemic data obtained at the first time point is retained, and the epidemic data for the remaining two time points are deleted, and,
- Screen for duplicate epidemic-data fields only. For example, if the
operator
who entered the data is different, even if the numbers are the same, both data will be retained.
12716 documents were removed in total.
In the latest update d166029
and future updates, duplicate entries will not be retained anymore, if you would like to backtrack duplicated entries, you can check them out in c8d6947
and previous data.