Skip to content
This repository has been archived by the owner on Mar 10, 2023. It is now read-only.

Commit

Permalink
Merge pull request #10 from gricn/master
Browse files Browse the repository at this point in the history
fix NATCM's bug
  • Loading branch information
gricn authored Aug 27, 2021
2 parents 39a0c2d + 80a506d commit cdb1276
Showing 1 changed file with 2 additions and 4 deletions.
6 changes: 2 additions & 4 deletions webSpider/spiders/NATCM.py
Original file line number Diff line number Diff line change
Expand Up @@ -125,11 +125,9 @@ def detailPage(self, response):
# change "时间:2020-12-10 15:40:19" to "2020-12-10"
item["publishingDate"] = re.search("(?<=:)\S*", date_origin).group(0)

item["source"] = response.xpath(
"//td[@valign]/table[2]//td/span/p[last()]/text()"
).get()
item["source"] = "国家中医药管理局"

article = "".join(response.xpath("//td[@valign]/table[2]//td").getall())
article = "".join(response.xpath("//td[@valign]/table[2]//td/span/p").getall())
item["article"] = article

item["plaintext"] = re.sub(r"\s(\s)+", " ", remove_tags(article))
Expand Down

0 comments on commit cdb1276

Please sign in to comment.