继承BreadthCrawler，获取网页中文部分输出乱码 #108

linye271709915 · 2019-09-03T09:51:47Z

visit里面
String name = page.select("h1").text();
String content = page.select("h2").html();

System.out.println("名称"+ name);
System.out.println("内容"+ content);

打印台结果
名称姝ｈ��瑁��ㄨ��ㄧО��瑁��虹�-DXDK110
内容姝ｈ��瑁��ㄨ��ㄧО��瑁��虹�-DXDK110浜у��绠�浠

hujunxianligong · 2019-09-03T11:27:19Z

可以通过page.charset("utf-8")方法，设置对应的网页编码后，再进行上述操作。

xiejx618 · 2019-09-09T04:41:29Z

@hujunxianligong cn.edu.hfut.dmic.webcollector.util.CharsetDetector#guessEncoding可不可以改改，当猜测为gb2312时，直接修改为GB18030。 GB18030兼容GBK和GB2312，比如这个页面
http://www.stats.gov.cn/tjsj/tjbz/tjyqhdmhcxhfdm/2018/44/4419.html
它的页面明明是gb2312,但cn.edu.hfut.dmic.webcollector.model.Page#html()就是乱码。使用浏览器也没乱码。但用page.charset("GB18030")也没乱码，但不想每个页面都设一下。

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

继承BreadthCrawler，获取网页中文部分输出乱码 #108

继承BreadthCrawler，获取网页中文部分输出乱码 #108

linye271709915 commented Sep 3, 2019

hujunxianligong commented Sep 3, 2019

xiejx618 commented Sep 9, 2019

继承BreadthCrawler，获取网页中文部分输出乱码 #108

继承BreadthCrawler，获取网页中文部分输出乱码 #108

Comments

linye271709915 commented Sep 3, 2019

hujunxianligong commented Sep 3, 2019

xiejx618 commented Sep 9, 2019