Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

继承BreadthCrawler,获取网页中文部分输出乱码 #108

Open
linye271709915 opened this issue Sep 3, 2019 · 2 comments
Open

继承BreadthCrawler,获取网页中文部分输出乱码 #108

linye271709915 opened this issue Sep 3, 2019 · 2 comments

Comments

@linye271709915
Copy link

visit里面
String name = page.select("h1").text();
String content = page.select("h2").html();

System.out.println("名称"+ name);
System.out.println("内容"+ content);

打印台结果
名称姝h���瑁��ㄨ���ㄧО����瑁��虹�-DXDK110
内容姝h���瑁��ㄨ���ㄧО����瑁��虹�-DXDK110浜у��绠�浠

@hujunxianligong
Copy link
Member

可以通过page.charset("utf-8")方法,设置对应的网页编码后,再进行上述操作。

@xiejx618
Copy link

xiejx618 commented Sep 9, 2019

@hujunxianligong cn.edu.hfut.dmic.webcollector.util.CharsetDetector#guessEncoding可不可以改改,当猜测为gb2312时,直接修改为GB18030。 GB18030兼容GBK和GB2312,比如这个页面
http://www.stats.gov.cn/tjsj/tjbz/tjyqhdmhcxhfdm/2018/44/4419.html
它的页面明明是gb2312,但cn.edu.hfut.dmic.webcollector.model.Page#html()就是乱码。使用浏览器也没乱码。但用page.charset("GB18030")也没乱码,但不想每个页面都设一下。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants