-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ContentExtractor中的computeInfo函数会出现StackOverflowError #116
Comments
修改的代码可以贴在这里啊 |
|
等有空我看下,最近比较忙不好意思
yanpeng <[email protected]> 于2019年11月29日周五 上午11:47写道:
… `// 计算叶子节点信息
private CountInfo computeLeafNodeInfo(Node node)
{
if (node instanceof TextNode)
{
TextNode tn = (TextNode) node;
CountInfo countInfo = new CountInfo();
String text = tn.text();
int len = text.length();
countInfo.textCount = len;
countInfo.leafList.add(len);
return countInfo;
}
else
{
return new CountInfo();
}
}
protected CountInfo computeInfo(Node node)
{
if (node instanceof Element)
{
Deque<Node> stack = new ArrayDeque<Node>();
Deque<Node> queue = new ArrayDeque<Node>();
Set<Node> accessedNodes = new HashSet<Node>();
Map<Node, CountInfo> nodeInfoMap = new HashMap<Node, CountInfo>();
stack.addFirst(node);
while (!stack.isEmpty())
{
Node headerNode = stack.getFirst();
// 如果是非叶子节点添加至已访问节点集合
// 并且将它的孩子以逆序入栈
if ((headerNode instanceof Element) && !accessedNodes.contains(headerNode))
{
accessedNodes.add(headerNode);
for (int i = headerNode.childNodeSize() - 1; i >= 0; --i)
{
stack.addFirst(headerNode.childNode(i));
}
}
else // 对于叶子节点和已经访问过的非叶子节点则入队列
{
queue.addLast(stack.removeFirst());
}
}
while (!queue.isEmpty())
{
Node headerNode = queue.removeFirst();
if (headerNode instanceof Element)
{
Element tag = (Element) headerNode;
CountInfo countInfo = new CountInfo();
for (Node childNode : headerNode.childNodes()) {
CountInfo childCountInfo = nodeInfoMap.get(childNode);
countInfo.textCount += childCountInfo.textCount;
countInfo.linkTextCount += childCountInfo.linkTextCount;
countInfo.tagCount += childCountInfo.tagCount;
countInfo.linkTagCount += childCountInfo.linkTagCount;
countInfo.leafList.addAll(childCountInfo.leafList);
countInfo.densitySum += childCountInfo.density;
countInfo.pCount += childCountInfo.pCount;
}
countInfo.tagCount++;
String tagName = tag.tagName();
if (tagName.equals("a"))
{
countInfo.linkTextCount = countInfo.textCount;
countInfo.linkTagCount++;
}
else if (tagName.equals("p"))
{
countInfo.pCount++;
}
int pureLen = countInfo.textCount - countInfo.linkTextCount;
int len = countInfo.tagCount - countInfo.linkTagCount;
if (pureLen == 0 || len == 0)
{
countInfo.density = 0;
}
else
{
countInfo.density = (pureLen + 0.0) / len;
}
infoMap.put(tag, countInfo);
nodeInfoMap.put(headerNode, countInfo);
}
else
{
nodeInfoMap.put(headerNode, computeLeafNodeInfo(headerNode));
}
}
return nodeInfoMap.get(node);
}
else
{
return computeLeafNodeInfo(node);
}
}`
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#116?email_source=notifications&email_token=AAZZQYKEXPGVWJ45RXR6FL3QWCGGBA5CNFSM4JPPZSCKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEFNZMXQ#issuecomment-559650398>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAZZQYINCF74MUFYOUDUFODQWCGGBANCNFSM4JPPZSCA>
.
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
比如处理这个url http://www.suixian.gov.cn/news/News_View.asp?NewsID=20939 时会出现上述错误,我已经将其修改为非递归版本了,请问可以将代码提交上来吗?
The text was updated successfully, but these errors were encountered: