Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix/Meta extraction - rules and error update #48

Merged
merged 3 commits into from
Apr 18, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions package-lock.json

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

4 changes: 2 additions & 2 deletions packages/client/package.json
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
{
"name": "@openreview/client",
"version": "0.0.31",
"version": "0.0.32",
"description": "Node.js client library for OpenReview's academic publishing API",
"main": "src/index.js",
"type": "module",
Expand Down Expand Up @@ -42,4 +42,4 @@
"mocha": "^10.2.0"
},
"gitHead": "e83fe20886ca81f61c67c1884941b8df937c24c3"
}
}
2 changes: 1 addition & 1 deletion packages/client/src/tools.js
Original file line number Diff line number Diff line change
Expand Up @@ -894,7 +894,7 @@ export default class Tools {
const contentType = result.headers.get('content-type');
throw new OpenReviewError({
name: 'ExtractAbstractError',
message: (contentType && contentType.indexOf('application/json') !== -1) ? await result.json() : await result.text(),
message: (contentType && contentType.indexOf('application/json') !== -1) ? JSON.stringify(await result.json()) : await result.text(),
status: result.status || 500
});

Expand Down
2 changes: 1 addition & 1 deletion packages/meta-extraction/package.json
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
{
"name": "@openreview/meta-extraction",
"version": "0.0.6",
"version": "0.0.7",
"description": "Extract abstracts for DBLP papers",
"main": "src/index.js",
"type": "module",
Expand Down
15 changes: 10 additions & 5 deletions packages/meta-extraction/src/abstractExtractionRules.js
Original file line number Diff line number Diff line change
Expand Up @@ -586,19 +586,23 @@ const neuripsCCRule = {

const sections = await page.$$('h4');

let abstract = null;
let nextElementText = null;
let nextNextElementText = null;

for (let index = 0; index < sections.length; index++) {
const textContent = await page.evaluate((p) => p.textContent, sections[index]);
if (textContent==='Abstract'){
const abstractContentElement = await sections[index].evaluateHandle(el => el.nextElementSibling.nextElementSibling);
abstract = await page.evaluate((p) => p.textContent, abstractContentElement);

const nextElement = await sections[index].evaluateHandle(el => el.nextElementSibling);
const nextNextElement = await sections[index].evaluateHandle(el => el.nextElementSibling.nextElementSibling);
nextElementText = await page.evaluate((p) => p?.textContent, nextElement);
nextNextElementText = await page.evaluate((p) => p?.textContent, nextNextElement);
}
}

const allEvidence = [
...highwirePressTags,
{ type: 'abstract', value: abstract }
{ type: 'abstract', value: nextElementText },
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you need to set the abstract twice?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for some urls nextElementText is text, for some other urls nextNextElementText is text

abstract type in allEvidence mean these values may be abstract and the longest one is taken

{ type: 'abstract', value: nextNextElementText }
];

const abstractEvidences = allEvidence.filter(
Expand Down Expand Up @@ -976,6 +980,7 @@ const runAllRules = async (html, page, url) => {
const { abstract, pdf, error } = await rule.executeRule(html, page);

if (error) {
if (error==='openreview rule') return {};
return {error};
}

Expand Down
2 changes: 1 addition & 1 deletion packages/meta-extraction/src/helpers.js
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ const shouldEnableMultiRedirect = (url) => [/doi.org/, /linkinghub.elsevier.com/

const getTimeout = (url) => {
const defaultTimeout = 15_000;
if ([/doi.org/, /spiedigitallibrary.org/, /iospress.com/].some((regex) => regex.test(url))) return defaultTimeout*2;
if ([/doi.org/, /spiedigitallibrary.org/, /iospress.com/].some((regex) => regex.test(url))) return defaultTimeout*3;
return defaultTimeout;
};

Expand Down
3 changes: 1 addition & 2 deletions packages/meta-extraction/test/test.js

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.