Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possibly wrong Mediacloud test data? #154

Open
RadhiFadlillah opened this issue Jun 14, 2024 · 1 comment
Open

Possibly wrong Mediacloud test data? #154

RadhiFadlillah opened this issue Jun 14, 2024 · 1 comment
Labels
question Further information is requested

Comments

@RadhiFadlillah
Copy link
Contributor

Hi @adbar, thanks for this awesome library.

While porting this library to Go, I noticed there are two Mediacloud tests that might be wrong:

"https://www.baltimoresun.com/opinion/columnists/zurawik/bs-ed-zontv-media-year-20201223-cnvrlhkhnrbihcxx6wxcxt2b7y-story.html#ed=rss_www.baltimoresun.com/arcio/rss/category/latest/": {
	"file": "1805697156.html",
	"date": "2020-12-23"
},
"https://elbaladtv.net/%d8%aa%d8%b1%d9%83%d9%89-%d8%a2%d9%84-%d8%a7%d9%84%d8%b4%d9%8a%d8%ae-%d8%a8%d8%b9%d8%af-%d8%a5%d8%b5%d8%a7%d8%a8%d8%a9-%d9%8a%d8%b3%d8%b1%d8%a7-%d8%a8%d9%83%d9%88%d8%b1%d9%88%d9%86%d8%a7-%d9%8a%d8%a7/": {
	"file": "1806793639.html",
	"date": "2020-12-25"
},

For baltimoresun, its JSON+LD contains following snippet:

{
	// ... omitted
	"articleSection": "zurawik",
	"dateCreated": "2020-12-22T01:06:41.361Z",
	"datePublished": "2020-12-23T15:42:33.814Z",
	"dateModified": "2020-12-23T15:42:34.197Z",
	// ... omitted
}

From that snippet we can see its creation date is 2020-12-22. Since we want the original date, I think we should use that one instead of 2020-12-23?


For elbalad.tv, its JSON+LD contains following snippet:

{
	"@type": "WebPage",
	"@id": "https://elbaladtv.net/%d8%aa%d8%b1%d9%83%d9%89-%d8%a2%d9%84-%d8%a7%d9%84%d8%b4%d9%8a%d8%ae-%d8%a8%d8%b9%d8%af-%d8%a5%d8%b5%d8%a7%d8%a8%d8%a9-%d9%8a%d8%b3%d8%b1%d8%a7-%d8%a8%d9%83%d9%88%d8%b1%d9%88%d9%86%d8%a7-%d9%8a%d8%a7/#webpage",
	"url": "https://elbaladtv.net/%d8%aa%d8%b1%d9%83%d9%89-%d8%a2%d9%84-%d8%a7%d9%84%d8%b4%d9%8a%d8%ae-%d8%a8%d8%b9%d8%af-%d8%a5%d8%b5%d8%a7%d8%a8%d8%a9-%d9%8a%d8%b3%d8%b1%d8%a7-%d8%a8%d9%83%d9%88%d8%b1%d9%88%d9%86%d8%a7-%d9%8a%d8%a7/",
	"name": "\u062a\u0631\u0643\u0649 \u0622\u0644 \u0627\u0644\u0634\u064a\u062e \u0628\u0639\u062f \u0625\u0635\u0627\u0628\u0629 \u064a\u0633\u0631\u0627 \u0628\u0643\u0648\u0631\u0648\u0646\u0627: \u064a\u0627\u0631\u0628 \u064a\u0631\u0641\u0639 \u0639\u0646\u0643 - \u0642\u0646\u0627\u0629 \u0635\u062f\u0649 \u0627\u0644\u0628\u0644\u062f",
	"datePublished": "2020-12-25T01:59:50+02:00",
	"dateModified": "2020-12-25T01:59:50+02:00",
	"isPartOf": { "@id": "https://elbaladtv.net/#website" },
	"primaryImageOfPage": {
		"@id": "https://elbaladtv.net/%d8%aa%d8%b1%d9%83%d9%89-%d8%a2%d9%84-%d8%a7%d9%84%d8%b4%d9%8a%d8%ae-%d8%a8%d8%b9%d8%af-%d8%a5%d8%b5%d8%a7%d8%a8%d8%a9-%d9%8a%d8%b3%d8%b1%d8%a7-%d8%a8%d9%83%d9%88%d8%b1%d9%88%d9%86%d8%a7-%d9%8a%d8%a7/#primaryImage"
	},
	"inLanguage": "ar"
}

It also contains following meta tag:

<meta property="article:published_time" content="2020-12-24T23:59:50+00:00">

From those two, we can see that the published time in JSON+LD and meta tags are actually the same except the former is in UTC+2 while the latter is in UTC+0.

So, for extraction result I think we should use 2020-12-24 since it's use UTC time instead of local time.

@adbar adbar added the question Further information is requested label Jun 14, 2024
@adbar
Copy link
Owner

adbar commented Jun 14, 2024

Hi @RadhiFadlillah Thanks for your feedback, I'll have a look.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants