Paper indexing feature #4

tatecarson · 2020-07-20T15:25:23Z

Hello,

I was just checking out the NIME Publication ecosystem workshop and I thought I would post an idea for feedback.

I am always finding myself wanting to search the entire proceedings to see if something has been mentioned before and usually can only do this with papers that are on google scholar, or ones I have on my computer. I know that you can do a search of the titles on the proceedings page but it can be very slow and it's just not an ideal way to do research.

It would also be great if there was some way of searching the text of these documents from one place. I think this would make the archive much more meaningful.

I am not sure how difficult it would be to do this. I think you could generate an index of all of the papers offline and then have that be searchable and somehow linked to each PDF? I am interested in helping implement this but I do not know how to do each part. I am also not sure if it is even something that people need or want.

Thanks, looking forward to hearing your thoughts.

cpmpercussion · 2020-07-21T08:08:05Z

Hi Tate,

This is an awesome idea, actually the search-ability of the text of the NIME proceedings is pretty low!

One idea would be to do some text analysis by scraping the text out of every PDF, I have a few idea about how to do this but it might have to happen offline (e.g., with language analysis tools in python).

It actually sounds like it would be a good project, I might remember a few ideas about how we could do it sometime during NIME this week!

alexarje · 2020-07-21T13:36:38Z

I did text mining on the entire archive back in 2013 (see this paper). It is easy to get out the text using pdftotext, so I guess we could that annually as part of the archiving step and make it available somehow?

Anyone wants to help?

tatecarson · 2020-07-21T19:15:44Z

I would love to help but am not so clear on how to link the text scraped from the pdfs to the website. I can do the pdftotext portion though. I have also used ocrmypdf previously and it has worked well. I'm not sure if older NIME papers might require OCR.

I think it would be a good idea to look for a model of a proceedings that allows searching. I will look for that and see how it looks on the frontend. Maybe someone else has an idea.

alexarje · 2020-07-21T19:25:00Z

There is no need for OCR, although the PDF quality of some of the early conferences is a bit sketchy.

Great if you can look for some examples!

tatecarson · 2020-07-21T20:20:20Z

I did text mining on the entire archive back in 2013 (see this paper). It is easy to get out the text using pdftotext, so I guess we could that annually as part of the archiving step and make it available somehow?

Anyone wants to help?

How did you automate this process? I'm having trouble figuring out that part.

After that, we can use something like elasticlunr to create a searchable front end. I am a little concerned that it will be too large to work well.

tatecarson · 2020-07-30T22:21:29Z

Mini-conf actually seems like a pretty good solution to this problem. Are you all looking at adopting some of its features? I saw it mentioned in another issue. It looks all around really great.

It doesn't do full-text searching but after some research, it seems like this would be a little difficult to do, especially with a static site. A search with titles and keywords is much better than no search at all though.

alexarje · 2020-07-31T05:01:26Z

Yes, and I like the network visualization they have there: http://www.mini-conf.org/paper_vis.html. How would this possibly work on nime.org?

alexarje · 2020-07-31T09:42:30Z

Just wonder whether any of this would help in solving the Google Scholar issue as well? @tatecarson if you are interested in testing this out, that would be great!

tatecarson · 2020-07-31T16:43:32Z

I can look into how to combine this with the current NIME site. Could you describe more about the Google Scholar problem? Is it just that the articles are not indexing?

alexarje · 2020-07-31T18:58:35Z

Could you describe more about the Google Scholar problem? Is it just that the articles are not indexing?

It is difficult to know exactly what is wrong, but it appears that only/mainly articles that have been self-archived in institutional repositories end up in Google Scholar. So there may be something wrong somewhere... We used to use the papercite plugin for Wordpress, which has a way of creating metadata that works well with Google Scholar. But after we changed to the new web page, we have had problems (I think). There is a little more info in the cookbook about this.

tatecarson · 2020-07-31T19:31:07Z

Hm, I will look into that. I am currently looking at the NIME website trying to figure out how exactly the proceedings page is generated. I see that this shortcode is doing some work with the bibjekyll plugin but I don't quite understand where that {{references}} shortcode is hooked up to the plugin.

alexarje · 2020-08-01T07:48:23Z

I think @cpmpercussion need to assist you here, since he set it up.

cpmpercussion · 2020-08-01T13:12:17Z

@tatecarson , that bibliography layout is the template for each entry in any of Jekyll-scholar's reference lists. The "reference" object is the formatted entry, e.g.:

Tate Carson. 2019. Mesh Garden: A creative-based musical game for participatory musical performance . Proceedings of the International Conference on New Interfaces for Musical Expression, UFRGS, pp. 339–342. http://doi.org/10.5281/zenodo.3672986

In the archives.md page, the tag {% bibliography --file nime_papers %} actually gets Jekyll-scholar to generate the big reference list.

I guess it hits this in Jekyll-scholar.

cpmpercussion added the enhancement New feature or request label Jul 21, 2020

dimitriaatos mentioned this issue Nov 22, 2020

Fixing searching bug on proceedings page #25

Merged

cpmpercussion mentioned this issue Jun 30, 2023

use datapage_gen for generating entry pages #45

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Paper indexing feature #4

Paper indexing feature #4

tatecarson commented Jul 20, 2020 •

edited

Loading

cpmpercussion commented Jul 21, 2020

alexarje commented Jul 21, 2020

tatecarson commented Jul 21, 2020

alexarje commented Jul 21, 2020

tatecarson commented Jul 21, 2020

tatecarson commented Jul 30, 2020

alexarje commented Jul 31, 2020

alexarje commented Jul 31, 2020

tatecarson commented Jul 31, 2020

alexarje commented Jul 31, 2020

tatecarson commented Jul 31, 2020

alexarje commented Aug 1, 2020

cpmpercussion commented Aug 1, 2020

Paper indexing feature #4

Paper indexing feature #4

Comments

tatecarson commented Jul 20, 2020 • edited Loading

cpmpercussion commented Jul 21, 2020

alexarje commented Jul 21, 2020

tatecarson commented Jul 21, 2020

alexarje commented Jul 21, 2020

tatecarson commented Jul 21, 2020

tatecarson commented Jul 30, 2020

alexarje commented Jul 31, 2020

alexarje commented Jul 31, 2020

tatecarson commented Jul 31, 2020

alexarje commented Jul 31, 2020

tatecarson commented Jul 31, 2020

alexarje commented Aug 1, 2020

cpmpercussion commented Aug 1, 2020

tatecarson commented Jul 20, 2020 •

edited

Loading