Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Paper indexing feature #4

Open
tatecarson opened this issue Jul 20, 2020 · 13 comments
Open

Paper indexing feature #4

tatecarson opened this issue Jul 20, 2020 · 13 comments
Labels
enhancement New feature or request

Comments

@tatecarson
Copy link

tatecarson commented Jul 20, 2020

Hello,

I was just checking out the NIME Publication ecosystem workshop and I thought I would post an idea for feedback.

I am always finding myself wanting to search the entire proceedings to see if something has been mentioned before and usually can only do this with papers that are on google scholar, or ones I have on my computer. I know that you can do a search of the titles on the proceedings page but it can be very slow and it's just not an ideal way to do research.

It would also be great if there was some way of searching the text of these documents from one place. I think this would make the archive much more meaningful.

I am not sure how difficult it would be to do this. I think you could generate an index of all of the papers offline and then have that be searchable and somehow linked to each PDF? I am interested in helping implement this but I do not know how to do each part. I am also not sure if it is even something that people need or want.

Thanks, looking forward to hearing your thoughts.

@cpmpercussion
Copy link
Collaborator

Hi Tate,

This is an awesome idea, actually the search-ability of the text of the NIME proceedings is pretty low!

One idea would be to do some text analysis by scraping the text out of every PDF, I have a few idea about how to do this but it might have to happen offline (e.g., with language analysis tools in python).

It actually sounds like it would be a good project, I might remember a few ideas about how we could do it sometime during NIME this week!

@cpmpercussion cpmpercussion added the enhancement New feature or request label Jul 21, 2020
@alexarje
Copy link
Contributor

I did text mining on the entire archive back in 2013 (see this paper). It is easy to get out the text using pdftotext, so I guess we could that annually as part of the archiving step and make it available somehow?

Anyone wants to help?

@tatecarson
Copy link
Author

I would love to help but am not so clear on how to link the text scraped from the pdfs to the website. I can do the pdftotext portion though. I have also used ocrmypdf previously and it has worked well. I'm not sure if older NIME papers might require OCR.

I think it would be a good idea to look for a model of a proceedings that allows searching. I will look for that and see how it looks on the frontend. Maybe someone else has an idea.

@alexarje
Copy link
Contributor

There is no need for OCR, although the PDF quality of some of the early conferences is a bit sketchy.

Great if you can look for some examples!

@tatecarson
Copy link
Author

I did text mining on the entire archive back in 2013 (see this paper). It is easy to get out the text using pdftotext, so I guess we could that annually as part of the archiving step and make it available somehow?

Anyone wants to help?

How did you automate this process? I'm having trouble figuring out that part.

After that, we can use something like elasticlunr to create a searchable front end. I am a little concerned that it will be too large to work well.

@tatecarson
Copy link
Author

Mini-conf actually seems like a pretty good solution to this problem. Are you all looking at adopting some of its features? I saw it mentioned in another issue. It looks all around really great.

It doesn't do full-text searching but after some research, it seems like this would be a little difficult to do, especially with a static site. A search with titles and keywords is much better than no search at all though.

@alexarje
Copy link
Contributor

Yes, and I like the network visualization they have there: http://www.mini-conf.org/paper_vis.html. How would this possibly work on nime.org?

@alexarje
Copy link
Contributor

Just wonder whether any of this would help in solving the Google Scholar issue as well? @tatecarson if you are interested in testing this out, that would be great!

@tatecarson
Copy link
Author

I can look into how to combine this with the current NIME site. Could you describe more about the Google Scholar problem? Is it just that the articles are not indexing?

@alexarje
Copy link
Contributor

Could you describe more about the Google Scholar problem? Is it just that the articles are not indexing?

It is difficult to know exactly what is wrong, but it appears that only/mainly articles that have been self-archived in institutional repositories end up in Google Scholar. So there may be something wrong somewhere... We used to use the papercite plugin for Wordpress, which has a way of creating metadata that works well with Google Scholar. But after we changed to the new web page, we have had problems (I think). There is a little more info in the cookbook about this.

@tatecarson
Copy link
Author

Hm, I will look into that. I am currently looking at the NIME website trying to figure out how exactly the proceedings page is generated. I see that this shortcode is doing some work with the bibjekyll plugin but I don't quite understand where that {{references}} shortcode is hooked up to the plugin.

@alexarje
Copy link
Contributor

alexarje commented Aug 1, 2020

I think @cpmpercussion need to assist you here, since he set it up.

@cpmpercussion
Copy link
Collaborator

@tatecarson , that bibliography layout is the template for each entry in any of Jekyll-scholar's reference lists. The "reference" object is the formatted entry, e.g.:

Tate Carson. 2019. Mesh Garden: A creative-based musical game for participatory musical performance . Proceedings of the International Conference on New Interfaces for Musical Expression, UFRGS, pp. 339–342. http://doi.org/10.5281/zenodo.3672986

In the archives.md page, the tag {% bibliography --file nime_papers %} actually gets Jekyll-scholar to generate the big reference list.

I guess it hits this in Jekyll-scholar.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants