-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Paper indexing feature #4
Comments
Hi Tate, This is an awesome idea, actually the search-ability of the text of the NIME proceedings is pretty low! One idea would be to do some text analysis by scraping the text out of every PDF, I have a few idea about how to do this but it might have to happen offline (e.g., with language analysis tools in python). It actually sounds like it would be a good project, I might remember a few ideas about how we could do it sometime during NIME this week! |
I did text mining on the entire archive back in 2013 (see this paper). It is easy to get out the text using pdftotext, so I guess we could that annually as part of the archiving step and make it available somehow? Anyone wants to help? |
I would love to help but am not so clear on how to link the text scraped from the pdfs to the website. I can do the pdftotext portion though. I have also used ocrmypdf previously and it has worked well. I'm not sure if older NIME papers might require OCR. I think it would be a good idea to look for a model of a proceedings that allows searching. I will look for that and see how it looks on the frontend. Maybe someone else has an idea. |
There is no need for OCR, although the PDF quality of some of the early conferences is a bit sketchy. Great if you can look for some examples! |
How did you automate this process? I'm having trouble figuring out that part. After that, we can use something like elasticlunr to create a searchable front end. I am a little concerned that it will be too large to work well. |
Mini-conf actually seems like a pretty good solution to this problem. Are you all looking at adopting some of its features? I saw it mentioned in another issue. It looks all around really great. It doesn't do full-text searching but after some research, it seems like this would be a little difficult to do, especially with a static site. A search with titles and keywords is much better than no search at all though. |
Yes, and I like the network visualization they have there: http://www.mini-conf.org/paper_vis.html. How would this possibly work on nime.org? |
Just wonder whether any of this would help in solving the Google Scholar issue as well? @tatecarson if you are interested in testing this out, that would be great! |
I can look into how to combine this with the current NIME site. Could you describe more about the Google Scholar problem? Is it just that the articles are not indexing? |
It is difficult to know exactly what is wrong, but it appears that only/mainly articles that have been self-archived in institutional repositories end up in Google Scholar. So there may be something wrong somewhere... We used to use the papercite plugin for Wordpress, which has a way of creating metadata that works well with Google Scholar. But after we changed to the new web page, we have had problems (I think). There is a little more info in the cookbook about this. |
Hm, I will look into that. I am currently looking at the NIME website trying to figure out how exactly the proceedings page is generated. I see that this shortcode is doing some work with the bibjekyll plugin but I don't quite understand where that |
I think @cpmpercussion need to assist you here, since he set it up. |
@tatecarson , that bibliography layout is the template for each entry in any of Jekyll-scholar's reference lists. The "reference" object is the formatted entry, e.g.:
In the I guess it hits this in Jekyll-scholar. |
Hello,
I was just checking out the NIME Publication ecosystem workshop and I thought I would post an idea for feedback.
I am always finding myself wanting to search the entire proceedings to see if something has been mentioned before and usually can only do this with papers that are on google scholar, or ones I have on my computer. I know that you can do a search of the titles on the proceedings page but it can be very slow and it's just not an ideal way to do research.
It would also be great if there was some way of searching the text of these documents from one place. I think this would make the archive much more meaningful.
I am not sure how difficult it would be to do this. I think you could generate an index of all of the papers offline and then have that be searchable and somehow linked to each PDF? I am interested in helping implement this but I do not know how to do each part. I am also not sure if it is even something that people need or want.
Thanks, looking forward to hearing your thoughts.
The text was updated successfully, but these errors were encountered: