Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parse Papers on Local Machines #104

Open
danich1 opened this issue Jun 11, 2021 · 8 comments
Open

Parse Papers on Local Machines #104

danich1 opened this issue Jun 11, 2021 · 8 comments
Assignees

Comments

@danich1
Copy link
Contributor

danich1 commented Jun 11, 2021

A user requested to have a feature that will allow them to parse their own papers without having to post on bioRxiv or medRxiv. Ideally I was thinking about the following steps to accomplish this request:

  • Use docker to containerize our word2vec model that will generate document vectors
  • Send the document vector to our API server and have results sent back to the user
  • Could think about having a local rendering of the front-end for users to view results, but this is up to interpretation as I don't know the best way to handle the server response
@dongbohu
Copy link
Contributor

I wonder whether the front end can add a new feature to allow users to submit their own files. Once the backend receives the file, it parses it and returns the closest neightbors, just like it did with an input preprint.

@danich1
Copy link
Contributor Author

danich1 commented Jun 11, 2021

I wonder whether the front end can add a new feature to allow users to submit their own files. Once the backend receives the file, it parses it and returns the closest neightbors, just like it did with an input preprint.

Only concern is that pdfs could have malicious code that could potentially take down our server. We could implement a check for this mentioned in the blog, but one alternative solution was to have users generate vectors locally.

@dongbohu
Copy link
Contributor

I think it also depends on how popular this feature could be. If it's just for a few users, asking them to submit the doc vectors is okay. If many users want it, we probably should figure out how to minimize their efforts as much as possible (for example, allow users to submit a PDF file directly, but either the frontend or backend will do some checking/cleaning before parsing it).

@cgreene
Copy link
Member

cgreene commented Jun 11, 2021

What if we provided source code + some environment + an example notebook that would do the embeddings. Then folks could access a new API endpoint to get the results for that.

@dongbohu
Copy link
Contributor

What if we provided source code + some environment + an example notebook that would do the embeddings. Then folks could access a new API endpoint to get the results for that.

Sure. We can use this approach now. If many users like it and want to simplify the procedure, we'll figure out how.

@cgreene
Copy link
Member

cgreene commented Jun 11, 2021

I just realized that it will make model versioning annoying. What if we created a notebook that calculated words and counts and those words + counts for each could be uploaded and calculated from?

It just feels like plain text is going to be easier to handle with less risk than PDFs and running the parsing library.

@dongbohu
Copy link
Contributor

I just realized that it will make model versioning annoying.

You mean the word modeling file that was used by the backend? How often is it supposed to be updated?

It just feels like plain text is going to be easier to handle with less risk than PDFs and running the parsing library.

True. I was thinking earlier that we can force users to convert whatever format their documents are in before allowing them to submit it.

@vincerubinetti
Copy link
Collaborator

Per our post-scrum chat about this:

It sounds like a good plan is to have the frontend allow a pdf/txt/word document upload, use a js library to convert it to plain text, then send that to a new backend API endpoint as plaintext. Then once it reaches the backend, the process continues as normal as if the plain text had come from the backend python pdf parser.

There are potential hazards to letting users upload pdfs. However, if we're doing it on the frontend, there's almost no risk. The user could crash their browser, but that wouldn't affect anyone else. Doing PDF parsing on the backend is more perilous because it could mess the server up for others.

Also, since this probably isn't the way we want to recommend people use the tool, I think we should sort of "hide" this feature in the frontend. I think a good way to do that is to let the user drag files onto the search box from their computer to upload. There would be no indication the user could do this unless they were instructed to do it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants