Parse Papers on Local Machines #104

danich1 · 2021-06-11T17:16:04Z

A user requested to have a feature that will allow them to parse their own papers without having to post on bioRxiv or medRxiv. Ideally I was thinking about the following steps to accomplish this request:

Use docker to containerize our word2vec model that will generate document vectors
Send the document vector to our API server and have results sent back to the user
Could think about having a local rendering of the front-end for users to view results, but this is up to interpretation as I don't know the best way to handle the server response

dongbohu · 2021-06-11T17:47:32Z

I wonder whether the front end can add a new feature to allow users to submit their own files. Once the backend receives the file, it parses it and returns the closest neightbors, just like it did with an input preprint.

danich1 · 2021-06-11T18:07:12Z

I wonder whether the front end can add a new feature to allow users to submit their own files. Once the backend receives the file, it parses it and returns the closest neightbors, just like it did with an input preprint.

Only concern is that pdfs could have malicious code that could potentially take down our server. We could implement a check for this mentioned in the blog, but one alternative solution was to have users generate vectors locally.

dongbohu · 2021-06-11T18:40:41Z

I think it also depends on how popular this feature could be. If it's just for a few users, asking them to submit the doc vectors is okay. If many users want it, we probably should figure out how to minimize their efforts as much as possible (for example, allow users to submit a PDF file directly, but either the frontend or backend will do some checking/cleaning before parsing it).

cgreene · 2021-06-11T18:47:59Z

What if we provided source code + some environment + an example notebook that would do the embeddings. Then folks could access a new API endpoint to get the results for that.

dongbohu · 2021-06-11T18:54:00Z

What if we provided source code + some environment + an example notebook that would do the embeddings. Then folks could access a new API endpoint to get the results for that.

Sure. We can use this approach now. If many users like it and want to simplify the procedure, we'll figure out how.

cgreene · 2021-06-11T18:58:21Z

I just realized that it will make model versioning annoying. What if we created a notebook that calculated words and counts and those words + counts for each could be uploaded and calculated from?

It just feels like plain text is going to be easier to handle with less risk than PDFs and running the parsing library.

dongbohu · 2021-06-11T19:06:13Z

I just realized that it will make model versioning annoying.

You mean the word modeling file that was used by the backend? How often is it supposed to be updated?

It just feels like plain text is going to be easier to handle with less risk than PDFs and running the parsing library.

True. I was thinking earlier that we can force users to convert whatever format their documents are in before allowing them to submit it.

vincerubinetti · 2021-06-11T20:05:17Z

Per our post-scrum chat about this:

It sounds like a good plan is to have the frontend allow a pdf/txt/word document upload, use a js library to convert it to plain text, then send that to a new backend API endpoint as plaintext. Then once it reaches the backend, the process continues as normal as if the plain text had come from the backend python pdf parser.

There are potential hazards to letting users upload pdfs. However, if we're doing it on the frontend, there's almost no risk. The user could crash their browser, but that wouldn't affect anyone else. Doing PDF parsing on the backend is more perilous because it could mess the server up for others.

Also, since this probably isn't the way we want to recommend people use the tool, I think we should sort of "hide" this feature in the frontend. I think a good way to do that is to let the user drag files onto the search box from their computer to upload. There would be no indication the user could do this unless they were instructed to do it.

danich1 assigned vincerubinetti and dongbohu Jun 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parse Papers on Local Machines #104

Parse Papers on Local Machines #104

danich1 commented Jun 11, 2021

dongbohu commented Jun 11, 2021

danich1 commented Jun 11, 2021

dongbohu commented Jun 11, 2021

cgreene commented Jun 11, 2021 •

edited

Loading

dongbohu commented Jun 11, 2021

cgreene commented Jun 11, 2021

dongbohu commented Jun 11, 2021

vincerubinetti commented Jun 11, 2021

Parse Papers on Local Machines #104

Parse Papers on Local Machines #104

Comments

danich1 commented Jun 11, 2021

dongbohu commented Jun 11, 2021

danich1 commented Jun 11, 2021

dongbohu commented Jun 11, 2021

cgreene commented Jun 11, 2021 • edited Loading

dongbohu commented Jun 11, 2021

cgreene commented Jun 11, 2021

dongbohu commented Jun 11, 2021

vincerubinetti commented Jun 11, 2021

cgreene commented Jun 11, 2021 •

edited

Loading