Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems with parsing quotes #2

Open
NikhilPr95 opened this issue Jul 4, 2016 · 2 comments
Open

Problems with parsing quotes #2

NikhilPr95 opened this issue Jul 4, 2016 · 2 comments

Comments

@NikhilPr95
Copy link

NikhilPr95 commented Jul 4, 2016

There are three types of problems that come about when parsing quotes -

  1. It very frequently divides the quote and the rest of the sentence into two separate sentences.

E.g. - "So what?" said Harry.

Here ' "So what?" ' and ' said Harry. ' are parsed as two separate sentences, rather than one.
2. Similar to the first, It divides a quote and the rest of the sentence into two sentences, but here the first word after the quote is a character identified by a character id.

E.g. "What is?" George demanded.

is parsed as two sentences ' "What is?" ' and 'George demanded. '
3. It concatenates two separate quotes which belong in different sentences into the same sentence.

E.g. "How are you?" "I'm fine, thank you", he replied.

Here while ' "How are you" ' is a separate sentence, it is considered as part of the second sentence.
4. It takes the beginning opening quotes ' " ' of a dialogue and takes it as the last token of the previous sentence.

E.g. There was a big blue shape in the sky. " What is it? " Asked Beth.

It parses these two individual sentences as ' There was a big blue shape in the sky. " ' and
' What is it ? " Asked Beth.

However the 'in quotes' values for 'What' here is 'true' making these easy to discover.

I found these errors and corrected them through hard coding in my own program ( For 1 - checking if the first word of a sentence is either in lower case or a character and appropriately concatenating the sentences. For 2 - Checking for every instance of consecutive quotes and dividing, For 3 - Checking if the first word of a sentence is 'in quotes', the word before it in the previous sentence is a double quote, and the word before that is a period, and correcting appropriately)

I was pleased with the results UNTIL I realised that the parser which constructs dependency trees does so on the original 'wrong' sentences and not on mine.
This left me trying to use the actual MaltParser for these affected sentences but I found that the parsing is not exactly the same - I assume that your code does not use the MaltParser directly and uses extra information as well.

I would really like this fixed as I am otherwise using only the tokens document that I got from implementing your code and this complicates things a lot.

If you could tell me a quick fix to this, it would be appreciated as well. In the meanwhile, I'll try to see if it is possible for me to make the necessary changes in your code myself.

P. S.

I am very grateful for this repository without which a project I am working on analyzing novels would have been much much more difficult. Thanks!

@NikhilPr95
Copy link
Author

EDIT:

I realised that issues 1 and 2 may have been left by design, however, issues 3 and 4 remain legitimate

@NikhilPr95
Copy link
Author

NikhilPr95 commented Jul 4, 2016

EDIT 2:

Upon examining the code, It seems that at least some of these issues are caused due to issues in the StanfordCoreNLP API and its parser, rather than anything in this repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant