Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transcript parsing #2

Open
wfdd opened this issue Oct 11, 2015 · 0 comments
Open

Transcript parsing #2

wfdd opened this issue Oct 11, 2015 · 0 comments

Comments

@wfdd
Copy link
Member

wfdd commented Oct 11, 2015

  • Use poppler's pdftohtml -xml to convert PDFs into XML documents
  • Depending on the horizonal and vertical spacing between arbitrary-length text
    objects, which are arbitrarily strewn on the page, figure out whether they
    are: a continuation of the same word or paragraph; or otherwise part of a
    table
  • Intelligently collapse contiguous blank text objects on page breaks
  • Construct syntax tree to assign parts of transcript to classes?
  • Parse into Akoma Ntoso
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant