Skip to content
This repository has been archived by the owner on May 11, 2021. It is now read-only.

Nov fixes and updates #5

Merged
merged 20 commits into from
Dec 12, 2018
Merged

Nov fixes and updates #5

merged 20 commits into from
Dec 12, 2018

Conversation

iross
Copy link
Member

@iross iross commented Dec 4, 2018

  • Tesseract4 has an issue where a word's bbox can take up the whole page (Noise characters recognized with bbox as the entire page tesseract-ocr/tesseract#1192) -- blackstack now skips those words
  • Added simple wrapper script + envvar toggle to run docker-compose setup in either classified or training mode
  • There was some potential weirdness where the list of labels was getting read from annotated docs instead of from the labels table. It would crash if you start a new model but don't have examples of each layer.
  • A few python3 fixes -- 5dcda56 is a critical one. filter() in python3 is a generator, so len(filter(...)) was throwing an exception that was getting silently caught. As a result, things downstream thought all areas contained zero words and document- and area-level heuristics broke.
  • Added annotated page dump (outputs the pages with areas labeled by category

@iross iross requested a review from jczaplew December 4, 2018 21:44
@iross
Copy link
Member Author

iross commented Dec 4, 2018

@jczaplew Want to give it a look + run-through and let me know if there's anything else you want clarified/fixed?

@iross iross merged commit 8e0c54c into UW-xDD:master Dec 12, 2018
@iross iross deleted the nov_fixes_and_updates branch December 12, 2018 21:25
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant