-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extract last citation #59
base: main
Are you sure you want to change the base?
Extract last citation #59
Conversation
Many thanks for this! I took your regex, applied it to 22b0e69a67357942Issue here seems to be that the first mention of "exit" comes in the actual citation description.
a3d5f422b5a9394cHere, "No" isn't identified as part of the boilerplate.
0a545d5765f5de92A slightly more complicated version of the previous example; in this case, the whole "NOTE –" line should probably be considered as part of the boilerplate.
8ee3d02e3d110b8cHere, the regex doesn't find anything, I think because of the newline (instead of space) between
8f234caa827a5d81In this example, the final line should probably be considered as part of the boilerplate but is not being captured.
|
Thanks for this Jeremy! Got a little busy the last few weeks, so haven't had much time to work on this. Tomorrow I'll work on fixing the regex and getting the TSA scraper set up. |
I tested this simple regex approach on 1,000 files and it seems to work fine, though there may be an edge case that it doesn't work for. After a lot of of investigating, there does not seem to be any way to consistently use the PDF layouts or newlines in the text to separate out these footer statements. I initially considered training a classifier, but after doing some frequency analysis of the text, these phrases seemed consistent and not shared by the text of the citation.