Extract last citation #59

gcappaert · 2023-05-05T01:47:28Z

I tested this simple regex approach on 1,000 files and it seems to work fine, though there may be an edge case that it doesn't work for. After a lot of of investigating, there does not seem to be any way to consistently use the PDF layouts or newlines in the text to separate out these footer statements. I initially considered training a classifier, but after doing some frequency analysis of the text, these phrases seemed consistent and not shared by the text of the citation.

jsvine · 2023-05-09T01:31:34Z

Many thanks for this! I took your regex, applied it to data/combined/inspections-citations.csv, and went looking for edge cases. Below are a few quirky outliers, with ///// marking where the approach is currently making the split. The list is non-exhaustive, but rather illustrative. Some examples are quirkier than others; some are more common than others. In some cases, the fix seems straightforward; for others, more difficult.

22b0e69a67357942

Issue here seems to be that the first mention of "exit" comes in the actual citation description.

On June 29, 2020 a 1-year old female pig-tailed macaque removed a feeder that was not locked by husbandry staff and
///// exited through the 3.5 x 5.75 inch feeder opening. The primate climbed onto another enclosure where a male pig-tailed macaque pulled her left arm through the 1 x 1 inch mesh enclosure. The husbandry staff were able to separate the 2 animals and called veterinary staff immediately. The female primate sustained multiple injuries and had to have her left arm amputated.
In order to protect the health and well-being of the animals, personnel should ensure all locks are in place and secure to maintain the animals in their enclosure. It is the responsibility of the research facility to provide continued training and instruction to all personnel with sufficient frequency to fulfill the research facility’s responsibilities.
Corrections were instituted prior to the inspection on December 1, 2020 in order to prevent recurrence.
This inspection and exit interview were conducted with the facility representatives.
End Section

a3d5f422b5a9394c

Here, "No" isn't identified as part of the boilerplate.

[...] No ///// exit briefing was given because the registrant was not available and did not return my phone call by the time this report
was finalized.

0a545d5765f5de92

A slightly more complicated version of the previous example; in this case, the whole "NOTE –" line should probably be considered as part of the boilerplate.

[...] The travelling unit of the licensee that was inspected at the Franklin Fair in Greenfield, MA on 9/7/17 needs to have a copy of the written Program of Veterinary Care.
Correct by 9/21/17.
NOTE – Inspection conducted at the Franklin Fairgrounds in Greenfield, MA with representative of the Licensee.
///// Exit briefing held 9/7/17 on-site with representative of the Licensee and in person with the Licensee at the North Haven, CT Fairgrounds on 9/8/17.

8ee3d02e3d110b8c

Here, the regex doesn't find anything, I think because of the newline (instead of space) between exit and interview.

** In the muntjac's kennel there was an abundance of roaches living in its pine straw bedding. There are many
things that can be done to minimize pests infestation such as increasing the frequency of cleaning and changing out
bedding in the animals enclosure, modifying the pest control program (consulting attending veterinarian before
introducing any chemical treatments and/or using a different bedding that does not provide favorable living
conditions for the pests. Whichever methods are considered and implemented there shall be a safe and effective
program established and maintained for the control of insects and ectoparasites. Correct by October 14, 2015
The inspection was conducted with the Executive Director and the Conservation Program Coordinator. The exit
interview was conducted with the Conservation Program Coordinator.

8f234caa827a5d81

In this example, the final line should probably be considered as part of the boilerplate but is not being captured.

Three zebras were housed in a pasture which did not have a perimeter fence. The primary enclosure appeared to
be eight foot high fencing on three sides and the fourth was a lower fence bordering a paddock. The paddock also
housed a young zebra and Nilgai antelope, and at least one side of that enclosure did not have a perimeter fence.
The eight foot fence also went around the front of the property. Outdoor housing facilities must be enclosed by a
perimeter fence at least eight feet high for potentially dangerous animals or at least six feet high for other animals
and must be at least three feet from the primary enclosure. Fencing that does not meet these requirements must be
approved in writing by the Administrator.
The owner allowed a partial inspection of the facility. Noncompliant items were identified and discussed during the
tour of the facility.
The inspection was conducted with the VMO, two Florida Fish and Wildlife Investigators, the facility owner and
another resident at the facility.

gcappaert · 2023-05-16T19:54:22Z

Thanks for this Jeremy! Got a little busy the last few weeks, so haven't had much time to work on this. Tomorrow I'll work on fixing the regex and getting the TSA scraper set up.

gcappaert added 3 commits May 4, 2023 21:32

Add regex pattern to eliminate extra citation text

9997ead

Tested and seems to be working

c158e61

Fixed filepaths

d3dffc0

jsvine self-assigned this May 9, 2023

jsvine mentioned this pull request May 11, 2023

Figure out how to separate the end-of-report notes from inspections' final citations #58

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extract last citation #59

Extract last citation #59

gcappaert commented May 5, 2023

jsvine commented May 9, 2023

gcappaert commented May 16, 2023

Extract last citation #59

Are you sure you want to change the base?

Extract last citation #59

Conversation

gcappaert commented May 5, 2023

jsvine commented May 9, 2023

22b0e69a67357942

a3d5f422b5a9394c

0a545d5765f5de92

8ee3d02e3d110b8c

8f234caa827a5d81

gcappaert commented May 16, 2023