Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NAICS code assignment #15

Open
calmc opened this issue Nov 19, 2024 · 0 comments
Open

NAICS code assignment #15

calmc opened this issue Nov 19, 2024 · 0 comments
Assignees
Labels
enhancement New feature or request

Comments

@calmc
Copy link
Collaborator

calmc commented Nov 19, 2024

Is the feature related to a problem? Please describe.
A single facility (unique Registry ID) may be assigned multiple NAICS codes across different EPA information systems. These NAICS codes may be in completely different sectors (e.g., agriculture vs. manufacturing vs. retail trade). However, the current approach to selecting a NAICS code in frs_extraction.py takes a naive approach and simply takes the first code reported for the value of naicsCode and keeps all additional codes for the value of naicsCodeAdditional. See the format_naics_csv method:

def format_naics_csv(self, data):

Describe the proposed solution
A solution should take a more informed approach to assigning NAICS codes when there are multiple different values across information systems. The solution should take into account for outlier assignments, such as shown in the example below where a single agriculture NAICS is listed with many manufacturing NAICS.

It's not clear at this point what the best solution is. Each of the alternatives identified below should be explored.
A test should be written to compare the results of the proposed solutions, as well as the original NAICS code, highlighting where there is agreement or not.

Describe alternatives considered

  1. Preference the NAICS codes from a specific EPA information system? For example, since energy estimates are derived from NEI (EIS) and GHGRP information systems, should these NAICS codes be preferenced over NAICS codes from other information systems (TRIS, RCRAINFO, etc.)? If so, should NEI (EIS) > GHGRP or GHGRP > NEI (EIS)?
  2. Use the most prevalent NAICS codes? For example, calculate counts of each NAICS code and select the one with the largest count.

Additional context
Add any other context or screenshots about the feature request here.
Here's an example of the FRS Detailed Facility Information query for Registry ID == 110000408274:
image

A more complicated example is Registry ID == 110001413239:
image

@calmc calmc added the enhancement New feature or request label Nov 19, 2024
@calmc calmc self-assigned this Nov 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant