You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is the feature related to a problem? Please describe.
A single facility (unique Registry ID) may be assigned multiple NAICS codes across different EPA information systems. These NAICS codes may be in completely different sectors (e.g., agriculture vs. manufacturing vs. retail trade). However, the current approach to selecting a NAICS code in frs_extraction.py takes a naive approach and simply takes the first code reported for the value of naicsCode and keeps all additional codes for the value of naicsCodeAdditional. See the format_naics_csv method:
Describe the proposed solution
A solution should take a more informed approach to assigning NAICS codes when there are multiple different values across information systems. The solution should take into account for outlier assignments, such as shown in the example below where a single agriculture NAICS is listed with many manufacturing NAICS.
It's not clear at this point what the best solution is. Each of the alternatives identified below should be explored. A test should be written to compare the results of the proposed solutions, as well as the original NAICS code, highlighting where there is agreement or not.
Describe alternatives considered
Preference the NAICS codes from a specific EPA information system? For example, since energy estimates are derived from NEI (EIS) and GHGRP information systems, should these NAICS codes be preferenced over NAICS codes from other information systems (TRIS, RCRAINFO, etc.)? If so, should NEI (EIS) > GHGRP or GHGRP > NEI (EIS)?
Use the most prevalent NAICS codes? For example, calculate counts of each NAICS code and select the one with the largest count.
Additional context
Add any other context or screenshots about the feature request here.
Here's an example of the FRS Detailed Facility Information query for Registry ID == 110000408274:
A more complicated example is Registry ID == 110001413239:
The text was updated successfully, but these errors were encountered:
Is the feature related to a problem? Please describe.
A single facility (unique Registry ID) may be assigned multiple NAICS codes across different EPA information systems. These NAICS codes may be in completely different sectors (e.g., agriculture vs. manufacturing vs. retail trade). However, the current approach to selecting a NAICS code in
frs_extraction.py
takes a naive approach and simply takes the first code reported for the value ofnaicsCode
and keeps all additional codes for the value ofnaicsCodeAdditional
. See theformat_naics_csv
method:foundational-industry-energy-data/fied/frs/frs_extraction.py
Line 291 in 2c20c9f
Describe the proposed solution
A solution should take a more informed approach to assigning NAICS codes when there are multiple different values across information systems. The solution should take into account for outlier assignments, such as shown in the example below where a single agriculture NAICS is listed with many manufacturing NAICS.
It's not clear at this point what the best solution is. Each of the alternatives identified below should be explored.
A test should be written to compare the results of the proposed solutions, as well as the original NAICS code, highlighting where there is agreement or not.
Describe alternatives considered
Additional context
Add any other context or screenshots about the feature request here.
Here's an example of the FRS Detailed Facility Information query for Registry ID == 110000408274:
A more complicated example is Registry ID == 110001413239:
The text was updated successfully, but these errors were encountered: