Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Needs to be merged #22

Open
jsfenfen opened this issue Jan 16, 2018 · 8 comments
Open

Needs to be merged #22

jsfenfen opened this issue Jan 16, 2018 · 8 comments

Comments

@jsfenfen
Copy link
Collaborator

A distinct class of variable problem we see are those that represent the same thing, but have been given different names, and thus need to be merged. When given the option, try to pick the variable_name used on the more recent version.

@jsfenfen
Copy link
Collaborator Author

Hey @lecy here's an example of something that I think needs to be merged, which is the 'opposite' problem I was looking at with the scope_duplicates, which typically requires variable 'splitting'.

(I'm omitting the /Return/ReturnData/ below for brevity)

/IRS990ScheduleR/IdRelatedOrgTxblCorpTrGrp/USAddress/ZIPCd
is present in versions 2014v5.0;2014v6.0;2015v2.0;2015v2.1;2015v3.0;2016v3.0', and '/IRS990ScheduleR/IdRelatedOrgTxblCorpTrGrp/USAddress/ZIPCode' is in versions 2013v3.0;2013v3.1;2013v4.0'

But I see 'SR_04_PC_IROCTAZIP' for /IRS990ScheduleR/IdRelatedOrgTxblCorpTrGrp/USAddress/ZIPCd

and 'SR_04_PC_IROCTAZIPCOD' for /IRS990ScheduleR/IdRelatedOrgTxblCorpTrGrp/USAddress/ZIPCode

I think these should be the same thing. But note the variable_name has 'COD' appended in the second case above. So the merge I guess I'm suggesting would preserve the IROCTAZIP one and remove the ZIPCOD one, since the former is the more recent one.

@jsfenfen
Copy link
Collaborator Author

@lecy @borenstein @MiguelABarbosa Here's a more complete list as an excel file of variables that I think should be merged. The change that would be made to the concordance would touch only the variable_name field (by changing variable_name_original to merge_into_variable_name); the xpath and versions would remain unchanged.

In general I'm replacing the older variable with the new one. I only ran this to 'fix' things from 2013 schemas forwards, there may be similar errors for earlier schemas that these proposed merges don't touch.

@lecy
Copy link
Member

lecy commented Jan 18, 2018

@jsfenfen this is great progress. Thanks for pushing forward with these cases. I have been slammed this week so sorry for the slow response.

I agree with your strategies here for splitting and merging variables.

I am sharing the cases we identified at the validathon. They are under the tab "Need Love". Here is the form participants completed:

https://goo.gl/forms/UpYGvwsJ3QCaNeno1

And the resulting spreadsheet (attached).

Validatathon Response Form (Responses).xlsx

I am not sure if there is a good way to cross-check these two lists, but just wanted to make sure you have this info for reference.

@jsfenfen
Copy link
Collaborator Author

Yeah not sure of best way forward on this. Here's the diff of the merges I turned up by script, fwiw: jsfenfen@dc066b3

@jsfenfen
Copy link
Collaborator Author

Also, @lecy, just eyeballing it, I don't think there's very much overlap between the validatathon response and my suggested merges, which are mostly addresses / zipcodes / states. If you were sure that the file is correct, I could probably pull out specific rows that called for merges and turn it into a similar 'merge file', if that makes sense.

There may be some overlap with my variable splitting commits? Some of them dealt with issues that were sorta orthogonal to this but still required fixing? I could be wrong too.

@jsfenfen
Copy link
Collaborator Author

Yeah, so for instance F9_12_PC_FINSTATBOTH is dealt with here: jsfenfen@c30bdda though it also appears in the validatathon report (for the thing I fixed it for, I think).

@lecy
Copy link
Member

lecy commented Jan 18, 2018

I think the way forward is to continue with your current strategy of automated checks - finding ways to check the logic of variables using metadata and rules about what should be true if the concordance is correct.

When we have fixed as many issues as we can, then we start with the divide and conquer strategy of one human reviewing one Part of the 990 form, confirming the variables are consistent according to a set of rules provided, preview some of the data (David has generated a dataset and report for each xpath to facilitate this), and sign off that an entire Part has been validated.

I will have bandwidth in March to focus on the main parts from the 990-PC (I have no bandwidth right now). I would like to have a dataset for a conference in April.

I am sure we will continue to find problems moving forward, but as long as the general architecture of the concordance is solid and we have a clear protocol for how to identify bugs, they should iron out over time. Just keep track of which parts and schedules have not been validated by human, and provide the necessary caveats.

Does that sound feasible as a longer-term strategy?

@jsfenfen jsfenfen mentioned this issue Jan 19, 2018
@jsfenfen
Copy link
Collaborator Author

@lecy that sounds reasonable--I put the merges in a pull request, lemme know if there's any reason not to merge these, I'll do so in a day or two if no one finds anything. I guess my question is about the quality of the validatathon spreadsheet. If we believe that all the rows that say they need to be merged should get merged as directed, I can run that too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants