You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A problem that arose while consuming data from public facing sites was that data was formatted into different names and it was non trivial to identify which names were associated with standard measurable values. This problem would compound when there are multiple providers uploading data and when the platform is not able to figure out where said data belongs. One way to approach this would be to have a standard list and ask uploaders to transfer data from the format they have into the new format that we define. But, as past efforts have shown, this is unsustainable and companies and countries are not incentivised to do this and as a result will not do this.
Assume there are three inputs - Input1, Input2, and Input3 with three fields to report
Input1 defines them to be Field1, Field2, Field3
Input2 defines them to be F1, F2, F3
Input3 defines them to be f1, f2, f3
Assume that the platform expects these fields to be defined as field1, field2, field3. The platform must have a way to infer that the respective fields are mapped to their correct domains by parsing their names. This model could be powered by a simple text parser, a ML based learning algorithm, etc. The idea is that this parsing layer must be a blackbox and everything put into it must come out cleanly formatted.
This blackbox could also potentially be used in other places where we might need inferential analysis (API endpoints, Names, etc). This would be a nice side project that can be easily plugged into the platform and does not depend on the platform to make any changes (one could write a parser that works on 100 examples and then run it on the platform)
The text was updated successfully, but these errors were encountered:
Data inferences
A problem that arose while consuming data from public facing sites was that data was formatted into different names and it was non trivial to identify which names were associated with standard measurable values. This problem would compound when there are multiple providers uploading data and when the platform is not able to figure out where said data belongs. One way to approach this would be to have a standard list and ask uploaders to transfer data from the format they have into the new format that we define. But, as past efforts have shown, this is unsustainable and companies and countries are not incentivised to do this and as a result will not do this.
Assume there are three inputs - Input1, Input2, and Input3 with three fields to report
Assume that the platform expects these fields to be defined as field1, field2, field3. The platform must have a way to infer that the respective fields are mapped to their correct domains by parsing their names. This model could be powered by a simple text parser, a ML based learning algorithm, etc. The idea is that this parsing layer must be a blackbox and everything put into it must come out cleanly formatted.
This blackbox could also potentially be used in other places where we might need inferential analysis (API endpoints, Names, etc). This would be a nice side project that can be easily plugged into the platform and does not depend on the platform to make any changes (one could write a parser that works on 100 examples and then run it on the platform)
The text was updated successfully, but these errors were encountered: