Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[question] assignment of stable identifier for a source (and not only its access points) #809

Open
tlongers opened this issue Jun 7, 2022 · 9 comments

Comments

@tlongers
Copy link
Member

tlongers commented Jun 7, 2022

@hancush In our model we assign an UUID to source access point but not the source itself:

Source 1 -> Access point A (e38161cb-f0f2-4fa6-94ff-2c99f96225ea)
Source 1 -> Access point B (1c0fc3ea-b5fa-4fcf-acb7-0f8cb6b3b829)
Source 1 -> Access point C (bf0b493d-6b56-4347-8935-8be63cc44fe3)

Access points are citations of specific parts of a source, and we assign them a stable UUID e.g. page 54 of Source 1 has a different access point (and uuid) to page 68 of Source 1. We don't, however, assign a stable UUID to Source 1.

Although sfm-cms draws uuids for access points from our import sheets, it also assign a uuid to the source. Check here, for example, using the sfm-cms our long neglected "sources" view (login required):

https://back.securityforcemonitor.org/en/source/view/079ddd1a-55c4-4694-902a-f6287a2ca09b/1da0094b-02fe-4b4f-a87c-84df1414bea8/#evidence

The URL displays access point 1da0094b-02fe-4b4f-a87c-84df1414bea8, the record for which contains the following data:

field value
source:comments:admin
source:status:admin 3
source:external_archive_sha_content:admin  
source:external_archive_sha_meta:admin  
source:access_point_id:admin 1da0094b-02fe-4b4f-a87c-84df1414bea8
source:type document
source:title BAHRAIN – M270 MULTIPLE LAUNCH ROCKET SYSTEMS (MLRS) UPGRADE
source:author  
source:url https://www.dsca.mil/press-media/major-arms-sales/bahrain-m270-multiple-launch-rocket-systems-mlrs-upgrade
source:created_timestamp  
source:uploaded_timestamp  
source:published_timestamp 2022-03-24
source:accessed_timestamp 2022-03-30
source:access_point_type archive
source:access_point_trigger  
source:archive_url https://web.archive.org/web/20220327065604/https://www.dsca.mil/press-media/major-arms-sales/bahrain-m270-multiple-launch-rocket-systems-mlrs-upgrade
source:archive_timestamp  
source:publication_country us
source:publication_name Defense Security Cooperation Agency
source:publication_id:admin 91eda787-dacd-41ee-93c0-e01e3120a28b

However it also assigns 079ddd1a-55c4-4694-902a-f6287a2ca09b to the source, which in this case is the document called BAHRAIN – M270 MULTIPLE LAUNCH ROCKET SYSTEMS (MLRS) UPGRADE.

How is it doing this, and does it repeat the process each time data are imported? Is there a requirement for source uniqueness inside sfm-cms that is being unmet here, and that we should fill by assigning a stable UUID to each source (and not only it access points)?

@tlongers
Copy link
Member Author

tlongers commented Oct 6, 2022

bump @hancush

@hancush
Copy link

hancush commented Oct 6, 2022

Hi, @tlongers, sharing a relevant email from late last year where we pondered this very question together:


Source import flow

The revised source import loops over each row in the sources sheet. First, creates or retrieves and updates an existing access point, based on "source:access_point_id:admin". Then, it creates or retrieves and updates the implicated source based on the combination of fields listed in Sources, below, and associates it with the access point.

If the source fields are not harmonized within records referring to the same source, then we'll see multiple versions of that source in our data. Referring back to the example in the current sheet, we have two sources for "By All Means Necessary", one with a publication date and one without, and the access points are split between those versions.

Access points

We use "source:access_point_id:admin" to create or retrieve an existing access point to relate to the source. Am I understanding you correctly that it, alone, does not uniquely identify an access point?

Sources

I agree a unique identifier for sources would be amazing! We actually have one in our data model already, so it'd be a matter of updating it (or, perhaps more easily, flushing and re-importing all sources) if/when it becomes available on your end.

Barring that, the fields we use to resolve sources are:

  • "source:title"
  • "source:type"
  • "source:author"
  • "source:publication_name"
  • "source:publication_country"
  • "source:url"
  • "source:created_timestamp"
  • "source:published_timestamp"
  • "source:accessed_timestamp"

@tlongers
Copy link
Member Author

tlongers commented Oct 6, 2022

Aha, the Tom and Hannah of the past were wise and solved this issue already. Thanks; we'll probably implement this our side.

@hancush
Copy link

hancush commented Oct 6, 2022

Wise then, shudder to think what we are now, @tlongers 😂

@tlongers
Copy link
Member Author

tlongers commented Oct 6, 2022

Thanks, we'll sort this out and let you know when we're done with it.

@tlongers
Copy link
Member Author

tlongers commented Oct 7, 2022

@hancush would you be able to do:

  • point to the alg inside WWIC that does this job so we can take a look; and,
  • generate a count from the current production database of the number of unique sources?

@smcalilly
Copy link

@tlongers This code creates the sources: https://github.com/security-force-monitor/sfm-cms/blob/master/sfm_pc/management/commands/import_country_data.py#L1389-L1433

I queried the production database and counted 11,968 unique sources.

@tlongers
Copy link
Member Author

Thanks @smcalilly

@tlongers
Copy link
Member Author

This is fixed now in the source model in source:source_id:admin. This provides a unique stable identifier for a source. What's required in sfm-cms to use these values rather than infer an identity using the alg Hannah described above?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants