Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feedback on refactored download #816

Open
tlongers opened this issue Jul 22, 2022 · 7 comments
Open

Feedback on refactored download #816

tlongers opened this issue Jul 22, 2022 · 7 comments

Comments

@tlongers
Copy link
Member

Feedback thread on implementation of #807 (via PR #810 to #815)

Impressions:

  • Great work @hancush @smcalilly
  • I like the super simple offering of "all our data, in one place, here at it". Exactly what we needed.

Questions:

  • explain the mechanic please. Does it take the raw inputs, stash them somewhere, and just output the collected files into a zip, or are there any further processing steps? For example, the importer filters rows where the status column is 3; does the importer do this?
  • is there an opportunity to process the files further, for example blanking some columns (e.g. the comments, status, notes fields) and doing a global compare certain conditions e.g. only output rows where the entity status=3, and where the primary key for the entity or source exists in the wwic database (meaning it has passed validation etc).
  • can we generate an variable for archive filesize + number of folders/files inside that can be passed to some additional explanatory text about the archive?
  • possible to compile it each country data into a single spreadsheet to spare the user some time doing that?
  • possible to expand the filenaming conventiont to "[version]_[two-letter-iso]_entity.csv"? These variables are available in the importer filenames.
  • what's stopping spamming/misuse of the download URL? I see it generates a unique URL each time the page is refreshed .
  • is it wired into the web analytics so we can see its use?
  • behind the scenes, is it keeping the old downloads anywhere?

Next up:

  • revise the text accompanying the download (that's for me to do)
  • take the "ZIP" out of the download button, and place some small text beneath it saying ".zip (32.7 MB,
  • add primary nav item for "Download" and strip out download buttons from search and entity pages
@smcalilly
Copy link

Thanks for your feedback and questions @tlongers. I'm noticing a theme in these questions — it seems like you want to do some extra processing. This is possible, we just need to decide on what to process so we can manage all of that without making the code too complex. cc @hancush

explain the mechanic please.

It's quite simple. It downloads the raw data files from Google Drive and packages them in a zip archive. The zip archive is uploaded to AWS S3, which is then retrievable via an S3 URL whenever anybody visits the download page. There is no filtering — we just download the same files that we input into the importer and zip those files up. This process is a part of the existing import process. We added some extra steps to the existing import process, and input the same files into the importer and zip archive.

is there an opportunity to process the files further

We could do this, though it could become quite involved of a task depending on what all you want to process. Removing columns wouldn't be too much work. But since the archive is just the raw files, there isn't a primary key in those files (at least not one created by the database. There might some key in the spreadsheets?). If you want a primary key from the database, then we'd have to change the process so that we dump out what's in the database rather than using the raw data files. As for validation, the zip archive creation is a part of the import process and won't be uploaded to S3 unless the importer has passed the validation and imported all of the data.

can we generate an variable for archive filesize + number of folders/files inside that can be passed to some additional explanatory text about the archive?

It would be easy to get the archive size, but counting the number of folders/files inside could be tricky. Once the archive is uploaded to S3, the files are deleted from the server and we'd have no way to count without storing this information in the database or decompressing the archive (decompressing would slow down the download page load). However we could share how many countries are contained within the archive without having to do too much.

possible to compile it each country data into a single spreadsheet to spare the user some time doing that?

Yes. It would require some extra processing but not too involved.

possible to expand the filenaming conventiont to "[version]_[two-letter-iso]_entity.csv"?

Yes, this would be quite simple.

what's stopping spamming/misuse of the download URL?

Good question. That URL will expire after an hour. We can do a number of things to prevent spam downloads. We could make the URL expire in a shorter time period, which in theory would shorten the amount of time for spam. We can setup that page so that there is a cache and it doesn't recreate a link on every page load. We could obfuscate this link and not download directly from S3 — this would be a bit involved but might be worth it if there's a higher risk of spam downloading. S3's biggest attack vulnerability is pricing, because a spammer can essentially send out X amount of requests per second on that S3 object and this could become expensive. It costs money per request to retrieve data from S3, though there is throttling which can limit this problem. GET requests cost $0.0004 per 1,000 requests.

is it wired into the web analytics so we can see its use?

We'd be able to see the visits to the download page, but we haven't wired up any event tracking for when somebody presses the "download" button. We have tracking configured only for the production site, so we'd be able to set up download event tracking once we release this change to production.

behind the scenes, is it keeping the old downloads anywhere?

No. It replaces the old data when the latest data is uploaded. We discussed whether we should add a way to download old data, but decided that we'd keep it simple with this first version.

@tlongers
Copy link
Member Author

tlongers commented Oct 6, 2022

My thoughts ...

Package content

Simple content modifications

We can perform these steps our side, but what would be the work WWIC-side to retain but blank out values from the following attributes in the download package

unit:owner:admin
unit:comments:admin
person:comments:admin
person:owner:admin
incident:comments:admin
incident:owner:admin
source:owner:admin
source:comments:admin

Removing values from these attributes means that we keep some of our internal commentary private and don't have worry about ensuring that is legible to an outside audience.

Package structure

  • Is it possible to package all the .csv into a single .xslx file for each country? This is more user-friendly. We use sheets.js
  • Instead of the two letter iso code ("bf") for each country, could a slugfied full name be used ("burkina_faso") for the filename and directory?
  • Can sources specific to a country's data be filtered out and pulled into a sources sheet just for that country? This is also something we can provide directly from our side as part of the import sheet.
  • Can we provide a full copy of the documentation inside the zip, to accompany the data? ReadTheDocs, which we use, builds a PDF version of documentation.

Package metadata and licensing

What can be added into the download package (or each country spreadsheet) etc that states what the download is (full filelist, generation date), the author and licensing terms, along with our contact details? We tend to include a notes sheet that contains all this detail. At the least, a copy of the CC4.0 unported license will need to be included.

Download page and text

Heading:

Download data from WhoWasInCommand.com

Paragraph/sub-heading

WhoWasInCommand.com is a free, public database of police, military and other security and defence forces. Click the "Download" button below to get a copy of all the data in the WhoWasInCommand.com database.

Single large button:

"Download(.zip, x MB)"

Sub-heading / para (See what looks good)

What's in the download file?

The download file includes spreadsheets containing data for each country in the WhoWasInCommand.com database. Each spreadsheet includes:

  • the organizational structure and command chain of each branch of the security forces we have researched.
  • details on command personnel and their postings to different units over time.
  • the geographical footprint of different forces, including infrastructure and areas of operation.
  • a file of geospatial information used to represent this data as a map.
  • all the sources used to evidence each piece of data.

The download file also contains:

  • the Security Force Monitor Research Handbook, which explains what each piece of data means and how it was created. You can also read the Research Handbook online.
  • copyright and licensing information explaining what you can do with this data, and what your obligations are should you use the data.

The data on published WhoWasInCommand are created by Security Force Monitor, an investigative team based at the Human Rights Institute of Columbia Law School. We provide this data to assist journalists, human rights workers, litigators and others seeking accountability for human rights abuses perpetrated by security and defense forces.

@smcalilly
Copy link

@tlongers

what would be the work WWIC-side to retain but blank out values from the following attributes in the download package

We can do this within the download packaging process. It's not a heavy lift to filter out columns completely or change the values to be empty. We're already working on filtering out any rows that aren't publishable (where < data model >:status:admin != 3), so we'll just have to filter out those columns you listed, too. Would removing the columns suffice? It would be simplest to remove the columns completely instead of changing the values to be empty.

Is it possible to package all the .csv into a single .xslx file for each country?

Yes we can do this, though shepherding csv into xlsx files isn't the cleanest process. Sometimes the processors can cast floats into weird values — I've had this happen before with sheets.js and it can be frustrating. These sort of issues are possible to work around, but it takes some extra time. I estimate we could do it in 4-6 hours in the best case, though it could go longer if we run into any problems. Plus y'all would need to do extra qa to check that the data right. With that in mind, do you think the value of the user friendliness is worth the time and effort?

Instead of the two letter iso code ("bf") for each country, could a slugfied full name be used ("burkina_faso") for the filename and directory?

The iso code is coming from the import docket, so we use those values to name the file. If y'all could add the slugified name as a new column in the import docket, then we could use that for the names.

Can sources specific to a country's data be filtered out and pulled into a sources sheet just for that country?

Ideally y'all could provide each country's sources within the country's spreadsheet, otherwise we'd need to do some extra processing. The existing import process is knowledgeable about the sources relationship with the different models, but this new process to package the data for download is ignorant of the data relationships. We can make it smarter, but that would be a good bit of work and possibly redundant since we're already managing that in the well-tested import process.

Can we provide a full copy of the documentation inside the zip, to accompany the data?
Yes, we can do that. We'll just have to download it directly from the link you provided. It would be best if y'all included a license that we could package up. As for the metadata-like pieces of information (filelist etc), we think it would be best if y'all created a readme that includes that information.

@tlongers
Copy link
Member Author

Would removing the columns suffice? It would be simplest to remove the columns completely instead of changing the values to be empty.

We should keep the columns, but empty them.

With that in mind, do you think the value of the user friendliness is worth the time and effort?

Yes, we've come across this specific issue with sheets.js too, and it's provide quite painful to resolve. Forget this for now - it's a tar pit.

The iso code is coming from the import docket, so we use those values to name the file. If y'all could add the slugified name as a new column in the import docket, then we could use that for the names.

That's great; I can do that.

Ideally y'all could provide each country's sources within the country's spreadsheet, otherwise we'd need to do some extra processing. [...]

So you'd be looking for a sources tab in Google Sheet used to import? Is the importer fussy about tabs it doesn't recognise?

Yes, we can do that. We'll just have to download it directly from the link you provided. It would be best if y'all included a license that we could package up. As for the metadata-like pieces of information (filelist etc), we think it would be best if y'all created a readme that includes that information.

Yep, we can do that. Where should it live; in this repo?

@smcalilly
Copy link

@tlongers No, the importer wouldn't be fussy about an unknown tab.

Good question about where the readme should live. Do you think y'all would want to make edits over time, or would the content be more static? We could source that from a Google Drive folder if you think y'all will make a lot of random changes, otherwise it can live in the repo. Living the in the repo would be simplest from a implementation perspective.

@tlongers
Copy link
Member Author

We could source that from a Google Drive folder if you think y'all will make a lot of random changes, otherwise it can live in the repo. Living the in the repo would be simplest from a implementation perspective.

Likely we would update it with each import, indicating the data versions (which are just timestamps), and noting any unfamiliar additions (like new countries). We're happy contributing to a plaintext file in the repo, or in a separate one that the deploy process can call on. The issue I can see here is how to deal with deployments that are non-WWIC.

@smcalilly
Copy link

@tlongers Gotcha. Let us know what y'all decide to do and we can add it to the packaging.

We're close to having these changes ready for your review. We just need these things:

  • readme
  • slugified name for each country in the import docket
  • sources in each country's spreadsheet

This was referenced Oct 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants