-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feedback on refactored download #816
Comments
Thanks for your feedback and questions @tlongers. I'm noticing a theme in these questions — it seems like you want to do some extra processing. This is possible, we just need to decide on what to process so we can manage all of that without making the code too complex. cc @hancush
It's quite simple. It downloads the raw data files from Google Drive and packages them in a zip archive. The zip archive is uploaded to AWS S3, which is then retrievable via an S3 URL whenever anybody visits the download page. There is no filtering — we just download the same files that we input into the importer and zip those files up. This process is a part of the existing import process. We added some extra steps to the existing import process, and input the same files into the importer and zip archive.
We could do this, though it could become quite involved of a task depending on what all you want to process. Removing columns wouldn't be too much work. But since the archive is just the raw files, there isn't a primary key in those files (at least not one created by the database. There might some key in the spreadsheets?). If you want a primary key from the database, then we'd have to change the process so that we dump out what's in the database rather than using the raw data files. As for validation, the zip archive creation is a part of the import process and won't be uploaded to S3 unless the importer has passed the validation and imported all of the data.
It would be easy to get the archive size, but counting the number of folders/files inside could be tricky. Once the archive is uploaded to S3, the files are deleted from the server and we'd have no way to count without storing this information in the database or decompressing the archive (decompressing would slow down the download page load). However we could share how many countries are contained within the archive without having to do too much.
Yes. It would require some extra processing but not too involved.
Yes, this would be quite simple.
Good question. That URL will expire after an hour. We can do a number of things to prevent spam downloads. We could make the URL expire in a shorter time period, which in theory would shorten the amount of time for spam. We can setup that page so that there is a cache and it doesn't recreate a link on every page load. We could obfuscate this link and not download directly from S3 — this would be a bit involved but might be worth it if there's a higher risk of spam downloading. S3's biggest attack vulnerability is pricing, because a spammer can essentially send out X amount of requests per second on that S3 object and this could become expensive. It costs money per request to retrieve data from S3, though there is throttling which can limit this problem. GET requests cost $0.0004 per 1,000 requests.
We'd be able to see the visits to the download page, but we haven't wired up any event tracking for when somebody presses the "download" button. We have tracking configured only for the production site, so we'd be able to set up download event tracking once we release this change to production.
No. It replaces the old data when the latest data is uploaded. We discussed whether we should add a way to download old data, but decided that we'd keep it simple with this first version. |
My thoughts ... Package contentSimple content modificationsWe can perform these steps our side, but what would be the work WWIC-side to retain but blank out values from the following attributes in the download package
Removing values from these attributes means that we keep some of our internal commentary private and don't have worry about ensuring that is legible to an outside audience. Package structure
Package metadata and licensingWhat can be added into the download package (or each country spreadsheet) etc that states what the download is (full filelist, generation date), the author and licensing terms, along with our contact details? We tend to include a Download page and textHeading: Download data from WhoWasInCommand.com Paragraph/sub-heading WhoWasInCommand.com is a free, public database of police, military and other security and defence forces. Click the "Download" button below to get a copy of all the data in the WhoWasInCommand.com database. Single large button: "Download(.zip, x MB)" Sub-heading / para (See what looks good) What's in the download file? The download file includes spreadsheets containing data for each country in the WhoWasInCommand.com database. Each spreadsheet includes:
The download file also contains:
The data on published WhoWasInCommand are created by Security Force Monitor, an investigative team based at the Human Rights Institute of Columbia Law School. We provide this data to assist journalists, human rights workers, litigators and others seeking accountability for human rights abuses perpetrated by security and defense forces. |
We can do this within the download packaging process. It's not a heavy lift to filter out columns completely or change the values to be empty. We're already working on filtering out any rows that aren't publishable (where
Yes we can do this, though shepherding csv into xlsx files isn't the cleanest process. Sometimes the processors can cast floats into weird values — I've had this happen before with sheets.js and it can be frustrating. These sort of issues are possible to work around, but it takes some extra time. I estimate we could do it in 4-6 hours in the best case, though it could go longer if we run into any problems. Plus y'all would need to do extra qa to check that the data right. With that in mind, do you think the value of the user friendliness is worth the time and effort?
The iso code is coming from the import docket, so we use those values to name the file. If y'all could add the slugified name as a new column in the import docket, then we could use that for the names.
Ideally y'all could provide each country's sources within the country's spreadsheet, otherwise we'd need to do some extra processing. The existing import process is knowledgeable about the sources relationship with the different models, but this new process to package the data for download is ignorant of the data relationships. We can make it smarter, but that would be a good bit of work and possibly redundant since we're already managing that in the well-tested import process.
|
We should keep the columns, but empty them.
Yes, we've come across this specific issue with sheets.js too, and it's provide quite painful to resolve. Forget this for now - it's a tar pit.
That's great; I can do that.
So you'd be looking for a
Yep, we can do that. Where should it live; in this repo? |
@tlongers No, the importer wouldn't be fussy about an unknown tab. Good question about where the readme should live. Do you think y'all would want to make edits over time, or would the content be more static? We could source that from a Google Drive folder if you think y'all will make a lot of random changes, otherwise it can live in the repo. Living the in the repo would be simplest from a implementation perspective. |
Likely we would update it with each import, indicating the data versions (which are just timestamps), and noting any unfamiliar additions (like new countries). We're happy contributing to a plaintext file in the repo, or in a separate one that the deploy process can call on. The issue I can see here is how to deal with deployments that are non-WWIC. |
@tlongers Gotcha. Let us know what y'all decide to do and we can add it to the packaging. We're close to having these changes ready for your review. We just need these things:
|
Feedback thread on implementation of #807 (via PR #810 to #815)
Impressions:
Questions:
Next up:
The text was updated successfully, but these errors were encountered: