-
Notifications
You must be signed in to change notification settings - Fork 0
/
PEP_DataMgmtPlan.qmd
746 lines (400 loc) · 47.6 KB
/
PEP_DataMgmtPlan.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
---
title: "PEP Data Management Plan"
author: "Stacie Koslovsky"
toc: true
theme: superhero
format: html
---
```{r setup, include = FALSE}
knitr::opts_chunk$set(echo = FALSE, warning = FALSE, message = FALSE)
```
This document was last updated on `r Sys.Date()`.
# Objective
Welcome to the Polar Ecosystems Program (PEP; pronounced both as P-E-P and pep) Data Management Plan. The objective of this document is to outline and implement consistent information management procedures for PEP data.
# Overview
Data are our legacy. The data we collect, manage and analyze today will continue to be available and used by staff (and others) in the future. It is, therefore, important to maintain, organize and share our data in a way that will be useful and meaningful to current and future users. This includes:
- Keeping files organized and up-to-date;
- Storing data in machine-readable, human-readable, and tidy formats;
- Documenting information about the files and the process(es) for creating the files;
- Cleaning out files (and paperwork) that are not important for the future legacy of the data (e.g. intermediate files, outdated information); and
- Sharing data with the public and interested partners.
## Foundations to Data Management
A recently-ish published paper in PLOS ONE, [Good Enough Practices in Scientific Computing](https://arxiv.org/pdf/1609.00037.pdf), provides a thorough overview of best practices and workflows for managing scientific data. While some of the focus is out of our scope, a few key principles in the data management section are worth focusing on:
1. **Create the data you wish to see in the world**. The original, raw data collected in the field or offloaded from sensors is rarely the data type and quality of data we would like to share with the world and have our names associated with. Data formats, data organization, column names, value data types and formats (e.g. date-time) can all be transformed and improved into higher quality forms. The better formed the data are, the easier subsequent analyses will be, the more reproducible our science will be, and the more likely it is that others will find use in our data.
2. **Create analysis-friendly (e.g. tidy) data**. In many cases, analysis-friendly data will be equivalent to the data you wish to see in the world. The key principle at play here is that of "tidy" data. Hadley Wickham's [2014 manuscript](https://www.jstatsoft.org/article/view/v059i10/v59i10.pdf) does an excellent job outlining the ideal structure of tidy data.
1. The key components of tidy data are:
a. Each variable forms a column.
b. Each observation forms a row.
c. Each type of observational unit forms a table.
2. Not only are these principles important for analyzing data in programming languages (such as R or Python), they are also key components to a well-organized database. Organizing data in this way will allow easier ingestion into a database; once your data are in a central database, the more tools for exploring and analyzing your data will be available.
3. **Record all the steps used to process data**. Embracing scripts developed in a programming language (such as R or Python) is essential to providing a robust, reproducible workflow for your data. With large, complex datasets, manually processing data via mouse clicks and spreadsheet-centric workflows are often time-consuming and difficult to reproduce. As part of the planning process for the PEP data management workflow, we will evaluate existing data management processes and identify ways to simplify and automate existing or new steps.
## General Guidelines
**Create folders with a top-down hierarchy**, i.e. folders should get more specific as you get deeper into folder organization. This will make long-term organization more efficient, will consolidate files to a layout that will be easy for others to interpret, and will improve information transfer among staff.
**Create and maintain metadata documentation for your files**. These can be shorter documents describing the files within a specific folder or longer documents describing the work that went into creating each of the files. The complexity of each file/folder dictates the level of detail that should be documented in the metadata. This level of metadata can be most simply managed using a word document or a text file.
**Copy files…do not drag-and-drop**. When you need to move files on the network or your computer, move them by either copying the files to the new location and deleting the files from the old location or cutting and pasting from the old location to the new location (either option instead of drag-and-drop). Use batch processing whenever dealing with large volumes of data (example link to SMK code for this in R).
## File Naming Conventions
Name and organize your files in such a way that you know what they mean at 3 am (or in a crunch)!
### Three Principles for File Naming
- **Human readable**
- File names are easy to understand by anyone, not just you. ☺
- Using versioning (when appropriate) to track different versions of your files (dates are a great way to do this).
- Keep file names as short as possible.
- **Machine readable**
- Avoid spaces, punctuation, accented characters.
- Case-sensitivity!
- Deliberately use of delimiters (e.g.\_ - .) instead of spaces.
- **Orders logically**
- Name files from most general to most specific, so similar items cluster together when sorted by name and are easy to read (e.g. HarborSeal_Locations and HarborSeal_Polygons).
- YYYY-MM-DD sorts better than DD-MMM-YYYY.
- Use leading zeroes! (e.g. 01, 02, 10 – instead of 1, 2, 10).
### File Naming Examples
| Not-so-good | Improved File Name |
|-----------------------|----------------------------|
| Plan 1.docx | DataMgmtPlan_20170825.docx |
| Why won’t this work.R | CHESS_DataImport.R |
| Asdfasdf.jpg | Figure01.jpg |
| New.pdf | ActivityPlan_20170815.pdf |
| NewFinal.pdf | ActivityPlan_20170818.pdf |
| NewFinalFinal.pdf | ActivityPlan_20170822.pdf |
# DEIAB in PEP Data Management
Data management *might seem* beyond the scope of DEIA, but there are important considerations that can be applied to data-related work. DEIAB stands for Diversity, Equity, Inclusivity, Accessibility, and Belonging.
- **Diversity** recognizes and appreciates our different skills, technical abilities, ideas, backgrounds, and ways of learning (among many other things). It is my intent to approach data management application and training in our program with this in mind. Our jobs and how we interact with data are also diverse; different people will have different needs and different applications for the data we collect and maintain.
- **Equity**, [from the World Economic Forum](https://www3.weforum.org/docs/WEF_Advancing_Data_Equity_2024.pdf), "is a shared responsibility that requires collective action to create data practices and systems that promote fair and just outcomes for all. ... By considering data equity throughout the data life cycle, data practices can be improved to promote fair, just and beneficial outcomes for all individuals" and projects.
- **Inclusivity** is "the practice of ensuring that data collection, storage, and analysis processes actively consider and represent diverse perspectives, experiences, and identities, preventing the exclusion of marginalized groups and promoting equitable representation across all datasets, ultimately leading to more accurate and comprehensive insights." While we don't have to consider inclusivity when it comes to seals, we do have to consider inclusivity with respect to the people who are collecting, managing and using the data. Some considerations include:
- Developing inclusive data collection protocols: design surveys and data gathering methods that are accessible.
- Implementing data governance practices: establish clear guidelines for collecting, storing, and using data. While this document is a blueprint for program projects, it does not replace the need for this to be considered and applied on a project-by-project basis.
- **Accessibility** - the data and the information about the data need to be accessible. This document (plus other efforts, like PEP Data Days) are intended to make information as accessible as possible for the diversity of needs and users within our program.
- **Belonging** is the feeling of being accepted and supported in each project and within our program. Ultimately, it is my goal that everyone feels like they have access to the information they need and are getting the support that is required. This is a two-way street between project and data management.
# Data Management Workflow
The figure below describes the general process that PEP uses for managing data, start to finish and cyclically for on-going projects. This workflow consists of eight steps, some of which occur concurrently or similarly timed yet independent of one another.
![](docs/PEP_DataMgmtPlan_files/images/DataManagementWorkflow.png){fig-align="center"}
## Plan
<details>
<summary>Click to Expand</summary>
Planning is crucial for the success of any project for all components of the work, including future data management. Early data management efforts can pay in dividends later by establishing clear roles, responsibilities, expectations and timelines. Prior to any new data collection, work with S. Koslovsky to:
- Identify where data will be stored (both short- and long-term; and set up the workspace), what data products will be needed, and what data processing will be required.
- Create storage locations on network for final data and on Google Drive/elsewhere for intermediate processing; ensure necessary staff have access to these locations.
- Develop/evaluate/implement data collection strategies that facilitate and simplify data management steps that follow (e.g. datasheets, in-field data entry, in-field archive efforts).
- Coordinate (or at least communicate) with the AFSC OFIS team regarding network storage space needs and any other concerns.
- Decide how best to organize the metadata for the project. All metadata entries within InPort will be published publicly and, eventually, used to create standards compliant alongside any open data products. A one-to-one relationship between metadata records and eventual open data products often makes this process easier.
The [NOAA Data Management Planning Procedural Directive](https://nosc.noaa.gov/EDMC/PD.DMP.php) (PD) ([direct link](https://nosc.noaa.gov/EDMC/documents/EDMC-PD-DMP-2.0.1.pdf) to PDF of current version 2.0.1) requires a data management plan to be developed for all environmental data collected by NOAA programs or systems. The PD provides a generic template and guideline for developing a data management plan and a [data management plan repository](https://drive.google.com/drive/u/0/folders/0Bwl6f-PNVtnndG0yOUJ5cGU0RGs) has been established. A common tool for developing data management plans (<https://dmptool.org>) has a template for NOAA. These represent a minimum requirement and, in fact, PEP data management plans will likely be more thorough. All data management plans should be developed in collaboration with the PEP data science lead (Stacie Koslovsky) and established prior to data collection. Work with S. Koslovsky to complete and submit a data management plan (as appropriate).
## Integrate
<details>
<summary>Click to Expand</summary>
For on-going projects, we will incorporate feedback from the previous year's/season's effort to make improvements to the upcoming effort. For new projects, this might include considering lessons learned from other projects.
## Collect
<details>
<summary>Click to Expand</summary>
It is critical to have clear instructions and expectations for how data are collected. It is also important to consider how to handle and/or document deviations from planned data collection – which are bound to happen!
### Field Data Collection
During field data collection, begin preliminary data management, as able:
- Review datasheets for data recording errors.
- Scan datasheets for backup.
- Use temporary storage on external/portable drives, cloud providers, or individual government laptops to backup images, scanned datasheets and other critical files.
- Enter field data into cloud-based data entry tools.
## Process
<details>
<summary>Click to Expand</summary>
After field work is completed, our aim will be to download, process and QA/QC data as soon as possible. All data should be archived to the network no later than **one month** after the completion of the project.
### Raw and Original Data
The AFSC network (LAN) is the primary storage location for all data collected by PEP.
- In practice, the transfer of data from field collection storage to the network should be the first priority upon returning from the field. Temporary storage of files on external/portable drives, cloud providers, individual government laptops should not exceed 30 days after return from the field.
- After data are transferred to the network, files and associated information should be reviewed for consistency.
- In general, 'raw' or 'original' data collected in the field should not be altered.
- In cases where the data are known to be incorrect or a more correct value is known, those files or entries should be edited.
- A common occurrence is for files to not be named properly in the field. File names should be corrected at this time.
- Additionally, data transcription errors can occur, and this is the appropriate time to fix those errors.
### Processed and Final Data
After raw/original data are archived to the network, data should be processed and reviewed for outliers. Data entry and QA/QC should be processed and finalized **within three months** of the field effort, so that other efforts (e.g. image review, counting) are able to proceed in a timely manner.
The processing and review steps will vary by project, but generally:
- We will use automated processes for streamlining and documenting processes whenever possible.
- Custom data-entry forms and/or spatial processing templates will be created for data entry and processing when needed.
- Queries and/or other data visualizations will be used to facilitate data QA/QC.
- Spatial data (grid and other products) will be stored in their native projection in the database.
- For many projects, this will be WGS-84.
- Spatial data from outside sources (e.g. environmental data) will be stored in the original projection. This may require data to be reprojected for specific analyses or other needs.
- Environmental data will be updated in the pep DB annually in August. Additional environmental data can be accessed from other data sources online. If you need help with this, contact E. Richmond or S. Koslovsky.
### Backup Procedures
All data copied to the network are backed up offsite in one of two ways: snap-mirrored to another NMFS facility, tape backup delivered off site (reserved for large files that change infrequently (e.g. imagery, acoustic files, video). In addition, any data that are snap-mirrored are also backed up differentially and that allows incremental restoration (daily for 7 days, weekly for 6 months).
## Debrief
<details>
<summary>Click to Expand</summary>
After fieldwork is completed, part of the larger project debrief should include time for discussing in-field data collection and management. The general comments and any specific actionable changes uncovered during the debrief should be incorporated into the INTEGRATE step in the next cycle of the data management workflow for that project.
## Analyze
<details>
<summary>Click to Expand</summary>
Part of the data workflow is to ensure the final products are easily usable and accessible for analyses. Accessing data for each analysis will be different; work with S. Koslovsky to identify the most efficient way to extract data from the DB and/or network for your needs.
## Evaluate
<details>
<summary>Click to Expand</summary>
We want to emphasize the feedback loop from analytical processes to data management. Our goal with data management is to streamline processing and extraction of information for analyses. This may be updates/improvements to data management processes over time. Feedback and communication are important for ensuring data products meet current and future analytical needs, and this information should be incorporated into the INTEGRATE step in the next cycle of the data management workflow for any related field projects.
## Share
<details>
<summary>Click to Expand</summary>
Data are considered final when a project is completed (e.g., ChESS) or when the annual data processing for a project (e.g., harbor seal surveys) is completed. After data are processed and considered final, S. Koslovsky will notify PI and data sharing staff to prepare final datasets for archive and distribution (when appropriate).
### Overall Workflow
![](docs/PEP_DataMgmtPlan_files/images/FinalDataWorkflow.png){fig-align="center"}
### Metadata
NOAA Fisheries requires that all data collected be documented within the official NOAA Fisheries metadata repository, [InPort](https://inport.nmfs.noaa.gov/inport). InPort provides an extensive suite of tools for editing and managing metadata. For PEP projects, we will update InPort as follows:
1. S. Koslovsky and C. Christman will work with project leads to establish a metadata plan and to get started with InPort.
2. The project lead will enter the appropriate metadata information into a GoogleForm that will be used to create and maintain program metadata records.
1. The responsibility for creating, editing, and maintaining the GoogleForm entries is the responsibility of the project lead. Because these records will be available to the public and, in many cases, represent the authoritative documentation of the data set, project leads should devote an appropriate amount of time and thought to development of metadata records.
2. Once the necessary and complete information is entered into the GoogleForm, notify C. Christman.
3. C. Christman will complete the necessary updates/entries.
For more information on metadata workflow, [review this documentation](https://docs.google.com/document/d/1kmEoKzXjS-O2r-6euBCL7A3We8mfAre3fpcEvytBD84/edit?tab=t.0) and/or contact C. Christman.
### Online Repositories
Much of the data collected and processed as part of PEP research activities is intended for public release, either in compliance with NOAA policies (e.g. Public Access to Research Results \[PARR\]) or in support of best practices related to open science and reproducible research. Keeping track of the evolving policies and expanding tools/repositories available can be challenging. Here, we outline our plan for the use of available repositories.
If you are publishing a manuscript and the journal requires the data to be provided on an open data portal, work with S. Koslovsky to identify the most appropriate repository and to ensure metadata are created.
## DEIAB in Data Management Workflow
Each step in this process is an opportunity to ensure we are achieving DEAIB in our data management practices within our projects. Misses at each step have the potential to introduce delays in processing, to cause staff to feel unwelcome, or to introduce errors.
- Planning
- Is the planning process inclusive (as appropriate)?
- Integrating
- Has feedback from a diversity of staff been incorporated into changes for the upcoming work?
- Collecting
- Are there clear instructions for how data will be collected?
- Have staff been trained and are they confident in their ability to collect data?
- Are staff comfortable sharing mis-steps or issues with data collection?
- Processing
- Have staff been given the appropriate training for processing data?
- Does documentation of processing steps exist? Is it accessible?
- Do we have processes in place to reduce process errors, and especially individual responsibility?
- Debriefing
- Are all perspectives being heard?
- Analyzing
- Is the information accessible and appropriate for analysis?
- Does the process for retrieving and updating data clear and avoidant of errors?
- Is the process for sharing data for an analysis inclusive of a diversity of perspectives? The people who collected the data might have different concerns about how data are analyzed based on their own observations/experiences than the project lead or data lead...
- Evaluating
- Do we have a process for evaluating a dataset and/or analysis that is inclusive of all staff?
- Sharing
- Are we ensuring that our data are accessible from an availability perspective?
- Are the data available in a format that is accessible?
- Are we ensuring that our data are accessible from an equity perspective? [From the Urban Institute](https://www.urban.org/sites/default/files/2021/06/08/do-no-harm-guide-recommendations.pdf), these are considerations for applying equity awareness in data visualization.
# On-boarding
S. Koslovsky will orient new staff to the general information provided in the PEP Data Management Plan and the specifics detailed within this section. Staff responsible for other program-wide tools (e.g. PEP Dashboard, inventory DB, Zotero) should provide resources/training as needed.
## Software Requirements and Recommendations
<details>
<summary>Click to Expand</summary>
Below is a list of software required for PEP computers.
| Program | Justification |
|-------------------------------|-------------------------------------------------------------------------------------------------------------------|
| QGIS | For the most direct connection to PEP database for spatial data |
| 64-bit PostgreSQL ODBC driver | For connecting to PEP database via Microsoft Access |
| Microsoft Access 2016 | For using PEP database front-ends (this is part of the Microsoft Suite, so might already be installed by default) |
| EndNote and/or Zotero | For accessing the PEP library |
| ACDSee | For image management (field photos and data); we also have licenses for Lightroom |
| Adobe Acrobat | For managing data and forms stored in pdf |
Below is a list of software recommended for PEP computers.
| Program | Justification |
|------------------------------------------|------------------------------------------------|
| Anaconda(or miniconda) with Python\>=3.6 | Lots of our tools require Python |
| R | Lots of our tools require R (and kept updated) |
| RStudio | A user-friendly GUI for R |
| ArcGIS Pro | Spatial processing and analysis |
| VLC Media Player | For video management |
### Details for ODBC Driver Installation
In order to connect to the DB from Access, you will need to ask IT to update your ODBC Data Sources to allow connection to PostgreSQL. Here's a website for reference: [https://help.interfaceware.com/v6/connect-to-postgresql-from-windows-with-odbc](#0){.uri}. The most recent driver from this site: [https://www.postgresql.org/ftp/odbc/versions/msi/](#0){.uri} (x64.zip version).
The connection details for setting up the PostgreSQL ANSI ODBC connection configured after the driver is installed are in the screenshot below. Just use your own username and password. Once they add PostgreSQL to the System DSN, they can test your connection using an existing Access database by using the "Test" button in the Driver Setup screen.
![](images/clipboard-4182424335.png)
## Data Storage
### Individual Storage Resources
<details>
<summary>Click to Expand</summary>
No data should be stored exclusively in any of these resources.
(this section to be completed after PEP Data Days 2025)
#### Laptop
#### Google Drive
#### Individual Users folders on LAN
### Shared Storage Resources
<details>
<summary>Click to Expand</summary>
#### PEP_GoogleDrive
We have a centralized Google Drive (tied to a generic email account) that serves as our program [Google Drive repository](https://drive.google.com/drive/folders/1XMadJkz4AqgHctLn0U2xnVChSjHxnyjb). Google Drive is an excellent location for sharing files among staff for easy off-site access. Google drive, however, should not be treated as a final storage location for data. Files can be stored here in perpetuity and as appropriate, but data should not. As projects and/or collaborations are completed, data should either be deleted (if they are temporary) or archived to the appropriate location on the network.
#### Network (aka LAN)
All original data and supplemental information to data collection will be stored on the LAN in the appropriate location (detailed below). If you have questions about where to store data for a project, contact S. Koslovsky.
- For those projects where data are stored on other locations on the network (e.g. CHESS, harbor seal survey images), create shortcuts to these other network locations within each project folder.
- Final files from intermediate locations will need to be moved to the appropriate location(s) on the network when processing is completed OR when the project becomes collaborative and multiple staff will need access. Project folders should contain final data, datasheets, databases, manuals, metadata, scripts, and intermediate files (as needed).
- For folders under Polar\\Analytics, Polar\\Data, Polar\\Projects on the network, this is a suggested folder structure for each project subfolder. If data easily fit into this schema, use it; if not, data should be organized and documented such that anyone in the program can identify what is where. Not all projects will be able to fit this mold, but this structure will help keep data organized across multiple projects.
- **\\Data** – contains data collected in numerous formats, including databases, spreadsheets, etc.
- **\\Docs** – contains project proposals, reports, processing manuals, original field data sheets, etc.
- **\\GIS** – spatial data and programs associated with the project.
- **\\Manuscripts** – drafts of manuscript and any associated files.
- **\\Presentations** – any presentations associated with the project.
- **\\Scripts** – R, Python, etc. script files.
- **READ_MEs** – include read me files wherever appropriate to provide additional information on files within folders.
We store our files in a variety of locations on the AFSC internal network:
- [\\\\nmfs\\akc-nmml\\NMML_CHESS_Imagery](about:blank) - images and associated from 2016 CHESS survey
- [\\\\nmfs\\akc-nmml\\Polar](about:blank)
- \\Analytics
- Will store files related to analytical processes.
- Subfolders should be named YYYY_LastNamePI_ProjectTitle.
- The PI is ultimately responsible for ensuring that all files get stored in this location.
- You do not have to store files in this location if it is inconvenient while you are working; files just need to be archived from computers and Google Drive when processing is complete.
- \\Data
- Will store files related to data collection and creating final datasets.
- Each folder should be a project name, but we do not want too many files to clutter up this location such that it becomes hard to find project data. Folders are likely to rarely be added the root directory.
- Archive datasheets, raw data, etc. should be stored within this directory. When data processing is complete (either for a field season or entirely), files from computers and Google Drive need to be archived to this location.
- \\CapturesAndHandling
- \\Data_CruiseUnderway: subfolders should be named YYYY_Location_Species_CruiseName and should match planning folder name under Projects.
- \\Legacy
- Will store older project planning and data files that need to be kept for long-term storage, but were not broken out into the new structure.
- Subfolders should be named YYYY-YYYY_ProjectName (as appropriate).
- \\Projects
- Will store files, maps and other information related to field operations (e.g. a folder for coastal harbor seal surveys would include maps for the plane, instructions for the field season, etc.).
- Subfolders should be named in the following format: YYYY_Region_Species_Platform (e.g. 2018_BeringSea_IceSeal_DysonCruise).
- After field work is complete, associated files from computers and Google Drive need to be archived to this location.
- \\ProgramMgmt – will store budget, library, permit, etc. information.
- \\ResearchProducts
- \\Manuscripts: subfolders or pdfs should be named YYYY_Author_Subject.
- \\Presentations: final presentations should be named YYYY_Author_Subject.
- \\Reports: subfolders or pdfs should be named YYYY\_ Subject.
- \\WorkshopsAndConferences: subfolders or pdfs should be named YYYY_Author_Subject.
- \\Users
- Subfolder should be named with the staff member’s last name.
- Subfolders should been cleaned out and what needs to be kept moved to \_Legacy when staff are no longer employed/involved.
- [\\\\nmfs\\akc-nmml\\Polar_Imagery](about:blank)
- \\Field_Photos – Portfolio monitors this folder for new images
- \\Field_Video – Portfolio monitors this folder for new videos
- \\Inbox
- \\PhotoConsentForms – contains consent forms
- \\Portfolio
- \\Portfolio_Previews
- \\Surveys_HS – contains original and processed images for coastal and glacial harbor seal surveys
- \\Surveys_IceSeals – contains original and processed images for ice seal surveys, except CHESS and some BOSS (which can be found on NMML_CHESS_Imagery and Polar_Imagery_2, respectively)
- \\Techniques_Test – contains images collected from flights testing equipment and image processing
- \\User_Photos – photos taken by staff that do not need to be managed in Portfolio; each staff member has a folder, and they are responsible for managing the photos in this location
- [\\\\nmfs\\akc-nmml\\Polar_Imagery_2](about:blank)
- \\Aleutian_HarborSeals_CaptureSiteAnalysis
- \\Surveys_IceSeasl_BOSS_2013
- \\Surveys_Iliamna_2013
- \\Techniques_test
- \\xTEMP_BOSS13_Skeyes - can this be deleted?
- [\\\\nmfs\\akc-nmml\\Polar_Imagery_3](about:blank)
- \\jobss_2021
### PEP PostgreSQL Database
<details>
<summary>Click to Expand</summary>
#### Overview
The table below details the final data are stored in the pep PostgreSQL database.
- All staff will have read-only access to all data.
- Staff will only have read-write access to the data they are required to edit.
- There are a number of automated scripts that will replicate and/or process data into their final product. This database should be the go-to location data anytime data are needed. **Intermediate copies and/or previous exports should always be replaced to ensure you are using the most up-to-date data.**
- QGIS is the only software we will use for editing spatial data stored in the DB. It will also be used for importing shapefiles to PostGIS. ArcGIS or other GIS software should only be used for viewing spatial data stored in the DB or from the pepGeodatabase.
#### Current Schemas
| **Project** | **Schema** | **Details** |
|----------------------------------|---------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| **Acoustics** | acoustics | Contains data received from CAEP Acoustics program and pre-processed for import into the DB. |
| **PEP Dashboard** | administrative | Contains data associated with the PEP Dashboard (activity planning, data tracking, etc.). An Access front-end is available for managing data. |
| **Annotations** | annotations | Contains data associated with application of ML models to imagery; the annotations and imagery data are stored within their project schemas; this schema tracks information related to annotation management. An Access front-end is available for managing data. |
| **Base** | base | Contains “THE” grid and environmental covariates extracted to the grid. |
| **UAS Body Condition** | body_condition | Contains image, LRF and measurement data related to the UAS body condition data. An Access front-end is available for managing data. |
| **Capture** | capture | Data have been imported from Excel files, which were exports from the Oracle DB. An Access front-end is available for managing data. |
| **Environmental** | environ | Includes NARR weather data from 2004-2016 and NSIDC CDR sea ice concentration data\*. |
| **Inventory** | inventory | Contains data for managing PEP gear inventory and field packlists. An Access front-end is available for managing data. |
| **Species Misclassification** | species_misclass | Contains ice seal species misclassification information from across all ice seal survey projects. |
| **Stock** | stock | All data have been migrated from pepgeo to this database. |
| **BOSS** | surv_boss | All data from Oracle and from the original FMC logs have been ingested into DB. An Access front-end is available for viewing/querying data. |
| **ChESS** | surv_chess | Data are uploaded to DB from CSV files through R script. All imported data have been QA/QC’ed. A final effort trackline is available. An Access front-end is available for viewing/querying data. |
| **Ice Seals: 2024** | surv_ice_seals_2024 | Image metadata/inventory and annotations from 2024 Ice Seal Surveys (in Bering Sea). |
| **Ice Seals: 2025** | surv_ice_seals_2025 | To be created/populated after 2025 surveys. |
| **JoBSS** | surv_jobss | Image metadata/inventory and annotations from 2021 Beaufort Sea Surveys. An Access front-end is available for viewing/querying data. |
| **Ice Seal: Polar Bear** | surv_polar_bear | Image metadata/inventory and annotations from 2019 surveys out of Kotzebue and Deadhorse targeting polar bears. |
| **Harbor Seal: Coastal Surveys** | surv_pv_cst | Image metadata/inventory and counts from 2004-2021 are currently QA/QC’ed and available in the DB. ADF&G, NOAA 1996-1997, NOAA 1998-2002 and NOAA 2003 data are available in the DB. Authoritative Iliamna and Pribilof data are also actively managed within this schema. An Access front-end is available for managing data. |
| **Harbor Seal: Glacial Surveys** | surv_pv_gla | Fliight data and counts from surveys are available in the DB. An Access front-end is available for managing data. |
| **Ice Seal: Kotz** | surv_test_kotz | Image data and annotations from 2019 testing of in-flight system out of Kotzebue. Fight 01 and R camera views should not be used for any AI/ML. |
| **Telemetry** | telem | All field-related telemetry information has been entered into the DB. Tag inventory and deployment information is managed through the Capture Access front-end. |
\*The sea ice data stored on the pep database are the CDR and sea ice extent products. The original versions of these datasets are also on the network. In addition to these datasets, the bootstrap and nimbus are also stored on the network. The CDR and sea ice extent should be the go-to sea ice data products, which is why they are processed and available in the pep database.
#### Naming Conventions
General Guidelines
- Coordinate with S. Koslovsky before adding new data tables to the database or if you need a stored query added to the database.
- Do not use spaces in database item names. Use underscores between words.
- Data for analyses:
- If you are using data products that are finalized (e.g. CHESS survey) and you need data processed prior to use, the processed data should be stored in a view/query.
- If you are using data products that will continue to be updated (e.g. telemetry data) and you need the processed data preserved, the processed data should either be stored in the database as a separate table or be exported and stored elsewhere (network or computer) with other associated files.
- For tables
- lku\_ - use this prefix for look-up tables.
- geo\_ - use this prefix for spatially explicit datasets that are stored in the postgres DB. If you have a table that has spatial data, use the tbl\_ prefix. If you have raster, basemap, etc., use the geo\_ prefix.
- tbl\_ - use this prefix for data tables. These can be spatially enabled.
- res\_ - use this prefix for results tables. These are tables that are derived from data stored in tables (tbl\_) that need to be stored but are not “raw” data. These should have clear indication of what they are, and if appropriate, a date
- For views
- qa\_ - use this prefix for QA/QC queries.
- res – use this prefix for views that are for analytical purposes.
- Anything else – use logical consistency for any other views stored.
#### Accessing Data
##### **Spatial Data**
For viewing spatial data, you can access data from one of these platforms:
- pep_geodatabase
- ArcMap and ArcPro (to connect to the spatial data available on AGOL)
- ArcMap
- Click on the Add Data button, and select “Add Data From ArcGIS Online”
- In the upper right corner of the pop-up window, sign into ArcGIS online using your CAC credentials.
- Once you are logged in, you can add data shared with the PEP SpatialData group under “My Groups”.
- ArcPro
- You are automatically logged into ArcGIS online when you open ArcPro.
- Once you have created a new project, you can add data from the Portal. Data shared to the PEP Spatial Data group will be accessible from Groups under Portal.
- ArcGIS Online
<!-- -->
- Once signed in to ArcGIS online, click on “Groups” at the top of the page
- Click on the group, “PEP Spatial Data” and open the item page for the dataset you want to download
- On the right side of the page, go to “Export Data” and choose the format you want to export
- An export box will appear with a default title for the data export. You can rename this if you’d like. You must enter at least one tag for your dataset to allow for export. Exported items from ArcGIS online are stored in the root folder of “My Content”. If exporting to shapefile, a compressed file (.zip) is created and can be downloaded and saved your computer.
<!-- -->
- QGIS (to connect to spatial data
- Right click PostGIS, and select “New Connection…”
- In the Create a New PostGIS Connection window, specify the following options; then, click OK.
- Name: pep
- Service: (leave blank)
- Host: 161.55.120.122
- Port: 5432
- Database: pep
- Username: your user name (check box to save)
- Password: your password (check box to save)
- Once created, you will see a new connection to the database under PostGIS. The available data are listed under each schema, and only spatial data are visible.
For entering spatial data, process the data as instructed for each individual project.
##### **Tabular Data**
For viewing tabular data:
- Project-specific Access frontends are available for the following projects:
- Capture (for entering/viewing data entered from datasheets)
- Coastal Harbor Seal Surveys (for track level metadata)
- You can directly connect to the pep DB using R with the RPostgreSQL package. This connection string will get you started: con \<- dbConnect(PostgreSQL(), dbname = "pep", host = "161.55.120.122", user = "UserName", password = "Password")
- If you need access to or exports of other tabular data from the pep DB, contact S. Koslovsky.
##### pepDataConnect (R package)
```
install.packages("devtools") devtools::install_github("StacieKoslovsky-noaa/pepDataConnect")
```
There are of tables (and linkages among them) in PEP database, and this can be overwhelming to learn and can lead to potential issues in how data are linked together or extracted from the DB. The pepDataConnect R package was created to easily connect PEP staff to the database and to ensure the quality of the data being retrieved. The R code snippet below will guide you through the process of installing the R package and an example of code to get a data table from the DB. There is *extensive* documentation in the
``` r
# Getting started with pepDataConnect remotes::install_github('staciekoslovsky-noaa/pepDataConnect') # To connect to the database, create a connection con <- pepDataConnect::pep_connect() # To load data into your R workspace, use one of the table functions data <- pepDataConnect::surv_jobss.tbl_detections_processed_ir(con)
```
Currently, the pepDataConnect R package has been set-up to interact with the database for the following schemas:
- Coastal harbor seal survey data (surv_pv_cst)
- JoBSS ice seal survey (surv_jobss)
Other schemas will slowly be added to the R package, as prioritized each year
#### Backup Procedure
The pepDB will be backed up according to the following schedule:
- Daily directory backup: The pep DB will be backed up daily to tape. The last 5 days will also be stored on the VM.
- Monthly full VM backup: The virtual machine on which the pep DB resides will be fully backed up to tape monthly. The last 3 months will also be stored on the VM.
The DB backups will be tested every 6 months (March and September) to ensure they are working properly, and data can be restored.
## Frequently Used Program-wide Tools
<details>
<summary>Click to Expand</summary>
### PEP Dashboard
The PEP dashboard is a schema (administrative) within the PEP PostgreSQL database that is used for tracking the roles of staff on projects and associated timelines and for entering responsibilities into the PAWS performance planning system. For more information:
- Documentation
- [Weekly report](https://htmlpreview.github.io/?https://github.com/staciekoslovsky-noaa/PEP_Dashboard/blob/main/SchedulingReport/SchedulingReport.html)
- Contact M. Cameron
### Inventory Database
This database is a schema (inventory) within the PEP PostgreSQL database and tracks inventory of gear and supplies used by staff in PEP for projects. For more information:
- Documentation
- Contact H. Ziel
### Extensis Portfolio
Extensis Portfolio is software used for tagging and querying images taken during PEP projects. For more information:
- Documentation
- Contact S. Walcott
### Zotero
Zotero houses a central library of journal articles, reports and other references. For more information:
- Documentation
- Contact S. Walcott
# Off-boarding
(this section to be completed after PEP Data Days 2025)