Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LivingAtlas: Additional fields for SpeciesListPipeline (ARGA) #865

Open
wants to merge 23 commits into
base: dev
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 15 additions & 4 deletions livingatlas/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -67,21 +67,32 @@ These steps will load a dataset into a SOLR index.

1. Download shape files from [here](https://pipelines-shp.s3-ap-southeast-2.amazonaws.com/pipelines-shapefiles.zip) and expand into `/data/pipelines-shp` directory
1. Download a test darwin core archive (e.g. https://archives.ala.org.au/archives/gbif/dr893/dr893.zip)
1. Create the following directory `/data/pipelines-data`
1. Build with maven `mvn clean package`
2. Copy it to /data/biocache-load/dr893
1. mkdir /data/biocache-load
2. mkdir /data/biocache-load/dr893
3. curl https://archives.ala.org.au/archives/gbif/dr893/dr893.zip -o /data/biocache-load/dr893/dr893.zip
4. Create the following directory `/data/pipelines-data`
5. Build with maven `mvn clean package`
6. Download vocabularies
1. mkdir /data/pipelines-vocabularies
2. cd /data/pipelines-vocabularies
3. curl -sS https://api.gbif.org/v1/vocabularies/DegreeOfEstablishment/releases/LATEST/export > DegreeOfEstablishment.json
4. curl -sS https://api.gbif.org/v1/vocabularies/LifeStage/releases/LATEST/export > LifeStage.json
5. curl -sS https://api.gbif.org/v1/vocabularies/EstablishmentMeans/releases/LATEST/export > EstablishmentMeans.json
6. curl -sS https://api.gbif.org/v1/vocabularies/Pathway/releases/LATEST/export > Pathway.json

### Running la-pipelines

1. Start required docker containers using
```bash
docker-compose -f pipelines/src/main/docker/ala-name-service.yml up -d
docker-compose -f pipelines/src/main/docker/solr8.yml up -d
docker-compose -f pipelines/src/main/docker/ala-sensitive-data-service.yml
docker-compose -f pipelines/src/main/docker/ala-sensitive-data-service.yml up -d
```
Note `ala-sensitive-data-service.yml` can be ommited if you don't need to run the SDS pipeline but you'll need to add
```yaml
index:
includeSensitiveData: false
includeSensitiveDataChecks: false
```
to the file `configs/la-pipelines-local.yaml`.
1. `cd scripts`
Expand Down
2 changes: 2 additions & 0 deletions livingatlas/configs/la-pipelines.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -192,6 +192,8 @@ speciesLists:
maxDownloadAgeInMinutes: 1440
includeConservationStatus: true
includeInvasiveStatus: true
includeTaxonPresentInCountry: false
includeTraits: false

# Sampling specific configuration
sampling:
Expand Down
2 changes: 2 additions & 0 deletions livingatlas/pipelines/src/main/docker/solr8.yml
Original file line number Diff line number Diff line change
Expand Up @@ -15,3 +15,5 @@ services:
- "start"
- "-cloud"
- "-f"
restart: on-failure
platform: linux/amd64
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,8 @@
* <li>Links to species lists for records
* <li>stateProvince and country associated conservation status for the record
* <li>stateProvince and country associated invasive status for the record
* <li>optional `taxonPresentInCountry` flag for the record
* <li>optional species `trait` values for the record
* </ul>
*
* This pipeline is left for debug purposes only. Species lists are joined to the records in the
Expand Down Expand Up @@ -149,6 +151,8 @@ public KV<String, String> apply(KV<String, ALATaxonRecord> record) {

final boolean includeConservationStatus = options.getIncludeConservationStatus();
final boolean includeInvasiveStatus = options.getIncludeInvasiveStatus();
final boolean includeTaxonPresentInCountry = options.getIncludeTaxonPresentInCountry();
final boolean includeTraits = options.getIncludeTraits();

// join collections
return result.apply(
Expand All @@ -167,7 +171,11 @@ public void processElement(ProcessContext c) {
if (speciesLists != null) {
TaxonProfile.Builder builder =
SpeciesListUtils.createTaxonProfileBuilder(
speciesLists, includeConservationStatus, includeInvasiveStatus);
speciesLists,
includeConservationStatus,
includeInvasiveStatus,
includeTaxonPresentInCountry,
includeTraits);
// output a link to each occurrence record we've matched by taxonID
for (String occurrenceID : occurrenceIDs) {
builder.setId(occurrenceID);
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -137,7 +137,9 @@ public static Map<String, TaxonProfile> generateTaxonProfileCollection(
alaTaxonRecord,
speciesListMap,
options.getIncludeConservationStatus(),
options.getIncludeInvasiveStatus()))
options.getIncludeInvasiveStatus(),
options.getIncludeTaxonPresentInCountry(),
options.getIncludeTraits()))
.collect(Collectors.toList());

return profiles.stream()
Expand All @@ -152,15 +154,21 @@ static TaxonProfile convertToTaxonProfile(
ALATaxonRecord alaTaxonRecord,
Map<String, List<SpeciesListRecord>> speciesListMap,
boolean includeConservationStatus,
boolean includeInvasiveStatus) {
boolean includeInvasiveStatus,
boolean includeTaxonPresentInCountry,
boolean includeTraits) {

Iterable<SpeciesListRecord> speciesLists =
speciesListMap.get(alaTaxonRecord.getTaxonConceptID());

if (speciesLists != null) {
TaxonProfile.Builder builder =
SpeciesListUtils.createTaxonProfileBuilder(
speciesLists, includeConservationStatus, includeInvasiveStatus);
speciesLists,
includeConservationStatus,
includeInvasiveStatus,
includeTaxonPresentInCountry,
includeTraits);
builder.setId(alaTaxonRecord.getId());
return builder.build();
} else {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -33,4 +33,14 @@ public interface SpeciesLevelPipelineOptions extends InterpretationPipelineOptio
Boolean getIncludeInvasiveStatus();

void setIncludeInvasiveStatus(Boolean includeInvasiveStatus);

@Default.Boolean(false)
Boolean getIncludeTaxonPresentInCountry();

void setIncludeTaxonPresentInCountry(Boolean includeTaxonPresentInCountry);

@Default.Boolean(false)
Boolean getIncludeTraits();

void setIncludeTraits(Boolean includeTraits);
}
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,6 @@ public interface IndexFields {
String POINT_0_1 = "point-0.1";
String POINT_1 = "point-1";
String PROVENANCE = "provenance";
String TAXON_RANK = "taxonRank";
String RAW_STATE_CONSERVATION = "raw_stateConservation";
String RECORDED_BY_ID = "recordedByID";
String SENSITIVE = "sensitive";
Expand All @@ -67,10 +66,15 @@ public interface IndexFields {
String SUBSPECIES = "subspecies";
String SUBSPECIES_ID = "subspeciesID";
String TAXONOMIC_ISSUES = "taxonomicIssues";
String TAXON_PRESENT_IN_COUNTRY = "taxonPresentInCountry";
String TAXON_RANK = "taxonRank";
String VIDEO_IDS = "videoIDs";

String DYNAMIC_PROPERTIES_PREFIX = "dynamicProperties_";
String GGBN_TERMS_LOAN = "http://data.ggbn.org/schemas/ggbn/terms/Loan";
String LOAN_DESTINATION_TERM = "http://data.ggbn.org/schemas/ggbn/terms/loanDestination";
String LOAN_IDENTIFIER_TERM = "http://data.ggbn.org/schemas/ggbn/terms/loanIdentifier";
String AUS_TRAITS_FIRE_RESPONSE = "fire_response";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Convert from snake case to camel

String AUS_TRAITS_POST_FIRE_RECRUITMENT = "post_fire_recruitment";
String AUS_TRAITS_PHOTOSYNTHETIC_PATHWAY = "photosynthetic_pathway";
}
Original file line number Diff line number Diff line change
Expand Up @@ -805,6 +805,51 @@ private static void addSpeciesListInfo(
}
}
}

// index taxonPresentInCountry
if (tpr.getTaxonPresentInCountry() != null) {
indexRecord.getStrings().put(TAXON_PRESENT_IN_COUNTRY, tpr.getTaxonPresentInCountry());
}

// taxon-level traits from speciesLists
Map<String, String> traits = tpr.getTraits();
for (Map.Entry<String, String> trait : traits.entrySet()) {
if (trait.getKey() != null) {
// save to a <Map> for dynamic-properties fallback
Map<String, String> traitMap = new HashMap<>();
traitMap.put(trait.getKey(), trait.getValue());
// check if traitName is declared as a value in @au.org.ala.pipelines.transforms.IndexFields
java.lang.reflect.Field[] fields = IndexFields.class.getDeclaredFields();
boolean isTraitInDeclaredFields = false;
// Check each <IndexFields> field value to see if it matches the current trait name
for (java.lang.reflect.Field f : fields) {
String strValue = null;
try {
strValue = (String) f.get(null);
} catch (IllegalAccessException e) {
// Don't throw an exception - log.warn and failover to next speciesList
log.warn(
"addSpeciesListInfo() - failed to get value for <IndexFields> field: "
+ f.getName()
+ ", with exception: "
+ e.getMessage());
}
if (strValue.equals(trait.getKey())) {
isTraitInDeclaredFields = true;
break;
}
}
// Dirty data has duplicate entries process via a Set first
Set<String> traitValuesSet = new HashSet<>(Arrays.asList(trait.getValue().split("\\|")));
List<String> traitValuesList = new ArrayList<>(traitValuesSet);
// Add to indexedRecord either as multivalues or dynamicProperties
if (isTraitInDeclaredFields) {
addIfNotEmpty(indexRecord, trait.getKey(), traitValuesList);
} else {
indexRecord.setDynamicProperties(traitMap);
}
}
}
}

private static MultimediaIndexRecord convertToMultimediaRecord(String uuid, Image image) {
Expand Down
Original file line number Diff line number Diff line change
@@ -1,9 +1,7 @@
package au.org.ala.pipelines.util;

import com.google.common.base.Strings;
import java.util.ArrayList;
import java.util.Iterator;
import java.util.List;
import java.util.*;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just noticed this - IDE did this and it might break coding rules?

import lombok.AccessLevel;
import lombok.NoArgsConstructor;
import org.gbif.pipelines.io.avro.*;
Expand All @@ -12,19 +10,25 @@
@NoArgsConstructor(access = AccessLevel.PRIVATE)
public class SpeciesListUtils {

private static String LIST_COMMON_TRAIT = "COMMON_TRAIT";

/**
* Creates a reusable template (Builder) for a TaxonProfile based on the supplied species lists.
*/
public static TaxonProfile.Builder createTaxonProfileBuilder(
Iterable<SpeciesListRecord> speciesLists,
boolean includeConservationStatus,
boolean includeInvasiveStatus) {
boolean includeInvasiveStatus,
boolean includeTaxonPresentInCountry,
boolean includeTraits) {

Iterator<SpeciesListRecord> iter = speciesLists.iterator();

List<String> speciesListIDs = new ArrayList<>();
List<ConservationStatus> conservationStatusList = new ArrayList<>();
List<InvasiveStatus> invasiveStatusList = new ArrayList<>();
String taxonPresentInCountryValue = null;
Map<String, String> traitsMap = new HashMap<>();

while (iter.hasNext()) {

Expand All @@ -48,6 +52,13 @@ public static TaxonProfile.Builder createTaxonProfileBuilder(
.setSpeciesListID(speciesListRecord.getSpeciesListID())
.setRegion(speciesListRecord.getRegion())
.build());
} else if (includeTaxonPresentInCountry
&& speciesListRecord.getTaxonPresentInCountry() != null) {
taxonPresentInCountryValue = speciesListRecord.getTaxonPresentInCountry();
} else if (includeTraits
&& speciesListRecord.getListType().equals(LIST_COMMON_TRAIT)
&& speciesListRecord.getTraitName() != null) {
traitsMap.put(speciesListRecord.getTraitName(), speciesListRecord.getTraitValue());
}
}

Expand All @@ -56,6 +67,8 @@ public static TaxonProfile.Builder createTaxonProfileBuilder(
builder.setSpeciesListID(speciesListIDs);
builder.setConservationStatuses(conservationStatusList);
builder.setInvasiveStatuses(invasiveStatusList);
builder.setTaxonPresentInCountry(taxonPresentInCountryValue);
builder.setTraits(traitsMap);
return builder;
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@ public class SpeciesList {
boolean isAuthoritative;
boolean isInvasive;
boolean isThreatened;
String taxonPresentInCountry;

@JsonPOJOBuilder(withPrefix = "")
@JsonIgnoreProperties(ignoreUnknown = true)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -132,6 +132,9 @@ public static void run(SpeciesLevelPipelineOptions options) throws IOException {
int guidIdx = columnHeaders.indexOf("guid");
int statusIdx = columnHeaders.indexOf("status");
int sourceStatusIdx = columnHeaders.indexOf("sourceStatus");
int countIdx = columnHeaders.indexOf("count");
int traitNameIdx = columnHeaders.indexOf("traitName");
int traitValueIdx = columnHeaders.indexOf("traitValue");

String region = null;

Expand Down Expand Up @@ -164,16 +167,30 @@ public static void run(SpeciesLevelPipelineOptions options) throws IOException {

String status = statusIdx > 0 ? currentLine[statusIdx] : null;
String sourceStatus = sourceStatusIdx > 0 ? currentLine[sourceStatusIdx] : null;
String count = countIdx > 0 ? currentLine[countIdx] : null;
String traitName = traitNameIdx > 0 ? currentLine[traitNameIdx] : null;
String traitValue = traitValueIdx > 0 ? currentLine[traitValueIdx] : null;
// ARGA addition to set `taxonPresentInCountry` to the value specified in the list's
// `region` attribute, when list has type "OTHER", has region set and
// contains a `count` column (note: count not currently used)
String taxonPresentInCountry =
(list.getListType().equals("OTHER") && region != null && count != null)
? region
: null;

SpeciesListRecord speciesListRecord =
SpeciesListRecord.newBuilder()
.setTaxonID(taxonID)
.setSpeciesListID(list.getDataResourceUid())
.setStatus(status)
.setRegion(region)
.setListType(list.getListType())
.setIsInvasive(list.isInvasive())
.setIsThreatened(list.isThreatened())
.setSourceStatus(sourceStatus)
.setTaxonPresentInCountry(taxonPresentInCountry)
.setTraitName(traitName)
.setTraitValue(traitValue)
.build();
dataFileWriter.append(speciesListRecord);
taxaRead++;
Expand Down
6 changes: 6 additions & 0 deletions livingatlas/solr/conf/managed-schema
Original file line number Diff line number Diff line change
Expand Up @@ -191,6 +191,12 @@
<field name="stateConservation" type="string" docValues="true" multiValued="false" indexed="true" />
<field name="countryConservation" type="string" docValues="true" multiValued="false" indexed="true" />

<!-- ARGA fields -->
<field name="taxonPresentInCountry" type="string" docValues="true" multiValued="false" indexed="true" />
<field name="fire_response" type="string" docValues="true" multiValued="true" indexed="true" />
<field name="post_fire_recruitment" type="string" docValues="true" multiValued="true" indexed="true" />
<field name="photosynthetic_pathway" type="string" docValues="true" multiValued="true" indexed="true" />

<!-- Additional invasive fields -->
<field name="stateInvasive" type="string" docValues="true" multiValued="false" indexed="true" />
<field name="countryInvasive" type="string" docValues="true" multiValued="false" indexed="true" />
Expand Down
6 changes: 5 additions & 1 deletion sdks/models/src/main/avro/specific/species-list-record.avsc
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,12 @@
{"name": "speciesListID","type":"string"},
{"name": "isThreatened", "type": "boolean"},
{"name": "isInvasive", "type": "boolean"},
{"name": "listType", "type": ["null", "string"], "default": null },
{"name": "region", "type": ["null", "string"], "default": null },
{"name": "status", "type": ["null", "string"]},
{"name": "sourceStatus", "type": ["null", "string"]}
{"name": "sourceStatus", "type": ["null", "string"]},
{"name": "taxonPresentInCountry", "type": ["null", "string"], "default": null},
{"name": "traitName", "type": ["null", "string"], "default": null},
{"name": "traitValue", "type": ["null", "string"], "default": null}
]
}
4 changes: 3 additions & 1 deletion sdks/models/src/main/avro/specific/taxon-profile.avsc
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,9 @@
{"name": "id", "type": ["null", "string"]},
{"name": "speciesListID", "type": {"type" : "array", "items" : "string"}, "default" : []},
{"name": "conservationStatuses", "type": {"type" : "array", "items" : "ConservationStatus"}, "default" : []},
{"name": "invasiveStatuses", "type": {"type" : "array", "items" : "InvasiveStatus"}, "default" : []}
{"name": "invasiveStatuses", "type": {"type" : "array", "items" : "InvasiveStatus"}, "default" : []},
{"name": "taxonPresentInCountry", "type": ["null", "string"], "default" : null},
{"name": "traits", "type": {"type": "map", "values": "string"}, "default" : {}}
]
}
]