Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance for getSamplingFeatureDatasets #144

Open
Elijahwalkerwest opened this issue Jan 30, 2018 · 6 comments
Open

Improve performance for getSamplingFeatureDatasets #144

Elijahwalkerwest opened this issue Jan 30, 2018 · 6 comments

Comments

@Elijahwalkerwest
Copy link
Contributor

I have already utilized eager loading for the SQLALchemy query in the endpoint, but due to the nature of the data needed to be gathered, the current structure is incredibly slow as more Samplingfeatures are queried for. Need to make it usable for a query of all SamplingFeatures of a given Type.

@horsburgh
Copy link
Member

@emiliom and @lsetiawan - this is the issue I was referring to on the phone today. I'm not sure it should be titled "improve performance", but rather may require a rethinking of the current approach.

@emiliom
Copy link
Member

emiliom commented Jan 30, 2018

Thanks @Elijahwalkerwest and @horsburgh

@lsetiawan and I may or may not have time this week (after today) to give input. But if you can point us to the branch and code where @Elijahwalkerwest's current / latest implementation can be seen, that'll be helpful. Again, no promises!

@Elijahwalkerwest
Copy link
Contributor Author

Elijahwalkerwest commented Jan 30, 2018

The most recent work should be in the development branch. I've been working on a few iterations to try and resolve this but nothing that has worked thus far, and so isn't up on github yet.

Added by Emilio, for convenience: https://github.com/ODM2/ODM2PythonAPI/blob/development/odm2api/ODM2/services/readService.py#L969

@emiliom
Copy link
Member

emiliom commented Feb 1, 2018

Don and I have taken a quick look. Unfortunately we won't have time to help out on this possibly through next week (we have a hands-on workshop late next week that I'm co-organizing).

In the meantime, two things come to mind:

  • If you have an equivalent SQL SELECT statement that you've developed and run directly on the ODM2 database, that is reasonably fast (and I assume not directly translatable to SQLAlchemy queries), please share it here.
  • In general (and based in part on some quick looks from Don), it seems like getSamplingFeatureDatasets tries to get a ton of information all at once, and the outcome is inevitably large. We wonder if some sort of break up into smaller, complementary functions may be necessary.

@Elijahwalkerwest
Copy link
Contributor Author

result = self._session_factory.engine.execute("SELECT * FROM odm2.samplingfeatures as SF
LEFT JOIN odm2.results as R on R.FeatureActionID in (
SELECT FeatureActionID
FROM odm2.featureactions as FA
WHERE FA.SamplingFeatureID = SF.SamplingFeatureID
)
LEFT JOIN odm2.datasets as DS on DS.DatasetID in (
SELECT DatasetID
FROM odm2.datasetsresults as DR
WHERE R.ResultID = DR.ResultID
)
WHERE SF.SamplingFeatureID=sfid")

Got SQLAlchemy to run raw query, this is the query I'm using. This look right to you guys?

@Elijahwalkerwest
Copy link
Contributor Author

Updated version of this SQL query that is currently working, but is VERY SLOW.

Also I'm not sure if it's getting all the data needed. Here is the data that is needed for that endpoint.

DataSetID,
DataSetTitle,
DataSetAbstract,
ResultTypeCV,
SampledMediumCV,
VariableCode,
VariableNameCV,
startDate: minDate,
endDate: maxDate,
siteType: SiteTypeCV,
latitude:samplingFeature.related_features.Latitude,
longitude:samplingFeature.related_features.Longitude
SamplingFeatureCode,
SamplingFeatureName,

result = self._session_factory.engine.execute("SELECT * FROM odm2.samplingfeatures as SF
LEFT JOIN odm2.results as R on R.FeatureActionID in (
SELECT FeatureActionID
FROM odm2.featureactions as FA
WHERE FA.SamplingFeatureID = SF.SamplingFeatureID
)
LEFT JOIN odm2.datasets as DS on DS.DatasetID in (
SELECT DatasetID
FROM odm2.datasetsresults as DR
WHERE R.ResultID = DR.ResultID
)
WHERE SF.SamplingFeatureID in %s",
(
((sf_list),),
)
)

Currently this query is taking ~ 50 seconds PER Sampling feature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants