-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Regarding queries that use PERCENTILE_DISC() #10
Comments
Yes indeed this is why these queries fail in our SQL Server tests. @schuemie any suggestions? |
Yes, there is no consistency between database platforms on this. Not only do the functions have very different syntax, they also create very different output. For this reason, applications like Achilles and FeatureExtraction avoid the PERCENTILE_DISC function altogether, and instead create a temporary ordered table of values, and use that to compute the percentiles. Here is example code from FeatureExtraction. It is not pretty though. |
Urggh, just exploring the possibilities of adding translation rules. Here is a correctly formulated query to compute the median year of birth per gender in SQL Server: SELECT DISTINCT gender_concept_id,
PERCENTILE_DISC(0.5) WITHIN GROUP (ORDER BY year_of_birth) OVER (PARTITION BY gender_concept_id)
FROM person; The equivalent in PostGreSQL is: SELECT DISTINCT gender_concept_id,
PERCENTILE_DISC(0.5) WITHIN GROUP (ORDER BY year_of_birth)
FROM person
GROUP BY gender_concept_id; (note that DISTINCT is not really required in Postgres). In SQL Server, you can't have an ORDER BY clause because it will complain about the year_of_birth variable not being in an aggregate function or GROUP BY clause. You must have DISTINCT, or else you'll get the same results duplicated over and over again (presumably for each row in the data). In PostgreSQL, you can't have the OVER clause. Instead, you must have a GROUP BY clause to partition the data. Surprisingly, both queries run fine on Oracle. |
Here's the equivalent SQL that avoids PERCENTILE_DISC functions: SELECT ordered_data.gender_concept_id,
MIN(CASE WHEN order_nr < .50 * population_size THEN 9999 ELSE year_of_birth END) AS median
FROM (
SELECT gender_concept_id,
year_of_birth,
ROW_NUMBER() OVER (PARTITION BY gender_concept_id ORDER BY year_of_birth) order_nr
FROM person
) ordered_data
INNER JOIN (
SELECT gender_concept_id,
COUNT(*) AS population_size
FROM person
GROUP BY gender_concept_id
) population_sizes
ON ordered_data.gender_concept_id = population_sizes.gender_concept_id
GROUP BY ordered_data.gender_concept_id; As I said: not very pretty. But it will run on all platforms. |
Thanks Martijn, this shows why we need the QueryLibrary :) It is important it should return the same results on all DBMS. (Where do we explain to our users they should not use this function? clearly in OMOP times we did not know) |
The function is not listed in the list of functions and structures supported by SqlRender. But I can forgive people for not having noticed that ;-) One idea is to create a reference document for OHDSI SQL that treats OHDSI SQL as its own dialect (a subset of SQL Server SQL) and describes all syntax that is supported. |
thanks a lot for your input @schuemie Indeed, not the nicest query but at least with this alternative function we can keep the query platform agnostic ;) . I will update the queries next week |
…stic way to estimate the percentiles (see thread #10) and modidying as well the way that the age of a person given a condition is calculated
Hi all,
I noticed that a significant amount of the queries (CO21, DEX26, DEX27, DEX29, DEX31, DEX34, DEX36, DEX41, OP07, OP08, OP11, OP12, OP19, OP20) make use of the PERCENTILE_DISC() function. All these queries failed the test due to the function must be called in different ways depending on the SQL platform.
The main reason is that percentile_disc takes an OVER partition by clause as an argument in Microsoft SQL server, whereas in Postgres it is not an argument.
A few examples:
postgresql: PERCENTILE_DISC(0.25) WITHIN GROUP(ORDER BY age)
oracle/sql server: PERCENTILE_DISC(0.25) WITHIN GROUP(ORDER BY age) OVER()
In Postgres the statement PERCENTILE_DISC(0.25) WITHIN GROUP(ORDER BY age) OVER() fails due to the OVER() clause, but in sql server it's a must.
I don't really know how we can approach this issue because in the SqlRender package (https://github.com/OHDSI/SqlRender/blob/master/inst/csv/replacementPatterns.csv) there's not any rule that make usage of percentile_disc.
Does someone have a possible/alternative solution for this? Otherwise, it may be wise to raise this in the Sqlrender repository, given that this may not be a difficult rule to incorporate in the replacement patterns.
The text was updated successfully, but these errors were encountered: