Regarding queries that use PERCENTILE_DISC() #10

tropicalberto · 2019-03-15T08:50:00Z

Hi all,

I noticed that a significant amount of the queries (CO21, DEX26, DEX27, DEX29, DEX31, DEX34, DEX36, DEX41, OP07, OP08, OP11, OP12, OP19, OP20) make use of the PERCENTILE_DISC() function. All these queries failed the test due to the function must be called in different ways depending on the SQL platform.
The main reason is that percentile_disc takes an OVER partition by clause as an argument in Microsoft SQL server, whereas in Postgres it is not an argument.

A few examples:

postgresql: PERCENTILE_DISC(0.25) WITHIN GROUP(ORDER BY age)
oracle/sql server: PERCENTILE_DISC(0.25) WITHIN GROUP(ORDER BY age) OVER()

In Postgres the statement PERCENTILE_DISC(0.25) WITHIN GROUP(ORDER BY age) OVER() fails due to the OVER() clause, but in sql server it's a must.

I don't really know how we can approach this issue because in the SqlRender package (https://github.com/OHDSI/SqlRender/blob/master/inst/csv/replacementPatterns.csv) there's not any rule that make usage of percentile_disc.

Does someone have a possible/alternative solution for this? Otherwise, it may be wise to raise this in the Sqlrender repository, given that this may not be a difficult rule to incorporate in the replacement patterns.

PRijnbeek · 2019-03-15T08:55:29Z

Yes indeed this is why these queries fail in our SQL Server tests.

@schuemie any suggestions?

schuemie · 2019-03-15T09:39:53Z

Yes, there is no consistency between database platforms on this. Not only do the functions have very different syntax, they also create very different output. For this reason, applications like Achilles and FeatureExtraction avoid the PERCENTILE_DISC function altogether, and instead create a temporary ordered table of values, and use that to compute the percentiles. Here is example code from FeatureExtraction. It is not pretty though.

schuemie · 2019-03-15T10:10:18Z

Urggh, just exploring the possibilities of adding translation rules. Here is a correctly formulated query to compute the median year of birth per gender in SQL Server:

SELECT DISTINCT gender_concept_id, 
   PERCENTILE_DISC(0.5) WITHIN GROUP (ORDER BY year_of_birth) OVER (PARTITION BY gender_concept_id) 
FROM person;

The equivalent in PostGreSQL is:

SELECT DISTINCT gender_concept_id, 
   PERCENTILE_DISC(0.5) WITHIN GROUP (ORDER BY year_of_birth) 
FROM person 
GROUP BY gender_concept_id;

(note that DISTINCT is not really required in Postgres).

In SQL Server, you can't have an ORDER BY clause because it will complain about the year_of_birth variable not being in an aggregate function or GROUP BY clause. You must have DISTINCT, or else you'll get the same results duplicated over and over again (presumably for each row in the data).

In PostgreSQL, you can't have the OVER clause. Instead, you must have a GROUP BY clause to partition the data.

Surprisingly, both queries run fine on Oracle.

schuemie · 2019-03-15T10:22:27Z

Here's the equivalent SQL that avoids PERCENTILE_DISC functions:

SELECT ordered_data.gender_concept_id,
  MIN(CASE WHEN order_nr < .50 * population_size THEN 9999 ELSE year_of_birth END) AS median
FROM (
  SELECT gender_concept_id,
    year_of_birth,
    ROW_NUMBER() OVER (PARTITION BY gender_concept_id ORDER BY year_of_birth) order_nr
  FROM person
) ordered_data
INNER JOIN (
  SELECT gender_concept_id,
    COUNT(*) AS population_size
  FROM person
  GROUP BY gender_concept_id
) population_sizes
ON ordered_data.gender_concept_id = population_sizes.gender_concept_id
GROUP BY ordered_data.gender_concept_id;

As I said: not very pretty. But it will run on all platforms.

PRijnbeek · 2019-03-15T10:49:53Z

Thanks Martijn, this shows why we need the QueryLibrary :)

It is important it should return the same results on all DBMS.

(Where do we explain to our users they should not use this function? clearly in OMOP times we did not know)

schuemie · 2019-03-15T11:15:11Z

The function is not listed in the list of functions and structures supported by SqlRender. But I can forgive people for not having noticed that ;-)

One idea is to create a reference document for OHDSI SQL that treats OHDSI SQL as its own dialect (a subset of SQL Server SQL) and describes all syntax that is supported.

tropicalberto · 2019-03-15T16:59:17Z

thanks a lot for your input @schuemie Indeed, not the nicest query but at least with this alternative function we can keep the query platform agnostic ;) . I will update the queries next week

…stic way to estimate the percentiles (see thread #10) and modidying as well the way that the age of a person given a condition is calculated

tropicalberto added help wanted Extra attention is needed question Further information is requested labels Mar 15, 2019

bmosseveld closed this as completed May 20, 2019

eldarallakhverdiiev mentioned this issue May 20, 2019

Remaining issues Odysseus #14

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regarding queries that use PERCENTILE_DISC() #10

Regarding queries that use PERCENTILE_DISC() #10

tropicalberto commented Mar 15, 2019

PRijnbeek commented Mar 15, 2019

schuemie commented Mar 15, 2019

schuemie commented Mar 15, 2019

schuemie commented Mar 15, 2019

PRijnbeek commented Mar 15, 2019

schuemie commented Mar 15, 2019

tropicalberto commented Mar 15, 2019

Regarding queries that use PERCENTILE_DISC() #10

Regarding queries that use PERCENTILE_DISC() #10

Comments

tropicalberto commented Mar 15, 2019

PRijnbeek commented Mar 15, 2019

schuemie commented Mar 15, 2019

schuemie commented Mar 15, 2019

schuemie commented Mar 15, 2019

PRijnbeek commented Mar 15, 2019

schuemie commented Mar 15, 2019

tropicalberto commented Mar 15, 2019