Improve sql queries : case aggregation #11

claudia-codeco · 2024-06-03T16:17:24Z

Problem

Currently, line list of cases are queried and then aggregated. In order to include group_by in the sql query, we need to
compute the epiweek.

Why do that

To optimize consultation, this is also useful for the webpage

current code (AlertTools::getCases.R)

     comando <- paste("SELECT * from \"Municipio\".\"Notificacao\" WHERE dt_digita <= '",lastday, 
                         "' AND dt_digita >= '",firstday, "' AND municipio_geocodigo IN (", sqlcity, 
                         ") AND cid10_codigo IN(", sqlcid,")", sep="")
 
        dd <- dbGetQuery(datasource,comando)


        dd$se_sin_pri <- epiweek(as.Date(dd$dt_sin_pri, format = "%Y-%m-%d"))
            dd <- dd %>% 
                  mutate(ano_sinpri = epiyear(dt_sin_pri),
                         SE = ano_sinpri*100+se_sin_pri)

 casos <- dd %>% 
            group_by(municipio_geocodigo, SE) %>%
            summarise(
                  casos = length(classi_fin),
                  cas_prov = sum(classi_fin != 5, na.rm = TRUE),
                  cas_lab = sum(classi_fin != 5 & criterio == 1 , na.rm = TRUE))

fccoelho · 2024-06-03T16:57:23Z

I suggest creating a view on the database like this (untested code):

CREATE VIEW casos as SELECT 
    count(classi_fin) FILTER(WHERE classi_fin IS NOT NULL) AS casos_suspeitos,
    count(classi_fin) FILTER(WHERE classi_fin <> 5 AND classi_fin <> '') AS casos_provaveis,
    count(classi_fin) FILTER(WHERE classi_fin <> 5 AND criterio = 1) as casos_lab
FROM  "Municipio"."Notificacao"
WHERE
    dt_digita <= {lastday} AND dt_digita >={firstday} AND municipio_geocodigo IN ({sqlcity}) AND cid10_codigo IN({sqlcid});

@claudia-codeco, is this the query you need?

claudia-codeco · 2024-06-03T17:14:49Z

almost there. we need to group_by epiweek using dt_sin_pri

fccoelho · 2024-06-03T17:21:27Z

It seems that first creating this aggregation by municipio_geocódigo and epiweek as a separate view, named say weekly_cases_by_city will simplify the query above which can then select from this view, instead of "Municipio"."Notificação".

@luabida can you finalize this, please?

claudia-codeco · 2024-06-03T18:00:28Z

Yes, the final table should have the following columns

municipio_geocodigo
year
week (epiweek) (1 to 53)
casos
casos_provaveis
casos_lab

fccoelho · 2024-06-03T18:15:02Z

So it seems that the view has to be something like this:

CREATE VIEW casos as SELECT 
    municipio_geocodigo,
    year,
    SE,
    count(classi_fin) FILTER(WHERE classi_fin IS NOT NULL) AS casos_suspeitos,
    count(classi_fin) FILTER(WHERE classi_fin <> 5 AND classi_fin <> '') AS casos_provaveis,
    count(classi_fin) FILTER(WHERE classi_fin <> 5 AND criterio = 1) as casos_lab
FROM  "Municipio"."Notificacao"
WHERE
    dt_digita <= {lastday} AND dt_digita >={firstday} AND municipio_geocodigo IN ({sqlcity}) AND cid10_codigo IN({sqlcid})
GROUP by municipio_geocodigo, year, SE;

luabida · 2024-06-04T14:11:31Z

@fccoelho @claudia-codeco this is the result from the query with minor adjustments in the column names:

SELECT
    municipio_geocodigo,
    ano_notif,
    se_notif,
    count(classi_fin) FILTER(WHERE classi_fin IS NOT NULL) AS casos_suspeitos,
    count(classi_fin) FILTER(WHERE classi_fin <> 5) AS casos_provaveis,
    count(classi_fin) FILTER(WHERE classi_fin <> 5 AND criterio = 1) as casos_lab
FROM  "Municipio"."Notificacao"
WHERE
    dt_digita BETWEEN '2023-01-01' AND '2023-01-03' 
    AND municipio_geocodigo IN (3304557) 
    AND cid10_codigo IN ('A90')
GROUP by municipio_geocodigo, ano_notif, se_notif;

The only thing I didn't quite understand is the purpose of having the column epiweek in the Copernicus table, but the table on production has just completed the UPDATE in every row with the epidemiological week:

dengue=# select date, epiweek, geocodigo from weather.copernicus_brasil limit 1;
    date    | epiweek | geocodigo
------------+---------+-----------
 2016-12-13 | 201650  |   2209955
(1 row)

fccoelho · 2024-06-04T16:36:47Z

There is only one issue: we cannot use se_notif for the group by we need to use the epiweek of dt_sinpri.

So unless we find a way to calculate the epiweek using SQL, we may need to call a pl-python function in this view.

The definition of EpiWeek is this (according to PAHO):

The epidemiological week begins on Sunday and ends on Saturday. The first epidemiological week of the year ends on the first Saturday of January, provided that it falls at least four or more days into the month. Therefore, the first epidemiological week may actually begin in December of the previous year.

fccoelho · 2024-06-04T16:39:30Z

Adding the epiweek in copernicus table is important to allow for grouping by epiweek.

luabida · 2024-06-11T01:50:10Z

I tried to keep the exact logic as the epiweek does, the system parameter is not quite necessary, but it may be portable to windows if necessary too

SQL Function with plpython3u:

CREATE OR REPLACE FUNCTION extract_SE(date DATE)
RETURNS INT AS $$
from datetime import date as dt

def _system_adjustment(system: str) -> int:
    systems = ("iso", "cdc")  # Monday, Sunday
    return systems.index(system.lower())

def _year_start(year: int, system: str) -> int:
    adjustment = _system_adjustment(system)
    mid_weekday = 3 - adjustment  # Sun is 6 .. Mon is 0
    jan1 = dt(year, 1, 1)
    jan1_ordinal = jan1.toordinal()
    jan1_weekday = jan1.weekday()
    week1_start_ordinal = jan1_ordinal - jan1_weekday - adjustment
    if jan1_weekday > mid_weekday:
        week1_start_ordinal += 7
    return week1_start_ordinal

def fromdate(date: dt, system: str = "cdc") -> int:
    if isinstance(date, str):
        date = dt.fromisoformat(date)
    year = date.year
    date_ordinal = date.toordinal()
    year_start_ordinal = _year_start(year, system)
    week = (date_ordinal - year_start_ordinal) // 7
    if week < 0:
        year -= 1
        year_start_ordinal = _year_start(year, system)
        week = (date_ordinal - year_start_ordinal) // 7
    elif week >= 52:
        year_start_ordinal = _year_start(year + 1, system)
        if date_ordinal >= year_start_ordinal:
            year += 1
            week = 0
    week += 1
    return int(str(year) + f"{week:02d}")

return fromdate(date, "cdc")
$$ LANGUAGE plpython3u;

Updated query:

SELECT
    municipio_geocodigo,
    ano_notif,
    extract_SE(dt_sin_pri) as SE,
    count(classi_fin) FILTER(WHERE classi_fin IS NOT NULL) AS casos_suspeitos,
    count(classi_fin) FILTER(WHERE classi_fin <> 5) AS casos_provaveis,
    count(classi_fin) FILTER(WHERE classi_fin <> 5 AND criterio = 1) as casos_lab
FROM  "Municipio"."Notificacao"
WHERE
    dt_digita BETWEEN '2023-01-01' AND '2023-01-03'
    AND municipio_geocodigo IN (3304557)
    AND cid10_codigo IN ('A90')
GROUP by municipio_geocodigo, ano_notif, extract_SE(dt_sin_pri);

Result:

fccoelho · 2024-06-11T11:52:34Z

We should add this python function to the setup script for the infodengue database so that we always have access to it. in case of a redeploy somewhere.

claudia-codeco added the help wanted Extra attention is needed label Jun 3, 2024

claudia-codeco added this to the Agrega milestone Jun 3, 2024

claudia-codeco assigned fccoelho and claudia-codeco Jun 3, 2024

fccoelho assigned turicas and luabida Jun 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve sql queries : case aggregation #11

Improve sql queries : case aggregation #11

claudia-codeco commented Jun 3, 2024 •

edited

Loading

fccoelho commented Jun 3, 2024 •

edited

Loading

claudia-codeco commented Jun 3, 2024

fccoelho commented Jun 3, 2024 •

edited

Loading

claudia-codeco commented Jun 3, 2024

fccoelho commented Jun 3, 2024 •

edited

Loading

luabida commented Jun 4, 2024 •

edited

Loading

fccoelho commented Jun 4, 2024

fccoelho commented Jun 4, 2024

luabida commented Jun 11, 2024 •

edited

Loading

fccoelho commented Jun 11, 2024

Improve sql queries : case aggregation #11

Improve sql queries : case aggregation #11

Comments

claudia-codeco commented Jun 3, 2024 • edited Loading

Problem

Why do that

current code (AlertTools::getCases.R)

fccoelho commented Jun 3, 2024 • edited Loading

claudia-codeco commented Jun 3, 2024

fccoelho commented Jun 3, 2024 • edited Loading

claudia-codeco commented Jun 3, 2024

fccoelho commented Jun 3, 2024 • edited Loading

luabida commented Jun 4, 2024 • edited Loading

fccoelho commented Jun 4, 2024

fccoelho commented Jun 4, 2024

luabida commented Jun 11, 2024 • edited Loading

fccoelho commented Jun 11, 2024

claudia-codeco commented Jun 3, 2024 •

edited

Loading

fccoelho commented Jun 3, 2024 •

edited

Loading

fccoelho commented Jun 3, 2024 •

edited

Loading

fccoelho commented Jun 3, 2024 •

edited

Loading

luabida commented Jun 4, 2024 •

edited

Loading

luabida commented Jun 11, 2024 •

edited

Loading