Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve sql queries : case aggregation #11

Open
claudia-codeco opened this issue Jun 3, 2024 · 10 comments
Open

Improve sql queries : case aggregation #11

claudia-codeco opened this issue Jun 3, 2024 · 10 comments
Assignees
Labels
help wanted Extra attention is needed
Milestone

Comments

@claudia-codeco
Copy link
Contributor

claudia-codeco commented Jun 3, 2024

Problem

Currently, line list of cases are queried and then aggregated. In order to include group_by in the sql query, we need to
compute the epiweek.

Why do that

To optimize consultation, this is also useful for the webpage

current code (AlertTools::getCases.R)

     comando <- paste("SELECT * from \"Municipio\".\"Notificacao\" WHERE dt_digita <= '",lastday, 
                         "' AND dt_digita >= '",firstday, "' AND municipio_geocodigo IN (", sqlcity, 
                         ") AND cid10_codigo IN(", sqlcid,")", sep="")
 
        dd <- dbGetQuery(datasource,comando)


        dd$se_sin_pri <- epiweek(as.Date(dd$dt_sin_pri, format = "%Y-%m-%d"))
            dd <- dd %>% 
                  mutate(ano_sinpri = epiyear(dt_sin_pri),
                         SE = ano_sinpri*100+se_sin_pri)

 casos <- dd %>% 
            group_by(municipio_geocodigo, SE) %>%
            summarise(
                  casos = length(classi_fin),
                  cas_prov = sum(classi_fin != 5, na.rm = TRUE),
                  cas_lab = sum(classi_fin != 5 & criterio == 1 , na.rm = TRUE))
@claudia-codeco claudia-codeco added the help wanted Extra attention is needed label Jun 3, 2024
@claudia-codeco claudia-codeco added this to the Agrega milestone Jun 3, 2024
@fccoelho
Copy link
Contributor

fccoelho commented Jun 3, 2024

I suggest creating a view on the database like this (untested code):

CREATE VIEW casos as SELECT 
    count(classi_fin) FILTER(WHERE classi_fin IS NOT NULL) AS casos_suspeitos,
    count(classi_fin) FILTER(WHERE classi_fin <> 5 AND classi_fin <> '') AS casos_provaveis,
    count(classi_fin) FILTER(WHERE classi_fin <> 5 AND criterio = 1) as casos_lab
FROM  "Municipio"."Notificacao"
WHERE
    dt_digita <= {lastday} AND dt_digita >={firstday} AND municipio_geocodigo IN ({sqlcity}) AND cid10_codigo IN({sqlcid});

@claudia-codeco, is this the query you need?

@claudia-codeco
Copy link
Contributor Author

almost there. we need to group_by epiweek using dt_sin_pri

@fccoelho
Copy link
Contributor

fccoelho commented Jun 3, 2024

It seems that first creating this aggregation by municipio_geocódigo and epiweek as a separate view, named say weekly_cases_by_city will simplify the query above which can then select from this view, instead of "Municipio"."Notificação".

@luabida can you finalize this, please?

@claudia-codeco
Copy link
Contributor Author

Yes, the final table should have the following columns

municipio_geocodigo
year
week (epiweek) (1 to 53)
casos
casos_provaveis
casos_lab

@fccoelho
Copy link
Contributor

fccoelho commented Jun 3, 2024

So it seems that the view has to be something like this:

CREATE VIEW casos as SELECT 
    municipio_geocodigo,
    year,
    SE,
    count(classi_fin) FILTER(WHERE classi_fin IS NOT NULL) AS casos_suspeitos,
    count(classi_fin) FILTER(WHERE classi_fin <> 5 AND classi_fin <> '') AS casos_provaveis,
    count(classi_fin) FILTER(WHERE classi_fin <> 5 AND criterio = 1) as casos_lab
FROM  "Municipio"."Notificacao"
WHERE
    dt_digita <= {lastday} AND dt_digita >={firstday} AND municipio_geocodigo IN ({sqlcity}) AND cid10_codigo IN({sqlcid})
GROUP by municipio_geocodigo, year, SE;

@luabida
Copy link

luabida commented Jun 4, 2024

@fccoelho @claudia-codeco this is the result from the query with minor adjustments in the column names:

SELECT
    municipio_geocodigo,
    ano_notif,
    se_notif,
    count(classi_fin) FILTER(WHERE classi_fin IS NOT NULL) AS casos_suspeitos,
    count(classi_fin) FILTER(WHERE classi_fin <> 5) AS casos_provaveis,
    count(classi_fin) FILTER(WHERE classi_fin <> 5 AND criterio = 1) as casos_lab
FROM  "Municipio"."Notificacao"
WHERE
    dt_digita BETWEEN '2023-01-01' AND '2023-01-03' 
    AND municipio_geocodigo IN (3304557) 
    AND cid10_codigo IN ('A90')
GROUP by municipio_geocodigo, ano_notif, se_notif;

image

The only thing I didn't quite understand is the purpose of having the column epiweek in the Copernicus table, but the table on production has just completed the UPDATE in every row with the epidemiological week:

dengue=# select date, epiweek, geocodigo from weather.copernicus_brasil limit 1;
    date    | epiweek | geocodigo
------------+---------+-----------
 2016-12-13 | 201650  |   2209955
(1 row)

@fccoelho
Copy link
Contributor

fccoelho commented Jun 4, 2024

There is only one issue: we cannot use se_notif for the group by we need to use the epiweek of dt_sinpri.

So unless we find a way to calculate the epiweek using SQL, we may need to call a pl-python function in this view.

The definition of EpiWeek is this (according to PAHO):

The epidemiological week begins on Sunday and ends on Saturday. The first epidemiological week of the year ends on the first Saturday of January, provided that it falls at least four or more days into the month. Therefore, the first epidemiological week may actually begin in December of the previous year.

@fccoelho
Copy link
Contributor

fccoelho commented Jun 4, 2024

Adding the epiweek in copernicus table is important to allow for grouping by epiweek.

@luabida
Copy link

luabida commented Jun 11, 2024

I tried to keep the exact logic as the epiweek does, the system parameter is not quite necessary, but it may be portable to windows if necessary too

SQL Function with plpython3u:

CREATE OR REPLACE FUNCTION extract_SE(date DATE)
RETURNS INT AS $$
from datetime import date as dt

def _system_adjustment(system: str) -> int:
    systems = ("iso", "cdc")  # Monday, Sunday
    return systems.index(system.lower())

def _year_start(year: int, system: str) -> int:
    adjustment = _system_adjustment(system)
    mid_weekday = 3 - adjustment  # Sun is 6 .. Mon is 0
    jan1 = dt(year, 1, 1)
    jan1_ordinal = jan1.toordinal()
    jan1_weekday = jan1.weekday()
    week1_start_ordinal = jan1_ordinal - jan1_weekday - adjustment
    if jan1_weekday > mid_weekday:
        week1_start_ordinal += 7
    return week1_start_ordinal

def fromdate(date: dt, system: str = "cdc") -> int:
    if isinstance(date, str):
        date = dt.fromisoformat(date)
    year = date.year
    date_ordinal = date.toordinal()
    year_start_ordinal = _year_start(year, system)
    week = (date_ordinal - year_start_ordinal) // 7
    if week < 0:
        year -= 1
        year_start_ordinal = _year_start(year, system)
        week = (date_ordinal - year_start_ordinal) // 7
    elif week >= 52:
        year_start_ordinal = _year_start(year + 1, system)
        if date_ordinal >= year_start_ordinal:
            year += 1
            week = 0
    week += 1
    return int(str(year) + f"{week:02d}")

return fromdate(date, "cdc")
$$ LANGUAGE plpython3u;

Updated query:

SELECT
    municipio_geocodigo,
    ano_notif,
    extract_SE(dt_sin_pri) as SE,
    count(classi_fin) FILTER(WHERE classi_fin IS NOT NULL) AS casos_suspeitos,
    count(classi_fin) FILTER(WHERE classi_fin <> 5) AS casos_provaveis,
    count(classi_fin) FILTER(WHERE classi_fin <> 5 AND criterio = 1) as casos_lab
FROM  "Municipio"."Notificacao"
WHERE
    dt_digita BETWEEN '2023-01-01' AND '2023-01-03'
    AND municipio_geocodigo IN (3304557)
    AND cid10_codigo IN ('A90')
GROUP by municipio_geocodigo, ano_notif, extract_SE(dt_sin_pri);

Result:
image

@fccoelho
Copy link
Contributor

We should add this python function to the setup script for the infodengue database so that we always have access to it. in case of a redeploy somewhere.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

4 participants