Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add trendline command #748

Closed
wants to merge 6 commits into from

Conversation

kt-eliatra
Copy link
Contributor

Description

This PR implements the trendline PPL command.

Issues Resolved

#655

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@YANG-DB
Copy link
Member

YANG-DB commented Oct 7, 2024

@kt-eliatra
please add the needed documentation in the following locations:

Afterward - add documentation and samples:

@kt-eliatra
Copy link
Contributor Author

@kt-eliatra please add the needed documentation in the following locations:

* [ppl planning](https://github.com/opensearch-project/opensearch-spark/tree/main/docs/ppl-lang/planning)

Afterward - add documentation and samples:

* [ppl commands list](https://github.com/opensearch-project/opensearch-spark/tree/main/docs/ppl-lang)

* [ppl functions list](https://github.com/opensearch-project/opensearch-spark/tree/main/docs/ppl-lang/functions)

* [ppl examples doc](https://github.com/opensearch-project/opensearch-spark/blob/main/docs/ppl-lang/PPL-Example-Commands.md)

@YANG-DB
Sure, I'll do that. In the meantime, I have a few questions:

  • Do we assume that the data is already sorted, e.g. by a preceding sort command, or should the trendline command syntax be extended to support sorting?

  • Should we calculate sma/wma when number of data-points in current window is less than number of data-points required by user?

    Having such data:

    YEAR  | SALES
    2024  |	5.0
    2023  | 6.5
    2022  | 7.0
    2021  | 4.3
    2020  | 3.6
    2019  | 9.0
    

    and number of data-points = 3

    Years 2024-2021

    SMA(2024) = (5.0+6.5+7.0)/3 = 6,166666667
    SMA(2023) = (6.5+7.0+4.3)/3 = 5,933333333
    SMA(2022) = (7.0+4.3+3.6)/3 = 4,966666667
    SMA(2021) = (4.3+3.6+9.0)/3 = 5,633333333
    

    Years 2020-2019

    SMA(2020) = NULL since there are only 2 data-points - 3.6 and 9.0 or?
    SMA(2019) = NULL since there is only 1 data-point- 9.0 or?
    

@YANG-DB
Copy link
Member

YANG-DB commented Oct 10, 2024

@kt-eliatra please add the needed documentation in the following locations:

* [ppl planning](https://github.com/opensearch-project/opensearch-spark/tree/main/docs/ppl-lang/planning)

Afterward - add documentation and samples:

* [ppl commands list](https://github.com/opensearch-project/opensearch-spark/tree/main/docs/ppl-lang)

* [ppl functions list](https://github.com/opensearch-project/opensearch-spark/tree/main/docs/ppl-lang/functions)

* [ppl examples doc](https://github.com/opensearch-project/opensearch-spark/blob/main/docs/ppl-lang/PPL-Example-Commands.md)

@YANG-DB Sure, I'll do that. In the meantime, I have a few questions:

  • Do we assume that the data is already sorted, e.g. by a preceding sort command, or should the trendline command syntax be extended to support sorting?

  • Should we calculate sma/wma when number of data-points in current window is less than number of data-points required by user?
    Having such data:

    YEAR  | SALES
    2024  |	5.0
    2023  | 6.5
    2022  | 7.0
    2021  | 4.3
    2020  | 3.6
    2019  | 9.0
    

    and number of data-points = 3
    Years 2024-2021

    SMA(2024) = (5.0+6.5+7.0)/3 = 6,166666667
    SMA(2023) = (6.5+7.0+4.3)/3 = 5,933333333
    SMA(2022) = (7.0+4.3+3.6)/3 = 4,966666667
    SMA(2021) = (4.3+3.6+9.0)/3 = 5,633333333
    

    Years 2020-2019

    SMA(2020) = NULL since there are only 2 data-points - 3.6 and 9.0 or?
    SMA(2019) = NULL since there is only 1 data-point- 9.0 or?
    

hi

  • Do we assume that the data is already sorted ? - IMO we cant assume that so please allow trendline to facilitate this functionality
  • Should we calculate sma/wma when number of data-points in current window is less ... - IMO we need to return a response stating there are insufficient data point to calculate the trendline based on the requested number

@YANG-DB
Copy link
Member

YANG-DB commented Oct 11, 2024

@kt-eliatra can u plz merge & resolve the conflicts ?
thanks

@kt-eliatra
Copy link
Contributor Author

@YANG-DB

    • Do we assume that the data is already sorted ? - IMO we cant assume that so please allow trendline to facilitate this functionality

    My proposal for the syntax of the trendline command

    • without sorting

      • ... | trendline sma(2, price) wma(3, x)
      • ... | trendline wma(4, price) sma(5, x)
    • using sorting

      • ... | trendline sort - date sma(2, price) wma(3, x)
      • ... | trendline sort + date wma(4, price) sma(5, x)

    ANTLR

    trendlineCommand
       : TRENDLINE (SORT sortField)? trendlineClause (trendlineClause)*
       ;
    
    trendlineClause
       : trendlineType LT_PRTHS numberOfDataPoints = integerLiteral COMMA field = fieldExpression RT_PRTHS AS alias = fieldExpression
       ;
    
    trendlineType
       : SMA
       | WMA
       ;
    
  • * `Should we calculate sma/wma when number of data-points in current window is less ...` - IMO we need to return a response stating there are insufficient data point to calculate the trendline based on the requested number
    

    Query like

    select 
      case when (count(1) over (order by year desc rows between 1 preceding and current row)) < 2
        then
          "Insufficient data point to calculate the trendline" 
        else
          (avg(sales) over (order by year desc rows between 1 preceding and current row)) end as sma_result 
    from sales;
    

    causes the calculated average to be casted to a string. Wouldn't it be better to return null in such a case?

@YANG-DB
Copy link
Member

YANG-DB commented Oct 24, 2024

@kt-eliatra thanks for the detailed review - I confirm please continue

Wouldn't it be better to return null in such a case?

Yes it does make sense

Signed-off-by: Kacper Trochimiak <[email protected]>
Signed-off-by: Kacper Trochimiak <[email protected]>
Signed-off-by: Kacper Trochimiak <[email protected]>
Signed-off-by: Kacper Trochimiak <[email protected]>
Signed-off-by: Kacper Trochimiak <[email protected]>
@lukasz-soszynski-eliatra
Copy link
Contributor

WMA extracted to PR eliatra#7

@salyh
Copy link
Contributor

salyh commented Oct 29, 2024

Superseded by #833 and eliatra#7

This PR should be closed in favor of the two mentioned aboved

@kt-eliatra kt-eliatra closed this Oct 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0.6 Lang:PPL Pipe Processing Language support
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants