Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stack overflow with LEAD and LAG functions #12731

Closed
iilyak opened this issue Oct 3, 2024 · 24 comments
Closed

Stack overflow with LEAD and LAG functions #12731

iilyak opened this issue Oct 3, 2024 · 24 comments
Assignees
Labels
bug Something isn't working

Comments

@iilyak
Copy link

iilyak commented Oct 3, 2024

Describe the bug

The following query causes stack overflow in rust

SELECT seq
  FROM (
    SELECT seq,
      LEAD(seq) OVER (ORDER BY seq) AS next_seq
      FROM parquet_table
  ) AS subquery
  WHERE next_seq - seq > 1
thread 'main' has overflowed its stack
error: process didn't exit successfully: `target\debug\seq.exe` (exit code: 0xc00000fd, STATUS_STACK_OVERFLOW)

To Reproduce

use datafusion::prelude::SessionContext;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
  let ctx = datafusion::prelude::SessionContext::new();
  ctx
    .register_parquet(
      "parquet_table",
      "batches.parquet",
      datafusion::prelude::ParquetReadOptions::default(),
    )
    .await?;

  let sql_query = "
  SELECT seq
  FROM (
    SELECT seq,
      LEAD(seq) OVER (ORDER BY seq) AS next_seq
      FROM parquet_table
  ) AS subquery
  WHERE next_seq - seq > 1
  ";

  let df = ctx.sql(sql_query).await?;

  df.show().await?;
  Ok(())
}

Expected behavior

The query should detect gaps in the seq number.

Additional context

datafusion = { version = "41.0.0", features = ["parquet", "default"] }
arrow = "53.0.0"
@iilyak iilyak added the bug Something isn't working label Oct 3, 2024
@iilyak
Copy link
Author

iilyak commented Oct 3, 2024

Version 42.0.0 exhibit the same issue.

@Eason0729
Copy link
Contributor

take

@Eason0729
Copy link
Contributor

Eason0729 commented Oct 4, 2024

I can reproduce it on Windows(can't on linux), it overflow on this loop.

I would try to write a unit test in sqlparser to cause it overflow, then I will file an issue on sqlparser-rs if needed.

@Eason0729
Copy link
Contributor

Eason0729 commented Oct 4, 2024

I found stack overflow not happening if stack size is set to 16MiB(default is 1 MiB on Windows), so you can reproduce this issue on linux by making stack size smaller.

I don't think this is an bug, but it could be better if we document it somewhere(document stack size is very small by default on windows).

Also of note, optimized build significantly reduce memory usage, 1 MiB may be enough.

@iilyak
Copy link
Author

iilyak commented Oct 4, 2024

I don't think this is an bug, but it could be better if we document it somewhere(document stack size is very small by default on windows).

Any stack overflow is a security risk https://book.hacktricks.xyz/binary-exploitation/stack-overflow

@Eason0729
Copy link
Contributor

Any stack overflow is a security risk https://book.hacktricks.xyz/binary-exploitation/stack-overflow

Sorry for missing additional context, I think such stack overflow won't happen if one of those requirement is meet:

  1. build with optimization(try running with cargo run --release)
  2. Stack size is adjusted(1MB is unreasonably small for modern application)

And on datafusion side, I don't think there is a reasonable way reduce stack size except using iteration instead of recursive on sqlparser.

I will try rewriting parser with iteration instead of recursive.

@iilyak
Copy link
Author

iilyak commented Oct 4, 2024

Is it possible to detect this condition programmatically and raise an exception? Controlled exception is always better then running over to someone else region of memory.

@iilyak
Copy link
Author

iilyak commented Oct 4, 2024

I think such stack overflow won't happen if one of those requirement is meet

Imagine a service which implements an application to analyze structured log events. Where log events are stored in parquet files. The application is a web server which accepts adhoc SQL queries and runs them to extract useful patterns from log events. In such scenario the malicious user can craft complex SQL query to cause stack overflow and mount remote code execution attack.

In this scenario no matter how big the stack size is. It will always be possible to craft more complex query to cause the problem.

@Eason0729
Copy link
Contributor

I think such stack overflow won't happen if one of those requirement is meet

Imagine a service which implements an application to analyze structured log events. Where log events are stored in parquet files. The application is a web server which accepts adhoc SQL queries and runs them to extract useful patterns from log events. In such scenario the malicious user can craft complex SQL query to cause stack overflow and mount remote code execution attack.

In this scenario no matter how big the stack size is. It will always be possible to craft more complex query to cause the problem.

I personally think that not something to be taken into consideration under optimized build, as query with thousand of token might needed, in that case, it should be bound checked(as physical memory might not be enough).

But totally I agree that stack overflow under unoptimized build is unacceptable(it hurt developer experience), so I will work on that.

For now, you can develop the application with more stack size, and release it with optimization, it should be safe if the input were bound checked for some reasonable large number.

@alamb
Copy link
Contributor

alamb commented Oct 8, 2024

I think rust consumes quite a bit more stack space for debug builds than release builds (there are several past tickets in this repo related to this)

In this scenario no matter how big the stack size is. It will always be possible to craft more complex query to cause the problem.

Indeed -- it would be great if someone could put a stack limit in to try and avoid overflows (make a real error) if the query is too deply nested

@alamb
Copy link
Contributor

alamb commented Oct 8, 2024

This particular query doesn't look like it have deep nesting though -- can someone post the stack trace here? Maybe we can improve the behavior for whatever function is happening

@Eason0729
Copy link
Contributor

Here is one stacktrace.
partial_stacktrace.txt

I am trying to understand which stack frame take some stack memory by manually read rbp and rsp in lldb. It's quite inefficient, so I might need some help on this.

In addition, if I recall correctly, it take about 100 frames to parse it. (that's where stack overflow happen)

@Eason0729
Copy link
Contributor

bt_windows.txt

Here is another one that actually stack overflow.

@Eason0729
Copy link
Contributor

I think rust consumes quite a bit more stack space for debug builds than release builds (there are several past tickets in this repo related to this)

In this scenario no matter how big the stack size is. It will always be possible to craft more complex query to cause the problem.

Indeed -- it would be great if someone could put a stack limit in to try and avoid overflows (make a real error) if the query is too deply nested

I think it's better to open another issue. I would like this issue to focus on reducing stack usage.

We could use asm macro to get stack usage, and detect usage(return Error) at some point

fn get_stack_usage() -> usize {
    let rbp: usize;
    let rsp: usize;
    unsafe {
        asm!("mov {0}, rbp", out(reg) rbp);
        asm!("mov {0}, rsp", out(reg) rsp);
    }
    return rbp - rsp;
}

@Eason0729
Copy link
Contributor

My original thought to reduce stack usage is to move more thing to heap, but I just found another possible solution using stacker.

I will try stacker next day, then submit a PR to see if it's appropriate.

@iilyak
Copy link
Author

iilyak commented Oct 9, 2024

My original thought to reduce stack usage is to move more thing to heap, but I just found another possible solution using stacker.

Interesting library

@iilyak
Copy link
Author

iilyak commented Oct 9, 2024

Another option is to make parser tail recursive and annotate the function using https://docs.rs/tailcall/latest/tailcall/

@alamb
Copy link
Contributor

alamb commented Oct 9, 2024

The sqlparser library already supports limiting recursion: https://docs.rs/sqlparser/latest/sqlparser/parser/struct.Parser.html#method.with_recursion_limit

Which is a pretty good protection against stack overflows

I think stacker is a somewhat low level solution and that we would have to ensure any use doesn't cause performance or safety issues

@alamb
Copy link
Contributor

alamb commented Oct 9, 2024

I think it's better to open another issue. I would like this issue to focus on reducing stack usage.

I agree

@alamb
Copy link
Contributor

alamb commented Oct 9, 2024

Given this is a stack overflow in sqlparser, perhaps we can move the ticket there and try to solve it in the sqlparser library?

@alamb
Copy link
Contributor

alamb commented Oct 9, 2024

@Eason0729
Copy link
Contributor

apache/datafusion-sqlparser-rs#1465

@alamb
Copy link
Contributor

alamb commented Oct 10, 2024

Let's use apache/datafusion-sqlparser-rs#1465 to track this issue

@alamb alamb closed this as completed Oct 10, 2024
@Eason0729
Copy link
Contributor

BTW, you can change stack size with Cargo.toml

# 64 bit MSVC
[target.x86_64-pc-windows-msvc]
rustflags = [
	"-C", "link-arg=/STACK:8000000"
]

# 64 bit Mingw
[target.x86_64-pc-windows-gnu]
rustflags = [
    "-C", "link-arg=-Wl,--stack,8000000"
]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants