-
Notifications
You must be signed in to change notification settings - Fork 140
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SNOW-1612981: SNOW-1640968 SNOW-1629635 Massive Memory Load #1004
Comments
My company has been seeing this issue ramping up since early July at least and it's causing OOM crashing in containers, which causes a bad user experience for our customers. We're having to rearchitect a large portion of our application to work around this problem. A 2GB hard-limit container that simply queries Snowflake can crash OOM rapidly. Dotnet's out of memory exception doesn't even get thrown, the Linux OOM killer simply terminates the process and we lose the application. We've done some basic analysis and there appears to be a memory allocation issue somewhere in the REST API querying part of the code, and dotnet GC isn't applicable to whatever is being allocated. |
hi all - thanks for raising this issue. Its appearance without any code change / driver change suggests that it probably does not come from the driver (= this repo) but perhaps the backend itself, but at this point of course there's not enough information to tell. Speaking about which, does any of you (or other readers of this ticket, who hit the same issue) have any details on how to reproduce the issue to trigger it? Most helpful would be a code which when run, leads to the unexpected high memory usage. Other details might be also relevant, as
At this point really any additional info might help; if you have queryIds to share that could be also useful. Finally, if you're not comfortable in sharing all these details in a publicly open place, here, you can raise a ticket with Snowflake Support to work 1:1 with a support engineer. Thank you everyone in advance for every detail which could help narrowing down this issue. |
We are doing just SELECT based statements on views that are getting data, and just using the C# driver. As far as I know, we have multiple databases, one for each environment, and the issue is happening on all of them I've implemented GC on opening the connection to snowflake, creating the command and reading the data, like this: await using var snowflakeDbConnection = new SnowflakeDbConnection(configuration.GetConnectionString("SnowflakeDb"));
snowflakeDbConnection.Open();
await using var snowflakeDbCommand = snowflakeDbConnection.CreateCommand();
snowflakeDbCommand.CommandText = @"SELECT data FROM created_view WHERE parameter = :parameter";
snowflakeDbCommand.CommandTimeout = 30;
var parameter = snowflakeDbCommand.CreateParameter();
parameter.ParameterName = "parameter";
parameter.Value = value;
parameter.DbType = DbType.String;
snowflakeDbCommand.Parameters.Add(parameter);
await using var dataReader = await snowflakeDbCommand.ExecuteReaderAsync();
snowflakeDbConnection.Close();
return mapper.Map<IEnumerable<Model>>(dataReader); Even after implementing this, the memory usage is still higher than what we want or are comfortable with |
FYI - We experienced the exact same issues here and had to contact Snowflake support to get this resolved. They confirmed that they had to make a load balance change for our EU connection, as soon as that was made - our memory pattern improved drastically @watercable76 @dliberatore-th FYI |
This is very interesting @elliottcrush ; do you mind sharing the case number for reference here perhaps ? It could not only serve for reference, but our teams to sync internally on the possible issues ; if no driver change, no code change, no other change than switching load balancer could mitigate the issue . Thank you in advance ! edit i might have found it. also it could be indicative for the error if you see any elevated rate of HTTP502/503 besides the memory usage increase and/or connection resets? |
Yes, all of the errors were 502 when trying to make the calls. We thought it was an issue with SNAT connections, which is why we made sure to close connections and setup our variables to be cleaned up with GC. Trying to work with Snowflake to get a support case opened up for this |
SF support said the memory issues we (our company) were seeing were causing 400's on the load balancer. If you email me @sfc-gh-dszmolka - [email protected] I'll share the details privately. |
Thank you @elliottcrush , in the meantime I could find it . Appreciate for the quick turnaround with the offer though! |
@sfc-gh-dszmolka we're starting to see the exact same issue happen again in production within the last 15 minutes. I've re-raised with Snowflake support, any ideas? |
Thank you for re-raising it with Snowflake Support and please do keep working with Snowflake support on this particular issue instead of here. We have very different SLO's for the two kind of channels. As a retrospective to my best knowledge, the re-occurrence happened due to the LB change not being configured as permanent. Now it is, since you re-raised it yesterday and I see from the associated support case you confirmed it helped (again). Apologies for the issues this problem caused. edit: Our team responsible for the traffic ingest into Snowflake is investigating the root cause and if somehow turns out to be related to the driver, will update here too. |
We raised a support ticket too, was able to get the load balancer configuration updated to the previous one we were on. Initial look is it's doing a lot better, but I'm going to be monitoring for another week before I can say it's actually fine again. |
Similar situation here re: a support ticket leading to a load balancer change resolving the client-side memory usage. I'm inclined to say that this issue is indeed a valid concern for the connector library. Receiving an error response code from the REST APIs shouldn't cause a massive memory spike on the client side. At least, that's not expected or logical behavior from a client library. The server-side problem was masked by the fact that these API call failures aren't shown in the logs unless you're running debug log level. For example, see Core/RestRequester.cs. This makes it difficult for monitoring software/services to signal on issues based on normal production configuration (non-debug-level) log ingestion. Perhaps it shouldn't be relegated to the Debug log level but more of a Warning or Error log level when there is an error status returned from the Snowflake backend? I'm not sure if this is worth opening a separate issue or not. edit: Since we run the associated microservice in containers, reconfiguring production would have meant building a new container image with updated configuration, which is not really a great option for us. |
The change to the load balancer and notes on what could be impacted is on the above article. Which Snowflake drivers and clients are impacted? |
Thank you all for the comments! The above article just refers to 'if you use these drivers, the new CA might be missing from the truststore, and you might need to manually install it'. At this point, not sure if it's related. It could be better indeed related to how the HTTP connections are made and maintaned with Snowflake, from the driver. Thus we're looking into this aspect, but at this moment could not yet reproduce the issue in such a magnitude where you folks see it. Here's what we tried (tried this on Linux and MacOS too):
using System.Data;
using Snowflake.Data.Client;
class App
{
const string TOKENPATH = "/snowflake/session/token";
static string getConnectionString()
{
string? account = Environment.GetEnvironmentVariable("SNOWFLAKE_ACCOUNT");
string? host = Environment.GetEnvironmentVariable("SNOWFLAKE_HOST");
if (File.Exists(TOKENPATH))
{
// automatically set by env
string token = File.ReadAllText(TOKENPATH);
return $"account={account};authenticator=oauth;token={token}";
}
else
{
// basic auth, variables must be set by user
string? user = Environment.GetEnvironmentVariable("SNOWFLAKE_USER");
string? password = Environment.GetEnvironmentVariable("SNOWFLAKE_PASSWORD");
return $"account={account};user={user};password={password}";
}
}
static void SetEnvVariables()
{
Environment.SetEnvironmentVariable("SNOWFLAKE_ACCOUNT", "{account targeting Envoy}");
Environment.SetEnvironmentVariable("SNOWFLAKE_USER", "{user}");
Environment.SetEnvironmentVariable("SNOWFLAKE_PASSWORD", "{value}");
}
static int Main()
{
SetEnvVariables();
for (var i = 0; i < int.MaxValue; i++)
{
try
{
//Console.WriteLine("Starting connection!");
using (IDbConnection conn = new SnowflakeDbConnection())
{
conn.ConnectionString = getConnectionString();
conn.Open();
using (IDbCommand cmd = conn.CreateCommand())
{
cmd.CommandText = "SELECT current_account() AS account, current_database() AS database, current_schema() AS schema";
IDataReader reader = cmd.ExecuteReader();
if (reader.Read())
{
string account, database, schema;
account = reader.GetString(0);
database = reader.GetString(1);
schema = reader.GetString(2);
//Console.WriteLine("account: " + account);
//Console.WriteLine("database: " + database);
//Console.WriteLine("schema: " + schema);
}
}
}
}
catch (Exception ex)
{
Console.WriteLine(ex.Message);
return 1;
}
Thread.Sleep(100);
}
return 0;
}
}
Experience is that even though connections are frequently being opened, So until now, we could not reproduce the issue you folks are seeing. Obviously there is something , triggering this problems for multiple users but at this moment we could not yet pinpoint it. So if it would be possible to:
Thank you in advance and again all the info you already shared here ! |
Quick status: we're still trying to generate the issue internally and could not get to it yet. Kudos to the one of you who provided a code example on the Support ticket opened with Snowflake Support ! 👍 May I ask please if you already have, or will have a ticket for this issue with Snowflake Support, if you have a DEV/test account we could use to move over to Envoy (the new LB on where the symptoms started) to profile it and observe the traffic more closely without interrupting important production flows - that could be very helpful. If you have such account to spare, even for a limited amount of time (then migrate back to the previous LB), please send it on your Support case and mention it was requested for reproduction and profiling. You can of course also send it here. Thank you so much ! |
Thanks @sfc-gh-dszmolka - that was us with the code example. We're leasing with support with respect to going back to the new LB. We don't have a separate environment - however, we are willing to upgrade our snowflake-connector-net and migrate to the new LB during an agreed time frame. All of which I'll continue to just chat about on the support ticket. |
a short update: we discovered one possible culprit for the issue in the driver side, and would like to address it #1009. Anyways, we're working on eliminating this possibility. |
the aforementioned change has been merged, and will be part of the next .NET driver release, scheduled to happen in the next coming days. Due to the lack of reliable reproduction of this issue, we cannot verify if this change helps or not in this particular issue, but after the next release it's possible to test it and see how it performs. |
in addition to that (and appreciate the suggestion) #1008 which is now also merged and will be part of the next releases, now exposes the non-successful messages at |
@sfc-gh-dszmolka Thanks for getting that PR in and working through all of these problems we've had. Do you know when that next release will be published? |
I believe it should come in the next few days; I think first we're hoping to get yet another possible fix for a possible culprit in (team discovered some anomalies around Okta type of authenticators, which might be even related to this issue, and wanted to fix that as well -> #1010) I would also like to grab the opportunity to ask here; just to confirm the above fix at 1010 can be related or not: are you guys using Okta type |
Not using Okta over here, more Azure based than anything |
short update; we're also hoping getting #1014 before the release. Which now warrants a list at this stage, which changes were done in relation to attempt to mitigate this issue:
|
One possible factor here, different from your test code above which does things synchronously in a serial manner, is that we're doing things async from a REST API of our own to serve requests in parallel. This is intended to fill several components on a page. I can't share our code with you and don't have a nice condensed test case for you, unfortunately, but you could probably simulate it well enough by doing async calls with a max of maybe 15 at a time, using a startup delay of a few ms each, and having it use a query that takes a few seconds to complete (and only return a few rows totaling a few KB) so they stack up. Thanks for keeping up with the efforts to improve the software. We're looking forward to trying those fixes out. edit: Not using Okta here either. |
short update: multiple teams are still working on root causing (then, fixing) this issue with the highest priority. The fixes mentioned above, while they do fix various issues and are still important, do not seem to be enough in themselves to actually fix this issue. We have a local repro for a while now which provides some relief and the ability to iterate debugging faster. What we observed is that disabling certificate validation (setting Most of you are hopefully not impacted, being migrated earlier back to the 'old' frontend where the containerised .NET application doesn't exhibit this behaviour - but in case you're not migrated back and just discovering this issue, do raise a case with Snowflake Support so we could mitigate you. |
short update: issue seems to be narrowed down exclusively to Linux, and how dotnet runtime and the underlying OS handles memory allocation for CRL, which specifically becomes a problem with large CRLs which have lots of entries. Issue/observation has been reported to the dotnet maintainers - while we still work internally in Snowflake on a possible mitigation. |
We have opened this issue with dotnet as well: dotnet/runtime#108557 Our teams are looking at a mitigation to take on Snowflake's side for this, but we will provide updates when that is ready and confirmed to be working. |
I'm debugging services for days and struggling with I experience exactly the same issues with Snowflake Driver 😢
|
sorry to see this issue caused so much problems for you (and others as well). I must also add, the root cause is the same, as for your other issue with the login-requests taking a really lot of time: #1025 Good news that not only is it now easily mitigated by |
19 October 2024, saturday early UTC hours the Snowflake-side mitigation rollout has finished. Meaning
We sincerely hope the Snowflake-side mitigation should provide a full workaround even though the Feedback about the fix, as always, is very much appreciated! edit: as of 21 October 2024 very early UTC hours, every account is now again on the new frontend, which has the fix. (except with whom we have a separate, different agreement, those accounts will be migrated per agreement) |
I removed "insecuremode" and still observe high latency on EDIT: In general query latency is much lower, however |
@myshon thanks for testing it so quickly! Maybe there's still something ongoing in your particular case, which still need investigation. editing my comment after seeing your edit, that the initial 5s login-request execution was brought down to 2s. It is already a significant improvement, and also confirms the fix did work in your use-case too. However, it seems further improvements are necessary after investigating why login is still in 2-3s range instead of the subsecond range. We'll work separately with you on the support case you have with us. |
This post is to notify people that since the Snowflake-side mitigation (implementing OCSP Stapling) is rolled out, and we already have confirmation it did bring down the memory usage to the pre-issue level, I'm going to close this issue as resolved in the coming days if there's no further reports of the massive memory increase for the .NET driver. In parallel, |
Closing this issue as it has been completely mitigated 21 October 2024 early UTC hours by global rollout of OCSP stapling (which allows to avoid this issue at all) + migrating accounts to Envoy frontend. Thank you all for bearing with us while this complex issue was debugged and fixed ! |
Awesome! Can't wait to see the performance difference and stability, and thanks for helping to look into this issue over the last few months |
Since the issue tracked in this ticket 1004 exclusively affected Azure accounts, it was issue was fixed in all Azure deployments with rolling out OCSP Stapling. As far as I know, you're using an AWS account so therefore the fix (October 21) is not yet available. There's a real possibility the issues you observe
is also related or could be mitigated by rolling out OCSP Stapling in AWS; the second one I'm hoping to be likely better after it's implemented. editing this comment to include the fresh info I got; that estimated delivery date for AWS implementation is around mid-January 2025 (given other critical priorities and upcoming change freezes). In the meantime, insecuremode should fully mitigate the situation. Thank you for bearing with us ! |
Issue Description
Since the snowflake outage on August 1st, 2024, my project has had a massive memory issue when making queries to Snowflake. Queries we've been using for over a year have never caused this many issues, but it all seemed to be after the outage.
We were using the Snowflake 3.1.0 C# package and had not made any code changes when the issue started.
Since the memory hog, we did the following:
await using var snowflakeDbConnection = new SnowflakeDbConnection(configuration.GetConnectionString("SnowflakeDb"));
The issue seems to be slightly resolved, but the memory usage is still larger than we have seen it and worse than we want it to perform
The text was updated successfully, but these errors were encountered: