Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SNOW-1612981: SNOW-1640968 SNOW-1629635 Massive Memory Load #1004

Closed
watercable76 opened this issue Aug 6, 2024 · 37 comments
Closed

SNOW-1612981: SNOW-1640968 SNOW-1629635 Massive Memory Load #1004

watercable76 opened this issue Aug 6, 2024 · 37 comments
Assignees
Labels
bug status-fixed_awaiting_release The issue has been fixed, its PR merged, and now awaiting the next release cycle of the connector. status-triage_done Initial triage done, will be further handled by the driver team

Comments

@watercable76
Copy link

Issue Description

Since the snowflake outage on August 1st, 2024, my project has had a massive memory issue when making queries to Snowflake. Queries we've been using for over a year have never caused this many issues, but it all seemed to be after the outage.

We were using the Snowflake 3.1.0 C# package and had not made any code changes when the issue started.

Since the memory hog, we did the following:

  • Updated our connections to be await using var snowflakeDbConnection = new SnowflakeDbConnection(configuration.GetConnectionString("SnowflakeDb"));
  • Upgraded to 4.1.0

The issue seems to be slightly resolved, but the memory usage is still larger than we have seen it and worse than we want it to perform

@github-actions github-actions bot changed the title Massive Memory Load SNOW-1612981: Massive Memory Load Aug 6, 2024
@dliberatore-th
Copy link

dliberatore-th commented Aug 12, 2024

My company has been seeing this issue ramping up since early July at least and it's causing OOM crashing in containers, which causes a bad user experience for our customers. We're having to rearchitect a large portion of our application to work around this problem. A 2GB hard-limit container that simply queries Snowflake can crash OOM rapidly. Dotnet's out of memory exception doesn't even get thrown, the Linux OOM killer simply terminates the process and we lose the application. We've done some basic analysis and there appears to be a memory allocation issue somewhere in the REST API querying part of the code, and dotnet GC isn't applicable to whatever is being allocated.

@sfc-gh-dszmolka
Copy link
Contributor

hi all - thanks for raising this issue. Its appearance without any code change / driver change suggests that it probably does not come from the driver (= this repo) but perhaps the backend itself, but at this point of course there's not enough information to tell.

Speaking about which, does any of you (or other readers of this ticket, who hit the same issue) have any details on how to reproduce the issue to trigger it? Most helpful would be a code which when run, leads to the unexpected high memory usage.

Other details might be also relevant, as

  • do you observe the issue on one particular Snowflake deployment only , or multiple ones? (of course only relevant if you have multiple accounts on multiple different deployments)
  • is it related to a specific type of query (SELECT, INSERT, etc), specific table/type of data, specific Snowflake object (e.g. only when using Streams, only when calling stored procedures etc)
  • if executing same/similar workload from a different Snowflake driver (ODBC, Python, etc) is possible; does it also exhibit the same behaviour ?

At this point really any additional info might help; if you have queryIds to share that could be also useful. Finally, if you're not comfortable in sharing all these details in a publicly open place, here, you can raise a ticket with Snowflake Support to work 1:1 with a support engineer.

Thank you everyone in advance for every detail which could help narrowing down this issue.

@sfc-gh-dszmolka sfc-gh-dszmolka self-assigned this Aug 13, 2024
@sfc-gh-dszmolka sfc-gh-dszmolka added status-triage Issue is under initial triage status-information_needed Additional information is required from the reporter labels Aug 13, 2024
@watercable76
Copy link
Author

We are doing just SELECT based statements on views that are getting data, and just using the C# driver. As far as I know, we have multiple databases, one for each environment, and the issue is happening on all of them

I've implemented GC on opening the connection to snowflake, creating the command and reading the data, like this:

await using var snowflakeDbConnection = new SnowflakeDbConnection(configuration.GetConnectionString("SnowflakeDb"));
snowflakeDbConnection.Open();

await using var snowflakeDbCommand = snowflakeDbConnection.CreateCommand();
snowflakeDbCommand.CommandText = @"SELECT data FROM created_view WHERE parameter = :parameter";
snowflakeDbCommand.CommandTimeout = 30;

var parameter = snowflakeDbCommand.CreateParameter();
parameter.ParameterName = "parameter";
parameter.Value = value;
parameter.DbType = DbType.String;
snowflakeDbCommand.Parameters.Add(parameter);

await using var dataReader = await snowflakeDbCommand.ExecuteReaderAsync();
snowflakeDbConnection.Close();
return mapper.Map<IEnumerable<Model>>(dataReader);

Even after implementing this, the memory usage is still higher than what we want or are comfortable with

@elliottcrush
Copy link

FYI - We experienced the exact same issues here and had to contact Snowflake support to get this resolved. They confirmed that they had to make a load balance change for our EU connection, as soon as that was made - our memory pattern improved drastically @watercable76 @dliberatore-th FYI

@sfc-gh-dszmolka
Copy link
Contributor

sfc-gh-dszmolka commented Aug 15, 2024

This is very interesting @elliottcrush ; do you mind sharing the case number for reference here perhaps ? It could not only serve for reference, but our teams to sync internally on the possible issues ; if no driver change, no code change, no other change than switching load balancer could mitigate the issue . Thank you in advance !

edit i might have found it. also it could be indicative for the error if you see any elevated rate of HTTP502/503 besides the memory usage increase and/or connection resets?

@watercable76
Copy link
Author

Yes, all of the errors were 502 when trying to make the calls. We thought it was an issue with SNAT connections, which is why we made sure to close connections and setup our variables to be cleaned up with GC. Trying to work with Snowflake to get a support case opened up for this

@elliottcrush
Copy link

SF support said the memory issues we (our company) were seeing were causing 400's on the load balancer.

If you email me @sfc-gh-dszmolka - [email protected] I'll share the details privately.

@sfc-gh-dszmolka
Copy link
Contributor

sfc-gh-dszmolka commented Aug 15, 2024

Thank you @elliottcrush , in the meantime I could find it . Appreciate for the quick turnaround with the offer though!

@sfc-gh-dszmolka sfc-gh-dszmolka added status-triage_done Initial triage done, will be further handled by the driver team and removed status-triage Issue is under initial triage status-information_needed Additional information is required from the reporter labels Aug 15, 2024
@elliottcrush
Copy link

@sfc-gh-dszmolka we're starting to see the exact same issue happen again in production within the last 15 minutes. I've re-raised with Snowflake support, any ideas?

@sfc-gh-dszmolka
Copy link
Contributor

sfc-gh-dszmolka commented Aug 16, 2024

Thank you for re-raising it with Snowflake Support and please do keep working with Snowflake support on this particular issue instead of here. We have very different SLO's for the two kind of channels.

As a retrospective to my best knowledge, the re-occurrence happened due to the LB change not being configured as permanent. Now it is, since you re-raised it yesterday and I see from the associated support case you confirmed it helped (again). Apologies for the issues this problem caused.

edit: Our team responsible for the traffic ingest into Snowflake is investigating the root cause and if somehow turns out to be related to the driver, will update here too.

@watercable76
Copy link
Author

We raised a support ticket too, was able to get the load balancer configuration updated to the previous one we were on. Initial look is it's doing a lot better, but I'm going to be monitoring for another week before I can say it's actually fine again.

@dliberatore-th
Copy link

dliberatore-th commented Aug 16, 2024

Similar situation here re: a support ticket leading to a load balancer change resolving the client-side memory usage.

I'm inclined to say that this issue is indeed a valid concern for the connector library. Receiving an error response code from the REST APIs shouldn't cause a massive memory spike on the client side. At least, that's not expected or logical behavior from a client library.

The server-side problem was masked by the fact that these API call failures aren't shown in the logs unless you're running debug log level. For example, see Core/RestRequester.cs. This makes it difficult for monitoring software/services to signal on issues based on normal production configuration (non-debug-level) log ingestion. Perhaps it shouldn't be relegated to the Debug log level but more of a Warning or Error log level when there is an error status returned from the Snowflake backend? I'm not sure if this is worth opening a separate issue or not.

edit: Since we run the associated microservice in containers, reconfiguring production would have meant building a new container image with updated configuration, which is not really a great option for us.

@elliottcrush
Copy link

https://community.snowflake.com/s/article/FAQ-Updates-on-Migration-of-Traffic-Serving-Proxy-Load-Balancing-Infrastructure

The change to the load balancer and notes on what could be impacted is on the above article.

Which Snowflake drivers and clients are impacted?
Any web browser or Snowflake driver relying on a trust store provided by the operating system or Java runtime environment could be impacted. The following Snowflake drivers rely on the operating system's trust store:
Snowflake NodeJS driver
Snowflake .NET driver

@sfc-gh-dszmolka
Copy link
Contributor

Thank you all for the comments! The above article just refers to 'if you use these drivers, the new CA might be missing from the truststore, and you might need to manually install it'. At this point, not sure if it's related.

It could be better indeed related to how the HTTP connections are made and maintaned with Snowflake, from the driver.
The new Envoy LB from which the Snowflake Community article is about, is indeed the new platform and migrations are ongoing for long months now. This issue however, is fairly new.

Thus we're looking into this aspect, but at this moment could not yet reproduce the issue in such a magnitude where you folks see it.

Here's what we tried (tried this on Linux and MacOS too):

  1. install .NET SDK 8 and dotnet-counters
  2. checkout https://github.com/Snowflake-Labs/sf-samples/tree/main/samples/spcs/sf-connection/dotnet which is a simple console app, targeting .NET 8.0. Originally made for testing our container platform SPCS, will now use outside of that platform.
  3. modify Snowflake.App.Csproj to use a newer version of the .NET driver, because in the example it still uses old 2.1.4. Tried this with 3.1.0 and 4.1.0.
  4. (optionally) modified log4net.config to have TRACE level of logs
  5. Program used in repro attempt, which is a slight modification of the App.cs coming with the dotnet example repo. Constantly
using System.Data;
using Snowflake.Data.Client;

class App
{
    const string TOKENPATH = "/snowflake/session/token";

    static string getConnectionString()
    {
        string? account = Environment.GetEnvironmentVariable("SNOWFLAKE_ACCOUNT");
        string? host = Environment.GetEnvironmentVariable("SNOWFLAKE_HOST");
        if (File.Exists(TOKENPATH))
        {
            // automatically set by env
            string token = File.ReadAllText(TOKENPATH);
            return $"account={account};authenticator=oauth;token={token}";
        }
        else
        {
            // basic auth, variables must be set by user
            string? user = Environment.GetEnvironmentVariable("SNOWFLAKE_USER");
            string? password = Environment.GetEnvironmentVariable("SNOWFLAKE_PASSWORD");
            return $"account={account};user={user};password={password}";
        }
    }

    static void SetEnvVariables()
    {
        Environment.SetEnvironmentVariable("SNOWFLAKE_ACCOUNT", "{account targeting Envoy}");
        Environment.SetEnvironmentVariable("SNOWFLAKE_USER", "{user}");
        Environment.SetEnvironmentVariable("SNOWFLAKE_PASSWORD", "{value}");
    }

    static int Main()
    {
        SetEnvVariables();
        for (var i = 0; i < int.MaxValue; i++)
        {
            try
            {
                //Console.WriteLine("Starting connection!");
                using (IDbConnection conn = new SnowflakeDbConnection())
                {
                    conn.ConnectionString = getConnectionString();
                    conn.Open();
                    using (IDbCommand cmd = conn.CreateCommand())
                    {
                        cmd.CommandText = "SELECT current_account() AS account, current_database() AS database, current_schema() AS schema";
                        IDataReader reader = cmd.ExecuteReader();
                        if (reader.Read())
                        {
                            string account, database, schema;
                            account = reader.GetString(0);
                            database = reader.GetString(1);
                            schema = reader.GetString(2);
                            //Console.WriteLine("account: " + account);
                            //Console.WriteLine("database: " + database);
                            //Console.WriteLine("schema: " + schema);
                        }
                    }
                }
            }
            catch (Exception ex)
            {
                Console.WriteLine(ex.Message);
                return 1;
            }

            Thread.Sleep(100);
        }

        return 0;
    }
}
  1. Run it under dotnet-counters and observe how memory is changing: ~/.dotnet/tools/dotnet-counters monitor --refresh-interval 1 -- dotnet run Program.cs

Experience is that even though connections are frequently being opened, GC Heap Size (MB) sits around 30-50 MB and Working Set (MB) sits around 200-215 MB.

So until now, we could not reproduce the issue you folks are seeing. Obviously there is something , triggering this problems for multiple users but at this moment we could not yet pinpoint it.

So if it would be possible to:

  1. share which Snowflake .NET driver versions you're using (i've seen info about 3.1.0/4.1.0 in the original issue submission) and the OS on which you experience the problem
  2. any details on how you manage connections with the driver; how you're connecting to Snowflake and closing those connections
  3. If a reproducible scenario could be also shared, which leads to the issue you're seeing, that would be invaluable for troubleshooting this issue and super appreciated too

Thank you in advance and again all the info you already shared here !

@sfc-gh-dszmolka
Copy link
Contributor

Quick status: we're still trying to generate the issue internally and could not get to it yet. Kudos to the one of you who provided a code example on the Support ticket opened with Snowflake Support ! 👍

May I ask please if you already have, or will have a ticket for this issue with Snowflake Support, if you have a DEV/test account we could use to move over to Envoy (the new LB on where the symptoms started) to profile it and observe the traffic more closely without interrupting important production flows - that could be very helpful.

If you have such account to spare, even for a limited amount of time (then migrate back to the previous LB), please send it on your Support case and mention it was requested for reproduction and profiling. You can of course also send it here. Thank you so much !

@elliottcrush
Copy link

Thanks @sfc-gh-dszmolka - that was us with the code example. We're leasing with support with respect to going back to the new LB. We don't have a separate environment - however, we are willing to upgrade our snowflake-connector-net and migrate to the new LB during an agreed time frame. All of which I'll continue to just chat about on the support ticket.

@sfc-gh-dszmolka
Copy link
Contributor

a short update: we discovered one possible culprit for the issue in the driver side, and would like to address it #1009.
However, this might not be an actual fix because currently without having a way to reproduce the issue , also we cannot confirm this proposed change actually fixes the issue described in this particular Issue.

Anyways, we're working on eliminating this possibility.

@sfc-gh-dszmolka sfc-gh-dszmolka removed the status-pr_pending_merge A PR is made and is under review label Aug 27, 2024
@sfc-gh-dszmolka sfc-gh-dszmolka added the status-fixed_awaiting_release The issue has been fixed, its PR merged, and now awaiting the next release cycle of the connector. label Aug 27, 2024
@sfc-gh-dszmolka
Copy link
Contributor

the aforementioned change has been merged, and will be part of the next .NET driver release, scheduled to happen in the next coming days.

Due to the lack of reliable reproduction of this issue, we cannot verify if this change helps or not in this particular issue, but after the next release it's possible to test it and see how it performs.

@sfc-gh-dszmolka
Copy link
Contributor

in addition to that (and appreciate the suggestion) #1008 which is now also merged and will be part of the next releases, now exposes the non-successful messages at Error loglevel (was: Debug) so hopefully it will be more easier visible .

@watercable76
Copy link
Author

@sfc-gh-dszmolka Thanks for getting that PR in and working through all of these problems we've had. Do you know when that next release will be published?

@sfc-gh-dszmolka
Copy link
Contributor

I believe it should come in the next few days; I think first we're hoping to get yet another possible fix for a possible culprit in (team discovered some anomalies around Okta type of authenticators, which might be even related to this issue, and wanted to fix that as well -> #1010)

I would also like to grab the opportunity to ask here; just to confirm the above fix at 1010 can be related or not: are you guys using Okta type authenticators in the applications which saw the unexplainable sudden increase in memory usage, or something else? Thanks for providing data on this !

@watercable76
Copy link
Author

Not using Okta over here, more Azure based than anything

@sfc-gh-dszmolka
Copy link
Contributor

sfc-gh-dszmolka commented Sep 2, 2024

short update; we're also hoping getting #1014 before the release. Which now warrants a list at this stage, which changes were done in relation to attempt to mitigate this issue:

@sfc-gh-dszmolka sfc-gh-dszmolka changed the title SNOW-1612981: Massive Memory Load SNOW-1612981: SNOW-1640968 Massive Memory Load Sep 2, 2024
@dliberatore-th
Copy link

dliberatore-th commented Sep 3, 2024

One possible factor here, different from your test code above which does things synchronously in a serial manner, is that we're doing things async from a REST API of our own to serve requests in parallel. This is intended to fill several components on a page.

I can't share our code with you and don't have a nice condensed test case for you, unfortunately, but you could probably simulate it well enough by doing async calls with a max of maybe 15 at a time, using a startup delay of a few ms each, and having it use a query that takes a few seconds to complete (and only return a few rows totaling a few KB) so they stack up.

Thanks for keeping up with the efforts to improve the software. We're looking forward to trying those fixes out.

edit: Not using Okta here either.

@sfc-gh-dszmolka sfc-gh-dszmolka changed the title SNOW-1612981: SNOW-1640968 Massive Memory Load SNOW-1612981: SNOW-1640968 SNOW-1629635 Massive Memory Load Sep 11, 2024
@sfc-gh-dszmolka
Copy link
Contributor

sfc-gh-dszmolka commented Sep 25, 2024

short update: multiple teams are still working on root causing (then, fixing) this issue with the highest priority. The fixes mentioned above, while they do fix various issues and are still important, do not seem to be enough in themselves to actually fix this issue. We have a local repro for a while now which provides some relief and the ability to iterate debugging faster.

What we observed is that disabling certificate validation (setting insecuremode=true in connection string) does help reducing the memory usage and thus the OOMKilled events, even when the account is on the 'new' frontend where the issue was discovered. The current focus of the investigation is this behaviour.

Most of you are hopefully not impacted, being migrated earlier back to the 'old' frontend where the containerised .NET application doesn't exhibit this behaviour - but in case you're not migrated back and just discovering this issue, do raise a case with Snowflake Support so we could mitigate you.

@sfc-gh-dszmolka
Copy link
Contributor

short update: issue seems to be narrowed down exclusively to Linux, and how dotnet runtime and the underlying OS handles memory allocation for CRL, which specifically becomes a problem with large CRLs which have lots of entries.
(Also a reason for why turning off CRL checking with insecuremode could provide a relief)

Issue/observation has been reported to the dotnet maintainers - while we still work internally in Snowflake on a possible mitigation.

@sfc-gh-mtatoczenko
Copy link

We have opened this issue with dotnet as well: dotnet/runtime#108557

Our teams are looking at a mitigation to take on Snowflake's side for this, but we will provide updates when that is ready and confirmed to be working.

@myshon
Copy link

myshon commented Oct 17, 2024

I'm debugging services for days and struggling with OOMKilled issues... but finally happy (somehow) for a moment that I found this thread...

I experience exactly the same issues with Snowflake Driver 😢

  1. .NET8 (latestFeature)
  2. Snowflake.Data 4.1.0
  3. Docker image: mcr.microsoft.com/dotnet/aspnet:8.0.10-jammy-chiseled
  4. GC Heap Size (MB) less than 50MB
  5. Kubernetes Pods grows up to 2GB of mem and got OOMKilled... (while other very similar services based on EntityFramework & Postgres ~300mb)

@sfc-gh-dszmolka
Copy link
Contributor

sorry to see this issue caused so much problems for you (and others as well). I must also add, the root cause is the same, as for your other issue with the login-requests taking a really lot of time: #1025

Good news that not only is it now easily mitigated by insecuremode, but (while also investigated on dotnet side) we're in progress of rolling out the Snowflake-side mitigation which hopefully will prevent further people running into it. Will keep this thread posted.

@sfc-gh-dszmolka
Copy link
Contributor

sfc-gh-dszmolka commented Oct 21, 2024

19 October 2024, saturday early UTC hours the Snowflake-side mitigation rollout has finished. Meaning

  • hope no new occurrences will happen after 19 October 2024 morning 😅
  • for you folks who were already affected, and mitigated by:
    -> insecuremode (and nothing else) : please try removing insecuremode from Connection string, and see if the memory usage is still unexpectedly high / if login-request needs literally multiple seconds to complete
    -> migrated back to the 'old' frontend (nginx): if you already have an open ticket with our Support, we'll work with you individually for migrating to the 'new' frontend (Envoy) and test

We sincerely hope the Snowflake-side mitigation should provide a full workaround even though the dotnet side of investigation is still ongoing and at this moment it's uncertain if it will yield any directly visible change in the near future.

Feedback about the fix, as always, is very much appreciated!

edit: as of 21 October 2024 very early UTC hours, every account is now again on the new frontend, which has the fix. (except with whom we have a separate, different agreement, those accounts will be migrated per agreement)

@myshon
Copy link

myshon commented Oct 21, 2024

I removed "insecuremode" and still observe high latency on login-request today.

EDIT: In general query latency is much lower, however login-request still take 2-3 seconds.

@sfc-gh-dszmolka
Copy link
Contributor

sfc-gh-dszmolka commented Oct 21, 2024

@myshon thanks for testing it so quickly! Maybe there's still something ongoing in your particular case, which still need investigation.
We got confirmation from some other customers, that the massive memory increase did resolve with rolling out this Snowflake-side mitigation I spoke about (which is, OCSP stapling). If it did not help with the slow login-request with CRL checking enabled, then we need to investigate separately. Could you please create a fresh set of DEBUG logs for us, and share it with my colleague on the ongoing Support case ?

editing my comment after seeing your edit, that the initial 5s login-request execution was brought down to 2s. It is already a significant improvement, and also confirms the fix did work in your use-case too. However, it seems further improvements are necessary after investigating why login is still in 2-3s range instead of the subsecond range. We'll work separately with you on the support case you have with us.

@sfc-gh-dszmolka
Copy link
Contributor

This post is to notify people that since the Snowflake-side mitigation (implementing OCSP Stapling) is rolled out, and we already have confirmation it did bring down the memory usage to the pre-issue level, I'm going to close this issue as resolved in the coming days if there's no further reports of the massive memory increase for the .NET driver.

In parallel, dotnet issue (linked above) ongoing separately.

@sfc-gh-dszmolka
Copy link
Contributor

Closing this issue as it has been completely mitigated 21 October 2024 early UTC hours by global rollout of OCSP stapling (which allows to avoid this issue at all) + migrating accounts to Envoy frontend.

Thank you all for bearing with us while this complex issue was debugged and fixed !

@watercable76
Copy link
Author

Closing this issue as it has been completely mitigated 21 October 2024 early UTC hours by global rollout of OCSP stapling (which allows to avoid this issue at all) + migrating accounts to Envoy frontend.

Thank you all for bearing with us while this complex issue was debugged and fixed !

Awesome! Can't wait to see the performance difference and stability, and thanks for helping to look into this issue over the last few months

@myshon
Copy link

myshon commented Nov 22, 2024

unfortunately, I still experience memory issues and pod restarts 😞
Upgraded connector to 4.2
image
image

@sfc-gh-dszmolka
Copy link
Contributor

sfc-gh-dszmolka commented Nov 22, 2024

Since the issue tracked in this ticket 1004 exclusively affected Azure accounts, it was issue was fixed in all Azure deployments with rolling out OCSP Stapling. As far as I know, you're using an AWS account so therefore the fix (October 21) is not yet available.

There's a real possibility the issues you observe

  • originally: avg of 5s needed for login-request, now it's around half, 2s AFAIK
  • now: increased POD memory usage (which is very similar to the issue pattern to the other folks here, on Azure, were having)

is also related or could be mitigated by rolling out OCSP Stapling in AWS; the second one I'm hoping to be likely better after it's implemented.
I don't have an exact day for it yet, but I'll update this thread if I have any new info.

editing this comment to include the fresh info I got; that estimated delivery date for AWS implementation is around mid-January 2025 (given other critical priorities and upcoming change freezes). In the meantime, insecuremode should fully mitigate the situation. Thank you for bearing with us !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug status-fixed_awaiting_release The issue has been fixed, its PR merged, and now awaiting the next release cycle of the connector. status-triage_done Initial triage done, will be further handled by the driver team
Projects
None yet
Development

No branches or pull requests

7 participants