Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Network.Socket.accept: resource exhausted (Too many open files) #2042

Closed
steve-chavez opened this issue Nov 24, 2021 · 35 comments · Fixed by #2158
Closed

Network.Socket.accept: resource exhausted (Too many open files) #2042

steve-chavez opened this issue Nov 24, 2021 · 35 comments · Fixed by #2158
Labels

Comments

@steve-chavez
Copy link
Member

steve-chavez commented Nov 24, 2021

(PostgREST v8.0.0.20211102 behind Kong/Nginx)

This happens when there's a good amount of traffic going into PostgREST, then it stops processing any new requests until it's restarted manually.

Nov 23 01:36:27 postgrest[11475]: 23/Nov/2021:01:36:27 +0000: Attempting to connect to the database...
Nov 23 01:36:27 postgrest[11475]: 23/Nov/2021:01:36:27 +0000: Connection successful
Nov 23 01:36:27 postgrest[11475]: 23/Nov/2021:01:36:27 +0000: Config re-loaded
Nov 23 01:36:27 postgrest[11475]: 23/Nov/2021:01:36:27 +0000: Schema cache loaded
Nov 23 03:54:41 postgrest[11475]: Network.Socket.accept: resource exhausted (Too many open files)
Nov 23 03:55:17 postgrest[11475]: 127.0.0.1 - - [23/Nov/2021:03:55:16 +0000] "POST /<redacted> HTTP/1.1" 503
Nov 23 04:13:51 systemd[1]: Stopped PostgREST.
Nov 23 04:13:51 systemd[1]: Started PostgREST.
Nov 23 04:13:51 postgrest[15073]: 23/Nov/2021:04:13:51 +0000: Attempting to connect to the database...
Nov 23 04:13:51 postgrest[15073]: 23/Nov/2021:04:13:51 +0000: Connection successful
Nov 23 04:13:51 postgrest[15073]: 23/Nov/2021:04:13:51 +0000: Listening on port 3000
Nov 23 04:13:51 postgrest[15073]: 23/Nov/2021:04:13:51 +0000: Config re-loaded
Nov 23 04:13:51 postgrest[15073]: 23/Nov/2021:04:13:51 +0000: Listening for notifications on the pgrst channel

Ideally PostgREST should recover or at least die and let systemd restart it.

This is a fix that needs to be done upstream: yesodweb/wai#830.

A patch is already up for review: yesodweb/wai#831

Also, a way to reproduce is mentioned on: yesodweb/wai#825.


Not yet confirmed, but increasing the db-pool size should help since PostgREST wil be able to process more requests without running out of sockets first.

@steve-chavez
Copy link
Member Author

Seems related to #1399 (comment) and #905.

@steve-chavez
Copy link
Member Author

steve-chavez commented Jan 4, 2022

A way to reproduce this was found on #2092 (comment), basically using pg_sleep on a function. That makes sense since the db-pool connection slots are busy and the db can't handle incoming traffic however postgrest keeps accepting new requests and then it runs out of sockets.

@steve-chavez
Copy link
Member Author

Seems yesodweb/wai#831 would help with EMFILE in that postgrest won't become blocked and will start responding if sockets become freed. However, if the db is still busy and can't handle incoming traffic, EMFILE will be reached again and thus, postgrest will stop responding again.

I think it would be better if we apply backpressure whenever EMFILE happens so we give a chance for db connections to become available. So we would start rejecting any new request with 429 Too Many Requests, this would also serve as a hint to scale or tune the db.

@steve-chavez
Copy link
Member Author

whenever EMFILE happens

I guess we could try catching that exception somehow but I'm not sure how we would stop responding with 429. Can we count the number of used sockets?

@steve-chavez
Copy link
Member Author

Can we count the number of used sockets?

Another option could be to try and fix this at the pool level. With that we could even stop hitting EMFILE by somehow failing when too many requests are waiting for a pool resource.

Hasql.Pool is a small wrapper on Data.Pool, which seems to be lacking an introspection feature(bos/pool#36) that could help us. It also looks to be unnmaintained(bos/pool#39).

@wolfgangwalther
Copy link
Member

So we would start rejecting any new request with 429 Too Many Requests, this would also serve as a hint to scale or tune the db.

429 is a client error, i.e. rate limiting. A user could hit this with just a single request - it's not their fault. This absolutely needs to be a 5xx error code.

@wolfgangwalther
Copy link
Member

Hasql.Pool is a small wrapper on Data.Pool, which seems to be lacking an introspection feature(bos/pool#36) that could help us. It also looks to be unnmaintained(bos/pool#39).

Looks like there is a maintained fork: https://hackage.haskell.org/package/resource-pool-fork-avanov-0.2.4.0

Does that have the introspection feature?

Maybe we should just make Hasql.Pool use that fork instead.

@wolfgangwalther
Copy link
Member

Maybe we can also use https://hackage.haskell.org/package/unix-2.7.2.2/docs/System-Posix-Resource.html. We could:

  • Use setResourceLimit to increase our process' softlimit to the hardlimit allowed by the kernel. This seems to be what other webservers do, too [citation needed]. This could already avoid hitting EMFILE for most cases.
  • Use getResourceLimit and compare that with the number of pending requests in the pool.

@steve-chavez
Copy link
Member Author

429 is a client error, i.e. rate limiting. A user could hit this with just a single request - it's not their fault. This absolutely needs to be a 5xx error code.

True, agree. I think having a different status than 503 for this would be better. Relevant options:

  • 504 Gateway Timeout: more related to proxies and doesn't seem to convey a temporary state
  • 529 Site is overloaded: more precise but non-standard, only used by some companies

I'm leaning towards 529 for now.

@wolfgangwalther
Copy link
Member

I think having a different status than 503 for this would be better.

I don't see why, yet. 503 is perfect. For the client, there is no difference. The server admin needs to take a look into the logs anyway, to figure out what's happening.

@steve-chavez
Copy link
Member Author

Does that have the introspection feature?

Yes! It has PoolStats, I think the createFailures field could help.

Maybe we should just make Hasql.Pool use that fork instead.

Seeing this recommendation by Nikita, I think I'll vendor the Hasql pool code for now and go from there.

I don't see why, yet. 503 is perfect. For the client, there is no difference. The server admin needs to take a look into the logs anyway, to figure out what's happening.

Just thought that making the status noticeable would be better so admins can detect faster that it's time to scale or tune, but yeah 503 with a special message will do as well.

@steve-chavez
Copy link
Member Author

steve-chavez commented Jan 14, 2022

I think the createFailures field could help.

Doesn't help. So I have a branch that for now logs the pool stats on each request. With a db-pool=2 and consuming 2 connections with pg_sleep(/rpc/sleep), the next request(/projects) logs:

13/Jan/2022:21:22:29 -0500: PoolStats {highwaterUsage = 2, currentUsage = 2, takes = 6, creates = 4, createFailures = 0}

And the request waits until the connections are freed, then it gets a response. Further requests only increase takes but createFailures always stays at 0. Not sure when createFailures increase, because it doesn't even when new requests arrive on a down pg.

Now currentUsage does coincide with the connections consumed. We could start rejecting new requests when currentUsage == db-pool, but don't think that's right since a connection could be freed some milliseconds later and then the request could succeed. It'd be better to have the size of the waiter queue(bos/pool#42) and when that surpasses a threshold(TBD) then we'd reject the requests.

The waiter queue is already included in avanov's fork but it doesn't have the size, so adding that would be the next step.

Besides that, I'll do a load test and move the pool stats to a /metrics endpoint to see some numbers there.

Edit: These are the stats when running GETSingle with db-pool = 20:

{"dbPoolStats":{"currentUsage":20,"creates":43,"highwaterUsage":20,"createFailures":0,"takes":102097}}

@wolfgangwalther
Copy link
Member

Hm. Using pool stats for metrics seems like a nice idea. However, I don't really understand why we're even looking at the db-pool in terms of this issue?

I think we should be looking at the http connections, not the database connections or a queue for that. This won't help dealing with connections that don't hit the database anyway.

So what if we were implementing a simple counter with warp's setOnOpen and setOnClose (https://hackage.haskell.org/package/warp-3.1.3/docs/Network-Wai-Handler-Warp.html)? We can even return False from setOnOpen to reject the connection if the counter is above the threshold.

This would just reject the connection immediately and would not return a "nice" response to the client. But this is actually what we should do, because accepting the connection to be able to return a nice error response might lead to the same issue with enough workload. By rejecting connections temporarily we can at least keep the process alive.

We could use the same counter for both admin and main connections - but could then have two different thresholds (higher for admin) to be able to keep some connections free for that use. This way we can still respond 200 to the liveness probe. We could then return 503 for the readiness probe, when the main threshold has been exceeded. This way we would not get any new traffic routed to us in a cluster environment.

@steve-chavez
Copy link
Member Author

steve-chavez commented Jan 14, 2022

I don't really understand why we're even looking at the db-pool in terms of this issue?

Because EMFILE is the consequence of a starved pool. When all the pool connections are consumed each new request will wait for a connection to become free(as mentioned above), during this time the http connection remains open and thus the socket is open as well, enough of these requests will then generate EMFILE. So if we solve the starved pool(root cause) then EMFILE will follow as well.

To be clear, the solution is to respond fast with a 529 status(or any 5xx) whenever the pool is starved, so we won't hit EMFILE.

I think we should be looking at the http connections,
We can even return False from setOnOpen to reject the connection if the counter is above the threshold
This would just reject the connection immediately and would not return a "nice" response to the client

That doesn't seem reliable at all to me, the amount of http connections supported could be high if the db queries are fast or it could be low if the db queries are slow; seems too hard to measure that for every case. Also this type of error should notify the admins/devs that the server is overloaded(needs scaling), it wouldn't be helpful at all to not return any error message.

On the contrary, db-pool stats are reliable and we are in full control of them on any case.

This won't help dealing with connections that don't hit the database anyway.

🤔 Can you elaborate on that?

@wolfgangwalther
Copy link
Member

Assuming db-pool=10 and the default file descriptor limits of the unix kernel (softlimit 1024 and hardlimit 4096, afaik), my understanding is the following:

  • Because of the 10 sessions to the database, we can handle 10 requests in parallel.
  • When all 10 database sessions are in use, new http connections will need to wait until a session is free to serve the request.
  • Now assume that we're getting a constant workload where the number of new http connections (i.e. requests) is more than we can handle from the pool/database side, the number of http connections that are waiting for a free slot will increase.
  • Each http connection keeps a file descriptor open.
  • As soon as we hit the kernels soft limit of 1024 http connections (real number is slightly lower, because we are using a few file descriptors for other purposes), we can't create any new file descriptors - and wai can't accept any new connections. It will then throw an EMFILE error and - unfortunately - be stuck.
  • To work around that bug in WAI, we can reject new http connections when we still have a few file descriptors available, let's say at 950 open http connections. This way, we will never get to the point of running out of file descriptors, we will never get EMFILE, the process will never be stuck again.

The reason we are hitting that limit is, because we are getting more requests/s than we can handle. In my understanding this can happen in regular operation.

I don't really know what you mean by "starved pool". Maybe this is a separate problem, that I don't fully understand, yet, which can lead to the same consequence. But surely the "we're getting more requests than we can handle" is a problem that will lead to the EMFILE problem eventually, too - even if you solve the "starved pool" thing.


This won't help dealing with connections that don't hit the database anyway.

Can you elaborate on that?

Think OPTIONS requests. Or many requests with a wrong JWT token. Those will never make it to the database pool, right?

But still: When we're getting busted with maaaaany of those requests, I can imagine getting to the same limit of file descriptors and getting that EMFILE error.


That doesn't seem reliable at all to me, the amount of http connections supported could be high if the db queries are fast or it could be low if the db queries are slow; seems too hard to measure that for every case.

The limit would be based on the number of available file descriptors. The limit would be there only to prevent reaching EMFILE. You can read the current process' softlimit for file descriptors and then base the threshold on that. You don't need to measure anything. This is much more reliable than anything else - if the goal is preventing EMFILE.

Also this type of error should notify the admins/devs that the server is overloaded(needs scaling), it wouldn't be helpful at all to not return any error message.

The nature of the problem ("having no more http connections available") prevents us from responding with a proper error to every request in that case... because that's exactly the problem. We can't handle more requests. But we can (1) stay alive and (2) report that failure at the readiness endpoint - and this is what admins need to observe.

@steve-chavez
Copy link
Member Author

I don't really know what you mean by "starved pool"

That the pool is left out of resources(connections), pool starvation seems to be a common term. See Dealing with database connection pool starvation on Oracle DB.

But surely the "we're getting more requests than we can handle" is a problem that will lead to the EMFILE problem eventually, too - even if you solve the "starved pool" thing.
But still: When we're getting busted with maaaaany of those requests, I can imagine getting to the same limit of file descriptors and getting that EMFILE error.

Oh, I don't think so. Warp is pretty fast, it can handle 100K reqs while we can only handle about 2K reqs(reads) due to the pool limitation.

That's why in general I think the direction should be improving the pool so we can get closer to Warp limits. Doesn't seem right to limit Warp when it's not the main culprit and it can handle so much.

Improving the pool means, for example, allowing the pool to use more than one db. I've found on supabase/benchmarks#24 that vertically scaling a pg instance doesn't lead to increased reads, so we need to enable horizontal scaling at the pool level. In fact doing this would also clear EMFILE in most cases I've seen since postgrest can now respond so much faster(numbers pending yet, but many recommend horizontal scaling for scaling reads).

Of course improving the pool also means it should do backpressure, tuning queries is another option to handle load.

Think OPTIONS requests. Or many requests with a wrong JWT token. Those will never make it to the database pool, right?

If those respond, that doesn't look like a problem to me. That being said, my main idea was to have a global state whenever we do backpressure so those requests could be rejected as well. Same for admin enpoint, it would fail.

As soon as we hit the kernels soft limit of 1024 http connections (real number is slightly lower, because we are using a few file descriptors for other purposes)

I don't know what other external factors can affect the http connections limits, that's why it seems so unreliable to me. Meanwhile the pool is something we fully control.

Why so opposed to the pool approach :)? What's the actual drawback? I cannot be the only one who thought of such an approach(encode/starlette#802 (comment)).

@wolfgangwalther
Copy link
Member

Why so opposed to the pool approach :)? What's the actual drawback?

Because it does not relate to this issue. You have many great ideas to improve performance given in this post, which are well worth discussing. But those don't solve the problem we have here. Those might make it less likely to occur, but the problem itself will still be there.

That's why in general I think the direction should be improving the pool so we can get closer to Warp limits. Doesn't seem right to limit Warp when it's not the main culprit and it can handle so much.

But Warp is the main culprit. Warp is not dealing with EMFILE correctly.

The issue here is the following:

This happens when there's a good amount of traffic going into PostgREST, then it stops processing any new requests until it's restarted manually.
[...]
Ideally PostgREST should recover or at least die and let systemd restart it.

Improving the pool is just a workaround to that very problem. Again: Improving the pool and performance is a great thing on it's own and we should discuss it - elsewhere.

There is no point in just trying to make the "postgrest hangs and needs to be restarted manually" less likely, when we can just eliminate it alltogether.


I think we're just talking different problems here. Given my example above, with a pool size of 10 and a limit of file descriptors of 1024. I am talking about limiting the queue of waiting http requests at around 1000 to make sure we never have our PostgREST process hang.

Now, of course, once we are in that state there will be a different problem: Those 1000 requests still in the queue might take a loong time until all of them are served. And that seems to be what you are concerned about. It makes perfect sense to have a much smaller threshold to start rejecting new requests with proper error messages. And to base this on pool performance. But then again, it also makes a lot of sense to keep this part configurable. I assume it would make a big difference whether you run a single instance of postgrest or a cluster and how you would want to deal with this in both cases. But: This is a different issue.

@steve-chavez
Copy link
Member Author

I am talking about limiting the queue of waiting http requests at around 1000

Looks like Akka has a similar way to limit its http connection queue:

At the db pool level, SQLAlchemy has a max_overflow:

@steve-chavez
Copy link
Member Author

Just to inform about some progress.

I found that Hasql.Pool uses takeResource, which is documented to be blocking if the pool is exhausted. There's also a tryTakeResource which doesn't block(added here, it uses the pool current connection usage) and returns a Nothing when the pool is exhausted. To keep it really simple(without patching the pool lib), we could add, say 10 connections(configurable, poor man's version of the waiter queue) to db-pool internally and just start rejecting requests when the pool starts returning Nothing.

Using more db connections for internal purposes doesn't seem right though(and also we cannot add 200 connections for a waiter queue because connections are scarce in many cases) so it would be better to improve the tryTakeResource to consider the waiter queue.

@steve-chavez
Copy link
Member Author

When I try this test(simple table GET with filter) with 500 concurrent users(VUs in k6), I get a totalWaiters(avanov/pool#3) of 483:

{"dbPoolStats":{"currentUsage":10,"creates":421,"highwaterUsage":10,"createFailures":0,"totalWaiters":483,"takes":190445}}

After 30 seconds, the totalWaiters goes to 0, and all keeps working normally.

Now with 1000 concurrent users, I manage to reproduce EMFILE:

Feb 02 03:56:01 pgrstServer postgrest[12703]: Network.Socket.accept: resource exhausted (No file descriptors available)
Feb 02 03:56:02 pgrstServer postgrest[12703]: Network.Socket.accept: resource exhausted (No file descriptors available)

Which happens because the soft limit is 1024 file descriptors:

systemctl show postgrest.service  | grep LimitNOFILE
LimitNOFILE=524288
LimitNOFILESoft=1024

Cannot query the totalWaiters at this point because the admin port also fails with curl: (7) Failed to connect to localhost port 3001: Connection refused, as expected.

If I change the soft limit to 5000, EMFILE is not hit, and the total waiters can be seen to be 987:

{"dbPoolStats":{"currentUsage":10,"creates":12,"highwaterUsage":10,"createFailures":0,"totalWaiters":987,"takes":38264}}

So seems limiting totalWaiters is not a good way to do backpressure, since there can be many waiters but if PostgREST can respond soon enough it will be fine and keep running. You were right @wolfgangwalther, this is a different issue and what we should limit is the queue waiting time.


I'll step back and instead explore the soft limit threshold.

The nature of the problem ("having no more http connections available") prevents us from responding with a proper error to every request in that case... because that's exactly the problem. We can't handle more requests. But we can (1) stay alive and (2) report that failure at the readiness endpoint - and this is what admins need to observe.

Could we respond a proper client error if we compare totalWaiters to the soft limit? Though that wouldn't work for reqs that don't hit the db, like the ones that only reject JWT, as mentioned before. Hmm...

@wolfgangwalther
Copy link
Member

systemctl show postgrest.service  | grep LimitNOFILE
LimitNOFILE=524288
LimitNOFILESoft=1024

I'll step back and instead explore the soft limit threshold.

Note the huge difference between soft and hard threshold. I guess most distributions will allow a much higher hard limit. We can set the soft limit from within the postgrest process itself, as mentioned earlier. Other webservers do, too. Increasing the soft limit to the hard limit will already solve this issue in 95% of cases, because the hard limit is so high that the bug will just occur even less frequently.

The remaining 5% will be turning down new requests against some kind of threshold. We could compare against:

  • the number of open file descriptors: Most reliable, not sure how performant we can query this for each request from /proc...
  • the number of http connections: Should have a linear relationship with file descriptors, although we'd need a safety buffer, because other parts of postgrest are opening file descriptors, too. I tested a few weeks ago - a bare process, connected to a database has something around 15-20 FDs open, can't remember exactly.
  • the number of totalWaiters: Less reliable, because it correlates less with the number of open FDs, as you already mentioned regarding non-db-queries. Probably easiest to implement, because it's already there.

@steve-chavez
Copy link
Member Author

Other webservers do, too.

I do see Nginx having the worker_rlimit_nofile, which manipulates the soft limit, tuning that is also a suggested solution to EMFILE. So I guess it's fine to manipulate the soft limit within postgrest itself.

Increasing the soft limit to the hard limit will already solve this issue in 95% of cases, because the hard limit is so high that the bug will just occur even less frequently.

If you are not careful with these limits, it can affect performance of your website performance and even other processes in the system.

Hm, I'm not sure about setting soft limit to hard limit, do you remember which server did that? I'd like to see what are the actual implications.

@steve-chavez
Copy link
Member Author

Btw, yesodweb/wai#831 was recently merged.

@wolfgangwalther
Copy link
Member

Btw, yesodweb/wai#831 was recently merged.

Sweet, now updating warp and then raising the soft limit should be enough.

Hm, I'm not sure about setting soft limit to hard limit, do you remember which server did that? I'd like to see what are the actual implications.

I thought that was nginx, but reading the link you posted I'm not sure anymore. Can't remember where I read that. There's only two options really:

  • Raise the soft limit to the hard limit, or
  • Make it configurable

Because raising the soft limit to anything else is always quite an arbitrary choice.

Maybe updating warp is enough, too. It's not that often that the bug was hit, right? And the real problem was just that the server was unresponsive after that. Once we reject a few connections, the queue should be getting smaller quickly anyway...

@steve-chavez
Copy link
Member Author

steve-chavez commented Feb 4, 2022

I tried the new warp version, the good news is that with this change postgrest continues to function when file descriptors become available, there are no more EMFILEs reported from warp.

However, there are EMFILEs coming from other places. Same 1000 concurrent users load test as before:

Feb 03 23:47:17 pgrstServer postgrest[22503]: 10.0.0.169 - postgres [03/Feb/2022:23:47:17 +0000] "GET /artist?select=*&artist_id=eq.39 HTTP/1.1" 503 - "" "k6/0.27.1 (https://k6.io/)"
Feb 03 23:47:17 pgrstServer postgrest[22503]: 03/Feb/2022:23:47:17 +0000: Attempting to connect to the database...
Feb 03 23:47:17 pgrstServer postgrest[22503]: 03/Feb/2022:23:47:17 +0000: Connection successful
Feb 03 23:47:17 pgrstServer postgrest[22503]: 03/Feb/2022:23:47:17 +0000: Failed reloading config: Error in config /nix/store/sbpmgrv1564lkw1w96qa04c7y14h3k2c-pgrst.conf: openFile: resource exhausted (No file descriptors available)

Another EMFILE which I think comes from re-reading the config file(not critical). This is triggered by the 503 that comes from the failed GET /artist request above, which on its client side it prints:

time="2022-02-04T03:07:18Z" level=info msg="{\"details\":\"connection to server on socket \\\"/run/postgresql/.s.PGSQL.5432\\\" failed:
could not create socket: No file descriptors available\\n\",\"code\":\"\",\"message\":\"Database connection error. Retrying the connecti
on.\"}"

When re-running the load test, sometimes this error also shows on the postgrest logs:

Feb 03 23:57:41 pgrstServer postgrest[22503]: 03/Feb/2022:23:57:41 +0000: Attempting to connect to the database...
Feb 03 23:57:41 pgrstServer postgrest[22503]: 03/Feb/2022:23:57:41 +0000: {"details":"connection to server on socket \"/run/postgresql/.s.PGSQL.5432\" failed: could not create socket: No file descriptors available\n","code":"","message":"Database connection error. Retrying the connection."}
Feb 03 23:57:41 pgrstServer postgrest[22503]: 03/Feb/2022:23:57:41 +0000: Attempting to reconnect to the database in 0 seconds...
Feb 03 23:57:41 pgrstServer postgrest[22503]: 10.0.0.169 - postgres [03/Feb/2022:23:57:41 +0000] "GET /artist?select=*&artist_id=eq.27 >
Feb 03 23:57:41 pgrstServer postgrest[22503]: 10.0.0.169 - postgres [03/Feb/2022:23:57:41 +0000] "GET /artist?select=*&artist_id=eq.112>
Feb 03 23:57:41 pgrstServer postgrest[22503]: 10.0.0.169 - postgres [03/Feb/2022:23:57:41 +0000] "GET /artist?select=*&artist_id=eq.239>
Feb 03 23:57:42 pgrstServer postgrest[22503]: 03/Feb/2022:23:57:42 +0000: Connection successful

But as it can be seen, postgrest recovers because it then retries the connection.

While the load test runs, I've also tried running other curl requests to see what happens.

Unlike before, when curling the admin port it doesn't give Connection refused, instead the request waits and a response is obtained when the load test finishes. Not good for the /metrics endpoint, when it finally responds, the totalWaiters is 0. I think adding a maxWaiters(similar to highwaterUsage) metric would be more useful in this case.

Same thing as the above happens for a new request on the main app port(curl localhost:3000/artist), it waits. Not sure how the warp patch works, is it somehow circumventing the soft limit?

There's only two options really:
Raise the soft limit to the hard limit, or
Make it configurable

@wolfgangwalther How about if we just recommend setting the file descriptors higher than 1024 on the docs? The maxWaiters metric should also give a hint for how much should it be, I think. It just seems safer to leave setting the soft limit to the systemd service.

@wolfgangwalther
Copy link
Member

@wolfgangwalther How about if we just recommend setting the file descriptors higher than 1024 on the docs? The maxWaiters metric should also give a hint for how much should it be, I think. It just seems safer to leave setting the soft limit to the systemd service.

My understanding about soft and hard limits is, that soft limits enable a process to self-limit, while hard limits allow the admin/user to limit the process to an enforced limit.

Having low soft limits makes sense for applications, that don't expect to have many of them open - it's basically a check against bugs in the application, which might keep those file descriptors open for longer than needed. But that's not the case here: With high load we expect a high number of file descriptors. It's perfectly fine, none of those file descriptors are orphaned.

Systemd allows to set limitNOFILE in two ways: as a single value or as a soft:hard pair. The single value will set both soft and hard limit to the same value. So basically setting the hard limit is the default. At the same time setting the soft limit only, is not even possible.

I think what we should do:

  • Increase the softlimit to the hardlimit from within postgrest, because we know having many file descriptors is expected with high load.
  • Advise in the docs how to set the hard limit in systemd and docker based setups.

This will set the default in a way that reaching the limit is more unlikely.

@wolfgangwalther
Copy link
Member

I tried the new warp version, the good news is that with this change postgrest continues to function when file descriptors become available, there are no more EMFILEs reported from warp.

Cool, that the correct way to handle it.

However, there are EMFILEs coming from other places.

Yes, that would be expected.

Unlike before, when curling the admin port it doesn't give Connection refused, instead the request waits and a response is obtained when the load test finishes. Not good for the /metrics endpoint, when it finally responds, the totalWaiters is 0. I think adding a maxWaiters(similar to highwaterUsage) metric would be more useful in this case.

Same thing as the above happens for a new request on the main app port(curl localhost:3000/artist), it waits. Not sure how the warp patch works, is it somehow circumventing the soft limit?

Not sure what you're looking at here. Didn't you just say a few lines up, that you are getting EMFILEs instead of waiting - until the number of connections decreases again?

In any case: To still allow re-reading the config file and serving useful responses at the admin port, we need to keep a safety buffer of available file descriptors. This is exactly what I suggested by counting the open http connections - and rejecting new connections before we hit EMFILE.

@steve-chavez
Copy link
Member Author

Not sure what you're looking at here. Didn't you just say a few lines up, that you are getting EMFILEs instead of waiting - until the number of connections decreases again?

Those are curl requests I tried while the load test is running(I've edited my comment above), and no, I don't get EMFILEs(coming from the pg socket) there, they just wait. The EMFILEs I shared came from the k6 load test and seem to happen for some requests randomly(1-4 reqs). Maybe not a problem though, since we keep running.

In any case: To still allow re-reading the config file and serving useful responses at the admin port, we need to keep a safety buffer of available file descriptors. This is exactly what I suggested by counting the open http connections - and rejecting new connections before we hit EMFILE.

Hm, keeping the admin port working seems like the right behavior, yes. But I'm not sure if counting http connections would somehow affect performance. Ideally warp would offer us an option to reserve some fds for the admin server, but I don't think there's any. I'll keep digging on this topic..

@steve-chavez
Copy link
Member Author

we need to keep a safety buffer of available file descriptors.

I haven't found other webservers(in other languages) doing something like this. That makes me a bit uneasy about trying that path.

Increase the softlimit to the hardlimit from within postgrest, because we know having many file descriptors is expected with high load.

I'll try the soft limit to hard limit approach, that seems enough. The admin port should keep serving requests while the main port requests will wait.

@steve-chavez
Copy link
Member Author

Maybe we can also use https://hackage.haskell.org/package/unix-2.7.2.2/docs/System-Posix-Resource.html. We could:
Use setResourceLimit to increase our process' softlimit to the hardlimit allowed by the kernel.

Tried setResouceLimit, it seems to work fine. However, this kind of comment makes me uneasy about raising the soft limit.

Additionally, if an unprivileged process can simply raise the soft limit itself, a forkbomb can raise the soft limit too all the way up to the hard limit

Now we don't "fork" but I assume there must be some cost to having many green threads, the soft limit is preventing us from reaching 524288 of those.

Increase the softlimit to the hardlimit from within postgrest, because we know having many file descriptors is expected with high load.
Advise in the docs how to set the hard limit in systemd and docker based setups.

We might as well just advise in the docs to set LimitNOFILE to a single value, since the hard limit could be low anyways(NixOS's ulimit -Hn is 4096). it doesn't seem necessary to make a change in postgrest.

Make it configurable

That looks better, but I don't think we need it. Nginx docs say that worker_rlimit_core is for

Used to increase the limit without restarting the main process.

Makes sense for nginx since changing LimitNOFILE in systemd would require a restart. However in our case, failure(timeouts) in high load won't be solved by more file descriptors, the db needs to be tuned/scaled.


So I'll patch the EMFILE bug, that plus a section in the admin docs about timeouts/EMFILE and how to scale(will take longer as
it requires the multiple db pool) should solve this one.

@wolfgangwalther
Copy link
Member

So I'll patch the EMFILE bug, that plus a section in the admin docs about timeouts/EMFILE

👍

@steve-chavez
Copy link
Member Author

steve-chavez commented Feb 8, 2022

The EMFILEs I shared came from the k6 load test and seem to happen for some requests randomly(1-4 reqs). Maybe not a problem though, since we keep running.

Noted that these EMFILE errors don't happen when db-pool-timeout=60(default is 10s, the load tests runs for 30s); seems some connections go idle somehow, get closed and when the pool tries to create new ones the "details\":\"connection to server on socket \\\"/run/postgresql/.s.PGSQL.5432\\\" failed: could not create socket: No file descriptors available\\n\" error happens because there's no file descriptors since other waiting http requests consumed them.

@wolfgangwalther
Copy link
Member

Hm. Did you do a test to confirm that those database-side EMFILEs are recoverable from? Given the observation you made, it could also be, that 4 out of 10 pool connections are now in a dead state, similar to warp earlier.

You could set db-pool=1 and db-pool-timeout=1 and then try to hit one of those (maybe by launching postgrest with a lower softlimit, too) - and then once you do see if the you can still make requests afterwards?

@steve-chavez
Copy link
Member Author

Hm. Did you do a test to confirm that those database-side EMFILEs are recoverable from?

Yes, when the db-side EMFILE happens the response is 503 and then:

let isPGAway = Wai.responseStatus response == HTTP.status503
when isPGAway connWorker

The pool is released and new connections are made, doing a new load test(not hitting EMFILE, say 100 concurrent users) then reveals no errors.

@wolfgangwalther
Copy link
Member

Perfect.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

Successfully merging a pull request may close this issue.

2 participants