-
Notifications
You must be signed in to change notification settings - Fork 30
Some clarity around suggested metastore caching settings #49
Comments
We've been doing some additional thinking and investigation here, so just sharing in case it is helpful.
I don't know if this is good or bad, but it's a data point.
So that does suggest caching would be safe if the Presto cluster used to run dbt is the same cluster used to query it for other use cases. And what it is really saying is that the other presto clusters, if in use, may want different caching settings. FWIW, we've been running dbt with Spark and querying the models using Presto with those default EMR settings and never noticed an issue. (We are now porting the |
This mirrors our experience. We ended up setting up the cache for an hour - AWS Glue Catalog is very limited with the number of requests it can handle and AWS isn't very eager to increase the limits. One trick we found was that running On the other hand, I assume that Presto with metadata repository that actually works well (like normal Hive Metastore or hopefully Iceberg in the future), the caching might not really be necessary at all. |
@wrborigin Are you using EMR? We've also found (and AWS support has been able to replicate) an issue where Presto on EMR doesn't cache at all when using the Glue Catalog. So we've struggled mightily with this trying to move our dbt jobs from Spark to Presto. We are using the latest version of EMR: 6.3 which uses Presto 0.145.1 I've been told this is not an issue with PrestoSQL (AKA Trino) on EMR, and that the caching works fine with there. In terms fo what to recommend for dbt users, I think a different caching recommendation in general is warranted, and something like a 20m to 1hr ttl seems sensible. I'd probably make the further, more broad recommendation that using a separate Presto cluster for dbt is likely a good idea. And finally, use the AWS Glue Catalog metastore at your own risk :) AWS will definitely up your quota if you ask and give a justification. "Bug in EMR" is a pretty good justification. |
For those who might stumble upon this issue, there's a hidden property for Presto on EMR that will enable caching when using the Glue Catalog:
This is currently not documented anywhere. |
Hi, Can you comment on that ? Thanks. |
Hello—
The recommended settings from the README are:
I believe the
hive.metastore-cache-ttl
set to 0 tells Presto NOT to cache. Then the second setting says to asynchronously refresh the cache every 5, though I think it won't be used. This can put a lot of load on the Presto server + Metastore. We've seen rate limiting using the AWS Glue Catalog that I suspect are because of these settings.I'm trying to understand the implications of using caching to speed things up, with or without a refresh interval. Something like this:
So long as it takes less than 60 seconds to refresh to metastore, users doing interactive dbt develoment should never experience slowness related to the Metastore. At the same time we won't be thrashing the Metastore every 5 seconds.
But I feel like I might be missing something, so just wanted to see what the reasoning behind that catalog setting for hive would be.
The text was updated successfully, but these errors were encountered: