Skip to content
This repository has been archived by the owner on Dec 18, 2023. It is now read-only.

Some clarity around suggested metastore caching settings #49

Open
friendofasquid opened this issue Jun 17, 2021 · 5 comments
Open

Some clarity around suggested metastore caching settings #49

friendofasquid opened this issue Jun 17, 2021 · 5 comments

Comments

@friendofasquid
Copy link

friendofasquid commented Jun 17, 2021

Hello—

The recommended settings from the README are:

hive.metastore-cache-ttl=0s
hive.metastore-refresh-interval = 5s

I believe the hive.metastore-cache-ttl set to 0 tells Presto NOT to cache. Then the second setting says to asynchronously refresh the cache every 5, though I think it won't be used. This can put a lot of load on the Presto server + Metastore. We've seen rate limiting using the AWS Glue Catalog that I suspect are because of these settings.

I'm trying to understand the implications of using caching to speed things up, with or without a refresh interval. Something like this:

hive.metastore-cache-ttl=120s
hive.metastore-refresh-interval = 60s

So long as it takes less than 60 seconds to refresh to metastore, users doing interactive dbt develoment should never experience slowness related to the Metastore. At the same time we won't be thrashing the Metastore every 5 seconds.

But I feel like I might be missing something, so just wanted to see what the reasoning behind that catalog setting for hive would be.

@friendofasquid
Copy link
Author

friendofasquid commented Jun 18, 2021

We've been doing some additional thinking and investigation here, so just sharing in case it is helpful.

  1. If using EMR, AWS sets the relevant Hive catalog properties as follows:
hive.metastore-cache-ttl=20m
hive.metastore-refresh-interval=1m

I don't know if this is good or bad, but it's a data point.

  1. Turning caching ttl to something non-zero, as expected, significantly speeds up dbt startup where is does a whole bunch of metadata queries. For us, this went from occasionally taking 10 minutes to about 30 seconds, sometimes a lot less.
  2. I found this post which states:

Presto’s metadata cache is in the coordinator, so it doesn’t suffer cache consistency problem if user only changes objects via Presto. However, if user also changes objects in the metastore via Hive, it suffers from inconsistency until a refresh occurs.

So that does suggest caching would be safe if the Presto cluster used to run dbt is the same cluster used to query it for other use cases. And what it is really saying is that the other presto clusters, if in use, may want different caching settings.

FWIW, we've been running dbt with Spark and querying the models using Presto with those default EMR settings and never noticed an issue. (We are now porting the dbt-spark project to dbt-presto hence the discussion)

@wrb2
Copy link

wrb2 commented Jul 9, 2021

This mirrors our experience. We ended up setting up the cache for an hour - AWS Glue Catalog is very limited with the number of requests it can handle and AWS isn't very eager to increase the limits.

One trick we found was that running show schemas or show tables in Presto / Trino makes it refresh immediately. That can fix problems with something like "we created a schema / database over AWS API and Presto can't see it".

On the other hand, I assume that Presto with metadata repository that actually works well (like normal Hive Metastore or hopefully Iceberg in the future), the caching might not really be necessary at all.

@friendofasquid
Copy link
Author

friendofasquid commented Jul 10, 2021

@wrborigin Are you using EMR?

We've also found (and AWS support has been able to replicate) an issue where Presto on EMR doesn't cache at all when using the Glue Catalog. So we've struggled mightily with this trying to move our dbt jobs from Spark to Presto. We are using the latest version of EMR: 6.3 which uses Presto 0.145.1

I've been told this is not an issue with PrestoSQL (AKA Trino) on EMR, and that the caching works fine with there.

In terms fo what to recommend for dbt users, I think a different caching recommendation in general is warranted, and something like a 20m to 1hr ttl seems sensible. I'd probably make the further, more broad recommendation that using a separate Presto cluster for dbt is likely a good idea.

And finally, use the AWS Glue Catalog metastore at your own risk :)

AWS will definitely up your quota if you ask and give a justification. "Bug in EMR" is a pretty good justification.

@friendofasquid
Copy link
Author

For those who might stumble upon this issue, there's a hidden property for Presto on EMR that will enable caching when using the Glue Catalog:

hive.metastore.glue.shared-metastore-cache-enabled=true

This is currently not documented anywhere.

@GlockGao
Copy link

Hi,
I tried metadata cache with config 'presto-connector-hive : hive.metastore.glue.shared-metastore-cache-enabled=true' on EMR 6.4 and EMR 6.8. But still cache doesn't work at all.

Can you comment on that ?

Thanks.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants