-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature Request: Support directly operating on files #372
Comments
Naturally this wouldn't just be limited to ORC. Spark supports all sorts of writers from Parquet and Avro to even CSV and JSON files! |
Thank you suggestion! I agree with you on the usefulness of directly operating on file(s), but it won't always be magical (in comparison to SQL-based ones). I see several orthogonal things in your comment:
Eventually, I think Verdict should support direct data loads from your local disk or from Google cloud storage, etc., without requiring you all messy data transformation. |
Sorry I just want to clarify - I am only talking about scramble creation here - not necessarily running Verdict directly on files as its own "Query Engine". I find that Tez (the default engine underneath Hive in more recent versions) is not great at Memory Management compared to MR and Spark as execution engines, we see jobs fail much more regularly - even jobs that have previously completed on the same hardware. The only reliable way of processing anything on our kind of datasets are by trying to actually "reduce" the amount of processing that has to be done. For example - to convert to ORC at the moment - the "easiest" way to do this in terms of effort is to create an external table on the CSV files, create an ORC table and then I think for the time being - a way to simply create a scramble from an RDD / Dataframe and be able to write those to ORC files would be a great huge step - I don't think it would be too much effort to develop (we had a brief try but we have deadline to meet) as all the code to do this is basically there just using HiveContext to materialise them. If we can do the computationally difficult stuff such as creating scrambles in Spark and then keep the SQL interface so it can do all of the query translation that would be perfect :) |
@voycey Assuming you will be using Google BigQuery (or at least Google Cloud ecosystem), what do you think of this workflow? This is to figure out the simplest way for you to use VerdictDB. Once the interface is determined, we can engineer its internals for higher efficiency.
|
BigQuery will be able to create the scrambles much faster than any other
DBMS that's for sure, if everything can be run in there I think that is a
good solution.
However *some data store *is likely to be problematic at the scale we are
talking about, hence why I am suggesting working directly on the files for
scramble creation.
Spark (Not SparkSQL) has been the only reliable method of processing bulk
files at the scale we are at cheaply, the moment any kind of query engine
is placed around it the overheads become problematic (we reduced our
cluster sizes yesterday and some counts now take 4 hours)
It may not be possible without this query engine, I don't know? But the:
File.read().doSomething().write()
Workflow is something that has reliably worked for us at scale.
I think this is due to the fact that our data is partitioned, so we can run
exactly the day & state we want.
…On Wed, 5 Jun. 2019, 02:51 Yongjoo Park, ***@***.***> wrote:
@voycey <https://github.com/voycey> Assuming you will be using Google
BigQuery (or at least Google Cloud ecosystem), what do you think of this
workflow? This is to figure out the simplest way for you to use VerdictDB.
Once the interface is determined, we can engineer its internals for higher
efficiency.
1. You store raw (or compressed) files in Google Cloud Storage. Let's
say on Jan 1, you collected mydata_jan1.csv and on Jan 2, you
collected mydata_jan2.csv, and so on.
2. You let VerdictDB load your data into its own *some data store* by
pointing the locations of those files. The *some data store* will
simply be BigQuery tables (managed by VerdictDB).
3. You run queries to VerdictDB.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#372>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAIEBCSNJZQNG5X3MBUYS2TPY2MQBANCNFSM4HPLOIOQ>
.
|
One thing we have noticed during our Big Data Journey is that while DB / Query engines are a convenient way to get access to data, if you want to do any kind of meaningful heavy lifting it is sometimes better to skip them altogether.
One thing we have been working on lately is directly operating on ORC files via Spark.
For example - doing a simple table copy from an unpartitioned table into a partitioned table (
insert into <x> select * from <y>
orload data inpath...
) with the size of data and number of partitions we have (50 x 90) = 4500 via Hive with several well powered nodes (16 CPU x 104G Memory) takes days to complete. (Even with stats generation turned off for inserted rows!).However looping the same query in Spark over a dataset like this:
Finishes in hours on a very small and basic 3 node standard cluster.
As scramble generation is basically a transmutation of the data in one table into another one - avoiding the overhead of these Query Engines would go a long way to making this more efficient.
Currently most of our plans for this has been in crippling the scramble generation somehow - whilst it is possible to get the scrambles created using a reduced partitioning system - it might not be necessary if you can take full advantage of an MPP system such as Spark directly (and not relying on the HiveContext).
What do you think?
The text was updated successfully, but these errors were encountered: