You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I want to count distinct for one column. I use 3 different methods: spark SQL, verdictDB on original table, verdictDB on the scramble table. (I make a few change to the Hello.scala example. The code is in the end.) And the results are:
I don't know why verdictDB is even slower than original spark. Do you have any thoughts?
Also, there is some pull requests trying to support verdictDB on pyspark but they are not merged into master branch. I am wondering if there will be some support with pyspark.
Appreciate for your help!
packageexampleimportorg.apache.spark.sql.SparkSessionimportorg.verdictdb.VerdictContextimportorg.verdictdb.connection.SparkConnectionimportscala.util.Randomimportorg.apache.spark.SparkConfimportorg.apache.spark.sql.types.{IntegerType,StringType,StructType,StructField}
classMyTimer(valtext_para:String) {
varstart_time=System.nanoTime()
vartext= text_para
defstop() {
valelapsed_time= (System.nanoTime() - start_time) /1e9d
printf("%s time: %.2f seconds\n", text, elapsed_time);
}
}
objectHelloextendsApp {
valconfig=newSparkConf()
config.set("spark.sql.storeAssignmentPolicy", "LEGACY")
valspark=SparkSession
.builder()
.config(config)
.appName("VerdictDB basic example")
.enableHiveSupport()
.getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
importspark.implicits._valverdict=VerdictContext.fromSparkSession(spark)
// prepare data
prepareData(spark, verdict)
valsqlDF= spark.sql("SELECT * FROM caida.sales")
sqlDF.show(5)
valspark_timer=newMyTimer("spark");
valspark_rs= spark.sql("select groupby, count(distinct metric_value) from caida.sales_scramble GROUP BY groupby")
spark_timer.stop()
spark_rs.show()
valcount_distinct_timer=newMyTimer("verdictDB scramble");
valrs= verdict.sql("select groupby, count(distinct metric_value) from caida.sales_scramble GROUP BY groupby")
count_distinct_timer.stop()
// rs.show()
verdict.sql("BYPASS DROP SCHEMA IF EXISTS verdictdbtemp CASCADE")
valcount_distinct_timer2=newMyTimer("verdictDB original");
valrs2= verdict.sql("select groupby, count(distinct metric_value) from caida.sales GROUP BY groupby")
count_distinct_timer2.stop()
verdict.sql("BYPASS DROP SCHEMA IF EXISTS verdictdbtemp CASCADE")
valcount_distinct_timer3=newMyTimer("verdictDB scramble");
valrs3= verdict.sql("select groupby, count(distinct metric_value) from caida.sales_scramble GROUP BY groupby")
count_distinct_timer3.stop()
defprepareData(spark: SparkSession, verdict: VerdictContext):Unit= {
// create a schema and a table
spark.sql("DROP SCHEMA IF EXISTS caida CASCADE")
spark.sql("CREATE SCHEMA IF NOT EXISTS caida")
spark.sql("CREATE TABLE IF NOT EXISTS caida.sales (groupby string, metric_value string)")
verdict.sql("BYPASS DROP TABLE IF EXISTS caida.sales_scramble")
verdict.sql("BYPASS DROP SCHEMA IF EXISTS verdictdbtemp CASCADE")
verdict.sql("BYPASS DROP SCHEMA IF EXISTS verdictdbmeta CASCADE")
valinput_files="s3://sketch-public/input/1m.csv"valcaida_schema=StructType(Array(
StructField("srcip", StringType, true),
StructField("dstip", StringType, true),
StructField("proto", StringType, true),
StructField("srcport", StringType, true),
StructField("dstport", StringType, true),
StructField("length", StringType, true)
))
valdf= spark.read.format("csv")
.option("sep", ",")
.schema(caida_schema)
.option("header", "false")
.load(input_files)
df.createOrReplaceTempView("dfView")
spark.sql("INSERT INTO caida.sales (SELECT dstip as groupby, CONCAT(srcip, '|', srcport, '|', dstport, '|', length) as metric_value FROM dfView)")
valscramble_timer=newMyTimer("scramble");
verdict.sql("CREATE SCRAMBLE caida.sales_scramble FROM caida.sales METHOD HASH HASHCOLUMN metric_value")
scramble_timer.stop()
}
}
The text was updated successfully, but these errors were encountered:
I want to count distinct for one column. I use 3 different methods: spark SQL, verdictDB on original table, verdictDB on the scramble table. (I make a few change to the Hello.scala example. The code is in the end.) And the results are:
spark time: 0.06 seconds
verdictDB scramble time: 4.67 seconds
verdictDB original time: 2.04 seconds
verdictDB scramble time: 3.13 seconds
I don't know why verdictDB is even slower than original spark. Do you have any thoughts?
Also, there is some pull requests trying to support verdictDB on pyspark but they are not merged into master branch. I am wondering if there will be some support with pyspark.
Appreciate for your help!
The text was updated successfully, but these errors were encountered: