Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added Cache Argument Method for Spark 3.4 #97

Closed
wants to merge 2 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
33 changes: 20 additions & 13 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -62,32 +62,34 @@ The Almaren Framework provides a simplified consistent minimalistic layer over A
To add Almaren Framework dependency to your sbt build:

```
libraryDependencies += "com.github.music-of-the-ainur" %% "almaren-framework" % "0.9.10-3.4"
libraryDependencies += "com.github.music-of-the-ainur" %% "almaren-framework" % "0.9.11-3.4"
```

To run in spark-shell:

For scala version(2.12):
```
spark-shell --packages "com.github.music-of-the-ainur:almaren-framework_2.12:0.9.10-3.4"
spark-shell --packages "com.github.music-of-the-ainur:almaren-framework_2.12:0.9.11-3.4"
```
For scala version(2.13):
```
spark-shell --packages "com.github.music-of-the-ainur:almaren-framework_2.13:0.9.10-3.4"
spark-shell --packages "com.github.music-of-the-ainur:almaren-framework_2.13:0.9.11-3.4"
```
Almaren connector is available in
[Maven Central](https://mvnrepository.com/artifact/com.github.music-of-the-ainur) repository.

| version | Connector Artifact |
|----------------------------|------------------------------------------------------------------|
| Spark 3.4.x and scala 2.13 | `com.github.music-of-the-ainur:almaren-framework_2.13:0.9.10-3.4` |
| Spark 3.4.x and scala 2.12 | `com.github.music-of-the-ainur:almaren-framework_2.12:0.9.10-3.4` |
| Spark 3.3.x and scala 2.13 | `com.github.music-of-the-ainur:almaren-framework_2.13:0.9.10-3.3` |
| Spark 3.3.x and scala 2.12 | `com.github.music-of-the-ainur:almaren-framework_2.12:0.9.10-3.3` |
| Spark 3.2.x and scala 2.12 | `com.github.music-of-the-ainur:almaren-framework_2.12:0.9.10-3.2` |
| Spark 3.1.x and scala 2.12 | `com.github.music-of-the-ainur:almaren-framework_2.12:0.9.10-3.1` |
| Spark 2.4.x and scala 2.12 | `com.github.music-of-the-ainur:almaren-framework_2.12:0.9.10-2.4` |
| Spark 2.4.x and scala 2.11 | `com.github.music-of-the-ainur:almaren-framework_2.11:0.9.10-2.4` |
| version | Connector Artifact |
|----------------------------|-------------------------------------------------------------------|
| Spark 3.5.x and scala 2.13 | `com.github.music-of-the-ainur:almaren-framework_2.13:0.9.11-3.5` |
| Spark 3.5.x and scala 2.12 | `com.github.music-of-the-ainur:almaren-framework_2.12:0.9.11-3.5` |
| Spark 3.4.x and scala 2.13 | `com.github.music-of-the-ainur:almaren-framework_2.13:0.9.11-3.4` |
| Spark 3.4.x and scala 2.12 | `com.github.music-of-the-ainur:almaren-framework_2.12:0.9.11-3.4` |
| Spark 3.3.x and scala 2.13 | `com.github.music-of-the-ainur:almaren-framework_2.13:0.9.11-3.3` |
| Spark 3.3.x and scala 2.12 | `com.github.music-of-the-ainur:almaren-framework_2.12:0.9.11-3.3` |
| Spark 3.2.x and scala 2.12 | `com.github.music-of-the-ainur:almaren-framework_2.12:0.9.11-3.2` |
| Spark 3.1.x and scala 2.12 | `com.github.music-of-the-ainur:almaren-framework_2.12:0.9.11-3.1` |
| Spark 2.4.x and scala 2.12 | `com.github.music-of-the-ainur:almaren-framework_2.12:0.9.11-2.4` |
| Spark 2.4.x and scala 2.11 | `com.github.music-of-the-ainur:almaren-framework_2.11:0.9.11-2.4` |


### Batch Example
Expand Down Expand Up @@ -357,6 +359,11 @@ Cache/Uncache both DataFrame or Table
```scala
cache(true)
```
Cache Dataframe with Storage Level

```scala
cache(true,storageLevel = MEMORY_AND_DISK)
```

#### Coalesce

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ import com.github.music.of.the.ainur.almaren.Tree
import com.github.music.of.the.ainur.almaren.builder.Core
import com.github.music.of.the.ainur.almaren.state.core._
import org.apache.spark.sql.Column
import org.apache.spark.storage.StorageLevel

private[almaren] trait Main extends Core {
def sql(sql: String): Option[Tree] =
Expand All @@ -12,8 +13,8 @@ private[almaren] trait Main extends Core {
def alias(alias:String): Option[Tree] =
Alias(alias)

def cache(opType:Boolean = true,tableName:Option[String] = None): Option[Tree] =
Cache(opType, tableName)
def cache(opType:Boolean = true,tableName:Option[String] = None,storageLevel: Option[StorageLevel] = None): Option[Tree] =
Cache(opType, tableName = tableName, storageLevel = storageLevel)

def coalesce(size:Int): Option[Tree] =
Coalesce(size)
Expand All @@ -33,12 +34,12 @@ private[almaren] trait Main extends Core {
def dsl(dsl:String): Option[Tree] =
Dsl(dsl)

def sqlExpr(exprs:String*): Option[Tree] =
def sqlExpr(exprs:String*): Option[Tree] =
SqlExpr(exprs:_*)

def where(expr:String): Option[Tree] =
def where(expr:String): Option[Tree] =
Where(expr)

def drop(drop:String*): Option[Tree] =
def drop(drop:String*): Option[Tree] =
Drop(drop:_*)
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@ package com.github.music.of.the.ainur.almaren.state.core
import com.github.music.of.the.ainur.almaren.State
import com.github.music.of.the.ainur.almaren.util.Constants
import org.apache.spark.sql.{Column, DataFrame}
import org.apache.spark.storage.StorageLevel

private[almaren] abstract class Main extends State {
override def executor(df: DataFrame): DataFrame = core(df)
Expand Down Expand Up @@ -81,22 +82,30 @@ case class Alias(alias:String) extends Main {
}
}

case class Cache(opType:Boolean = true,tableName:Option[String] = None) extends Main {
case class Cache(opType: Boolean = true, tableName: Option[String] = None, storageLevel: Option[StorageLevel] = None) extends Main {
override def core(df: DataFrame): DataFrame = cache(df)

def cache(df: DataFrame): DataFrame = {
logger.info(s"opType:{$opType}, tableName{$tableName}")
logger.info(s"opType:{$opType}, tableName:{$tableName}, StorageType:{$storageLevel}")
tableName match {
case Some(t) => cacheTable(df,t)
case None => cacheDf(df)
case Some(t) => cacheTable(df, t)
case None => cacheDf(df, storageLevel)
}
df
}
private def cacheDf(df:DataFrame): Unit = opType match {
case true => df.persist()

private def cacheDf(df: DataFrame, storageLevel: Option[StorageLevel]): Unit = opType match {
case true => {
storageLevel match {
case Some(value) => df.persist(value)
case None => df.persist()
}
}
case false => df.unpersist()

}
private def cacheTable(df:DataFrame,tableName: String): Unit =

private def cacheTable(df: DataFrame, tableName: String): Unit =
opType match {
case true => df.sqlContext.cacheTable(tableName)
case false => df.sqlContext.uncacheTable(tableName)
Expand All @@ -112,7 +121,7 @@ case class SqlExpr(exprs:String*) extends Main {

case class Where(where:String) extends Main {
override def core(df: DataFrame): DataFrame = {
logger.info(s"where:{$where}")
logger.info(s"where:{$where}")
df.where(where)
}
}
Expand All @@ -122,4 +131,4 @@ case class Drop(drop:String*) extends Main {
logger.info(s"""drop:{${drop.mkString("\n")}}""")
df.drop(drop:_*)
}
}
}
23 changes: 18 additions & 5 deletions src/test/scala/com/github/music/of/the/ainur/almaren/Test.scala
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,9 @@ import org.apache.spark.sql.avro._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.{AnalysisException, Column, DataFrame, SaveMode}
import org.scalatest._
import org.apache.spark.sql.avro._
import org.scalatest.funsuite.AnyFunSuite
import org.apache.spark.sql.avro._
import org.apache.spark.storage.StorageLevel._

import java.io.File
import scala.collection.immutable._
Expand Down Expand Up @@ -383,6 +384,18 @@ class Test extends AnyFunSuite with BeforeAndAfter {
assert(bool_cache)
}

val testCacheDfStorage: DataFrame = almaren.builder.sourceSql("select * from cache_test").cache(true,storageLevel = Some(MEMORY_ONLY)).batch
val bool_cache_storage = testCacheDfStorage.storageLevel.useMemory
test("Testing Cache Memory Storage") {
assert(bool_cache_storage)
}

val testCacheDfDiskStorage: DataFrame = almaren.builder.sourceSql("select * from cache_test").cache(true, storageLevel = Some(DISK_ONLY)).batch
val bool_cache_disk_storage = testCacheDfDiskStorage.storageLevel.useDisk
test("Testing Cache Disk Storage") {
assert(bool_cache_disk_storage)
}

val testUnCacheDf = almaren.builder.sourceSql("select * from cache_test").cache(false).batch
val bool_uncache = testUnCacheDf.storageLevel.useMemory
test("Testing Uncache") {
Expand Down Expand Up @@ -515,14 +528,15 @@ class Test extends AnyFunSuite with BeforeAndAfter {
val jsonStr = Seq("""{"name":"John","age":21,"address":"New York"}""",
"""{"name":"Peter","age":18,"address":"Prague"}""",
"""{"name":"Tony","age":40,"address":"New York"}""").toDF("json_string").createOrReplaceTempView("sample_json_table")

val df = spark.sql("select * from sample_json_table")
val jsonSchema = "address STRING,age BIGINT,name STRING"
val jsonSchema = "`address` STRING,`age` BIGINT,`name` STRING"
val generatedSchema = Util.genDDLFromJsonString(df, "json_string", 0.1)
testSchema(jsonSchema, generatedSchema, "Test infer schema for json column")
}

def testInferSchemaDataframe(df: DataFrame): Unit = {
val dfSchema = "cast ARRAY<STRING>,genres ARRAY<STRING>,title STRING,year BIGINT"
val dfSchema = "`cast` ARRAY<STRING>,`genres` ARRAY<STRING>,`title` STRING,`year` BIGINT"
val generatedSchema = Util.genDDLFromDataFrame(df, 0.1)
testSchema(dfSchema, generatedSchema, "Test infer schema for dataframe")
}
Expand All @@ -534,5 +548,4 @@ class Test extends AnyFunSuite with BeforeAndAfter {

}


}
}
Loading