Spark Table Stats

Spark Table Stats is intended to provide summary statistics by column in an efficient manner. As designed now the intent is to generate the following statistics in only two passes by making use of repartitionAndSortWithinPartitions leveraging custom partitioning and foreachPartition leveraging custom accumulators.

Summary Statistics By Column:

Sum
Average
Standard Deviation
Max
Min
Carnality (The number of records of frequency / total records)
Count Nulls
Count Empties
Top (K) Values by Frequency (NOT COMPLETED)

TODO:

Top(K) - Evaluate oppertunity to use combineByKey and create an empty min queue for each key. Merge values into the queue if its size is < K. If >= K, only merge the value if it exceeds the smallest element; if so add it and remove the smallest element.

Collaborators:

Eric, Roderick, Brad

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
src		src
target		target
README.md		README.md
TableStats.iml		TableStats.iml
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Spark Table Stats

Summary Statistics By Column:

TODO:

Collaborators:

About

Releases

Packages

Contributors 2

Languages

roderickyao/Spark-TableStats

Folders and files

Latest commit

History

Repository files navigation

Spark Table Stats

Summary Statistics By Column:

TODO:

Collaborators:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages