Skip to content

roderickyao/Spark-TableStats

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Spark Table Stats

Spark Table Stats is intended to provide summary statistics by column in an efficient manner. As designed now the intent is to generate the following statistics in only two passes by making use of repartitionAndSortWithinPartitions leveraging custom partitioning and foreachPartition leveraging custom accumulators.

Summary Statistics By Column:

  • Sum
  • Average
  • Standard Deviation
  • Max
  • Min
  • Carnality (The number of records of frequency / total records)
  • Count Nulls
  • Count Empties
  • Top (K) Values by Frequency (NOT COMPLETED)

TODO:

  • Top(K) - Evaluate oppertunity to use combineByKey and create an empty min queue for each key. Merge values into the queue if its size is < K. If >= K, only merge the value if it exceeds the smallest element; if so add it and remove the smallest element.

Collaborators:

Eric, Roderick, Brad

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published