Spark Table Stats is intended to provide summary statistics by column in an efficient manner. As designed now the intent is to generate the following statistics in only two passes by making use of repartitionAndSortWithinPartitions leveraging custom partitioning and foreachPartition leveraging custom accumulators.
- Sum
- Average
- Standard Deviation
- Max
- Min
- Carnality (The number of records of frequency / total records)
- Count Nulls
- Count Empties
- Top (K) Values by Frequency (NOT COMPLETED)
- Top(K) - Evaluate oppertunity to use combineByKey and create an empty min queue for each key. Merge values into the queue if its size is < K. If >= K, only merge the value if it exceeds the smallest element; if so add it and remove the smallest element.
Eric, Roderick, Brad