Skip to content

Sql4DCompiler

srikalyan chandrashekar edited this page Jun 13, 2015 · 13 revisions

Description

A SQL language for generating druid queries.

Query types

GroupBy

Examples

  1. GroupBy with granularity day
SELECT COUNT(*) AS rows, LONG_SUM(cnt) AS edit_count, DOUBLE_SUM(added) AS chars_added  FROM wikipedia WHERE 
interval BETWEEN 2010-01-01T00:00 AND 2020-01-01T00 BREAK BY 'day' GROUP BY rows, edit_count 
  1. GroupBy with Having clause.
SELECT COUNT(*) AS rows, UNIQUE(added) AS chars_added  FROM wikipedia WHERE interval BETWEEN 2010-01-01T00:00 
AND 2020-01-01T00 GROUP BY namespace, type HAVING rows > 30
  1. GroupBy with complex Having clause.
SELECT COUNT(*) AS rows, LONG_SUM(cnt) AS edit_count, DOUBLE_SUM(added) AS chars_added  FROM wikipedia WHERE 
interval BETWEEN 2010-01-01T00:00 AND 2020-01-01T00 GROUP BY rows, edit_count HAVING namespace = 10 AND rows > 30
  1. GroupBy with sorting and limit. The following query requests to sort ascending by default and limit rows to 10.
SELECT timestamp, page, LONG_SUM(count) AS edit_count FROM wikipedia WHERE interval BETWEEN 
2010-01-01T00:00:00.000Z AND 2020-01-01T00:00:00.000Z AND country='United States' BREAK BY 'all'
 GROUP BY page  ORDER BY edit_count LIMIT 10;

The following query requests to sort descending and limit rows to 10

SELECT timestamp, page, LONG_SUM(count) AS edit_count FROM wikipedia WHERE interval BETWEEN 
2010-01-01T00:00:00.000Z AND 2020-01-01T00:00:00.000Z AND country='United States' BREAK BY 'all'
 GROUP BY page  ORDER BY edit_count DESC LIMIT 10;

The above 2 queries transform into TopN queries when only single dimension and metric are requested(as TopN is more efficient). Ex : The following 2 queries are TopN because only 1 dim(page) and 1 metric(edit_count) are involved.

SELECT page, LONG_SUM(count) AS edit_count FROM wikipedia WHERE interval BETWEEN 2010-01-01T00:00:00.000Z AND 2020-01-01T00:00:00.000Z AND country='United States' BREAK BY 'all' 
GROUP BY page  ORDER BY edit_count LIMIT 10;

SELECT page, LONG_SUM(count) AS edit_count FROM wikipedia WHERE interval BETWEEN 2010-01-01T00:00:00.000Z AND 2020-01-01T00:00:00.000Z AND country='United States' BREAK BY 'all'
 GROUP BY page  ORDER BY edit_count DESC LIMIT 10;
  1. TopN on the metric 'edit_count' with granularity period 1d and EST
SELECT COUNT(*) AS rows, LONG_SUM(cnt) AS edit_count, DOUBLE_SUM(added) AS chars_added  FROM wikipedia WHERE 
interval BETWEEN 2010-01-01T00:00 AND 2020-01-01T00 BREAK BY PERIOD('P1D', 'EST5EDT') GROUP BY rows, edit_count  
ORDER BY edit_count LIMIT 10 ;
  1. GroupBy with complex Post Aggregation.
SELECT COUNT(*) AS rows, DOUBLE_SUM(total) AS tot  FROM    wikipedia WHERE interval BETWEEN 2010-01-01T00:00 
AND 2020-01-01T00 GROUP BY dimension1, dimension2 THEN (100 / ((total AS tot / rows AS rows) AS avgRows)) AS 
average;
  1. Another GroupBy with complex Post Aggregation
SELECT COUNT(*) AS rows, DOUBLE_SUM(total) AS tot  FROM    wikipedia WHERE interval BETWEEN 2010-01-01T00:00 
AND 2020-01-01T00 GROUP BY rows, tot THEN ((UNIQUE(unique_users) + (100 * rows)) / 34) AS rows;
  1. GroupBy with regex filter.
SELECT COUNT(*) AS rows, DOUBLE_SUM(total) AS tot  FROM    wikipedia WHERE interval BETWEEN 2010-01-01T00:00 
AND 2020-01-01T00 AND rows LIKE '%tetete%' GROUP BY rows, tot;
  1. GroupBy with complex filter
SELECT COUNT(*) AS rows, DOUBLE_SUM(total) AS tot  FROM    wikipedia WHERE interval BETWEEN 2010-01-01T00:00 
AND 2020-01-01T00 AND (dam = 10) AND (rows LIKE '%tetete%') GROUP BY rows, tot;
  1. GroupBy with more complex filter
SELECT COUNT(*) AS rows, DOUBLE_SUM(total) AS tot  FROM    wikipedia WHERE interval BETWEEN 2010-01-01T00:00 
AND 2020-01-01T00 AND ((dam = 10) AND (rows LIKE '%tetete%')) OR NOT (tot =34) GROUP BY rows, tot;
  1. GroupBy with javascript based post aggregation
SELECT LONG_SUM(headline_views) AS headline_views, LONG_SUM(content_views) AS fullContentViews, 
LONG_SUM(shares) AS shares, UNIQUE(unique_content_count) AS unique_content_count, MAX(ISOTimestamp) AS max 
 FROM    tp042 WHERE interval BETWEEN  2014-03-09T01:00:00.000-05:00 AND 2014-03-10T22:00:01.000-04:00 AND 
provider_id='someprovider' AND content_type='cavideo' BREAK BY PERIOD('P1D', 'EST5EDT') GROUP BY
 headline_views THEN javascript:'sharesPer1000FullContentViews(fullContentViews, shares) { return (1000 * shares
 / fullContentViews).toFixed(2);}' HINT('timeseries');
  1. Search with single keyword
SELECT a, b FROM wiki WHERE interval BETWEEN 2010-01-01T00:00 AND 2020-01-01T00  WHICH CONTAINS('somestuff')
  1. Search with multiple keywords
SELECT a, b FROM wiki WHERE interval BETWEEN 2010-01-01T00:00 AND 2020-01-01T00  WHICH CONTAINS('somestuff', 
'anotherstuff')
  1. Search with keyword and sort by lexical order
SELECT a, b FROM wiki WHERE interval BETWEEN 2010-01-01T00:00 AND 2020-01-01T00 WHICH CONTAINS('somestuff') 
SORT('lexicographic')
  1. Search with keyword and Sort by string length
SELECT a, b FROM wiki WHERE interval BETWEEN 2010-01-01T00:00 AND 2020-01-01T00 WHICH CONTAINS('somestuff') 
SORT('strlen')
  1. Timeseries with micro time ranges
SELECT LONG_SUM(all_content_seen) AS content_seen FROM UniqueCountTable WHERE interval BETWEEN 
(['2014-06-11T23:00:00.000-04:00','2014-06-11T23:59:59.000-04:00'],
['2014-06-12T23:00:00.000-04:00','2014-06-12T23:59:59.000-04:00'],['2014-06-13T23:00:00.000-04:00','2014-06-13T23:59:59.000-04:00'],
['2014-06-14T23:00:00.000-04:00','2014-06-14T23:59:59.000-04:00'],
['2014-06-15T23:00:00.000-04:00','2014-06-15T23:59:59.000-04:00'],['2014-06-16T23:00:00.000-04:00','2014-06-16T23:59:59.000-04:00'],
['2014-06-17T23:00:00.000-04:00','2014-06-17T23:59:59.000-04:00']) AND provider_id = 'superpublisher' AND 
content_type = 'cavideo' BREAK BY PERIOD('P1D', 'EST5EDT', '2014-06-11T04:00:00.000Z') HINT('timeseries');
  1. Timeseries with micro time ranges expressed through INCLUDE() function in PERIOD() function The following query produces same query as above i.e (15)
SELECT LONG_SUM(all_content_seen) AS content_seen FROM UniqueContent WHERE interval BETWEEN 
2014-06-11T23:00:00.000-04:00 AND 2014-06-17T23:59:59.000-04:00 AND provider_id = 'superpublisher' AND 
content_type = 'cavideo' BREAK BY PERIOD('P1D', 'EST5EDT', '2014-06-11T04:00:00.000Z', INCLUDE([23,1])) 
HINT('timeseries');
  1. Select query(Select type druid query) select clause for this type of query can have metrics/dimensions they are figured out at runtime from the coordinator before firing the select query.
SELECT content_uuid, provider_id, content_views, follows FROM AggsTable WHERE interval BETWEEN '2014-07-01' AND '2014-07-02' LIMIT 5;
  1. Select query(select type druid query,for all columns)
SELECT * FROM AggsTable WHERE interval BETWEEN '2014-07-01' AND '2014-07-02' LIMIT 5;
  1. GroupBy With Join(currently JOIN,LEFT_JOIN, RIGHT_JOIN are supported)
SELECT timestamp , LONG_SUM(content_views) AS content_views, LONG_SUM(shares) AS shares FROM AggTable WHERE interval BETWEEN  2014-05-20T00:00:00.000-04:00 AND 2014-05-31T23:00:00.000-04:00 AND provider_id='superpublisher' AND content_type='cavideo' BREAK BY PERIOD('P1D', 'EST5EDT')  GROUP BY timestamp HINT('timeseries') JOIN (SELECT timestamp , LONG_SUM(all_content_seen) AS content_seen FROM UniqueCountTable WHERE interval BETWEEN  2014-05-20T00:00:00.000-04:00 AND 2014-05-31T23:00:00.000-04:00 AND provider_id='superpublisher' AND content_type='cavideo' BREAK BY PERIOD('P1D', 'EST5EDT') GROUP BY timestamp HINT('timeseries')) ON (timestamp, content_views);
  1. TimeBoudary query (simplest query)
SELECT FROM wikipedia;
  1. Nested query(i.e query data source)
SELECT uuid, DOUBLE_SUM(clicks) AS more_clicks FROM 
  (SELECT provider, uuid, DOUBLE_SUM(click) AS clicks from abf1 where interval between 2014-10-01 and  2014-11-30 GROUP BY uuid) 
WHERE interval BETWEEN 2014-10-01 and  2014-11-30;
Clone this wiki locally