-
Notifications
You must be signed in to change notification settings - Fork 41
Sql4DCompiler
srikalyan chandrashekar edited this page Jun 13, 2015
·
13 revisions
A SQL language for generating druid queries.
Examples
- GroupBy with granularity day
SELECT COUNT(*) AS rows, LONG_SUM(cnt) AS edit_count, DOUBLE_SUM(added) AS chars_added FROM wikipedia WHERE
interval BETWEEN 2010-01-01T00:00 AND 2020-01-01T00 BREAK BY 'day' GROUP BY rows, edit_count
- GroupBy with Having clause.
SELECT COUNT(*) AS rows, UNIQUE(added) AS chars_added FROM wikipedia WHERE interval BETWEEN 2010-01-01T00:00
AND 2020-01-01T00 GROUP BY namespace, type HAVING rows > 30
- GroupBy with complex Having clause.
SELECT COUNT(*) AS rows, LONG_SUM(cnt) AS edit_count, DOUBLE_SUM(added) AS chars_added FROM wikipedia WHERE
interval BETWEEN 2010-01-01T00:00 AND 2020-01-01T00 GROUP BY rows, edit_count HAVING namespace = 10 AND rows > 30
- GroupBy with sorting and limit. The following query requests to sort ascending by default and limit rows to 10.
SELECT timestamp, page, LONG_SUM(count) AS edit_count FROM wikipedia WHERE interval BETWEEN
2010-01-01T00:00:00.000Z AND 2020-01-01T00:00:00.000Z AND country='United States' BREAK BY 'all'
GROUP BY page ORDER BY edit_count LIMIT 10;
The following query requests to sort descending and limit rows to 10
SELECT timestamp, page, LONG_SUM(count) AS edit_count FROM wikipedia WHERE interval BETWEEN
2010-01-01T00:00:00.000Z AND 2020-01-01T00:00:00.000Z AND country='United States' BREAK BY 'all'
GROUP BY page ORDER BY edit_count DESC LIMIT 10;
The above 2 queries transform into TopN queries when only single dimension and metric are requested(as TopN is more efficient). Ex : The following 2 queries are TopN because only 1 dim(page) and 1 metric(edit_count) are involved.
SELECT page, LONG_SUM(count) AS edit_count FROM wikipedia WHERE interval BETWEEN 2010-01-01T00:00:00.000Z AND 2020-01-01T00:00:00.000Z AND country='United States' BREAK BY 'all'
GROUP BY page ORDER BY edit_count LIMIT 10;
SELECT page, LONG_SUM(count) AS edit_count FROM wikipedia WHERE interval BETWEEN 2010-01-01T00:00:00.000Z AND 2020-01-01T00:00:00.000Z AND country='United States' BREAK BY 'all'
GROUP BY page ORDER BY edit_count DESC LIMIT 10;
- TopN on the metric 'edit_count' with granularity period 1d and EST
SELECT COUNT(*) AS rows, LONG_SUM(cnt) AS edit_count, DOUBLE_SUM(added) AS chars_added FROM wikipedia WHERE
interval BETWEEN 2010-01-01T00:00 AND 2020-01-01T00 BREAK BY PERIOD('P1D', 'EST5EDT') GROUP BY rows, edit_count
ORDER BY edit_count LIMIT 10 ;
- GroupBy with complex Post Aggregation.
SELECT COUNT(*) AS rows, DOUBLE_SUM(total) AS tot FROM wikipedia WHERE interval BETWEEN 2010-01-01T00:00
AND 2020-01-01T00 GROUP BY dimension1, dimension2 THEN (100 / ((total AS tot / rows AS rows) AS avgRows)) AS
average;
- Another GroupBy with complex Post Aggregation
SELECT COUNT(*) AS rows, DOUBLE_SUM(total) AS tot FROM wikipedia WHERE interval BETWEEN 2010-01-01T00:00
AND 2020-01-01T00 GROUP BY rows, tot THEN ((UNIQUE(unique_users) + (100 * rows)) / 34) AS rows;
- GroupBy with regex filter.
SELECT COUNT(*) AS rows, DOUBLE_SUM(total) AS tot FROM wikipedia WHERE interval BETWEEN 2010-01-01T00:00
AND 2020-01-01T00 AND rows LIKE '%tetete%' GROUP BY rows, tot;
- GroupBy with complex filter
SELECT COUNT(*) AS rows, DOUBLE_SUM(total) AS tot FROM wikipedia WHERE interval BETWEEN 2010-01-01T00:00
AND 2020-01-01T00 AND (dam = 10) AND (rows LIKE '%tetete%') GROUP BY rows, tot;
- GroupBy with more complex filter
SELECT COUNT(*) AS rows, DOUBLE_SUM(total) AS tot FROM wikipedia WHERE interval BETWEEN 2010-01-01T00:00
AND 2020-01-01T00 AND ((dam = 10) AND (rows LIKE '%tetete%')) OR NOT (tot =34) GROUP BY rows, tot;
- GroupBy with javascript based post aggregation
SELECT LONG_SUM(headline_views) AS headline_views, LONG_SUM(content_views) AS fullContentViews,
LONG_SUM(shares) AS shares, UNIQUE(unique_content_count) AS unique_content_count, MAX(ISOTimestamp) AS max
FROM tp042 WHERE interval BETWEEN 2014-03-09T01:00:00.000-05:00 AND 2014-03-10T22:00:01.000-04:00 AND
provider_id='someprovider' AND content_type='cavideo' BREAK BY PERIOD('P1D', 'EST5EDT') GROUP BY
headline_views THEN javascript:'sharesPer1000FullContentViews(fullContentViews, shares) { return (1000 * shares
/ fullContentViews).toFixed(2);}' HINT('timeseries');
- Search with single keyword
SELECT a, b FROM wiki WHERE interval BETWEEN 2010-01-01T00:00 AND 2020-01-01T00 WHICH CONTAINS('somestuff')
- Search with multiple keywords
SELECT a, b FROM wiki WHERE interval BETWEEN 2010-01-01T00:00 AND 2020-01-01T00 WHICH CONTAINS('somestuff',
'anotherstuff')
- Search with keyword and sort by lexical order
SELECT a, b FROM wiki WHERE interval BETWEEN 2010-01-01T00:00 AND 2020-01-01T00 WHICH CONTAINS('somestuff')
SORT('lexicographic')
- Search with keyword and Sort by string length
SELECT a, b FROM wiki WHERE interval BETWEEN 2010-01-01T00:00 AND 2020-01-01T00 WHICH CONTAINS('somestuff')
SORT('strlen')
- Timeseries with micro time ranges
SELECT LONG_SUM(all_content_seen) AS content_seen FROM UniqueCountTable WHERE interval BETWEEN
(['2014-06-11T23:00:00.000-04:00','2014-06-11T23:59:59.000-04:00'],
['2014-06-12T23:00:00.000-04:00','2014-06-12T23:59:59.000-04:00'],['2014-06-13T23:00:00.000-04:00','2014-06-13T23:59:59.000-04:00'],
['2014-06-14T23:00:00.000-04:00','2014-06-14T23:59:59.000-04:00'],
['2014-06-15T23:00:00.000-04:00','2014-06-15T23:59:59.000-04:00'],['2014-06-16T23:00:00.000-04:00','2014-06-16T23:59:59.000-04:00'],
['2014-06-17T23:00:00.000-04:00','2014-06-17T23:59:59.000-04:00']) AND provider_id = 'superpublisher' AND
content_type = 'cavideo' BREAK BY PERIOD('P1D', 'EST5EDT', '2014-06-11T04:00:00.000Z') HINT('timeseries');
- Timeseries with micro time ranges expressed through INCLUDE() function in PERIOD() function The following query produces same query as above i.e (15)
SELECT LONG_SUM(all_content_seen) AS content_seen FROM UniqueContent WHERE interval BETWEEN
2014-06-11T23:00:00.000-04:00 AND 2014-06-17T23:59:59.000-04:00 AND provider_id = 'superpublisher' AND
content_type = 'cavideo' BREAK BY PERIOD('P1D', 'EST5EDT', '2014-06-11T04:00:00.000Z', INCLUDE([23,1]))
HINT('timeseries');
- Select query(Select type druid query) select clause for this type of query can have metrics/dimensions they are figured out at runtime from the coordinator before firing the select query.
SELECT content_uuid, provider_id, content_views, follows FROM AggsTable WHERE interval BETWEEN '2014-07-01' AND '2014-07-02' LIMIT 5;
- Select query(select type druid query,for all columns)
SELECT * FROM AggsTable WHERE interval BETWEEN '2014-07-01' AND '2014-07-02' LIMIT 5;
- GroupBy With Join(currently JOIN,LEFT_JOIN, RIGHT_JOIN are supported)
SELECT timestamp , LONG_SUM(content_views) AS content_views, LONG_SUM(shares) AS shares FROM AggTable WHERE interval BETWEEN 2014-05-20T00:00:00.000-04:00 AND 2014-05-31T23:00:00.000-04:00 AND provider_id='superpublisher' AND content_type='cavideo' BREAK BY PERIOD('P1D', 'EST5EDT') GROUP BY timestamp HINT('timeseries') JOIN (SELECT timestamp , LONG_SUM(all_content_seen) AS content_seen FROM UniqueCountTable WHERE interval BETWEEN 2014-05-20T00:00:00.000-04:00 AND 2014-05-31T23:00:00.000-04:00 AND provider_id='superpublisher' AND content_type='cavideo' BREAK BY PERIOD('P1D', 'EST5EDT') GROUP BY timestamp HINT('timeseries')) ON (timestamp, content_views);
- TimeBoudary query (simplest query)
SELECT FROM wikipedia;
- Nested query(i.e query data source)
SELECT uuid, DOUBLE_SUM(clicks) AS more_clicks FROM
(SELECT provider, uuid, DOUBLE_SUM(click) AS clicks from abf1 where interval between 2014-10-01 and 2014-11-30 GROUP BY uuid)
WHERE interval BETWEEN 2014-10-01 and 2014-11-30;