Optimize/Improve stratified sample generation #112

dongyoungy · 2018-03-16T20:24:33Z

Currently, stratified samples require multiple passes to generate them and take significantly longer than other types of samples.

It is desirable to streamline current stratified sample generation procedure somehow for a faster sample generation.

henryqin1997 · 2018-05-02T03:37:49Z

@dongyoungy Where is the current algorithm(the path)? I have some idea but the problem is not clear enough for me. Is it the I/O time consuming or it is because of algorithm itself? I want to study the current algorithm first.

dongyoungy · 2018-05-02T15:04:42Z

The logic to create different types of samples (i.e., uniform, stratified, universe) is implemented under CreateSampleQuery class (at least it will give you the starting point, you might need to look at other classes from there), which is located at /core/src/main/java/edu/umich/verdict/query/CreateSampleQuery.java

Possible improvements could be something like 1) removing steps of generating temp table for counting # of groups; and/or 2) revising sample generation query itself somehow for a better performance.

henryqin1997 · 2018-05-06T03:13:37Z

To implement 1), I guess we may
1.store group sizes data in memory (will the size of the table be a problem, or the data transfer slow?) 2.use nested queries. (I wonder whether 'group by' used in 'count' still achievable in later stratifying)
Is any of the two way tried and didn't work so I should skip trying? Is there other thoughts about what to try?

dongyoungy added the feature request label Mar 16, 2018

barzan added the medium priority label Mar 16, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize/Improve stratified sample generation #112

Optimize/Improve stratified sample generation #112

dongyoungy commented Mar 16, 2018

henryqin1997 commented May 2, 2018

dongyoungy commented May 2, 2018 •

edited

Loading

henryqin1997 commented May 6, 2018

Optimize/Improve stratified sample generation #112

Optimize/Improve stratified sample generation #112

Comments

dongyoungy commented Mar 16, 2018

henryqin1997 commented May 2, 2018

dongyoungy commented May 2, 2018 • edited Loading

henryqin1997 commented May 6, 2018

dongyoungy commented May 2, 2018 •

edited

Loading