Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize/Improve stratified sample generation #112

Open
dongyoungy opened this issue Mar 16, 2018 · 3 comments
Open

Optimize/Improve stratified sample generation #112

dongyoungy opened this issue Mar 16, 2018 · 3 comments

Comments

@dongyoungy
Copy link
Contributor

Currently, stratified samples require multiple passes to generate them and take significantly longer than other types of samples.

It is desirable to streamline current stratified sample generation procedure somehow for a faster sample generation.

@henryqin1997
Copy link

@dongyoungy Where is the current algorithm(the path)? I have some idea but the problem is not clear enough for me. Is it the I/O time consuming or it is because of algorithm itself? I want to study the current algorithm first.

@dongyoungy
Copy link
Contributor Author

dongyoungy commented May 2, 2018

The logic to create different types of samples (i.e., uniform, stratified, universe) is implemented under CreateSampleQuery class (at least it will give you the starting point, you might need to look at other classes from there), which is located at /core/src/main/java/edu/umich/verdict/query/CreateSampleQuery.java

Possible improvements could be something like 1) removing steps of generating temp table for counting # of groups; and/or 2) revising sample generation query itself somehow for a better performance.

@henryqin1997
Copy link

To implement 1), I guess we may
1.store group sizes data in memory (will the size of the table be a problem, or the data transfer slow?) 2.use nested queries. (I wonder whether 'group by' used in 'count' still achievable in later stratifying)
Is any of the two way tried and didn't work so I should skip trying? Is there other thoughts about what to try?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants