-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
new bi sample benchmark tests #1995
base: master
Are you sure you want to change the base?
Conversation
363c7e5
to
6076d15
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not necessary for this PR, but it might be nice to do these benchmarks also for the read_batch
method.
But this will probably be more reasonable if we add more of these data sets in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will need to setup git lfs for this file before merging the PR
python/benchmarks/bi_benchmarks.py
Outdated
df = self.lib.read(f"{self.symbol}{times_bigger}", query_builder=q) | ||
return df.data | ||
|
||
def get_query_groupaby_city_count_isin_filter(self, q): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo: should be groupby
Applies to all the methods with the same typo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done!
python/benchmarks/bi_benchmarks.py
Outdated
def time_query_readall(self, times_bigger): | ||
self.lib.read(f"{self.symbol}{times_bigger}") | ||
|
||
def get_query_groupaby_city_count_all(self, q): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: looks like there is no need for these to be methods of the class, can be defined outside the class which will make it a bit more readable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done!
python/benchmarks/bi_benchmarks.py
Outdated
_df = self.df.copy(deep=True) | ||
arctic_df = self.time_query_groupaby_city_count_filter_two_aggregations(BIBenchmarks.params[0]) | ||
_df = self.get_query_groupaby_city_count_filter_two_aggregations(_df) | ||
arctic_df.sort_index(inplace=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the sorting of the index is already done in the assert_frame_equal
function, so it is not needed here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ìndeed! late refactoring omission
python/benchmarks/bi_benchmarks.py
Outdated
|
||
del self.ac | ||
|
||
def assert_frame_equal(self, pandas_df:pd.DataFrame, arctic_df:pd.DataFrame): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Looks like this doesn't need to be a method of the class
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree!
python/benchmarks/bi_benchmarks.py
Outdated
df = self.lib.read(f"{self.symbol}{times_bigger}", query_builder=q) | ||
return df.data | ||
|
||
def peakmem_query_groupaby_city_count_filter_two_aggregations(self, times_bigger): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be a good idea to add a peakmem_... for the other time_... variants as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was considing this also. I will add them later we can remove them easily
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now I added also to print info for large dataframes that are produced by this test. So our first test is with 1 GB DF while last is with 10 GB df
OUTPUT -------->
Parquet file exists!
The procedure is creating N times larger dataframes
by concatenating original DF N times
DF for iterration xSize original ready: 10
<class 'pandas.core.frame.DataFrame'>
Index: 9126570 entries, 0 to 912656
Data columns (total 31 columns):
# Column Dtype
--- ------ -----
0 City/Admin object
1 City/State object
2 City object
3 Created Date/Time float64
4 Date Joined float64
5 FF Ratio float64
6 Favorites int32
7 First Link in Tweet object
8 Followers int32
9 Following int32
10 Gender object
11 Influencer? int32
12 Keyword object
13 LPF float64
14 Language object
15 Lat float64
16 Listed Number int32
17 Long Domain object
18 Long float64
19 Number of Records int32
20 Region object
21 Short Domain object
22 State/Country object
23 State object
24 Tweet Text object
25 Tweets int32
26 Twitter Client object
27 User Bio object
28 User Loc object
29 Username 1 object
30 Username object
dtypes: float64(6), int32(7), object(18)
memory usage: 10.8 GB
Reference Issues/PRs
What does this implement or fix?
Any other comments?
Checklist
Checklist for code changes...