Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

new bi sample benchmark tests #1995

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

new bi sample benchmark tests #1995

wants to merge 2 commits into from

Conversation

grusev
Copy link
Collaborator

@grusev grusev commented Nov 12, 2024

Reference Issues/PRs

What does this implement or fix?

    Sample test benchmark for using one opensource BI CSV source.
    The logic of a test is 
        - download if parquet file does not exists source in .bz2 format
        - convert it to parquet format
        - prepare library with it containing  several symbols that are constructed based on this DF
        - for each query we want to benchmark do a pre-check that this query produces SAME result on Pandas and arcticDB
        - run the benchmark tests

Any other comments?

Checklist

Checklist for code changes...
  • Have you updated the relevant docstrings, documentation and copyright notice?
  • Is this contribution tested against all ArcticDB's features?
  • Do all exceptions introduced raise appropriate error messages?
  • Are API changes highlighted in the PR description?
  • Is the PR labelled as enhancement or bug so it appears in autogenerated release notes?

Copy link
Collaborator

@G-D-Petrov G-D-Petrov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessary for this PR, but it might be nice to do these benchmarks also for the read_batch method.
But this will probably be more reasonable if we add more of these data sets in the future.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will need to setup git lfs for this file before merging the PR

df = self.lib.read(f"{self.symbol}{times_bigger}", query_builder=q)
return df.data

def get_query_groupaby_city_count_isin_filter(self, q):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: should be groupby
Applies to all the methods with the same typo

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done!

def time_query_readall(self, times_bigger):
self.lib.read(f"{self.symbol}{times_bigger}")

def get_query_groupaby_city_count_all(self, q):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: looks like there is no need for these to be methods of the class, can be defined outside the class which will make it a bit more readable

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done!

_df = self.df.copy(deep=True)
arctic_df = self.time_query_groupaby_city_count_filter_two_aggregations(BIBenchmarks.params[0])
_df = self.get_query_groupaby_city_count_filter_two_aggregations(_df)
arctic_df.sort_index(inplace=True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the sorting of the index is already done in the assert_frame_equal function, so it is not needed here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ìndeed! late refactoring omission


del self.ac

def assert_frame_equal(self, pandas_df:pd.DataFrame, arctic_df:pd.DataFrame):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Looks like this doesn't need to be a method of the class

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree!

df = self.lib.read(f"{self.symbol}{times_bigger}", query_builder=q)
return df.data

def peakmem_query_groupaby_city_count_filter_two_aggregations(self, times_bigger):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be a good idea to add a peakmem_... for the other time_... variants as well

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was considing this also. I will add them later we can remove them easily

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now I added also to print info for large dataframes that are produced by this test. So our first test is with 1 GB DF while last is with 10 GB df

         OUTPUT -------->
         Parquet file exists!
         The procedure is creating N times larger dataframes
         by concatenating original DF N times
         DF for iterration xSize original ready:  10
         <class 'pandas.core.frame.DataFrame'>
         Index: 9126570 entries, 0 to 912656
         Data columns (total 31 columns):
          #   Column               Dtype
         ---  ------               -----
          0   City/Admin           object
          1   City/State           object
          2   City                 object
          3   Created Date/Time    float64
          4   Date Joined          float64
          5   FF Ratio             float64
          6   Favorites            int32
          7   First Link in Tweet  object
          8   Followers            int32
          9   Following            int32
          10  Gender               object
          11  Influencer?          int32
          12  Keyword              object
          13  LPF                  float64
          14  Language             object
          15  Lat                  float64
          16  Listed Number        int32
          17  Long Domain          object
          18  Long                 float64
          19  Number of Records    int32
          20  Region               object
          21  Short Domain         object
          22  State/Country        object
          23  State                object
          24  Tweet Text           object
          25  Tweets               int32
          26  Twitter Client       object
          27  User Bio             object
          28  User Loc             object
          29  Username 1           object
          30  Username             object
         dtypes: float64(6), int32(7), object(18)
         memory usage: 10.8 GB

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants