New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

new bi sample benchmark tests #1995

Open

grusev wants to merge 2 commits into master from perf_first_bi_benchmark

+519 −6

Collaborator

grusev commented Nov 12, 2024

Reference Issues/PRs

What does this implement or fix?

    Sample test benchmark for using one opensource BI CSV source.
    The logic of a test is 
        - download if parquet file does not exists source in .bz2 format
        - convert it to parquet format
        - prepare library with it containing  several symbols that are constructed based on this DF
        - for each query we want to benchmark do a pre-check that this query produces SAME result on Pandas and arcticDB
        - run the benchmark tests

Any other comments?

Checklist

Checklist for code changes...

Have you updated the relevant docstrings, documentation and copyright notice?
Is this contribution tested against all ArcticDB's features?
Do all exceptions introduced raise appropriate error messages?
Are API changes highlighted in the PR description?
Is the PR labelled as enhancement or bug so it appears in autogenerated release notes?

grusev requested review from alexowens90, willdealtry and poodlewars as code owners

November 12, 2024 15:59

maxim-morozov added the replicated label


          new bi sample benchmark tests

6076d15

grusev force-pushed the perf_first_bi_benchmark branch from 363c7e5 to 6076d15 Compare

November 18, 2024 07:06

G-D-Petrov requested changes

View reviewed changes

Collaborator

G-D-Petrov left a comment

Not necessary for this PR, but it might be nice to do these benchmarks also for the read_batch method.
But this will probably be more reasonable if we add more of these data sets in the future.

python/data/CityMaxCapita_1.parquet.gzip Outdated

Collaborator

G-D-Petrov Nov 18, 2024

We will need to setup git lfs for this file before merging the PR

python/benchmarks/bi_benchmarks.py Outdated

+                      df = self.lib.read(f"{self.symbol}{times_bigger}", query_builder=q)
+                      return df.data
+                  def get_query_groupaby_city_count_isin_filter(self, q):

Collaborator

G-D-Petrov Nov 18, 2024

Typo: should be groupby
Applies to all the methods with the same typo

Collaborator Author

grusev Nov 19, 2024

done!

python/benchmarks/bi_benchmarks.py Outdated

+                  def time_query_readall(self, times_bigger):
+                      self.lib.read(f"{self.symbol}{times_bigger}")
+                  def get_query_groupaby_city_count_all(self, q):

Collaborator

G-D-Petrov Nov 18, 2024

nit: looks like there is no need for these to be methods of the class, can be defined outside the class which will make it a bit more readable

Collaborator Author

grusev Nov 19, 2024

done!

python/benchmarks/bi_benchmarks.py Outdated

+                      _df = self.df.copy(deep=True)
+                      arctic_df = self.time_query_groupaby_city_count_filter_two_aggregations(BIBenchmarks.params[0])
+                      _df = self.get_query_groupaby_city_count_filter_two_aggregations(_df)
+                      arctic_df.sort_index(inplace=True)

Collaborator

G-D-Petrov Nov 18, 2024

Looks like the sorting of the index is already done in the assert_frame_equal function, so it is not needed here.

Collaborator Author

grusev Nov 19, 2024

ìndeed! late refactoring omission

python/benchmarks/bi_benchmarks.py Outdated


		del self.ac

		def assert_frame_equal(self, pandas_df:pd.DataFrame, arctic_df:pd.DataFrame):

Collaborator

G-D-Petrov Nov 18, 2024

nit: Looks like this doesn't need to be a method of the class

Collaborator Author

grusev Nov 19, 2024

agree!

python/benchmarks/bi_benchmarks.py Outdated

+                      df = self.lib.read(f"{self.symbol}{times_bigger}", query_builder=q)
+                      return df.data
+                  def peakmem_query_groupaby_city_count_filter_two_aggregations(self, times_bigger):

Collaborator

G-D-Petrov Nov 18, 2024

Might be a good idea to add a peakmem_... for the other time_... variants as well

Collaborator Author

grusev Nov 19, 2024

I was considing this also. I will add them later we can remove them easily

Collaborator Author

grusev Nov 19, 2024

Now I added also to print info for large dataframes that are produced by this test. So our first test is with 1 GB DF while last is with 10 GB df

         OUTPUT -------->
         Parquet file exists!
         The procedure is creating N times larger dataframes
         by concatenating original DF N times
         DF for iterration xSize original ready:  10
         <class 'pandas.core.frame.DataFrame'>
         Index: 9126570 entries, 0 to 912656
         Data columns (total 31 columns):
          #   Column               Dtype
         ---  ------               -----
          0   City/Admin           object
          1   City/State           object
          2   City                 object
          3   Created Date/Time    float64
          4   Date Joined          float64
          5   FF Ratio             float64
          6   Favorites            int32
          7   First Link in Tweet  object
          8   Followers            int32
          9   Following            int32
          10  Gender               object
          11  Influencer?          int32
          12  Keyword              object
          13  LPF                  float64
          14  Language             object
          15  Lat                  float64
          16  Listed Number        int32
          17  Long Domain          object
          18  Long                 float64
          19  Number of Records    int32
          20  Region               object
          21  Short Domain         object
          22  State/Country        object
          23  State                object
          24  Tweet Text           object
          25  Tweets               int32
          26  Twitter Client       object
          27  User Bio             object
          28  User Loc             object
          29  Username 1           object
          30  Username             object
         dtypes: float64(6), int32(7), object(18)
         memory usage: 10.8 GB


          update comment

5c21cbe

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

G-D-Petrov G-D-Petrov requested changes

alexowens90 Awaiting requested review from alexowens90 alexowens90 is a code owner

willdealtry Awaiting requested review from willdealtry willdealtry is a code owner

poodlewars Awaiting requested review from poodlewars poodlewars is a code owner

Requested changes must be addressed to merge this pull request.

Labels