Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

database search with multiple criteria is very slow #111

Open
nsjarvis opened this issue Nov 5, 2024 · 15 comments
Open

database search with multiple criteria is very slow #111

nsjarvis opened this issue Nov 5, 2024 · 15 comments

Comments

@nsjarvis
Copy link

nsjarvis commented Nov 5, 2024

I'm using the python interface. I find that when I add a 4th condition to the selection string, the query becomes very much slower, so much so that if I run it over the whole run range, it looks dead.

Would it be possible to streamline the queries under the hood, or have it run each part of the query over the results of the previous part, or is there a way for me to do that easily myself in python?

Please try my example, it is really surprising how much slower it becomes when I use the full query string, and here I am only searching over a small number of runs.

import rcdb
db = rcdb.RCDBProvider("mysql://rcdb@hallddb/rcdb2")
selection = "@is_production and not @is_empty_target and cdc_gas_pressure >=99.8"
runs = db.select_runs(selection,30274, 30300)
selection = "@is_production and not @is_empty_target and cdc_gas_pressure >=99.8 and cdc_gas_pressure<=100.2"
runs = db.select_runs(selection,30274, 30300)

@DraTeots
Copy link
Collaborator

select_runs is kind of an old interface to RCDB search. Could you try select_values instead?

https://github.com/JeffersonLab/rcdb/wiki/Select-values

@DraTeots
Copy link
Collaborator

I made this benchmark:

import time
import rcdb

selection = "@is_production and not @is_empty_target and cdc_gas_pressure >=99.8 and cdc_gas_pressure<=100.2"

def run_select_runs():
    db = rcdb.RCDBProvider("mysql://rcdb@hallddb/rcdb2")
    runs = db.select_runs(selection,30274, 30300)

def run_select_values():
    db = rcdb.RCDBProvider("mysql://rcdb@hallddb/rcdb2")
    runs = db.select_values(['polarization_angle','beam_current'], selection, run_min=30272, run_max=30300)

def benchmark_function(func, n=1):
    """
    Run the given function `n` times and measure the total and average time.
    """
    total_time = 0.0
    for i in range(n):
        start = time.time()
        func()
        end = time.time()
        elapsed = end - start
        total_time += elapsed
        print(f"Run {i+1}: {func.__name__} took {elapsed:.6f} seconds")
    avg_time = total_time / n
    print(f"Average time for {func.__name__} over {n} runs: {avg_time:.6f} seconds\n")

if __name__ == "__main__":
    # Adjust the number of runs as needed
    benchmark_function(run_select_runs, n=5)
    benchmark_function(run_select_values, n=5)

The results are:

C:\toolbox\anaconda3\python.exe C:\dev\rcdb\rcdb\python\tests\benchmark_selecting.py 
Run 1: run_select_runs took 3.320747 seconds
Run 2: run_select_runs took 0.197246 seconds
Run 3: run_select_runs took 0.185122 seconds
Run 4: run_select_runs took 0.193280 seconds
Run 5: run_select_runs took 0.184041 seconds
Average time for run_select_runs over 5 runs: 0.216087 seconds

Run 1: run_select_values took 0.109544 seconds
Run 2: run_select_values took 0.072191 seconds
Run 3: run_select_values took 0.073029 seconds
Run 4: run_select_values took 0.079685 seconds
Run 5: run_select_values took 0.066032 seconds
Average time for run_select_values over 5 runs: 0.080096 seconds

As you can see, the first time it took 3 seconds to execute the query. I believe that is because the query run for the first time and then MySQL just cached the query. But that actually is good and gives an estimate of RCDB overhead, which is ~0.2 sec for select_runs and 0.07 sec for select_values.

P.S. The first run of select_values also took ~0.04 sec longer than the consequent runs. But I assume it is because python went this way for the first time.
P.P.S Tests were done from JLab but from wifi (i.e. outside network)

@DraTeots
Copy link
Collaborator

Now I run the same benchmark for the whole run range from 30000 to 39999.

C:\toolbox\anaconda3\python.exe C:\dev\rcdb\rcdb\python\tests\benchmark_selecting.py 
Run 1: run_select_runs took 16.996676 seconds
Run 2: run_select_runs took 8.034362 seconds
Run 3: run_select_runs took 6.303619 seconds
Run 4: run_select_runs took 7.555733 seconds
Run 5: run_select_runs took 7.275873 seconds
Average time for run_select_runs over 5 runs: 9.233253 seconds

Run 1: run_select_values took 0.642131 seconds
Run 2: run_select_values took 0.127621 seconds
Run 3: run_select_values took 0.124136 seconds
Run 4: run_select_values took 0.148941 seconds
Run 5: run_select_values took 0.153389 seconds
Average time for run_select_values over 5 runs: 0.239244 seconds

Now it is a clear show of how new select_values outperform select_runs

@DraTeots
Copy link
Collaborator

Finally I made the search for the whole DB, it looks like this:

Run 1: run_select_runs took 114.611838 seconds
Run 2: run_select_runs took 110.689821 seconds
Run 3: run_select_runs took 116.609994 seconds
Run 4: run_select_runs took 129.203460 seconds
Run 5: run_select_runs took 119.896811 seconds
Average time for run_select_runs over 5 runs: 118.202385 seconds

Run 1: run_select_values took 8.262341 seconds
Run 2: run_select_values took 7.349750 seconds
Run 3: run_select_values took 6.979630 seconds
Run 4: run_select_values took 6.911734 seconds
Run 5: run_select_values took 7.225785 seconds
Average time for run_select_values over 5 runs: 7.345848 seconds

@DraTeots
Copy link
Collaborator

DraTeots commented Dec 18, 2024

Now... select_values actually have performance metrics, which one can print like this:

def run_select_values():
    db = rcdb.RCDBProvider("mysql://rcdb@hallddb/rcdb2")
    runs = db.select_values(['polarization_angle','beam_current'], selection, run_min=run_min, run_max=run_max)
    print("preparation      ", runs.performance["preparation"])
    print("query            ", runs.performance["query"])
    print("selection        ", runs.performance["selection"])    
    print("total            ", runs.performance["total"])

And the answer is this:

   preparation       0.02051440000650473
   query             7.012327500007814
   selection         0.09433279998484068
   total             7.127206499979366

Here "preparation" and "selection" is what python does and "query" is a pure time of MySQL DB Query. So we probably can't go faster than what select_values function does with current DB structure. But there were plans to speed it up with virtual tables. But!... Usually it is fast enough for run-range specific searches so there was no demand to really implement it yet.

@nsjarvis
Copy link
Author

OK, select-values looks good. Is that what the gui uses now?
I was getting much slower times than you are - 5 minutes or more just for 2017 runs.

@DraTeots
Copy link
Collaborator

The current GUI is a very old version of RCDB and we are now fighting working with CST team to update the server to RHEL9 to install new RCDB web site. The server update is required as they would like to put everything to RHEL9 and we need a modern python version.

@nsjarvis
Copy link
Author

This is what I settled on last week, can you recommend a neater method please? get_values is very nice, I could not see how to include the run start time etc into the results without doing separate db calls.

`runs = db.select_runs("@is_production and @status_approved", firstrun, lastrun)
vals = runs.get_values(['beam_on_current','polarization_angle','event_count','target_type','cdc_gas_pressure'], insert_r
un_number=True)

table = []

for x in range(0,len(runs)):

run = runs[x]
row = [run.number, run.start_time.strftime('%b %d %Y %H:%M:%S'), run.end_time.strftime('%b %d %Y %H:%M:%S')]

row.extend(vals[x])

table.append(row)
`

@nsjarvis
Copy link
Author

(I did first write it using select_values, but then reverted to the above when I realised that I also needed to extract the times)

@DraTeots
Copy link
Collaborator

DraTeots commented Dec 18, 2024

There are run_start_time and run_end_time conditions now (e.g. for run 30300):

run_start_time | 2017-02-05 15:15:54
run_end_time | 2017-02-05 16:43:04

so you could do runs.get_values([... ,run_start_time, run_end_time], ...)

@DraTeots
Copy link
Collaborator

They, btw is of type time, so they should be in python datetime when you select them

@nsjarvis
Copy link
Author

Ok, so I could do the whole thing with select_values now? Great!
Could you add this to the examples, please?

@DraTeots
Copy link
Collaborator

DraTeots commented Dec 18, 2024

There is a file with examples:

https://github.com/JeffersonLab/rcdb/blob/main/python/examples/example_select_values.py

Did you mean to add it to Wiki or to add run_start_time there?

@nsjarvis
Copy link
Author

If you could add it to the py example, that would be great. I did not know that run_start_time existed in that way.
Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants