database search with multiple criteria is very slow #111

nsjarvis · 2024-11-05T19:32:26Z

I'm using the python interface. I find that when I add a 4th condition to the selection string, the query becomes very much slower, so much so that if I run it over the whole run range, it looks dead.

Would it be possible to streamline the queries under the hood, or have it run each part of the query over the results of the previous part, or is there a way for me to do that easily myself in python?

Please try my example, it is really surprising how much slower it becomes when I use the full query string, and here I am only searching over a small number of runs.

import rcdb
db = rcdb.RCDBProvider("mysql://rcdb@hallddb/rcdb2")
selection = "@is_production and not @is_empty_target and cdc_gas_pressure >=99.8"
runs = db.select_runs(selection,30274, 30300)
selection = "@is_production and not @is_empty_target and cdc_gas_pressure >=99.8 and cdc_gas_pressure<=100.2"
runs = db.select_runs(selection,30274, 30300)

nsjarvis · 2024-11-05T19:45:15Z

I tried the same thing in the gui and it gives a proxy error.

https://halldweb.jlab.org/rcdb/runs/search?runFrom=30274&runTo=31057&q=%40is_production+and+%40status_approved+and+cdc_gas_pressure+%3E+99.9+and+cdc_gas_pressure+%3C+100.1+and+not+%40is_empty_target

DraTeots · 2024-12-17T23:50:42Z

select_runs is kind of an old interface to RCDB search. Could you try select_values instead?

https://github.com/JeffersonLab/rcdb/wiki/Select-values

DraTeots · 2024-12-18T17:48:29Z

I made this benchmark:

import time
import rcdb

selection = "@is_production and not @is_empty_target and cdc_gas_pressure >=99.8 and cdc_gas_pressure<=100.2"

def run_select_runs():
    db = rcdb.RCDBProvider("mysql://rcdb@hallddb/rcdb2")
    runs = db.select_runs(selection,30274, 30300)

def run_select_values():
    db = rcdb.RCDBProvider("mysql://rcdb@hallddb/rcdb2")
    runs = db.select_values(['polarization_angle','beam_current'], selection, run_min=30272, run_max=30300)

def benchmark_function(func, n=1):
    """
    Run the given function `n` times and measure the total and average time.
    """
    total_time = 0.0
    for i in range(n):
        start = time.time()
        func()
        end = time.time()
        elapsed = end - start
        total_time += elapsed
        print(f"Run {i+1}: {func.__name__} took {elapsed:.6f} seconds")
    avg_time = total_time / n
    print(f"Average time for {func.__name__} over {n} runs: {avg_time:.6f} seconds\n")

if __name__ == "__main__":
    # Adjust the number of runs as needed
    benchmark_function(run_select_runs, n=5)
    benchmark_function(run_select_values, n=5)

The results are:

C:\toolbox\anaconda3\python.exe C:\dev\rcdb\rcdb\python\tests\benchmark_selecting.py 
Run 1: run_select_runs took 3.320747 seconds
Run 2: run_select_runs took 0.197246 seconds
Run 3: run_select_runs took 0.185122 seconds
Run 4: run_select_runs took 0.193280 seconds
Run 5: run_select_runs took 0.184041 seconds
Average time for run_select_runs over 5 runs: 0.216087 seconds

Run 1: run_select_values took 0.109544 seconds
Run 2: run_select_values took 0.072191 seconds
Run 3: run_select_values took 0.073029 seconds
Run 4: run_select_values took 0.079685 seconds
Run 5: run_select_values took 0.066032 seconds
Average time for run_select_values over 5 runs: 0.080096 seconds

As you can see, the first time it took 3 seconds to execute the query. I believe that is because the query run for the first time and then MySQL just cached the query. But that actually is good and gives an estimate of RCDB overhead, which is ~0.2 sec for select_runs and 0.07 sec for select_values.

P.S. The first run of select_values also took ~0.04 sec longer than the consequent runs. But I assume it is because python went this way for the first time.
P.P.S Tests were done from JLab but from wifi (i.e. outside network)

DraTeots · 2024-12-18T17:51:48Z

Now I run the same benchmark for the whole run range from 30000 to 39999.

C:\toolbox\anaconda3\python.exe C:\dev\rcdb\rcdb\python\tests\benchmark_selecting.py 
Run 1: run_select_runs took 16.996676 seconds
Run 2: run_select_runs took 8.034362 seconds
Run 3: run_select_runs took 6.303619 seconds
Run 4: run_select_runs took 7.555733 seconds
Run 5: run_select_runs took 7.275873 seconds
Average time for run_select_runs over 5 runs: 9.233253 seconds

Run 1: run_select_values took 0.642131 seconds
Run 2: run_select_values took 0.127621 seconds
Run 3: run_select_values took 0.124136 seconds
Run 4: run_select_values took 0.148941 seconds
Run 5: run_select_values took 0.153389 seconds
Average time for run_select_values over 5 runs: 0.239244 seconds

Now it is a clear show of how new select_values outperform select_runs

DraTeots · 2024-12-18T18:05:02Z

Finally I made the search for the whole DB, it looks like this:

Run 1: run_select_runs took 114.611838 seconds
Run 2: run_select_runs took 110.689821 seconds
Run 3: run_select_runs took 116.609994 seconds
Run 4: run_select_runs took 129.203460 seconds
Run 5: run_select_runs took 119.896811 seconds
Average time for run_select_runs over 5 runs: 118.202385 seconds

Run 1: run_select_values took 8.262341 seconds
Run 2: run_select_values took 7.349750 seconds
Run 3: run_select_values took 6.979630 seconds
Run 4: run_select_values took 6.911734 seconds
Run 5: run_select_values took 7.225785 seconds
Average time for run_select_values over 5 runs: 7.345848 seconds

DraTeots · 2024-12-18T18:10:16Z

Now... select_values actually have performance metrics, which one can print like this:

def run_select_values():
    db = rcdb.RCDBProvider("mysql://rcdb@hallddb/rcdb2")
    runs = db.select_values(['polarization_angle','beam_current'], selection, run_min=run_min, run_max=run_max)
    print("preparation      ", runs.performance["preparation"])
    print("query            ", runs.performance["query"])
    print("selection        ", runs.performance["selection"])    
    print("total            ", runs.performance["total"])

And the answer is this:

   preparation       0.02051440000650473
   query             7.012327500007814
   selection         0.09433279998484068
   total             7.127206499979366

Here "preparation" and "selection" is what python does and "query" is a pure time of MySQL DB Query. So we probably can't go faster than what select_values function does with current DB structure. But there were plans to speed it up with virtual tables. But!... Usually it is fast enough for run-range specific searches so there was no demand to really implement it yet.

nsjarvis · 2024-12-18T18:12:02Z

OK, select-values looks good. Is that what the gui uses now?
I was getting much slower times than you are - 5 minutes or more just for 2017 runs.

DraTeots · 2024-12-18T18:14:31Z

The current GUI is a very old version of RCDB and we are now ~~fighting~~ working with CST team to update the server to RHEL9 to install new RCDB web site. The server update is required as they would like to put everything to RHEL9 and we need a modern python version.

nsjarvis · 2024-12-18T18:16:30Z

This is what I settled on last week, can you recommend a neater method please? get_values is very nice, I could not see how to include the run start time etc into the results without doing separate db calls.

`runs = db.select_runs("@is_production and @status_approved", firstrun, lastrun)
vals = runs.get_values(['beam_on_current','polarization_angle','event_count','target_type','cdc_gas_pressure'], insert_r
un_number=True)

table = []

for x in range(0,len(runs)):

run = runs[x]
row = [run.number, run.start_time.strftime('%b %d %Y %H:%M:%S'), run.end_time.strftime('%b %d %Y %H:%M:%S')]

row.extend(vals[x])

table.append(row)
`

nsjarvis · 2024-12-18T18:19:28Z

(I did first write it using select_values, but then reverted to the above when I realised that I also needed to extract the times)

DraTeots · 2024-12-18T18:20:11Z

There are run_start_time and run_end_time conditions now (e.g. for run 30300):

run_start_time | 2017-02-05 15:15:54
run_end_time | 2017-02-05 16:43:04

so you could do runs.get_values([... ,run_start_time, run_end_time], ...)

DraTeots · 2024-12-18T18:21:36Z

They, btw is of type time, so they should be in python datetime when you select them

nsjarvis · 2024-12-18T18:24:33Z

Ok, so I could do the whole thing with select_values now? Great!
Could you add this to the examples, please?

DraTeots · 2024-12-18T18:26:33Z

There is a file with examples:

https://github.com/JeffersonLab/rcdb/blob/main/python/examples/example_select_values.py

Did you mean to add it to Wiki or to add run_start_time there?

nsjarvis · 2024-12-18T18:34:03Z

If you could add it to the py example, that would be great. I did not know that run_start_time existed in that way.
Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

database search with multiple criteria is very slow #111

database search with multiple criteria is very slow #111

nsjarvis commented Nov 5, 2024

nsjarvis commented Nov 5, 2024

DraTeots commented Dec 17, 2024

DraTeots commented Dec 18, 2024

DraTeots commented Dec 18, 2024

DraTeots commented Dec 18, 2024

DraTeots commented Dec 18, 2024 •

edited

Loading

nsjarvis commented Dec 18, 2024

DraTeots commented Dec 18, 2024

nsjarvis commented Dec 18, 2024

nsjarvis commented Dec 18, 2024

DraTeots commented Dec 18, 2024 •

edited

Loading

DraTeots commented Dec 18, 2024

nsjarvis commented Dec 18, 2024

DraTeots commented Dec 18, 2024 •

edited

Loading

nsjarvis commented Dec 18, 2024

database search with multiple criteria is very slow #111

database search with multiple criteria is very slow #111

Comments

nsjarvis commented Nov 5, 2024

nsjarvis commented Nov 5, 2024

DraTeots commented Dec 17, 2024

DraTeots commented Dec 18, 2024

DraTeots commented Dec 18, 2024

DraTeots commented Dec 18, 2024

DraTeots commented Dec 18, 2024 • edited Loading

nsjarvis commented Dec 18, 2024

DraTeots commented Dec 18, 2024

nsjarvis commented Dec 18, 2024

nsjarvis commented Dec 18, 2024

DraTeots commented Dec 18, 2024 • edited Loading

DraTeots commented Dec 18, 2024

nsjarvis commented Dec 18, 2024

DraTeots commented Dec 18, 2024 • edited Loading

nsjarvis commented Dec 18, 2024

DraTeots commented Dec 18, 2024 •

edited

Loading

DraTeots commented Dec 18, 2024 •

edited

Loading

DraTeots commented Dec 18, 2024 •

edited

Loading