-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
database search with multiple criteria is very slow #111
Comments
I tried the same thing in the gui and it gives a proxy error. |
select_runs is kind of an old interface to RCDB search. Could you try select_values instead? |
I made this benchmark: import time
import rcdb
selection = "@is_production and not @is_empty_target and cdc_gas_pressure >=99.8 and cdc_gas_pressure<=100.2"
def run_select_runs():
db = rcdb.RCDBProvider("mysql://rcdb@hallddb/rcdb2")
runs = db.select_runs(selection,30274, 30300)
def run_select_values():
db = rcdb.RCDBProvider("mysql://rcdb@hallddb/rcdb2")
runs = db.select_values(['polarization_angle','beam_current'], selection, run_min=30272, run_max=30300)
def benchmark_function(func, n=1):
"""
Run the given function `n` times and measure the total and average time.
"""
total_time = 0.0
for i in range(n):
start = time.time()
func()
end = time.time()
elapsed = end - start
total_time += elapsed
print(f"Run {i+1}: {func.__name__} took {elapsed:.6f} seconds")
avg_time = total_time / n
print(f"Average time for {func.__name__} over {n} runs: {avg_time:.6f} seconds\n")
if __name__ == "__main__":
# Adjust the number of runs as needed
benchmark_function(run_select_runs, n=5)
benchmark_function(run_select_values, n=5) The results are:
As you can see, the first time it took 3 seconds to execute the query. I believe that is because the query run for the first time and then MySQL just cached the query. But that actually is good and gives an estimate of RCDB overhead, which is ~0.2 sec for P.S. The first run of |
Now I run the same benchmark for the whole run range from 30000 to 39999.
Now it is a clear show of how new select_values outperform select_runs |
Finally I made the search for the whole DB, it looks like this:
|
Now... select_values actually have performance metrics, which one can print like this: def run_select_values():
db = rcdb.RCDBProvider("mysql://rcdb@hallddb/rcdb2")
runs = db.select_values(['polarization_angle','beam_current'], selection, run_min=run_min, run_max=run_max)
print("preparation ", runs.performance["preparation"])
print("query ", runs.performance["query"])
print("selection ", runs.performance["selection"])
print("total ", runs.performance["total"]) And the answer is this:
Here "preparation" and "selection" is what python does and "query" is a pure time of MySQL DB Query. So we probably can't go faster than what |
OK, select-values looks good. Is that what the gui uses now? |
The current GUI is a very old version of RCDB and we are now |
This is what I settled on last week, can you recommend a neater method please? get_values is very nice, I could not see how to include the run start time etc into the results without doing separate db calls. `runs = db.select_runs("@is_production and @status_approved", firstrun, lastrun) table = [] for x in range(0,len(runs)): run = runs[x] row.extend(vals[x]) table.append(row) |
(I did first write it using select_values, but then reverted to the above when I realised that I also needed to extract the times) |
There are run_start_time and run_end_time conditions now (e.g. for run 30300):
so you could do |
They, btw is of type time, so they should be in python datetime when you select them |
Ok, so I could do the whole thing with select_values now? Great! |
There is a file with examples: https://github.com/JeffersonLab/rcdb/blob/main/python/examples/example_select_values.py Did you mean to add it to Wiki or to add run_start_time there? |
If you could add it to the py example, that would be great. I did not know that run_start_time existed in that way. |
I'm using the python interface. I find that when I add a 4th condition to the selection string, the query becomes very much slower, so much so that if I run it over the whole run range, it looks dead.
Would it be possible to streamline the queries under the hood, or have it run each part of the query over the results of the previous part, or is there a way for me to do that easily myself in python?
Please try my example, it is really surprising how much slower it becomes when I use the full query string, and here I am only searching over a small number of runs.
The text was updated successfully, but these errors were encountered: