-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add embeddings to sqlite #228
base: main
Are you sure you want to change the base?
Conversation
…te file with embeddings
…ile with embeddings
@florian-huber I wanted to update MS2Query to the new version of matchms. One of the issues was that we were using pickled pandas files for the embeddings. Here I moved the embeddings into the Sqlite file. This is also a lot more intuitive to me, to have one file containing all the library information, reducing the risk of creating mismatches between embeddings and spectra. Also, I have a few small things open that I would still have to do before merging, but I thought it was good to already ask for your feedback. |
Looks like we have exceeded the limits of SQLite a bit here. A few that come to mind here are:
I will have a look that the code changes in more detail in the coming days. |
And some more thoughts...
|
@florian-huber Thanks! Parquet sounds simple to implement. But I would be concerned that Parquet would have backwards compatibility issues in the same way that pickle had. But, I am not sure if this is the case. Do you know this? |
@niekdejonge I am running some performance tests using multiple formats. So far it seems that parquet or feathers will also not solve the issue. I'll keep you posted. |
@florian-huber Great, thanks! |
@florian-huber I just realize I used pd.read_sql_query, but there is also the option for pd.read_sql_table. This might speed up the process ass well. Since the query has the flexibility to load part of the DF, which is functionality we do not need. I will quickly check the speed for this |
@florian-huber I checked the read_sql_table, but this only seemed to make it slower... |
@florian-huber Cool! Which method did you use for storing and loading from sqlite? The decrepency with my test might be due to the way of storing or loading the data. If we can get to that speed for loading embeddings, I think sqlite should be the preferred option. |
I used: import sqlite3
def save_to_sqlite(df, filename):
conn = sqlite3.connect(filename)
df.to_sql('data', conn, if_exists='replace', index=False)
conn.close()
def load_from_sqlite(filename):
conn = sqlite3.connect(filename)
df = pd.read_sql_query("SELECT * FROM data", conn)
conn.close()
return df |
Hmm, very similar to what I did, but I did use an index. I will try it in the exact way that you did, to try and replicate this speed. |
@florian-huber Hmm surprising, I tried storing and loading in the exact way you did with the MS2Deepscore embeddings and it still takes about 30s for 314000 embeddings. I tried 100.000 embeddings as well and it takes 10 s. Do you have any idea what could be causing this? The dtype of the floats that are used as input maybe? Or maybe just my local hardware? |
Just to quote an Apache FAQ https://arrow.apache.org/faq/ on the subject
as an example, the original format o 2013 is still perfectly readable today. It seems a perfect fit for the data you have, while SQLite (being a pretty good software for many applications) does not seem the obvious choice to store dataframes (columnar data with no relationship structure). |
OUTDATED
By now it has been fixed with #233. However, some other restructuring changes were made in the PR that might be valuable later. So I leave the PR open for now
Quick attempt to incorporate embeddings into the sqlite file. Just realized that it would probably be pretty straightforward and it seems to be the case indeed.
This will resolve our dependency on pandas important for solving #199 and #191.
Speed
The speed of loading all embeddings from pickle is < 1 sec, while the speed of loading all embeddings from sqlite is about 30 seconds. So it is a tradeoff. More logical way of storing the data, but slower start time. The loading would only need to happen once, when MS2Library is initialized.