Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AnnData automatically converts int index to str, therefore matching some Gene synonyms #2124

Open
felix0097 opened this issue Oct 31, 2024 · 14 comments
Assignees

Comments

@felix0097
Copy link
Contributor

Report

ln.Curator.from_anndata matches wrong gene symbols. For an AnnData object where the var index are just integers, the Curator matches SCARNA10 and EGR1 . Which is a bit odd as the gene names are just integers in this case.

That's the output of curator.validate:

• saving validated records of 'var_index'
✓ added 2 records from public with Gene.symbol for var_index: 'SCARNA10', 'EGR1'
• mapping var_index on Gene.symbol
!    998 terms are not validated: '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ...
→ fix typos, remove non-existent values, or save terms via .add_new_from_var_index()
False

Here's an example on how to reproduce the bug:

!lamin init --storage ./lamin-intro --schema bionty
import anndata
import numpy as np
import pandas as pd
import lamindb as ln
import bionty as bt

adata = anndata.AnnData(
    X=np.ones((10, 1000)),
    var=pd.DataFrame(index=range(1000))
)

curator = ln.Curator.from_anndata(
    adata,
    var_index=bt.Gene.symbol,
    organism="human",
)
curator.validate()

Version information


anndata 0.10.9
bionty 0.52.0
lamindb 0.76.15
numpy 2.1.2
pandas 2.2.3
session_info 1.0.0

annotated_types 0.7.0
anyio NA
appdirs 1.4.4
appnope 0.1.4
arrow 1.3.0
asgiref 3.8.1
asttokens NA
attr 24.2.0
attrs 24.2.0
babel 2.16.0
botocore 1.35.51
certifi 2024.08.30
chardet 5.2.0
charset_normalizer 3.4.0
click 8.1.7
comm 0.2.2
cython_runtime NA
dateutil 2.9.0.post0
debugpy 1.8.7
decorator 5.1.1
deprecation 2.1.0
dj_database_url NA
django 5.1.2
dotenv NA
exceptiongroup 1.2.2
executing 2.1.0
fastjsonschema NA
fastobo 0.12.3
filelock 3.16.1
fqdn NA
fsspec 2024.10.0
gotrue 2.8.1
graphlib NA
h11 0.14.0
h2 4.1.0
h5py 3.12.1
hpack 4.0.0
httpcore 1.0.6
httpx 0.27.2
hyperframe 6.0.1
idna 3.10
ipykernel 6.29.5
isoduration NA
jedi 0.19.1
jinja2 3.1.4
jmespath 1.0.1
json5 0.9.25
jsonpointer 3.0.0
jsonschema 4.23.0
jsonschema_specifications NA
jupyter_events 0.10.0
jupyter_server 2.14.2
jupyterlab_server 2.27.3
lamin_utils 0.13.7
lamindb_setup 0.80.0
lnschema_core 0.76.1
markupsafe 3.0.2
natsort 8.4.0
nbformat 5.10.4
overrides NA
packaging 24.1
parso 0.8.4
pexpect 4.9.0
platformdirs 4.3.6
postgrest 0.13.2
prometheus_client NA
prompt_toolkit 3.0.48
pronto 2.5.5
psutil 6.1.0
psycopg2 2.9.10 (dt dec pq3 ext lo64)
ptyprocess 0.7.0
pure_eval 0.2.3
pyarrow 18.0.0
pydantic 2.9.2
pydantic_core 2.23.4
pydantic_settings 2.6.0
pydev_ipython NA
pydevconsole NA
pydevd 3.1.0
pydevd_file_utils NA
pydevd_plugins NA
pydevd_tracing NA
pygments 2.18.0
pythonjsonlogger NA
pytz 2024.2
realtime 1.0.6
referencing NA
requests 2.32.3
rfc3339_validator 0.1.4
rfc3986_validator 0.1.1
rich NA
rpds NA
scipy 1.14.1
send2trash NA
six 1.16.0
sniffio 1.3.1
sqlparse 0.5.1
stack_data 0.6.3
storage3 0.5.5
strenum 0.4.15
supabase 2.2.1
supafunc NA
tornado 6.4.1
traitlets 5.14.3
typing_extensions NA
upath 0.2.5
uri_template NA
urllib3 2.2.3
wcwidth 0.2.13
webcolors 24.8.0
websocket 1.8.0
websockets 12.0
yaml 6.0.2
zmq 26.2.0
zoneinfo NA

IPython 8.29.0
jupyter_client 8.6.3
jupyter_core 5.7.2
jupyterlab 4.2.5

Python 3.10.15 (main, Oct 3 2024, 02:33:33) [Clang 14.0.6 ]
macOS-10.16-x86_64-i386-64bit

Session information updated at 2024-10-31 10:38

@Zethson Zethson self-assigned this Nov 4, 2024
@Zethson
Copy link
Member

Zethson commented Nov 4, 2024

Okay so unfortunately that's a synonyms issue:

  1. EGR1 has the synonym 225
  2. SCARNA10 has the synonym 52

@felix0097 We can consider dropping those synonyms because I really don't know who'd ever want to use numbers as gene names.
@sunnyosun is dropping them the best idea? WDYT?

@falexwolf
Copy link
Member

Why not deal with this based on type comparison?

Re-phrase: Why don't we just stop casting integers to strings? It seems dangerous and unnecessary.

@Zethson
Copy link
Member

Zethson commented Nov 4, 2024

Ah yeah, that's a fair point. I'll have a look. Thanks!

In any case, it's weird to have numbers in there in the first place.

@falexwolf
Copy link
Member

Ah yeah, that's a fair point. I'll have a look. Thanks!

Great! I hope that's really simple. It's a fundamental issue with the code that I think should be fixed & tested soon.

In any case, it's weird to have numbers in there in the first place.

Yes, but we can't do anything about it. Our expectation unfortunately needs to be that there is a high amount of noise/nonsense in public sources.

Hence: our implementation needs to be as robust as it can be in the presence of noise rather than attempting to "clean ontologies from noise". Cleaning ontologies might be adequate in some particular cases but I don't believe it's adequate here. There might be numbers in other synonym fields of ontologies as well.

@Zethson
Copy link
Member

Zethson commented Nov 8, 2024

Here's the issue:

adata = anndata.AnnData(
    X=np.ones((10, 1000)),
    var=pd.DataFrame(index=range(1000))
)

gives us a str (categorical) index. This is documented behavior (https://anndata.readthedocs.io/en/latest/generated/anndata.AnnData.html):

Indexing into an AnnData object can be performed by relative position with numeric indices (like pandas’ iloc()), or by labels (like loc()). To avoid ambiguity with numeric indexing into observations or variables, indexes of the AnnData object are converted to strings by the constructor.

I can perform type checks and see whether the input is "number like" but it's of course not perfect...

@Zethson
Copy link
Member

Zethson commented Nov 8, 2024

I talked a lot with Phil about this. A perfect technical solution seems out of reach but maybe we can improve the situation at least. Ideas:

  1. Doing nothing and considering this a weird edge case. Phil says that type checking could become easier when AnnData switches to UUIDs.
  2. Attempting to guess whether the input identifiers are "number like" and then type checking.
  3. Removing these weird numbers from the gene ontology. I highly doubt that other ontologies have this issue or that these numbers are even useful. Who uses such gene symbol names ?!?
  4. Being really opinionated and always converting gene symbols to ensembl ids and then validating these. We could do that internally or even be super opinionated and modify the input AnnData object then.

@sunnyosun sunnyosun changed the title ln.Curator matches wrong gene symbols AnnData automatically converts int index to str, therefore matching some Gene synonyms Nov 8, 2024
@felix0097
Copy link
Contributor Author

Yes, but we can't do anything about it. Our expectation unfortunately needs to be that there is a high amount of noise/nonsense in public sources.

Hence: our implementation needs to be as robust as it can be in the presence of noise rather than attempting to "clean ontologies from noise". Cleaning ontologies might be adequate in some particular cases but I don't believe it's adequate here. There might be numbers in other synonym fields of ontologies as well.

Agree those ontologies are often noisy and it's pretty much impossible for us to clean them (that would probably be a huge community effort). However, I would say that it can be dangerous to just plainly trust them then, e.g. auto-matching to ontology terms without having a human in the loop or at least without giving the option to check them manually (again this is very hard if you have 40k genes to check). I only realized it in that case, cause I was expecting no matches.

Some odd matches, might lead to some weird behavior upstream, which is pretty much impossible to debug or at least very hard.

@falexwolf
Copy link
Member

falexwolf commented Nov 8, 2024

Proposed solution: Why does .validate() validate against synonyms? I think it shouldn't unless the user explicitly asks for it (even though I think that even then this is dangerous; I'd rather not make this possible). Hence, how about removing it and we'll no longer have this problem. I remember having multiple discussions with @sunnyosun about this but can't recall why we would consider a synonym "valid".

Background notes:

Indexing into an AnnData object can be performed by relative position with numeric indices (like pandas’ iloc()), or by labels (like loc()). To avoid ambiguity with numeric indexing into observations or variables, indexes of the AnnData object are converted to strings by the constructor.

Darn, I forgot about this. For many cases, the magical [] that relies on casting any index to str makes things easier for users. But yeah, one should have not done the casting and rather added .iloc for those cases in which people pass integer indexes. Isaac was very unhappy about it.

easier when AnnData switches to UUIDs

Interesting. My last discussion with Isaac about this dates 3 years back I guess. Back then I'd have just removed the casting and added .iloc. But well.

@sunnyosun
Copy link
Member

sunnyosun commented Nov 8, 2024

These synonyms are not noise, and I believe ensembl is one of the most curated ontologies as everyone uses them. Also, other ontology sources we use are reasonably well-curated.

For instance: 225 is an actual synonym of EGR1:

The current Curator does require a human in the loop and it's not possible to .save_artifact without being validated, so I don't think there's a danger as long as users aren't blindly running .add_new_from().

@sunnyosun
Copy link
Member

sunnyosun commented Nov 8, 2024

Proposed solution: Why does .validate() validate against synonyms?

Because this is a high-level function meant to require minimal work from users, if we ditch synonym mapping, this curation process will get longer and require more work.

We had no synonyms mapping at the beginning and at some point changed it because the scrna guide was too long/complicated.

@falexwolf
Copy link
Member

Ok, so, we're back to the big box that clarifies what "validated" means. We need this box and you should add it ASAP @sunnyosun @Zethson to this page: https://docs.lamin.ai/curate

I'm then almost certain that most users and we as data engineers don't want to consider a synonym to be a validated value.

This is the whole point of .standardize(). If you don't call .standardize(), your values will not validate when they contain synonyms.

@falexwolf
Copy link
Member

We had no synonyms mapping at the beginning and at some point changed it because the scrna guide was too long/complicated.

I think we need a solution that ascertains a meaningful definition of what a "validated dataset" is. As I said, I'm almost entirely sure that a validated dataset should not contain synonyms of anything.

In the second step we need to make the curation process bearable and concise. Calling .standardize() under the hood is something one can consider, but also seems somewhat dangerous.

@falexwolf
Copy link
Member

We had no synonyms mapping at the beginning and at some point changed it because the scrna guide was too long/complicated.

I wasn't aware of this change.

As it looks to me right now, this is a great danger for the integrity of the lakehouse and should be reverted ASAP.

Apologies if I'm misunderstanding things.

@falexwolf
Copy link
Member

falexwolf commented Nov 8, 2024

We already have an "Example" box here: https://docs.lamin.ai/curate

image

Imagine the "Experiment 1" label has a synonym "Experiment_1".

Do we want to consider a "Experiment_1" valid value under ULabel.name? I don't think so.

Of course we can call it valid under ULabel.synonyms but that's not what people should be using for curating data. To Curator classes, we pass constraints along the lines of Gene.ensembl_gene_id and not constraints that involve synonyms.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants