-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AnnData automatically converts int index to str, therefore matching some Gene synonyms #2124
Comments
Okay so unfortunately that's a synonyms issue:
@felix0097 We can consider dropping those synonyms because I really don't know who'd ever want to use numbers as gene names. |
Why not deal with this based on type comparison? Re-phrase: Why don't we just stop casting integers to strings? It seems dangerous and unnecessary. |
Ah yeah, that's a fair point. I'll have a look. Thanks! In any case, it's weird to have numbers in there in the first place. |
Great! I hope that's really simple. It's a fundamental issue with the code that I think should be fixed & tested soon.
Yes, but we can't do anything about it. Our expectation unfortunately needs to be that there is a high amount of noise/nonsense in public sources. Hence: our implementation needs to be as robust as it can be in the presence of noise rather than attempting to "clean ontologies from noise". Cleaning ontologies might be adequate in some particular cases but I don't believe it's adequate here. There might be numbers in other synonym fields of ontologies as well. |
Here's the issue:
gives us a str (categorical) index. This is documented behavior (https://anndata.readthedocs.io/en/latest/generated/anndata.AnnData.html):
I can perform type checks and see whether the input is "number like" but it's of course not perfect... |
I talked a lot with Phil about this. A perfect technical solution seems out of reach but maybe we can improve the situation at least. Ideas:
|
Agree those ontologies are often noisy and it's pretty much impossible for us to clean them (that would probably be a huge community effort). However, I would say that it can be dangerous to just plainly trust them then, e.g. auto-matching to ontology terms without having a human in the loop or at least without giving the option to check them manually (again this is very hard if you have 40k genes to check). I only realized it in that case, cause I was expecting no matches. Some odd matches, might lead to some weird behavior upstream, which is pretty much impossible to debug or at least very hard. |
Proposed solution: Why does Background notes:
Darn, I forgot about this. For many cases, the magical
Interesting. My last discussion with Isaac about this dates 3 years back I guess. Back then I'd have just removed the casting and added |
These synonyms are not noise, and I believe ensembl is one of the most curated ontologies as everyone uses them. Also, other ontology sources we use are reasonably well-curated. For instance:
The current Curator does require a human in the loop and it's not possible to |
Because this is a high-level function meant to require minimal work from users, if we ditch synonym mapping, this curation process will get longer and require more work. We had no synonyms mapping at the beginning and at some point changed it because the scrna guide was too long/complicated. |
Ok, so, we're back to the big box that clarifies what "validated" means. We need this box and you should add it ASAP @sunnyosun @Zethson to this page: https://docs.lamin.ai/curate I'm then almost certain that most users and we as data engineers don't want to consider a This is the whole point of |
I think we need a solution that ascertains a meaningful definition of what a "validated dataset" is. As I said, I'm almost entirely sure that a validated dataset should not contain synonyms of anything. In the second step we need to make the curation process bearable and concise. Calling |
I wasn't aware of this change. As it looks to me right now, this is a great danger for the integrity of the lakehouse and should be reverted ASAP. Apologies if I'm misunderstanding things. |
We already have an "Example" box here: https://docs.lamin.ai/curate Imagine the Do we want to consider a Of course we can call it valid under |
Report
ln.Curator.from_anndata
matches wrong gene symbols. For an AnnData object where the var index are just integers, the Curator matchesSCARNA10
andEGR1
. Which is a bit odd as the gene names are just integers in this case.That's the output of
curator.validate
:Here's an example on how to reproduce the bug:
Version information
anndata 0.10.9
bionty 0.52.0
lamindb 0.76.15
numpy 2.1.2
pandas 2.2.3
session_info 1.0.0
annotated_types 0.7.0
anyio NA
appdirs 1.4.4
appnope 0.1.4
arrow 1.3.0
asgiref 3.8.1
asttokens NA
attr 24.2.0
attrs 24.2.0
babel 2.16.0
botocore 1.35.51
certifi 2024.08.30
chardet 5.2.0
charset_normalizer 3.4.0
click 8.1.7
comm 0.2.2
cython_runtime NA
dateutil 2.9.0.post0
debugpy 1.8.7
decorator 5.1.1
deprecation 2.1.0
dj_database_url NA
django 5.1.2
dotenv NA
exceptiongroup 1.2.2
executing 2.1.0
fastjsonschema NA
fastobo 0.12.3
filelock 3.16.1
fqdn NA
fsspec 2024.10.0
gotrue 2.8.1
graphlib NA
h11 0.14.0
h2 4.1.0
h5py 3.12.1
hpack 4.0.0
httpcore 1.0.6
httpx 0.27.2
hyperframe 6.0.1
idna 3.10
ipykernel 6.29.5
isoduration NA
jedi 0.19.1
jinja2 3.1.4
jmespath 1.0.1
json5 0.9.25
jsonpointer 3.0.0
jsonschema 4.23.0
jsonschema_specifications NA
jupyter_events 0.10.0
jupyter_server 2.14.2
jupyterlab_server 2.27.3
lamin_utils 0.13.7
lamindb_setup 0.80.0
lnschema_core 0.76.1
markupsafe 3.0.2
natsort 8.4.0
nbformat 5.10.4
overrides NA
packaging 24.1
parso 0.8.4
pexpect 4.9.0
platformdirs 4.3.6
postgrest 0.13.2
prometheus_client NA
prompt_toolkit 3.0.48
pronto 2.5.5
psutil 6.1.0
psycopg2 2.9.10 (dt dec pq3 ext lo64)
ptyprocess 0.7.0
pure_eval 0.2.3
pyarrow 18.0.0
pydantic 2.9.2
pydantic_core 2.23.4
pydantic_settings 2.6.0
pydev_ipython NA
pydevconsole NA
pydevd 3.1.0
pydevd_file_utils NA
pydevd_plugins NA
pydevd_tracing NA
pygments 2.18.0
pythonjsonlogger NA
pytz 2024.2
realtime 1.0.6
referencing NA
requests 2.32.3
rfc3339_validator 0.1.4
rfc3986_validator 0.1.1
rich NA
rpds NA
scipy 1.14.1
send2trash NA
six 1.16.0
sniffio 1.3.1
sqlparse 0.5.1
stack_data 0.6.3
storage3 0.5.5
strenum 0.4.15
supabase 2.2.1
supafunc NA
tornado 6.4.1
traitlets 5.14.3
typing_extensions NA
upath 0.2.5
uri_template NA
urllib3 2.2.3
wcwidth 0.2.13
webcolors 24.8.0
websocket 1.8.0
websockets 12.0
yaml 6.0.2
zmq 26.2.0
zoneinfo NA
IPython 8.29.0
jupyter_client 8.6.3
jupyter_core 5.7.2
jupyterlab 4.2.5
Python 3.10.15 (main, Oct 3 2024, 02:33:33) [Clang 14.0.6 ]
macOS-10.16-x86_64-i386-64bit
Session information updated at 2024-10-31 10:38
The text was updated successfully, but these errors were encountered: