Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a list of tokens that are always indexed #288

Draft
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

fstachura
Copy link
Collaborator

Currently, the list consists of arbitrarily selected tokens from c_common_reswords list present in GCC source code.

https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=gcc/c-family/c-common.cc;h=0341c44a2cd99771c3eeb44878d3df7a9ae816ea;hb=HEAD#l385

Needs testing and research about gcc/clang extensions.

Currently, the list consists of arbitrarily selected tokens from
c_common_reswords list present in GCC source code.

https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=gcc/c-family/c-common.cc;h=0341c44a2cd99771c3eeb44878d3df7a9ae816ea;hb=HEAD#l385
@tleb
Copy link
Member

tleb commented Jun 18, 2024

  • This does not contain builtins from here, eg __builtin_unreachable.
  • This does not contain types such as __int128. I took this randomly from the GCC compiler extensions doc.
  • Seeing a 100 elements array, I'm wondering if it can make things slow. This gets called for all tokens outputted from tokenizers that do not match a definition, which can be a lot.
  • As you noted, this might be missing some GCC/Clang specific extensions.
  • We might need to handle other languages.
  • We really ought to have a way to compare Elixir databases. Here we move in the blind when making changes to the indexing code. I'm working on my spare time on a rework of update.py that is much faster, I'll be sending a script to compare databases with it. The goal of this script will be to cover two cases:
    • When we make changes to the indexing that should not affect database output: we want to be able to compare databases and see no difference.
    • When we make changes to the indexing that change the database: we want an overview of the database diffs.

update.py Outdated
@@ -26,7 +26,7 @@
from threading import Thread, Lock, Event, Condition

import lib
from lib import script, scriptLines
from lib import script, scriptLines, always_indexed_tokens
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe use lib.always_indexed_tokens instead and avoid a manual import.

update.py Outdated
# We only index CONFIG_??? in makefiles
ref_allowed = db.defs.exists(tok) or (tok in always_indexed_tokens)
# We only index CONFIG_??? in makefiles
config_or_not_makefile = tok.startswith(b'CONFIG_') or family != 'M'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason to reverse the conditional order from family != 'M' or tok.startswith(b'CONFIG') to tok.startswith(b'CONFIG') or family != 'M'?

Copy link
Collaborator Author

@fstachura fstachura Jun 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC I called the variable config_or_not_makefile and decided to reverse order to make it match the name. Assuming that expression will be evaluated left to right, old order is probably a bit better, I will fix that.

@fstachura
Copy link
Collaborator Author

We really ought to have a way to compare Elixir databases. Here we move in the blind when making changes to the indexing code. I'm working on my spare time on a rework of update.py that is much faster, I'll be sending a script to compare databases with it. The goal of this script will be to cover two cases:

I sometimes compare databases by doing db_dump -p of a database and a diff.

@fstachura
Copy link
Collaborator Author

fstachura commented Jun 18, 2024

We might need to handle other languages.

Elixir seems to mostly focus on C now. Handling other languages properly will require a bigger rework of the update script. Let's maybe focus on handling more interesting builtins/extensions used in Linux for now and focus on other languages later.

Seeing a 100 elements array, I'm wondering if it can make things slow. This gets called for all tokens outputted from tokenizers that do not match a definition, which can be a lot.

Python set has better lookup time (O(1) average) than a list. I will replace the list with a set.

@fstachura
Copy link
Collaborator Author

I'm working on my spare time on a rework of update.py that is much faster

Offtopic, but that's great to hear, I have seen very concerning I/O performance caused by the update script recently. On my VM, default update.py on many tags causes this:

image

Sure, it's mostly because the VM is massively underpowered (2GB of ram), but I noticed better CPU usage after I added 100MB of cache to each database (self.db.set_cachesize(0, 100000000) before open in BsdDb). Now CPU usage looks like this:

image

(200MB was even better, but then I accidentally got the script killed by OOM killer. Gonna upgrade to 4GB before I continue, I just need this job to finish first...)

Another thing about the update script, it often does lookups in the definitions and references database. I think having our own cache for definitions would improve speed. And postponing key updates until a file is processed (in update_definitions and references) could probably help too. Apparently Berkeley DB supports batch inserts, but I'm not sure if it's possible to utilize that from the python wrapper unfortunately

@tleb
Copy link
Member

tleb commented Jun 19, 2024

I sometimes compare databases by doing db_dump -p of a database and a diff.

This is good advice, thanks!

We might need to handle other languages.

Elixir seems to mostly focus on C now. Handling other languages properly will require a bigger rework of the update script. Let's maybe focus on handling more interesting builtins/extensions used in Linux for now and focus on other languages later.

There is a PR to add support for other languages, see #254. This is a useful direction for Elixir so we should be taking it into account even if it is not yet supported. The goal is to keep adding new languages as simple as possible. Ideally it would only be a new family and new implementations in script.sh.

About update.py: I do not see major changes for supporting other languages. See #254 that does none by the way. Maybe there are some steps (like parse-docs) we might want to avoid.


I've addressed update.py performance comments in a new issue #289 to keep things sorted.

@fstachura
Copy link
Collaborator Author

The goal is to keep adding new languages as simple as possible. Ideally it would only be a new family and new implementations in script.sh.

Ok, so should I create a dict with a different list for each family?

@tleb
Copy link
Member

tleb commented Jun 19, 2024

Try adding compiler-provided defs for already supported languages (other languages in the C family, devicetree, Kconfig, Makefiles). For example, for devicetrees, we have:

/dts-v1/
/plugin/
/memreserve/
/bits/
/delete-property/
/delete-node/
/omit-if-no-ref/
/incbin/

Ideally we would index them as /delete-node/ but I guess it would require some big changes to Elixir for it to support idents named like that. Else let's just index them as delete-node, this is fine.

This needs to be looked into for other already-supported languages. I can't think of any (useful to index) compiler-provided idents for Kconfig and Makefiles. But C++ surely has a lot of compiler-provided defs.

@fstachura
Copy link
Collaborator Author

@tleb Ok, so I started to read through GCC and Clang docs about C extensions and other reserved keywords. Point is: there are many of them, and the list is getting bigger. Not often, but I still doubt we will keep it up to date in the long term. Of course that doesn't mean we shouldn't try, but we could either limit the scope a bit, or think about a future-proof way handle many identifiers at once.

After reading this https://gcc.gnu.org/onlinedocs/gcc/C-Extensions.html and this https://clang.llvm.org/docs/LanguageExtensions.html and some GCC source code, I came to the (perhaps obvious) conclusion that there are a lot identifiers that start with __builtin. Same with __sync, __atomic, __cpp and some others. What do you think about handling all identifiers that start with these prefixes?
As I proposed during the meeting, we could even handle everything that contains a double underscore. That could of course result in some useless identifiers being indexed, but would it actually be a problem? Especially if we were to limit that to C.
This of course won't handle everything, but will get rid of a big part of the issue.

Second thing - some standard C functions (quite a long list) don't require headers to be used. See https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html
Some have duplicates with suffixes related to different floating point types, see https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=gcc/builtins.def;h=f6f3e104f6a8219eff4ac26fda588da808e4b7c9;hb=HEAD#l125
Do you think it makes sense to handle these too?

I pushed a commit (not tested too well yet) with the proposal - a list of suffixes and more interesting special identifiers. It's still missing some stuff, for example I would move the list to a different file. Extended attributes and most per-architecture stuff is missing too. (And again, does it make sense to handle that?)

I couldn't find anything interesting to index in Kconfig and GNU Make docs:
https://www.kernel.org/doc/html/next/kbuild/kconfig-language.html#kconfig-syntax

https://www.gnu.org/software/make/manual/make.html#Quick-Reference
https://www.gnu.org/software/make/manual/make.html#Special-Targets
https://www.gnu.org/software/make/manual/make.html#Name-Index

@tpetazzoni
Copy link

Sorry for jumping in the middle of the conversation here, but what is the motivation behind indexing those identifiers? To be Elixir is about navigating in the Linux kernel code (or U-Boot, or etc.). Why would I search for __atomic_store or __atomic_exchange or _Generic ? The implementation is not in Linux, U-Boot, etc. So why searching for those identifiers matters?

@fstachura
Copy link
Collaborator Author

fstachura commented Jul 10, 2024

It seems that some built-in identifiers without definitions are interesting to some users. For example #237

Some C extensions are used in the kernel but are not indexed because they have no definitions (that are recognized by ctags at least). See https://elixir.bootlin.com/linux/latest/source/include/linux/overflow.h#L355
I think searching for some of these extensions could be useful for some people too.

@tleb
Copy link
Member

tleb commented Jul 10, 2024

@tpetazzoni I see three concrete reasons:

  • We index identifiers, it sounds intuitive for people to expect that we index those provided by the compiler.
  • See how projects use specific compiler-provided identifiers. The original request is about _Generic and makes sense: does the kernel use it? Can it be used in driver code or is it reserved to special cases?
  • Find abstractions on top of compiler-provided identifiers. Say you want to use __builtin_va_* stuff, you search for __builtin_va_start and find the include/linux/stdarg.h header.

@tpetazzoni
Copy link

I see but I'm not extremely convinced that this is really that important, compared to many other things we have planned for Elixir.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants