Add a list of tokens that are always indexed #288

fstachura · 2024-06-18T13:56:38Z

Currently, the list consists of arbitrarily selected tokens from c_common_reswords list present in GCC source code.

https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=gcc/c-family/c-common.cc;h=0341c44a2cd99771c3eeb44878d3df7a9ae816ea;hb=HEAD#l385

Needs testing and research about gcc/clang extensions.

Currently, the list consists of arbitrarily selected tokens from c_common_reswords list present in GCC source code. https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=gcc/c-family/c-common.cc;h=0341c44a2cd99771c3eeb44878d3df7a9ae816ea;hb=HEAD#l385

tleb · 2024-06-18T17:23:03Z

This does not contain builtins from here, eg __builtin_unreachable.
This does not contain types such as __int128. I took this randomly from the GCC compiler extensions doc.
Seeing a 100 elements array, I'm wondering if it can make things slow. This gets called for all tokens outputted from tokenizers that do not match a definition, which can be a lot.
As you noted, this might be missing some GCC/Clang specific extensions.
We might need to handle other languages.
We really ought to have a way to compare Elixir databases. Here we move in the blind when making changes to the indexing code. I'm working on my spare time on a rework of update.py that is much faster, I'll be sending a script to compare databases with it. The goal of this script will be to cover two cases:
- When we make changes to the indexing that should not affect database output: we want to be able to compare databases and see no difference.
- When we make changes to the indexing that change the database: we want an overview of the database diffs.

tleb · 2024-06-18T17:23:47Z

update.py

@@ -26,7 +26,7 @@
 from threading import Thread, Lock, Event, Condition

 import lib
-from lib import script, scriptLines
+from lib import script, scriptLines, always_indexed_tokens


Maybe use lib.always_indexed_tokens instead and avoid a manual import.

tleb · 2024-06-18T17:25:10Z

update.py

-                            # We only index CONFIG_??? in makefiles
+                        ref_allowed = db.defs.exists(tok) or (tok in always_indexed_tokens)
+                        # We only index CONFIG_??? in makefiles
+                        config_or_not_makefile = tok.startswith(b'CONFIG_') or family != 'M'


Any reason to reverse the conditional order from family != 'M' or tok.startswith(b'CONFIG') to tok.startswith(b'CONFIG') or family != 'M'?

IIRC I called the variable config_or_not_makefile and decided to reverse order to make it match the name. Assuming that expression will be evaluated left to right, old order is probably a bit better, I will fix that.

fstachura · 2024-06-18T18:52:01Z

We really ought to have a way to compare Elixir databases. Here we move in the blind when making changes to the indexing code. I'm working on my spare time on a rework of update.py that is much faster, I'll be sending a script to compare databases with it. The goal of this script will be to cover two cases:

I sometimes compare databases by doing db_dump -p of a database and a diff.

fstachura · 2024-06-18T18:55:56Z

We might need to handle other languages.

Elixir seems to mostly focus on C now. Handling other languages properly will require a bigger rework of the update script. Let's maybe focus on handling more interesting builtins/extensions used in Linux for now and focus on other languages later.

Seeing a 100 elements array, I'm wondering if it can make things slow. This gets called for all tokens outputted from tokenizers that do not match a definition, which can be a lot.

Python set has better lookup time (O(1) average) than a list. I will replace the list with a set.

fstachura · 2024-06-18T19:06:19Z

I'm working on my spare time on a rework of update.py that is much faster

Offtopic, but that's great to hear, I have seen very concerning I/O performance caused by the update script recently. On my VM, default update.py on many tags causes this:

Sure, it's mostly because the VM is massively underpowered (2GB of ram), but I noticed better CPU usage after I added 100MB of cache to each database (self.db.set_cachesize(0, 100000000) before open in BsdDb). Now CPU usage looks like this:

(200MB was even better, but then I accidentally got the script killed by OOM killer. Gonna upgrade to 4GB before I continue, I just need this job to finish first...)

Another thing about the update script, it often does lookups in the definitions and references database. I think having our own cache for definitions would improve speed. And postponing key updates until a file is processed (in update_definitions and references) could probably help too. Apparently Berkeley DB supports batch inserts, but I'm not sure if it's possible to utilize that from the python wrapper unfortunately

tleb · 2024-06-19T08:45:12Z

I sometimes compare databases by doing db_dump -p of a database and a diff.

This is good advice, thanks!

We might need to handle other languages.

Elixir seems to mostly focus on C now. Handling other languages properly will require a bigger rework of the update script. Let's maybe focus on handling more interesting builtins/extensions used in Linux for now and focus on other languages later.

There is a PR to add support for other languages, see #254. This is a useful direction for Elixir so we should be taking it into account even if it is not yet supported. The goal is to keep adding new languages as simple as possible. Ideally it would only be a new family and new implementations in script.sh.

About update.py: I do not see major changes for supporting other languages. See #254 that does none by the way. Maybe there are some steps (like parse-docs) we might want to avoid.

I've addressed update.py performance comments in a new issue #289 to keep things sorted.

fstachura · 2024-06-19T09:24:20Z

The goal is to keep adding new languages as simple as possible. Ideally it would only be a new family and new implementations in script.sh.

Ok, so should I create a dict with a different list for each family?

tleb · 2024-06-19T09:58:30Z

Try adding compiler-provided defs for already supported languages (other languages in the C family, devicetree, Kconfig, Makefiles). For example, for devicetrees, we have:

/dts-v1/
/plugin/
/memreserve/
/bits/
/delete-property/
/delete-node/
/omit-if-no-ref/
/incbin/

Ideally we would index them as /delete-node/ but I guess it would require some big changes to Elixir for it to support idents named like that. Else let's just index them as delete-node, this is fine.

This needs to be looked into for other already-supported languages. I can't think of any (useful to index) compiler-provided idents for Kconfig and Makefiles. But C++ surely has a lot of compiler-provided defs.

Add a list of always indexed prefixes

fstachura · 2024-07-10T10:36:42Z

@tleb Ok, so I started to read through GCC and Clang docs about C extensions and other reserved keywords. Point is: there are many of them, and the list is getting bigger. Not often, but I still doubt we will keep it up to date in the long term. Of course that doesn't mean we shouldn't try, but we could either limit the scope a bit, or think about a future-proof way handle many identifiers at once.

After reading this https://gcc.gnu.org/onlinedocs/gcc/C-Extensions.html and this https://clang.llvm.org/docs/LanguageExtensions.html and some GCC source code, I came to the (perhaps obvious) conclusion that there are a lot identifiers that start with __builtin. Same with __sync, __atomic, __cpp and some others. What do you think about handling all identifiers that start with these prefixes?
As I proposed during the meeting, we could even handle everything that contains a double underscore. That could of course result in some useless identifiers being indexed, but would it actually be a problem? Especially if we were to limit that to C.
This of course won't handle everything, but will get rid of a big part of the issue.

Second thing - some standard C functions (quite a long list) don't require headers to be used. See https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html
Some have duplicates with suffixes related to different floating point types, see https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=gcc/builtins.def;h=f6f3e104f6a8219eff4ac26fda588da808e4b7c9;hb=HEAD#l125
Do you think it makes sense to handle these too?

I pushed a commit (not tested too well yet) with the proposal - a list of suffixes and more interesting special identifiers. It's still missing some stuff, for example I would move the list to a different file. Extended attributes and most per-architecture stuff is missing too. (And again, does it make sense to handle that?)

I couldn't find anything interesting to index in Kconfig and GNU Make docs:
https://www.kernel.org/doc/html/next/kbuild/kconfig-language.html#kconfig-syntax

https://www.gnu.org/software/make/manual/make.html#Quick-Reference
https://www.gnu.org/software/make/manual/make.html#Special-Targets
https://www.gnu.org/software/make/manual/make.html#Name-Index

tpetazzoni · 2024-07-10T11:50:13Z

Sorry for jumping in the middle of the conversation here, but what is the motivation behind indexing those identifiers? To be Elixir is about navigating in the Linux kernel code (or U-Boot, or etc.). Why would I search for __atomic_store or __atomic_exchange or _Generic ? The implementation is not in Linux, U-Boot, etc. So why searching for those identifiers matters?

fstachura · 2024-07-10T12:03:16Z

It seems that some built-in identifiers without definitions are interesting to some users. For example #237

Some C extensions are used in the kernel but are not indexed because they have no definitions (that are recognized by ctags at least). See https://elixir.bootlin.com/linux/latest/source/include/linux/overflow.h#L355
I think searching for some of these extensions could be useful for some people too.

tleb · 2024-07-10T12:23:32Z

@tpetazzoni I see three concrete reasons:

We index identifiers, it sounds intuitive for people to expect that we index those provided by the compiler.
See how projects use specific compiler-provided identifiers. The original request is about _Generic and makes sense: does the kernel use it? Can it be used in driver code or is it reserved to special cases?
Find abstractions on top of compiler-provided identifiers. Say you want to use __builtin_va_* stuff, you search for __builtin_va_start and find the include/linux/stdarg.h header.

tpetazzoni · 2024-07-10T12:34:41Z

I see but I'm not extremely convinced that this is really that important, compared to many other things we have planned for Elixir.

Add a list of tokens that are always indexed

63ac06f

Currently, the list consists of arbitrarily selected tokens from c_common_reswords list present in GCC source code. https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=gcc/c-family/c-common.cc;h=0341c44a2cd99771c3eeb44878d3df7a9ae816ea;hb=HEAD#l385

tleb reviewed Jun 18, 2024

View reviewed changes

tleb mentioned this pull request Jun 19, 2024

Improve indexing performance (update.py) #289

Open

fstachura added 2 commits July 8, 2024 15:48

Add more tokens to always indexed list

daa92ee

Add a list of always indexed prefixes

Add more always indexed tokens

2125c63

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a list of tokens that are always indexed #288

Add a list of tokens that are always indexed #288

fstachura commented Jun 18, 2024

tleb commented Jun 18, 2024 •

edited by fstachura

Loading

tleb Jun 18, 2024

tleb Jun 18, 2024

fstachura Jun 18, 2024 •

edited

Loading

fstachura commented Jun 18, 2024

fstachura commented Jun 18, 2024 •

edited

Loading

fstachura commented Jun 18, 2024

tleb commented Jun 19, 2024 •

edited

Loading

fstachura commented Jun 19, 2024

tleb commented Jun 19, 2024

fstachura commented Jul 10, 2024

tpetazzoni commented Jul 10, 2024

fstachura commented Jul 10, 2024 •

edited

Loading

tleb commented Jul 10, 2024

tpetazzoni commented Jul 10, 2024

Add a list of tokens that are always indexed #288

Are you sure you want to change the base?

Add a list of tokens that are always indexed #288

Conversation

fstachura commented Jun 18, 2024

tleb commented Jun 18, 2024 • edited by fstachura Loading

tleb Jun 18, 2024

Choose a reason for hiding this comment

tleb Jun 18, 2024

Choose a reason for hiding this comment

fstachura Jun 18, 2024 • edited Loading

Choose a reason for hiding this comment

fstachura commented Jun 18, 2024

fstachura commented Jun 18, 2024 • edited Loading

fstachura commented Jun 18, 2024

tleb commented Jun 19, 2024 • edited Loading

fstachura commented Jun 19, 2024

tleb commented Jun 19, 2024

fstachura commented Jul 10, 2024

tpetazzoni commented Jul 10, 2024

fstachura commented Jul 10, 2024 • edited Loading

tleb commented Jul 10, 2024

tpetazzoni commented Jul 10, 2024

tleb commented Jun 18, 2024 •

edited by fstachura

Loading

fstachura Jun 18, 2024 •

edited

Loading

fstachura commented Jun 18, 2024 •

edited

Loading

tleb commented Jun 19, 2024 •

edited

Loading

fstachura commented Jul 10, 2024 •

edited

Loading