From b469406e85b91a947c266ec84835ab81eaa6150e Mon Sep 17 00:00:00 2001 From: Varun Gandhi Date: Thu, 24 Oct 2024 21:14:56 +0800 Subject: [PATCH] docs: Add DESIGN.md (#289) --- DESIGN.md | 140 +++++++++++++++++++++++++++++++++++++++++ Readme.md => README.md | 8 +-- 2 files changed, 144 insertions(+), 4 deletions(-) create mode 100644 DESIGN.md rename Readme.md => README.md (92%) diff --git a/DESIGN.md b/DESIGN.md new file mode 100644 index 00000000..99f54df4 --- /dev/null +++ b/DESIGN.md @@ -0,0 +1,140 @@ +# Design rationale for SCIP + +Sourcegraph supports viewing and navigating code for +many different programming languages. +This is similar to an IDE, except for the actual "editor" part. + +The design of SCIP is based on this primary motivating use case, +as well as for fixing the pain points we encountered with using LSIF.[^1] + +SCIP is meant to be a _transmission_ format for sending +data from some producers to some consumers +-- it is not meant as a _storage_ format for querying. +Ideally, producers should be able to directly output SCIP +instead of going through an intermediate format +and then converting to SCIP for transmission. + +[^1]: + Sourcegraph historically supported LSIF uploads + as well as maintained our own LSIF indexers, + but we ran into issues of development velocity, + debugging, as well as indexer performance bottlenecks. + +## Goals + +### Primary goals + +- Support code navigation at the fidelity of state-of-the-art IDEs. + - Why: We want people to have an excellent experience navigating + their code inside Sourcegraph itself. +- Ease of writing indexers (i.e. producers). + - Adding cross-repo navigation support should be easy. + - Why: Scaling for Sourcegraph customers using code spread across many repos, + as well as supporting code navigation for package ecosystems. + - Adding file-level incrementality should be easy. + - Why: Scaling for large monorepos. + - Making the indexer parallel should be easy + - Why: Scaling for large monorepos. + - Writing indexers in different languages should be feasible + - Why: Generally, tooling for a language tends to be implemented + in the same language or an adjacent one; it would be impractical + to expect indexers to settle on a common implementation language. +- Robustness against indexer bugs: Incorrect code nav data for a certain entity + should have a limited blast radius. +- Ease of debugging. + +## Non-goals + +- Support use cases involving code modifications. + - Why: Sourcegraph's code search and navigation has historically + focused on read-only use cases. Adding support for code modifications + introduces more complexity. + While the same data can be used for tools for refactoring tools + and IDE tooling, supporting those is not the focus of SCIP. +- Ease of writing consumers. + - Why: We expect the number of SCIP producers to be much higher + than the number of consumers, + so it makes sense to optimize for producers. +- Be as compact as possible in uncompressed form.[^2] + - Why: Modern general-purpose compression formats like gzip + and zstd are already very good in terms of both compression + speed and compression ratio.[^3] +- Support efficient code navigation by itself. + + - Why: Code navigation fundamentally requires some form of bidirectional + lookup which is best served by a query engine. + + For example, finding subclasses and superclasses are dual operations; + supporting both across different indexes (for cross-repository navigation) + requires some way of connecting the data together anyways. + However, if the consumer is capable of supporting that, + then recording bidirectional links in index data is not useful. + +[^2]: The only compromise on this decision is in the choice of representation of source ranges, where benchmarks showed that using a variable length integer array encoding provided significant savings compared to a message-based encoding, even after compression. + +[^3]: In practice, SCIP data tends to be have a compression ratio around in the range of 10%-20%, as modern compressors are very good at de-duplicating away the repetitive textual symbols. + +## Core design decisions + +### Using Protobuf for the schema + +- Relatively compact binary format, reducing I/O overhead. +- Protobuf supports easy code generation. +- Many languages have Protobuf code generators. +- TLV format enables streaming reads and writes + as well as merging by concatenation. +- Rules for maintaining forward and backward compatibility + are easily understood. + +### Using strings for IDs + +Hash tables are a core data type used in compilers and +hence are likely to be useful in indexers generally. + +String types in mainstream languages support equality and hashing. +where other objects may not be. + +### Avoid direct encoding of graphs + +One very tempting design for code navigation data is to +think of all semantic entities as nodes and relationships between +entities as edges, and to simply record ALL data +using an adjacency list based graph representation. + +This is conceptually appealing, +but is not desirable for a few reasons: + +- It encourages a wholesale approach to writing + indexers as it involves merging all the data together. + Such indexers are less likely to be able + to easily support parallelism. +- This potentially requires keeping a lot of data of memory + at indexing time. Ideally, an indexer should be able + to load parts of the codebase, + append index data for that part to an open file, + and then move on to the next part having cleared all memory + being used from the previous iteration. +- It potentially requires keeping a lot of data in memory + on the consumer side. + +Instead, the approach to using documents and arrays +helps colocate relevant data and naturally allows for streaming +if the underlying data format allows streaming. + +### Avoiding integer IDs + +This includes avoiding structures like a symbol table +mapping string IDs to integers and mostly using the integer +IDs in raw data. + +As far as we know, the integer IDs in LSIF are primarily present as +an ad-hoc compression scheme due to the verbosity of JSON +and LSIF's graph-based encoding scheme. + +Avoiding integer IDs helps with limiting the blast radius +of indexer bugs. With LSIF, we've had off-by-one bugs in indexers +cause code navigation to fail repo-wide. + +This also helps with debugging, as the raw data itself +can be inspected much more easily without needing +a lot of surrounding context. diff --git a/Readme.md b/README.md similarity index 92% rename from Readme.md rename to README.md index d9abfe31..672a8668 100644 --- a/Readme.md +++ b/README.md @@ -7,7 +7,7 @@ such as Go to definition, Find references, and Find implementations. This repository includes: -- A [protobuf schema for SCIP](./scip.proto). +- A [Protobuf schema for SCIP](./scip.proto). - Rich Go and Rust bindings for SCIP: These include many utility functions to help build tooling on top of SCIP. - Auto-generated bindings for TypeScript and Haskell. @@ -15,7 +15,7 @@ This repository includes: a breeze to work with. If you're interested in better understanding the motivation behind SCIP, -check out the [announcement blog post](https://about.sourcegraph.com/blog/announcing-scip). +check out the [announcement blog post](https://about.sourcegraph.com/blog/announcing-scip) and the [design doc](./DESIGN.md). If you're interested in writing a new indexer that emits SCIP, check out our documentation on @@ -24,8 +24,8 @@ Also, check out the [Debugging section][] in the Development docs. If you're interested in consuming SCIP data, you can either use one of the [provided language bindings](https://github.com/sourcegraph/scip/tree/main/bindings), -or generate code for the [SCIP protobuf schema](./scip.proto) -using the protobuf toolchain for your language ecosystem. +or generate code for the [SCIP Protobuf schema](./scip.proto) +using the Protobuf toolchain for your language ecosystem. Also, check out the [Debugging section][] in the Development docs. [debugging section]: ./Development.md#debugging