Skip to content

Commit

Permalink
docs: Add DESIGN.md (#289)
Browse files Browse the repository at this point in the history
  • Loading branch information
varungandhi-src authored Oct 24, 2024
1 parent 727d53e commit b469406
Show file tree
Hide file tree
Showing 2 changed files with 144 additions and 4 deletions.
140 changes: 140 additions & 0 deletions DESIGN.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,140 @@
# Design rationale for SCIP

Sourcegraph supports viewing and navigating code for
many different programming languages.
This is similar to an IDE, except for the actual "editor" part.

The design of SCIP is based on this primary motivating use case,
as well as for fixing the pain points we encountered with using LSIF.[^1]

SCIP is meant to be a _transmission_ format for sending
data from some producers to some consumers
-- it is not meant as a _storage_ format for querying.
Ideally, producers should be able to directly output SCIP
instead of going through an intermediate format
and then converting to SCIP for transmission.

[^1]:
Sourcegraph historically supported LSIF uploads
as well as maintained our own LSIF indexers,
but we ran into issues of development velocity,
debugging, as well as indexer performance bottlenecks.

## Goals

### Primary goals

- Support code navigation at the fidelity of state-of-the-art IDEs.
- Why: We want people to have an excellent experience navigating
their code inside Sourcegraph itself.
- Ease of writing indexers (i.e. producers).
- Adding cross-repo navigation support should be easy.
- Why: Scaling for Sourcegraph customers using code spread across many repos,
as well as supporting code navigation for package ecosystems.
- Adding file-level incrementality should be easy.
- Why: Scaling for large monorepos.
- Making the indexer parallel should be easy
- Why: Scaling for large monorepos.
- Writing indexers in different languages should be feasible
- Why: Generally, tooling for a language tends to be implemented
in the same language or an adjacent one; it would be impractical
to expect indexers to settle on a common implementation language.
- Robustness against indexer bugs: Incorrect code nav data for a certain entity
should have a limited blast radius.
- Ease of debugging.

## Non-goals

- Support use cases involving code modifications.
- Why: Sourcegraph's code search and navigation has historically
focused on read-only use cases. Adding support for code modifications
introduces more complexity.
While the same data can be used for tools for refactoring tools
and IDE tooling, supporting those is not the focus of SCIP.
- Ease of writing consumers.
- Why: We expect the number of SCIP producers to be much higher
than the number of consumers,
so it makes sense to optimize for producers.
- Be as compact as possible in uncompressed form.[^2]
- Why: Modern general-purpose compression formats like gzip
and zstd are already very good in terms of both compression
speed and compression ratio.[^3]
- Support efficient code navigation by itself.

- Why: Code navigation fundamentally requires some form of bidirectional
lookup which is best served by a query engine.

For example, finding subclasses and superclasses are dual operations;
supporting both across different indexes (for cross-repository navigation)
requires some way of connecting the data together anyways.
However, if the consumer is capable of supporting that,
then recording bidirectional links in index data is not useful.

[^2]: The only compromise on this decision is in the choice of representation of source ranges, where benchmarks showed that using a variable length integer array encoding provided significant savings compared to a message-based encoding, even after compression.

[^3]: In practice, SCIP data tends to be have a compression ratio around in the range of 10%-20%, as modern compressors are very good at de-duplicating away the repetitive textual symbols.

## Core design decisions

### Using Protobuf for the schema

- Relatively compact binary format, reducing I/O overhead.
- Protobuf supports easy code generation.
- Many languages have Protobuf code generators.
- TLV format enables streaming reads and writes
as well as merging by concatenation.
- Rules for maintaining forward and backward compatibility
are easily understood.

### Using strings for IDs

Hash tables are a core data type used in compilers and
hence are likely to be useful in indexers generally.

String types in mainstream languages support equality and hashing.
where other objects may not be.

### Avoid direct encoding of graphs

One very tempting design for code navigation data is to
think of all semantic entities as nodes and relationships between
entities as edges, and to simply record ALL data
using an adjacency list based graph representation.

This is conceptually appealing,
but is not desirable for a few reasons:

- It encourages a wholesale approach to writing
indexers as it involves merging all the data together.
Such indexers are less likely to be able
to easily support parallelism.
- This potentially requires keeping a lot of data of memory
at indexing time. Ideally, an indexer should be able
to load parts of the codebase,
append index data for that part to an open file,
and then move on to the next part having cleared all memory
being used from the previous iteration.
- It potentially requires keeping a lot of data in memory
on the consumer side.

Instead, the approach to using documents and arrays
helps colocate relevant data and naturally allows for streaming
if the underlying data format allows streaming.

### Avoiding integer IDs

This includes avoiding structures like a symbol table
mapping string IDs to integers and mostly using the integer
IDs in raw data.

As far as we know, the integer IDs in LSIF are primarily present as
an ad-hoc compression scheme due to the verbosity of JSON
and LSIF's graph-based encoding scheme.

Avoiding integer IDs helps with limiting the blast radius
of indexer bugs. With LSIF, we've had off-by-one bugs in indexers
cause code navigation to fail repo-wide.

This also helps with debugging, as the raw data itself
can be inspected much more easily without needing
a lot of surrounding context.
8 changes: 4 additions & 4 deletions Readme.md → README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,15 +7,15 @@ such as Go to definition, Find references, and Find implementations.

This repository includes:

- A [protobuf schema for SCIP](./scip.proto).
- A [Protobuf schema for SCIP](./scip.proto).
- Rich Go and Rust bindings for SCIP: These include many utility functions
to help build tooling on top of SCIP.
- Auto-generated bindings for TypeScript and Haskell.
- The [`scip` CLI](./docs/CLI.md), which makes SCIP indexes
a breeze to work with.

If you're interested in better understanding the motivation behind SCIP,
check out the [announcement blog post](https://about.sourcegraph.com/blog/announcing-scip).
check out the [announcement blog post](https://about.sourcegraph.com/blog/announcing-scip) and the [design doc](./DESIGN.md).

If you're interested in writing a new indexer that emits SCIP,
check out our documentation on
Expand All @@ -24,8 +24,8 @@ Also, check out the [Debugging section][] in the Development docs.

If you're interested in consuming SCIP data,
you can either use one of the [provided language bindings](https://github.com/sourcegraph/scip/tree/main/bindings),
or generate code for the [SCIP protobuf schema](./scip.proto)
using the protobuf toolchain for your language ecosystem.
or generate code for the [SCIP Protobuf schema](./scip.proto)
using the Protobuf toolchain for your language ecosystem.
Also, check out the [Debugging section][] in the Development docs.

[debugging section]: ./Development.md#debugging
Expand Down

0 comments on commit b469406

Please sign in to comment.