Skip to content

Commit

Permalink
doc: Revise and update README.md, Concepts.md and tutorial jupyter no…
Browse files Browse the repository at this point in the history
…tebooks (#658)

TASK: IL-305

Co-authored-by: Florian Schepers <[email protected]>
  • Loading branch information
NiklasKoehneckeAA and FlorianSchepersAA authored Mar 28, 2024
1 parent e7335c4 commit ca707b8
Show file tree
Hide file tree
Showing 11 changed files with 461 additions and 421 deletions.
80 changes: 38 additions & 42 deletions Concepts.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,12 +2,12 @@

The main focus of the Intelligence Layer is to enable developers to

- implement their LLM use cases by building upon existing and composing existing functionality and providing insights into
the runtime behavior of these
- implement their LLM use cases by building upon and composing existing functionalities
- obtain insights into the runtime behavior of their implementations
- iteratively improve their implementations or compare them to existing implementations by evaluating them against
a given set of example
a given set of examples

Both focus points are described in more detail in the following sections.
How these focus points are realized in the Intelligence Layer is described in more detail in the following sections.

## Task

Expand All @@ -18,8 +18,8 @@ transforms an input-parameter to an output like a function in mathematics.
Task: Input -> Output
```

In Python this is expressed through an abstract class with type-parameters and the abstract method `do_run`
where the actual transformation is implemented:
In Python this is realized by an abstract class with type-parameters and the abstract method `do_run`
in which the actual transformation is implemented:

```Python
class Task(ABC, Generic[Input, Output]):
Expand All @@ -30,13 +30,13 @@ class Task(ABC, Generic[Input, Output]):
```

`Input` and `Output` are normal Python datatypes that can be serialized from and to JSON. For this the Intelligence
Layer relies on [Pydantic](https://docs.pydantic.dev/). The types that can actually be used are defined in form
of the type-alias [`PydanticSerializable`](src/intelligence_layer/core/tracer/tracer.py#L44).
Layer relies on [Pydantic](https://docs.pydantic.dev/). The used types are defined in form
of type-aliases PydanticSerializable.

The second parameter `task_span` is used for [tracing](#Trace) which is described below.

`do_run` is the method that needs to be implemented for a concrete task. The external interface of a
task is its `run` method:
`do_run` is the method that implements a concrete task and has to be provided by the user. It will be executed by the external interface method `run` of a
task:

```Python
class Task(ABC, Generic[Input, Output]):
Expand All @@ -45,7 +45,7 @@ class Task(ABC, Generic[Input, Output]):
...
```

Its signature differs only in the parameters regarding [tracing](#Trace).
The signatures of the `do_run` and `run` methods differ only in the [tracing](#Trace) parameters.

### Levels of abstraction

Expand All @@ -56,17 +56,17 @@ with an LLM on a very generic or even technical level.

Examples for higher level tasks (Use Cases) are:

- Answering a question based on a gievn document: `QA: (Document, Question) -> Answer`
- Answering a question based on a given document: `QA: (Document, Question) -> Answer`
- Generate a summary of a given document: `Summary: Document -> Summary`

Examples for lower level tasks are:

- Let the model generate text based on an instruacton and some context: `Instruct: (Context, Instruction) -> Completion`
- Let the model generate text based on an instruction and some context: `Instruct: (Context, Instruction) -> Completion`
- Chunk a text in smaller pieces at optimized boundaries (typically to make it fit into an LLM's context-size): `Chunk: Text -> [Chunk]`

### Composability

Tasks compose. Typically you would build higher level tasks from lower level tasks. Given a task you can draw a dependency graph
Typically you would build higher level tasks from lower level tasks. Given a task you can draw a dependency graph
that illustrates which sub-tasks it is using and in turn which sub-tasks they are using. This graph typically forms a hierarchy or
more general a directed acyclic graph. The following drawing shows this graph for the Intelligence Layer's `RecursiveSummarize`
task:
Expand All @@ -76,8 +76,8 @@ task:

### Trace

A task implements a workflow. It processes its input, passes it on to sub-tasks, processes the outputs of sub-tasks
to build its own output. This workflow can be represented in a trace. For this a task's `run` method takes a `Tracer`
A task implements a workflow. It processes its input, passes it on to sub-tasks, processes the outputs of the sub-tasks
and builds its own output. This workflow can be represented in a trace. For this a task's `run` method takes a `Tracer`
that takes care of storing details on the steps of this workflow like the tasks that have been invoked along with their
input and output and timing information. The following illustration shows the trace of an MultiChunkQa-task:

Expand All @@ -86,9 +86,9 @@ input and output and timing information. The following illustration shows the tr
To represent this tracing defines the following concepts:

- A `Tracer` is passed to a task's `run` method and provides methods for opening `Span`s or `TaskSpan`s.
- A `Span` is a `Tracer` and allows for grouping multiple logs and duration together as a single, logical step in the
- A `Span` is a `Tracer` and allows to group multiple logs and runtime durations together as a single, logical step in the
workflow.
- A `TaskSpan` is a `Span` and allows for grouping multiple logs together, as well as the task's specific input, output.
- A `TaskSpan` is a `Span` that allows to group multiple logs together with the task's specific input and output.
An opened `TaskSpan` is passed to `Task.do_run`. Since a `TaskSpan` is a `Tracer` a `do_run` implementation can pass
this instance on to `run` methods of sub-tasks.

Expand All @@ -104,7 +104,7 @@ three abstract classes `Tracer`, `Span` and `TaskSpan` needs to be implemented.
- The `NoOpTracer` can be used when tracing information shall not be stored at all.
- The `InMemoryTracer` stores all traces in an in memory data structure and is most helpful in tests or
Jupyter notebooks.
- The `FileTracer` stores all traces in a jsonl-file.
- The `FileTracer` stores all traces in a json-file.
- The `OpenTelemetryTracer` uses an OpenTelemetry
[`Tracer`](https://opentelemetry-python.readthedocs.io/en/latest/api/trace.html#opentelemetry.trace.Tracer)
to store the traces in an OpenTelemetry backend.
Expand All @@ -127,8 +127,8 @@ The evaluation process helps to:

### Dataset

The basis of an evaluation is a set of examples for the specific task-type to be evaluated. A single example
consists out of :
The basis of an evaluation is a set of examples for the specific task-type to be evaluated. A single `Example`
consists of:

- an instance of the `Input` for the specific task and
- optionally an _expected output_ that can be anything that makes sense in context of the specific evaluation (e.g.
Expand All @@ -139,6 +139,7 @@ consists out of :
To enable reproducibility of evaluations datasets are immutable. A single dataset can be used to evaluate all
tasks of the same type, i.e. with the same `Input` and `Output` types.


### Evaluation Process

The Intelligence Layer supports different kinds of evaluation techniques. Most important are:
Expand All @@ -148,18 +149,16 @@ The Intelligence Layer supports different kinds of evaluation techniques. Most i
case the aggregated result could contain metrics like accuracy which can easily compared with other
aggregated results.
- Comparing the individual outputs of different runs (all based on the same dataset)
in a single evaluation process and produce as aggregated result a
ranking of all runs. This technique is useful when it is hard to come up with an absolute metrics to evaluate
in a single evaluation process and produce a ranking of all runs as an aggregated result. This technique is useful when it is hard to come up with an absolute metrics to evaluate
a single output, but it is easier to compare two different outputs and decide which one is better. An example
use case could be summarization.

To support these techniques the Intelligence Layer differentiates between 3 consecutive steps:

1. Run a task by feeding it all inputs of a dataset and collecting all outputs
2. Evaluate the outputs of one or several
runs and produce an evaluation result for each example. Typically a single run is evaluated if absolute
2. Evaluate the outputs of one or several runs and produce an evaluation result for each example. Typically a single run is evaluated if absolute
metrics can be computed and several runs are evaluated when the outputs of runs shall be compared.
3. Aggregate the evaluation results of one or several evaluation runs into a single object containing the aggregated
1. Aggregate the evaluation results of one or several evaluation runs into a single object containing the aggregated
metrics. Aggregating over several evaluation runs supports amending a previous comparison result with
comparisons of new runs without the need to re-execute the previous comparisons again.

Expand All @@ -171,30 +170,27 @@ The following table shows how these three steps are represented in code:
| 2. Evaluate | `Evaluator` | `EvaluationLogic` | `EvaluationRepository` |
| 3. Aggregate | `Aggregator` | `AggregationLogic` | `AggregationRepository` |

The column
- Executor lists concrete implementations provided by the Intelligence Layer.
- Custom Logic lists abstract classes that need to be implemented with the custom logic.
- Repository lists abstract classes for storing intermediate results. The Intelligence Layer provides
Columns explained
- "Executor" lists concrete implementations provided by the Intelligence Layer.
- "Custom Logic" lists abstract classes that need to be implemented with the custom logic.
- "Repository" lists abstract classes for storing intermediate results. The Intelligence Layer provides
different implementations for these. See the next section for details.

### Data Storage

During an evaluation process a lot of intermediate data is created before the final aggregated result can be produced.
To avoid that expensive computations have to be repeated if new results should be produced based on previous ones
To avoid that expensive computations have to be repeated if new results are to be produced based on previous ones
all intermediate results are persisted. For this the different executor-classes make use of repositories.

There are the following Repositories:

- The `DatasetRepository` offers methods to manage datasets. The `Runner` uses it to read all examples of a dataset to feed
then to the `Task`.
- The `RunRepository` is responsible for storing a task's output (in form of a `ExampleOutput`) for each example of a dataset
- The `DatasetRepository` offers methods to manage datasets. The `Runner` uses it to read all `Example`s of a dataset and feeds them to the `Task`.
- The `RunRepository` is responsible for storing a task's output (in form of an `ExampleOutput`) for each `Example` of a dataset
which are created when a `Runner`
runs a task using this dataset. At the end of a run a `RunOverview` is stored containing some metadata concerning the run.
The `Evaluator` reads these outputs given a list of runs it should evaluate to create an evaluation
result for each example of the dataset.
- The `EvaluationRepository` enables the `Evaluator` to store the individual evaluation result (in form of an `ExampleEvaluation`)
for each example and an `EvaluationOverview`
and makes them available to the `Aggregator`.
result for each `Example` of the dataset.
- The `EvaluationRepository` enables the `Evaluator` to store the evaluation result (in form of an `ExampleEvaluation`) for each example along with an `EvaluationOverview`. The `Aggregator` uses this repository to read the evaluation results.
- The `AggregationRepository` stores the `AggregationOverview` containing the aggregated metrics on request of the `Aggregator`.

The following diagrams illustrate how the different concepts play together in case of the different types of evaluations.
Expand All @@ -210,7 +206,7 @@ The following diagrams illustrate how the different concepts play together in ca
`RunRepository` and the corresponding `Example` from the `DatasetRepository` and uses the `EvaluationLogic` to compute an `Evaluation`.
4. Each `Evaluation` gets wrapped in an `ExampleEvaluation` and stored in the `EvaluationRepository`.
5. The `Aggregator` reads all `ExampleEvaluation`s for a given evaluation and feeds them to the `AggregationLogic` to produce a `AggregatedEvaluation`.
6. The `AggregatedEvalution` is wrapped in an `AggregationOverview` and stoed in the `AggregationRepository`.
6. The `AggregatedEvalution` is wrapped in an `AggregationOverview` and stored in the `AggregationRepository`.

The next diagram illustrates the more complex case of a relative evaluation.

Expand All @@ -219,13 +215,13 @@ The next diagram illustrates the more complex case of a relative evaluation.
<figcaption>Process of a relative Evaluation</figcaption>
</figure>

1. Multiple `Runner`s read the same dataset and produce for different `Task`s corresponding `Output`s.
1. Multiple `Runner`s read the same dataset and produce the corresponding `Output`s for different `Task`s.
2. For each run all `Output`s are stored in the `RunRepository`.
3. The `Evaluator` gets as input previous evaluations (that were produced on basis of the same dataset, but different `Task`s) and the new runs of the previous step.
3. The `Evaluator` gets as input previous evaluations (that were produced on basis of the same dataset, but by different `Task`s) and the new runs of the current task.
4. Given the previous evaluations and the new runs the `Evaluator` can read the `ExampleOutput`s of both the new runs
and the runs associated to previous evaluations, collect all that belong to a single `Example` and pass them
along with the `Example` to the `EvaluationLogic` to compute an `Evaluation`.
5. Each `Evaluation` gets wrapped in an `ExampleEvaluation` and is stored in the `EvaluationRepository`.
6. The `Aggregator` reads all `ExampleEvaluation` from all involved evaluations
and feeds them to the `AggregationLogic` to produce a `AggregatedEvaluation`.
7. The `AggregatedEvalution` is wrapped in an `AggregationOverview` and stoed in the `AggregationRepository`.
7. The `AggregatedEvalution` is wrapped in an `AggregationOverview` and stored in the `AggregationRepository`.
16 changes: 8 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,12 +46,12 @@ The environment can be activated via `poetry shell`. See the official poetry doc

### Getting started with the Jupyter Notebooks

After running the local installation steps, there are two environment variables that have to be set before you can start running the examples.
After running the local installation steps, you can set whether you are using the Aleph-Alpha API or an on-prem setup via the environment variables.

---
**Using the Aleph-Alpha API** \
\
You will need an [Aleph Alpha access token](https://docs.aleph-alpha.com/docs/account/#create-a-new-token) to run the examples.
In the Intelligence Layer the Aleph-Alpha API (`https://api.aleph-alpha.com`) is set as default host URL. However, you will need an [Aleph Alpha access token](https://docs.aleph-alpha.com/docs/account/#create-a-new-token) to run the examples.
Set your access token with

```bash
Expand All @@ -62,13 +62,13 @@ export AA_TOKEN=<YOUR TOKEN HERE>

**Using an on-prem setup** \
\
The default host url in the project is set to `https://api.aleph-alpha.com`. This can be changed by setting the `CLIENT_URL` environment variable:
In case you want to use an on-prem endpoint you will have to change the host URL by setting the `CLIENT_URL` environment variable:

```bash
export CLIENT_URL=<YOUR_ENDPOINT_URL_HERE>
```

The program will warn you if no `CLIENT_URL` is explicitly set.
The program will warn you in case no `CLIENT_URL` is set explicitly set.

---
After correctly setting up the environment variables you can run the jupyter notebooks.
Expand Down Expand Up @@ -188,10 +188,10 @@ Not sure where to start? Familiarize yourself with the Intelligence Layer using
If you prefer you can also read about the [concepts](Concepts.md) first.
## Tutorials
The tutorials aim to guide you through implementing several common use-cases with the Intelligence Layer. They introduce you to key concepts and enable you to create your own use-cases.
The tutorials aim to guide you through implementing several common use-cases with the Intelligence Layer. They introduce you to key concepts and enable you to create your own use-cases. In general the tutorials are build in a way that you can simply hop into the topic you are most interested in. However, for starters we recommend to read through the `Summarization` tutorial first. It explains the core concepts of the intelligence layer in more depth while for the other tutorials we assume that these concepts are known.

| Order | Topic | Description | Notebook 📓 |
| ----- | ------------------ | ---------------------------------------------------- | --------------------------------------------------------------- |
| Order | Topic | Description | Notebook 📓 |
| ----- | ------------------ |------------------------------------------------------|-----------------------------------------------------------------|
| 1 | Summarization | Summarize a document | [summarization.ipynb](./src/examples/summarization.ipynb) |
| 2 | Question Answering | Various approaches for QA | [qa.ipynb](./src/examples/qa.ipynb) |
| 3 | Classification | Learn about two methods of classification | [classification.ipynb](./src/examples/classification.ipynb) |
Expand All @@ -200,7 +200,7 @@ The tutorials aim to guide you through implementing several common use-cases wit
| 6 | Document Index | Connect your proprietary knowledge base | [document_index.ipynb](./src/examples/document_index.ipynb) |
| 7 | Human Evaluation | Connect to Argilla for manual evaluation | [human_evaluation.ipynb](./src/examples/human_evaluation.ipynb) |
| 8 | Performance tips | Contains some small tips for performance | [performance_tips.ipynb](./src/examples/performance_tips.ipynb) |
| 9 | Deployment | Shows how to deploy a Task in a minimal FastAPI app. | [fastapi_example.py](./src/examples/fastapi_example.py) |
| 9 | Deployment | Shows how to deploy a Task in a minimal FastAPI app. | [fastapi_tutorial.md](./src/examples/fastapi_tutorial.md) |

## How-Tos
The how-tos are quick lookups about how to do things. Compared to the tutorials, they are shorter and do not explain the concepts they are using in-depth.
Expand Down
Loading

0 comments on commit ca707b8

Please sign in to comment.