-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Developer's guide
Some useful resources to jump into digital color management, editing pipeline, calibrations, view transform, etc. :
- http://ves.devwalk.com/wp-content/uploads/2019/03/cinematiccolorveslo.pdf
- https://acescentral.com/
- http://last.hit.bme.hu/download/firtha/video/Colorimetry/Fairchild_M._Color_appearance_models__2005.pdf
- https://www.handprint.com/HP/WCL/wcolor.html
- https://www.filmlight.ltd.uk/pdf/whitepapers/FL-TL-TN-0417-StdColourSpaces.pdf
Too many programmers jump on their IDE before being sure they actually understand the problem they are trying to solve. darktable is full of saturday-afternoon projects that lack polish, disregard ergonomics and got their inner colour science wrong. Yet, they sort of help and allow to get some work done, but running the extra mile could have made them more (or simply) efficient.
Design is a process by which we match the needs of a category of users with a technical solution by building a tool. For the tool to be adapted to user's needs, one has to know first what kind of users is targeted, what their real need is (and not what they think their need is), then sketch several possible solutions, before finally bending some code to do it.
While designing a tool, draw, sketch, write, research what the academia has to say about your problem and the state of the art, then finally prototype something. Don't open your IDE until you got everything figured out on paper first.
Hacking is nice and all, but often ends up with half-baked code that produces toys, not tools.
Pixels are essentially 4D RGBA vectors. Since 2004, processors have got special abilities to process vectors and apply Single Instructions on Multiple Data (SIMD). This allows to speed-up the computations by processing 1 pixel (SSE2) to 4 pixels (AVX-512) at the same time, saving a lot of CPU cycles.
darktable has 3 version of its IOPs : pure C (scalar), SSE2 (vectorized for 4 floats) and OpenCL (vectorized on GPU). That triggers some redundancy in the code. However, modern compilers and the OpenMP library have auto-vectorization options that could optimize pure C, provided the code is written in a vectorizable way and uses some pragmas to give hints to the compiler.
Write vectorizable code : https://info.ornl.gov/sites/publications/files/Pub69214.pdf
Best practices for auto-vectorization:
- avoid branches in loops that change the control flow. Use inline statements like
absolute = (x > 0) ? x : -x;
so they can be converted to bytes masks in SIMD, - pixels should only be referenced from the base pointer of their array and the indices of the loops, not from implicit pointer increments, for example:
float *image = (float *)in;
for(size_t i= 0; i < height; ++i)
{
float *pixel = (float *)image + i * width;
for(size_t j = 0; j < width; ++j)
{
*pixel = whatever;
pixel++;
}
}
should be written :
float *const restrict image = (float *)in;
for(size_t i = 0; i < height; ++i)
{
for(size_t j = 0; j < width; ++j)
{
image[i * width + j] = whatever;
}
}
In the former, the address pointed by pixel
depends on the previous loop iteration, which prevents parallelization and vectorization, and also makes the code more difficult to read. The latter uses an indexing that only depends on i
and j
loop increments, avoids false aliasing, and is easier to read (we immediately spot the array indexing).
- avoid carrying
struct
arguments in functions called in loops, and unpack thestruct
members before the loop. Vectorization can't be performed on structures, but only onfloat
andint
scalars and arrays. For example:
typedef struct iop_data_t
{
float[4] pixel;
float factor;
} iop_data_t;
float foo(float x, struct iop_data_t *bar)
{
return bar->factor * (x + bar->pixel[0] + bar->pixel[1] + bar->pixel[2] + bar->pixel[3]);
}
void loop(const float *in, float *out, const size_t width, const size_t height, const struct iop_data_t bar)
{
for(size_t k = 0; k < height * width; ++k)
{
out[k] = foo(in[k], bar); // the non-vectorized function will be called at each iteration (expensive)
}
}
should be written:
typedef struct iop_data_t
{
float[4] pixel DT_ALIGNED_PIXEL; // align on 16-bits addresses
float factor;
} iop_data_t;
#ifdef _OPENMP
#pragma declare simd
#endif
/* declare the function vectorizable and inline it to avoid calls from within the loop */
inline float foo(const float x, const float pixel[4], const float factor)
{
float sum = x;
/* use a SIMD reduction to vectorize the sum */
#ifdef _OPENMP
#pragma omp simd aligned(pixel:16) reduction(+:sum)
#endif
for(size_t k = 0; k < 4; ++k)
sum += pixel[k];
return factor * sum;
}
void loop(const float *const restrict in, // use constant pointers and restrict keyword to avoid false-aliasing
float *const restrict out,
const size_t width, const size_t height, const struct iop_data_t bar)
{
/* unpack the struct members */
const float *const restrict pixel = bar->pixel;
const float factor = bar-> factor;
#ifdef _OPENMP
#pragma omp parallel for simd default(none) \
dt_omp_firstprivate(in, out, pixel, factor, width, height) \
schedule(simd:static) aligned(in, out:64)
#endif
for(size_t k = 0; k < height * width; ++k)
{
/*
* now the code of the function foo is copied inside the loop
* so we avoid functions calls
* and the compiler can vectorize the content of foo at the loop level
* for example, on AVX2 platforms, the compiler could optimize the function
* to process 16 elements of out and in at every loop step to save cycles.
*/
out[k] = foo(in[k], pixel, factor);
}
}
- if you use nested loops (e.g. loop on the width and height of the array), declare the pixel pointers in the innermost loop and use
collapse(2)
in the OpenMP pragma so the compiler will be able to optimize the cache/memory use and split the loop more evenly between the different threads, - use flat indexing of arrays whenever possible (
for(size_t k = 0 ; k < ch * width * height ; k += ch)
) instead of nested width/height/channels loops, - use the
restrict
keyword on image/pixels pointers to avoid aliasing and avoid inplace operations on pixels (*out
must always be different from*in
) so you don't trigger variable dependencies between threads - align arrays on 64 bytes and pixels on 16 bytes blocks so the memory is contiguous and the CPU can load full cache lines (and avoid segfaults),
- write small functions and optimize locally (one loop/function), using OpenMP and/or compiler pragmas,
- keep your code stupid simple, systematic and avoid smart-ass pointer arithmetic because it will only lead the compiler to detect variable dependencies and pointer aliasing where there are none,
- avoid types casts,
- declare input/output pointers as
*const
and variables asconst
to avoid false-sharing in parallel loops (usingshared(variable)
OpenMP pragma), - look at Rawtherapee source code because these guys got it right.
Modules are the interfaces for IOPs, i.e. image-processing filters stacked in the pixelpipe. IOPs can be found in src/iop and the IOP API can be found in the header src/iop/iop_api.h.
Most IOP have 3 variant of their pixel-filtering part:
- a pure C implementation, in
process()
- a C optimized version, with SSE2 intrinsics, in
process_sse2()
- an OpenCL version, offloading the computation to the GPU, in
process_opencl()
.
An example of a dummy IOP can be found in src/iop/useless.c and used as a boilerplate.
If you add a new IOP, be sure to add the C file in src/iop/CMakeLists.txt#L69 and deal with its priority in the pixelpipe by adding a new node in tools/iop_dependencies.py
darktable wiki is licensed under the Creative Commons BY-SA 4.0 terms.