Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data volume reduction on the client -data types and indices #355

Open
miranov25 opened this issue Jun 1, 2024 · 8 comments
Open

Data volume reduction on the client -data types and indices #355

miranov25 opened this issue Jun 1, 2024 · 8 comments

Comments

@miranov25
Copy link
Owner

miranov25 commented Jun 1, 2024

To facilitate data transfer between the server and client, user-defined lossy and lossless compression is employed.

The techniques discussed below can reduce the data volume on the client. The impact on the speed of subsequent data manipulation should be evaluated:

  • For integers, use the smallest representation possible—this can be included as an option.
  • For floats compressed with absolute precision, an integer array multiplied by a step size can be utilized.
  • For floats compressed relatively, a lookup table (either 8-bit or 16-bit) can be employed.
  • Use bit precision for integer representation, such as 1-bit, 2-bit, or 3-bit numbers.
  • Often, group-by statistics are employed to calculate various statistics (e.g., mean, median, RMS, entries, fit parameters) by group. In such cases, an index to another table can be used instead of flattening the data.

It is unclear how swiftly specific functionalities can be implemented in JavaScript, the potential slowdown in code performance, and the complexity it might introduce. A subset of these features should be implemented promptly, while the rest could likely be addressed using WebAssembly.

@miranov25
Copy link
Owner Author

N-bit Integer Representation

An array of 1, 2, 3, ... N-bit integers can be represented using a bit array.

Consider an array of M numbers, each with N bits. To represent them, you can use a Uint8Array of size (N * M) / 8.

How efficient is random access in such an array in JavaScript?

@pl0xz0rz
Copy link
Contributor

pl0xz0rz commented Jun 4, 2024

  • For integers, use the smallest representation possible—this can be included as an option.
    -- Already done two years ago, improved performance a lot
  • For floats compressed with absolute precision, an integer array multiplied by a step size can be utilized.
    -- Didn't have time to implement this for use in histograms, but they are already sent to the client as integers
  • For floats compressed relatively, a lookup table (either 8-bit or 16-bit) can be employed.
    -- This would probably make the performance worse because of cache locality
  • Use bit precision for integer representation, such as 1-bit, 2-bit, or 3-bit numbers.
    -- Doesn't seem very useful, packed representation is slow
  • Often, group-by statistics are employed to calculate various statistics (e.g., mean, median, RMS, entries, fit parameters) by group. In such cases, an index to another table can be used instead of flattening the data.
    -- Half implemented, but with three issues, bad interface (to be improved to parity with RDataFrame interface), bugs (don't remember what exactly they were, but remember that there were some bugs) and automatically expanding the new columns - probably can be avoided by computing these after downsampling in case of drawing with downsampling, and fusing this operation with histogram in case of histogram in the easy cases - not sure if streaming will work however

What is not mentioned here but should help improve memory consumption:
-Bitmasks take up too much space - while not that much of an issue if there's more columns than histograms, reducing this by a factor of ~2 should be simple

@miranov25
Copy link
Owner Author

Bitmasks for Widget Selection

This section explains the bitmask representation used for widget selection, which is ideal for inclusion in READMEdeveloper.md or similar documentation.

Definition

  • N_points: Total number of points.
  • N_widgets: Total number of widgets.

We utilize a bitmask representation where each widget and point is represented by a bit.

Memory Allocation

Ideally, we should allocate an array of 32 bits for the product of N_points and int(N_widgets/32):

  • For each point, the value is a bitmask as follows: (w0, w1, w2, ..., wN).

Selection Logic

  • If N_widgets < 32: A point is selected if bitmask[ipoints] = bitmask_accept. Typically, all bits should be set to 1 if we use a logical AND operation.
  • If N_widgets > 32: A range of indices should be checked to determine selection.

Implementation Differences (described above and new Marian pull request)

In the existing implementation by Marian, a different logic is used where the columns are bitmask per widget, sized at N_points/32.

Additional Memory Requirements for Special Functions

  • funCustom: The memory size required is Npoints x 64 bits (float).
  • customSelection: The size is Npoints x 32 bits (int), which needs to be converted to boolean. Additionally, this is added to the bitmasks.

The advantage of Marian’s column-per-widget implementation lies in its update method, which modifies only the corresponding column of the bitmask when a single widget changes. In the row bitmask representation, the decision-making process is swift and does not require a special column. However, since this column is small, the memory savings are minimal (N_points x 32 bits).

@miranov25
Copy link
Owner Author

Custom logic imlementation - for later implementation (for later)

Currently the selection is only logical && of selections from differnt widgets

  • (s0 & s1 &s2 & ...)
    In gernaral we would like tp cofigure also custom login e.g
  • ( (s0 |s1) &s2 )

In old PAW system configuration was defiend as a custom user defiend string - as shown above.
For us it means that widgets needs IDs (as string) and in cusom logic form this IDS could be used .

@miranov25
Copy link
Owner Author

refer to #96

@miranov25
Copy link
Owner Author

Optimization of the funCustom - needed to reduce need for exanded functions

Title: Automatically Generate Function Header for JavaScript Function Based on Used Variables

Description:

We have a list of all variables present in a table and a JavaScript function (as a string without the header) that uses a subset of these variables. Our goal is to automatically generate the function header for this JavaScript function, including only the variables that are actually used in the function.

This is needed as we would like to keep columns compressed and use (expand to) float representation only if needed.

Use Case:

  1. Input:

    • A list of all variables (columns).
    • A JavaScript function as a string (functionString).
  2. Output:

    • A function header that includes only the variables from the list that are used in the function string.

Proposed Solution:

We will implement an algorithm to achieve this by iterating through the list of variables and checking if each variable is present in the function string. If a variable is found in the function string, it will be added to the function header.

Pseudocode:

function generateFunctionHeader(columns, functionString):
    Initialize an empty list `columnArgs`
    
    for each column in columns:
        if column is present in functionString:
            add column to `columnArgs`
    
    Create the function header string `header`
    header = "function XXX(" + join(columnArgs, ", ") + ") {"
    
    return header

JavaScript Code:

function generateFunctionHeader(columns, functionString) {
    let columnArgs = [];
    
    columns.forEach(column => {
        if (functionString.includes(column)) {
            columnArgs.push(column);
        }
    });

    let header = "function XXX(" + columnArgs.join(", ") + ") {";
    
    return header;
}

// Example usage:
let columns = ["v0", "v1", "v2", "v3", "v4"];
let functionString = "let sum = v0 + v1; let product = v1 * v3; return sum + product;";

let functionHeader = generateFunctionHeader(columns, functionString);
console.log(functionHeader);  // Output: "function XXX(v0, v1, v3) {"

@miranov25
Copy link
Owner Author

miranov25 commented Jun 14, 2024

Goal:

  • Allow the data source to handle a large number of columns.

  • The current limitation is in memory capacity, not CPU processing power.

  • Transition to using smaller float representations to reduce memory usage significantly. Currently, floats are represented in 64 bits, but with compression, we can achieve 8-bit or 16-bit quantization, which represents a memory reduction factor of 8 or 4.

  • A potential drawback is that transformations (such as selections and custom functions) may be slower. We need to quantify how CPU time scales with the number of rows and the number of columns.

  • 10^6-10^7 crows * 10^2 columns -> 0.1-1 GBY in case of 1Byte representation - or 6-64 GBy with Float64 representation

Node.js Speed Benchmark:

To evaluate the use of compressed float representations for data manipulation, we should compare CPU usage across different types of operations:

Representations:

  • Float 64: Standard double-precision floating-point.
  • Index Representation:
    • Absolute Compression: Value = val_i * step.
    • Relative Rounding: Value = ordered_lookup(i).
    • Lookup (e.g., for group by values).

Operations:

  • Selection: Can work with indices. Ranges can be transformed in the case of ordered data.
  • Standard Operations: Addition, subtraction, multiplication, and division by scalars and vectors.
  • Functions: Trigonometric functions like sine and cosine.
  • Histogramming: Evaluating the distribution of data.

Benchmark:

  • Measure CPU time per operation for different representations as function on ncols and nrows

@miranov25
Copy link
Owner Author

Node.js vs Browser CPU Benchmark Consideration

Marian's Observation-Statement

"When I look at performance, there is quite a big difference between the browser and Node.js; in some code pairs, one is faster in the browser and the other in Node.js.
Therefore, I believe it is more realistic to perform benchmarks in the browser and restrict Node.js to just unit tests."

Marian, could you please provide some examples? Are the differences a matter of percentage points or orders of magnitude?

Proposal for Continuing CPU Benchmarks in Node.js

I suggest we maintain CPU benchmarks for the numerical parts of the code in Node.js as well, since it is easier to automate and compare. For regression testing, this will be very important. We should soon move more towards WebAssembly, where I expect the results will be more predictable.

Suggested Solution - As per GPT:

Based on the observation that performance varies significantly between the browser and Node.js environments for different code snippets, the suggestion to focus benchmarking in the browser and limit Node.js to unit testing appears reasonable. Here's a more detailed approach to implementing this strategy:

  1. Isolate Performance-Sensitive Code: Identify parts of the codebase that are performance-sensitive and may behave differently in a browser compared to Node.js. This includes DOM manipulations, graphical operations, or operations that rely heavily on browser APIs.

  2. Conduct Benchmarks in the Browser: Use tools like Google Chrome’s Lighthouse or Mozilla Firefox’s Developer Tools to run performance benchmarks directly in the browser. This method provides a more accurate measurement of how the code performs in its intended environment.

  3. Set Up Automated Browser Tests: Implement automated browser testing frameworks such as Puppeteer or Selenium. These tools help automate benchmarks and integrate them into your continuous integration (CI) pipeline, regularly monitoring performance as part of your development process.

  4. Utilize Node.js for Unit Testing: Since Node.js may not accurately replicate the browser environment for all types of code, it should primarily be used for unit testing backend functionalities or non-UI logic that does not heavily depend on browser capabilities. Frameworks like Mocha, Jest, or Jasmine are suitable for these tasks.

  5. Performance Regression Testing: Regularly compare new performance measurements against previous benchmarks to identify any regressions or improvements. This can be integrated into your CI/CD pipeline to ensure continuous performance evaluation.

  6. Documentation and Reporting: Maintain detailed records of performance tests and their results. Utilize this data to generate reports for further analysis and to guide optimization efforts.

  7. Feedback Loop: Use insights from both browser benchmarks and Node.js tests to inform development decisions, focusing on optimizing areas where performance bottlenecks are identified.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants