Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Java][Python] Avoid Reuse VectorSchemaRoot for exporting ArrowArrayStream to other language #36443

Open
hu6360567 opened this issue Jul 4, 2023 · 16 comments

Comments

@hu6360567
Copy link
Contributor

hu6360567 commented Jul 4, 2023

Describe the bug, including details regarding any error messages, version, and platform.

I'm trying to import/export data to database in python through ArrayStream over pyarrow.jvm and JDBC.

In order to export ArrowVectorIterator as stream without unloading to RecordBatch on java side before it export to stream, I wrap ArrowVectorIterator into ArrowReader as below:

public class ArrowVectorIteratorReader extends ArrowReader {

    private final Iterator<VectorSchemaRoot> iterator;
    private final Schema schema;
    private VectorSchemaRoot root;

    public ArrowVectorIteratorReader(BufferAllocator allocator, Iterator<VectorSchemaRoot> iterator, Schema schema) {
        super(allocator);
        this.iterator = iterator;
        this.schema = schema;
        this.root = null;
    }

    @Override
    public VectorSchemaRoot getVectorSchemaRoot() throws IOException {
        if (root == null) return super.getVectorSchemaRoot();
        return root;
    }

    @Override
    public boolean loadNextBatch() throws IOException {
        if (iterator.hasNext()) {
            VectorSchemaRoot lastRoot = root;
            root = iterator.next();
            if (root != lastRoot && lastRoot != null) lastRoot.close();
            return true;
        } else {
            return false;
        }
    }

    @Override
    public long bytesRead() {
        return 0;
    }

    @Override
    protected void closeReadSource() throws IOException {
        if (iterator instanceof AutoCloseable) {
            try {
                ((AutoCloseable) iterator).close();
            } catch (Exception e) {
                throw new IOException(e);
            }
        }
        root.close();
    }

    @Override
    protected Schema readSchema() throws IOException {
        return schema;
    }
}

When ArrowVectorIterator use the config with reuseVectorSchemaRoot is enabled, utf8 array may crushed on python side, but works as expectred on java side.

Java code as below

try (final ArrowReader source = porter.importData(1);    returns ArrowVectorIteratorReader with batchSize=1
             final ArrowArrayStream stream = ArrowArrayStream.allocateNew(allocator)) {
            Data.exportArrayStream(allocator, source, stream);

            try (final ArrowReader reader = Data.importArrayStream(allocator, stream)) {
                while (reader.loadNextBatch()) {
                    // root from getVectorSchemaRoot() is legal on every vector
                    totalRecord += reader.getVectorSchemaRoot().getRowCount();
                }
            }
        }

On Python side, the situation is unexplainable.
The exported stream from Java in wrapped into a RecordBatchReader and write into different file formats.

def wrap_from_java_stream_to_generator(java_arrow_stream, allocator=None, yield_schema=False):
    if allocator is None:
        allocator = get_java_root_allocator().allocator
    c_stream = arrow_c.new("struct ArrowArrayStream*")
    c_stream_ptr = int(arrow_c.cast("uintptr_t", c_stream))

    org = jpype. JPackage("org")
    java_wrapped_stream = org.apache.arrow.c.ArrowArrayStream.wrap(c_stream_ptr)

    org.apache.arrow.c.Data.exportArrayStream(allocator, java_arrow_stream, java_wrapped_stream)

    # noinspection PyProtectedMember
    with pa. RecordBatchReader._import_from_c(c_stream_ptr) as reader:  # type: pa. RecordBatchReader
        if yield_schema:
            yield reader.schema
        yield from reader


def wrap_from_java_stream(java_arrow_stream, allocator=None):
    generator = wrap_from_java_stream_to_generator(java_arrow_stream, allocator, yield_schema=True)
    schema = next(generator)

    return pa. RecordBatchReader.from_batches(schema, generator)

For CSV, works as expected

with wrap_from_java_stream(java_arrow_stream, allocator) as stream:
    with pa.csv.CSVWriter(csv_path, stream.schema) as writer:
        for record_batch in stream:
            writer.write_batch(record_batch)

For Parquet, writing with dataset api as below

with wrap_from_java_stream(java_arrow_stream, allocator) as stream:
    pa.dataset.write_dataset(stream, data_path, format="parquet")
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
......./python3.8/site-packages/pyarrow/dataset.py:999: in write_dataset
    _filesystemdataset_write(
pyarrow/_dataset.pyx:3655: in pyarrow._dataset._filesystemdataset_write
    ???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

>   ???
E   pyarrow.lib.ArrowInvalid: Parquet cannot store strings with size 2GB or more

OR

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

>   ???
E   pyarrow.lib.ArrowInvalid: Column 1: In chunk 0: Invalid: Length spanned by binary offsets (7) larger than values array (size 6)

In order to making out which record raises error, RecordBatchReader is wrapped into a smaller batch size and log the content as below:

with wrap_from_java_stream(java_arrow_stream, allocator) as stream:
    def generator():
        for rb in stream:
            for i in range(rb.num_rows):
                slice = rb.slice(i,1)
                logger.info(slice.to_pylist())
                yield slice
    pa.dataset.write_dataset(pa.RecordBatchReader.from_batches(stream.schema, generator(), data_path, format="parquet")

Although the logger can print the slice, but write_dataset fails

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
......./python3.8/site-packages/pyarrow/dataset.py:999: in write_dataset
    _filesystemdataset_write(
pyarrow/_dataset.pyx:3655: in pyarrow._dataset._filesystemdataset_write
    ???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

>   ???
E   pyarrow.lib.ArrowInvalid: Column 1: In chunk 0: Invalid: First or last binary offset out of bounds

For arrow/feather format, it seems directly write record_batch into files, but when record_batch is invalid when reading from file (code is similar as above)

Then, if I create the ArrowVectorIteratorReader without reuseVectorSchemaRoot, everything works fine on Python side.

Component(s)

Java, Python

@lidavidm
Copy link
Member

lidavidm commented Jul 4, 2023

Is it possible to share a self-contained reproduction?

That said, I think what might be happening is that the Parquet writer may request more than one batch from the reader, and if you request to share roots, then the previous batch will be overwritten. That is, I would expect this to fail:

reader = wrap_from_java_stream(...)
batch1 = reader.read_next_batch()
batch1.validate(full=True)  # OK
batch2 = reader.read_next_batch()
batch1.validate(full=True)  # Not OK because batch2 and batch1 share the same allocation

Exporting data via C Data does not copy the data, so it is your application's responsibility to properly manage the lifetime of the buffers. And Arrow Java uses mutable buffers, so if you enable reusing a VectorSchemaRoot, you'll find that reading new data invalidates previously read data.

@lidavidm
Copy link
Member

lidavidm commented Jul 4, 2023

(you may have to read more batches than that to get things to fail, but I hope the idea is clear)

@hu6360567
Copy link
Contributor Author

That said, I think what might be happening is that the Parquet writer may request more than one batch from the reader, and if you request to share roots, then the previous batch will be overwritten. That is, I would expect this to fail:

That explains the failure in my code. Since Arrow Java prefers use the same root to populate data, and so as it in default ArrowReader implementation, the base class to export to C Data. Like "InMemoeryArrowReader", it should fail in such scenario.

Do we have any document to explain why java implementation perfers the same root, also why C Data Stream is bound to ArrowReader, rather than Iterator<VectorSchemaRoot>.
The ArrowReader default implementation always need to convert to RecordBatch once if only loadNextBatch overrided.

@lidavidm
Copy link
Member

lidavidm commented Jul 4, 2023

There just isn't a good interface for this in Java. Arrow Java was designed differently from the C++ implementation.

@lidavidm
Copy link
Member

lidavidm commented Jul 4, 2023

I suppose the C Data implementation should unload from the root when it exports.

@hu6360567 hu6360567 changed the title [Java][Python] export ArrowVectorIterator to python fails randomly, if reuseVectorSchemaRoot enabled [Java][Python] Avoid Reuse VectorSchemaRoot for exporting ArrowArrayStream to other language Jul 7, 2023
@lidavidm lidavidm reopened this Jul 7, 2023
@lidavidm
Copy link
Member

lidavidm commented Jul 7, 2023

We should fix this behavior in Java (and ensure it's tested) since it is a surprise.

@lidavidm
Copy link
Member

lidavidm commented Jul 7, 2023

CC @davisusanibar

@davisusanibar
Copy link
Contributor

Hi @hu6360567 sorry to join late,

Could you help me to validate if this is working on your side please?

I'm able to create parquet file with data obtained from the database using JDBC adapter and then use C Data Interface to read that from python side:

Testing:

1. Create jar with dependencies: `mvn clean package`
2. Print log for data read: `python src/main/java/org/example/consumer/consumerReaderAPI.py 2 True log`
3. Create parquet file: `python src/main/java/org/example/consumer/consumerReaderAPI.py 2 True parquet`
4. Validate parquet files: `parquet-tools cat jdbc/parquet/part-0.parquet`

Java Produces of Arrow Reader:

package org.example.cdata;

import java.io.BufferedReader;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.IOException;
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.ResultSet;
import java.sql.SQLException;
import java.sql.Types;
import java.util.HashMap;

import org.apache.arrow.adapter.jdbc.ArrowVectorIterator;
import org.apache.arrow.adapter.jdbc.JdbcFieldInfo;
import org.apache.arrow.adapter.jdbc.JdbcToArrow;
import org.apache.arrow.adapter.jdbc.JdbcToArrowConfig;
import org.apache.arrow.adapter.jdbc.JdbcToArrowConfigBuilder;
import org.apache.arrow.adapter.jdbc.JdbcToArrowUtils;
import org.apache.arrow.memory.BufferAllocator;
import org.apache.arrow.memory.RootAllocator;
import org.apache.arrow.vector.VectorSchemaRoot;
import org.apache.arrow.vector.ipc.ArrowReader;
import org.apache.arrow.vector.types.pojo.Schema;
import org.apache.ibatis.jdbc.ScriptRunner;

public class JavaReaderApi {
  final static BufferAllocator allocator = new RootAllocator();

  public static BufferAllocator getAllocatorForJavaConsumers() {
    return allocator;
  }

  public static ArrowReader getArrowReaderForJavaConsumers(int batchSize, boolean reuseVSR) {
    System.out.println("Java Parameters: BatchSize = " + batchSize + ", reuseVSR = " + reuseVSR);
    String query = "SELECT int_field1, bool_field2, bigint_field5, char_field16, list_field19 FROM TABLE1";
    final Connection connection;
    try {
      connection = DriverManager.getConnection("jdbc:h2:mem:h2-jdbc-adapter");
    } catch (SQLException e) {
      throw new RuntimeException(e);
    }
    final ScriptRunner runnerDDLDML = new ScriptRunner(connection);
    runnerDDLDML.setLogWriter(null);
    try {
      runnerDDLDML.runScript(new BufferedReader(
          new FileReader("./src/main/resources/h2-ddl.sql")));
    } catch (FileNotFoundException e) {
      throw new RuntimeException(e);
    }
    try {
      runnerDDLDML.runScript(new BufferedReader(
          new FileReader("./src/main/resources/h2-dml.sql")));
    } catch (FileNotFoundException e) {
      throw new RuntimeException(e);
    }
    final JdbcToArrowConfig config = new JdbcToArrowConfigBuilder(allocator,
        JdbcToArrowUtils.getUtcCalendar())
        .setTargetBatchSize(batchSize)
        .setReuseVectorSchemaRoot(reuseVSR)
        .setArraySubTypeByColumnNameMap(
            new HashMap() {{
              put("LIST_FIELD19",
                  new JdbcFieldInfo(Types.INTEGER));
            }}
        )
        .build();
    final ResultSet resultSetConvertToParquet;
    try {
      resultSetConvertToParquet = connection.createStatement().executeQuery(query);
    } catch (SQLException e) {
      throw new RuntimeException(e);
    }
    final ArrowVectorIterator arrowVectorIterator;
    try {
      arrowVectorIterator = JdbcToArrow.sqlToArrowVectorIterator(
          resultSetConvertToParquet, config);
    } catch (SQLException e) {
      throw new RuntimeException(e);
    } catch (IOException e) {
      throw new RuntimeException(e);
    }
    // get jdbc row data as an arrow reader
    final ArrowReader arrowReader = new JDBCReader(allocator, arrowVectorIterator, config);
    return arrowReader;
  }
}

class JDBCReader extends ArrowReader {
  private final ArrowVectorIterator iter;
  private final JdbcToArrowConfig config;
  private VectorSchemaRoot root;
  private boolean firstRoot = true;

  public JDBCReader(BufferAllocator allocator, ArrowVectorIterator iter, JdbcToArrowConfig config) {
    super(allocator);
    this.iter = iter;
    this.config = config;
  }

  @Override
  public boolean loadNextBatch() throws IOException {
    if (firstRoot) {
      firstRoot = false;
      return true;
    }
    else {
      if (iter.hasNext()) {
        if (root != null && !config.isReuseVectorSchemaRoot()) {
          root.close();
        }
        else {
          root.allocateNew();
        }
        root = iter.next();
        return root.getRowCount() != 0;
      }
      else {
        return false;
      }
    }
  }

  @Override
  public long bytesRead() {
    return 0;
  }

  @Override
  protected void closeReadSource() throws IOException {
    if (root != null && !config.isReuseVectorSchemaRoot()) {
      root.close();
    }
  }

  @Override
  protected Schema readSchema() throws IOException {
    return null;
  }

  @Override
  public VectorSchemaRoot getVectorSchemaRoot() throws IOException {
    if (root == null) {
      root = iter.next();
    }
    return root;
  }
}

Python Side:

import jpype
import pyarrow as pa
import pyarrow.csv as csv
import pyarrow.dataset as ds
import sys
from pyarrow.cffi import ffi


def getRecordBatchReader(py_stream_ptr):
    generator = getIterableRecordBatchReader(py_stream_ptr)
    schema = next(generator)
    return pa.RecordBatchReader.from_batches(schema, generator)


def getIterableRecordBatchReader(py_stream_ptr):
    with pa.RecordBatchReader._import_from_c(py_stream_ptr) as reader:
        yield reader.schema
        yield from reader


# batchSize = int(sys.argv[1]), reuseVSR = eval(sys.argv[2], log|parquet = str(sys.argv[3])
jpype.startJVM(classpath=[
    "./target/java-python-by-cdata-1.0-SNAPSHOT-jar-with-dependencies.jar"])
java_reader_api = jpype.JClass('org.example.cdata.JavaReaderApi')
java_c_package = jpype.JPackage("org").apache.arrow.c
py_stream = ffi.new("struct ArrowArrayStream*")
py_stream_ptr = int(ffi.cast("uintptr_t", py_stream))
java_wrapped_stream = java_c_package.ArrowArrayStream.wrap(py_stream_ptr)
# get reader data exported into memoryAddress
print('Python Parameters: BatchSize = ' + sys.argv[1] + ', reuseVSR = ' +
      sys.argv[2])
java_c_package.Data.exportArrayStream(
    java_reader_api.getAllocatorForJavaConsumers(),
    java_reader_api.getArrowReaderForJavaConsumers(int(sys.argv[1]),
                                                   eval(sys.argv[2])),
    java_wrapped_stream)
with getRecordBatchReader(py_stream_ptr) as streamsReaderForJava:
    # print logs
    if str(sys.argv[3]) == 'log':
        for batch in streamsReaderForJava:
            print(batch.num_rows)
            print(batch.num_columns)
            print(batch.to_pylist())
    # create parquet file
    elif str(sys.argv[3]) == 'parquet':
        ds.write_dataset(streamsReaderForJava,
                         './jdbc/parquet',
                         format="parquet")
    # create csv file
    elif str(sys.argv[3]) == 'csv':
        with csv.CSVWriter('./jdbc/csv',
                           streamsReaderForJava.schema) as writer:
            for record_batch in streamsReaderForJava:
                writer.write_batch(record_batch)
    else:
        print('Invalid parameter. Values supported are: {log, parquet, csv}')

Java POM

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
  <modelVersion>4.0.0</modelVersion>

  <groupId>org.example</groupId>
  <artifactId>java-python-by-cdata</artifactId>
  <version>1.0-SNAPSHOT</version>
  <packaging>jar</packaging>

  <name>java-python-by-cdata</name>
  <url>http://maven.apache.org</url>

  <properties>
    <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
      <maven.compiler.source>11</maven.compiler.source>
      <maven.compiler.target>11</maven.compiler.target>
      <arrow.version>12.0.0</arrow.version>
  </properties>

  <dependencies>
      <dependency>
          <groupId>org.apache.arrow</groupId>
          <artifactId>arrow-vector</artifactId>
          <version>${arrow.version}</version>
      </dependency>
      <dependency>
          <groupId>org.apache.arrow</groupId>
          <artifactId>arrow-memory-netty</artifactId>
          <version>${arrow.version}</version>
      </dependency>
      <dependency>
          <groupId>org.apache.arrow</groupId>
          <artifactId>arrow-jdbc</artifactId>
          <version>${arrow.version}</version>
      </dependency>
      <dependency>
          <groupId>org.apache.arrow</groupId>
          <artifactId>arrow-dataset</artifactId>
          <version>${arrow.version}</version>
      </dependency>
      <dependency>
          <groupId>ch.qos.logback</groupId>
          <artifactId>logback-classic</artifactId>
          <version>1.2.11</version>
      </dependency>
      <dependency>
          <groupId>org.apache.ibatis</groupId>
          <artifactId>ibatis-core</artifactId>
          <version>3.0</version>
      </dependency>
      <dependency>
          <groupId>com.h2database</groupId>
          <artifactId>h2</artifactId>
          <version>2.1.214</version>
      </dependency>
    <dependency>
      <groupId>junit</groupId>
      <artifactId>junit</artifactId>
      <version>3.8.1</version>
      <scope>test</scope>
    </dependency>
  </dependencies>
    <build>
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-assembly-plugin</artifactId>
                <version>3.6.0</version>
                <configuration>
                    <descriptorRefs>
                        <descriptorRef>jar-with-dependencies</descriptorRef>
                    </descriptorRefs>
                    <archive>
                        <manifest>
                            <mainClass>org.example.cdata.JavaReaderApi</mainClass>
                        </manifest>
                    </archive>
                </configuration>
                <executions>
                    <execution>
                        <id>make-assembly</id>
                        <phase>package</phase>
                        <goals>
                            <goal>single</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>
        </plugins>
    </build>
</project>

DML/DDL

h2-ddl.sql:

create table TABLE1 (
    INT_FIELD1 int,
    BOOL_FIELD2 boolean,
    TINYINT_FIELD3 smallint,
    SMALLINT_FIELD4 smallint,
    BIGINT_FIELD5 bigint,
    DECIMAL_FIELD6 numeric(14, 3),
    DOUBLE_FIELD7 double,
    REAL_FIELD8 real,
    TIME_FIELD9 time,
    DATE_FIELD10 date,
    TIMESTAMP_FIELD11 timestamp,
    BINARY_FIELD12 blob,
    VARCHAR_FIELD13 varchar(256),
    BLOB_FIELD14 blob,
    CLOB_FIELD15 clob,
    CHAR_FIELD16 char(16),
    BIT_FIELD17 boolean,
    NULL_FIELD18 null,
    LIST_FIELD19 int array
);

h2-dml.sql:

INSERT INTO table1 VALUES (101, 1, 45, 12000, 1000000000300.0000001, 17345667789.111, 56478356785.345, 56478356785.345, PARSEDATETIME('12:45:35 GMT', 'HH:mm:ss z'),
                           PARSEDATETIME('2018-02-12 GMT', 'yyyy-MM-dd z'), PARSEDATETIME('2018-02-12 12:45:35 GMT', 'yyyy-MM-dd HH:mm:ss z'),
                           '736f6d6520746578742074686174206e6565647320746f20626520636f6e76657274656420746f2062696e617279', 'some text that needs to be converted to varchar',
                           '736f6d6520746578742074686174206e6565647320746f20626520636f6e76657274656420746f2062696e617279', 'some text that needs to be converted to clob', 'some char text', 1, null, ARRAY[1, 2, 3]);

INSERT INTO table1 VALUES (102, 1, 45, 12000, 100000000030.00000011, 17345667789.222, 56478356785.345, 56478356785.345, PARSEDATETIME('12:45:35 GMT', 'HH:mm:ss z'),
                           PARSEDATETIME('2018-02-12 GMT', 'yyyy-MM-dd z'), PARSEDATETIME('2018-02-12 12:45:35 GMT', 'yyyy-MM-dd HH:mm:ss z'),
                           '736f6d6520746578742074686174206e6565647320746f20626520636f6e76657274656420746f2062696e617279', 'some text that needs to be converted to varchar',
                           '736f6d6520746578742074686174206e6565647320746f20626520636f6e76657274656420746f2062696e617279', 'some text that needs to be converted to clob', 'some char text', 1, null, ARRAY[1, 2]);

INSERT INTO table1 VALUES (103, 1, 45, 12000, 10000000003.000000111, 17345667789.333, 56478356785.345, 56478356785.345, PARSEDATETIME('12:45:35 GMT', 'HH:mm:ss z'),
                           PARSEDATETIME('2018-02-12 GMT', 'yyyy-MM-dd z'), PARSEDATETIME('2018-02-12 12:45:35 GMT', 'yyyy-MM-dd HH:mm:ss z'),
                           '736f6d6520746578742074686174206e6565647320746f20626520636f6e76657274656420746f2062696e617279', 'some text that needs to be converted to varchar',
                           '736f6d6520746578742074686174206e6565647320746f20626520636f6e76657274656420746f2062696e617279', 'some text that needs to be converted to clob', 'some char text', 1, null, ARRAY[1]);

@davisusanibar
Copy link
Contributor

Alternatively, you might want to consider https://github.com/davisusanibar/java-python-by-cdata

@lidavidm
Copy link
Member

@davisusanibar the 'right' fix is probably to unload the root when exporting and to make exports work in terms of ArrowRecordBatch instead of VectorSchemaRoot

@davisusanibar
Copy link
Contributor

@davisusanibar the 'right' fix is probably to unload the root when exporting and to make exports work in terms of ArrowRecordBatch instead of VectorSchemaRoot

Yes, that is my second step, first I want to validate that it is working, then I can create a new issue for that enhancement.

@davisusanibar
Copy link
Contributor

I have a doubt about how-to close resources properly, all of them are open, because any of this are inside try-with-resources.

  1. Is it okay to expose a Java method with autocloseable all open resources and invoke it from Python? Or, What is the best way to close resources on Java side?
  2. Is it also necessary to close something on the Python side?
  3. What is the best way / tool / mechanism / command line to review that there are resources needed to close on Python or Java side? It works, but I'd like to know how to check that kind of 'invisible' problem behind the scenes.

@lidavidm
Copy link
Member

It would generally look like

try (final ArrowRecordBatch batch = ...) {
  exportBatch(batch); // increments reference count
}
// batch is not freed here unless exportBatch threw (but reference count is decreased)
// when Python frees the arrow::RecordBatch, the C Data callback will
// decrement the reference count and actually free the batch

The only problem is the lifetime of the BufferAllocator which isn't solvable in C Data Interface; either the Java application needs to have a singleton allocator or (preferably) it should tie the lifetime of the allocator to something else (e.g. an init/finalize call, or if the Java application implements a reader or some other object, it should tie the allocator to that object)

@hu6360567
Copy link
Contributor Author

The only problem is the lifetime of the BufferAllocator which isn't solvable in C Data Interface; either the Java application needs to have a singleton allocator or (preferably) it should tie the lifetime of the allocator to something else (e.g. an init/finalize call, or if the Java application implements a reader or some other object, it should tie the allocator to that object)

It cannot use a very close lifetime scoped BufferAllocator, since the python gc is not very reliable. We are using a sapaterate allocator for export streams that outlives themselves.
When record_batch at python side is garbage collected, it will make a callback to the owner to release. But it not always as expected, since there is no RAII like object to live as long as a with statement, you may check earlier discussion in mailing list.
https://lists.apache.org/thread/glog4g0gkm1c7lz7nx6opso7d97sot49

@davisusanibar
Copy link
Contributor

Just created this PR for cookbook, I would appreciate it if you could help me validate the notes mentions
apache/arrow-cookbook#325

    For Python Consumer and Java Producer, please consider:

    - The Root Allocator should be shared for all memory allocations.

    - The Python application will sometimes shut down the Java JVM but Java JNI C Data will still work on releasing exported objects, which is why some guards have been implemented to protect against such scenarios. A warning message "WARNING: Failed to release Java C Data resource" indicates this scenario.

    - We do not know when Root Allocator will be closed. It is for this reason that the Root Allocator should survive so long as the export/import of used objects is released. Here is an example of this scenario:

        + Whenever Java code calls `allocator.close`, a memory leak will occur since many objects will have to be released on either Python or Java JNI sides.

        + To solve memory leak problems, you will call Java `allocator.close` when Python and Java JNI have released all their objects, which is impossible to accomplish.

    - In addition, Java applications should expose a method for closing all Java-created objects independently from Root Allocators.

@suibianwanwank
Copy link

Hi, I’m new to Arrow, and I’ve encountered a similar issue where I need to convert an ArrowVectorIterator to an ArrowArrayStream. Do you have any good suggestions on how to approach this? Thanks for your advice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants