Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add indexing for Snowflake parquet files #2645

Draft
wants to merge 5 commits into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion bin/run.sh
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
#!/bin/sh

java -cp `ls target/*-fatjar.jar` -Xms512M -Xmx192G --add-modules jdk.incubator.vector $@
java -cp `ls target/*-fatjar.jar` -Xms512M -Xmx512G --add-modules jdk.incubator.vector $@
Original file line number Diff line number Diff line change
Expand Up @@ -137,24 +137,58 @@
// Read each record from the Parquet file
while ((record = reader.read()) != null) {
// Extract the docid (String) from the record
String docid = record.getString("docid", 0);
String docid = record.getString("doc_id", 0);

Check warning on line 140 in src/main/java/io/anserini/collection/ParquetDenseVectorCollection.java

View check run for this annotation

Codecov / codecov/patch

src/main/java/io/anserini/collection/ParquetDenseVectorCollection.java#L140

Added line #L140 was not covered by tests
ids.add(docid);

// Extract the vector (double[]) from the record
Group vectorGroup = record.getGroup("vector", 0); // Access the 'vector' field
Group vectorGroup = record.getGroup("embedding", 0); // Access the 'vector' field

Check warning on line 144 in src/main/java/io/anserini/collection/ParquetDenseVectorCollection.java

View check run for this annotation

Codecov / codecov/patch

src/main/java/io/anserini/collection/ParquetDenseVectorCollection.java#L144

Added line #L144 was not covered by tests
int vectorSize = vectorGroup.getFieldRepetitionCount(0); // Get the number of elements in the vector
double[] vector = new double[vectorSize];
for (int i = 0; i < vectorSize; i++) {
Group listGroup = vectorGroup.getGroup(0, i); // Access the 'list' group
vector[i] = listGroup.getDouble("element", 0); // Get the double value from the 'element' field
vector[i] = listGroup.getFloat("element", 0); // Get the double value from the 'element' field

Check warning on line 149 in src/main/java/io/anserini/collection/ParquetDenseVectorCollection.java

View check run for this annotation

Codecov / codecov/patch

src/main/java/io/anserini/collection/ParquetDenseVectorCollection.java#L149

Added line #L149 was not covered by tests
}
vector = normalizeVector(vector);

Check warning on line 151 in src/main/java/io/anserini/collection/ParquetDenseVectorCollection.java

View check run for this annotation

Codecov / codecov/patch

src/main/java/io/anserini/collection/ParquetDenseVectorCollection.java#L151

Added line #L151 was not covered by tests
vectors.add(vector);
}

reader.close();
currentIndex = 0;
}

/**
* Computes the L2 norm (Euclidean norm) of a vector.
* @param vector the vector to compute the norm of
* @return the L2 norm of the vector
*/
private static double computeL2Norm(double[] vector) {
double sumOfSquares = 0.0;

Check warning on line 165 in src/main/java/io/anserini/collection/ParquetDenseVectorCollection.java

View check run for this annotation

Codecov / codecov/patch

src/main/java/io/anserini/collection/ParquetDenseVectorCollection.java#L165

Added line #L165 was not covered by tests
for (double v : vector) {
sumOfSquares += v * v;

Check warning on line 167 in src/main/java/io/anserini/collection/ParquetDenseVectorCollection.java

View check run for this annotation

Codecov / codecov/patch

src/main/java/io/anserini/collection/ParquetDenseVectorCollection.java#L167

Added line #L167 was not covered by tests
}
return Math.sqrt(sumOfSquares);

Check warning on line 169 in src/main/java/io/anserini/collection/ParquetDenseVectorCollection.java

View check run for this annotation

Codecov / codecov/patch

src/main/java/io/anserini/collection/ParquetDenseVectorCollection.java#L169

Added line #L169 was not covered by tests
}

/**
* Normalizes a vector to have a norm of 1.
* @param vector the vector to normalize
* @return a new vector that is the normalized version of the input vector
*/
private static double[] normalizeVector(double[] vector) {
double norm = computeL2Norm(vector);
double[] normalizedVector = new double[vector.length];

Check warning on line 179 in src/main/java/io/anserini/collection/ParquetDenseVectorCollection.java

View check run for this annotation

Codecov / codecov/patch

src/main/java/io/anserini/collection/ParquetDenseVectorCollection.java#L178-L179

Added lines #L178 - L179 were not covered by tests

if (norm == 0) {
throw new IllegalArgumentException("Zero vector cannot be normalized.");

Check warning on line 182 in src/main/java/io/anserini/collection/ParquetDenseVectorCollection.java

View check run for this annotation

Codecov / codecov/patch

src/main/java/io/anserini/collection/ParquetDenseVectorCollection.java#L182

Added line #L182 was not covered by tests
}

for (int i = 0; i < vector.length; i++) {
normalizedVector[i] = vector[i] / norm;

Check warning on line 186 in src/main/java/io/anserini/collection/ParquetDenseVectorCollection.java

View check run for this annotation

Codecov / codecov/patch

src/main/java/io/anserini/collection/ParquetDenseVectorCollection.java#L186

Added line #L186 was not covered by tests
}

return normalizedVector;

Check warning on line 189 in src/main/java/io/anserini/collection/ParquetDenseVectorCollection.java

View check run for this annotation

Codecov / codecov/patch

src/main/java/io/anserini/collection/ParquetDenseVectorCollection.java#L189

Added line #L189 was not covered by tests
}

/**
* Reads the next document in the segment.
*
Expand Down
Loading