-
Notifications
You must be signed in to change notification settings - Fork 0
1) Dataframes
A dataframe is a set of columns representing a data grid
You can create a dataframe using:
// A dataframe contains columns
// A column is composed of a name and set of data
var df = Dataframes.create(
new Column<>("x", 0, 1, 2, 3, 4, 5),
new Column<>("y", 0, 1, 2, 3, 4, 5)
);
// Displays the last 10 lines of the dataframe
df.tail();
You can also print the content dataframe this way:
// Defines the number of rows you want to display from the top
df.show(10)
// Defines an interval
df.show(0, 10)
You can also use lists to declare your columns:
var df = Dataframes.create(
new Column<>("x", IntStream.range(0, 10).boxed().collect(Collectors.toList())),
new Column<>("y", LongStream.range(0, 10).boxed().collect(Collectors.toList())),
new Column<>("z", List.of(0D, 1D, 2D, 3D, 4D, 5D, 6D, 7D, 8D, 9D))
);
df.tail();
var df = Dataframes.create(
new String[]{"x", "y", "z"}, new Row(1, 2, 3), new Row(4, 5, 6), new Row(7, 8, 8), new Row(10, 11, 12)
);
df.tail();
// Loads a CSV with "," separator and "\"" enclosures as default and loading
// the first line as the header for column names
var df = Dataframes.csv("/path/to/filename.csv");
// Here you define which separator you want to use
var df = Dataframes.csv("/path/to/filename.csv", ";");
// You can also chose to have no header, the columns will default to
// c0, c1, ...
var df = Dataframes.csv("/path/to/filename.csv", ";", "\'", false);
In order to get your data ready for ML operations, you can use the TrainTestDataframe
decorator to help you
// Use the dedicated methods to create your dataframes
var df = Dataframes.trainTest(
new Column<>("x", 0, 1, 2, 3, 4, 5),
new Column<>("y", 0, 1, 2, 3, 4, 5)
);
df = Dataframes.trainTest(
new String[]{"x", "y", "z"}, new Row(1, 2, 3), new Row(4, 5, 6), new Row(7, 8, 8), new Row(10, 11, 12)
);
df = Dataframes.csvTrainTest("/path/to/filename.csv", ";", "\"", false);
// You can then set your split threshold (between 0 or 1), shuffle your data, and then get your train test split data
// Default split value is 0.75
var dfSplit = df.setSplitValue(0.65).shuffle().split();
dfSplit.train().tail();
dfSplit.test().tail();
If you only want to select a set of your columns within your dataframe:
var df = Dataframes.create(
new Column<>("x", IntStream.range(0, 10).boxed().collect(Collectors.toList())),
new Column<>("y", LongStream.range(0, 10).boxed().collect(Collectors.toList())),
new Column<>("z", List.of(0D, 1D, 2D, 3D, 4D, 5D, 6D, 7D, 8D, 9D))
);
var newDf = df.select("x", "z");
You can add a column on the go:
// Will add the "w" column
var newDf = df.addColumn(new Column<>("w", List.of(0D, 1D, 2D, 3D, 4D, 5D, 6D, 7D, 8D, 9D)));
Whenever you wish to remove columns:
// Keep all but "x" and "z" columns
var newDf = df.mapWithout("x", "z");
You can supply an new column with a Java Supplier:
// Will supply a new column with the value of the supplier
var newDf = df.map("expOfZero", () -> Math.exp(0));
You can add a new column based on the value of another using Java Functions:
// This will create the column "expOfX"
// and will apply "Math::exp" from the "x"
// column to the "expOfX" column
var newDf = df.map("expOfX", Math::exp, "x");
// You can also use more complex Functions and deal with other types
var newDf = df.map("Label Name", (Integer i) -> "Number " + i, "x");
You can create a column based on the values of 2 other using Java BiFunctions:
// Same as the Java Functions but it will receive 2 output columns
// as parameters
var newDf = df.map("x * y", (Integer x, Long y) -> x * y, "x", "y");
You can filter rows of your dataframe using Java Predicates:
// This will only keep the rows where the value of the column "x" is prime
var newDf = df.filter("x", IntMath::isPrime);
You can also filter rows of your dataframe using Java BiPredicates:
var df = Dataframes.create(
new Column<>("a", List.of(0D, 1D, 2D, 3D, 4D, 5D)),
new Column<>("b", List.of(5D, 4D, 3D, 2D, 1D, 0D))
);
// This will only keep the rows where the value of the column "x" is prime
var newDf = df.filter("a", "b", (Double a, Double b) -> a > b);
Whenever you have a column in your dataset containing labels, you can apply one hot encoding into your dataframe for further exploit.
var df = Dataframes.create(
new Column<>("colors", List.of("red", "green", "blue", "yellow"))
);
// This will denormalize the column "color" into 4 columns
// respectively to the number of labels (here 4)
// the value of that column will be "true" if the initial value was the
// color itself, otherwise false
var oneHotEncodeDf = df.oneHotEncode("colors");
oneHotEncode.tail();
Output:
=======================================
colors ║ blue ║ green ║ red ║ yellow
=======================================
red ║ false ║ false ║ true ║ false
=======================================
green ║ false ║ true ║ false ║ false
=======================================
blue ║ true ║ false ║ false ║ false
=======================================
yellow ║ false ║ false ║ false ║ true
=======================================
You can transform one of the column of your Dataframe into an ND4J vector : INDArray
var df = Dataframes.create(
new Column<>("n", List.of(1, 2, 3, 4)),
new Column<>("colorsStr", List.of("1.1", "2", "3.2", "4.4")),
new Column<>("colors", List.of("red", "green", "blue", "yellow"))
);
// Will transform the column into an INDArray column vector (Matrix shape : [4, 1])
var numberVector = df.toVector("n");
// Output:
// [[0],
// [1.0000],
// [2.0000],
// [3.0000],
// [4.0000]]
// If your strings are number-like, it will be automatically parsed into an Double or a Long
var numberVector = df.toVector("colorsStr");
// Will throw an Exception since these are labels
var numberVector = df.toVector("colors");
You can transform some of the columns or your entire Dataframe into an ND4J matrix : INDArray
// Let's say you have a matrix with a few values one hot encoded
var df = Dataframes.create(
new Column<>("n", List.of(1, 2, 3, 4)),
new Column<>("colorsStr", List.of("1.1", "2", "3.2", "4.4")),
new Column<>("colors", List.of("red", "green", "blue", "yellow"))
).oneHotEncode("colors");
// You can define all the columns you want to include in your Matrix
var matrixWithColInput = df.toMatrix("n", "colorsStr", "red", "green", "blue", "yellow");
// You can remove an unwanted column before turning the dataframe into a matrix
// Without defining the columns
var matrixWithoutColInput = df.mapWithout("colors").toMatrix();
// both result will return a matrix with the shape [4, 6]