Skip to content

1) Dataframes

Rémi Sultan edited this page May 27, 2021 · 11 revisions

Dataframes

A dataframe is a set of columns representing a data grid

Create and Display

Create manually a dataframe

You can create a dataframe using:

    // A dataframe contains columns
    // A column is composed of a name and set of data 
    var df = Dataframes.create(
        new Column<>("x", 0, 1, 2, 3, 4, 5),
        new Column<>("y", 0, 1, 2, 3, 4, 5)
    );

    // Displays the last 10 lines of the dataframe
    df.tail();

You can also print the content dataframe this way:

  // Defines the number of rows you want to display from the top
  df.show(10)

  // Defines an interval
  df.show(0, 10)

Create a dataframe with columns using lists

You can also use lists to declare your columns:

    var df = Dataframes.create(
        new Column<>("x", IntStream.range(0, 10).boxed().collect(Collectors.toList())),
        new Column<>("y", LongStream.range(0, 10).boxed().collect(Collectors.toList())),
        new Column<>("z", List.of(0D, 1D, 2D, 3D, 4D, 5D, 6D, 7D, 8D, 9D))
    );

    df.tail();

Create a dataframe using rows and column names

    var df = Dataframes.create(
        new String[]{"x", "y", "z"}, new Row(1, 2, 3), new Row(4, 5, 6), new Row(7, 8, 8), new Row(10, 11, 12)
    );

    df.tail();

Load a dataframe from a file

    // Loads a CSV with "," separator and "\"" enclosures as default and loading
    // the first line as the header for column names
    var df = Dataframes.csv("/path/to/filename.csv");

    // Here you define which separator you want to use
    var df = Dataframes.csv("/path/to/filename.csv", ";");

    // You can also chose to have no header, the columns will default to
    // c0, c1, ...
    var df = Dataframes.csv("/path/to/filename.csv", ";", "\'", false);

Train test shuffle split dataframes

In order to get your data ready for ML operations, you can use the TrainTestDataframe decorator to help you

    // Use the dedicated methods to create your dataframes
    var df = Dataframes.trainTest(
        new Column<>("x", 0, 1, 2, 3, 4, 5),
        new Column<>("y", 0, 1, 2, 3, 4, 5)
    );
    
    df = Dataframes.trainTest(
        new String[]{"x", "y", "z"}, new Row(1, 2, 3), new Row(4, 5, 6), new Row(7, 8, 8), new Row(10, 11, 12)
    );

   df = Dataframes.csvTrainTest("/path/to/filename.csv", ";", "\"", false);

   // You can then set your split threshold (between 0 or 1), shuffle your data, and then get your train test split data
   // Default split value is 0.75
   var dfSplit = df.setSplitValue(0.65).shuffle().split();
   dfSplit.train().tail();
   dfSplit.test().tail();

Manipulate Dataframes

Select a set of columns

If you only want to select a set of your columns within your dataframe:

   var df = Dataframes.create(
        new Column<>("x", IntStream.range(0, 10).boxed().collect(Collectors.toList())),
        new Column<>("y", LongStream.range(0, 10).boxed().collect(Collectors.toList())),
        new Column<>("z", List.of(0D, 1D, 2D, 3D, 4D, 5D, 6D, 7D, 8D, 9D))
    );

   var newDf = df.select("x", "z");

Add a column

You can add a column on the go:

  // Will add the "w" column
  var newDf = df.addColumn(new Column<>("w", List.of(0D, 1D, 2D, 3D, 4D, 5D, 6D, 7D, 8D, 9D)));

Remove columns

Whenever you wish to remove columns:

  // Keep all but "x" and "z" columns
  var newDf = df.mapWithout("x", "z");

Supply a column

You can supply an new column with a Java Supplier:

  // Will supply a new column with the value of the supplier
  var newDf = df.map("expOfZero", () -> Math.exp(0));

Supply a column from an existing one:

You can add a new column based on the value of another using Java Functions:

  // This will create the column "expOfX"
  // and will apply "Math::exp" from the "x" 
  // column to the "expOfX" column
  var newDf = df.map("expOfX", Math::exp, "x");

  // You can also use more complex Functions and deal with other types
  var newDf = df.map("Label Name", (Integer i) -> "Number " + i, "x");

Supply a column with from two existing columns

You can create a column based on the values of 2 other using Java BiFunctions:

  // Same as the Java Functions but it will receive 2 output columns
  // as parameters
  
  var newDf = df.map("x * y", (Integer x, Long y) -> x * y, "x", "y");

Filter rows based on a column value

You can filter rows of your dataframe using Java Predicates:

  // This will only keep the rows where the value of the column "x" is prime
  var newDf = df.filter("x", IntMath::isPrime);

Filter rows based on 2 column values

You can also filter rows of your dataframe using Java BiPredicates:

    var df = Dataframes.create(
        new Column<>("a", List.of(0D, 1D, 2D, 3D, 4D, 5D)),
        new Column<>("b", List.of(5D, 4D, 3D, 2D, 1D, 0D))
    );

    // This will only keep the rows where the value of the column "x" is prime
    var newDf = df.filter("a", "b", (Double a, Double b) -> a > b);

Matrix Operations

One Hot Encoding

Whenever you have a column in your dataset containing labels, you can apply one hot encoding into your dataframe for further exploit.

    var df = Dataframes.create(
        new Column<>("colors", List.of("red", "green", "blue", "yellow"))
    );

    // This will denormalize the column "color" into 4 columns
    // respectively to the number of labels (here 4)
    // the value of that column will be "true" if the initial value was the 
    // color itself, otherwise false
 
    var oneHotEncodeDf = df.oneHotEncode("colors");
    oneHotEncode.tail();
 

Output:

=======================================
colors ║  blue ║ green ║   red ║ yellow
=======================================
   red ║ false ║ false ║  true ║  false
=======================================
 green ║ false ║  true ║ false ║  false
=======================================
  blue ║  true ║ false ║ false ║  false
=======================================
yellow ║ false ║ false ║ false ║   true
=======================================

Vectors

You can transform one of the column of your Dataframe into an ND4J vector : INDArray

    var df = Dataframes.create(
        new Column<>("n", List.of(1, 2, 3, 4)),
        new Column<>("colorsStr", List.of("1.1", "2", "3.2", "4.4")),
        new Column<>("colors", List.of("red", "green", "blue", "yellow"))
    );
    
    // Will transform the column into an INDArray column vector (Matrix shape : [4, 1])
    var numberVector = df.toVector("n");
    // Output: 
    // [[0], 
    //  [1.0000], 
    //  [2.0000], 
    //  [3.0000], 
    //  [4.0000]]
    
   // If your strings are number-like, it will be automatically parsed into an Double or a Long
   var numberVector = df.toVector("colorsStr");

   // Will throw an Exception since these are labels
   var numberVector = df.toVector("colors");

Matrix

You can transform some of the columns or your entire Dataframe into an ND4J matrix : INDArray

    // Let's say you have a matrix with a few values one hot encoded
    var df = Dataframes.create(
        new Column<>("n", List.of(1, 2, 3, 4)),
        new Column<>("colorsStr", List.of("1.1", "2", "3.2", "4.4")),
        new Column<>("colors", List.of("red", "green", "blue", "yellow"))
    ).oneHotEncode("colors");
    
   // You can define all the columns you want to include in your Matrix
   var matrixWithColInput = df.toMatrix("n", "colorsStr", "red", "green", "blue", "yellow");

   // You can remove an unwanted column before turning the dataframe into a matrix
   // Without defining the columns
   var matrixWithoutColInput = df.mapWithout("colors").toMatrix();

  // both result will return a matrix with the shape [4, 6]