[Doubts] Custom spark physical operator Just like AlreadySortedExec #5

dragno99 · 2024-09-20T19:39:29Z

Hello Vladimir Prus, I read your blog on medium, it was very much interesting, i learned alot from it. By reading so, I was trying to write a custom operator which will derived an extra column ( i.e add an extra UTF8 string at the end of InternalRow), after that I am applying groupBy aggregation, it seems like everything is working fine but i am seeing that only 1 element is taking part in groupBy aggregation whereas when i am just using mapPartitions to derive that columns, lots of elements taking part in shuffling stage and giving correct output.

I need your help and suggestion to resolve this issue.

vprus · 2024-09-23T11:09:37Z

Can you put together a minimal example to reproduce this problem? E.g. as a gist at gist.github.com?

dragno99 · 2024-09-24T08:40:14Z

Hi, apologies for my late response.

so when i wrote my doExecute() method like this then in shuffle stage, only 1 record was taking part ( seems like hashAggregate was getting same hash value for every row)

  override protected def doExecute(): RDD[InternalRow] = {
    val func = (partitionIndex: Int, it: Iterator[InternalRow]) => {

      val inputs = UnsafeProjection.create(child.output, output)
      inputs.initialize(partitionIndex)

      val retRows: ArrayBuffer[UnsafeRow] = ArrayBuffer()

      val queryKeys: util.List[String] = new util.ArrayList[String]()

      val keyIdx = child.schema.fieldIndex(keyFields.head.name)
      
      // here i am gathering all keys for bulk query
      it.foreach(row => {
        val r =  inputs(row)
        if(!r.isNullAt(keyIdx)) {
          val key = r.getUTF8String(keyIdx).toString
          if(key != "") queryKeys.add(key)
        }
        retRows += r
      })

      var map: util.Map[String, Array[String]] = new util.HashMap[String, Array[String]]

      if (!queryKeys.isEmpty) {
        map = queryInBulk(queryKeys)
      }

      val rowWriter: UnsafeRowWriter = new UnsafeRowWriter(newAttributeReference.size)
      
      val joiner = GenerateUnsafeRowJoiner.create(child.schema, newAttributeReference.map(_.toAttribute).toStructType)

      retRows.map(row => {
        var res: UnsafeRow = null
        var queryRes = new Array[String](qualifiers.length)
        if(!row.isNullAt(keyIdx)) {
          val key = row.getUTF8String(keyIdx).toString
          queryRes = map.getOrDefault(key, new Array[String](qualifiers.length))
        }
        res = joiner.join(row, buildUnsafeRow(rowWriter, queryRes))
        res
      }).iterator
    }
    child.execute().mapPartitionsWithIndex(func, preservesPartitioning = true)
  }
   
 private def buildUnsafeRow(rowWriter: UnsafeRowWriter, values: Array[String]): UnsafeRow = {
    rowWriter.reset()
    for (i <- values.indices) {
      if (values(i) == null) {
        rowWriter.setNullAt(i)
      } else {
        rowWriter.write(i, UTF8String.fromString(values(i)))
      }
    }
    rowWriter.getRow
  }

but when i changed my code and used .copy() method, it started working correctly, below are few changes which made above code working

val r = inputs(row).copy()
res = joiner.join(row, buildUnsafeRow(rowWriter, queryRes)).copy()

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Doubts] Custom spark physical operator Just like AlreadySortedExec #5

[Doubts] Custom spark physical operator Just like AlreadySortedExec #5

dragno99 commented Sep 20, 2024

vprus commented Sep 23, 2024

dragno99 commented Sep 24, 2024

[Doubts] Custom spark physical operator Just like AlreadySortedExec #5

[Doubts] Custom spark physical operator Just like AlreadySortedExec #5

Comments

dragno99 commented Sep 20, 2024

vprus commented Sep 23, 2024

dragno99 commented Sep 24, 2024