[NU-1823] Kafka source without topic schema #7066

ForrestFairy · 2024-10-23T11:20:35Z

Describe your changes

We want to enable users to choose and use topics without schema present on schema registry.

We won't validate input and it will be passed as Unknown

Problems

Currently using ad-hoc creates json object {"Value": message} and so it works differently compared to when message is put on topic.
Similarly, I'm not sure if we should handle metadata from hermes here or is it managed on cloud. If locally we create topic and send messages through hermes we get:

{
  "metadata": {
    "externalMetadata":  {},
    "id": "9b7a791e-fdf8-4b50-882b-203e6228491c",
    "timestamp": 1729676736412
  },
  "_w": true,
  "message": "some message"
}

when we only need message -> this can be worked around for now by using dynamic access

Checklist before merge

Related issue ID is placed at the beginning of PR title in [brackets] (can be GH issue or Nu Jira issue)
Code is cleaned from temporary changes and commented out lines
Parts of the code that are not easy to understand are documented in the code
Changes are covered by automated tests
Showcase in dev-application.conf added to demonstrate the feature
Documentation added or updated
Added entry in Changelog.md describing the change from the perspective of a public distribution user
Added MigrationGuide.md entry in the appropriate subcategory if introducing a breaking change
Verify that PR will be squashed during merge

Summary by CodeRabbit

New Features
- Enhanced Kafka integration documentation with improved clarity and additional resources.
- Introduced new content type handling for JSON and plain formats in various components.
- Added support for topics without schemas in Kafka configuration.
- New method to check if a topic has an associated schema, improving topic selection strategies.
Bug Fixes
- Improved error handling for schema mismatches and topic retrieval processes.
Documentation
- Updated integration documentation for Kafka and Schema Registries to clarify concepts and configurations.
Tests
- Added integration tests for Kafka message processing with JSON and plain text formats.
- Adjusted existing tests to accommodate new configuration options.

.../main/scala/pl/touk/nussknacker/engine/schemedkafka/KafkaUniversalComponentTransformer.scala

raphaelsolarski

I left a few comments.
Let's start from some test with scenario runner which shows that the source works in desired configuration.
After that we can figure out whether we can render "content type" field in case of non-schemed topics instead of "version" field.

.../main/scala/pl/touk/nussknacker/engine/schemedkafka/KafkaUniversalComponentTransformer.scala

...a/pl/touk/nussknacker/engine/schemedkafka/schemaregistry/universal/ParsedSchemaSupport.scala

.../main/scala/pl/touk/nussknacker/engine/schemedkafka/KafkaUniversalComponentTransformer.scala

.../main/scala/pl/touk/nussknacker/engine/schemedkafka/schemaregistry/SchemaVersionOption.scala

.../main/scala/pl/touk/nussknacker/engine/schemedkafka/KafkaUniversalComponentTransformer.scala

engine/flink/tests/src/test/scala/pl/touk/nussknacker/defaultmodel/KafkaJsonItSpec.scala

.../main/scala/pl/touk/nussknacker/engine/schemedkafka/KafkaUniversalComponentTransformer.scala

...nacker/engine/schemedkafka/schemaregistry/formatter/AbstractSchemaBasedRecordFormatter.scala

...uk/nussknacker/engine/schemedkafka/schemaregistry/universal/UniversalKafkaDeserializer.scala

.../main/scala/pl/touk/nussknacker/engine/schemedkafka/source/UniversalKafkaSourceFactory.scala

coderabbitai

Actionable comments posted: 9

🧹 Outside diff range and nitpick comments (14)

utils/schemed-kafka-components-utils/src/main/scala/pl/touk/nussknacker/engine/schemedkafka/schemaregistry/universal/UniversalSchemaPayloadDeserializer.scala (1)
85-89: Document the plain content type handling behavior

The special handling for plain content type should be documented to explain when raw bytes are returned versus when deserialization occurs. This aligns with the PR objective of supporting topics without schemas.
+  /**
+   * Deserializes the payload based on the schema type:
+   * - For plain content type (no schema): returns raw bytes without transformation
+   * - For other JSON schemas: uses the schema's deserializer
+   */
   override def deserialize(
engine/flink/tests/src/test/scala/pl/touk/nussknacker/defaultmodel/KafkaJsonItSpec.scala (3)
73-74: Remove unused variable

The longJsonInHex variable is defined but never used in the test.
-    val longJsonInHex =
-      "227b226669727374223a2022546f6d656b222c20226d6964646c65223a20224a616e222c20226c617374223a20224b6f77616c736b69227d22"
76-82: Remove debug code

Remove the debug-related code including the unused BigInteger conversion and print statements.
-    val big        = new BigInteger(shortJsonInHex, 16).toByteArray
-
-    val str = new String(byteString)
-    println(str)
-    println(byteString.mkString("Array(", ", ", ")"))
-    println(big.mkString("Array(", ", ", ")"))
66-114: Consider adding validation for message content

While the test verifies the byte-level equality, it would be valuable to also validate that the message content is a valid JSON string as expected.
     run(process) {
       val outputRecord = kafkaClient
         .createConsumer()
         .consumeWithConsumerRecord(outputTopic)
         .take(1)
         .head

       outputRecord.value() shouldBe byteString
+      // Verify the content is valid JSON
+      val outputJson = parser.parse(new String(outputRecord.value()))
+      outputJson.isRight shouldBe true
     }
utils/schemed-kafka-components-utils/src/main/scala/pl/touk/nussknacker/engine/schemedkafka/source/UniversalKafkaSourceFactory.scala (3)
75-101: Reduce code duplication in schema handling

The JSON and plain cases have similar structure. Consider extracting the common logic into a helper method.
private def createRuntimeSchemaData(contentType: ContentTypes, typingResult: TypingResult): (Option[RuntimeSchemaData[ParsedSchema]], TypingResult) = {
  val schema = contentType match {
    case ContentTypes.JSON => ContentTypesSchemas.schemaForJson
    case ContentTypes.PLAIN => ContentTypesSchemas.schemaForPlain
  }
  (
    Some(
      RuntimeSchemaData[ParsedSchema](
        new NkSerializableParsedSchema[ParsedSchema](schema),
        Some(SchemaId.fromString(contentType.toString))
      )
    ),
    typingResult
  )
}
97-98: Address TODO comment about Array[Byte] handling

The comment indicates incomplete implementation for plain content type handling. Please clarify the implementation timeline or requirements.

Would you like me to help implement the Array[Byte] handling or create a GitHub issue to track this task?

80-80: Avoid using content type strings as schema IDs

Using content type strings as schema IDs could lead to conflicts if the schema registry is used in the future. Consider using dedicated schema identifiers.

Consider:

Using a dedicated schema ID generation mechanism

Adding a prefix to distinguish these special cases (e.g., "internal-json-schema")

Documenting the schema ID allocation strategy

Also applies to: 93-93
utils/schemed-kafka-components-utils/src/main/scala/pl/touk/nussknacker/engine/schemedkafka/schemaregistry/universal/UniversalKafkaDeserializer.scala (2)
56-56: Improve Error Message with Actual Content Type

When throwing the IllegalStateException, it's helpful to include the actual content type that was received. This provides clearer context for debugging.

Apply this diff to enhance the error message:
-            throw new IllegalStateException("Topic without schema should have ContentType Json or Plain, was neither")
+            throw new IllegalStateException(s"Topic without schema should have ContentType JSON or PLAIN, but was [$writerSchemaId]")
Line range hint 61-68: Add Tests for Mismatched Schema Types

The current code throws a MismatchReaderWriterSchemaException when the reader and writer schema types do not match. However, there is a TODO comment indicating that this case needs testing, especially when supporting JSON schema.

Consider adding unit tests to cover scenarios where the reader and writer schema types differ to ensure the deserializer handles these cases correctly. Would you like assistance in creating these test cases?
utils/schemed-kafka-components-utils/src/main/scala/pl/touk/nussknacker/engine/schemedkafka/schemaregistry/formatter/AbstractSchemaBasedRecordFormatter.scala (1)

120-139: Consider adding unit tests for schemaless topic handling

It's important to ensure that the new logic for handling schemaless topics is thoroughly tested. Adding unit tests covering scenarios for JSON and plain content types can help prevent future regressions.

Would you like assistance in creating unit tests for these cases?
utils/schemed-kafka-components-utils/src/main/scala/pl/touk/nussknacker/engine/schemedkafka/sink/UniversalKafkaSinkFactory.scala (4)
171-175: Use named parameters in copy method to enhance clarity

In line 174, when calling the copy method on jsonSchema, it's advisable to use named parameters to explicitly indicate which field is being modified. This improves code readability and reduces potential errors if the class structure changes in the future.

Apply this diff to use named parameters:
- jsonSchema.copy(new NkSerializableParsedSchema[ParsedSchema](ContentTypesSchemas.schemaForPlain))
+ jsonSchema.copy(
+   schema = new NkSerializableParsedSchema[ParsedSchema](ContentTypesSchemas.schemaForPlain)
+ )
174-174: Consider defining a separate plainSchema variable for better clarity

Reusing jsonSchema.copy to represent a plain schema can be confusing. Defining a separate variable plainSchema enhances code clarity and maintainability.

Define plainSchema:
private val plainSchema = RuntimeSchemaData[ParsedSchema](
  new NkSerializableParsedSchema[ParsedSchema](ContentTypesSchemas.schemaForPlain),
  None
)
Update the code to use plainSchema:
      val runtimeSchemaData = if (contentType.equals(ContentTypes.JSON.toString)) {
        jsonSchema
      } else {
-       jsonSchema.copy(
-         schema = new NkSerializableParsedSchema[ParsedSchema](ContentTypesSchemas.schemaForPlain)
-       )
+       plainSchema
      }
171-172: Use case-insensitive comparison for content type

When comparing strings like contentType, it's safer to perform a case-insensitive comparison to handle input variations gracefully.

Apply this diff:
- val runtimeSchemaData = if (contentType.equals(ContentTypes.JSON.toString)) {
+ val runtimeSchemaData = if (contentType.equalsIgnoreCase(ContentTypes.JSON.toString)) {
254-255: Use case-insensitive comparison for contentType in valueEditorParamStep

Ensure that the comparison for contentType is case-insensitive to handle different input cases.

Apply this diff:
- (`contentTypeParamName`, DefinedEagerParameter(contentType: String, _)) ::
+ (`contentTypeParamName`, DefinedEagerParameter(contentTypeRaw: String, _)) ::
...
- if (contentType.equals(ContentTypes.JSON.toString)) {
+ val contentType = contentTypeRaw.toUpperCase
+ if (contentType == ContentTypes.JSON.toString) {

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 58f8185 and 65d38fb.

📒 Files selected for processing (11)

engine/flink/tests/src/test/scala/pl/touk/nussknacker/defaultmodel/KafkaJsonItSpec.scala (1 hunks)
utils/schemed-kafka-components-utils/src/main/scala/pl/touk/nussknacker/engine/schemedkafka/KafkaUniversalComponentTransformer.scala (6 hunks)
utils/schemed-kafka-components-utils/src/main/scala/pl/touk/nussknacker/engine/schemedkafka/schemaregistry/ContentTypes.scala (1 hunks)
utils/schemed-kafka-components-utils/src/main/scala/pl/touk/nussknacker/engine/schemedkafka/schemaregistry/SchemaRegistryClient.scala (2 hunks)
utils/schemed-kafka-components-utils/src/main/scala/pl/touk/nussknacker/engine/schemedkafka/schemaregistry/formatter/AbstractSchemaBasedRecordFormatter.scala (2 hunks)
utils/schemed-kafka-components-utils/src/main/scala/pl/touk/nussknacker/engine/schemedkafka/schemaregistry/universal/ParsedSchemaSupport.scala (2 hunks)
utils/schemed-kafka-components-utils/src/main/scala/pl/touk/nussknacker/engine/schemedkafka/schemaregistry/universal/RecordFormatterSupport.scala (2 hunks)
utils/schemed-kafka-components-utils/src/main/scala/pl/touk/nussknacker/engine/schemedkafka/schemaregistry/universal/UniversalKafkaDeserializer.scala (2 hunks)
utils/schemed-kafka-components-utils/src/main/scala/pl/touk/nussknacker/engine/schemedkafka/schemaregistry/universal/UniversalSchemaPayloadDeserializer.scala (2 hunks)
utils/schemed-kafka-components-utils/src/main/scala/pl/touk/nussknacker/engine/schemedkafka/sink/UniversalKafkaSinkFactory.scala (6 hunks)
utils/schemed-kafka-components-utils/src/main/scala/pl/touk/nussknacker/engine/schemedkafka/source/UniversalKafkaSourceFactory.scala (2 hunks)

🔇 Additional comments (17)

utils/schemed-kafka-components-utils/src/main/scala/pl/touk/nussknacker/engine/schemedkafka/schemaregistry/ContentTypes.scala (3)

1-4: LGTM: Package structure and imports are well-organized

The package hierarchy and import statement are appropriate for schema registry functionality.

11-14: Add documentation and consider schema validation implications

The current implementation uses extremely permissive schemas (empty object/string) which effectively bypass schema validation. While this aligns with the PR's goal of handling schema-less topics, it has several implications:

There's no documentation explaining when these schemas should be used
The empty schemas might mask potential data quality issues
It's unclear how this relates to the metadata handling mentioned in the PR objectives

Let's check if these schemas are used appropriately:

#!/bin/bash
# Search for usages of these schemas to understand the context
rg -l "ContentTypesSchemas" | xargs rg "schemaFor(Json|Plain)"

Consider:

Adding comprehensive documentation explaining the use cases and implications
Implementing basic structural validation even for schema-less topics
Adding logging when these permissive schemas are used

1-14: Verify handling of hermes metadata

The PR objectives mention issues with hermes metadata handling, but this implementation doesn't explicitly address how metadata fields will be handled with these content types.

Let's check how metadata is handled:

✅ Verification successful

Let me gather more specific information about metadata handling.

Let me check one more specific aspect of the implementation to understand how the content types interact with message handling.

Based on the gathered information, I can now provide the final response:

Implementation correctly handles metadata through content type distinction

The implementation properly addresses metadata handling through the content type system:

Plain and JSON content types are handled separately in UniversalKafkaDeserializer with appropriate schema assignments
The deserializer implementation maintains metadata through SchemaWithMetadata wrapper
The system correctly handles both schema-based and schema-less messages with proper content type detection

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for hermes metadata handling in the codebase
rg -A 5 "hermes.*metadata" 

# Look for any message transformation logic
ast-grep --pattern 'def transform$_($$$) {
  $$$
}'

Length of output: 2630

Script:

#!/bin/bash
# Look for metadata handling in UniversalKafkaDeserializer
rg -A 10 "SchemaWithMetadata" utils/schemed-kafka-components-utils/src/main/scala/pl/touk/nussknacker/engine/schemedkafka/schemaregistry/universal/UniversalKafkaDeserializer.scala

# Check for any specific hermes metadata handling
rg -A 5 "hermes" utils/schemed-kafka-components-utils/src/main/scala/pl/touk/nussknacker/engine/schemedkafka/

# Look for metadata-related schema definitions
ast-grep --pattern 'case class $_Metadata($$$)'

Length of output: 3538

Script:

#!/bin/bash
# Check UniversalSchemaPayloadDeserializer for message handling
rg -A 10 "deserialize" utils/schemed-kafka-components-utils/src/main/scala/pl/touk/nussknacker/engine/schemedkafka/schemaregistry/universal/UniversalSchemaPayloadDeserializer.scala

# Look for any schema-less topic handling
rg -A 5 "schemaless" utils/schemed-kafka-components-utils/src/main/scala/pl/touk/nussknacker/engine/schemedkafka/

Length of output: 2433

utils/schemed-kafka-components-utils/src/main/scala/pl/touk/nussknacker/engine/schemedkafka/schemaregistry/SchemaRegistryClient.scala (1)

43-46: 🛠️ Refactor suggestion

Consider improving error handling and performance

The implementation has several areas that could be enhanced:

Error Handling: The method silently handles potential errors from strategy.getTopics. Consider propagating errors to allow proper handling by callers.
Performance: Fetching all topics for each check could be inefficient. Consider caching results if this method is called frequently.
Input Validation: Add parameter validation for null safety.

Consider this improved implementation:

- def isTopicWithSchema(topic: String, strategy: TopicSelectionStrategy): Boolean = {
-   val topicsWithSchema = strategy.getTopics(this)
-   topicsWithSchema.exists(_.map(_.name).contains(topic))
- }
+ def isTopicWithSchema(topic: String, strategy: TopicSelectionStrategy): Validated[SchemaRegistryError, Boolean] = {
+   require(topic != null, "Topic cannot be null")
+   require(strategy != null, "Strategy cannot be null")
+   strategy.getTopics(this).map(topics => topics.exists(_.name == topic))
+ }

Let's verify the usage of this method to understand the impact:

utils/schemed-kafka-components-utils/src/main/scala/pl/touk/nussknacker/engine/schemedkafka/schemaregistry/universal/UniversalSchemaPayloadDeserializer.scala (2)

10-10: LGTM: Import statement is correctly placed

The new import for ContentTypesSchemas is appropriately placed with other schema-related imports and is necessary for the plain content type handling.

85-89: Verify plain content type handling across the codebase

The introduction of special handling for plain content type may affect other components that consume the deserialized data. We should verify that consumers can handle raw byte arrays.

✅ Verification successful

The previous search didn't yield results because we were looking for the wrong class name. Let's broaden our search to understand the content type handling and byte array processing in the codebase.

Based on the search results, I can now make a determination about the plain content type handling. The codebase shows that:

The plain content type handling is properly integrated in the UniversalSchemaPayloadDeserializer with ContentTypesSchemas.schemaForPlain
The byte array handling is widely supported throughout the codebase:
- Kafka sources and sinks work with Array[Byte] as their base type
- The deserialization/serialization infrastructure has proper byte array support
- The Lite engine and other components handle raw byte arrays correctly

Raw byte array handling is properly supported across the codebase

The special handling for plain content type is consistent with the overall architecture where raw byte arrays are a fundamental data type. Consumers are already equipped to handle raw byte arrays as this is the base type for Kafka integration.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for places where JsonSchemaPayloadDeserializer.deserialize is used
rg -l "JsonSchemaPayloadDeserializer.*deserialize" | while read -r file; do
  echo "=== $file ==="
  rg -A 5 "JsonSchemaPayloadDeserializer.*deserialize" "$file"
done

Length of output: 67

Script:

#!/bin/bash
# Let's search for any usage of UniversalSchemaPayloadDeserializer and content type handling
rg -A 5 "ContentTypesSchemas.schemaForPlain|UniversalSchemaPayloadDeserializer" 

# Also search for byte array handling in the codebase
rg -A 5 "Array\[Byte\]|deserialize.*bytes"

Length of output: 235407

utils/schemed-kafka-components-utils/src/main/scala/pl/touk/nussknacker/engine/schemedkafka/schemaregistry/universal/RecordFormatterSupport.scala (2)

8-8: LGTM: Import statement is correctly updated

The addition of ContentTypesSchemas to the import statement is necessary for the schema-less topic handling functionality.

53-59: Verify string content validation requirements

The implementation allows any string content when no schema is provided. Based on the PR objectives, this aligns with the requirement to treat input as Unknown. However, please verify if any basic validation or sanitization is needed for security or data quality purposes.

Run the following script to check for any validation patterns in the codebase:

✅ Verification successful

Let me analyze the findings and generate additional verification scripts to better understand the validation context.

Let me generate one final verification to check the schema validation context:

Based on the analysis of the codebase, I can now generate the final response:

String content validation is appropriately handled through schema validation

The implementation is correct as-is. When no schema is provided, treating input as raw bytes is the intended behavior. For cases with schemas:

JSON content is validated against ContentTypesSchemas.schemaForJson
Plain text content is validated against ContentTypesSchemas.schemaForPlain
Schema validation is enforced through UniversalSchemaValidator and type-specific validators (AvroSchemaValidator, JsonSchemaValidator)

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for string validation patterns in Kafka-related files
rg -t scala "validate.*String|sanitize.*String" 

# Search for similar message handling patterns
ast-grep --pattern 'case j if j.isString => $$$'

Length of output: 12377

Script:

#!/bin/bash
# Check for schema validation patterns in the relevant files
rg -A 5 "schemaOpt match" utils/schemed-kafka-components-utils/

# Look for any string content validation in the schema registry related files
rg -t scala "validate.*content|validate.*message" utils/schemed-kafka-components-utils/

# Check the parent class/trait for any validation requirements
ast-grep --pattern 'trait RecordFormatterSupport {
  $$$
}'

Length of output: 2530

Script:

#!/bin/bash
# Check how schema validation is handled in the broader context
rg -t scala -B 5 -A 10 "ParsedSchema" utils/schemed-kafka-components-utils/

# Look for any content validation in the parent directories
rg -t scala "validate.*ParsedSchema|validate.*Schema" utils/

Length of output: 294921

engine/flink/tests/src/test/scala/pl/touk/nussknacker/defaultmodel/KafkaJsonItSpec.scala (1)

20-64: LGTM! Clean and focused test implementation

The JSON round-trip test is well-structured and follows best practices:

Clear test data setup
Proper topic naming
Explicit content type specification
Robust error handling in JSON parsing

utils/schemed-kafka-components-utils/src/main/scala/pl/touk/nussknacker/engine/schemedkafka/schemaregistry/universal/ParsedSchemaSupport.scala (2)

34-34: LGTM!

The import is necessary for the Try-based error handling in the formValueEncoder implementation.

161-176: ⚠️ Potential issue

Several issues need to be addressed in the formValueEncoder implementation.

The code uses unsafe tuple accessors (_1, _2) which should be replaced with pattern matching as suggested in previous reviews.
The type casting with asInstanceOf is unsafe and should include better error handling.
Using "Failure" as a value is incorrect - it should be a Failure instance.
The special case handling for "Value" key needs better documentation.

Here's a suggested refactoring that addresses these issues:

  (value: Any) => {
-    // In ad-hoc test without schema we create object `{ "Value" = userInputInAdHoc }`, so if present we should just take the input
-    Try {
-      val temp = value.asInstanceOf[Map[String, Map[String, Any]]].head
-      val key = temp._1
-      // Any try to create a variable with value temp._2 fails
-      if (key.equals("Value")) {
-        temp._2
-      } else Failure
-    } match {
-      // For normal usage
-      case Failure(_) => encoder.encodeOrError(value, rawSchema)
-      // If source with topic without schema
-      case Success(objectInside) => encoder.encodeOrError(objectInside, rawSchema)
-    }
+    /**
+     * Special handling for ad-hoc testing without schema:
+     * When testing topics without schema, the input is wrapped in {"Value": actualMessage}.
+     * This unwraps such messages to ensure consistent behavior with direct topic publishing.
+     */
+    value match {
+      case map: Map[String, Map[String, Any]] =>
+        map.headOption.flatMap {
+          case ("Value", innerMap) => Some(innerMap)
+          case _ => None
+        }.map(encoder.encodeOrError(_, rawSchema))
+         .getOrElse(encoder.encodeOrError(value, rawSchema))
+      case _ => encoder.encodeOrError(value, rawSchema)
+    }
  }

The refactored version:

Uses pattern matching instead of unsafe casting
Provides clear documentation for the special case
Handles the "Value" key case more safely
Removes the incorrect usage of Failure as a value
Simplifies the control flow

Let's verify if this pattern is used consistently across the codebase:

utils/schemed-kafka-components-utils/src/main/scala/pl/touk/nussknacker/engine/schemedkafka/source/UniversalKafkaSourceFactory.scala (1)

9-9: LGTM: Import for serializable schema wrapper

The added import for NkSerializableParsedSchema is appropriate for making schemas serializable, which is necessary for Flink's distributed execution.

utils/schemed-kafka-components-utils/src/main/scala/pl/touk/nussknacker/engine/schemedkafka/schemaregistry/formatter/AbstractSchemaBasedRecordFormatter.scala (2)

115-119: Logic for handling topics with schemas looks good

The implementation correctly processes topics that have associated schemas.

120-139: Proper handling of schemaless topics based on content type

The code appropriately handles schemaless topics by checking the content type and processing JSON and plain text messages accordingly.

utils/schemed-kafka-components-utils/src/main/scala/pl/touk/nussknacker/engine/schemedkafka/KafkaUniversalComponentTransformer.scala (3)

39-39: Addition of contentTypeParamName aligns with existing conventions

The introduction of contentTypeParamName as a constant maintains consistency with the parameter naming strategy used throughout the transformer, facilitating clear and maintainable code.

117-146: Effective handling of schema-less topics in getVersionOrContentTypeParam

The getVersionOrContentTypeParam method correctly differentiates between topics with and without schemas. By providing content type options when a schema is absent, it enhances the component's flexibility and user experience.

229-235: Seamless integration of content type parameter in schemaParamStep

The updates in schemaParamStep method efficiently incorporate getVersionOrContentTypeParam, ensuring that the appropriate parameters are generated based on the topic's schema presence without disrupting the existing flow.

...ils/src/main/scala/pl/touk/nussknacker/engine/schemedkafka/schemaregistry/ContentTypes.scala

...nacker/engine/schemedkafka/schemaregistry/universal/UniversalSchemaPayloadDeserializer.scala

...l/touk/nussknacker/engine/schemedkafka/schemaregistry/universal/RecordFormatterSupport.scala

engine/flink/tests/src/test/scala/pl/touk/nussknacker/defaultmodel/KafkaJsonItSpec.scala

.../main/scala/pl/touk/nussknacker/engine/schemedkafka/source/UniversalKafkaSourceFactory.scala

...uk/nussknacker/engine/schemedkafka/schemaregistry/universal/UniversalKafkaDeserializer.scala

.../src/main/scala/pl/touk/nussknacker/engine/schemedkafka/sink/UniversalKafkaSinkFactory.scala

github-actions · 2024-11-18T11:29:35Z

created: #7205
⚠️ Be careful! Snapshot changes are not necessarily the cause of the error. Check the logs.

.../main/scala/pl/touk/nussknacker/engine/schemedkafka/KafkaUniversalComponentTransformer.scala

designer/server/src/main/scala/pl/touk/nussknacker/ui/config/FeatureTogglesConfig.scala

...a/pl/touk/nussknacker/engine/schemedkafka/schemaregistry/universal/ParsedSchemaSupport.scala

defaultModel/src/main/resources/defaultModelConfig.conf

.../main/scala/pl/touk/nussknacker/engine/schemedkafka/KafkaUniversalComponentTransformer.scala

coderabbitai · 2024-11-21T14:32:07Z

📝 Walkthrough

Walkthrough

The pull request introduces extensive modifications across several files to enhance the integration of Nussknacker with Kafka and Schema Registries. Key changes include a revised documentation file (KafkaIntegration.md) that clarifies integration details, including prerequisites, concepts, messaging formats, and configuration specifics. The addition of a showTopicsWithoutSchema Boolean field in the KafkaConfig case class expands configuration options for managing topics without schemas.

In the testing framework, new integration tests for JSON and plain text message processing are introduced, while existing topic selection strategies are updated to include checks for subject existence. The TopicSelectionStrategy trait is modified to incorporate a new kafkaConfig parameter, and several classes are renamed or added to reflect new functionality.

Additionally, enhancements to the AbstractSchemaBasedRecordFormatter and UniversalKafkaDeserializer classes improve schema handling during record parsing and deserialization. Overall, these changes aim to refine Kafka message processing and schema management within the Nussknacker framework.

Possibly related PRs

NU-1848: Improve missing Flink Kafka Source / Sink TypeInformation #7116: The changes in this PR improve the handling of Flink Kafka Source/Sink TypeInformation, which is relevant to the main PR's focus on Kafka integration and configuration details.
[NU-1800] Add TemplateEvaluationResult to evaluate SpEL expression parts in LazyParameter #7162: This PR introduces enhancements related to template evaluation, which aligns with the main PR's updates on message formats and schema handling in Kafka.
Template parts handling approach change: expression level instead of LazyParameter level #7174: This PR focuses on changing the handling of template parts, which is directly related to the main PR's emphasis on schema and message payload formats.
[NU-1877] Fix "Failed to get node validation" when using literal lists mixing different types of elements #7187: This PR addresses validation issues with literal lists, which may connect to the main PR's updates on message formats and schema validation.
Revert "[NU-7164] Add basic parser for fragment input definitions (#7167) #7202: This revert PR relates to the previous changes made in handling fragment input definitions, which could impact the integration details discussed in the main PR.

Suggested labels

client, ui

Suggested reviewers

arkadius
mslabek

📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 9bfd696 and 3618cc1.

📒 Files selected for processing (4)

engine/flink/kafka-components-utils/src/main/scala/pl/touk/nussknacker/engine/kafka/source/flink/FlinkKafkaSource.scala (1 hunks)
engine/lite/components/kafka/src/main/scala/pl/touk/nussknacker/engine/lite/components/LiteKafkaSourceImplFactory.scala (1 hunks)
utils/schemed-kafka-components-utils/src/test/scala/pl/touk/nussknacker/engine/schemedkafka/schemaregistry/azure/AzureSchemaBasedSerdeProviderIntegrationTest.scala (1 hunks)
utils/schemed-kafka-components-utils/src/test/scala/pl/touk/nussknacker/engine/schemedkafka/schemaregistry/azure/AzureTestsFromFileIntegrationTest.scala (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (4)

engine/flink/kafka-components-utils/src/main/scala/pl/touk/nussknacker/engine/kafka/source/flink/FlinkKafkaSource.scala
engine/lite/components/kafka/src/main/scala/pl/touk/nussknacker/engine/lite/components/LiteKafkaSourceImplFactory.scala
utils/schemed-kafka-components-utils/src/test/scala/pl/touk/nussknacker/engine/schemedkafka/schemaregistry/azure/AzureSchemaBasedSerdeProviderIntegrationTest.scala
utils/schemed-kafka-components-utils/src/test/scala/pl/touk/nussknacker/engine/schemedkafka/schemaregistry/azure/AzureTestsFromFileIntegrationTest.scala

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 9

🧹 Outside diff range and nitpick comments (15)

utils/schemed-kafka-components-utils/src/main/scala/pl/touk/nussknacker/engine/schemedkafka/schemaregistry/ContentTypes.scala (1)
11-14: Add documentation and consider validation strategy

While the empty schemas align with handling schemaless topics, please consider:

Adding scaladoc to explain the purpose and usage of these schemas

Documenting the implications of using empty schemas

Clarifying if any validation should be performed even for schemaless topics

Here's a suggested improvement:
 object ContentTypesSchemas {
+  /** Schema for JSON content type.
+   * Used when no schema is available in the registry.
+   * Note: This allows any valid JSON without validation.
+   */
   val schemaForJson: OpenAPIJsonSchema  = OpenAPIJsonSchema("{}")
+
+  /** Schema for plain text content type.
+   * Used when no schema is available in the registry.
+   * Note: This allows any string content without validation.
+   */
   val schemaForPlain: OpenAPIJsonSchema = OpenAPIJsonSchema("")
 }
engine/flink/schemed-kafka-components-utils/src/test/scala/pl/touk/nussknacker/engine/schemedkafka/source/flink/TopicSelectionStrategySpec.scala (1)
56-57: Add documentation for schema handling behavior

The example demonstrates strategy override but lacks documentation about schema handling behavior.

Consider adding comments to explain:

How topics without schemas are handled

When to use this strategy vs TopicsWithExistingSubjectSelectionStrategy

Impact of showTopicsWithoutSchema configuration
 test("show how to override topic selection strategy") {
+  // Example: Configure source factory to handle topics with pattern "test-.*"
+  // Topics without schemas will be treated as Unknown type based on showTopicsWithoutSchema setting
   new UniversalKafkaSourceFactory(
     schemaRegistryClientFactory,
     UniversalSchemaBasedSerdeProvider.create(schemaRegistryClientFactory),
     testModelDependencies,
     new FlinkKafkaSourceImplFactory(None)
   ) {
     override def topicSelectionStrategy =
       new TopicsMatchingPatternWithExistingSubjectsSelectionStrategy(Pattern.compile("test-.*"))
   }
 }
engine/flink/tests/src/test/scala/pl/touk/nussknacker/defaultmodel/KafkaJsonItSpec.scala (3)
18-22: Consider adding more diverse test cases

While the current test data is valid, consider adding test cases that cover:

Nested JSON structures

Arrays

Different data types (numbers, booleans, nulls)

Special characters in strings

51-51: Document the rationale for using lax validation mode

The choice of ValidationMode.lax should be documented to explain why strict validation isn't used and what implications this has for the test.

97-105: Enhance error handling in test assertions

The test should handle parsing errors more gracefully and provide better error messages. Consider:

Adding timeout to the consumer

Adding descriptive error messages for parsing failures
   run(process) {
-    val outputRecord = kafkaClient.createConsumer().consumeWithConsumerRecord(outputTopic).take(1).head
+    val outputRecord = kafkaClient.createConsumer()
+      .consumeWithConsumerRecordWithin(outputTopic, timeout = 10.seconds)
+      .headOption
+      .getOrElse(fail("No message received within timeout"))

     val parsedOutput = parser
       .parse(new String(outputRecord.value(), StandardCharsets.UTF_8))
-      .fold(throw _, identity)
+      .fold(
+        error => fail(s"Failed to parse output as JSON: $error"),
+        identity
+      )

     parsedOutput shouldBe jsonRecord
   }
utils/kafka-utils/src/main/scala/pl/touk/nussknacker/engine/kafka/KafkaConfig.scala (2)
32-33: Add ScalaDoc for the new configuration parameter.

The new showTopicsWithoutSchema parameter would benefit from documentation explaining its purpose and impact. Consider adding ScalaDoc to clarify when this should be enabled/disabled.

Add documentation above the parameter:
    sinkDeliveryGuarantee: Option[SinkDeliveryGuarantee.Value] = None,
+   /** Controls visibility of Kafka topics that don't have an associated schema in the schema registry.
+    * When true (default), all non-hidden topics are shown.
+    * When false, only topics with existing schemas are displayed.
+    */
    showTopicsWithoutSchema: Boolean = true
32-33: Add changelog entry for the new configuration.

Since this introduces a new configuration capability, it should be documented in the changelog to help users understand and adopt the feature.

Consider adding an entry like:
### Added
- New configuration option `showTopicsWithoutSchema` in `KafkaConfig` to control visibility of Kafka topics without schema registry entries
utils/schemed-kafka-components-utils/src/main/scala/pl/touk/nussknacker/engine/schemedkafka/schemaregistry/universal/ParsedSchemaSupport.scala (1)
159-176: Document the ad-hoc testing behavior

The code includes special handling for ad-hoc testing scenarios, but this behavior is only documented in an inline comment.

Add a proper ScalaDoc comment explaining this special case:
/** Encodes the input value according to the given schema.
  * 
  * For ad-hoc testing without schema, the input is expected to be in the format:
  * {{{
  * Map("Value" -> actualValue)
  * }}}
  * In such cases, only the inner value is encoded. Otherwise, the entire value
  * is encoded normally.
  *
  * @param schema The schema to encode against
  * @param mode The validation mode to use
  * @return A function that encodes values according to the schema
  */
docs/integration/KafkaIntegration.md (1)

207-207: LGTM! Consider adding migration guide entry.

The new configuration option showTopicsWithoutSchema is well documented and aligns perfectly with the PR objective of allowing users to select Kafka topics without schema. The default value of true ensures backward compatibility.

Consider adding an entry to the migration guide since this introduces a behavioral change in topic visibility. This will help users understand:

What changed in topic visibility

How to restore previous behavior by setting showTopicsWithoutSchema: false

Impact on existing scenarios
utils/schemed-kafka-components-utils/src/main/scala/pl/touk/nussknacker/engine/schemedkafka/TopicSelectionStrategy.scala (1)
44-59: Consider logging the caught KafkaException for better observability

Currently, when a KafkaException occurs while listing topics from Kafka, the exception is swallowed, and an empty list is returned. Logging this exception can aid in debugging and provide visibility into potential issues during runtime.

You might consider modifying the catch block to log the exception:
         } catch {
           // In some tests we pass dummy kafka address, so when we try to get topics from kafka it fails
-          case _: KafkaException =>
+          case e: KafkaException =>
+            logger.warn("Failed to list topics from Kafka", e)
             List.empty
         }
utils/schemed-kafka-components-utils/src/main/scala/pl/touk/nussknacker/engine/schemedkafka/schemaregistry/universal/UniversalKafkaDeserializer.scala (1)
56-56: Enhance Exception Message for Better Debugging

The exception message on line 56 can be made more informative by including the unexpected content type value. This will aid in debugging by specifying the actual value encountered.

Apply this diff to improve the exception message:
- throw new IllegalStateException("Topic without schema should have ContentType Json or Plain, was neither")
+ throw new IllegalStateException(s"Topic without schema should have ContentType Json or Plain, but was '${value}'")
utils/schemed-kafka-components-utils/src/main/scala/pl/touk/nussknacker/engine/schemedkafka/schemaregistry/formatter/AbstractSchemaBasedRecordFormatter.scala (1)
136-138: Enhance exception messages with detailed context

Consider enhancing the exception messages to include details about the unexpected content type or schema ID. This additional information can aid in debugging by providing more context about the error.

Apply this diff to improve the exception messages:
-              throw new IllegalStateException("Schemaless topic should have json or plain content type, got neither")
+              throw new IllegalStateException(s"Schemaless topic should have JSON or plain content type, got: $$contentType")
and
-            throw new IllegalStateException("Schemaless topic should have json or plain content type, got neither")
+            throw new IllegalStateException(s"Schemaless topic should have valid content type, got: $${record.valueSchemaId.getOrElse("None")}")
utils/schemed-kafka-components-utils/src/main/scala/pl/touk/nussknacker/engine/schemedkafka/KafkaUniversalComponentTransformer.scala (1)

103-134: Address the TODO: Enable PLAIN content type when ready

There is a TODO comment indicating that the PLAIN content type is not working correctly. It's important to resolve this before merging or create a follow-up task to enable the PLAIN content type when it's fully supported. Including commented-out code can lead to confusion and should be avoided in the main codebase.

Would you like assistance in resolving the issue with the PLAIN content type or in creating a task to track this work?
utils/schemed-kafka-components-utils/src/main/scala/pl/touk/nussknacker/engine/schemedkafka/sink/UniversalKafkaSinkFactory.scala (2)
365-366: Use equalsIgnoreCase for contentType comparisons

Currently, the comparison of contentType strings is case-sensitive. To improve robustness and handle different cases of input, consider using equalsIgnoreCase.

Apply this diff:
      private def runtimeSchemaDataForContentType(contentType: String): RuntimeSchemaData[ParsedSchema] = {
-        if (contentType.equals(ContentTypes.JSON.toString)) { jsonSchema }
-        else if (contentType.equals(ContentTypes.PLAIN.toString)) { plainSchema }
+        if (contentType.equalsIgnoreCase(ContentTypes.JSON.toString)) { jsonSchema }
+        else if (contentType.equalsIgnoreCase(ContentTypes.PLAIN.toString)) { plainSchema }
         else { throw new IllegalStateException("Content Type should be JSON or PLAIN, is neither") }
       }
367-367: Throw IllegalArgumentException for invalid contentType

When an invalid argument is provided, it's more appropriate to throw an IllegalArgumentException instead of an IllegalStateException. Additionally, including the invalid value in the exception message can aid in debugging.

Apply this diff:
       else {
-        throw new IllegalStateException("Content Type should be JSON or PLAIN, is neither")
+        throw new IllegalArgumentException(s"Invalid content type: $contentType. Content type should be JSON or PLAIN.")
       }

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 58f8185 and 94e24e0.

📒 Files selected for processing (13)

docs/integration/KafkaIntegration.md (1 hunks)
engine/flink/schemed-kafka-components-utils/src/test/scala/pl/touk/nussknacker/engine/schemedkafka/source/flink/TopicSelectionStrategySpec.scala (4 hunks)
engine/flink/tests/src/test/scala/pl/touk/nussknacker/defaultmodel/KafkaJsonItSpec.scala (1 hunks)
utils/kafka-utils/src/main/scala/pl/touk/nussknacker/engine/kafka/KafkaConfig.scala (1 hunks)
utils/schemed-kafka-components-utils/src/main/scala/pl/touk/nussknacker/engine/schemedkafka/KafkaUniversalComponentTransformer.scala (6 hunks)
utils/schemed-kafka-components-utils/src/main/scala/pl/touk/nussknacker/engine/schemedkafka/TopicSelectionStrategy.scala (1 hunks)
utils/schemed-kafka-components-utils/src/main/scala/pl/touk/nussknacker/engine/schemedkafka/schemaregistry/ContentTypes.scala (1 hunks)
utils/schemed-kafka-components-utils/src/main/scala/pl/touk/nussknacker/engine/schemedkafka/schemaregistry/SchemaRegistryClient.scala (2 hunks)
utils/schemed-kafka-components-utils/src/main/scala/pl/touk/nussknacker/engine/schemedkafka/schemaregistry/formatter/AbstractSchemaBasedRecordFormatter.scala (2 hunks)
utils/schemed-kafka-components-utils/src/main/scala/pl/touk/nussknacker/engine/schemedkafka/schemaregistry/universal/ParsedSchemaSupport.scala (1 hunks)
utils/schemed-kafka-components-utils/src/main/scala/pl/touk/nussknacker/engine/schemedkafka/schemaregistry/universal/UniversalKafkaDeserializer.scala (2 hunks)
utils/schemed-kafka-components-utils/src/main/scala/pl/touk/nussknacker/engine/schemedkafka/sink/UniversalKafkaSinkFactory.scala (7 hunks)
utils/schemed-kafka-components-utils/src/main/scala/pl/touk/nussknacker/engine/schemedkafka/source/UniversalKafkaSourceFactory.scala (3 hunks)

🔇 Additional comments (19)

utils/schemed-kafka-components-utils/src/main/scala/pl/touk/nussknacker/engine/schemedkafka/schemaregistry/ContentTypes.scala (2)

1-4: LGTM!

The package structure and import are well-organized and specific.

5-9: Previous review comment about sealed trait pattern is still applicable.

Let's verify the usage of ContentTypes to understand the impact of the suggested refactor:

#!/bin/bash
# Search for usages of ContentTypes to assess refactoring impact
rg -A 2 "ContentTypes\."

engine/flink/schemed-kafka-components-utils/src/test/scala/pl/touk/nussknacker/engine/schemedkafka/source/flink/TopicSelectionStrategySpec.scala (1)

9-12: Verify complete migration to new topic selection strategies

The import changes suggest a shift towards strategies that explicitly handle existing subjects. This aligns with the PR objective of supporting topics without schemas.

✅ Verification successful

Migration to new topic selection strategies is complete

The verification shows that the old strategy classes (AllTopicsSelectionStrategy and TopicPatternSelectionStrategy) have been completely removed from the codebase, and the new strategies (TopicsWithExistingSubjectSelectionStrategy and TopicsMatchingPatternWithExistingSubjectsSelectionStrategy) are consistently used across all relevant files. The migration appears to be thorough and complete.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify that old strategy classes are fully replaced
# Expected: No occurrences of old strategy classes

# Search for old strategy class usages
rg "AllTopicsSelectionStrategy|TopicPatternSelectionStrategy" --type scala

# Search for new strategy class usages to confirm migration
rg "TopicsWithExistingSubjectSelectionStrategy|TopicsMatchingPatternWithExistingSubjectsSelectionStrategy" --type scala

Length of output: 3770

utils/schemed-kafka-components-utils/src/main/scala/pl/touk/nussknacker/engine/schemedkafka/schemaregistry/SchemaRegistryClient.scala (1)

5-10: LGTM: Import changes are well-organized

The new imports are properly organized and necessary for the added functionality.

utils/kafka-utils/src/main/scala/pl/touk/nussknacker/engine/kafka/KafkaConfig.scala (1)

32-33: Verify the parameter usage across the codebase.

The addition of this parameter needs to be reflected in configuration files and documentation.

✅ Verification successful

Let me analyze the search results to verify the parameter usage. I can see that the parameter showTopicsWithoutSchema is part of KafkaConfig class and is used in the codebase. Let me search for its actual usage to determine if configuration files need updating.

Based on the search results, I can now provide a final response:

No configuration updates needed for this parameter

The showTopicsWithoutSchema parameter is already well-integrated in the codebase:

It has a default value of true in the KafkaConfig class
It's documented in docs/integration/KafkaIntegration.md with clear description
The parameter is used in KafkaUniversalComponentTransformer to control topic selection strategy
It's an optional configuration that works with the default value, and users can override it if needed through the standard Kafka configuration mechanism

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for potential configuration files and documentation that might need updating
echo "Searching for configuration files..."
fd -e conf -e yaml -e properties --exec rg -l "kafka"

echo "Searching for documentation files..."
fd -e md --exec rg -l "kafka|schema"

echo "Searching for test configurations..."
rg "KafkaConfig" -A 5

Length of output: 105643

Script:

#!/bin/bash
# Search for usage of showTopicsWithoutSchema parameter
rg "showTopicsWithoutSchema" -A 3

# Search for kafka configuration in config files
rg "kafka" -A 5 ./nussknacker-dist/src/universal/conf/application.conf

Length of output: 2788

docs/integration/KafkaIntegration.md (1)

Line range hint 1-207: Documentation is comprehensive and well-structured!

The documentation provides excellent coverage of Kafka integration, including:

Clear prerequisites
Detailed concepts
Configuration options
Authentication setup
Schema handling
Exception handling

The new changes integrate seamlessly with the existing content.

🧰 Tools

🪛 LanguageTool

[grammar] ~205-~205: The noun should probably be in the singular form.
Context: ... | Maximum entries size for each caches: available schemas cache and parsed sch...

(EVERY_EACH_SINGULAR)

utils/schemed-kafka-components-utils/src/main/scala/pl/touk/nussknacker/engine/schemedkafka/schemaregistry/universal/UniversalKafkaDeserializer.scala (2)

8-17: Imports are Correct and Necessary

The added imports are appropriate for the new logic introduced and ensure all dependencies are properly referenced.

48-57: Duplicate Issue: Incorrect Pattern Matching on writerSchemaId

As previously noted, pattern matching on writerSchemaId.value is incorrect because writerSchemaId.value is likely a String, whereas StringSchemaId is a case class expecting to match against writerSchemaId itself. This mismatch will result in a match failure at runtime.

utils/schemed-kafka-components-utils/src/main/scala/pl/touk/nussknacker/engine/schemedkafka/schemaregistry/formatter/AbstractSchemaBasedRecordFormatter.scala (2)

115-122: Correct handling of topics with schemas

The code correctly checks if the topic has a schema and processes the value accordingly. The use of schemaRegistryClient.isTopicWithSchema ensures that messages are serialized properly when a schema is present.

124-143: Robust handling of schemaless topics with JSON or PLAIN content types

The added logic accurately handles topics without schemas by checking the content type and processing JSON and plain text formats appropriately. This ensures that messages from schemaless topics are correctly serialized.

utils/schemed-kafka-components-utils/src/main/scala/pl/touk/nussknacker/engine/schemedkafka/KafkaUniversalComponentTransformer.scala (5)

29-29: Consistent addition of contentTypeParamName constant

The addition of contentTypeParamName aligns with the existing parameter naming conventions and enhances readability.

50-54: Ensure showTopicsWithoutSchema is properly documented

The use of kafkaConfig.showTopicsWithoutSchema to determine the topic selection strategy is appropriate. Please ensure that the showTopicsWithoutSchema configuration parameter is added to KafkaConfig and thoroughly documented in the configuration documentation.

70-70: Appropriate usage of kafkaConfig in getTopics method

Passing kafkaConfig to topicSelectionStrategy.getTopics ensures that the strategy has access to necessary configuration parameters. This modification enhances flexibility and is correctly implemented.

217-223: Correct integration of versionOrContentTypeParam in schemaParamStep

The integration of versionOrContentTypeParam into the parameters and error handling ensures that the component accurately accounts for both version and content type based on the topic's schema presence. This implementation is sound and aligns with the intended functionality.

238-239: Addition of contentTypeParamName maintains consistency

Adding contentTypeParamName as a protected lazy val maintains consistency with other parameter names in the class, enhancing maintainability.

utils/schemed-kafka-components-utils/src/main/scala/pl/touk/nussknacker/engine/schemedkafka/source/UniversalKafkaSourceFactory.scala (4)

9-9: Import is necessary and appropriate

The import of NkSerializableParsedSchema is required for the creation of RuntimeSchemaData instances and is correctly added.

47-47: Constructor modification is appropriate

The addition of implProvider to the constructor parameters is appropriate, as it is utilized within the implementation method to create the Kafka source.

74-74: ⚠️ Potential issue

Avoid potential NullPointerException by swapping operands in equals

Using contentType.equals(...) can lead to a NullPointerException if contentType is null. Swap the operands to prevent this issue.

This concern was previously raised; please address it to ensure robustness.

74-100: 🛠️ Refactor suggestion

Refactor duplicated schema creation logic

The code for creating RuntimeSchemaData is duplicated between the JSON and PLAIN cases. Consider extracting this logic into a helper method to improve maintainability and reduce duplication.

This suggestion was previously provided and is still applicable. Refactoring will enhance the code's clarity and ease future modifications.

.../scala/pl/touk/nussknacker/engine/schemedkafka/source/flink/TopicSelectionStrategySpec.scala

...main/scala/pl/touk/nussknacker/engine/schemedkafka/schemaregistry/SchemaRegistryClient.scala

engine/flink/tests/src/test/scala/pl/touk/nussknacker/defaultmodel/KafkaJsonItSpec.scala

...a/pl/touk/nussknacker/engine/schemedkafka/schemaregistry/universal/ParsedSchemaSupport.scala

.../main/scala/pl/touk/nussknacker/engine/schemedkafka/source/UniversalKafkaSourceFactory.scala

coderabbitai

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (3)

utils/schemed-kafka-components-utils/src/main/scala/pl/touk/nussknacker/engine/schemedkafka/TopicSelectionStrategy.scala (3)

21-30: Consider documenting unused parameter

The kafkaConfig parameter is not used in this implementation. Consider adding a comment explaining why, or if this is intentional.

 class TopicsWithExistingSubjectSelectionStrategy extends TopicSelectionStrategy {
+  // kafkaConfig not used as this strategy only fetches topics from schema registry
   override def getTopics(
       schemaRegistryClient: SchemaRegistryClient,
       kafkaConfig: KafkaConfig
   ): Validated[SchemaRegistryError, List[UnspecializedTopicName]] = {
     schemaRegistryClient.getAllTopics
   }
 }

32-60: Good implementation with a few suggestions for improvement

The implementation successfully combines topics from both sources and handles errors gracefully. Consider these improvements:

The comment about dummy kafka address should be moved to ScalaDoc
Consider extracting the hidden topic prefix "_" to a constant
Consider adding logging for Kafka connection failures

 class AllNonHiddenTopicsSelectionStrategy extends TopicSelectionStrategy {
+  private val HiddenTopicPrefix = "_"
+  
+  /** Retrieves topics from both schema registry and Kafka.
+   * Note: In test environments with dummy Kafka addresses, Kafka topic retrieval
+   * may fail silently and return empty list
+   */
   override def getTopics(
       schemaRegistryClient: SchemaRegistryClient,
       kafkaConfig: KafkaConfig
   ): Validated[SchemaRegistryError, List[UnspecializedTopicName]] = {
     val topicsFromSchemaRegistry = schemaRegistryClient.getAllTopics
     val validatorConfig          = kafkaConfig.topicsExistenceValidationConfig.validatorConfig
     val schemaLessTopics: List[UnspecializedTopicName] = {
       try {
         KafkaUtils.usingAdminClient(kafkaConfig) {
           _.listTopics(new ListTopicsOptions().timeoutMs(validatorConfig.adminClientTimeout.toMillis.toInt))
             .names()
             .get()
             .asScala
             .toSet
             .map(UnspecializedTopicName.apply)
-            .filterNot(topic => topic.name.startsWith("_"))
+            .filterNot(topic => topic.name.startsWith(HiddenTopicPrefix))
             .toList
         }
       } catch {
-        // In some tests we pass dummy kafka address, so when we try to get topics from kafka it fails
         case e: KafkaException =>
+          logger.warn("Failed to retrieve topics from Kafka", e)
           List.empty
       }
     }

64-72: Consider documenting unused parameter

Similar to TopicsWithExistingSubjectSelectionStrategy, the kafkaConfig parameter is not used in this implementation.

 class TopicsMatchingPatternWithExistingSubjectsSelectionStrategy(val topicPattern: Pattern)
     extends TopicSelectionStrategy {
 
+  // kafkaConfig not used as this strategy only filters topics from schema registry
   override def getTopics(
       schemaRegistryClient: SchemaRegistryClient,
       kafkaConfig: KafkaConfig
   ): Validated[SchemaRegistryError, List[UnspecializedTopicName]] =
     schemaRegistryClient.getAllTopics.map(_.filter(topic => topicPattern.matcher(topic.name).matches()))

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 94e24e0 and 68fa2d5.

📒 Files selected for processing (1)

utils/schemed-kafka-components-utils/src/main/scala/pl/touk/nussknacker/engine/schemedkafka/TopicSelectionStrategy.scala (1 hunks)

🔇 Additional comments (2)

utils/schemed-kafka-components-utils/src/main/scala/pl/touk/nussknacker/engine/schemedkafka/TopicSelectionStrategy.scala (2)

4-6: LGTM: Required imports added for Kafka admin functionality

The new imports support the enhanced topic selection capabilities with Kafka admin client integration.

14-17: Verify impact of the breaking change in trait signature

Adding kafkaConfig parameter is a breaking change that will affect all implementations of this trait.

✅ Verification successful

All implementations of TopicSelectionStrategy have been updated with the new parameter

The two implementations found in the codebase:

TopicsWithExistingSubjectSelectionStrategy
AllNonHiddenTopicsSelectionStrategy

Both have been properly updated to include the new kafkaConfig parameter in their getTopics method signature, matching the trait's interface. The breaking change has been handled correctly.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Find all implementations of TopicSelectionStrategy to ensure they're updated
ast-grep --pattern 'class $_ extends TopicSelectionStrategy'

Length of output: 6594

coderabbitai

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (2)

utils/schemed-kafka-components-utils/src/test/scala/pl/touk/nussknacker/engine/schemedkafka/schemaregistry/azure/AzureSchemaBasedSerdeProviderIntegrationTest.scala (1)

32-37: Consider making the test environment more CI-friendly.

The test relies on Azure Event Hubs namespace which might not be available in all CI environments. Consider:

Adding clear documentation about required environment variables

Providing a mock implementation for CI

Marking the test with appropriate tags for conditional execution
utils/schemed-kafka-components-utils/src/test/scala/pl/touk/nussknacker/engine/schemedkafka/schemaregistry/azure/AzureTestsFromFileIntegrationTest.scala (1)
42-42: LGTM! Consider adding a comment explaining the flag's purpose.

The addition of showTopicsWithoutSchema = false is appropriate for this test class as it focuses on schema registry functionality. This ensures the test remains focused on verifying schema-based topic behavior.

Consider adding a comment explaining why this flag is set to false here:
+ // Explicitly disable schema-less topics as this test focuses on schema registry functionality
  private val kafkaConfig = KafkaConfig(Some(schemaRegistryConfigMap), None, showTopicsWithoutSchema = false)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 68fa2d5 and 901d727.

📒 Files selected for processing (2)

utils/schemed-kafka-components-utils/src/test/scala/pl/touk/nussknacker/engine/schemedkafka/schemaregistry/azure/AzureSchemaBasedSerdeProviderIntegrationTest.scala (1 hunks)
utils/schemed-kafka-components-utils/src/test/scala/pl/touk/nussknacker/engine/schemedkafka/schemaregistry/azure/AzureTestsFromFileIntegrationTest.scala (1 hunks)

🔇 Additional comments (2)

utils/schemed-kafka-components-utils/src/test/scala/pl/touk/nussknacker/engine/schemedkafka/schemaregistry/azure/AzureSchemaBasedSerdeProviderIntegrationTest.scala (1)

32-37: LGTM on the KafkaConfig parameter addition.

The addition of showTopicsWithoutSchema = false is appropriate for this schema-based test case.

utils/schemed-kafka-components-utils/src/test/scala/pl/touk/nussknacker/engine/schemedkafka/schemaregistry/azure/AzureTestsFromFileIntegrationTest.scala (1)

42-42: Verify test coverage for schema-less topics.

While this test appropriately covers schema registry functionality, we should ensure there are corresponding test cases for the new schema-less topic support introduced in this PR.

Let's check for test coverage of schema-less topics:

✅ Verification successful

Schema-less topic test coverage is present in the codebase

The codebase has adequate test coverage for schema-less topics in multiple test files:

KafkaJsonItSpec contains dedicated tests for topics without schema, testing both JSON and plain content types
UniversalRecordFormatterSpec includes specific test cases for JSON record formatting without schema
UniversalKafkaDeserializerTest covers deserialization scenarios for messages without schema IDs

The showTopicsWithoutSchema = false setting in the reviewed test file is appropriate as it's specifically testing Azure Schema Registry functionality, while schema-less scenarios are covered elsewhere.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for test files that might contain schema-less topic tests
rg -t scala "showTopicsWithoutSchema\s*=\s*true" --type scala

# Look for potential test class names related to schema-less topics
rg -t scala "class.*Test.*Schema(Less|Free|Optional|Without)" --type scala

Length of output: 140

Script:

#!/bin/bash
# Let's first check for any test files that mention schema-less or related terms
rg -t scala -i "(schema.?less|without.?schema|no.?schema)" --type scala

# Let's also check for test files that use showTopicsWithoutSchema in any form
rg -t scala "showTopicsWithoutSchema" --type scala

# Let's look at the actual test file content to understand the test coverage
cat utils/schemed-kafka-components-utils/src/test/scala/pl/touk/nussknacker/engine/schemedkafka/schemaregistry/azure/AzureTestsFromFileIntegrationTest.scala

Length of output: 14915

.../engine/schemedkafka/schemaregistry/azure/AzureSchemaBasedSerdeProviderIntegrationTest.scala

TODO: ad-hoc test and schema in sink

Added comments where not sure

…t tests break

…andling when couldn't list kafka topics (previously it wasn't needed)

…compiles Also a workaround in the same file for test, it will now fetch all topics from schema registry as before and from kafka, so tests should pass

When topic without schema is selected instead of version, user can now choose ContentType (which for now doesn't change anything, need to implement handling bytes array)

- created a function for repeated code - removed use of avro in test for json - used more constants were applicable

- commented out plain for now - all kafka topics are now filtered using topicSelectionStrategy - todo: add feature flag for showing topics without schema

…ema - false for now

…isabled

- `showTopicsWithoutSchema` flag is now in KafkaConfig - TopicSelectionStrategies changed to include topics without schema

… fix

coderabbitai

Actionable comments posted: 4

🧹 Outside diff range and nitpick comments (8)

utils/schemed-kafka-components-utils/src/main/scala/pl/touk/nussknacker/engine/schemedkafka/schemaregistry/universal/UniversalKafkaDeserializer.scala (1)
44-59: Consider extensibility for future content types

The current implementation hardcodes JSON and Plain content types. Consider making this extensible for future content types.

Example approach:
trait ContentTypeHandler {
  def canHandle(contentType: String): Boolean
  def getSchemaWithMetadata(contentType: String): SchemaWithMetadata
}

object JsonContentTypeHandler extends ContentTypeHandler {
  def canHandle(contentType: String): Boolean = 
    contentType.equals(ContentTypes.JSON.toString)
  
  def getSchemaWithMetadata(contentType: String): SchemaWithMetadata =
    SchemaWithMetadata(ContentTypesSchemas.schemaForJson, 
      SchemaId.fromString(ContentTypes.JSON.toString))
}

// Similar implementation for PlainContentTypeHandler
utils/schemed-kafka-components-utils/src/main/scala/pl/touk/nussknacker/engine/schemedkafka/schemaregistry/formatter/AbstractSchemaBasedRecordFormatter.scala (2)
124-140: Simplify content type handling logic

The current implementation has nested conditions and duplicate error messages. Consider using pattern matching for cleaner content type handling.
-        val schema = record.valueSchemaId.flatMap {
-          case StringSchemaId(contentType) =>
-            if (contentType.equals(ContentTypes.JSON.toString)) {
-              Some(
-                SchemaWithMetadata(
-                  ContentTypesSchemas.schemaForJson,
-                  SchemaId.fromString(ContentTypes.JSON.toString)
-                ).schema
-              )
-            } else if (contentType.equals(ContentTypes.PLAIN.toString)) {
-              None
-            } else
-              throw new IllegalStateException("Schemaless topic should have json or plain content type, got neither")
-          case _ =>
-            throw new IllegalStateException("Schemaless topic should have json or plain content type, got neither")
+        val schema = record.valueSchemaId.flatMap {
+          case StringSchemaId(contentType) => contentType match {
+            case ct if ct == ContentTypes.JSON.toString =>
+              Some(SchemaWithMetadata(
+                ContentTypesSchemas.schemaForJson,
+                SchemaId.fromString(ContentTypes.JSON.toString)
+              ).schema)
+            case ct if ct == ContentTypes.PLAIN.toString => None
+            case _ => throw new IllegalStateException(s"Schemaless topic should have json or plain content type, got: $contentType")
+          }
+          case other => throw new IllegalStateException(s"Schemaless topic should have string schema id, got: $other")
+        }
114-144: Document the schemaless topic handling

The new functionality for handling schemaless topics should be documented with examples and limitations.

Consider adding scaladoc comments explaining:

The behavior for different content types

Expected JSON structure for schemaless topics

Handling of metadata from hermes (as mentioned in PR objectives)
utils/schemed-kafka-components-utils/src/main/scala/pl/touk/nussknacker/engine/schemedkafka/KafkaUniversalComponentTransformer.scala (1)
114-118: Consider improving error messages for better user experience.

The error message from schema registry is directly passed to the user. Consider providing more context-specific error messages that guide users on how to resolve the issue.
         case Invalid(e) =>
           Writer[List[ProcessCompilationError], List[Integer]](
-            List(CustomNodeError(e.getMessage, Some(topicParamName))),
+            List(CustomNodeError(s"Failed to retrieve schema versions: ${e.getMessage}. Please ensure the topic exists and has schemas registered.", Some(topicParamName))),
             Nil
           )
utils/schemed-kafka-components-utils/src/main/scala/pl/touk/nussknacker/engine/schemedkafka/source/UniversalKafkaSourceFactory.scala (1)
68-73: Consider adding validation for content type parameter

The content type parameter is accepted without validation, which could lead to runtime errors if an invalid value is provided.

Consider adding a dedicated parameter with a fixed set of allowed values:
+  protected val contentTypeParam = Parameter(contentTypeParamName)
+    .copy(editor = Some(FixedValuesParameterEditor(ContentTypes.values.map(_.toString).toList)))
+
   override def contextTransformation(context: ValidationContext, dependencies: List[NodeDependencyValue])(
       implicit nodeId: NodeId
   ): ContextTransformationDefinition =
     topicParamStep orElse
+      Parameter.initial(contentTypeParam) orElse
       schemaParamStep(paramsDeterminedAfterSchema) orElse
       nextSteps(context, dependencies)
utils/schemed-kafka-components-utils/src/main/scala/pl/touk/nussknacker/engine/schemedkafka/sink/UniversalKafkaSinkFactory.scala (2)
85-86: Consider caching schema instances more efficiently

The lazy vals for schemas are a good start, but consider moving them to a companion object to ensure they're truly singleton instances across all factory instances.
object UniversalKafkaSinkFactory {
+  private lazy val jsonSchema =
+    RuntimeSchemaData(new NkSerializableParsedSchema[ParsedSchema](ContentTypesSchemas.schemaForJson), None)
+  private lazy val plainSchema =
+    RuntimeSchemaData(new NkSerializableParsedSchema[ParsedSchema](ContentTypesSchemas.schemaForPlain), None)

  private val paramsDeterminedAfterSchema = List(
    // existing code
  )
Also applies to: 89-92

365-369: Enhance error handling in runtimeSchemaDataForContentType

The error message could be more informative. Consider including the actual content type in the error message and using pattern matching for better type safety.
  private def runtimeSchemaDataForContentType(contentType: String): RuntimeSchemaData[ParsedSchema] = {
-   if (contentType.equals(ContentTypes.JSON.toString)) { jsonSchema }
-   else if (contentType.equals(ContentTypes.PLAIN.toString)) { plainSchema }
-   else { throw new IllegalStateException("Content Type should be JSON or PLAIN, is neither") }
+   contentType match {
+     case ct if ct.equals(ContentTypes.JSON.toString) => jsonSchema
+     case ct if ct.equals(ContentTypes.PLAIN.toString) => plainSchema
+     case unsupported => throw new IllegalStateException(
+       s"Content Type should be JSON or PLAIN, got: $unsupported"
+     )
+   }
  }
docs/integration/KafkaIntegration.md (1)

207-207: Configuration entry looks good but could use more context.

The configuration entry for showTopicsWithoutSchema is clear and well-documented. However, consider adding a note about its relationship with schema-less topics and how it affects the user experience when working with topics that don't have associated schemas.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 901d727 and 26d42b1.

📒 Files selected for processing (15)

docs/integration/KafkaIntegration.md (1 hunks)
engine/flink/schemed-kafka-components-utils/src/test/scala/pl/touk/nussknacker/engine/schemedkafka/source/flink/TopicSelectionStrategySpec.scala (4 hunks)
engine/flink/tests/src/test/scala/pl/touk/nussknacker/defaultmodel/KafkaJsonItSpec.scala (1 hunks)
utils/kafka-utils/src/main/scala/pl/touk/nussknacker/engine/kafka/KafkaConfig.scala (1 hunks)
utils/schemed-kafka-components-utils/src/main/scala/pl/touk/nussknacker/engine/schemedkafka/KafkaUniversalComponentTransformer.scala (6 hunks)
utils/schemed-kafka-components-utils/src/main/scala/pl/touk/nussknacker/engine/schemedkafka/TopicSelectionStrategy.scala (1 hunks)
utils/schemed-kafka-components-utils/src/main/scala/pl/touk/nussknacker/engine/schemedkafka/schemaregistry/ContentTypes.scala (1 hunks)
utils/schemed-kafka-components-utils/src/main/scala/pl/touk/nussknacker/engine/schemedkafka/schemaregistry/SchemaRegistryClient.scala (2 hunks)
utils/schemed-kafka-components-utils/src/main/scala/pl/touk/nussknacker/engine/schemedkafka/schemaregistry/formatter/AbstractSchemaBasedRecordFormatter.scala (2 hunks)
utils/schemed-kafka-components-utils/src/main/scala/pl/touk/nussknacker/engine/schemedkafka/schemaregistry/universal/ParsedSchemaSupport.scala (1 hunks)
utils/schemed-kafka-components-utils/src/main/scala/pl/touk/nussknacker/engine/schemedkafka/schemaregistry/universal/UniversalKafkaDeserializer.scala (2 hunks)
utils/schemed-kafka-components-utils/src/main/scala/pl/touk/nussknacker/engine/schemedkafka/sink/UniversalKafkaSinkFactory.scala (7 hunks)
utils/schemed-kafka-components-utils/src/main/scala/pl/touk/nussknacker/engine/schemedkafka/source/UniversalKafkaSourceFactory.scala (3 hunks)
utils/schemed-kafka-components-utils/src/test/scala/pl/touk/nussknacker/engine/schemedkafka/schemaregistry/azure/AzureSchemaBasedSerdeProviderIntegrationTest.scala (1 hunks)
utils/schemed-kafka-components-utils/src/test/scala/pl/touk/nussknacker/engine/schemedkafka/schemaregistry/azure/AzureTestsFromFileIntegrationTest.scala (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (9)

engine/flink/schemed-kafka-components-utils/src/test/scala/pl/touk/nussknacker/engine/schemedkafka/source/flink/TopicSelectionStrategySpec.scala
engine/flink/tests/src/test/scala/pl/touk/nussknacker/defaultmodel/KafkaJsonItSpec.scala
utils/kafka-utils/src/main/scala/pl/touk/nussknacker/engine/kafka/KafkaConfig.scala
utils/schemed-kafka-components-utils/src/main/scala/pl/touk/nussknacker/engine/schemedkafka/TopicSelectionStrategy.scala
utils/schemed-kafka-components-utils/src/main/scala/pl/touk/nussknacker/engine/schemedkafka/schemaregistry/ContentTypes.scala
utils/schemed-kafka-components-utils/src/main/scala/pl/touk/nussknacker/engine/schemedkafka/schemaregistry/SchemaRegistryClient.scala
utils/schemed-kafka-components-utils/src/main/scala/pl/touk/nussknacker/engine/schemedkafka/schemaregistry/universal/ParsedSchemaSupport.scala
utils/schemed-kafka-components-utils/src/test/scala/pl/touk/nussknacker/engine/schemedkafka/schemaregistry/azure/AzureSchemaBasedSerdeProviderIntegrationTest.scala
utils/schemed-kafka-components-utils/src/test/scala/pl/touk/nussknacker/engine/schemedkafka/schemaregistry/azure/AzureTestsFromFileIntegrationTest.scala

🔇 Additional comments (13)

utils/schemed-kafka-components-utils/src/main/scala/pl/touk/nussknacker/engine/schemedkafka/schemaregistry/universal/UniversalKafkaDeserializer.scala (2)

8-8: LGTM: New imports support schema-less topics functionality

The added imports are well-organized and necessary for implementing the new feature to handle topics without schema.

Also applies to: 12-17

44-59: Verify content type validation for schema-less topics

The current implementation assumes the message format matches the content type without validation. This could lead to runtime errors if the actual message format doesn't match the assumed content type.

Consider adding message format validation for schema-less topics. For example:

For JSON content type: Verify the message can be parsed as valid JSON
For Plain content type: Define clear expectations for the message format

Would you like me to propose a validation implementation?

utils/schemed-kafka-components-utils/src/main/scala/pl/touk/nussknacker/engine/schemedkafka/schemaregistry/formatter/AbstractSchemaBasedRecordFormatter.scala (2)

11-19: LGTM: New imports support schemaless topic functionality

The added imports are necessary and well-organized to support the new feature for handling topics without schemas.

141-142: Verify JSON structure validation

For schemaless topics with JSON content type, ensure that the JSON structure is properly validated. The current implementation might accept any JSON value (string, number, array, etc.) without validation.

✅ Verification successful

Based on the code analysis, I can now provide a final response regarding the JSON validation concern:

JSON validation is properly handled through multiple layers

For schemaless topics with JSON content type, the validation is handled appropriately through:

The JsonPayloadRecordFormatterSupport handles JSON message formatting and reading
When schema is provided, validation is performed using JsonSchemaOutputValidator with proper validation modes (strict/lax)
For schemaless JSON topics, basic structural validation is performed during serialization/deserialization
The implementation correctly handles both schema-based and schemaless JSON messages through the universal formatter infrastructure

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for JSON validation logic in value parsing
ast-grep --pattern 'readValueMessage($schema, $topic, $jsonObj) {
  $$$
}'

Length of output: 76

Script:

#!/bin/bash
# Let's first check the actual implementation of readValueMessage
rg -A 10 "readValueMessage" --type scala

# Also search for any JSON validation related code
rg -A 5 "JSON|json" --type scala

# Let's also check the entire file content to understand the context better
cat "utils/schemed-kafka-components-utils/src/main/scala/pl/touk/nussknacker/engine/schemedkafka/schemaregistry/formatter/AbstractSchemaBasedRecordFormatter.scala"

Length of output: 111078

Script:

#!/bin/bash
# Let's check the implementation of JsonPayloadRecordFormatterSupport which handles JSON validation
rg -A 10 "JsonPayloadRecordFormatterSupport" --type scala

# Also check for any JSON validation related code in the universal package
rg -A 5 "validate.*json|json.*validate" --type scala --ignore-case

# Let's also check the specific implementation of readMessage in JsonPayloadRecordFormatterSupport
ast-grep --pattern 'readMessage($topic, $schemaOpt, $jsonObj) {
  $$$
}'

Length of output: 51680

Script:

#!/bin/bash
# Let's check the implementation of JsonPayloadRecordFormatterSupport more specifically
rg -A 15 "object JsonPayloadRecordFormatterSupport" --type scala

# Also check for any JSON schema validation in the universal package
rg -A 10 "JsonSchemaOutputValidator" --type scala

# Let's also check the specific implementation of readMessage in RecordFormatterSupport
rg -A 10 "private def readMessage" --type scala

Length of output: 27730

utils/schemed-kafka-components-utils/src/main/scala/pl/touk/nussknacker/engine/schemedkafka/KafkaUniversalComponentTransformer.scala (5)

29-29: LGTM: Content type parameter constant added.

The new constant follows the existing naming convention and is appropriately placed with other parameter name constants.

50-54: LGTM: Topic selection strategy based on configuration.

The implementation cleanly switches between strategies based on the configuration flag, following the suggested approach from previous reviews.

70-70: LGTM: Topic selection strategy now receives KafkaConfig.

The change properly implements the suggested refactoring to pass KafkaConfig to the topic selection strategy.

123-125: Verify the status of PLAIN content type support.

There's a TODO comment about removing the PLAIN content type once it's working correctly. Please clarify:

What issues exist with PLAIN content type?
What's the timeline for enabling it?
Should this TODO be tracked in an issue?

Would you like me to create a GitHub issue to track this TODO?

238-239: LGTM: Content type parameter name constant follows existing patterns.

The implementation is consistent with other parameter name constants and correctly marked as @transient.

utils/schemed-kafka-components-utils/src/main/scala/pl/touk/nussknacker/engine/schemedkafka/sink/UniversalKafkaSinkFactory.scala (4)

22-27: LGTM: Clean import organization for content type support

The new imports are well-organized and properly scoped for the content type functionality.

99-108: LGTM: Clean pattern matching for raw editor parameters

The pattern matching for content type parameters is well-structured and maintains consistency with the existing schema-based approach.

162-195: 🛠️ Refactor suggestion

Consider extracting duplicate validation logic

The validation logic in this block is nearly identical to the schema-based validation at lines 124-161. Consider extracting this into a shared method to improve maintainability.

+ private def validateWithSchema(
+   context: ValidationContext,
+   runtimeSchemaData: RuntimeSchemaData[ParsedSchema],
+   mode: String,
+   value: BaseDefinedParameter
+ )(implicit nodeId: NodeId): TransformationStepResult = {
+   schemaSupportDispatcher
+     .forSchemaType(runtimeSchemaData.schema.schemaType())
+     .extractParameter(
+       runtimeSchemaData.schema,
+       rawMode = true,
+       validationMode = extractValidationMode(mode),
+       rawParameter = rawValueParamDeclaration.createParameter(),
+       restrictedParamNames
+     )
+     .map { extractedSinkParameter =>
+       val validationAgainstSchemaErrors = extractedSinkParameter
+         .validateParams(Map(sinkValueParamName -> value))
+         .swap
+         .map(_.toList)
+         .getOrElse(List.empty)
+       FinalResults(
+         context,
+         validationAgainstSchemaErrors,
+         Some(TransformationState(runtimeSchemaData, extractedSinkParameter))
+       )
+     }
+     .valueOr { errors =>
+       FinalResults(context, errors.toList)
+     }
+ }

249-278: Consider handling empty schemas more explicitly

The comment about empty schemas leading to infinite loops deserves more attention. Consider adding explicit validation or documentation for this edge case.

✅ Verification successful

Let me gather more information about the schema handling and parameter transformation logic to better understand the potential for infinite loops.

Based on the analysis of the codebase, I can now provide a final response regarding the empty schema concern:

Empty schema handling is properly implemented with safeguards

The code already handles empty schemas appropriately:

It explicitly checks for empty parameters using valueParam.toParameters.isEmpty
It returns FinalResults with the current state when empty, preventing infinite loops
The validation mode system (ValidationMode.lax vs ValidationMode.strict) provides additional control over schema validation
Similar pattern is consistently used across the codebase in other schema-handling components

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for empty schema handling across the codebase
rg -A 3 "empty\s+schema" 

# Look for potential infinite loop scenarios in schema handling
ast-grep --pattern 'while ($condition) {
  $$$
  if ($condition) {
    $$$
  }
}'

Length of output: 4037

Script:

#!/bin/bash
# Look for schema validation and parameter transformation logic
ast-grep --pattern 'def extractParameter($$$) {
  $$$
}'

# Search for other instances where toParameters is used
rg -A 3 "toParameters"

# Look for schema validation implementations
rg -A 5 "ValidationMode"

Length of output: 108045

…dSchemaSupport doesn't know about it, it will be checked only in tests

coderabbitai

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (2)

engine/lite/components/kafka/src/main/scala/pl/touk/nussknacker/engine/lite/components/LiteKafkaSourceImplFactory.scala (1)
89-94: Good fix for the {"Value": message} wrapping issue.

The implementation correctly addresses the PR objective of fixing the ad-hoc JSON object formatting. However, consider these improvements for robustness:

The "Value" string comparison is case-sensitive. Consider using case-insensitive comparison or defining it as a constant.

Consider adding validation for the extracted inner value to ensure it's of the expected type.

Here's a suggested improvement:
+  private val ValueKey = "Value" // Define constant for better maintainability
+
   override def parametersToTestData(params: Map[ParameterName, AnyRef]): ConsumerRecord[Array[Byte], Array[Byte]] = {
     val unflattenedParams = TestingParametersSupport.unflattenParameters(params)
     val removedValue = if (unflattenedParams.size == 1) {
       unflattenedParams.headOption match {
-        case Some(("Value", inner)) => inner
+        case Some((key, inner)) if key.equalsIgnoreCase(ValueKey) => 
+          require(inner != null, "Value cannot be null")
+          inner
         case _                      => unflattenedParams
       }
     } else unflattenedParams
engine/flink/kafka-components-utils/src/main/scala/pl/touk/nussknacker/engine/kafka/source/flink/FlinkKafkaSource.scala (1)

128-137: Consider metadata handling requirements

The PR objectives mention uncertainty about handling metadata from hermes. While these changes address the basic value handling, we might need to consider:

How metadata will be handled in test scenarios

Whether metadata should be included in test data generation

If additional parameters for metadata testing are needed

Consider adding support for metadata handling in test scenarios if required. This could involve:

Adding metadata-related test parameters

Implementing metadata extraction in test data generation

Documenting metadata handling behavior

Would you like me to help design the metadata handling solution?

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 26d42b1 and 9bfd696.

📒 Files selected for processing (2)

engine/flink/kafka-components-utils/src/main/scala/pl/touk/nussknacker/engine/kafka/source/flink/FlinkKafkaSource.scala (1 hunks)
engine/lite/components/kafka/src/main/scala/pl/touk/nussknacker/engine/lite/components/LiteKafkaSourceImplFactory.scala (1 hunks)

🔇 Additional comments (3)

engine/lite/components/kafka/src/main/scala/pl/touk/nussknacker/engine/lite/components/LiteKafkaSourceImplFactory.scala (1)

Line range hint 52-54: Consider addressing the TODO comment about kafkaConfig properties.

The PR objectives mention uncertainty about handling metadata from hermes. This TODO comment about kafkaConfig properties might be related. Consider whether additional kafkaConfig properties should be used here for metadata handling.

Let's check for any related configuration handling:

engine/flink/kafka-components-utils/src/main/scala/pl/touk/nussknacker/engine/kafka/source/flink/FlinkKafkaSource.scala (2)

137-137: Verify test record creation with the new parameter handling

The change to use removedValue in createTestRecord looks correct, but we should ensure it handles all test scenarios properly.

Let's verify the test record creation:

✅ Verification successful

Based on the search results, I can now generate the final response:

Test record creation with removedValue is correctly implemented

The usage of createTestRecord with removedValue is consistent with the Kafka source test pattern across the codebase. The KafkaTestParametersInfo case class defines a type-safe function createTestRecord: Any => TestRecord, and its usage in both Flink and Lite Kafka sources follows the same pattern.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check test record creation implementations

# Search for implementations of createTestRecord
ast-grep --pattern 'def createTestRecord($params) = {
  $$$
}'

# Search for test cases using createTestRecord
rg -A 5 "createTestRecord"

Length of output: 23073

128-133: Improved handling of test data parameters

The changes correctly address the PR objective by removing the ad-hoc JSON object formatting when there's only a "Value" parameter. This aligns with the goal of handling topics without schema more naturally.

Let's verify the test data handling in the codebase:

✅ Verification successful

The verification results show that the parametersToTestData change in FlinkKafkaSource is mirrored in LiteKafkaSourceImplFactory, both implementing the same logic for handling single "Value" parameters. This is a consistent pattern across Kafka-related components. The change is well-contained within these Kafka source implementations and other source implementations have their own specific parameter handling logic.

Parameter handling change is consistent and properly scoped

The special handling of single "Value" parameter is correctly implemented in both Kafka-related sources while other sources maintain their specific parameter handling logic. This confirms the change is properly scoped to Kafka components and aligns with the goal of handling topics without schema.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for test cases that might be affected by the parameter handling change

# Search for test cases using parametersToTestData
rg -l "parametersToTestData" | while read -r file; do
  echo "=== $file ==="
  rg -A 5 "parametersToTestData" "$file"
done

# Search for test cases with single "Value" parameter
ast-grep --pattern 'Map($_, $v)' | while read -r match; do
  rg -C 5 "Value.*=>" "$(echo "$match" | cut -d: -f1)"
done

Length of output: 9050

ForrestFairy force-pushed the kafka-source-without-topic-schema branch 2 times, most recently from 17b9df4 to 6ec935e Compare October 30, 2024 11:35

ForrestFairy marked this pull request as ready for review October 30, 2024 14:06

ForrestFairy commented Oct 31, 2024

View reviewed changes

.../main/scala/pl/touk/nussknacker/engine/schemedkafka/KafkaUniversalComponentTransformer.scala Outdated Show resolved Hide resolved

raphaelsolarski requested changes Oct 31, 2024

View reviewed changes

ForrestFairy requested a review from raphaelsolarski November 6, 2024 10:24

ForrestFairy force-pushed the kafka-source-without-topic-schema branch from cd3c797 to 4ef3b59 Compare November 6, 2024 11:05

raphaelsolarski requested changes Nov 12, 2024

View reviewed changes

coderabbitai bot reviewed Nov 18, 2024

View reviewed changes

github-actions bot mentioned this pull request Nov 18, 2024

Update Cypress snapshots in kafka-source-without-topic-schema #7163

Closed

raphaelsolarski reviewed Nov 19, 2024

View reviewed changes

.../main/scala/pl/touk/nussknacker/engine/schemedkafka/KafkaUniversalComponentTransformer.scala Outdated Show resolved Hide resolved

raphaelsolarski reviewed Nov 19, 2024

View reviewed changes

.../main/scala/pl/touk/nussknacker/engine/schemedkafka/KafkaUniversalComponentTransformer.scala Outdated Show resolved Hide resolved

This comment was marked as resolved.

Sign in to view

This comment was marked as spam.

Sign in to view

TouK deleted a comment from coderabbitai bot Nov 20, 2024

raphaelsolarski requested changes Nov 21, 2024

View reviewed changes

github-actions bot added the docs label Nov 21, 2024

coderabbitai bot reviewed Nov 21, 2024

View reviewed changes

.../engine/schemedkafka/schemaregistry/azure/AzureSchemaBasedSerdeProviderIntegrationTest.scala Show resolved Hide resolved

github-actions bot mentioned this pull request Nov 21, 2024

Update Cypress snapshots in kafka-source-without-topic-schema #7205

Closed

Szymon Bogusz added 7 commits November 21, 2024 21:38

Kafka source and sink now lets user choose topic without schema

8fb221a

TODO: ad-hoc test and schema in sink

Remove some better described schemas to show only "Unknown" in sinks

a2b387c

Added comments where not sure

Handling ad-hoc test with kafka-source without schema

bfa0bdd

Locally can create scenario with topic having schema and it works, bu…

8177fbc

…t tests break

Fixing some test with kafka, creating topics where they weren't and h…

b4f2f2c

…andling when couldn't list kafka topics (previously it wasn't needed)

Change of import in KafkaUniversalComponentTransformer so scala 2.12 …

b593d85

…compiles Also a workaround in the same file for test, it will now fetch all topics from schema registry as before and from kafka, so tests should pass

Fix for scala 2.12

d68f3d1

Szymon Bogusz added 8 commits November 21, 2024 21:38

Review fix - created a test with kafka source using topic without schema

036ddf2

When topic without schema is selected instead of version, user can now choose ContentType (which for now doesn't change anything, need to implement handling bytes array)

Review fix - wip

8157048

- created a function for repeated code - removed use of avro in test for json - used more constants were applicable

Review fix - wip

8742c76

- commented out plain for now - all kafka topics are now filtered using topicSelectionStrategy - todo: add feature flag for showing topics without schema

Review fix - added feature flag for fetching kafka topics without sch…

df39189

…ema - false for now

Default changed so topics without kafka will be fetched unless it's d…

4c3a903

…isabled

Review fix

3977197

- `showTopicsWithoutSchema` flag is now in KafkaConfig - TopicSelectionStrategies changed to include topics without schema

Fix for 2.12

25d8c71

In azure slow test fetch ony topics from schema registry for possible…

26d42b1

… fix

raphaelsolarski force-pushed the kafka-source-without-topic-schema branch from 901d727 to 26d42b1 Compare November 21, 2024 20:39

coderabbitai bot reviewed Nov 21, 2024

View reviewed changes

Problem with ad-hoc tests resolved in better way, normal use of Parse…

9bfd696

…dSchemaSupport doesn't know about it, it will be checked only in tests

coderabbitai bot reviewed Nov 22, 2024

View reviewed changes

TouK deleted a comment from coderabbitai bot Nov 22, 2024

Ignoring 2 Azure tests for now

3618cc1

raphaelsolarski approved these changes Nov 22, 2024

View reviewed changes

raphaelsolarski merged commit 5e5bf70 into staging Nov 22, 2024
18 checks passed

raphaelsolarski deleted the kafka-source-without-topic-schema branch November 22, 2024 10:27

This was referenced Nov 25, 2024

[NU-7164] Add basic parser for fragment input definitions V2 #7228

Merged

[NU-1823] Fix for schemaless topics in kafka source/sink #7232

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NU-1823] Kafka source without topic schema #7066

[NU-1823] Kafka source without topic schema #7066

ForrestFairy commented Oct 23, 2024 •

edited by coderabbitai bot

Loading

raphaelsolarski left a comment

coderabbitai bot left a comment

github-actions bot commented Nov 18, 2024 •

edited

Loading

This comment was marked as resolved.

This comment was marked as spam.

coderabbitai bot commented Nov 21, 2024 •

edited

Loading

Walkthrough

Possibly related PRs

Suggested labels

Suggested reviewers

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

coderabbitai bot left a comment

coderabbitai bot left a comment

coderabbitai bot left a comment

coderabbitai bot left a comment

coderabbitai bot left a comment

[NU-1823] Kafka source without topic schema #7066

[NU-1823] Kafka source without topic schema #7066

Conversation

ForrestFairy commented Oct 23, 2024 • edited by coderabbitai bot Loading

Describe your changes

Problems

Checklist before merge

Summary by CodeRabbit

raphaelsolarski left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

github-actions bot commented Nov 18, 2024 • edited Loading

This comment was marked as resolved.

This comment was marked as spam.

coderabbitai bot commented Nov 21, 2024 • edited Loading

Walkthrough

Possibly related PRs

Suggested labels

Suggested reviewers

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

ForrestFairy commented Oct 23, 2024 •

edited by coderabbitai bot

Loading

github-actions bot commented Nov 18, 2024 •

edited

Loading

coderabbitai bot commented Nov 21, 2024 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)