Skip to content
This repository has been archived by the owner on Apr 18, 2018. It is now read-only.

API for adding java-based spouts to Pyleus topologies #99

Open
wants to merge 7 commits into
base: develop
Choose a base branch
from

Conversation

mzbyszynski
Copy link
Contributor

Adds the ability to integrate java-based spouts with Pyleus topologies, based on the way that the kafka spout was previously integrated into Pyleus.

In a nutshell, to add a java spout to your topology you need to:

  1. Write a java class that implements the new SpoutProvider interface and package it in a jar.
  2. Add the jars you need and the spout type-to-java SpoutProvider class mapping to your pyleus.conf
  3. Add the spout to your topology yaml and define the type, output_fields and options.

Documentation

  • I added external.rst to the docs, which describes the above steps in greater detail.
  • I added a new example project called java_spout_provider that includes the java + python source for creating one of these things.
  • Updated the yaml documentation to include the new options.

Testing:

  • I think I added unit tests to cover all the python changes I made.
  • I also manually tested these cases (and verified they were working):
    • java_spout_provider example locally using pyleus local with Storm 0.9.3.
    • java_spout_provider on storm cluster using pyleus submit with Storm 0.9.3.
    • kafka spout with simple bolt that logs the messages on kafka/storm cluster using pyleus submit with Storm 0.9.3 and Kafka 0.8.2.

All questions, feedback and code review comments welcome! I was thinking about adding a readme.md file to the java_spout_provider example, since there are a bunch of steps to build it, but I didn't see anything similar in the other examples so I didn't want to violate any project conventions. Some guidance on that would be great as well.

Thanks!

Closes #93
Closes #91

@poros
Copy link
Contributor

poros commented Mar 2, 2015

@mzbyszynski, sorry for the late answer. This is a feature that I believe would be a very good addition to pyleus and it is a fair amount of work, so thank you for doing that. And thank you for writing documentation as well :)

(Since it also closes #93, I guess this is based on #94, right?)

However, since this is such a huge change, also in terms of "user interface", I believe we should have people thoughts on that (pinging @patricklucas and @ecanzonieri here) before actually starting to discuss about the details of the code.

  • Is this change in the direction we want the project to follow?
  • Are we OK to have people go through this kinda complex build procedure by themselves (pyleus jar is not even available in Maven Central)?
  • Do we like this SpoutProvider pattern?
  • Is OK to leave Bolts behind? (Spouts are usually only dependent on the data source and so reusable, while Bolts tend to be rewritten for any topology, but still...)
  • Are we OK to add a topology-level feature such this one to pyleus.conf and break the separation between pyleus-level configuration and topology-level options?

Coming to the first and most important question of the list, personally, I have mixed feelings about this change.
On the one hand, I have always wanted to get rid of all the Java bits except for the MsgpackSerializer, being the code untested and introducing a complexity and being difficult to maintain.
One the other hand, rewriting Pyleus core in Python would requires a huge amount of work I have not enough time to carry on at the moment or in the near future. It also has some serious open issues like losing the local run feature provided by Storm, adding Thrift compiling and implementing and testing a whole new topology "building" and packaging system.
For these reasons, I might be convinced that keeping on adding features to the Java core might be not a bad idea and might not increase too much the pain of maintaining the project.
Having said that, I am not the owner nor the primary responsible of the project and I have no right to block or pass any pull request single-handedly, so waiting for other people's input here.

@slively
Copy link

slively commented Mar 31, 2015

I'm looking into using pyleus (love it a ton so far), but we are using kinesis instead of kafka, and AWS has a supported kinesis spout for storm that I'd really like to use. This feature would be really awesome for my use case. Looking through the changes and documentation it looks pretty straight forward to use this. I'm gonna give it a try for my use case and report back.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
3 participants