Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Early Access: Task Worker proposal #2

Open
wants to merge 15 commits into
base: master
Choose a base branch
from
Open

Conversation

chadrik
Copy link

@chadrik chadrik commented Oct 20, 2020

This is a PR to make it easy to see what we've changed before we move forward with a PR against apache/beam

Comment on lines 42 to 53
except ImportError:
BatchV1Api = None
V1EnvVar = None
V1Container = None
V1PodTemplateSpec = None
V1PodSpec = None
V1ObjectMeta = None
V1Job = None
V1JobSpec = None
V1DeleteOptions = None
ApiException = None
load_kube_config = None
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this ok or necessary?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I followed the pattern in some other modules that require "extras" that may not be installed. This will not work without the kubernetes API installed, but the module needs to be importable due to it getting imported via entrypoints regardless of whether or not it gets used.

I can clean this up a bit and add a logger warning message.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds like there is both a precedent and a reason. just leave a comment explaining "the module needs to be importable due to it getting imported via entrypoints regardless of whether or not it gets used"

Copy link
Author

@chadrik chadrik Oct 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

though maybe this would be easier to look at and maintain if we did:

try:
  from kubernetes import client
except ImportError:
  client = None

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I ended up with something similar. I've also setup apache_beam.task.task_worker.kubejob.transforms so that it will raise an ImportError if the kubernetes API isn't available.

def process_bundle(self, instruction_id):
# type: (str) -> Tuple[List[beam_fn_api_pb2.DelayedBundleApplication], bool]
def process_bundle(self, instruction_id, use_task_worker=True):
# type: (str, bool) -> Tuple[List[beam_fn_api_pb2.DelayedBundleApplication], bool]
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@violalyu Is the BundleProcessor re-used for many bundles? If not, should use_task_worker be passed to __init__?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, a BundleProcessor may be reused, the number of bundle processor should be the same as the number of SDK worker we have I think (each SDK worker request BundleProcessor from BundleProcessorCache, and if there's no more "available" bundle processor it creates a new one, otherwise it use the available ones)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can use_task_worker vary over the lifetime of a BundleProcessor? If not, we should make it an __init__ arg.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated in this commit: 884ffd3

input_op_by_transform_id, # type: Dict[str, DataInputOperation]
use_task_worker=True # type: bool
):
# type: (...) -> Union[Tuple[None, None], Tuple[List[beam_fn_api_pb2.DelayedBundleApplication], bool]]
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@violalyu like-wise, how many times in the lifespan of a BundleProcessor would maybe_process_remotely be called?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be called the same number of times as the number of bundles this current BundleProcessor is asked to process

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants