Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix "bundle init" when run from Databricks #1744

Closed
wants to merge 14 commits into from
Closed

Conversation

fjakobs
Copy link
Contributor

@fjakobs fjakobs commented Sep 3, 2024

Changes

When started from a Databricks cluster use import API when creating notebooks from bundle init.

Tests

Manually tested by executing databricks bundle init from a web terminal on a Databricks cluster.

libs/template/file.go Outdated Show resolved Hide resolved
libs/template/materialize.go Outdated Show resolved Hide resolved
libs/template/file.go Outdated Show resolved Hide resolved
Copy link
Contributor

@shreyas-goenka shreyas-goenka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! This also needs some unit tests. You could use a mocked workspace client and set it in the context for the unit test. You can use t.Setenv to set the DATABRICKS_RUNTIME_VERSION env var for the tests.

See TestResolveClusterReference for a reference of a mocked workplace client.

libs/template/file.go Outdated Show resolved Hide resolved
libs/template/file.go Outdated Show resolved Hide resolved
libs/template/file.go Outdated Show resolved Hide resolved
libs/template/file.go Outdated Show resolved Hide resolved
libs/template/file.go Outdated Show resolved Hide resolved
libs/template/file.go Outdated Show resolved Hide resolved
libs/template/materialize.go Outdated Show resolved Hide resolved
libs/template/file.go Outdated Show resolved Hide resolved
libs/template/file.go Outdated Show resolved Hide resolved
libs/template/file.go Outdated Show resolved Hide resolved
libs/template/file.go Outdated Show resolved Hide resolved
libs/template/file.go Outdated Show resolved Hide resolved
libs/template/file.go Outdated Show resolved Hide resolved
@fjakobs fjakobs force-pushed the init_on_databricks branch 2 times, most recently from a46942f to 20ee23e Compare September 4, 2024 09:32
Copy link
Contributor

@lennartkats-db lennartkats-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking at this! This was much less involved than I had expected. Just one remaining comment + one nit.


const envDatabricksRuntimeVersion = "DATABRICKS_RUNTIME_VERSION"

func RunsOnDatabricks(ctx context.Context) bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for a common lib function. I'd expect this to also check for the existence of /Workspace though to avoid false positives. There may be all kinds of reasons why customers set the env var locally.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer not to test for /Workspace as it makes the code using it harder to test

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, testing is a pain :( We can leave it out, but then we should at least emit a debug log message whenever there's a match. There will be false positives.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Testing is not an issue as long as you don't inline these expectations.

You can store a bool on the context that stores whether or not we're running on DBR (see bundle/context.go for inspiration). In tests, you mark it as always true or always false depending on what you want to test. For the real CLI run, you run a routine that performs the actual detection and stores it in the context.

For the check itself, it can look at /proc/mounts for the workspace fuse mount, it can check for /.fuse-mounts, it can check for /databricks, and I'm sure there are a couple other stable ways to determine this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented this approach in #1889.

Besides the environment variable, it also checks for the presence of a /databricks directory.

@@ -313,8 +319,7 @@ func (r *renderer) persistToDisk() error {
_, err := os.Stat(path)
if err == nil {
return fmt.Errorf("failed to initialize template, one or more files already exist: %s", path)
}
if err != nil && !errors.Is(err, fs.ErrNotExist) {
} else if !errors.Is(err, fs.ErrNotExist) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: isn't this easier to read without the else if?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it is

Copy link
Contributor

@shreyas-goenka shreyas-goenka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, looks good to me other than one comment. Please TAL!

libs/template/file.go Outdated Show resolved Hide resolved
if strings.HasPrefix(path, "/Workspace/") && runtime.RunsOnDatabricks(ctx) {
isNotebook, _, _ := notebook.DetectWithContent(path, content)
return isNotebook
} else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style nit: remove else block

libs/template/file.go Show resolved Hide resolved
libs/runtime/detect.go Show resolved Hide resolved
libs/template/file.go Show resolved Hide resolved
@fjakobs fjakobs enabled auto-merge September 4, 2024 12:09
@fjakobs fjakobs disabled auto-merge September 4, 2024 12:09
Copy link
Contributor

@pietern pietern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple small comments remaining.

FWIW, in hindsight it would have been neater to use a filer for writing everything. Then we could have swapped out the writing filer for the extension-aware workspace filer and things would have worked transparently (and all writes would use the API as opposed to only notebooks).

libs/notebook/detect.go Outdated Show resolved Hide resolved
assert.False(t, shouldUseImportNotebook(ctx, "/Workspace/foo/bar", data))
assert.False(t, shouldUseImportNotebook(ctx, "/Workspace/foo/bar.ipynb", data))

t.Setenv("DATABRICKS_RUNTIME_VERSION", "14.3")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should use env.Set.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the difference? I see t.Setenv used in tests. If I change it to env.Set then my tests break.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With env.Set it adds the variable to the context and passes it along in the context.

If it broke then either 1) you weren't capturing the returned ctx, or 2) the function doesn't use env.Get to access the environment variables.

libs/template/renderer.go Show resolved Hide resolved
@fjakobs fjakobs added this pull request to the merge queue Sep 17, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Sep 17, 2024
assert.False(t, shouldUseImportNotebook(ctx, "/Workspace/foo/bar", data))
assert.False(t, shouldUseImportNotebook(ctx, "/Workspace/foo/bar.ipynb", data))

t.Setenv("DATABRICKS_RUNTIME_VERSION", "14.3")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With env.Set it adds the variable to the context and passes it along in the context.

If it broke then either 1) you weren't capturing the returned ctx, or 2) the function doesn't use env.Get to access the environment variables.


const envDatabricksRuntimeVersion = "DATABRICKS_RUNTIME_VERSION"

func RunsOnDatabricks(ctx context.Context) bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Testing is not an issue as long as you don't inline these expectations.

You can store a bool on the context that stores whether or not we're running on DBR (see bundle/context.go for inspiration). In tests, you mark it as always true or always false depending on what you want to test. For the real CLI run, you run a routine that performs the actual detection and stores it in the context.

For the check itself, it can look at /proc/mounts for the workspace fuse mount, it can check for /.fuse-mounts, it can check for /databricks, and I'm sure there are a couple other stable ways to determine this.


func RunsOnDatabricks(ctx context.Context) bool {
value, ok := env.Lookup(ctx, envDatabricksRuntimeVersion)
return value != "" && ok
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you need the value to be non-empty, then you don't need to check if it exists (the OK).

You can use _ for unused variables, or use env.Get directly:

cli/libs/env/context.go

Lines 51 to 56 in a4ba0bb

// Get key from the context or the environment.
// Context has precedence.
func Get(ctx context.Context, key string) string {
v, _ := Lookup(ctx, key)
return v
}

Copy link
Contributor

@pietern pietern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parking this PR.

Discussed that we should take the import approach for all files.

github-merge-queue bot pushed a commit that referenced this pull request Nov 14, 2024
## Changes

Whether or not the CLI is running on DBR can be detected once and stored
in the command's context.

By storing it in the context, it can easily be mocked for testing.

This builds on the simpler approach and conversation in #1744. It
unblocks testing of the DBR-specific paths while not compromising on the
checks we can perform to test if the CLI is running on DBR.

## Tests

* Unit tests for the new `dbr` package
* New unit test for the `ConfigureWSFS` mutator
github-merge-queue bot pushed a commit that referenced this pull request Nov 20, 2024
## Changes

While working on the v2 of #1744, I found that:
* Template initialization first copies built-in templates to a temporary
directory before initializing them
* Reading a template's contents goes through a `filer.Filer` but is
hardcoded to a local one

This change updates the interface for reading templates to be `fs.FS`.
This is compatible with the `embed.FS` type for the built-in templates,
so they no longer have to be copied to a temporary directory before
being used.

The alternative is to use a `filer.Filer` throughout, but this would
have required even more plumbing, and we don't need to _read_ templates,
including notebooks, from the workspace filesystem (yet?).

As part of making `template.Materialize` take an `fs.FS` argument, the
logic to match a given argument to a particular built-in template in the
`init` command has moved to sit next to its implementation.

## Tests

Existing tests pass.
github-merge-queue bot pushed a commit that referenced this pull request Nov 20, 2024
## Changes

When running the CLI on Databricks Runtime (DBR), use the
extension-aware filer to write an instantiated template if the instance
path is located in the workspace filesystem.

Notebooks cannot be written through the workspace filesystem's FUSE
mount. As a result, this is the only method for initializing templates
that contain notebooks when running the CLI on DBR and writing to the
workspace filesystem.

Depends on #1910 and #1911.

Supersedes #1744.

## Tests

* Manually confirmed I can initialize a template with notebooks when
running the CLI from the web terminal.
@pietern
Copy link
Contributor

pietern commented Nov 20, 2024

This was superseded by #1912.

The new approach is to use a filer.Filer for all writes so that we do not differentiate on file type.

@pietern pietern closed this Nov 20, 2024
@pietern pietern deleted the init_on_databricks branch November 20, 2024 12:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants