[pull] master from tensorflow:master #238

PiperOrigin-RevId: 698049836

…e ALL broadcast-like inputs on TFLite ops that support implicit broadcasting PiperOrigin-RevId: 698054216

* Change default QNN graph config to use HTP FP16 precision backend config, this is required to correctly compile FP32 OPs. * Create 1-element 1D tensor out of scalar value, QNN OP always use ranked tensor type as input. PiperOrigin-RevId: 698081261

… the .td definition. PiperOrigin-RevId: 698082807

* Add FC Op legalization and test data. * Add Select/Select_v2 Op legalization. * Mics cleanups. PiperOrigin-RevId: 698094953

PiperOrigin-RevId: 698111562

…rializing any modules. Also pulled the deserialization a little further up the stack and only do it if the input doesn't already have a full module op. PiperOrigin-RevId: 698116466

PiperOrigin-RevId: 698121993

PiperOrigin-RevId: 698132925

PiperOrigin-RevId: 698133024

… Todo (resolved) PiperOrigin-RevId: 698133747

This is to fix issue with gsutil which expects Python 3.5-3.11: ``` Error: gsutil requires Python version 2.7 or 3.5-3.11, but a different version is installed. ``` PiperOrigin-RevId: 698134102

PiperOrigin-RevId: 698137647

PiperOrigin-RevId: 698150097

PiperOrigin-RevId: 698163185

…. This Extend() call would also lead to a memory assignment issue since it wasn't accompanied by the necessary chunk commit requests. We also add a VerifyAllocations() function that uses a BufferIntervalTree to check for overlapping Allocations before scheduling the asynchronous copies. This is an extra check for the correctness of MsaAlgorithm allocations, and is only applied if options_.verify is enabled in MSA options. options_.verify is disabled by default. PiperOrigin-RevId: 698164396

PiperOrigin-RevId: 698164750

PiperOrigin-RevId: 698164921

This change adds the legalization pass from IFRT to VIFRT. Legalization uses a templated OpConversion class, which is refined via the `IFRT` <-> `VIFRT` and `mlir::Func::*` <-> `VIFRT` op mappings defined in `map_ifrt_to_vifrt.h` The change versions also `mlir::func::FuncOp`, `mlir::func::ReturnOp` and `mlir::func::CallOp` because this provides the following advantages: 1) we can use the templated OpConversion class rather than implementing a separate converter for each op, and 2) we can restrict the surface of possible breaking changes to just builtin types and attributes. Moreover, the change versions `mlir::FunctionType` and `mlir::TypeAttr` in order to be able to use the generic Op converter, and to restrict types allowed in functions (just builtin and IFRT types). PiperOrigin-RevId: 698168526

Also fixed invalid C++ header usage. PiperOrigin-RevId: 698170878

…td definition. PiperOrigin-RevId: 698171237

PiperOrigin-RevId: 698174417

PiperOrigin-RevId: 698189797

PiperOrigin-RevId: 698196106

PiperOrigin-RevId: 698201598

PiperOrigin-RevId: 698212499

PiperOrigin-RevId: 698218778

PiperOrigin-RevId: 698221629

PiperOrigin-RevId: 698228306

… Todo(resolved) PiperOrigin-RevId: 698230798

PiperOrigin-RevId: 698230884

PiperOrigin-RevId: 698237370

…remove the .td definition. PiperOrigin-RevId: 698241447

… for bytes return from plugin in tests to avoid copy PiperOrigin-RevId: 698251740

PiperOrigin-RevId: 698271808

PiperOrigin-RevId: 698294876

PiperOrigin-RevId: 698294898

PiperOrigin-RevId: 698297679

We have support for lowering PTX in the runtime, so we can just use `MultiKernelLoaderSpec` and we get compilation and caching for free. PiperOrigin-RevId: 698297929

PiperOrigin-RevId: 698302393

This is not needed since the runtime can compile PTX for us. Actually I'm surprised that this even worked because this original code compiled PTX into CUBIN and then forced the CUBIN into the PTX argument in the kernel creation helper. But this is now all fixed. PiperOrigin-RevId: 698303129

PiperOrigin-RevId: 698304950

…hunks Also the usual drive-by cleanups: - Remove unused includes - Add explicit includes for things we depended on transitively - Clean up dependencies of the build targets PiperOrigin-RevId: 698317755

func-bufferize pass is removed by llvm/llvm-project@e394fec PiperOrigin-RevId: 698322658

We turn on previously implemented heuristics by default. PiperOrigin-RevId: 698324486

PiperOrigin-RevId: 698329980

Imported from GitHub PR openxla/xla#19363 Sets the loop iteration counter increment in the backward transformation of the collective pipeliner pass to account for cases with non-zero initial value of the loop iteration counter. See #16953 and #18568. Copybara import of the project: -- 06137aa0618d372e2d4badbf16920bead9922cfb by Philipp Hack <[email protected]>: Modifies the loop counter increment set in the backward transformation of the collective pipeliner. -- 6da45bcb26643d8994bf608f05230fa748286b02 by Philipp Hack <[email protected]>: Modifies the loop counter increment set in the backward transformation of the collective pipeliner. Merging this change closes #19363 PiperOrigin-RevId: 698342374

Dequantized static data is cached. However, when there are multiple subgraphs, the data is overwritten by each subgraph. PiperOrigin-RevId: 698342673

…or fusion candidates. Imported from GitHub PR openxla/xla#19393 Copybara import of the project: -- ec107a12fbee6826f1f668218b7c7a40f5886420 by Ilia Sergachev <[email protected]>: [GPU] Horizontal loop fusion: pass bitcasts when looking for fusion candidates. -- 71241097ce67412246ec18efca5165619601eace by Ilia Sergachev <[email protected]>: simplify cuDNN norm test Merging this change closes #19393 PiperOrigin-RevId: 698357838

PiperOrigin-RevId: 698360931

PiperOrigin-RevId: 698372450

…vice/gpu/te… Imported from GitHub PR openxla/xla#19484 …sts:gpu_input_fusible_slice_test Copybara import of the project: -- 0d307384bff386d5182f89ae5a5422f8ca1a1290 by Dragan Mladjenovic <[email protected]>: [ROCm] Fix //xla/tests:complex_unary_op_test and //xla/service/gpu/tests:gpu_input_fusible_slice_test Merging this change closes #19484 PiperOrigin-RevId: 698374588

… CUDA_ERROR_ILLEGAL_ADDRESS in (micro)benchmarks with FP8 Triton kernels during exhaustive autotuning. PiperOrigin-RevId: 698387396

PiperOrigin-RevId: 698388778

PiperOrigin-RevId: 698388866

PiperOrigin-RevId: 698389323

What this change does is it: 1. Identifies all `kTfLiteBuiltinDequantize` nodes converting `kTfLiteFloat16` to `kTfLiteFloat32` and plugging into a `kTfLiteBuiltinFullyConnected`, `kTfLiteBuiltinConv2d`, or `kTfLiteBuiltinDepthwiseConv2d` node. 2. Re-maps XNNPACK tensors pointing to the `kTfLiteFloat32` output to point to the original `kTfLiteFloat16` input. The `kTfLiteFloat16` weights/filters and biases are handled by XNNPACK directly. PiperOrigin-RevId: 698395748

…g_util by using `PropagateShardingAlongDimsAndReplicateOthers`. This is a no-op change. PiperOrigin-RevId: 698403022

…ernal to a MutableOpResolver. PiperOrigin-RevId: 698407108

PiperOrigin-RevId: 698410181

We need to adjust the heuristic because before our emitter had an issue that prevented Triton from doing proper layout optimizations. It was fixed in openxla/xla@7280b9a. We needed to use higher number of warps (up to 32) before to cover the lack of layout optimization, but now it can cause performance regressions, because Triton likes to insert shmem usage and barrier syncs. PiperOrigin-RevId: 698416298

…ather/scatter instructions. Implicit batching dims are also known as index parallel dims. Update `GetGatherScatterBatchParallelDims` accordingly. The sharding propagation and spmd partitioner will process explicit and implicit batching dims separately. PiperOrigin-RevId: 698421986

Minimal XLA:CPU runtime implementation optimized for low latency inference. -------------------------------------------------------------- Benchmark Time CPU Iterations -------------------------------------------------------------- BM_NanoRtAddScalars 84.8 ns 84.8 ns 8277118 BM_NanoRtFibonacci 81.1 ns 81.1 ns 8468298 BM_PjRtAddScalars 1517 ns 1517 ns 460076 BM_PjRtFibonacci 1523 ns 1523 ns 460415 PiperOrigin-RevId: 698426607

PiperOrigin-RevId: 698427377

name old cpu/op new cpu/op delta BM_CountDownSuccess/4 97.6ns ± 2% 97.9ns ± 1% ~ (p=0.841 n=5+5) BM_CountDownSuccess/8 123ns ± 2% 122ns ± 1% ~ (p=0.548 n=5+5) BM_CountDownSuccess/16 171ns ± 1% 172ns ± 2% ~ (p=0.548 n=5+5) BM_CountDownSuccess/32 270ns ± 1% 271ns ± 1% ~ (p=0.310 n=5+5) BM_CountDownError/4 215ns ± 1% 212ns ± 3% ~ (p=0.310 n=5+5) BM_CountDownError/8 309ns ± 2% 307ns ± 1% ~ (p=0.421 n=5+5) BM_CountDownError/16 500ns ± 1% 496ns ± 2% ~ (p=0.421 n=5+5) BM_CountDownError/32 888ns ± 1% 885ns ± 2% ~ (p=0.548 n=5+5) PiperOrigin-RevId: 698431683

…ed) in CUDA BUILD file. PiperOrigin-RevId: 698444212

…esolved) PiperOrigin-RevId: 698444339

Todo(resolved) PiperOrigin-RevId: 698445171

Todo(resolved) PiperOrigin-RevId: 698445177

PiperOrigin-RevId: 698448973

PiperOrigin-RevId: 698452075

Currently the transposed convolution is orders of magnitude slower than the regular one. Ideally performance should be similar. Detailed results: ---------------------------------------------------------------------------------- Benchmark Time CPU Iterations ---------------------------------------------------------------------------------- BM_Conv2DStrided/process_time 3737222 ns 41608631 ns 16 BM_Conv2DTransposedStrided/process_time 590079914 ns 1.0847e+10 ns 1 PiperOrigin-RevId: 698453016

The correct output dimension when dumped to HLO text is `bf0`, where `f` means the output feature dimension. There is no dimension called `o`. PiperOrigin-RevId: 698453240

… State By keeping AsyncValueRef as a part of the State we avoid one extra reference counting operation when copying CountDownAsyncValue (and we expect to copy it `cnt` times). name old cpu/op new cpu/op delta BM_CountDownSuccess/8 95.8ns ± 4% 81.7ns ± 1% -14.64% (p=0.000 n=40+35) BM_CountDownSuccess/16 142ns ± 1% 127ns ± 1% -10.05% (p=0.000 n=37+38) BM_CountDownSuccess/32 229ns ± 2% 216ns ± 1% -5.56% (p=0.000 n=40+38) BM_CountDownError/4 165ns ± 1% 152ns ± 2% -7.65% (p=0.000 n=39+40) BM_CountDownError/8 238ns ± 2% 225ns ± 1% -5.65% (p=0.000 n=40+38) BM_CountDownError/16 388ns ± 2% 369ns ± 2% -4.77% (p=0.000 n=40+36) BM_CountDownError/32 684ns ± 1% 666ns ± 1% -2.50% (p=0.000 n=38+38) PiperOrigin-RevId: 698454410

…en calls to `Delegate::PrepareOpsToDelegate`. PiperOrigin-RevId: 698455074

PiperOrigin-RevId: 698462670

PiperOrigin-RevId: 698466696

… compiler. Imported from GitHub PR openxla/xla#19237 Copybara import of the project: -- 177f911fd4c6af86c25aba2e38ea09767477be03 by Ilia Sergachev <[email protected]>: [GPU] Fix passing of key-value store handle from client to compiler. -- ec2b96ccdf8cd81abdc25f3cff2bdf65df455219 by Ilia Sergachev <[email protected]>: use allowed_devices instead of CUDA_VISIBLE_DEVICES -- 77ba9fd7b172052269fafd1a1970d58d1d803a59 by Ilia Sergachev <[email protected]>: skip the added test on pre-Ampere GPUs Merging this change closes #19237 PiperOrigin-RevId: 698469112

PiperOrigin-RevId: 698473752

PiperOrigin-RevId: 698474274

No functional change is intended but it generates less IR. PiperOrigin-RevId: 698477060

…er_interval_comparator into msa/utils. PiperOrigin-RevId: 698481492

PiperOrigin-RevId: 698485833

As of this CL, all array operations (except `IsDeleted()`) are asynchronous. This CL also makes the following drive-by changes: 1. Version management is getting refactored to use an enum and a header file within /common. 2. All error responses from the server (except connection terminations, which follow the previous behavior) are now printed out as a WARNING. PiperOrigin-RevId: 698491308

PiperOrigin-RevId: 698494677

PiperOrigin-RevId: 698502341

Also add a bit more comments and re-organize some things. PiperOrigin-RevId: 698507311

…e the .td definition. PiperOrigin-RevId: 698510501

PiperOrigin-RevId: 698519430

…ion poitners. check for correct union types in tensor cc api PiperOrigin-RevId: 698529069

PiperOrigin-RevId: 698529351

…Lite. This CL optimizes explicit broadcasting-like patterns in TFLite, because TFLite Ops support implicit broadcasting. Also, this CL is moving the existing fusions on broadcast-to+select to the dedicated pass. The patterns are: - Fuse splat const into select op. - Fuse fill-op into select op. PiperOrigin-RevId: 698530501

PiperOrigin-RevId: 698531058

PiperOrigin-RevId: 698535976

PiperOrigin-RevId: 698542878

PiperOrigin-RevId: 698548746

…ues). The function was previously assuming allocation_values can never be empty. PiperOrigin-RevId: 698548828

…e recursively calculating the range of an expression. PiperOrigin-RevId: 698551804

It is unused. PiperOrigin-RevId: 698554807

PiperOrigin-RevId: 698555795

PiperOrigin-RevId: 698557779

PiperOrigin-RevId: 698567794

PiperOrigin-RevId: 698572323

…ing input tensors to it. PiperOrigin-RevId: 698572634

* Add support for overriding cross program prefetch behavior. * Add support for filtering buffer intervals based on the uses of the buffer. * Add tests for overriding cross program prefetch behavior * Add tests for expanding filtering criteria. PiperOrigin-RevId: 698574108

PiperOrigin-RevId: 698575496

…gin. PiperOrigin-RevId: 698585431

PiperOrigin-RevId: 698592285

PiperOrigin-RevId: 698610190

PiperOrigin-RevId: 698625663

…ved) PiperOrigin-RevId: 6986421

PiperOrigin-RevId: 698644067

PiperOrigin-RevId: 698648940

…titioned Op. PiperOrigin-RevId: 698655747

… memory_space_assignment tests. PiperOrigin-RevId: 698657506

…chedule after/before for prefetch time override. PiperOrigin-RevId: 698666645

PiperOrigin-RevId: 698671850

…ompiler NVPTXCompiler was calling `cuda::GetDriverVersion` to determine whether the CUDA driver is new enough to consider it for PTX JIT compilation. This change makes it use the driver version available in the `DeviceDescription` type. PiperOrigin-RevId: 698672918

There seems to be no dedicated libdevice call for Log with F16 or BF16 type. Currently we upcast to F32 and use __nv_logf. However it seems likely that __nv_fast_logf is good enough for F16 and BF16 type, so use it as it is considerably faster. PiperOrigin-RevId: 698673580

Also delete the other things which were only referenced from that file. PiperOrigin-RevId: 698706755

- The remaining `GetRuntimeVersion` and `GetFuncBySymbol` functions get moved into the executors - the only place where they are needed. - For CUDA is also create an overload of `cuda::ToStatus` which can convert a CUDA runtime error (`cudaError_t`) into an `absl::Status`. - I also had to adjust the `RocmKernel` and `CudaKernel` tests which were using `GetFuncBySymbol` directly. Now they rely on `LoadKernel` from the executors. PiperOrigin-RevId: 698720699

There was a check in place that works around a performance bug in ptxas from CUDA 12.1. This check has various problems: 1. It's untested and the way it's implemented it can't be easily test. 2. The version check doesn't work library compilation which we transition towards as it's checking the version of a local ptxas binary 3. It's unclear whether the workaround is still needed with the new MLIR emitters. So I'm removing it here since it blocks me from making more refactoring around PTX compilation. PiperOrigin-RevId: 698720761

If the IR is not canonicalized after unrolling, then the passes that follow unrolling in the pipeline don't converge sometimes. PiperOrigin-RevId: 698723354

…nal and loop Imported from GitHub PR openxla/xla#19528 Observed in saxml workload that sharing the same command buffer cmd type (CONDITIONALS) for WHILE and CONDITIONAL command over kill the lowering opportunities. Many cases could allow CONDITIONAL instruction to lower into command buffer, while WHILE is not possible. This PR uses separate command buffer cmd type flag for CONDITIONAL and WHILE instructions when user specifies the type to lowering. Copybara import of the project: -- 4d62fb512995e2fc6e9077a1b3251a6754c866ca by Shawn Wang <[email protected]>: use separte command buffer cmd flag for conditional and loop Merging this change closes #19528 PiperOrigin-RevId: 698729891

Imported from GitHub PR openxla/xla#19552 - avoid unnecessary work - bump log level at which complete computations are printed - add log statements Copybara import of the project: -- e273aea41dd15efbc5d79c363810cf634e73203e by Ilia Sergachev <[email protected]>: [GPU][NFC] Cleanup horizontal loop fusion. - avoid unnecessary work - bump log level at which complete computations are printed - add log statements Merging this change closes #19552 PiperOrigin-RevId: 698731719

With the feature enabled, XLA GPU will automatically match all kinds of normalization diamond patterns in the graph (Softmax, RmsNorm, etc.) and generate efficient kernels with Triton. In the compilation pipeline the following steps happen: 1. `SoftmaxRewriterTriton` pass matches minimal normalization diamonds and creates new fusions with `kCustom` kind. The fusions also have a backend config attached with `__triton` kind and tiling information in `BlockLevelFusionConfig`. 2. `PriorityFusion` uses the Cost Model to potentially fuse more instructions into the matched fusions. 3. Fusions are emitter with generic Triton fusion emitter. The Cost Model chooses tile sizes for each Triton fusion. Currently `SoftmaxRewriterTriton` only matches normalization patterns that reduce the minormost dimension. PiperOrigin-RevId: 698735843

This class is no longer used. PiperOrigin-RevId: 698736858

Updates LLVM usage to match [33fcd6acc755](llvm/llvm-project@33fcd6acc755) PiperOrigin-RevId: 698742870

Since `CudaDriverVersion()` is now only used in one place, let's inline the function and remove the target. PiperOrigin-RevId: 698747446

This has been upstreamed to LLVM, and we have updated to a revision containing this. PiperOrigin-RevId: 698748177

Imported from GitHub PR openxla/xla#18407 This PR aims to enable the XLA/mlir/tool test cases on the Windows Platform. Error: //xla/mlir/tools/mlir_bisect/... tests were failing on the Windows platform with the errors shown below: Errors Error 1.Error with llvm::seq no matching function for call to 'seq' for (auto i : llvm::seq(0ul, sizeof...(T))) { Solution: change to llvm::seq(0, sizeof...(T)) By explicitly specifying the type (unsigned long) in llvm::seq, the compiler now clearly understands the type of the sequence. Error 2. Missing dlfcn.h: Location: xla/mlir/tools/mlir_interpreter/dialects/func.cc fatal error: 'dlfcn.h' file not found Solution: include 'windows.h' for Windows platform Error 3. Use of Undeclared Identifiers sym and RTLD_DEFAULT: Location: xla/mlir/tools/mlir_interpreter/dialects/func.cc use of undeclared identifier 'sym' sym = dlsym(RTLD_DEFAULT, callee.getSymName().str().c_str()); ^ use of undeclared identifier 'RTLD_DEFAULT' Solution: On Windows, the approach to obtaining a symbol's address differs from Unix-based systems. GetModuleHandle function retrieves a handle to the specified module (DLL) that is loaded in the address space of the calling process. This handle is necessary to access the module's symbols. GetProcAddress function locates the address of an exported function or variable by name. Copybara import of the project: -- 1a428996c7991df8e093393e7989fbcf251dc0f4 by Raunak <[email protected]>: fix xla-mlir failures on windows -- 15009666c4ee861218bb798c6fe0d2493fa8e060 by Raunak <[email protected]>: resolve comments -- 2483001d510582179d74b94571f9fd6beb943aaa by Raunak <[email protected]>: Keep the original file -- 4c7fe5e4debed0ff39eb87f64f60f99ce6ee0a74 by Raunak <[email protected]>: fix the formatting issue -- 270898a2b0bca97a7de30435ce6a53b5980ca73e by mraunak <[email protected]>: Update symbol_finder_windows.cc -- 6b63a306ee4ef69f9849822418426d5f705e73ff by mraunak <[email protected]>: Update symbol_finder_linux.cc -- f0996fcc1c67e43bbb7b7829adddf0c7d8f5c738 by mraunak <[email protected]>: Update symbol_finder.h -- 0c0c9bba3dac548e663aab5e2e3af6fb96c77fde by Raunak <[email protected]>: Fix the build file -- 6d7f269262dcc8a85579c62db51e38dc534d6564 by Raunak <[email protected]>: Resolve the comments -- ef598af149e9ad96dc1fa27be763a7ffd219011c by Raunak <[email protected]>: Resolve the comments -- 7131b8d24ad353044b622614c36c342d90101d37 by Raunak <[email protected]>: added :find_symbol to dependency -- 64a6e9e45d6deef4201c3cd8da64e99b9d40ca78 by mraunak <[email protected]>: Update BUILD -- d47a8b27c89e5df01ea94a237080fd2ac3ad8e85 by mraunak <[email protected]>: Fix clang format -- 1a24df16d3de5f007065b69a67965158e821ffe3 by Raunak <[email protected]>: resolve the comments -- 12f69fc2d188f8bc368bc5e29b53a80d15b6dbac by Raunak <[email protected]>: adding namespace and header style consistent -- ec9b5051471a36f7881ed21215f60ec893f18e7d by Raunak <[email protected]>: Fix the build file Merging this change closes #18407 PiperOrigin-RevId: 698754912

Split the initialization into several methods to have a better distinction between their responisbilities. PiperOrigin-RevId: 698757702

… MLIR. An upstream MLIR PR [0] removed `finalizing-bufferize` pass. We are using only two pattern from the pass. As suggested by the note in the PR description, we can copy those pattern. [0] llvm/llvm-project@cbc7802 PiperOrigin-RevId: 698761132

The build target doesn't exist anymore but there is still a header file which gets deleted in this change. PiperOrigin-RevId: 698778403

…oryBase name old cpu/op new cpu/op delta BM_NanoRtAddScalars 82.2ns ± 2% 63.1ns ± 2% -23.17% (p=0.000 n=37+40) BM_NanoRtFibonacci 86.7ns ± 2% 68.4ns ± 2% -21.09% (p=0.000 n=37+35) BM_PjRtAddScalars 1.78µs ± 2% 1.79µs ± 2% ~ (p=0.280 n=39+38) BM_PjRtFibonacci 1.79µs ± 3% 1.79µs ± 3% ~ (p=0.355 n=38+38) PiperOrigin-RevId: 698783540

To avoid running into the 259 character path length limitation. PiperOrigin-RevId: 698786300

Imported from GitHub PR openxla/xla#19578 Copybara import of the project: -- 849d78bf539cc69387ecb3f9710b6188cee5a494 by Ilia Sergachev <[email protected]>: [doc] Fix a link to a page in the table of contents. Merging this change closes #19578 PiperOrigin-RevId: 698788574

…turn a new instance. This CL changes the `ConstraintExpression` class by making it a value type and using C++ operators for logical operations. This hopefully makes the code more concise and easier to read. PiperOrigin-RevId: 698791293

PiperOrigin-RevId: 698800849

PiperOrigin-RevId: 698809905

PiperOrigin-RevId: 698812072

PiperOrigin-RevId: 698813006

…vice traceviewer PiperOrigin-RevId: 698820533

PiperOrigin-RevId: 698821973

PiperOrigin-RevId: 698822218

With llvm/llvm-project@3494ee9, upstream has stricter checks for ints. PiperOrigin-RevId: 698823182

PiperOrigin-RevId: 698823837

…timized for latency. This change introduces `cond_v2.fast_cond_v2()`, which is a tool for writing latency-optimized conditionals using the functional `IfOp` implementation. PiperOrigin-RevId: 698835221

…mensions do NOT overlap. These dims are processed separately in spmd partitioner. 1. Explicit batching dims exist in all tensors (operand, indices, output). 2. Index pass-through dims exist in indices and output. 3. Operand pass-through dims exist in operand and output. We replace `GatherOutputShardingFromIndexIndexPassthroughDimensions` with `GatherOutputShardingFromIndex(bool consider_explict_batch_dims=true)`. The added test failed before this change since it process explicit batch dims as index pass-through dims. This change fix this issue. PiperOrigin-RevId: 698840297

PiperOrigin-RevId: 698843953

PiperOrigin-RevId: 698863847

…x uses PJRT. PiperOrigin-RevId: 698867902

No behavior change. PiperOrigin-RevId: 698870832

…verification fails. The fusion is extracted into a separate module, so it's easier to reproduce the issue. If the fusion is too long, stdout log will be cropped. PiperOrigin-RevId: 698872626

PiperOrigin-RevId: 698873637

Internal test is broken. Reverts aeef8f4 PiperOrigin-RevId: 698893785

Prototyping test only KernelEmitter API that can be used for writing XLA:CPU kernel tests. PiperOrigin-RevId: 698895333

Before this change, `GetGatherScatterBatchParallelDims` only returns the implicit batching dims in operand and indices. We still need to call `GetGatherParallelOutputDims` to return the corresponding dims in the output. With this change, `GetGatherScatterBatchParallelDims` returns the implicit batch dims in 3 tensors (operand, indices, and output). PiperOrigin-RevId: 698895717

…ge sunk collective is encountered. PiperOrigin-RevId: 698924172

PiperOrigin-RevId: 698927051

PiperOrigin-RevId: 698929552

PiperOrigin-RevId: 698932952

PiperOrigin-RevId: 698933032

PiperOrigin-RevId: 698939145

Imported from GitHub PR openxla/xla#16901 When the user does not specify the number of GPUs for auto sharding, XLA defaults to using all available GPUs. The current implementation uses the number of cores (SMs) on the GPU as the default shard count. For example, on an A100, the sharding algorithm will try to shard into 108 devices, which can be confusing for users. This patch changes the shard count to the number of cards, which has been tested to work correctly on an 8-card A100 machine. Copybara import of the project: -- 232a62ae2599e6fe76e2e235ea18452195bce799 by Tianyi Liu <[email protected]>: [XLA:GPU] Fix default device mesh for auto sharding Merging this change closes #16901 PiperOrigin-RevId: 698956243

PiperOrigin-RevId: 698958036

Some patterns added the the quantize_patterns.td were making decisions about quantizing some weights that are not annotated by Q-DQ nodes. This PR separates these two categories for cases we want strict adherence to Q-DQ annotations (e.g. QAT). PiperOrigin-RevId: 698960224

…on.cc PiperOrigin-RevId: 698970177

PiperOrigin-RevId: 698973331

PiperOrigin-RevId: 698987161

PiperOrigin-RevId: 698995873

…ifier.cc PiperOrigin-RevId: 698997197

PiperOrigin-RevId: 699000269

PiperOrigin-RevId: 699004930

With llvm/llvm-project@3494ee9, upstream has stricter checks for ints. Setting `APInt(.., /*isSigned=*/ !isUnsigned, ..)` seems to break EvalCompareOpPattern, likely due to signed i1 not allowing 1. This change just keeps the status quo without making too many changes. PiperOrigin-RevId: 699031101

PiperOrigin-RevId: 699035755

PiperOrigin-RevId: 699044311

PiperOrigin-RevId: 699044350

PiperOrigin-RevId: 699048665

PiperOrigin-RevId: 699052605

PiperOrigin-RevId: 699056083

This moves all the PTX compilation functions that spawn subprocesses - notably ptxas, nvlink, and fatbin into a separate file. The goal is to make this optional so that and eventually disable it by default. Since we can compile through libraries like libnvjitlink the rather brittle approach of calling external binaries is not needed anymore. This also adds tests for all the helper functions. Tests for the actual compilation will follow separately. PiperOrigin-RevId: 699058086

…poser.cc PiperOrigin-RevId: 699059539

…tion_annotator.cc PiperOrigin-RevId: 699062814

PiperOrigin-RevId: 699067483

This adds a new target `:driver_compilation` and moves `LinkGpuAsm` into a new file `driver_compilatio.cc` I'm also bringing back the `StreamExecutor` argument for being able to call `ActicateContext` which I had removed mistakenly in a previous CL. The active context is indeed needed. The goal is to separate out all the different PTX compilation and linking methods, make them independently testable and optional. PiperOrigin-RevId: 699071278

PiperOrigin-RevId: 699071557

Also renames the target for consistency. PiperOrigin-RevId: 699076843

PiperOrigin-RevId: 699079990

`LinkUsingNvlink` and `LinkGpuAsmUsingDriver` used to take a list of `CubinOrPTXImage` structs as inputs, but the functions doen't even support compiling PTX, so it's very misleading. I change the parameter type to a a list of byte arrays (`std::vector<uint8_t>`) which is what we use everywhere else for representing compiled modules (CUBINS). PiperOrigin-RevId: 699082261

Imported from GitHub PR openxla/xla#19656 This PR fixes a bug related to handling missing (implied) indices and adds the corresponding tests. 1. When `scatter_dims_to_operand_dims` size is not equal to the operand rank, the `out_of_bound_tensor` has incorrect dimensions, resulting in mismatched shapes of the select op. This is fixed at line 718. 2. When the update is not scalar, the indices are recalculated - this requires updating the `out_of_bound_tensor` (lines 757-761). 3. After expanding the indices, the `has_scalar_indices` flag has to be updated (line 777). Also added a few cosmetic changes: 1. Removed `is_one_dimensional` branch in `ExpandIndices`, as this never happens (probably an artefact from prior implementation). 2. Broadcast the boundary constants instead of generating a (possibly big) literal. Copybara import of the project: -- 2e38efc0c9efc2f708058bd2ae526f13d2ed8354 by Sergey Kozub <[email protected]>: Fix implicit index handling in ScatterDeterminismExpander Merging this change closes #19656 PiperOrigin-RevId: 699083584

…ewriter This copies over original_value attribute when an value is replaced during this pass. PiperOrigin-RevId: 699087576

…Contraction Rewriter Imported from GitHub PR openxla/xla#19679 This PR moves the addend shape check to the rewriter so that the code to append oneDNN post-ops can be shared between matmul and convolution kernels. Copybara import of the project: -- c6497851473b2ec5b5041de459e4aaa3c8c2cb93 by Akhil Goel <[email protected]>: Move addend check Merging this change closes #19679 PiperOrigin-RevId: 699095534

@hawkinsp

Imported from GitHub PR openxla/xla#19346 cc @hawkinsp Copybara import of the project: -- 292e7ebb7ee57e5af5977c08f0aaf28fc1f852e2 by vfdev-5 <[email protected]>: Bumped rules_python version to 0.39.0 Merging this change closes #19346 PiperOrigin-RevId: 699100796

The issues which we have hit previously seem to be fixed now. PiperOrigin-RevId: 699120716

PiperOrigin-RevId: 699125504

Imported from GitHub PR openxla/xla#19577 Copybara import of the project: -- b4180bb5e59c92b374eb16fc59d6f03d7f37db4a by Ilia Sergachev <[email protected]>: Cleanup handling of 3 fields of ExecutableBuildOptions. -- 21206eb838fa04dabaddec0aa8cdf73789ce8206 by Ilia Sergachev <[email protected]>: add a test -- a6571ef2ac7ec6a94056b1588a3260ecc7d9db17 by Ilia Sergachev <[email protected]>: cleanup -- 44d479f3d3d6320d37d35934cf81596e50e10c51 by Ilia Sergachev <[email protected]>: add missing newline -- c3c550f491b2fc03dacdf1101042a3fbadd51e7c by Ilia Sergachev <[email protected]>: add missing include -- 5acf13c0423b7aef87f81b86ac95a0a1471927f1 by Ilia Sergachev <[email protected]>: ignore key_value_store Merging this change closes #19577 PiperOrigin-RevId: 699125923

PiperOrigin-RevId: 699128079

…subprocess compilation This adds a new interface `CompilationProvider` which offers `PTX` to `CUBIN` compilation. It also adds the first implementation of this interface, the `SubprocessCompilationProvider` which uses ptxas and nvlink to the compilation. Some additional changes were also needed: - New type `CompilationOptions` which collects and documents all compilation options in one place. - Some additional overloads in `:subprocess_compilation` where needed so that the `SubprocessCompilationProvider` can control the exact file path to ptxas and nvlink. - A fairly comprehensive test suite for the compilation provider is also added. PiperOrigin-RevId: 699134414

------------------------------------------------------------------ Benchmark Time CPU Iterations ------------------------------------------------------------------ BM_CreateZeroCopyBuffer 234 ns 234 ns 3075841 PiperOrigin-RevId: 699137060

tf_cuda_tests_tags() seems to work as well. Add hermetic_cuda_data_dir parameter as well, so that e.g. ptxas can be found. Also use linkopts = ["-Wl,-rpath,$$ORIGIN/../lit_lib"] so that the dynamic libraries are found, which are symlinked from the lit_lib directory. PiperOrigin-RevId: 699146874

…ersionPattern` to a walk. PiperOrigin-RevId: 699148617

The low_f32 should be rounded to bf16 instead of truncation. PiperOrigin-RevId: 699154452

PiperOrigin-RevId: 699159107

Updates LLVM usage to match [a12e79a85fc1](llvm/llvm-project@a12e79a85fc1) PiperOrigin-RevId: 699163893

…erpreter`. This is more in line with how the dialects were meant to be added according to the readme file in the parent directory. PiperOrigin-RevId: 699169422

What this change does is it: 1. Identifies all `kTfLiteBuiltinDequantize` nodes converting `kTfLiteFloat16` to `kTfLiteFloat32` and plugging into a `kTfLiteBuiltinFullyConnected`, `kTfLiteBuiltinConv2d`, or `kTfLiteBuiltinDepthwiseConv2d` node. 2. Re-maps XNNPACK tensors pointing to the `kTfLiteFloat32` output to point to the original `kTfLiteFloat16` input. The `kTfLiteFloat16` weights/filters and biases are handled by XNNPACK directly. PiperOrigin-RevId: 699184221

PiperOrigin-RevId: 699195297

We infer the update sharding from update to obtain `passthrough_sharding`. This `passthrough_sharding` should be merged with the existing update sharding, such that we may keep the original sharding axes in update. The added all-reduce are along the sharding axes along index pass-through dimensions. It should not be along the sharding axes along explicit batch dims or index vector dim. PiperOrigin-RevId: 699206933

PiperOrigin-RevId: 699215454

Imported from GitHub PR openxla/xla#19660 This PR switches the default rocm build to clang as the gcc config is broken at the moment. Copybara import of the project: -- ea48f7c480d110eab3f133ed6ea8989da0e1e724 by Alexandros Theodoridis <[email protected]>: [ROCm] switch rocm build to clang -- 2743fabafd6a358c05e858781064e7fa2e389c78 by Alexandros Theodoridis <[email protected]>: Remove explicit clang path from the bazelrc rocm config -- 202dea0a80602cafdbee6067d8f20dc3055c6bbb by Alexandros Theodoridis <[email protected]>: Address review comments Merging this change closes #19660 PiperOrigin-RevId: 699222609

Next step is to migrate NcclComm and NcclOwnedComm to std::unique_ptr<Communicator> and proper virtual inheritance. PiperOrigin-RevId: 699233544

…filer_test This was originally proposed in openxla/xla#16102, but I still ran into issue where it failed by slight margin: ``` Expected: (profiler.MeasureClockCyclesPerOp(HloOpcode::kDivide, F64) .value() .clock_cycles()) > (300), actual: 296 vs 300 ``` That said, I ran 1000 tests and did not encounter this issue. Reducing the threshold to 280 since the bound seems very close and flaky test is no good either way. PiperOrigin-RevId: 699233864

… for XLA:CPU PiperOrigin-RevId: 699234540

PiperOrigin-RevId: 699235057

Also fixes a few missing includes. Uses C++ includes instead or C ones. PiperOrigin-RevId: 699237969

PiperOrigin-RevId: 699238045

Define APIs for compiling LLVM modules to functions required by the XLA:CPU runtime: kernels, comparators, etc. Implementation largely exists as SimpleOrcJit in service/cpu, but it's tightly coupled with "legacy" XLA. PiperOrigin-RevId: 699239722

PiperOrigin-RevId: 699242142

PiperOrigin-RevId: 699247019

* De-dupe logic in test common and model_buffer. * Factor out the flatbuffer model wrapper from the class in test common and move to flatbuffer_tools. * Add some extra helpers for flatbuffers in flatbuffer_tools, and add test. * Hide all the usage of `std::filesystem` stuff in one cc. Technically `<filesystem>` is an unapproved header. * Update model_load to use the flatbuffer tools. * Pull some of the member functions of "model unpacker" out into non-member functions. PiperOrigin-RevId: 699249089

…utions Performance is comparable to the synchronous version. Detailed results (where 'old' is the synchronous execution, 'new' is async execution; both use the same, custom algorithm for transposed conv): name old cpu/op new cpu/op delta BM_Conv1DStrided/process_time 29.4ms ± 6% 29.7ms ± 5% ~ (p=0.841 n=5+5) BM_Conv1DTransposedStrided/process_time 29.6ms ± 2% 30.7ms ± 2% +3.52% (p=0.008 n=5+5) BM_Conv1DTransposedStridedNonDefaultLayout/process_time 28.5ms ± 3% 28.3ms ± 1% ~ (p=0.222 n=5+5) name old time/op new time/op delta BM_Conv1DStrided/process_time 2.68ms ± 7% 2.72ms ± 5% ~ (p=0.548 n=5+5) BM_Conv1DTransposedStrided/process_time 7.91ms ± 3% 7.98ms ± 5% ~ (p=0.548 n=5+5) BM_Conv1DTransposedStridedNonDefaultLayout/process_time 7.00ms ± 2% 7.32ms ± 4% +4.58% (p=0.016 n=5+5) PiperOrigin-RevId: 699250549

Updates LLVM usage to match [556ea5265a25](llvm/llvm-project@556ea5265a25) PiperOrigin-RevId: 699251575

NCCL implementation detail will have private visibility, and for all external users (Thunks etc.) we'll export it via public header that uses xla/core/collectives APIs. PiperOrigin-RevId: 699256314

…ator, encoded as follows ``` _TENSOR_V1_<name>: { TENSOR_SHAPE: Vector<i64>, TENSOR_TYPE: tflite::TensorType (casted to i64), TENSOR_DATA: Vector<f32> or Vector<i64> } ``` PiperOrigin-RevId: 699272982

The pass runs over a VIFRT module, and tries to convert it to a given target version. PiperOrigin-RevId: 699279298

PiperOrigin-RevId: 699279921

PiperOrigin-RevId: 699286343

PiperOrigin-RevId: 699309235

…annot be null PiperOrigin-RevId: 699310290

…ying the naming. PiperOrigin-RevId: 699317601

StreamExecutorGpuClient topology description as well. PiperOrigin-RevId: 699320139

nullptr is handled here. PiperOrigin-RevId: 699323007

Also: * Add some helper functions for checking a litert op matches a tfl op which can can also be re-used in other contexts. * Add some quantization related helper functions to flatbuffer_tools * Update dump for quantization * Move thins around a bit and add quantization stuff to model_util support checks PiperOrigin-RevId: 699333588

PiperOrigin-RevId: 699337598

StreamExecutorGpuTopologyDescription rather than parsing it for every compile. PiperOrigin-RevId: 699344815

… with array output and multiple users. It may trigger compilation error, such as the added test target. PiperOrigin-RevId: 699357851

PiperOrigin-RevId: 699361885

PiperOrigin-RevId: 699397857

PiperOrigin-RevId: 699409569

PiperOrigin-RevId: 699467519

PiperOrigin-RevId: 699496299

PiperOrigin-RevId: 699497695

Reverts c5f0512 PiperOrigin-RevId: 699499360

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[pull] master from tensorflow:master #238

[pull] master from tensorflow:master #238

Commits on Nov 19, 2024

Commits on Nov 20, 2024

Commits on Nov 21, 2024

Commits on Nov 22, 2024

Commits on Nov 23, 2024

[pull] master from tensorflow:master #238

Are you sure you want to change the base?

[pull] master from tensorflow:master #238

Commits on Nov 19, 2024

Commits on Nov 20, 2024

Commits on Nov 21, 2024

Commits on Nov 22, 2024

Commits on Nov 23, 2024