From 849003d05f1f42af9c0438492f08347df468a9c6 Mon Sep 17 00:00:00 2001
From: Istvan Kiss <neon60@gmail.com>
Date: Thu, 23 May 2024 22:59:59 +0200
Subject: [PATCH] Markdown fixes

---
 CONTRIBUTING.md                       |  77 ++++++-----
 README.md                             |  74 +++++-----
 docs/how-to/faq.md                    | 189 ++++++++++++++++----------
 docs/how-to/hip_porting_driver_api.md |  41 ++++--
 docs/how-to/hip_porting_guide.md      | 134 ++++++++++--------
 docs/how-to/hip_rtc.md                |  82 +++++++----
 docs/how-to/programming_manual.md     |  74 ++++++----
 docs/index.md                         |   2 +-
 docs/reference/terms.md               |   2 +-
 docs/understand/glossary.md           |  23 ++--
 util/vim/README.md                    |   4 +-
 11 files changed, 418 insertions(+), 284 deletions(-)

diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
index 026c335459..3696f0ccce 100644
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@@ -29,12 +29,12 @@ Some guidelines are outlined below:
 
 ### Add a new HIP API ###
 
-- Add a translation to the hipify-clang tool ; many examples abound.
-    - For stat tracking purposes, place the API into an appropriate stat category ("dev", "mem", "stream", etc).
-- Add a inlined NVIDIA implementation for the function in /hipnv/include/hip/nvidia_detail/nvidia_hip_runtime_api.h in the repository [hipother](https://github.com/ROCm/hipother).
-    - These are typically headers
-- Add an HIP definition and Doxygen comments for the function in /include/hip/hip_runtime_api.h, in the repository [hip](https://github.com/ROCm/hip).
-    - Source implementation typically go in clr/hipamd/src/hip_*.cpp in the reposotory [clr](https://github.com/ROCm/clr). The implementation involves calls to HIP runtime (ie for hipStream_t).
+* Add a translation to the hipify-clang tool ; many examples abound.
+  * For stat tracking purposes, place the API into an appropriate stat category ("dev", "mem", "stream", etc).
+* Add a inlined NVIDIA implementation for the function in /hipnv/include/hip/nvidia_detail/nvidia_hip_runtime_api.h in the repository [hipother](https://github.com/ROCm/hipother).
+  * These are typically headers
+* Add an HIP definition and Doxygen comments for the function in /include/hip/hip_runtime_api.h, in the repository [hip](https://github.com/ROCm/hip).
+  * Source implementation typically go in clr/hipamd/src/hip_*.cpp in the reposotory [clr](https://github.com/ROCm/clr). The implementation involves calls to HIP runtime (ie for hipStream_t).
 
 ### Run Unit Tests ###
 
@@ -42,23 +42,26 @@ For new features or bug fixes, it's mandatory to run associate [hip-tests](https
 Please go to the repo and follow the steps.
 
 For applications and benchmarks outside the hip-tests environment, developments should use a two-step development flow:
-- #1. Compile, link, and install HIP. See {ref}`Building the HIP runtime` notes.
-- #2. Relink the target application to include changes in HIP runtime file.
+
+* #1. Compile, link, and install HIP. See {ref}`Building the HIP runtime` notes.
+* #2. Relink the target application to include changes in HIP runtime file.
 
 ## Coding Style ##
-- Code Indentation:
-    - Tabs should be expanded to spaces.
-    - Use 4 spaces indentation.
-- Capitalization and Naming
-    - Prefer camelCase for HIP interfaces and internal symbols.  Note HCC uses _ for separator.
-      This guideline is not yet consistently followed in HIP code - eventual compliance is aspirational.
-    - Member variables should begin with a leading "_".  This allows them to be easily distinguished from other variables or functions.
-
-- `{}` placement
-    - namespace should be on same line as `{` and separated by a space.
-    - Single-line if statement should still use `{/}` pair (even though C++ does not require).
-    - For functions, the opening `{` should be placed on a new line.
-    - For if/else blocks, the opening `{` is placed on same line as the if/else. Use a space to separate `{` from if/else. For example,
+
+* Code Indentation:
+  * Tabs should be expanded to spaces.
+  * Use 4 spaces indentation.
+* Capitalization and Naming
+  * Prefer camelCase for HIP interfaces and internal symbols.  Note HCC uses _ for separator.
+    This guideline is not yet consistently followed in HIP code * eventual compliance is aspirational.
+  * Member variables should begin with a leading "_".  This allows them to be easily distinguished from other variables or functions.
+
+* `{}` placement
+  * namespace should be on same line as `{` and separated by a space.
+  * Single-line if statement should still use `{/}` pair (even though C++ does not require).
+  * For functions, the opening `{` should be placed on a new line.
+  * For if/else blocks, the opening `{` is placed on same line as the if/else. Use a space to separate `{` from if/else. For example,
+
 ```console
     if (foo) {
         doFoo()
@@ -67,16 +70,16 @@ For applications and benchmarks outside the hip-tests environment, developments
     }
 ```
 
-- Miscellaneous
-    - All references in function parameter lists should be const.
-    - "ihip" means internal hip structures.  These should not be exposed through the HIP API.
-    - Keyword TODO refers to a note that should be addressed in long-term.  Could be style issue, software architecture, or known bugs.
-    - FIXME refers to a short-term bug that needs to be addressed.
+* Miscellaneous
+  * All references in function parameter lists should be const.
+  * "ihip" means internal hip structures.  These should not be exposed through the HIP API.
+  * Keyword TODO refers to a note that should be addressed in long-term.  Could be style issue, software architecture, or known bugs.
+  * FIXME refers to a short-term bug that needs to be addressed.
 
-- `HIP_INIT_API()` should be placed at the start of each top-level HIP API.  This function will make sure the HIP runtime is initialized, and also constructs an appropriate API string for tracing and CodeXL marker tracing. The arguments to HIP_INIT_API should match those of the parent function.
-- `hipExtGetLastError()` can be called as the AMD platform specific API, to return error code from last HIP API called from the active host thread. `hipGetLastError()` and `hipPeekAtLastError()` can also return the last error that was returned by any of the HIP runtime calls in the same host thread.
-- All HIP environment variables should begin with the keyword HIP_
-Environment variables should be long enough to describe their purpose but short enough so they can be remembered - perhaps 10-20 characters, with 3-4 parts separated by underscores.
+* `HIP_INIT_API()` should be placed at the start of each top-level HIP API.  This function will make sure the HIP runtime is initialized, and also constructs an appropriate API string for tracing and CodeXL marker tracing. The arguments to HIP_INIT_API should match those of the parent function.
+* `hipExtGetLastError()` can be called as the AMD platform specific API, to return error code from last HIP API called from the active host thread. `hipGetLastError()` and `hipPeekAtLastError()` can also return the last error that was returned by any of the HIP runtime calls in the same host thread.
+* All HIP environment variables should begin with the keyword HIP_
+Environment variables should be long enough to describe their purpose but short enough so they can be remembered * perhaps 10-20 characters, with 3-4 parts separated by underscores.
 To see the list of current environment variables, along with their values, set HIP_PRINT_ENV and run any hip applications on ROCm platform.
 HIPCC or other tools may support additional environment variables which should follow the above convention.
 
@@ -91,16 +94,18 @@ Some tips:
     https://robots.thoughtbot.com/5-useful-tips-for-a-better-commit-message
 
 In particular :
-   - Use imperative voice, ie "Fix this bug", "Refactor the XYZ routine", "Update the doc".
-     Not : "Fixing the bug", "Fixed the bug", "Bug fix", etc.
-   - Subject should summarize the commit.  Do not end subject with a period.  Use a blank line
-     after the subject.
+
+* Use imperative voice, ie "Fix this bug", "Refactor the XYZ routine", "Update the doc".
+  Not : "Fixing the bug", "Fixed the bug", "Bug fix", etc.
+* Subject should summarize the commit.  Do not end subject with a period.  Use a blank line
+  after the subject.
 
 ### Deliverables ###
 
 HIP is an open source library. Because of this, we include the following license description at the top of every source file.
 If you create new source files in the repository, please include this text in them as well (replacing "xx" with the digits for the current year):
-```
+
+```C++
 // Copyright (c) 20xx Advanced Micro Devices, Inc. All rights reserved.
 //
 // Permission is hereby granted, free of charge, to any person obtaining a copy
@@ -139,5 +144,5 @@ To update the code in your PR (eg. in response to a code review discussion), you
 
 ### Doxygen Editing Guidelines ###
 
-- bugs should be marked with @bugs near the code where the bug might be fixed.  The @bug message will appear in the API description and also in the
+* bugs should be marked with @bugs near the code where the bug might be fixed.  The @bug message will appear in the API description and also in the
 doxygen bug list.
diff --git a/README.md b/README.md
index a6e887d478..b9ff89b5ad 100644
--- a/README.md
+++ b/README.md
@@ -18,7 +18,7 @@ The information presented in this document is for informational purposes only an
 
 © 2023 Advanced Micro Devices, Inc. All Rights Reserved.
 
-## Repository branches:
+## Repository branches
 
 The HIP repository maintains several branches. The branches that are of importance are:
 
@@ -26,27 +26,28 @@ The HIP repository maintains several branches. The branches that are of importan
 * Main branch: This is the stable branch. It is up to date with the latest release branch, for example, if the latest HIP release is rocm-4.3, main branch will be the repository based on this release.
 * Release branches. These are branches corresponding to each ROCM release, listed with release tags, such as rocm-4.2, rocm-4.3, etc.
 
-## Release tagging:
+## Release tagging
 
 HIP releases are typically naming convention for each ROCM release to help differentiate them.
 
 * rocm x.yy: These are the stable releases based on the ROCM release.
   This type of release is typically made once a month.*
 
-## More Info:
-- [Installation](docs/install/install.rst)
-- [HIP FAQ](docs/how-to/faq.md)
-- [HIP Kernel Language](docs/reference/kernel_language.rst)
-- [HIP Porting Guide](docs/how-to/hip_porting_guide.md)
-- [HIP Porting Driver Guide](docs/how-to/hip_porting_driver_api.md)
-- [HIP Programming Guide](docs/how-to/programming_manual.md)
-- [HIP Logging ](docs/how-to/logging.rst)
-- [Building HIP From Source](docs/install/build.rst)
-- [HIP Debugging ](docs/how-to/debugging.rst)
-- [HIP RTC](docs/how-to/hip_rtc.md)
-- [HIP Terminology](docs/reference/terms.md) (including Rosetta Stone of GPU computing terms across CUDA/HIP/OpenCL)
-- [HIPIFY](https://github.com/ROCm/HIPIFY/blob/amd-staging/README.md)
-- Supported CUDA APIs:
+## More Info
+
+* [Installation](docs/install/install.rst)
+* [HIP FAQ](docs/how-to/faq.md)
+* [HIP Kernel Language](docs/reference/kernel_language.rst)
+* [HIP Porting Guide](docs/how-to/hip_porting_guide.md)
+* [HIP Porting Driver Guide](docs/how-to/hip_porting_driver_api.md)
+* [HIP Programming Guide](docs/how-to/programming_manual.md)
+* [HIP Logging](docs/how-to/logging.rst)
+* [Building HIP From Source](docs/install/build.rst)
+* [HIP Debugging](docs/how-to/debugging.rst)
+* [HIP RTC](docs/how-to/hip_rtc.md)
+* [HIP Terminology](docs/reference/terms.md) (including Rosetta Stone of GPU computing terms across CUDA/HIP/OpenCL)
+* [HIPIFY](https://github.com/ROCm/HIPIFY/blob/amd-staging/README.md)
+* Supported CUDA APIs:
   * [Runtime API](https://github.com/ROCm/HIPIFY/blob/amd-staging/docs/tables/CUDA_Runtime_API_functions_supported_by_HIP.md)
   * [Driver API](https://github.com/ROCm/HIPIFY/blob/amd-staging/docs/tables/CUDA_Driver_API_functions_supported_by_HIP.md)
   * [cuComplex API](https://github.com/ROCm/HIPIFY/blob/amd-staging/docs/tables/cuComplex_API_supported_by_HIP.md)
@@ -56,20 +57,21 @@ HIP releases are typically naming convention for each ROCM release to help diffe
   * [cuDNN](https://github.com/ROCm/HIPIFY/blob/amd-staging/docs/tables/CUDNN_API_supported_by_HIP.md)
   * [cuFFT](https://github.com/ROCm/HIPIFY/blob/amd-staging/docs/tables/CUFFT_API_supported_by_HIP.md)
   * [cuSPARSE](https://github.com/ROCm/HIPIFY/blob/amd-staging/docs/tables/CUSPARSE_API_supported_by_HIP.md)
-- [Developer/CONTRIBUTING Info](CONTRIBUTING.md)
-- [Release Notes](RELEASE.md)
+* [Developer/CONTRIBUTING Info](CONTRIBUTING.md)
+* [Release Notes](RELEASE.md)
 
 ## How do I get set up?
 
 See the [Installation](docs/install/install.rst) notes.
 
 ## Simple Example
+
 The HIP API includes functions such as hipMalloc, hipMemcpy, and hipFree.
 Programmers familiar with CUDA will also be able to quickly learn and start coding with the HIP API.
 Compute kernels are launched with the "hipLaunchKernelGGL" macro call.
 Here is simple example showing a snippet of HIP API code:
 
-```
+```cpp
 hipMalloc(&A_d, Nbytes);
 hipMalloc(&C_d, Nbytes);
 
@@ -84,13 +86,11 @@ hipLaunchKernelGGL(vector_square,   /* compute kernel*/
 hipMemcpy(C_h, C_d, Nbytes, hipMemcpyDeviceToHost);
 ```
 
-
 The HIP kernel language defines builtins for determining grid and block coordinates, math functions, short vectors,
 atomics, and timer functions.
 It also specifies additional defines and keywords for function types, address spaces, and optimization controls (See the [HIP Kernel Language](docs/reference/kernel_language.rst) for a full description).
 Here's an example of defining a simple 'vector_square' kernel.
 
-
 ```cpp
 template <typename T>
 __global__ void
@@ -108,12 +108,14 @@ vector_square(T *C_d, const T *A_d, size_t N)
 The HIP Runtime API code and compute kernel definition can exist in the same source file - HIP takes care of generating host and device code appropriately.
 
 ## HIP Portability and Compiler Technology
+
 HIP C++ code can be compiled with either,
-- On the NVIDIA CUDA platform, HIP provides header file which translate from the HIP runtime APIs to CUDA runtime APIs.  The header file contains mostly inlined
+
+* On the NVIDIA CUDA platform, HIP provides header file which translate from the HIP runtime APIs to CUDA runtime APIs.  The header file contains mostly inlined
   functions and thus has very low overhead - developers coding in HIP should expect the same performance as coding in native CUDA.  The code is then
   compiled with nvcc, the standard C++ compiler provided with the CUDA SDK.  Developers can use any tools supported by the CUDA SDK including the CUDA
   profiler and debugger.
-- On the AMD ROCm platform, HIP provides a header and runtime library built on top of HIP-Clang compiler.  The HIP runtime implements HIP streams, events, and memory APIs,
+* On the AMD ROCm platform, HIP provides a header and runtime library built on top of HIP-Clang compiler.  The HIP runtime implements HIP streams, events, and memory APIs,
   and is a object library that is linked with the application.  The source code for all headers and the library implementation is available on GitHub.
   HIP developers on ROCm can use AMD's ROCgdb (https://github.com/ROCm/ROCgdb) for debugging and profiling.
 
@@ -121,36 +123,36 @@ Thus HIP source code can be compiled to run on either platform.  Platform-specif
 provides source portability to either platform.   HIP provides the _hipcc_ compiler driver which will call the appropriate toolchain depending on the desired platform.
 
 
-## Examples and Getting Started:
+## Examples and Getting Started
 
 * A sample and [blog](https://github.com/ROCm/hip-tests/tree/develop/samples/0_Intro/square) that uses any of [HIPIFY](https://github.com/ROCm/HIPIFY/blob/amd-staging/README.md) tools to convert a simple app from CUDA to HIP:
 
-
-```shell
-cd samples/01_Intro/square
-# follow README / blog steps to hipify the application.
-```
+  ```shell
+  cd samples/01_Intro/square
+  # follow README / blog steps to hipify the application.
+  ```
 
 * Guide to [Porting a New Cuda Project](https://rocm.docs.amd.com/projects/HIP/en/latest/how-to/hip_porting_guide.html#porting-a-new-cuda-project)
 
-
 ## More Examples
+
 The GitHub repository [HIP-Examples](https://github.com/ROCm/HIP-Examples) contains a hipified version of benchmark suite.
 Besides, there are more samples in Github [HIP samples](https://github.com/ROCm/hip-tests/tree/develop/samples), showing how to program with different features, build and run.
 
 ## Tour of the HIP Directories
+
 * **include**:
-    * **hip_runtime_api.h** : Defines HIP runtime APIs and can be compiled with many standard Linux compilers (GCC, ICC, CLANG, etc), in either C or C++ mode.
-    * **hip_runtime.h** : Includes everything in hip_runtime_api.h PLUS hipLaunchKernelGGL and syntax for writing device kernels and device functions.  hip_runtime.h can be compiled using a standard C++ compiler but will expose a subset of the available functions.
-    * **amd_detail/**** , **nvidia_detail/**** : Implementation details for specific platforms. HIP applications should not include these files directly.
+  * **hip_runtime_api.h** : Defines HIP runtime APIs and can be compiled with many standard Linux compilers (GCC, ICC, CLANG, etc), in either C or C++ mode.
+  * **hip_runtime.h** : Includes everything in hip_runtime_api.h PLUS hipLaunchKernelGGL and syntax for writing device kernels and device functions.  hip_runtime.h can be compiled using a standard C++ compiler but will expose a subset of the available functions.
+  * **amd_detail/**** , **nvidia_detail/**** : Implementation details for specific platforms. HIP applications should not include these files directly.
 
 * **bin**: Tools and scripts to help with hip porting
-    * **hipcc** : Compiler driver that can be used to replace nvcc in existing CUDA code. hipcc will call nvcc or HIP-Clang depending on platform and include appropriate platform-specific headers and libraries.
-    * **hipconfig** : Print HIP configuration (HIP_PATH, HIP_PLATFORM, HIP_COMPILER, HIP_RUNTIME, CXX config flags, etc.)
+  * **hipcc** : Compiler driver that can be used to replace nvcc in existing CUDA code. hipcc will call nvcc or HIP-Clang depending on platform and include appropriate platform-specific headers and libraries.
+  * **hipconfig** : Print HIP configuration (HIP_PATH, HIP_PLATFORM, HIP_COMPILER, HIP_RUNTIME, CXX config flags, etc.)
 
 * **docs**: Documentation - markdown and doxygen info.
 
 ## Reporting an issue
+
 Use the [GitHub issue tracker](https://github.com/ROCm/HIP/issues).
 If reporting a bug, include the output of "hipconfig --full" and samples/1_hipInfo/hipInfo (if possible).
-
diff --git a/docs/how-to/faq.md b/docs/how-to/faq.md
index 2c6a11c7d8..52860e52ff 100644
--- a/docs/how-to/faq.md
+++ b/docs/how-to/faq.md
@@ -1,95 +1,105 @@
 # Frequently asked questions
 
 ## What APIs and features does HIP support?
+
 HIP provides the following:
-- Devices (hipSetDevice(), hipGetDeviceProperties(), etc.)
-- Memory management (hipMalloc(), hipMemcpy(), hipFree(), etc.)
-- Streams (hipStreamCreate(),hipStreamSynchronize(), hipStreamWaitEvent(),  etc.)
-- Events (hipEventRecord(), hipEventElapsedTime(), etc.)
-- Kernel launching (hipLaunchKernel/hipLaunchKernelGGL is the preferred way of launching kernels. hipLaunchKernelGGL is a standard C/C++ macro that can serve as an alternative way to launch kernels, replacing the CUDA triple-chevron (<<< >>>) syntax).
-- HIP Module API to control when adn how code is loaded.
-- CUDA-style kernel coordinate functions (threadIdx, blockIdx, blockDim, gridDim)
-- Cross-lane instructions including shfl, ballot, any, all
-- Most device-side math built-ins
-- Error reporting (hipGetLastError(), hipGetErrorString())
+
+* Devices (hipSetDevice(), hipGetDeviceProperties(), etc.)
+* Memory management (hipMalloc(), hipMemcpy(), hipFree(), etc.)
+* Streams (hipStreamCreate(),hipStreamSynchronize(), hipStreamWaitEvent(),  etc.)
+* Events (hipEventRecord(), hipEventElapsedTime(), etc.)
+* Kernel launching (hipLaunchKernel/hipLaunchKernelGGL is the preferred way of launching kernels. hipLaunchKernelGGL is a standard C/C++ macro that can serve as an alternative way to launch kernels, replacing the CUDA triple-chevron (<<< >>>) syntax).
+* HIP Module API to control when adn how code is loaded.
+* CUDA-style kernel coordinate functions (threadIdx, blockIdx, blockDim, gridDim)
+* Cross-lane instructions including shfl, ballot, any, all
+* Most device-side math built-ins
+* Error reporting (hipGetLastError(), hipGetErrorString())
 
 The HIP API documentation describes each API and its limitations, if any, compared with the equivalent CUDA API.
 
 ## What is not supported?
 
 ### Runtime/Driver API features
+
 At a high-level, the following features are not supported:
-- Textures (partial support available)
-- Dynamic parallelism (CUDA 5.0)
-- Graphics interoperability with OpenGL or Direct3D
-- CUDA IPC Functions (Under Development)
-- CUDA array, mipmappedArray and pitched memory
-- Queue priority controls
+
+* Textures (partial support available)
+* Dynamic parallelism (CUDA 5.0)
+* Graphics interoperability with OpenGL or Direct3D
+* CUDA IPC Functions (Under Development)
+* CUDA array, mipmappedArray and pitched memory
+* Queue priority controls
 
 See the [API Support Table](https://github.com/ROCm/HIPIFY/blob/amd-staging/docs/tables/CUDA_Runtime_API_functions_supported_by_HIP.md) for more detailed information.
 
 ### Kernel language features
-- C++-style device-side dynamic memory allocations (free, new, delete) (CUDA 4.0)
-- Virtual functions, indirect functions and try/catch (CUDA 4.0)
-- `__prof_trigger`
-- PTX assembly (CUDA 4.0).  HIP-Clang supports inline GCN assembly.
-- Several kernel features are under development.  See the {doc}`/reference/kernel_language` for more information.
 
+* C++-style device-side dynamic memory allocations (free, new, delete) (CUDA 4.0)
+* Virtual functions, indirect functions and try/catch (CUDA 4.0)
+* `__prof_trigger`
+* PTX assembly (CUDA 4.0).  HIP-Clang supports inline GCN assembly.
+* Several kernel features are under development.  See the {doc}`/reference/kernel_language` for more information.
 
 ## Is HIP a drop-in replacement for CUDA?
+
 No. HIP provides porting tools which do most of the work to convert CUDA code into portable C++ code that uses the HIP APIs.
 Most developers will port their code from CUDA to HIP and then maintain the HIP version.
 HIP code provides the same performance as native CUDA code, plus the benefits of running on AMD platforms.
 
 ## What specific version of CUDA does HIP support?
+
 HIP APIs and features do not map to a specific CUDA version. HIP provides a strong subset of the functionality provided in CUDA, and the hipify tools can scan code to identify any unsupported CUDA functions - this is useful for identifying the specific features required by a given application.
 
 However, we can provide a rough summary of the features included in each CUDA SDK and the support level in HIP. Each bullet below lists the major new language features in each CUDA release and then indicate which are supported/not supported in HIP:
 
-- CUDA 4.0 and earlier :
-    - HIP supports CUDA 4.0 except for the limitations described above.
-- CUDA 5.0 :
-    - Dynamic Parallelism (not supported)
-    - cuIpc functions (under development).
-- CUDA 6.0 :
-    - Managed memory (under development)
-- CUDA 6.5 :
-    - __shfl intrinsic (supported)
-- CUDA 7.0 :
-    - Per-thread default streams (supported)
-    - C++11 (Hip-Clang supports all of C++11, all of C++14 and some C++17 features)
-- CUDA 7.5 :
-    - float16 (supported)
-- CUDA 8.0 :
-    - Page Migration including cudaMemAdvise, cudaMemPrefetch, other cudaMem* APIs(not supported)
-- CUDA 9.0 :
-    - Cooperative Launch, Surface Object Management, Version Management
+* CUDA 4.0 and earlier :
+  * HIP supports CUDA 4.0 except for the limitations described above.
+* CUDA 5.0 :
+  * Dynamic Parallelism (not supported)
+  * cuIpc functions (under development).
+* CUDA 6.0 :
+  * Managed memory (under development)
+* CUDA 6.5 :
+  * __shfl intrinsic (supported)
+* CUDA 7.0 :
+  * Per-thread default streams (supported)
+  * C++11 (Hip-Clang supports all of C++11, all of C++14 and some C++17 features)
+* CUDA 7.5 :
+  * float16 (supported)
+* CUDA 8.0 :
+  * Page Migration including cudaMemAdvise, cudaMemPrefetch, other cudaMem* APIs(not supported)
+* CUDA 9.0 :
+  * Cooperative Launch, Surface Object Management, Version Management
 
 ## What libraries does HIP support?
+
 HIP includes growing support for the four key math libraries using hipBlas, hipFFt, hipRAND and hipSPARSE, as well as MIOpen for machine intelligence applications.
 These offer pointer-based memory interfaces (as opposed to opaque buffers) and can be easily interfaced with other HIP applications.
 The hip interfaces support both ROCm and CUDA paths, with familiar library interfaces.
 
-- [hipBlas](https://github.com/ROCmSoftwarePlatform/hipBLAS), which utilizes [rocBlas](https://github.com/ROCmSoftwarePlatform/rocBLAS).
-- [hipFFt](https://github.com/ROCmSoftwarePlatform/hipfft)
-- [hipsSPARSE](https://github.com/ROCmSoftwarePlatform/hipsparse)
-- [hipRAND](https://github.com/ROCmSoftwarePlatform/hipRAND)
-- [MIOpen](https://github.com/ROCmSoftwarePlatform/MIOpen)
+* [hipBlas](https://github.com/ROCmSoftwarePlatform/hipBLAS), which utilizes [rocBlas](https://github.com/ROCmSoftwarePlatform/rocBLAS).
+* [hipFFt](https://github.com/ROCmSoftwarePlatform/hipfft)
+* [hipsSPARSE](https://github.com/ROCmSoftwarePlatform/hipsparse)
+* [hipRAND](https://github.com/ROCmSoftwarePlatform/hipRAND)
+* [MIOpen](https://github.com/ROCmSoftwarePlatform/MIOpen)
 
 Additionally, some of the cublas routines are automatically converted to hipblas equivalents by the HIPIFY tools. These APIs use cublas or hcblas depending on the platform and replace the need to use conditional compilation.
 
 ## How does HIP compare with OpenCL?
+
 Both AMD and Nvidia support OpenCL 1.2 on their devices so that developers can write portable code.
 HIP offers several benefits over OpenCL:
-- Developers can code in C++ as well as mix host and device C++ code in their source files. HIP C++ code can use templates, lambdas, classes and so on.
-- The HIP API is less verbose than OpenCL and is familiar to CUDA developers.
-- Because both CUDA and HIP are C++ languages, porting from CUDA to HIP is significantly easier than porting from CUDA to OpenCL.
-- HIP uses the best available development tools on each platform: on Nvidia GPUs, HIP code compiles using NVCC and can employ the nSight profiler and debugger (unlike OpenCL on Nvidia GPUs).
-- HIP provides pointers and host-side pointer arithmetic.
-- HIP provides device-level control over memory allocation and placement.
-- HIP offers an offline compilation model.
+
+* Developers can code in C++ as well as mix host and device C++ code in their source files. HIP C++ code can use templates, lambdas, classes and so on.
+* The HIP API is less verbose than OpenCL and is familiar to CUDA developers.
+* Because both CUDA and HIP are C++ languages, porting from CUDA to HIP is significantly easier than porting from CUDA to OpenCL.
+* HIP uses the best available development tools on each platform: on Nvidia GPUs, HIP code compiles using NVCC and can employ the nSight profiler and debugger (unlike OpenCL on Nvidia GPUs).
+* HIP provides pointers and host-side pointer arithmetic.
+* HIP provides device-level control over memory allocation and placement.
+* HIP offers an offline compilation model.
 
 ## How does porting CUDA to HIP compare to porting CUDA to OpenCL?
+
 Both HIP and CUDA are dialects of C++, and thus porting between them is relatively straightforward.
 Both dialects support templates, classes, lambdas, and other C++ constructs.
 As one example, the hipify-perl tool was originally a Perl script that used simple text conversions from CUDA to HIP.
@@ -101,10 +111,12 @@ As a result, the OpenCL syntax is different from CUDA, and the porting tools hav
 The tools also struggle with more complex CUDA applications, in particular, those that use templates, classes, or other C++ features inside the kernel.
 
 ## What hardware does HIP support?
-- For AMD platforms, see the [ROCm documentation](https://github.com/RadeonOpenCompute/ROCm#supported-gpus) for the list of supported platforms.
-- For Nvidia platforms, HIP requires unified memory and should run on any device supporting CUDA SDK 6.0 or newer. We have tested the Nvidia Titan and Tesla K40.
+
+* For AMD platforms, see the [ROCm documentation](https://github.com/RadeonOpenCompute/ROCm#supported-gpus) for the list of supported platforms.
+* For Nvidia platforms, HIP requires unified memory and should run on any device supporting CUDA SDK 6.0 or newer. We have tested the Nvidia Titan and Tesla K40.
 
 ## Do HIPIFY tools automatically convert all source code?
+
 Typically, HIPIFY tools can automatically convert almost all run-time code.
 Most device code needs no additional conversion since HIP and CUDA have similar names for math and built-in functions.
 The hipify-clang tool will automatically modify the kernel signature as needed (automating a step that used to be done manually).
@@ -112,17 +124,21 @@ Additional porting may be required to deal with architecture feature queries or
 In general, developers should always expect to perform some platform-specific tuning and optimization.
 
 ## What is NVCC?
+
 NVCC is Nvidia's compiler driver for compiling "CUDA C++" code into PTX or device code for Nvidia GPUs. It's a closed-source binary compiler that is provided by the CUDA SDK.
 
 ## What is HIP-Clang?
+
 HIP-Clang is a Clang/LLVM based compiler to compile HIP programs which can run on AMD platform.
 
 ## Why use HIP rather than supporting CUDA directly?
+
 While HIP is a strong subset of the CUDA, it is a subset.  The HIP layer allows that subset to be clearly defined and documented.
 Developers who code to the HIP API can be assured their code will remain portable across Nvidia and AMD platforms.
 In addition, HIP defines portable mechanisms to query architectural features and supports a larger 64-bit wavesize which expands the return type for cross-lane functions like ballot and shuffle from 32-bit ints to 64-bit ints.
 
 ## Can I develop HIP code on an Nvidia CUDA platform?
+
 Yes.  HIP's CUDA path only exposes the APIs and functionality that work on both NVCC and AMDGPU back-ends.
 "Extra" APIs, parameters, and features which exist in CUDA but not in HIP-Clang will typically result in compile-time or run-time errors.
 Developers need to use the HIP API for most accelerator code and bracket any CUDA-specific code with preprocessor conditionals.
@@ -130,44 +146,55 @@ Developers concerned about portability should, of course, run on both platforms,
 In some cases, CUDA has a richer set of modes for some APIs, and some C++ capabilities such as virtual functions - see the HIP @API documentation for more details.
 
 ## Can I develop HIP code on an AMD HIP-Clang platform?
+
 Yes. HIP's HIP-Clang path only exposes the APIs and functions that work on AMD runtime back ends. "Extra" APIs, parameters and features that appear in HIP-Clang but not CUDA will typically cause compile- or run-time errors. Developers must use the HIP API for most accelerator code and bracket any HIP-Clang specific code with preprocessor conditionals. Those concerned about portability should, of course, test their code on both platforms and should tune it for performance. Typically, HIP-Clang supports a more modern set of C++11/C++14/C++17 features, so HIP developers who want portability should be careful when using advanced C++ features on the HIP-Clang path.
 
 ## How to use HIP-Clang to build HIP programs?
+
 The environment variable can be used to set compiler path:
-- HIP_CLANG_PATH: path to hip-clang. When set, this variable let hipcc to use hip-clang for compilation/linking.
+
+* HIP_CLANG_PATH: path to hip-clang. When set, this variable let hipcc to use hip-clang for compilation/linking.
 
 There is an alternative environment variable to set compiler path:
-- HIP_ROCCLR_HOME: path to root directory of the HIP-ROCclr runtime. When set, this variable let hipcc use hip-clang from the ROCclr distribution.
+
+* HIP_ROCCLR_HOME: path to root directory of the HIP-ROCclr runtime. When set, this variable let hipcc use hip-clang from the ROCclr distribution.
 NOTE: If HIP_ROCCLR_HOME is set, there is no need to set HIP_CLANG_PATH since hipcc will deduce them from HIP_ROCCLR_HOME.
 
 ## What is AMD clr?
+
 AMD clr (Common Language Runtime) is a repository for the AMD platform, which contains source codes for AMD's compute languages runtimes as follows,
 
-- hipamd - contains implementation of HIP language for AMD GPU.
-- rocclr - contains virtual device interfaces that compute runtimes interact with backends, such as ROCr on Linux and PAL on Windows.
-- opencl - contains implementation of OpenCL™ on the AMD platform.
+* hipamd - contains implementation of HIP language for AMD GPU.
+* rocclr - contains virtual device interfaces that compute runtimes interact with backends, such as ROCr on Linux and PAL on Windows.
+* opencl - contains implementation of OpenCL™ on the AMD platform.
 
 ## What is hipother?
+
 A new repository 'hipother' is added in the ROCm 6.1 release, which is branched out from HIP.
 hipother supports the HIP back-end implementation on some non-AMD platforms, like NVIDIA.
 
 ## Can I get HIP open source repository for Windows?
+
 No, there is no HIP repository open publicly on Windows.
 
 ## Can a HIP binary run on both AMD and Nvidia platforms?
+
 HIP is a source-portable language that can be compiled to run on either AMD or NVIDIA platform. HIP tools don't create a "fat binary" that can run on either platform, however.
 
 ## On HIP-Clang, can I link HIP code with host code compiled with another compiler such as gcc, icc, or clang ?
+
 Yes.  HIP generates the object code which conforms to the GCC ABI, and also links with libstdc++.  This means you can compile host code with the compiler of your choice and link the generated object code
 with GPU code compiled with HIP.  Larger projects often contain a mixture of accelerator code (initially written in CUDA with nvcc) and host code (compiled with gcc, icc, or clang). These projects
 can convert the accelerator code to HIP, compile that code with hipcc, and link with object code from their preferred compiler.
 
 ## Can HIP API support C style application? What is the difference between C and C++ ?
+
 HIP is C++ runtime API that supports C style applications as well.
 
 Some C style applications (and interfaces to other languages (FORTRAN, Python)) would call certain HIP APIs but not use kernel programming.
 They can be compiled with a C compiler and run correctly, however, small details must be considered in the code. For example, initialization, as shown in the simple application below, uses HIP structs dim3 with the file name "test.hip.cpp"
-```
+
+```cpp
 #include "hip/hip_runtime_api.h"
 #include "stdio.h"
 
@@ -181,51 +208,61 @@ int main(int argc, char** argv) {
 ```
 
 When using a C++ compiler,
-```
+
+```bash
 $ gcc -x c++  $(hipconfig --cpp_config) test3.hip.cpp -o test
 $ ./test
 dim3 grid1; x=1, y=1, z=1
 dim3 grid2 = {1,1,1}; x=1, y=1, z=1
 ```
+
 In which "dim3 grid1;" will yield a dim3 grid with all dimensional members x,y,z initalized to 1, as the default constructor behaves that way.
 Further, if written:
-```
+
+```cpp
 dim3 grid(2); // yields {2,1,1}
 dim3 grid(2,3); yields {2,3,1}
 ```
 
 In comparison, when using the C compiler,
-```
+
+```bash
 $ gcc -x c $(hipconfig --cpp_config) test.hip.cpp -o test
 $ ./test
 dim3 grid1; x=646881376, y=21975, z=1517277280
 dim3 grid2 = {1,1,1}; x=1, y=1, z=1
 ```
+
 In which "dim3 grid;" does not imply any initialization, no constructor is called, and dimensional values x,y,z of grid are undefined.
 NOTE: To get the C++ default behavior, C programmers must additionally specify the right-hand side as shown below,
-```
+
+```cpp
 dim3 grid = {1,1,1}; // initialized as in C++
 ```
 
-
 ## Can I install both CUDA SDK and HIP-Clang on the same machine?
+
 Yes. You can use HIP_PLATFORM to choose which path hipcc targets.  This configuration can be useful when using HIP to develop an application which is portable to both AMD and NVIDIA.
 
+## HIP detected my platform (HIP-Clang vs nvcc) incorrectly * what should I do?
 
-## HIP detected my platform (HIP-Clang vs nvcc) incorrectly - what should I do?
 HIP will set the platform to AMD and use HIP-Clang as compiler if it sees that the AMD graphics driver is installed and has detected an AMD GPU.
-Sometimes this isn't what you want - you can force HIP to recognize the platform by setting the following,
-```
+Sometimes this isn't what you want * you can force HIP to recognize the platform by setting the following,
+
+```bash
 export HIP_PLATFORM=amd
 ```
+
 HIP then set and use correct AMD compiler and runtime,
 HIP_COMPILER=clang
 HIP_RUNTIME=rocclr
 
 To choose NVIDIA platform, you can set,
-```
+
+```bash
 export HIP_PLATFORM=nvidia
 ```
+
 In this case, HIP will set and use the following,
 HIP_COMPILER=cuda
 HIP_RUNTIME=nvcc
@@ -233,6 +270,7 @@ HIP_RUNTIME=nvcc
 One symptom of this problem is the message "error: 'unknown error'(11) at square.hipref.cpp:56".  This can occur if you have a CUDA installation on an AMD platform, and HIP incorrectly detects the platform as nvcc.  HIP may be able to compile the application using the nvcc tool-chain but will generate this error at runtime since the platform does not have a CUDA device.
 
 ## On CUDA, can I mix CUDA code with HIP code?
+
 Yes.  Most HIP data structures (hipStream_t, hipEvent_t) are typedefs to CUDA equivalents and can be intermixed.  Both CUDA and HIP use integer device ids.
 One notable exception is that hipError_t is a new type, and cannot be used where a cudaError_t is expected.  In these cases, refactor the code to remove the expectation.  Alternatively, hip_runtime_api.h defines functions which convert between the error code spaces:
 
@@ -243,19 +281,24 @@ hipCUResultTohipError
 If platform portability is important, use #ifdef __HIP_PLATFORM_NVIDIA__ to guard the CUDA-specific code.
 
 ## How do I trace HIP application flow?
+
 See {doc}`/how-to/logging` for more information.
 
 ## What are the maximum limits of kernel launch parameters?
+
 Product of block.x, block.y, and block.z should be less than 1024.
 Please note, HIP does not support kernel launch with total work items defined in dimension with size gridDim x blockDim >= 2^32, so gridDim.x * blockDim.x, gridDim.y * blockDim.y and gridDim.z * blockDim.z are always less than 2^32.
 
 ## Are __shfl_*_sync functions supported on HIP platform?
+
 __shfl_*_sync is not supported on HIP but for nvcc path CUDA 9.0 and above all shuffle calls get redirected to it's sync version.
 
 ## How to create a guard for code that is specific to the host or the GPU?
+
 The compiler defines the `__HIP_DEVICE_COMPILE__` macro only when compiling the code for the GPU.  It could be used to guard code that is specific to the host or the GPU.
 
 ## Why _OpenMP is undefined when compiling with -fopenmp?
+
 When compiling an OpenMP source file with `hipcc -fopenmp`, the compiler may generate error if there is a reference to the `_OPENMP` macro.  This is due to a limitation in hipcc that treats any source file type (for example `.cpp`) as an HIP translation unit leading to some conflicts with the OpenMP language switch.  If the OpenMP source file doesn't contain any HIP language constructs you could work around this issue by adding the `-x c++` switch to force the compiler to treat the file as regular C++.  Another approach would be to guard the OpenMP code with `#ifdef _OPENMP` so that the code block is disabled when compiling for the GPU.  The `__HIP_DEVICE_COMPILE__` macro defined by the HIP compiler when compiling GPU code could also be used for guarding code paths specific to the host or the GPU.
 
 ## Does the HIP-Clang compiler support extern shared declarations?
@@ -273,8 +316,8 @@ If you have compiled the application yourself, make sure you have given the corr
 
 If you have a precompiled application/library (like rocblas, tensorflow etc) which gives you such error, there are one of two possibilities.
 
- - The application/library does not ship code object bundles for *all* of your device(s): in this case you need to recompile the application/library yourself with correct `--offload-arch`.
- - The application/library does not ship code object bundles for *some* of your device(s), for example you have a system with an APU + GPU and the library does not ship code objects for your APU. For this you can set the environment variable `HIP_VISIBLE_DEVICES` or `CUDA_VISIBLE_DEVICES` on NVdia platform, to only enable GPUs for which code object is available. This will limit the GPUs visible to your application and allow it to run.
+* The application/library does not ship code object bundles for *all* of your device(s): in this case you need to recompile the application/library yourself with correct `--offload-arch`.
+* The application/library does not ship code object bundles for *some* of your device(s), for example you have a system with an APU + GPU and the library does not ship code objects for your APU. For this you can set the environment variable `HIP_VISIBLE_DEVICES` or `CUDA_VISIBLE_DEVICES` on NVdia platform, to only enable GPUs for which code object is available. This will limit the GPUs visible to your application and allow it to run.
 
 Note: In previous releases, the error code is hipErrorNoBinaryForGpu with message "Unable to find code object for all current devices".
 The error code handling behavior is changed. HIP runtime shows the error code hipErrorSharedObjectInitFailed with message "Error: shared object initialization failed" on unsupported GPU.
@@ -298,8 +341,9 @@ typedef float2 hipFloatComplex;
 typedef double2 hipDoubleComplex;
 
 Any application uses complex multiplication and division operations, need to replace '*' and '/' operators with the following,
-- hipCmulf() and hipCdivf() for hipFloatComplex
-- hipCmul() and hipCdiv() for hipDoubleComplex
+
+* hipCmulf() and hipCdivf() for hipFloatComplex
+* hipCmul() and hipCdiv() for hipDoubleComplex
 
 Note: These complex operations are equivalent to corresponding types/functions on the NVIDIA platform.
 
@@ -329,4 +373,3 @@ hipRuntimeGetVersion(&runtimeVersion);
 The version returned will always be greater than the versions in previous ROCm releases.
 
 Note: The version definition of HIP runtime is different from CUDA. On AMD platform, the function returns HIP runtime version, while on NVIDIA platform, it returns CUDA runtime version. And there is no mapping/correlation between HIP version and CUDA version.
-
diff --git a/docs/how-to/hip_porting_driver_api.md b/docs/how-to/hip_porting_driver_api.md
index e2cfe23bb6..99847dbd11 100644
--- a/docs/how-to/hip_porting_driver_api.md
+++ b/docs/how-to/hip_porting_driver_api.md
@@ -1,16 +1,18 @@
 # Porting CUDA Driver API
 
 ## Introduction to the CUDA Driver and Runtime APIs
+
 CUDA provides a separate CUDA Driver and Runtime APIs. The two APIs have significant overlap in functionality:
-- Both APIs support events, streams, memory management, memory copy, and error handling.
-- Both APIs deliver similar performance.
-- Driver APIs calls begin with the prefix `cu` while Runtime APIs begin with the prefix `cuda`. For example, the Driver API API contains `cuEventCreate` while the Runtime API contains `cudaEventCreate`, with similar functionality.
-- The Driver API defines a different but largely overlapping error code space than the Runtime API, and uses a different coding convention. For example, Driver API defines `CUDA_ERROR_INVALID_VALUE` while the Runtime API defines `cudaErrorInvalidValue`
 
+* Both APIs support events, streams, memory management, memory copy, and error handling.
+* Both APIs deliver similar performance.
+* Driver APIs calls begin with the prefix `cu` while Runtime APIs begin with the prefix `cuda`. For example, the Driver API API contains `cuEventCreate` while the Runtime API contains `cudaEventCreate`, with similar functionality.
+* The Driver API defines a different but largely overlapping error code space than the Runtime API, and uses a different coding convention. For example, Driver API defines `CUDA_ERROR_INVALID_VALUE` while the Runtime API defines `cudaErrorInvalidValue`
 
 The Driver API offers two additional pieces of functionality not provided by the Runtime API: cuModule and cuCtx APIs.
 
 ### cuModule API
+
 The Module section of the Driver API provides additional control over how and when accelerator code objects are loaded.
 For example, the driver API allows code objects to be loaded from files or memory pointers.
 Symbols for kernels or global data can be extracted from the loaded code objects.
@@ -29,6 +31,7 @@ The Module functions can be used to load the generated code objects and launch k
 As we will see below, HIP defines a Module API which provides similar explicit control over code object management.
 
 ### cuCtx API
+
 The Driver API defines "Context" and "Devices" as separate entities.
 Contexts contain a single device, and a device can theoretically have multiple contexts.
 Each context contains a set of streams and events specific to the context.
@@ -62,12 +65,13 @@ Notably, there is not a fat binary format that can contain code for both NVCC an
 The hipModule API can be used to load additional code objects, and in this way provides an extended capability to the automatically loaded code objects.
 HIP-Clang allows both of these capabilities to be used together, if desired. Of course it is possible to create a program with no kernels and thus no automatic loading.
 
-
 ### hipCtx API
+
 HIP provides a `Ctx` API as a thin layer over the existing Device functions. This Ctx API can be used to set the current context, or to query properties of the device associated with the context.
 The current context is implicitly used by other APIs such as `hipStreamCreate`.
 
 ### hipify translation of CUDA Driver API
+
 The HIPIFY tools convert CUDA Driver APIs for streams, events, modules, devices, memory management, context, profiler to the equivalent HIP driver calls. For example, `cuEventCreate` will be translated to `hipEventCreate`.
 HIPIFY tools also convert error codes from the Driver namespace and coding convention to the equivalent HIP error code. Thus, HIP unifies the APIs for these common functions.
 
@@ -78,25 +82,32 @@ The first flavor may be faster in some cases since they avoid host overhead to d
 HIP defines a single error space, and uses camel-case for all errors (i.e. `hipErrorInvalidValue`).
 
 #### Address Spaces
+
 HIP-Clang defines a process-wide address space where the CPU and all devices allocate addresses from a single unified pool.
 Thus addresses may be shared between contexts, and unlike the original CUDA definition a new context does not create a new address space for the device.
 
 #### Using hipModuleLaunchKernel
+
 `hipModuleLaunchKernel` is `cuLaunchKernel` in HIP world. It takes the same arguments as `cuLaunchKernel`.
 
 #### Additional Information
-- HIP-Clang creates a primary context when the HIP API is called. So in a pure driver API code, HIP-Clang will create a primary context while HIP/NVCC will have empty context stack.
+
+* HIP-Clang creates a primary context when the HIP API is called. So in a pure driver API code, HIP-Clang will create a primary context while HIP/NVCC will have empty context stack.
 HIP-Clang will push primary context to context stack when it is empty. This can have subtle differences on applications which mix the runtime and driver APIs.
 
 ### hip-clang Implementation Notes
+
 #### .hip_fatbin
+
 hip-clang links device code from different translation units together. For each device target, a code object is generated. Code objects for different device targets are bundled by clang-offload-bundler as one fatbinary, which is embeded as a global symbol `__hip_fatbin` in the .hip_fatbin section of the ELF file of the executable or shared object.
 
 #### Initialization and Termination Functions
+
 hip-clang generates initializatiion and termination functions for each translation unit for host code compilation. The initialization functions call `__hipRegisterFatBinary` to register the fatbinary embeded in the ELF file. They also call `__hipRegisterFunction` and `__hipRegisterVar` to register kernel functions and device side global variables. The termination functions call `__hipUnregisterFatBinary`.
 hip-clang emits a global variable `__hip_gpubin_handle` of void** type with linkonce linkage and inital value 0 for each host translation unit. Each initialization function checks `__hip_gpubin_handle` and register the fatbinary only if `__hip_gpubin_handle` is 0 and saves the return value of `__hip_gpubin_handle` to `__hip_gpubin_handle`. This is to guarantee that the fatbinary is only registered once. Similar check is done in the termination functions.
 
 #### Kernel Launching
+
 hip-clang supports kernel launching by CUDA `<<<>>>` syntax, hipLaunchKernelGGL. The latter one is macro which expand to CUDA `<<<>>>` syntax.
 
 When the executable or shared library is loaded by the dynamic linker, the initilization functions are called. In the initialization functions, when `__hipRegisterFatBinary` is called, the code objects containing all kernels are loaded; when `__hipRegisterFunction` is called, the stub functions are associated with the corresponding kernels in code objects.
@@ -108,6 +119,7 @@ By default, in the host code, for the `<<<>>>` statement, hip-clang first emits
 ### NVCC Implementation Notes
 
 #### Interoperation between HIP and CUDA Driver
+
 CUDA applications may want to mix CUDA driver code with HIP code (see example below). This table shows the type equivalence to enable this interaction.
 
 |**HIP Type**   |**CU Driver Type**|**CUDA Runtime Type**|
@@ -121,12 +133,14 @@ CUDA applications may want to mix CUDA driver code with HIP code (see example be
 | hipArray      |  CUarray         | cudaArray           |
 
 #### Compilation Options
+
 The `hipModule_t` interface does not support `cuModuleLoadDataEx` function, which is used to control PTX compilation options.
 HIP-Clang does not use PTX and does not support these compilation options.
 In fact, HIP-Clang code objects always contain fully compiled ISA and do not require additional compilation as a part of the load step.
 The corresponding HIP function `hipModuleLoadDataEx` behaves as `hipModuleLoadData` on HIP-Clang path (compilation options are not used) and as `cuModuleLoadDataEx` on NVCC path.
 For example (CUDA):
-```
+
+```cpp
 CUmodule module;
 void *imagePtr = ...;  // Somehow populate data pointer with code object
 
@@ -143,8 +157,10 @@ cuModuleLoadDataEx(module, imagePtr, numOptions, options, optionValues);
 CUfunction k;
 cuModuleGetFunction(&k, module, "myKernel");
 ```
+
 HIP:
-```
+
+```cpp
 hipModule_t module;
 void *imagePtr = ...;  // Somehow populate data pointer with code object
 
@@ -166,7 +182,7 @@ hipModuleGetFunction(&k, module, "myKernel");
 
 The below sample shows how to use `hipModuleGetFunction`.
 
-```
+```cpp
 #include<hip_runtime.h>
 #include<hip_runtime_api.h>
 #include<iostream>
@@ -246,9 +262,9 @@ int main(){
 
 ## HIP Module and Texture Driver API
 
-HIP supports texture driver APIs however texture reference should be declared in host scope. Following code explains the use of texture reference for __HIP_PLATFORM_AMD__ platform.
+HIP supports texture driver APIs however texture reference should be declared in host scope. Following code explains the use of texture reference for `__HIP_PLATFORM_AMD__` platform.
 
-```
+```cpp
 // Code to generate code object
 
 #include "hip/hip_runtime.h"
@@ -264,7 +280,8 @@ __global__ void tex2dKernel(hipLaunchParm lp, float* outputData,
 }
 
 ```
-```
+
+```cpp
 // Host code:
 
 texture<float, 2, hipReadModeElementType> tex;
diff --git a/docs/how-to/hip_porting_guide.md b/docs/how-to/hip_porting_guide.md
index 7611a94c5d..ba088a3149 100644
--- a/docs/how-to/hip_porting_guide.md
+++ b/docs/how-to/hip_porting_guide.md
@@ -1,4 +1,5 @@
 # HIP Porting Guide
+
 In addition to providing a portable C++ programming environment for GPUs, HIP is designed to ease
 the porting of existing CUDA code into the HIP environment.  This section describes the available tools
 and provides practical suggestions on how to port CUDA code and work through common issues.
@@ -6,14 +7,17 @@ and provides practical suggestions on how to port CUDA code and work through com
 ## Porting a New CUDA Project
 
 ### General Tips
-- Starting the port on a CUDA machine is often the easiest approach, since you can incrementally port pieces of the code to HIP while leaving the rest in CUDA. (Recall that on CUDA machines HIP is just a thin layer over CUDA, so the two code types can interoperate on nvcc platforms.) Also, the HIP port can be compared with the original CUDA code for function and performance.
-- Once the CUDA code is ported to HIP and is running on the CUDA machine, compile the HIP code using the HIP compiler on an AMD machine.
-- HIP ports can replace CUDA versions: HIP can deliver the same performance as a native CUDA implementation, with the benefit of portability to both Nvidia and AMD architectures as well as a path to future C++ standard support. You can handle platform-specific features through conditional compilation or by adding them to the open-source HIP infrastructure.
-- Use **[hipconvertinplace-perl.sh](https://github.com/ROCm/HIPIFY/blob/amd-staging/bin/hipconvertinplace-perl.sh)** to hipify all code files in the CUDA source directory.
+
+* Starting the port on a CUDA machine is often the easiest approach, since you can incrementally port pieces of the code to HIP while leaving the rest in CUDA. (Recall that on CUDA machines HIP is just a thin layer over CUDA, so the two code types can interoperate on nvcc platforms.) Also, the HIP port can be compared with the original CUDA code for function and performance.
+* Once the CUDA code is ported to HIP and is running on the CUDA machine, compile the HIP code using the HIP compiler on an AMD machine.
+* HIP ports can replace CUDA versions: HIP can deliver the same performance as a native CUDA implementation, with the benefit of portability to both Nvidia and AMD architectures as well as a path to future C++ standard support. You can handle platform-specific features through conditional compilation or by adding them to the open-source HIP infrastructure.
+* Use **[hipconvertinplace-perl.sh](https://github.com/ROCm/HIPIFY/blob/amd-staging/bin/hipconvertinplace-perl.sh)** to hipify all code files in the CUDA source directory.
 
 ### Scanning existing CUDA code to scope the porting effort
+
 The **[hipexamine-perl.sh](https://github.com/ROCm/HIPIFY/blob/amd-staging/bin/hipexamine-perl.sh)** tool will scan a source directory to determine which files contain CUDA code and how much of that code can be automatically hipified.
-```
+
+```shell
 > cd examples/rodinia_3.0/cuda/kmeans
 > $HIP_DIR/bin/hipexamine-perl.sh.
 info: hipify ./kmeans.h =====>
@@ -34,17 +38,19 @@ info: TOTAL-converted 89 CUDA->HIP refs( dev:3 mem:32 kern:2 builtin:37 math:0 s
 
 hipexamine-perl scans each code file (cpp, c, h, hpp, etc.) found in the specified directory:
 
-   * Files with no CUDA code (ie kmeans.h) print one line summary just listing the source file name.
-   * Files with CUDA code print a summary of what was found - for example the kmeans_cuda_kernel.cu file:
-```
+* Files with no CUDA code (ie kmeans.h) print one line summary just listing the source file name.
+* Files with CUDA code print a summary of what was found - for example the kmeans_cuda_kernel.cu file:
+
+```shell
 info: hipify ./kmeans_cuda_kernel.cu =====>
   info: converted 40 CUDA->HIP refs( dev:0 mem:0 kern:0 builtin:37 math:0 stream:0 event:0
 ```
+
 * Interesting information in kmeans_cuda_kernel.cu :
   * How many CUDA calls were converted to HIP (40)
   * Breakdown of the CUDA functionality used (dev:0 mem:0 etc). This file uses many CUDA builtins (37) and texture functions (3).
-   * Warning for code that looks like CUDA API but was not converted (0 in this file).
-   * Count Lines-of-Code (LOC) - 185 for this file.
+  * Warning for code that looks like CUDA API but was not converted (0 in this file).
+  * Count Lines-of-Code (LOC) - 185 for this file.
 
 * hipexamine-perl also presents a summary at the end of the process for the statistics collected across all files. This has similar format to the per-file reporting, and also includes a list of all kernels which have been called.  An example from above:
 
@@ -60,18 +66,17 @@ info: TOTAL-converted 89 CUDA->HIP refs( dev:3 mem:32 kern:2 builtin:37 math:0 s
 ```
 
 For each input file FILE, this script will:
-  - If "FILE.prehip file does not exist, copy the original code to a new file with extension ".prehip". Then hipify the code file.
-  - If "FILE.prehip" file exists, hipify FILE.prehip and save to FILE.
 
-This is useful for testing improvements to the hipify toolset.
+* If "FILE.prehip file does not exist, copy the original code to a new file with extension ".prehip". Then hipify the code file.
+* If "FILE.prehip" file exists, hipify FILE.prehip and save to FILE.
 
+This is useful for testing improvements to the hipify toolset.
 
 The [hipconvertinplace-perl.sh](https://github.com/ROCm/HIPIFY/blob/amd-staging/bin/hipconvertinplace-perl.sh) script will perform inplace conversion for all code files in the specified directory.
 This can be quite handy when dealing with an existing CUDA code base since the script preserves the existing directory structure
 and filenames - and includes work.  After converting in-place, you can review the code to add additional parameters to
 directory names.
 
-
 ```shell
 > hipconvertinplace-perl.sh MY_SRC_DIR
 ```
@@ -79,10 +84,11 @@ directory names.
 ### Library Equivalents
 
 Most CUDA libraries have a corresponding ROCm library with similar functionality and APIs. However, ROCm also provides HIP marshalling libraries that greatly simplify the porting process because they more precisely reflect their CUDA counterparts and can be used with either the AMD or NVIDIA platforms (see "Identifying HIP Target Platform" below). There are a few notable exceptions:
-  - MIOpen does not have a marshalling library interface to ease porting from cuDNN.
-  - RCCL is a drop-in replacement for NCCL and implements the NCCL APIs.
-  - hipBLASLt does not have a ROCm library but can still target the NVIDIA platform, as needed.
-  - EIGEN's HIP support is part of the library.
+
+* MIOpen does not have a marshalling library interface to ease porting from cuDNN.
+* RCCL is a drop-in replacement for NCCL and implements the NCCL APIs.
+* hipBLASLt does not have a ROCm library but can still target the NVIDIA platform, as needed.
+* EIGEN's HIP support is part of the library.
 
 | CUDA Library | HIP Library | ROCm Library | Comment |
 |------------- | ----------- | ------------ | ------- |
@@ -99,30 +105,28 @@ Most CUDA libraries have a corresponding ROCm library with similar functionality
 | EIGEN        | EIGEN       | N/A          | C++ template library for linear algebra: matrices, vectors, numerical solvers,
 | NCCL         | N/A         | RCCL         | Communications Primitives Library based on the MPI equivalents
 
-
-
 ## Distinguishing Compiler Modes
 
-
 ### Identifying HIP Target Platform
+
 All HIP projects target either AMD or NVIDIA platform. The platform affects which headers are included and which libraries are used for linking.
 
-- `HIP_PLATFORM_AMD` is defined if the HIP platform targets AMD.
+* `HIP_PLATFORM_AMD` is defined if the HIP platform targets AMD.
 Note, `HIP_PLATFORM_HCC` was previously defined if the HIP platform targeted AMD, it is deprecated.
-
-- `HIP_PLATFORM_NVDIA` is defined if the HIP platform targets NVIDIA.
+* `HIP_PLATFORM_NVDIA` is defined if the HIP platform targets NVIDIA.
 Note, `HIP_PLATFORM_NVCC` was previously defined if the HIP platform targeted NVIDIA, it is deprecated.
 
 ### Identifying the Compiler: hip-clang or nvcc
+
 Often, it's useful to know whether the underlying compiler is HIP-Clang or nvcc. This knowledge can guard platform-specific code or aid in platform-specific performance tuning.
 
-```
+```cpp
 #ifdef __HIP_PLATFORM_AMD__
 // Compiled with HIP-Clang
 #endif
 ```
 
-```
+```cpp
 #ifdef __HIP_PLATFORM_NVIDIA__
 // Compiled with nvcc
 //  Could be compiling with CUDA language extensions enabled (for example, a ".cu file)
@@ -130,21 +134,20 @@ Often, it's useful to know whether the underlying compiler is HIP-Clang or nvcc.
 
 ```
 
-```
+```cpp
 #ifdef __CUDACC__
 // Compiled with nvcc (CUDA language extensions enabled)
 ```
 
 Compiler directly generates the host code (using the Clang x86 target) and passes the code to another host compiler. Thus, they have no equivalent of the \__CUDA_ACC define.
 
-
 ### Identifying Current Compilation Pass: Host or Device
 
 nvcc makes two passes over the code: one for host code and one for device code.
 HIP-Clang will have multiple passes over the code: one for the host code, and one for each architecture on the device code.
 `__HIP_DEVICE_COMPILE__` is set to a nonzero value when the compiler (HIP-Clang or nvcc) is compiling code for a device inside a `__global__` kernel or for a device function. `__HIP_DEVICE_COMPILE__` can replace #ifdef checks on the `__CUDA_ARCH__` define.
 
-```
+```cpp
 // #ifdef __CUDA_ARCH__
 #if __HIP_DEVICE_COMPILE__
 ```
@@ -152,6 +155,7 @@ HIP-Clang will have multiple passes over the code: one for the host code, and on
 Unlike `__CUDA_ARCH__`, the `__HIP_DEVICE_COMPILE__` value is 1 or undefined, and it doesn't represent the feature capability of the target device.
 
 ### Compiler Defines: Summary
+
 |Define  		|   HIP-Clang  | nvcc 		|  Other (GCC, ICC, Clang, etc.)
 |--- | --- | --- |---|
 |HIP-related defines:|
@@ -175,27 +179,29 @@ Unlike `__CUDA_ARCH__`, the `__HIP_DEVICE_COMPILE__` value is 1 or undefined, an
 
 Some CUDA code tests `__CUDA_ARCH__` for a specific value to determine whether the machine supports a certain architectural feature. For instance,
 
-```
+```cpp
 #if (__CUDA_ARCH__ >= 130)
 // doubles are supported
 ```
+
 This type of code requires special attention, since AMD and CUDA devices have different architectural capabilities. Moreover, you can't determine the presence of a feature using a simple comparison against an architecture's version number. HIP provides a set of defines and device properties to query whether a specific architectural feature is supported.
 
 The `__HIP_ARCH_*` defines can replace comparisons of `__CUDA_ARCH__` values:
-```
+
+```cpp
 //#if (__CUDA_ARCH__ >= 130)   // non-portable
 if __HIP_ARCH_HAS_DOUBLES__ {  // portable HIP feature query
    // doubles are supported
 }
 ```
 
-For host code, the `__HIP_ARCH__*` defines are set to 0. You should only use the __HIP_ARCH__ fields in device code.
+For host code, the `__HIP_ARCH__*` defines are set to 0. You should only use the `__HIP_ARCH__` fields in device code.
 
 ### Device-Architecture Properties
 
 Host code should query the architecture feature flags in the device properties that hipGetDeviceProperties returns, rather than testing the "major" and "minor" fields directly:
 
-```
+```cpp
 hipGetDeviceProperties(&deviceProp, device);
 //if ((deviceProp.major == 1 && deviceProp.minor < 2))  // non-portable
 if (deviceProp.arch.hasSharedInt32Atomics) {            // portable HIP feature query
@@ -204,6 +210,7 @@ if (deviceProp.arch.hasSharedInt32Atomics) {            // portable HIP feature
 ```
 
 ### Table of Architecture Properties
+
 The table below shows the full set of architectural properties that HIP supports.
 
 |Define (use only in device code) | Device Property (run-time query) | Comment |
@@ -232,12 +239,11 @@ The table below shows the full set of architectural properties that HIP supports
 |`__HIP_ARCH_HAS_3DGRID__`                     |   has3dGrid                    | Grids and groups are 3D
 |`__HIP_ARCH_HAS_DYNAMIC_PARALLEL__`           |   hasDynamicParallelism        |
 
-
 ## Finding HIP
 
 Makefiles can use the following syntax to conditionally provide a default HIP_PATH if one does not exist:
 
-```
+```shell
 HIP_PATH ?= $(shell hipconfig --path)
 ```
 
@@ -245,15 +251,14 @@ HIP_PATH ?= $(shell hipconfig --path)
 
 HIP can depend on rocclr, or cuda as runtime
 
-- AMD platform
+* AMD platform
 On AMD platform, HIP uses Radeon Open Compute Common Language Runtime, called ROCclr.
 ROCclr is a virtual device interface that HIP runtimes interact with different backends which allows runtimes to work on Linux , as well as Windows without much efforts.
 
-- NVIDIA platform
+* NVIDIA platform
 On Nvidia platform, HIP is just a thin layer on top of CUDA.
 On non-AMD platform, HIP runtime determines if cuda is available and can be used. If available, HIP_PLATFORM is set to nvidia and underneath CUDA path is used.
 
-
 ## hipLaunchKernelGGL
 
 hipLaunchKernelGGL is a macro that can serve as an alternative way to launch kernel, which accepts parameters of launch configurations (grid dims, group dims, stream, dynamic shared size) followed by a variable number of kernel arguments.
@@ -289,7 +294,6 @@ hipcc adds the necessary libraries for HIP as well as for the accelerator compil
 
 hipcc adds -lm by default to the link command.
 
-
 ## Linking Code With Other Compilers
 
 CUDA code often uses nvcc for accelerator code (defining and launching kernels, typically defined in .cu or .cuh files).
@@ -299,7 +303,6 @@ In some cases, you must take care to ensure the data types and alignment of the
 
 HIP-Clang generates both device and host code using the same Clang-based compiler. The code uses the same API as gcc, which allows code generated by different gcc-compatible compilers to be linked together. For example, code compiled using HIP-Clang can link with code compiled using "standard" compilers (such as gcc, ICC and Clang). Take care to ensure all compilers use the same standard C++ header and library formats.
 
-
 ### libc++ and libstdc++
 
 hipcc links to libstdc++ by default. This provides better compatibility between g++ and HIP.
@@ -308,33 +311,34 @@ If you pass "--stdlib=libc++" to hipcc, hipcc will use the libc++ library.  Gene
 
 When cross-linking C++ code, any C++ functions that use types from the C++ standard library (including std::string, std::vector and other containers) must use the same standard-library implementation. They include the following:
 
-- Functions or kernels defined in HIP-Clang that are called from a standard compiler
-- Functions defined in a standard compiler that are called from HIP-Clanng.
+* Functions or kernels defined in HIP-Clang that are called from a standard compiler
+* Functions defined in a standard compiler that are called from HIP-Clanng.
 
 Applications with these interfaces should use the default libstdc++ linking.
 
 Applications which are compiled entirely with hipcc, and which benefit from advanced C++ features not supported in libstdc++, and which do not require portability to nvcc, may choose to use libc++.
 
-
 ### HIP Headers (hip_runtime.h, hip_runtime_api.h)
 
 The hip_runtime.h and hip_runtime_api.h files define the types, functions and enumerations needed to compile a HIP program:
 
-- hip_runtime_api.h: defines all the HIP runtime APIs (e.g., hipMalloc) and the types required to call them. A source file that is only calling HIP APIs but neither defines nor launches any kernels can include hip_runtime_api.h. hip_runtime_api.h uses no custom hc language features and can be compiled using a standard C++ compiler.
-- hip_runtime.h: included in hip_runtime_api.h. It additionally provides the types and defines required to create and launch kernels. hip_runtime.h can be compiled using a standard C++ compiler but will expose a subset of the available functions.
+* hip_runtime_api.h: defines all the HIP runtime APIs (e.g., hipMalloc) and the types required to call them. A source file that is only calling HIP APIs but neither defines nor launches any kernels can include hip_runtime_api.h. hip_runtime_api.h uses no custom hc language features and can be compiled using a standard C++ compiler.
+* hip_runtime.h: included in hip_runtime_api.h. It additionally provides the types and defines required to create and launch kernels. hip_runtime.h can be compiled using a standard C++ compiler but will expose a subset of the available functions.
 
 CUDA has slightly different contents for these two files. In some cases you may need to convert hipified code to include the richer hip_runtime.h instead of hip_runtime_api.h.
 
 ### Using a Standard C++ Compiler
+
 You can compile hip\_runtime\_api.h using a standard C or C++ compiler (e.g., gcc or ICC). The HIP include paths and defines (`__HIP_PLATFORM_AMD__` or `__HIP_PLATFORM_NVIDIA__`) must pass to the standard compiler; hipconfig then returns the necessary options:
-```
+
+```bash
 > hipconfig --cxx_config
  -D__HIP_PLATFORM_AMD__ -I/home/user1/hip/include
 ```
 
 You can capture the hipconfig output and passed it to the standard compiler; below is a sample makefile syntax:
 
-```
+```bash
 CPPFLAGS += $(shell $(HIP_PATH)/bin/hipconfig --cpp_config)
 ```
 
@@ -361,13 +365,16 @@ run hipcc when appropriate.
 ## Workarounds
 
 ### warpSize
+
 Code should not assume a warp size of 32 or 64.  See [Warp Cross-Lane Functions](https://rocm.docs.amd.com/projects/HIP/en/latest/reference/kernel_language.html#warp-cross-lane-functions) for information on how to write portable wave-aware code.
 
 ### Kernel launch with group size > 256
+
 Kernel code should use ``` __attribute__((amdgpu_flat_work_group_size(<min>,<max>)))```.
 
 For example:
-```
+
+```cpp
 __global__ void dot(double *a,double *b,const int n) __attribute__((amdgpu_flat_work_group_size(1, 512)))
 ```
 
@@ -382,7 +389,8 @@ Note that the symbol name needs to be encased in the HIP_SYMBOL macro, as shown
 For example:
 
 Device Code:
-```
+
+```cpp
 #include<hip/hip_runtime.h>
 #include<hip/hip_runtime_api.h>
 #include<iostream>
@@ -431,7 +439,8 @@ int main()
 To get pointer's memory type in HIP/HIP-Clang, developers should use hipPointerGetAttributes API. First parameter of the API is hipPointerAttribute_t which has 'type' as member variable. 'type' indicates input pointer is allocated on device or host.
 
 For example:
-```
+
+```cpp
 double * ptr;
 hipMalloc(reinterpret_cast<void**>(&ptr), sizeof(double));
 hipPointerAttribute_t attr;
@@ -442,10 +451,12 @@ hipHostMalloc(&ptrHost, sizeof(double));
 hipPointerAttribute_t attr;
 hipPointerGetAttributes(&attr, ptrHost); /*attr.type will have value as hipMemoryTypeHost*/
 ```
+
 Please note, hipMemoryType enum values are different from cudaMemoryType enum values.
 
 For example, on AMD platform, hipMemoryType is defined in hip_runtime_api.h,
-```
+
+```cpp
 typedef enum hipMemoryType {
     hipMemoryTypeHost = 0,    ///< Memory is physically located on host
     hipMemoryTypeDevice = 1,  ///< Memory is physically located on device. (see deviceId for specific device)
@@ -454,8 +465,10 @@ typedef enum hipMemoryType {
     hipMemoryTypeManaged = 4  ///< Managed memory, automaticallly managed by the unified memory system
 } hipMemoryType;
 ```
+
 Looking into CUDA toolkit, it defines cudaMemoryType as following,
-```
+
+```cpp
 enum cudaMemoryType
 {
   cudaMemoryTypeUnregistered = 0, // Unregistered memory.
@@ -464,6 +477,7 @@ enum cudaMemoryType
   cudaMemoryTypeManaged = 3, // Managed memory
 }
 ```
+
 In this case, memory type translation for hipPointerGetAttributes needs to be handled properly on nvidia platform to get the correct memory type in CUDA, which is done in the file nvidia_hip_runtime_api.h.
 
 So in any HIP applications which use HIP APIs involving memory types, developers should use #ifdef in order to assign the correct enum values depending on Nvidia or AMD platform.
@@ -475,6 +489,7 @@ With the #ifdef condition, HIP APIs work as expected on both AMD and NVIDIA plat
 Note, cudaMemoryTypeUnregstered is currently not supported in hipMemoryType enum, due to HIP functionality backward compatibility.
 
 ## threadfence_system
+
 Threadfence_system makes all device memory writes, all writes to mapped host memory, and all writes to peer memory visible to CPU and other GPU devices.
 Some implementations can provide this behavior by flushing the GPU L2 cache.
 HIP/HIP-Clang does not provide this functionality.  As a workaround, users can set the environment variable `HSA_DISABLE_CACHE=1` to disable the GPU L2 cache. This will affect all accesses and for all kernels and so may have a performance impact.
@@ -487,9 +502,8 @@ AMD compilers currently load all data into both the L1 and L2 caches, so __ldg i
 
 We recommend the following for functional portability:
 
-- For programs that use textures only to benefit from improved caching, use the __ldg instruction
-- Programs that use texture object and reference APIs, work well on HIP
-
+* For programs that use textures only to benefit from improved caching, use the __ldg instruction
+* Programs that use texture object and reference APIs, work well on HIP
 
 ## More Tips
 
@@ -499,7 +513,7 @@ On an AMD platform, set the AMD_LOG_LEVEL environment variable to log HIP applic
 
 The value of the setting controls different logging level,
 
-```
+```cpp
 enum LogLevel {
 LOG_NONE = 0,
 LOG_ERROR = 1,
@@ -512,7 +526,7 @@ LOG_DEBUG = 4
 Logging mask is used to print types of functionalities during the execution of HIP application.
 It can be set as one of the following values,
 
-```
+```cpp
 enum LogMask {
   LOG_API       = 1,      //!< (0x1)     API call
   LOG_CMD       = 2,      //!< (0x2)     Kernel and Copy Commands and Barriers
@@ -538,9 +552,10 @@ enum LogMask {
 ```
 
 ### Debugging hipcc
+
 To see the detailed commands that hipcc issues, set the environment variable HIPCC_VERBOSE to 1. Doing so will print to stderr the HIP-clang (or nvcc) commands that hipcc generates.
 
-```
+```bash
 export HIPCC_VERBOSE=1
 make
 ...
@@ -548,6 +563,5 @@ hipcc-cmd: /opt/rocm/bin/hipcc --offload-arch=native -x hip backprop_cuda.cu
 ```
 
 ### Editor Highlighting
-See the utils/vim or utils/gedit directories to add handy highlighting to hip files.
-
 
+See the utils/vim or utils/gedit directories to add handy highlighting to hip files.
diff --git a/docs/how-to/hip_rtc.md b/docs/how-to/hip_rtc.md
index bd22beeebf..e112056614 100644
--- a/docs/how-to/hip_rtc.md
+++ b/docs/how-to/hip_rtc.md
@@ -5,16 +5,17 @@ Kernels can be stored as a text string and can be passed to HIPRTC APIs alongsid
 
 NOTE:
 
-  - This library can be used on systems without HIP installed nor AMD GPU driver installed at all (offline compilation). Therefore, it does not depend on any HIP runtime library.
-  - But it does depend on COMGr. You may try to statically link COMGr into HIPRTC to avoid any ambiguity.
-  - Developers can decide to bundle this library with their application.
+* This library can be used on systems without HIP installed nor AMD GPU driver installed at all (offline compilation). Therefore, it does not depend on any HIP runtime library.
+* But it does depend on COMGr. You may try to statically link COMGr into HIPRTC to avoid any ambiguity.
+* Developers can decide to bundle this library with their application.
 
 ## Example
+
 To use HIPRTC functionality, HIPRTC header needs to be included first.
 ```#include <hip/hiprtc.h>```
 
-
 Kernels can be stored in a string:
+
 ```cpp
 static constexpr auto kernel_source {
 R"(
@@ -43,6 +44,7 @@ hiprtcCreateProgram API also allows you to add headers which can be included in
 For online compilation, the compiler pre-defines HIP device API functions, HIP specific types and macros for device compilation, but does not include standard C/C++ headers by default. Users can only include header files provided to hiprtcCreateProgram.
 
 After associating the kernel string with hiprtcProgram, you can now compile this program using:
+
 ```cpp
 hiprtcCompileProgram(prog,     // hiprtcProgram
                     0,         // Number of options
@@ -65,6 +67,7 @@ if (logSize) {
 ```
 
 If the compilation is successful, you can load the compiled binary in a local variable.
+
 ```cpp
 size_t codeSize;
 hiprtcGetCodeSize(prog, &codeSize);
@@ -77,6 +80,7 @@ After loading the binary, hiprtcProgram can be destroyed.
 ```hiprtcDestroyProgram(&prog);```
 
 The binary present in ```kernel_binary``` can now be loaded via ```hipModuleLoadData``` API.
+
 ```cpp
 hipModule_t module;
 hipFunction_t kernel;
@@ -88,6 +92,7 @@ hipModuleGetFunction(&kernel, module, "vector_add");
 And now this kernel can be launched via hipModule APIs.
 
 The full example is below:
+
 ```cpp
 #include <hip/hip_runtime.h>
 #include <hip/hiprtc.h>
@@ -220,14 +225,17 @@ int main() {
 ```
 
 ## HIPRTC specific options
+
 HIPRTC provides a few HIPRTC specific flags
- - ```--gpu-architecture``` : This flag can guide the code object generation for a specific gpu arch. Example: ```--gpu-architecture=gfx906:sramecc+:xnack-```, its equivalent to ```--offload-arch```.
-    - This option is compulsory if compilation is done on a system without AMD GPUs supported by HIP runtime.
-    - Otherwise, HIPRTC will load the hip runtime and gather the current device and its architecture info and use it as option.
- - ```-fgpu-rdc``` : This flag when provided during the hiprtcCompileProgram generates the bitcode (HIPRTC doesn't convert this bitcode into ISA and binary). This bitcode can later be fetched using hiprtcGetBitcode and hiprtcGetBitcodeSize APIs.
+* ```--gpu-architecture``` : This flag can guide the code object generation for a specific gpu arch. Example: ```--gpu-architecture=gfx906:sramecc+:xnack-```, its equivalent to ```--offload-arch```.
+  * This option is compulsory if compilation is done on a system without AMD GPUs supported by HIP runtime.
+  * Otherwise, HIPRTC will load the hip runtime and gather the current device and its architecture info and use it as option.
+* ```-fgpu-rdc``` : This flag when provided during the hiprtcCompileProgram generates the bitcode (HIPRTC doesn't convert this bitcode into ISA and binary). This bitcode can later be fetched using hiprtcGetBitcode and hiprtcGetBitcodeSize APIs.
 
 ### Bitcode
+
 In the usual scenario, the kernel associated with hiprtcProgram is compiled into the binary which can be loaded and run. However, if -fpu-rdc option is provided in the compile options, HIPRTC calls comgr and generates only the LLVM bitcode.  It doesn't convert this bitcode to ISA and generate the final binary.
+
 ```cpp
 std::string sarg = std::string("-fgpu-rdc");
 const char* options[] = {
@@ -238,6 +246,7 @@ hiprtcCompileProgram(prog, // hiprtcProgram
 ```
 
 If the compilation is successful, one can load the bitcode in a local variable using the bitcode APIs provided by HIPRTC.
+
 ```cpp
 size_t bitCodeSize;
 hiprtcGetBitcodeSize(prog, &bitCodeSize);
@@ -263,7 +272,9 @@ HIPRTC assumes **WGP mode by default** for gfx10+. This can be overridden by pas
 The bitcode generated using the HIPRTC Bitcode APIs can be loaded using hipModule APIs and also can be linked with other generated bitcodes with appropriate linker flags using the HIPRTC linker APIs.  This also provides more flexibility and optimizations to the applications who want to generate the binary dynamically according to their needs. The input bitcodes can be generated only for a specific architecture or it can be a bundled bitcode which is generated for multiple architectures.
 
 ### Example
+
 Firstly, HIPRTC link instance or a pending linker invocation must be created using hiprtcLinkCreate, with the appropriate linker options provided.
+
 ```cpp
 hiprtcLinkCreate( num_options,           // number of options
                   options,               // Array of options
@@ -272,6 +283,7 @@ hiprtcLinkCreate( num_options,           // number of options
 ```
 
 Following which, the bitcode data can be added to this link instance via hiprtcLinkAddData (if the data is present as a string) or hiprtcLinkAddFile (if the data is present as a file) with the appropriate input type according to the data or the bitcode used.
+
 ```cpp
 hiprtcLinkAddData(rtc_link_state,        // HIPRTC link state
                   input_type,            // type of the input data or bitcode
@@ -282,6 +294,7 @@ hiprtcLinkAddData(rtc_link_state,        // HIPRTC link state
                   0,                     // Array of options applied to this input
                   0);                    // Array of option values cast to void*
 ```
+
 ```cpp
 hiprtcLinkAddFile(rtc_link_state,        // HIPRTC link state
                   input_type,            // type of the input data or bitcode
@@ -292,6 +305,7 @@ hiprtcLinkAddFile(rtc_link_state,        // HIPRTC link state
 ```
 
 Once the bitcodes for multiple architectures are added to the link instance, the linking of the device code must be completed using hiprtcLinkComplete which generates the final binary.
+
 ```cpp
 hiprtcLinkComplete(rtc_link_state,       // HIPRTC link state
                    &binary,              // upon success, points to the output binary
@@ -299,18 +313,23 @@ hiprtcLinkComplete(rtc_link_state,       // HIPRTC link state
 ```
 
 If the hiprtcLinkComplete returns successfully, the generated binary can be loaded and run using the hipModule* APIs.
+
 ```cpp
 hipModuleLoadData(&module, binary);
 ```
 
 #### Note
- - The compiled binary must be loaded before HIPRTC link instance is destroyed using the hiprtcLinkDestroy API.
+
+* The compiled binary must be loaded before HIPRTC link instance is destroyed using the hiprtcLinkDestroy API.
+
 ```cpp
 hiprtcLinkDestroy(rtc_link_state);
 ```
- - The correct sequence of calls is : hiprtcLinkCreate, hiprtcLinkAddData or hiprtcLinkAddFile, hiprtcLinkComplete, hiprtcModuleLoadData, hiprtcLinkDestroy.
+
+* The correct sequence of calls is : hiprtcLinkCreate, hiprtcLinkAddData or hiprtcLinkAddFile, hiprtcLinkComplete, hiprtcModuleLoadData, hiprtcLinkDestroy.
 
 ### Input Types
+
 HIPRTC provides hiprtcJITInputType enumeration type which defines the input types accepted by the Linker APIs. Here are the enum values of hiprtcJITInputType. However only the input types  HIPRTC_JIT_INPUT_LLVM_BITCODE, HIPRTC_JIT_INPUT_LLVM_BUNDLED_BITCODE and HIPRTC_JIT_INPUT_LLVM_ARCHIVES_OF_BUNDLED_BITCODE are supported currently.
 
 HIPRTC_JIT_INPUT_LLVM_BITCODE can be used to load both LLVM bitcode or LLVM IR assembly code. However, HIPRTC_JIT_INPUT_LLVM_BUNDLED_BITCODE and HIPRTC_JIT_INPUT_LLVM_ARCHIVES_OF_BUNDLED_BITCODE are only for bundled bitcode and archive of bundled bitcode.
@@ -338,8 +357,9 @@ COMgr, a shared library, incorporates the LLVM/Clang compiler that HIPRTC relies
 To ensure smooth operation and compatibility, an application may choose to ship the specific versions of HIPRTC and COMgr dynamic libraries, or it may opt to clearly specify the version requirements and dependencies. This approach guarantees that the application can correctly compile the specified version of bitcode/IR.
 
 ### Link Options
-- `HIPRTC_JIT_IR_TO_ISA_OPT_EXT` - AMD Only. Options to be passed on to link step of compiler by `hiprtcLinkCreate`.
-- `HIPRTC_JIT_IR_TO_ISA_OPT_COUNT_EXT` - AMD Only. Count of options passed on to link step of compiler.
+
+* `HIPRTC_JIT_IR_TO_ISA_OPT_EXT` - AMD Only. Options to be passed on to link step of compiler by `hiprtcLinkCreate`.
+* `HIPRTC_JIT_IR_TO_ISA_OPT_COUNT_EXT` - AMD Only. Count of options passed on to link step of compiler.
 
 Example:
 
@@ -354,9 +374,11 @@ hiprtcLinkCreate(2, jit_options.data(), (void**)lopts, &linkstate);
 ```
 
 ## Error Handling
+
 HIPRTC defines the hiprtcResult enumeration type and a function hiprtcGetErrorString for API call error handling.  hiprtcResult enum defines the API result codes.  HIPRTC APIs return hiprtcResult to indicate the call result. hiprtcGetErrorString function returns a string describing the given hiprtcResult code, e.g., HIPRTC_SUCCESS to "HIPRTC_SUCCESS". For unrecognized enumeration values, it returns "Invalid HIPRTC error code".
 
 hiprtcResult enum supported values and the hiprtcGetErrorString usage are mentioned below.
+
 ```cpp
 HIPRTC_SUCCESS = 0,
 HIPRTC_ERROR_OUT_OF_MEMORY = 1,
@@ -372,6 +394,7 @@ HIPRTC_ERROR_NO_LOWERED_NAMES_BEFORE_COMPILATION = 10,
 HIPRTC_ERROR_NAME_EXPRESSION_NOT_VALID = 11,
 HIPRTC_ERROR_INTERNAL_ERROR = 12
 ```
+
 ```cpp
 hiprtcResult result;
 result = hiprtcCompileProgram(prog, 1, opts);
@@ -381,6 +404,7 @@ std::cout << "hiprtcCompileProgram fails with error " << hiprtcGetErrorString(re
 ```
 
 ## HIPRTC General APIs
+
 HIPRTC provides the following API for querying the version.
 
 hiprtcVersion(int* major, int* minor) -  This sets the output parameters major and minor with the HIP Runtime compilation major version and minor version number respectively.
@@ -388,16 +412,19 @@ hiprtcVersion(int* major, int* minor) -  This sets the output parameters major a
 Currently, it returns hardcoded value.  This should be implemented to return HIP runtime major and minor version in the future releases.
 
 ## Lowered Names (Mangled Names)
+
 HIPRTC mangles the ```__global__``` function names and names of ```__device__``` and ```__constant__``` variables. If the generated binary is being loaded using the HIP Runtime API, the kernel function or ```__device__/__constant__``` variable must be looked up by name, but this is very hard when the name has been mangled. To overcome this, HIPRTC provides API functions that map  ```__global__``` function or ```__device__/__constant__``` variable names in the source to the mangled names present in the generated binary.
 
 The two APIs hiprtcAddNameExpression and hiprtcGetLoweredName provide this functionality. First, a 'name expression' string denoting the address for the ```__global__``` function or ```__device__/__constant__``` variable is provided to hiprtcAddNameExpression. Then, the program is compiled with hiprtcCompileProgram. During compilation, HIPRTC will parse the name expression string as a C++ constant expression at the end of the user program. Finally, the function hiprtcGetLoweredName is called with the original name expression and it returns a pointer to the lowered name. The lowered name can be used to refer to the kernel or variable in the HIP Runtime API.
 
 ### Note
- - The identical name expression string must be provided on a subsequent call to hiprtcGetLoweredName to extract the lowered name.
- - The correct sequence of calls is : hiprtcAddNameExpression, hiprtcCompileProgram, hiprtcGetLoweredName, hiprtcDestroyProgram.
- - The lowered names must be fetched using hiprtcGetLoweredName only after the HIPRTC program has been compiled, and before it has been destroyed.
+
+* The identical name expression string must be provided on a subsequent call to hiprtcGetLoweredName to extract the lowered name.
+* The correct sequence of calls is : hiprtcAddNameExpression, hiprtcCompileProgram, hiprtcGetLoweredName, hiprtcDestroyProgram.
+* The lowered names must be fetched using hiprtcGetLoweredName only after the HIPRTC program has been compiled, and before it has been destroyed.
 
 ### Example
+
 kernel containing various definitions ```__global__``` functions/function templates and ```__device__/__constant__``` variables can be stored in a string.
 
 ```cpp
@@ -415,6 +442,7 @@ template<typename T>
 __global__ void f3(int *result) { *result = sizeof(T); }
 )"};
 ```
+
 hiprtcAddNameExpression is called with various name expressions referring to the address of ```__global__``` functions and ```__device__/__constant__``` variables.
 
 ```cpp
@@ -428,12 +456,14 @@ for (auto&& x : variable_name_vec) hiprtcAddNameExpression(prog, x.c_str());
 ```
 
 After which, the program is compiled using hiprtcCompileProgram and the generated binary is loaded using hipModuleLoadData.  And the mangled names can be fetched using hirtcGetLoweredName.
+
 ```cpp
 for (decltype(variable_name_vec.size()) i = 0; i != variable_name_vec.size(); ++i) {
   const char* name;
   hiprtcGetLoweredName(prog, variable_name_vec[i].c_str(), &name);
 }
 ```
+
 ```cpp
 for (decltype(kernel_name_vec.size()) i = 0; i != kernel_name_vec.size(); ++i) {
   const char* name;
@@ -442,7 +472,8 @@ for (decltype(kernel_name_vec.size()) i = 0; i != kernel_name_vec.size(); ++i) {
 ```
 
 The mangled name of the variables are used to look up the variable in the module and update its value.
-```
+
+```cpp
 hipDeviceptr_t variable_addr;
 size_t bytes{};
 hipModuleGetGlobal(&variable_addr, &bytes, module, name);
@@ -460,16 +491,19 @@ hipModuleLaunchKernel(kernel, 1, 1, 1, 1, 1, 1, 0, nullptr, nullptr, config);
 Please have a look at hiprtcGetLoweredName.cpp for the detailed example.
 
 ## Versioning
+
 HIPRTC follows the below versioning.
- - Linux
-    - HIPRTC follows the same versioning as HIP runtime library.
-    - The `so` name field for the shared library is set to MAJOR version. For example, for HIP 5.3 the `so` name is set to 5 (hiprtc.so.5).
- - Windows
-    - HIPRTC dll is named as hiprtcXXYY.dll where XX is MAJOR version and YY is MINOR version. For example, for HIP 5.3 the name is hiprtc0503.dll.
+* Linux
+  * HIPRTC follows the same versioning as HIP runtime library.
+  * The `so` name field for the shared library is set to MAJOR version. For example, for HIP 5.3 the `so` name is set to 5 (hiprtc.so.5).
+* Windows
+  * HIPRTC dll is named as hiprtcXXYY.dll where XX is MAJOR version and YY is MINOR version. For example, for HIP 5.3 the name is hiprtc0503.dll.
 
 ## HIP header support
- - Added HIPRTC support for all the hip common header files such as library_types.h, hip_math_constants.h, hip_complex.h, math_functions.h, surface_types.h etc. from 6.1. HIPRTC users need not include any HIP macros or constants explicitly in their header files. All of these should get included via HIPRTC builtins when the app links to HIPRTC library.
+
+* Added HIPRTC support for all the hip common header files such as library_types.h, hip_math_constants.h, hip_complex.h, math_functions.h, surface_types.h etc. from 6.1. HIPRTC users need not include any HIP macros or constants explicitly in their header files. All of these should get included via HIPRTC builtins when the app links to HIPRTC library.
 
 ## Deprecation notice
- - Currently HIPRTC APIs are separated from HIP APIs and HIPRTC is available as a separate library libhiprtc.so/libhiprtc.dll. But on Linux, HIPRTC symbols are also present in libhipamd64.so in order to support the existing applications. Gradually, these symbols will be removed from HIP library and applications using HIPRTC will be required to explicitly link to HIPRTC library. However, on Windows hiprtc.dll must be used as the hipamd64.dll doesn't contain the HIPRTC symbols.
- - Data types such as uint32_t, uint64_t, int32_t, int64_t defined in std namespace in HIPRTC are deprecated earlier and are being removed from ROCm release 6.1 since these can conflict with the standard C++ data types. These data types are now prefixed with __hip__, e.g. __hip_uint32_t. Applications previously using std::uint32_t or similar types can use __hip_ prefixed types to avoid conflicts with standard std namespace or application can have their own definitions for these types. Also, type_traits templates previously defined in std namespace are moved to __hip_internal namespace as implementation details.
+
+* Currently HIPRTC APIs are separated from HIP APIs and HIPRTC is available as a separate library libhiprtc.so/libhiprtc.dll. But on Linux, HIPRTC symbols are also present in libhipamd64.so in order to support the existing applications. Gradually, these symbols will be removed from HIP library and applications using HIPRTC will be required to explicitly link to HIPRTC library. However, on Windows hiprtc.dll must be used as the hipamd64.dll doesn't contain the HIPRTC symbols.
+* Data types such as uint32_t, uint64_t, int32_t, int64_t defined in std namespace in HIPRTC are deprecated earlier and are being removed from ROCm release 6.1 since these can conflict with the standard C++ data types. These data types are now prefixed with ```__hip__```, e.g. ```__hip_uint32_t```. Applications previously using std::uint32_t or similar types can use ```__hip_``` prefixed types to avoid conflicts with standard std namespace or application can have their own definitions for these types. Also, type_traits templates previously defined in std namespace are moved to ```__hip_internal``` namespace as implementation details.
diff --git a/docs/how-to/programming_manual.md b/docs/how-to/programming_manual.md
index 5fe5c18ac7..f3761c2f78 100644
--- a/docs/how-to/programming_manual.md
+++ b/docs/how-to/programming_manual.md
@@ -3,14 +3,17 @@
 ## Host Memory
 
 ### Introduction
+
 hipHostMalloc allocates pinned host memory which is mapped into the address space of all GPUs in the system, the memory can be accessed directly by the GPU device, and can be read or written with much higher bandwidth than pageable memory obtained with functions such as malloc().
 There are two use cases for this host memory:
-- Faster HostToDevice and DeviceToHost Data Transfers:
+
+* Faster HostToDevice and DeviceToHost Data Transfers:
 The runtime tracks the hipHostMalloc allocations and can avoid some of the setup required for regular unpinned memory.  For exact measurements on a specific system, experiment with --unpinned and --pinned switches for the hipBusBandwidth tool.
-- Zero-Copy GPU Access:
+* Zero-Copy GPU Access:
 GPU can directly access the host memory over the CPU/GPU interconnect, without need to copy the data.  This avoids the need for the copy, but during the kernel access each memory access must traverse the interconnect, which can be tens of times slower than accessing the GPU's local device memory.  Zero-copy memory can be a good choice when the memory accesses are infrequent (perhaps only once).  Zero-copy memory is typically "Coherent" and thus not cached by the GPU but this can be overridden if desired.
 
 ### Memory allocation flags
+
 There are flags parameter which can specify options how to allocate the memory, for example,
 hipHostMallocPortable, the memory is considered allocated by all contexts, not just the one on which the allocation is made.
 hipHostMallocMapped, will map the allocation into the address space for the current device, and the device pointer can be obtained with the API hipHostGetDevicePointer().
@@ -19,6 +22,7 @@ hipHostMallocNumaUser is the flag to allow host memory allocation to follow Numa
 All allocation flags are independent, and can be used in any combination without restriction, for instance, hipHostMalloc can be called with both hipHostMallocPortable and hipHostMallocMapped flags set. Both usage models described above use the same allocation flags, and the difference is in how the surrounding code uses the host memory.
 
 ### Numa-aware host memory allocation
+
 Numa policy determines how memory is allocated.
 Target of Numa policy is to select a CPU that is closest to each GPU.
 Numa distance is the measurement of how far between GPU and CPU devices.
@@ -26,26 +30,29 @@ Numa distance is the measurement of how far between GPU and CPU devices.
 By default, each GPU selects a Numa CPU node that has the least Numa distance between them, that is, host memory will be automatically allocated closest on the memory pool of Numa node of the current GPU device. Using hipSetDevice API to a different GPU will still be able to access the host allocation, but can have longer Numa distance.
 Note, Numa policy is so far implemented on Linux, and under development on Windows.
 
-
 ### Coherency Controls
+
 ROCm defines two coherency options for host memory:
-- Coherent memory : Supports fine-grain synchronization while the kernel is running.  For example, a kernel can perform atomic operations that are visible to the host CPU or to other (peer) GPUs.  Synchronization instructions include threadfence_system and C++11-style atomic operations.
+
+* Coherent memory : Supports fine-grain synchronization while the kernel is running.  For example, a kernel can perform atomic operations that are visible to the host CPU or to other (peer) GPUs.  Synchronization instructions include threadfence_system and C++11-style atomic operations.
 In order to achieve this fine-grained coherence, many AMD GPUs use a limited cache policy, such as leaving these allocations uncached by the GPU, or making them read-only.
 
-- Non-coherent memory : Can be cached by GPU, but cannot support synchronization while the kernel is running.  Non-coherent memory can be optionally synchronized only at command (end-of-kernel or copy command) boundaries.  This memory is appropriate for high-performance access when fine-grain synchronization is not required.
+* Non-coherent memory : Can be cached by GPU, but cannot support synchronization while the kernel is running.  Non-coherent memory can be optionally synchronized only at command (end-of-kernel or copy command) boundaries.  This memory is appropriate for high-performance access when fine-grain synchronization is not required.
 
 HIP provides the developer with controls to select which type of memory is used via allocation flags passed to hipHostMalloc and the HIP_HOST_COHERENT environment variable. By default, the environment variable HIP_HOST_COHERENT is set to 0 in HIP.
 The control logic in the current version of HIP is as follows:
-- No flags are passed in: the host memory allocation is coherent, the HIP_HOST_COHERENT environment variable is ignored.
-- hipHostMallocCoherent=1: The host memory allocation will be coherent, the HIP_HOST_COHERENT environment variable is ignored.
-- hipHostMallocMapped=1: The host memory allocation will be coherent, the HIP_HOST_COHERENT environment variable is ignored.
-- hipHostMallocNonCoherent=1, hipHostMallocCoherent=0, and hipHostMallocMapped=0: The host memory will be non-coherent, the HIP_HOST_COHERENT environment variable is ignored.
-- hipHostMallocCoherent=0, hipHostMallocNonCoherent=0, hipHostMallocMapped=0, but one of the other HostMalloc flags is set:
-  - If HIP_HOST_COHERENT is defined as 1, the host memory allocation is coherent.
-  - If HIP_HOST_COHERENT is not defined, or defined as 0, the host memory allocation is non-coherent.
-- hipHostMallocCoherent=1, hipHostMallocNonCoherent=1: Illegal.
+
+* No flags are passed in: the host memory allocation is coherent, the HIP_HOST_COHERENT environment variable is ignored.
+* hipHostMallocCoherent=1: The host memory allocation will be coherent, the HIP_HOST_COHERENT environment variable is ignored.
+* hipHostMallocMapped=1: The host memory allocation will be coherent, the HIP_HOST_COHERENT environment variable is ignored.
+* hipHostMallocNonCoherent=1, hipHostMallocCoherent=0, and hipHostMallocMapped=0: The host memory will be non-coherent, the HIP_HOST_COHERENT environment variable is ignored.
+* hipHostMallocCoherent=0, hipHostMallocNonCoherent=0, hipHostMallocMapped=0, but one of the other HostMalloc flags is set:
+  * If HIP_HOST_COHERENT is defined as 1, the host memory allocation is coherent.
+  * If HIP_HOST_COHERENT is not defined, or defined as 0, the host memory allocation is non-coherent.
+* hipHostMallocCoherent=1, hipHostMallocNonCoherent=1: Illegal.
 
 ### Visibility of Zero-Copy Host Memory
+
 Coherent host memory is automatically visible at synchronization points.
 Non-coherent
 
@@ -56,21 +63,24 @@ Non-coherent
 | hipEventSynchronize  | host waits for the specified event to complete                                 | device-scope release | yes                        | depends - see below|
 | hipStreamWaitEvent   | stream waits for the specified event to complete                               | none                 | yes                        | no   |
 
-
 ### hipEventSynchronize
+
 Developers can control the release scope for hipEvents:
-- By default, the GPU performs a device-scope acquire and release operation with each recorded event.  This will make host and device memory visible to other commands executing on the same device.
+
+* By default, the GPU performs a device-scope acquire and release operation with each recorded event.  This will make host and device memory visible to other commands executing on the same device.
 
 A stronger system-level fence can be specified when the event is created with hipEventCreateWithFlags:
-- hipEventReleaseToSystem : Perform a system-scope release operation when the event is recorded.  This will make both Coherent and Non-Coherent host memory visible to other agents in the system, but may involve heavyweight operations such as cache flushing.  Coherent memory will typically use lighter-weight in-kernel synchronization mechanisms such as an atomic operation and thus does not need to use hipEventReleaseToSystem.
-- hipEventDisableTiming: Events created with this flag will not record profiling data and provide the best performance if used for synchronization.
 
-### Summary and Recommendations:
+* hipEventReleaseToSystem : Perform a system-scope release operation when the event is recorded.  This will make both Coherent and Non-Coherent host memory visible to other agents in the system, but may involve heavyweight operations such as cache flushing.  Coherent memory will typically use lighter-weight in-kernel synchronization mechanisms such as an atomic operation and thus does not need to use hipEventReleaseToSystem.
+* hipEventDisableTiming: Events created with this flag will not record profiling data and provide the best performance if used for synchronization.
+
+### Summary and Recommendations
 
-- Coherent host memory is the default and is the easiest to use since the memory is visible to the CPU at typical synchronization points.  This memory allows in-kernel synchronization commands such as threadfence_system to work transparently.
-- HIP/ROCm also supports the ability to cache host memory in the GPU using the "Non-Coherent" host memory allocations. This can provide performance benefit, but care must be taken to use the correct synchronization.
+* Coherent host memory is the default and is the easiest to use since the memory is visible to the CPU at typical synchronization points.  This memory allows in-kernel synchronization commands such as threadfence_system to work transparently.
+* HIP/ROCm also supports the ability to cache host memory in the GPU using the "Non-Coherent" host memory allocations. This can provide performance benefit, but care must be taken to use the correct synchronization.
 
 ### Managed memory allocation
+
 Managed memory, including the `__managed__` keyword, is supported in HIP combined host/device compilation, on Linux, not on Windows (under development).
 
 Managed memory, via unified memory allocation, allows data be shared and accessible to both the CPU and GPU using a single pointer.
@@ -78,7 +88,7 @@ The allocation will be managed by AMD GPU driver using the Linux HMM (Heterogene
 
 In HIP application, it is recommended to do the capability check before calling the managed memory APIs. For example:
 
-```
+```cpp
 int managed_memory = 0;
 HIPCHECK(hipDeviceGetAttribute(&managed_memory,
   hipDeviceAttributeManagedMemory,p_gpuDevice));
@@ -92,6 +102,7 @@ else {
 . . .
 }
 ```
+
 Please note, the managed memory capability check may not be necessary, but if HMM is not supported, then managed malloc will fall back to using system memory and other managed memory API calls will have undefined behavior.
 
 Note, managed memory management is implemented on Linux, not supported on Windows yet.
@@ -110,6 +121,7 @@ For more details, please check the documentation HIP-API.pdf.
 Please note, HIP stream does not guarantee concurrency on AMD hardware for the case of multiple (at least 6) long-running streams executing concurrently, using hipStreamSynchronize(nullptr) for synchronization.
 
 ## Direct Dispatch
+
 HIP runtime has Direct Dispatch enabled by default in ROCM 4.4 on Linux.
 With this feature we move away from our conventional producer-consumer model where the runtime creates a worker thread(consumer) for each HIP Stream, and the host thread(producer) enqueues commands to a command queue(per stream).
 
@@ -123,6 +135,7 @@ AMD_DIRECT_DISPATCH=0
 Note, Direct Dispatch is implemented on Linux. It is currently not supported on Windows.
 
 ## HIP Runtime Compilation
+
 HIP now supports runtime compilation (HIPRTC), the usage of which will provide the possibility of optimizations and performance improvement compared with other APIs via regular offline static compilation.
 
 HIPRTC APIs accept HIP source files in character string format as input parameters and create handles of programs by compiling the HIP source files without spawning separate processes.
@@ -132,6 +145,7 @@ For more details on HIPRTC APIs, refer to [HIP Runtime API Reference](https://ro
 For Linux developers, the link [here](https://github.com/ROCm/hip-tests/blob/develop/samples/2_Cookbook/23_cmake_hiprtc/saxpy.cpp) shows an example how to program HIP application using runtime compilation mechanism, and a detailed [HIPRTC programming guide](./hip_rtc) is also available.
 
 ## HIP Graph
+
 HIP graph is supported. For more details, refer to the HIP API Guide.
 
 ## Device-Side Malloc
@@ -150,11 +164,11 @@ And users can explicitly use "hipStreamPerThread" as per-thread default stream h
 
 ## Use of Long Double Type
 
-In HIP-Clang, long double type is 80-bit extended precision format for x86_64, which is not supported by AMDGPU.  HIP-Clang treats long double type as IEEE double type for AMDGPU. Using long double type in HIP source code will not cause issue as long as data of long double type is not transferred between host and device. However, long double type should not be used as kernel argument type.
+In HIP-Clang, long double type is 80-bit extended precision format for x86_64, which is not supported by AMDGPU. HIP-Clang treats long double type as IEEE double type for AMDGPU. Using long double type in HIP source code will not cause issue as long as data of long double type is not transferred between host and device. However, long double type should not be used as kernel argument type.
 
-## Use of _Float16 Type
+## Use of ``_Float16`` Type
 
-If a host function is to be used between clang (or hipcc) and gcc for x86_64, i.e. its definition is compiled by one compiler but the caller is compiled by a different compiler, _Float16 or aggregates containing _Float16 should not be used as function argument or return type. This is due to lack of stable ABI for _Float16 on x86_64. Passing _Float16 or aggregates containing _Float16 between clang and gcc could cause undefined behavior.
+If a host function is to be used between clang (or hipcc) and gcc for x86_64, i.e. its definition is compiled by one compiler but the caller is compiled by a different compiler, ``_Float16`` or aggregates containing ``_Float16`` should not be used as function argument or return type. This is due to lack of stable ABI for ``_Float16`` on x86_64. Passing ``_Float16`` or aggregates containing ``_Float16`` between clang and gcc could cause undefined behavior.
 
 ## FMA and contractions
 
@@ -179,13 +193,17 @@ HIP-Clang supports generating two types of static libraries. The first type of s
 In addition, the first type of library contains host objects with device code embedded as fat binaries. It is generated using the flag --emit-static-lib. The second type of library contains relocatable device objects and is generated using ar.
 
 Here is an example to create and use static libraries:
-- Type 1 using --emit-static-lib:
-    ```
+
+* Type 1 using --emit-static-lib:
+
+    ```cpp
     hipcc hipOptLibrary.cpp --emit-static-lib -fPIC -o libHipOptLibrary.a
     gcc test.cpp -L. -lhipOptLibrary -L/path/to/hip/lib -lamdhip64 -o test.out
     ```
-- Type 2 using system ar:
-    ```
+
+* Type 2 using system ar:
+
+    ```cpp
     hipcc hipDevice.cpp -c -fgpu-rdc -o hipDevice.o
     ar rcsD libHipDevice.a hipDevice.o
     hipcc libHipDevice.a test.cpp -fgpu-rdc -o test.out
diff --git a/docs/index.md b/docs/index.md
index 8f13c546ca..e5645772bc 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -58,6 +58,6 @@ portable applications for AMD and NVIDIA GPUs from single source code.
 Known issues are listed on the [HIP GitHub repository](https://github.com/ROCm/HIP/issues).
 
 To contribute features or functions to the HIP project, refer to [Contributing to HIP](https://github.com/ROCm/HIP/blob/develop/CONTRIBUTING.md).
-To contribute to the documentation, refer to {doc}`Contributing to ROCm docs <rocm:contribute/contributing>` page. 
+To contribute to the documentation, refer to {doc}`Contributing to ROCm docs <rocm:contribute/contributing>` page.
 
 You can find licensing information on the [Licensing](https://rocm.docs.amd.com/en/latest/about/license.html) page.
diff --git a/docs/reference/terms.md b/docs/reference/terms.md
index d79d955603..ce6d51f3ec 100644
--- a/docs/reference/terms.md
+++ b/docs/reference/terms.md
@@ -34,5 +34,5 @@
 |Vector|`float4`|`float4`|`float4`|
 
 ## Notes
-The indexing functions (starting with `thread-index`) show the terminology for a 1D grid.  Some APIs use reverse order of xyz / 012 indexing for 3D grids.
 
+The indexing functions (starting with `thread-index`) show the terminology for a 1D grid.  Some APIs use reverse order of xyz / 012 indexing for 3D grids.
diff --git a/docs/understand/glossary.md b/docs/understand/glossary.md
index 69f697f565..cb211efbb7 100644
--- a/docs/understand/glossary.md
+++ b/docs/understand/glossary.md
@@ -1,23 +1,24 @@
 # Glossary of terms
 
-- **host**, **host cpu** : Executes the HIP runtime API and is capable of initiating kernel launches to one or more devices.
-- **default device** : Each host thread maintains a default device.
+* **host**, **host cpu** : Executes the HIP runtime API and is capable of initiating kernel launches to one or more devices.
+* **default device** : Each host thread maintains a default device.
 Most HIP runtime APIs (including memory allocation, copy commands, kernel launches) do not accept an explicit device
 argument but instead implicitly use the default device.
 The default device can be set with ```hipSetDevice```.
 
-- **active host thread** - the thread which is running the HIP APIs.
+* **active host thread** - the thread which is running the HIP APIs.
 
-- **HIP-Clang** - Heterogeneous AMDGPU Compiler, with its capability to compile HIP programs on AMD platform (https://github.com/RadeonOpenCompute/llvm-project).
+* **HIP-Clang** - Heterogeneous AMDGPU Compiler, with its capability to compile HIP programs on AMD platform (https://github.com/RadeonOpenCompute/llvm-project).
 
-- **clr** - a repository for AMD Common Language Runtime, contains source codes for AMD's compute languages runtimes: HIP and OpenCL.
+* **clr** - a repository for AMD Common Language Runtime, contains source codes for AMD's compute languages runtimes: HIP and OpenCL.
 clr (https://github.com/ROCm/clr) contains the following three parts,
-    - ```hipamd```: contains implementation of HIP language on AMD platform.
-    - ```rocclr```: contains common runtime used in HIP and OpenCL, which provides virtual device interfaces that compute runtimes interact with different backends such as ROCr on Linux or PAL on Windows.
-    - ```opencl```: contains implementation of OpenCL on AMD platform.
 
-- **hipify tools** - tools to convert CUDA code to portable C++ code (https://github.com/ROCm/HIPIFY).
+  * ```hipamd```: contains implementation of HIP language on AMD platform.
+  * ```rocclr```: contains common runtime used in HIP and OpenCL, which provides virtual device interfaces that compute runtimes interact with different backends such as ROCr on Linux or PAL on Windows.
+  * ```opencl```: contains implementation of OpenCL on AMD platform.
 
-- **hipconfig** - tool to report various configuration properties of the target platform.
+* **hipify tools** - tools to convert CUDA code to portable C++ code (https://github.com/ROCm/HIPIFY).
 
-- **nvcc** - NVIDIA CUDA ```nvcc``` compiler, do not capitalize.
+* **hipconfig** - tool to report various configuration properties of the target platform.
+
+* **nvcc** - NVIDIA CUDA ```nvcc``` compiler, do not capitalize.
diff --git a/util/vim/README.md b/util/vim/README.md
index 722eb670ad..b27ecaf1e2 100644
--- a/util/vim/README.md
+++ b/util/vim/README.md
@@ -1,9 +1,9 @@
 ### How to install? ###
+
 1. Add the <code>hip.vim</code> to <code>~/.vim/syntax/</code> directory
 2. Add the following text to the end of <code>~/.vimrc</code>
 
-
-```
+```shell
 augroup filetypedetect
  au BufNewFile,BufRead *.cpp set filetype=cpp syntax=hip
 augroup END