PolyHok Documentation

This page documents PolyHok commands, constructors, and functions.

The PolyHok Module

To create kernels and device functions, you must define a PolyHok module. A PolyHok module behaves similarly to a standard Elixir module (you can even define regular Elixir functions inside it), but it also supports the definition of kernels and device functions through PolyHok's special keywords.

Note that PolyHok kernels and device functions can only be invoked through PolyHok.spawn/4 (described here). Attempting to invoke them directly will cause Elixir to raise an exception.

require PolyHok

PolyHok.defmodule Example do
  # Kernels, device functions, regular Elixir functions
end

Defining Kernels and Device Functions

Kernels and device functions can only be defined inside a PolyHok module. To define a kernel, use the defk keyword. To define a device function, use the defd keyword. Function names and argument lists are written exactly like regular Elixir functions.

require PolyHok

PolyHok.defmodule Example do
  # Defining a kernel
  defk map_kernel(arr, size, f) do
    tid = threadIdx.x + blockDim.x * blockIdx.x
    
    if (tid < size) do
      arr[tid] = f(arr[tid])
    end
  end
  
  # Defining a device function
  defd inc_fun(el) do
    el + 1
  end
end

Built-in CUDA and OpenCL Constructs

As you may have noticed in the example above, the tid variable is computed using CUDA's special built-in structures: threadIdx, blockDim, and blockIdx. PolyHok supports these constructs to give programmers fine-grained control over kernel execution behavior.

The following CUDA built-in structures are currently supported in PolyHok:

threadIdx
blockIdx
blockDim
gridDim

In OCL-PolyHok (the OpenCL-based version of PolyHok), both CUDA-style structures and OpenCL built-in work-item functions are supported, such as get_global_id, get_local_id, and get_group_id. Internally, OCL-PolyHok maps CUDA-style indexing structures to the corresponding OpenCL execution model primitives, allowing programmers to use CUDA-style syntax even on non-NVIDIA hardware.

This design choice allows existing PolyHok code to run in OCL-PolyHok without source-code modifications. Additionally, programmers are free to choose whichever indexing style they prefer.

To demonstrate, the following map kernel written in OCL-PolyHok is equivalent to the previous PolyHok map kernel:

require OCLPolyHok

OCLPolyHok.defmodule Example do
  # Defining a kernel
  defk map_kernel(arr, size, f) do
    tid = get_global_id(0)
    
    if (tid < size) do
      arr[tid] = f(arr[tid])
    end
  end
  
  # Defining a device function
  defd inc_fun(el) do
    el + 1
  end
end

The following OpenCL built-in functions are currently supported in OCL-PolyHok:

get_global_id
get_global_size
get_local_id
get_local_size
get_group_id
get_num_groups

Programming in PolyHok

PolyHok's syntax is very similar to Elixir's, as it is dynamically typed, garbage collected, and uses the same do...end block syntax. However, PolyHok is an imperative language, which is fundamentally different from the functional programming paradigm adopted by Elixir.

Like any imperative language, PolyHok performs in-place memory updates. Elixir, on the other hand, follows the immutability principles of functional programming, creating new data structures whenever values are modified.

For repetitive tasks in Elixir, such as iterating over a list, recursion is commonly used. In PolyHok, however, recursion is forbidden (GPUs generally do not handle recursion efficiently). Instead, PolyHok provides traditional loop constructs such as for and while (described here).

Keeping these differences in mind is important to fully take advantage of PolyHok's programming model. And do not get confused: these imperative constructs and behaviors are exclusive to PolyHok, regular Elixir code remains completely unaffected.

Don't forget: PolyHok introduces imperative semantics exclusively inside kernels and device functions. Regular Elixir code remains purely functional and immutable.

Higher-Order Kernels and GPU Lambdas

One of the main features of PolyHok and OCL-PolyHok is support for higher-order GPU kernels.

A higher-order kernel is a kernel capable of receiving device functions as arguments. This enables a much more expressive programming style, allowing developers to build algorithmic skeletons such as map, reduce, filter, and other parallel patterns.

Traditional GPU programming frameworks such as CUDA and OpenCL C do not naturally support higher-order kernels or GPU lambdas, making this feature one of the key abstractions provided by PolyHok.

To support this feature, PolyHok provides the PolyHok.phok construct (or OCLPolyHok.phok in OCL-PolyHok), which allows the creation of GPU anonymous functions (GPU lambdas).

The following program demonstrates a map skeleton implemented with a higher-order kernel and uses a GPU lambda to define the mapping function:

require OCLPolyHok

OCLPolyHok.defmodule PMap do
  defk map_ker(a1,a2,size,f) do
    index = blockIdx.x * blockDim.x + threadIdx.x
    stride = blockDim.x * gridDim.x

    for i in range(index,size,stride) do
      a2[i] = f(a1[i])
    end
  end

  def map(input, f) do
    shape = OCLPolyHok.get_shape(input)
    type = OCLPolyHok.get_type(input)

    result_gpu = OCLPolyHok.new_gnx(shape,type)

    size = Tuple.product(shape)

    threadsPerBlock = 128
    numberOfBlocks = div(size + threadsPerBlock - 1,
                         threadsPerBlock)

    OCLPolyHok.spawn(
      &PMap.map_ker/4,
      {numberOfBlocks,1,1},
      {threadsPerBlock,1,1},
      [input,result_gpu,size,f]
    )

    result_gpu
  end
end

a = Nx.tensor(Enum.to_list(1..1000),
              type: {:s, 32})

result = a
  |> OCLPolyHok.new_gnx
  |> PMap.map(
       OCLPolyHok.phok fn x ->
         x + 1
       end
     )
  |> OCLPolyHok.get_gnx

IO.inspect(result, limit: :infinity)

In this example, the map_ker kernel receives a function f as an argument and applies it to every element of the input array in parallel. The kernel also uses a grid-stride loop, a common CUDA programming technique that allows kernels to efficiently process inputs larger than the total number of spawned threads.

The anonymous GPU function is created using OCLPolyHok.phok:

OCLPolyHok.phok fn x ->
  x + 1
end

Like kernels and device functions, GPU lambdas follow PolyHok's imperative programming model and may perform in-place memory updates.

The syntax and behavior of PolyHok.phok are identical in both PolyHok and OCL-PolyHok.

Note that using GPU lambdas is entirely optional. You can also pass references to device functions directly to higher-order kernels:

require OCLPolyHok

OCLPolyHok.defmodule PMap do
  defk map_ker(a1,a2,size,f) do
    index = blockIdx.x * blockDim.x + threadIdx.x
    stride = blockDim.x * gridDim.x

    for i in range(index,size,stride) do
      a2[i] = f(a1[i])
    end
  end

  defd inc(x) do
    x + 1
  end

  def map(input, f) do
    [...]
  end
end

a = Nx.tensor(Enum.to_list(1..1000),
              type: {:s, 32})

result = a
  |> OCLPolyHok.new_gnx
  |> PMap.map(
       &PMap.inc/1
     )
  |> OCLPolyHok.get_gnx

IO.inspect(result, limit: :infinity)

Higher-order kernels allow PolyHok programs to express reusable GPU computation patterns while still providing fine-grained control over execution. You can create your own custom skeletons and reuse them throughout your applications!

Launching Kernels

To execute a kernel in PolyHok, you must use the PolyHok.spawn/4 function. The function has the following syntax:

PolyHok.spawn(
  kernel,
  num_blocks,
  threads_per_block,
  kernel_args
)

kernel: a function reference to the kernel you wish to launch. It must follow Elixir's function reference syntax: &ModuleName.kernel_name/arity;
num_blocks: a 3-element tuple specifying the number of blocks in each grid dimension;
threads_per_block: a 3-element tuple specifying the number of threads in each block dimension;
kernel_args: an Elixir list containing the arguments passed to the kernel. The arguments must follow the exact same order as the kernel declaration;

To launch the previously defined map_kernel with an array containing 1,000 elements, we could write:

arr = Nx.tensor(Enum.to_list(1..1000), type: {:s, 32}) |> PolyHok.new_gnx
size = 1000

threadsPerBlock = 256
numberOfBlocks = div(size + threadsPerBlock - 1, threadsPerBlock)

PolyHok.spawn(
  &Example.map_kernel/3,
  {numberOfBlocks, 1, 1},
  {threadsPerBlock, 1, 1},
  [arr, size, &Example.inc_fun/1]
)

In this example, we create an Nx tensor containing 1,000 signed 32-bit integers. You can think of an Nx tensor as a contiguous array of elements with the same type stored in CPU memory (RAM).

Since GPUs cannot directly access CPU memory, we must copy the data to GPU memory (VRAM). To do this, we use the PolyHok.new_gnx function. A GNx is essentially an Nx tensor allocated in VRAM. PolyHok kernels and device functions can only read from and write to GNxes. We discuss GNxes in detail here.

With our input data stored in a GNx, we launch the kernel on the GPU using blocks containing 256 threads each. The number of blocks is computed dynamically to ensure that all elements of the input array are processed by a GPU thread.

The following expression is commonly used in CUDA programming to compute the minimum number of blocks required to process the entire input:

numberOfBlocks = div(size + threadsPerBlock - 1, threadsPerBlock)

This formula performs a ceiling division between size and threadsPerBlock. In our example, the array contains 1,000 elements and each block contains 256 threads. The division results in 4, meaning we need to launch at least 4 blocks of that size to guarantee full coverage of the input array. If fewer blocks were used, some elements of the array would not be processed.

Tip: the Elixir expression div(size + threadsPerBlock - 1, threadsPerBlock) computes the ceiling division of size / threadsPerBlock, ensuring that enough blocks are launched to cover the entire input.

PolyHok Statements and Commands

Synchronize

The PolyHok.synchronize/0 function blocks the host (CPU) until the GPU finishes all previously submitted tasks (like kernel executions, memory copy etc).

PolyHok.synchronize()

Most of the time, explicitly calling PolyHok.synchronize/0 is not necessary. For example, the PolyHok.get_gnx/1 function is blocking and automatically waits for all pending GPU operations to complete before copying data back from the GPU to the CPU.

Tip: kernel launches in PolyHok are asynchronous by default. This means that PolyHok.spawn/4 does not wait for the kernel to finish before returning, allowing the CPU to continue executing while the GPU runs the kernel.

If / Else

if condition do 
  # CODE
end

# --- or ---

if condition do 
  # CODE - IF
else
  # CODE - ELSE
end

For loop

Iterates from 0 to n with increments of 1:

for i in range(n) do
  # CODE
end

Iterates from start to stop with increments of 1:

for i in range(start, stop) do
  # CODE
end

Iterates from start to stop with increments of step:

for i in range(start, stop, step) do
  # CODE
end

While loop

while condition do
  # CODE
end

Shared memory declaration

The syntax is the same both for PolyHok and OCL-PolyHok.

__shared__(buffer[size])

Note: size must be a literal integer for shared memory declaration.

Return

return expression

Tip: when writing device functions, you don't need to explicitly add a return statement to return a value. Simply leave the value as the last statement in the body of the function. This behavior is the same of Elixir.

Operators

Arithmetic

+, -, *, /

Boolean

<=, <, >, >=, &&, ||, !, !=, ==

Bitwise

<<<: left shift
>>>: right shift
&&&: bitwise AND
|||: bitwise OR
+++: bitwise XOR

Modulus (Remainder)

number ~>> divisor

PolyHok uses unconventional operators for bitwise and modulus operations due to limitations imposed by Elixir's syntax rules. Unlike languages such as C++ or Python, Elixir does not allow arbitrary operator redefinition. Traditional bitwise operators (such as << and >>) are already reserved by the language and use incompatible semantics.

As a result, PolyHok adopts alternative operators from the subset of operators that Elixir allows developers to override. More information about Elixir operators can be found here.

Memory

As briefly discussed before, PolyHok kernels can only access memory allocated on the GPU. To manage this, PolyHok provides the GNx abstraction, built on top of Elixir's Nx library.

Numerical Elixir (Nx) is an Elixir library for numerical computing that enables manipulation of multi-dimensional tensors. Nx tensors support multiple dimensions, well-defined data types, and provide an easy-to-use interface for numerical programming. For this reason, PolyHok uses Nx as the foundation for the GNx (GPU Nx) abstraction.

A GNx is simply an Nx tensor allocated in VRAM that can be accessed and manipulated inside PolyHok kernels and device functions.

It is important to note that GNx data can only be accessed inside PolyHok kernels and device functions. In regular Elixir code, GNxes cannot be directly manipulated.

The inverse is also true for standard Nx tensors: they cannot be accessed inside kernels or device functions, as they reside in CPU memory and are only accessible from Elixir code executing on the CPU.

To create a GNx, you have two options:

Creating a GNx from an Existing Nx Tensor

arr_nx = Nx.tensor([1,2,3,4,5], type: {:s, 32})
arr_gnx = PolyHok.new_gnx(arr_nx)

In this example, we create a new Nx tensor from an Elixir list and set its type to a signed 32-bit integer. Then, we create a GNx from this tensor. PolyHok copies the tensor data to the GPU while preserving the original tensor's shape and data type.

Creating an Empty GNx

arr_gnx = PolyHok.new_gnx(10, 4, {:s, 32})

In this example, we create an empty GNx tensor with 10 rows and 4 columns (40 elements in total) using signed 32-bit integers.

PolyHok provides multiple variations of this function, which are listed below:

PolyHok.new_gnx(lines, columns, type)
PolyHok.new_gnx({columns}, type)
PolyHok.new_gnx({lines, columns}, type)
PolyHok.new_gnx({lines, columns, depth}, type)

About type support: currently, PolyHok and OCL-PolyHok support only the following data types for GNxes:

Signed 32-bit integers (:s32)
32-bit floating-point numbers (:f32)
64-bit floating-point numbers (:f64)

We plan on supporting all Nx data types in the future.

Reading Data Back from a GNx

Once data has been processed on the GPU, you will often want to transfer the results back to CPU memory so they can be accessed from Elixir code. PolyHok provides the PolyHok.get_gnx/1 function for this purpose:

cpu_nx = PolyHok.get_gnx(gnx_array)

This function reads the contents of a GNx from GPU memory and returns a traditional Nx tensor containing the copied data. The resulting tensor preserves all properties of the original GNx, including shape and data type.

Helper Functions

PolyHok also provides several utility functions that can be used with both GNxes and standard Nx tensors:

type = PolyHok.get_type_gnx(gnx_array)
type = PolyHok.get_type(nx_tensor)

shape_tuple = PolyHok.get_shape_gnx(gnx_array)
shape_tuple = PolyHok.get_shape(nx_tensor)

Table of Contents