PolyHok: GPGPU in Elixir

Graphics Processing Units (GPUs) are now a common part of modern computing systems, powering not only graphics but also general-purpose applications. However, developing software for GPUs is a challenging task, since the main tools available are APIs (Application Programming Interfaces) designed for low-level languages such as C and C++.

Software development in low-level languages (especially targeting GPUs) introduces several challenges, such as manual memory management, device synchronization, and explicit parallelism orchestration. These complexities have driven the search for simpler and more productive approaches.

With this goal in mind, PolyHok was developed: a Domain-Specific Language (DSL) embedded in the Elixir programming language. PolyHok adopts several functional programming features to provide high expressiveness and ease of use, such as higher-order polymorphic kernels (i.e., functions that run on the GPU and can receive other functions as parameters), dynamic typing, and automatic device memory management.

Due to its strong integration with Elixir, PolyHok can easily take advantage of the language’s powerful capabilities for parallel and distributed computing. An example of this is demonstrated in this demo.

Currently, there are two versions (or “flavors”) of PolyHok: one that uses CUDA as its execution platform (recommended for NVIDIA GPUs), and a second version based on OpenCL, designed to run on any accelerator from any vendor (such as AMD GPUs, Intel iGPUs, etc.). The OpenCL-based version of PolyHok was nicknamed “OCL-PolyHok” when it was created, but don’t be misled: they are the same language, differing only in their execution backends for broader compatibility.

Hello, World!

Below, we present a simple program to demonstrate the basic functionality of PolyHok. In this example, we create an array of numbers on the CPU, copy it to GPU memory (VRAM), add 1 to each element in parallel using a kernel, and bring the result back to the CPU.

require PolyHok

PolyHok.defmodule SimpleTest do
  defk simple_kernel(array, size) do
    index = blockIdx.x * blockDim.x + threadIdx.x

    if (index < size) do
      array[index] = array[index] + 1.0
    end
  end
end

array_size = 100

# Create a tensor on the CPU of type float with values from 1 to array_size
array_cpu = Nx.tensor(Enum.to_list(1..array_size), type: {:f, 32})

IO.inspect(array_cpu, label: "CPU Array")

# Create a tensor on the GPU copying the data from the CPU tensor
array_gpu = array_cpu |> PolyHok.new_gnx()

# Spawn the kernel to run on the GPU
PolyHok.spawn(
          &SimpleTest.simple_kernel/2,  # Kernel function
          {1, 1, 1},                    # Number of blocks
          {array_size, 1, 1},           # Threads per block
          [array_gpu, array_size])      # Kernel parameters

# Get result back to CPU
result = PolyHok.get_gnx(array_gpu)

IO.inspect(result, label: "Result after kernel execution")

Understanding the Code

Our program follows the classic workflow of heterogeneous computing: data preparation, transfer to the device, kernel execution, and result retrieval.

CPU Preparation: We start by creating an Nx tensor (which essentially behaves like an array) of floating-point numbers with size 100 on the CPU. We use the variable array_size and the expression Enum.to_list(1..array_size) to populate it with values from 1 to 100. You can experiment with much larger values if you wish. Once created, we print the array to the screen so we can compare it later with the result.
Copying to the GPU: To allow the GPU to process the data, we create a GNx (GPU Nx) from the CPU tensor using the command array_cpu |> PolyHok.new_gnx(). A GNx is an array that resides in GPU memory (VRAM) and can only be modified by kernels and device functions executed on the GPU.
Kernel Execution: We launch the simple_kernel kernel using the PolyHok.spawn/4 function. Notice the execution configuration: we use 1 block ({1, 1, 1}) containing 100 threads ({array_size, 1, 1}). Inside the kernel, each thread computes its own global index (index = blockIdx.x * blockDim.x + threadIdx.x) to independently determine which element of the array it should access and increment.
Reading the Results: After execution, the modified GNx remains in GPU memory. We use PolyHok.get_gnx/1 to bring the data back to the CPU. Once retrieved, we print the result to compare it with the original array.

Expected Output

$ mix run tests/SimpleTest.exs

CPU Array: #Nx.Tensor<
f32[100]
[1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0, 24.0, 25.0, 26.0, 27.0, 28.0, 29.0, 30.0, 31.0, 32.0, 33.0, 34.0, 35.0, 36.0, 37.0, 38.0, 39.0, 40.0, 41.0, 42.0, 43.0, 44.0, 45.0, 46.0, 47.0, 48.0, 49.0, 50.0, ...]
>

Result after kernel execution: #Nx.Tensor<
f32[100]
[2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0, 24.0, 25.0, 26.0, 27.0, 28.0, 29.0, 30.0, 31.0, 32.0, 33.0, 34.0, 35.0, 36.0, 37.0, 38.0, 39.0, 40.0, 41.0, 42.0, 43.0, 44.0, 45.0, 46.0, 47.0, 48.0, 49.0, 50.0, 51.0, ...]
>

Try PolyHok!

Both PolyHok and OCL-PolyHok (its OpenCL-based version) are available on GitHub. In there you can find instructions on how to download and use the language on your machine! Click on the version you want to try to access it's repository:

See More

📚

Table of Contents

PolyHok: GPGPU in Elixir

Hello, World!

Understanding the Code

Expected Output

Try PolyHok!

See More

Documentation

Meet the Team

Publications