Skip to content
3 changes: 2 additions & 1 deletion Compiling.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,7 @@ As also mentioned in the instructions below but repeated here for visibility, if
* If using the OpenCL backend, a modern GPU that supports OpenCL 1.2 or greater, or else something like [this](https://software.intel.com/en-us/opencl-sdk) for CPU. But if using CPU, Eigen should be better.
* If using the CUDA backend, CUDA 11 or later and a compatible version of CUDNN based on your CUDA version (https://developer.nvidia.com/cuda-toolkit) (https://developer.nvidia.com/cudnn) and a GPU capable of supporting them.
* If using the TensorRT backend, in addition to a compatible CUDA Toolkit (https://developer.nvidia.com/cuda-toolkit), you also need TensorRT (https://developer.nvidia.com/tensorrt) that is at least version 8.5.
* If using the ROCm backend, ROCm 6.4 or later and a GPU capable of supporting them. More information about installation(https://rocm.docs.amd.com/projects/install-on-linux/en/latest/) and please install all possiable ROCm developer packages, instead of just ROCm runtime packages.
* If using the Eigen backend, Eigen3. With Debian packages, (i.e. apt or apt-get), this should be `libeigen3-dev`.
* zlib, libzip. With Debian packages (i.e. apt or apt-get), these should be `zlib1g-dev`, `libzip-dev`.
* If you want to do self-play training and research, probably Google perftools `libgoogle-perftools-dev` for TCMalloc or some other better malloc implementation. For unknown reasons, the allocation pattern in self-play with large numbers of threads and parallel games causes a lot of memory fragmentation under glibc malloc that will eventually run your machine out of memory, but better mallocs handle it fine.
Expand All @@ -41,7 +42,7 @@ As also mentioned in the instructions below but repeated here for visibility, if
* `git clone https://github.com/lightvector/KataGo.git`
* Compile using CMake and make in the cpp directory:
* `cd KataGo/cpp`
* `cmake . -DUSE_BACKEND=OPENCL` or `cmake . -DUSE_BACKEND=CUDA` or `cmake . -DUSE_BACKEND=TENSORRT` or `cmake . -DUSE_BACKEND=EIGEN` depending on which backend you want.
* `cmake . -DUSE_BACKEND=OPENCL` or `cmake . -DUSE_BACKEND=CUDA` or `cmake . -DUSE_BACKEND=TENSORRT` or `cmake . -DUSE_BACKEND=EIGEN` or `cmake . -DUSE_BACKEND=ROCM`depending on which backend you want.
* Specify also `-DUSE_TCMALLOC=1` if using TCMalloc.
* Compiling will also call git commands to embed the git hash into the compiled executable, specify also `-DNO_GIT_REVISION=1` to disable it if this is causing issues for you.
* Specify `-DUSE_AVX2=1` to also compile Eigen with AVX2 and FMA support, which will make it incompatible with old CPUs but much faster. (If you want to go further, you can also add `-DCMAKE_CXX_FLAGS='-march=native'` which will specialize to precisely your machine's CPU, but the exe might not run on other machines at all).
Expand Down
55 changes: 31 additions & 24 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,27 +1,30 @@
# KataGo

* [Overview](#overview)
* [Training History and Research](#training-history-and-research)
* [Where To Download Stuff](#where-to-download-stuff)
* [Setting Up and Running KataGo](#setting-up-and-running-katago)
* [GUIs](#guis)
* [Windows and Linux](#windows-and-linux)
* [MacOS](#macos)
* [OpenCL vs CUDA vs TensorRT vs Eigen](#opencl-vs-cuda-vs-tensorrt-vs-eigen)
* [How To Use](#how-to-use)
* [Tuning for Performance](#tuning-for-performance)
* [Common Questions and Issues](#common-questions-and-issues)
* [Issues with specific GPUs or GPU drivers](#issues-with-specific-gpus-or-gpu-drivers)
* [Common Problems](#common-problems)
* [Other Questions](#other-questions)
* [Features for Developers](#features-for-developers)
* [GTP Extensions](#gtp-extensions)
* [Analysis Engine](#analysis-engine)
* [Compiling KataGo](#compiling-katago)
* [Source Code Overview](#source-code-overview)
* [Selfplay Training](#selfplay-training)
* [Contributors](#contributors)
* [License](#license)
- [KataGo](#katago)
- [Overview](#overview)
- [Training History and Research and Docs](#training-history-and-research-and-docs)
- [Where To Download Stuff](#where-to-download-stuff)
- [Setting Up and Running KataGo](#setting-up-and-running-katago)
- [GUIs](#guis)
- [Windows and Linux](#windows-and-linux)
- [MacOS](#macos)
- [OpenCL vs CUDA vs TensorRT vs ROCm vs Eigen](#opencl-vs-cuda-vs-tensorrt-vs-rocm-vs-eigen)
- [How To Use](#how-to-use)
- [Human-style Play and Analysis](#human-style-play-and-analysis)
- [Other Commands:](#other-commands)
- [Tuning for Performance](#tuning-for-performance)
- [Common Questions and Issues](#common-questions-and-issues)
- [Issues with specific GPUs or GPU drivers](#issues-with-specific-gpus-or-gpu-drivers)
- [Common Problems](#common-problems)
- [Other Questions](#other-questions)
- [Features for Developers](#features-for-developers)
- [GTP Extensions:](#gtp-extensions)
- [Analysis Engine:](#analysis-engine)
- [Compiling KataGo](#compiling-katago)
- [Source Code Overview:](#source-code-overview)
- [Selfplay Training:](#selfplay-training)
- [Contributors](#contributors)
- [License](#license)

## Overview

Expand Down Expand Up @@ -84,20 +87,22 @@ The community also provides KataGo packages for [Homebrew](https://brew.sh) on M

Use `brew install katago`. The latest config files and networks are installed in KataGo's `share` directory. Find them via `brew list --verbose katago`. A basic way to run katago will be `katago gtp -config $(brew list --verbose katago | grep 'gtp.*\.cfg') -model $(brew list --verbose katago | grep .gz | head -1)`. You should choose the Network according to the release notes here and customize the provided example config as with every other way of installing KataGo.

### OpenCL vs CUDA vs TensorRT vs Eigen
KataGo has four backends, OpenCL (GPU), CUDA (GPU), TensorRT (GPU), and Eigen (CPU).
### OpenCL vs CUDA vs TensorRT vs ROCm vs Eigen
KataGo has five backends, OpenCL (GPU), CUDA (GPU), TensorRT (GPU), ROCm (GPU) and Eigen (CPU).

The quick summary is:
* **To easily get something working, try OpenCL if you have any good or decent GPU.**
* **For often much better performance on NVIDIA GPUs, try TensorRT**, but you may need to install TensorRT from Nvidia.
* Use Eigen with AVX2 if you don't have a GPU or if your GPU is too old/weak to work with OpenCL, and you just want a plain CPU KataGo.
* Use Eigen without AVX2 if your CPU is old or on a low-end device that doesn't support AVX2.
* The CUDA backend can work for NVIDIA GPUs with CUDA+CUDNN installed but is likely worse than TensorRT.
* The ROCm backend can work for AMD GPUs with ROCm+MIOpen installed.

More in detail:
* OpenCL is a general GPU backend should be able to run with any GPUs or accelerators that support [OpenCL](https://en.wikipedia.org/wiki/OpenCL), including NVIDIA GPUs, AMD GPUs, as well CPU-based OpenCL implementations or things like Intel Integrated Graphics. This is the most general GPU version of KataGo and doesn't require a complicated install like CUDA does, so is most likely to work out of the box as long as you have a fairly modern GPU. **However, it also need to take some time when run for the very first time to tune itself.** For many systems, this will take 5-30 seconds, but on a few older/slower systems, may take many minutes or longer. Also, the quality of OpenCL implementations is sometimes inconsistent, particularly for Intel Integrated Graphics and for AMD GPUs that are older than several years, so it might not work for very old machines, as well as specific buggy newer AMD GPUs, see also [Issues with specific GPUs or GPU drivers](#issues-with-specific-gpus-or-gpu-drivers).
* CUDA is a GPU backend specific to NVIDIA GPUs (it will not work with AMD or Intel or any other GPUs) and requires installing [CUDA](https://developer.nvidia.com/cuda-zone) and [CUDNN](https://developer.nvidia.com/cudnn) and a modern NVIDIA GPU. On most GPUs, the OpenCL implementation will actually beat NVIDIA's own CUDA/CUDNN at performance. The exception is for top-end NVIDIA GPUs that support FP16 and tensor cores, in which case sometimes one is better and sometimes the other is better.
* TensorRT is similar to CUDA, but only uses NVIDIA's TensorRT framework to run the neural network with more optimized kernels. For modern NVIDIA GPUs, it should work whenever CUDA does and will usually be faster than CUDA or any other backend.
* ROCm is a GPU backend specific to AMD GPUs (it will not work with NVIDIA or Intel or any other GPUs) and requires installing [ROCm](https://rocm.docs.amd.com) and [MIOpen](https://rocm.docs.amd.com/projects/MIOpen) and a modern AMD GPU. On most GPUs, the OpenCL implementation will actually beat AMD's own ROCm/MIOpen at performance. The exception is for top-end AMD GPUs that support FP16 and stream processors, in which case sometimes one is better and sometimes the other is better.
* Eigen is a *CPU* backend that should work widely *without* needing a GPU or fancy drivers. Use this if you don't have a good GPU or really any GPU at all. It will be quite significantly slower than OpenCL or CUDA, but on a good CPU can still often get 10 to 20 playouts per second if using the smaller (15 or 20) block neural nets. Eigen can also be compiled with AVX2 and FMA support, which can provide a big performance boost for Intel and AMD CPUs from the last few years. However, it will not run at all on older CPUs (and possibly even some recent but low-power modern CPUs) that don't support these fancy vector instructions.

For **any** implementation, it's recommended that you also tune the number of threads used if you care about optimal performance, as it can make a factor of 2-3 difference in the speed. See "Tuning for Performance" below. However, if you mostly just want to get it working, then the default untuned settings should also be still reasonable.
Expand Down Expand Up @@ -175,6 +180,8 @@ This section summarizes a number of common questions and issues when running Kat
#### Issues with specific GPUs or GPU drivers
If you are observing any crashes in KataGo while attempting to run the benchmark or the program itself, and you have one of the below GPUs, then this is likely the reason.

* **AMD GPUs** - If you choose to use ROCm backend, uou need a GPU supported with official [System requirements lists](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/reference/system-requirements.html) (at least AMD Radeon RX 7700 XT). And ROCm backend only supports Linux now, because MIOpen and CMake HIP Language doesn't support Windows at this moment. We suggest installing the lastest version of ROCm developer stack.

* **AMD Radeon RX 5700** - AMD's drivers for OpenCL for this GPU have been buggy ever since this GPU was released, and as of May 2020 AMD has still never released a fix. If you are using this GPU, you will just not be able to run KataGo (Leela Zero and other Go engines will probably fail too) and will probably also obtain incorrect calculations or crash if doing anything else scientific or mathematical that uses OpenCL. See for example these reddit threads: [[1]](https://www.reddit.com/r/Amd/comments/ebso1x/its_not_just_setihome_any_mathematic_or/) or [[2]](https://www.reddit.com/r/BOINC/comments/ebiz18/psa_please_remove_your_amd_rx5700xt_from_setihome/) or this [L19 thread](https://lifein19x19.com/viewtopic.php?f=18&t=17093).
* **OpenCL Mesa** - These drivers for OpenCL are buggy. Particularly if on startup before crashing you see KataGo printing something like
`Found OpenCL Platform 0: ... (Mesa) (OpenCL 1.1 Mesa ...) ...`
Expand Down
101 changes: 99 additions & 2 deletions cpp/CMakeLists.txt
Original file line number Diff line number Diff line change
@@ -1,6 +1,10 @@
cmake_minimum_required(VERSION 3.18.2)
if(USE_BACKEND STREQUAL "METAL")
project(katago LANGUAGES CXX Swift)
elseif(USE_BACKEND STREQUAL "ROCM")
set(CMAKE_C_COMPILER /opt/rocm/bin/hipcc CACHE FILEPATH "" FORCE)
set(CMAKE_CXX_COMPILER /opt/rocm/bin/hipcc CACHE FILEPATH "" FORCE)
project(katago LANGUAGES C CXX HIP)
else()
project(katago)
endif()
Expand Down Expand Up @@ -32,7 +36,7 @@ endif()
set(BUILD_DISTRIBUTED 0 CACHE BOOL "Build with http support for contributing to distributed training")
set(USE_BACKEND CACHE STRING "Neural net backend")
string(TOUPPER "${USE_BACKEND}" USE_BACKEND)
set_property(CACHE USE_BACKEND PROPERTY STRINGS "" CUDA TENSORRT OPENCL EIGEN)
set_property(CACHE USE_BACKEND PROPERTY STRINGS "" CUDA TENSORRT OPENCL EIGEN ROCM)

set(USE_TCMALLOC 0 CACHE BOOL "Use TCMalloc")
set(NO_GIT_REVISION 0 CACHE BOOL "Disable embedding the git revision into the compiled exe")
Expand Down Expand Up @@ -139,6 +143,41 @@ elseif(USE_BACKEND STREQUAL "EIGEN")
set(NEURALNET_BACKEND_SOURCES
neuralnet/eigenbackend.cpp
)
# --------------------------- ROCM backend(AMD GPU / HIP MIOpen) ---------------------------
elseif(USE_BACKEND STREQUAL "ROCM")
message(STATUS "-DUSE_BACKEND=ROCM, using AMD ROCm backend.")

enable_language(HIP)
set(CMAKE_HIP_STANDARD 17)

if(CMAKE_PREFIX_PATH STREQUAL "" OR NOT DEFINED CMAKE_PREFIX_PATH)
if(DEFINED ENV{HIP_PATH})
# Windows HIP‑SDK
list(APPEND CMAKE_PREFIX_PATH $ENV{HIP_PATH})
message(STATUS "Auto‑detected HIP_PATH=$ENV{HIP_PATH} → CMAKE_PREFIX_PATH")
elseif(EXISTS "/opt/rocm")
# Linux
list(APPEND CMAKE_PREFIX_PATH "/opt/rocm")
message(STATUS "CMAKE_PREFIX_PATH not given; defaulting to /opt/rocm")
endif()
endif()

# Users can -DCMAKE_HIP_ARCHITECTURES=gfx90a;gfx942 manually specify GFX architectures
if(NOT DEFINED CMAKE_HIP_ARCHITECTURES)
# Default compile MI200 / RDNA3 cards, can be simplified as needed
set(CMAKE_HIP_ARCHITECTURES 90a 942 908 1100 1101 1200 1201 CACHE STRING "AMD GPU targets")
endif()

# 2) Specify backend source code. rocmhelpers.hip contains GPU kernels, don't forget it
set(NEURALNET_BACKEND_SOURCES
neuralnet/rocmbackend.cpp
neuralnet/rocmutils.cpp
neuralnet/rocmhelpers.hip
)

# Optional: Enable model-size‑based autotuning and other macros
# add_compile_definitions(HIP_SUPPORTS_FP16)

elseif(USE_BACKEND STREQUAL "")
message(WARNING "${ColorBoldRed}WARNING: Using dummy neural net backend, intended for non-neural-net testing only, will fail on any code path requiring a neural net. To use neural net, specify -DUSE_BACKEND=CUDA or -DUSE_BACKEND=TENSORRT or -DUSE_BACKEND=OPENCL or -DUSE_BACKEND=EIGEN to compile with the respective backend.${ColorReset}")
set(NEURALNET_BACKEND_SOURCES neuralnet/dummybackend.cpp)
Expand Down Expand Up @@ -418,6 +457,64 @@ elseif(USE_BACKEND STREQUAL "OPENCL")
link_directories(${OpenCL_LIBRARY})
target_link_libraries(katago ${OpenCL_LIBRARY})
endif()
# --------------------------- ROCM linking stage ---------------------------
elseif(USE_BACKEND STREQUAL "ROCM")
# Macro: used in source code with #ifdef USE_ROCM_BACKEND
target_compile_definitions(katago PRIVATE USE_ROCM_BACKEND)
target_compile_definitions(katago PRIVATE HIP_TARGET_VERSION=${CMAKE_HIP_COMPILER_VERSION})

string(TOLOWER "${CMAKE_HIP_ARCHITECTURES}" _gfxlist) # e.g. "90a;942"
if(_gfxlist MATCHES "803|900|90a|94[0-9]|110[0-9]|120[0-9]")
target_compile_definitions(katago PRIVATE HIP_SUPPORTS_FP16)
message(STATUS "Detected FP16‑capable GFX arch (${CMAKE_HIP_ARCHITECTURES}); defining HIP_SUPPORTS_FP16")
endif()

# 3) Find ROCm runtime & libraries. Since ROCm 6.x, CMake config-mode packages are included. If not found, add -DCMAKE_PREFIX_PATH=/opt/rocm
find_package(hip QUIET CONFIG) # Export hip::device / hip::host
find_package(hipblas QUIET CONFIG) # Export roc::hipblas
find_package(miopen QUIET CONFIG) # Export roc::miopen
# ---------- fallback:HIP Runtime ----------
if(NOT hip_FOUND)
find_path(HIP_INCLUDE_DIR hip/hip_runtime.h
HINTS ${CMAKE_PREFIX_PATH} /opt/rocm
PATH_SUFFIXES include)
find_library(HIP_RUNTIME_LIB amdhip64
HINTS ${CMAKE_PREFIX_PATH} /opt/rocm
PATH_SUFFIXES lib lib64)
if(NOT HIP_INCLUDE_DIR OR NOT HIP_RUNTIME_LIB)
message(FATAL_ERROR "HIP headers or runtime NOT found; install ROCm or set CMAKE_PREFIX_PATH.")
endif()
add_library(hip::device UNKNOWN IMPORTED)
set_target_properties(hip::device PROPERTIES
IMPORTED_LOCATION "${HIP_RUNTIME_LIB}"
INTERFACE_INCLUDE_DIRECTORIES "${HIP_INCLUDE_DIR}")
target_include_directories(katago SYSTEM PRIVATE ${HIP_INCLUDE_DIR})
endif()

# ---------- fallback:hipBLAS / MIOpen ----------
foreach(_pkg hipblas miopen)
if(NOT ${_pkg}_FOUND)
find_library(${_pkg}_LIB ${_pkg}
HINTS ${CMAKE_PREFIX_PATH} /opt/rocm
PATH_SUFFIXES lib lib64)
if(${_pkg}_LIB)
add_library(roc::${_pkg} UNKNOWN IMPORTED)
set_target_properties(roc::${_pkg} PROPERTIES
IMPORTED_LOCATION "${${_pkg}_LIB}")
target_include_directories(katago SYSTEM PRIVATE ${HIP_INCLUDE_DIR})
else()
message(FATAL_ERROR "Required ROCm component ${_pkg} not found – install it or set CMAKE_PREFIX_PATH.")
endif()
endif()
endforeach()

# 4) Header file paths are resolved by config-mode targets, no need to hard-code
target_link_libraries(katago
hip::device # HIP runtime & kernel offload
roc::hipblas # BLAS
MIOpen
# roc::miopen # DNN primitives
)
elseif(USE_BACKEND STREQUAL "EIGEN")
target_compile_definitions(katago PRIVATE USE_EIGEN_BACKEND)
if(NOT (MSVC))
Expand Down Expand Up @@ -547,7 +644,7 @@ if(MSVC)
set(CMAKE_EXE_LINKER_FLAGS "${CMAKE_EXE_LINKER_FLAGS} /STACK:8388608")
elseif(CMAKE_CXX_COMPILER_ID STREQUAL "GNU" OR CMAKE_CXX_COMPILER_ID STREQUAL "Clang" OR CMAKE_CXX_COMPILER_ID STREQUAL "AppleClang")
message(STATUS "Setting up build for GNU, Clang or MinGW.")
if(NOT (${CMAKE_SYSTEM_PROCESSOR} MATCHES "(arm|aarch32|aarch64)"))
if(NOT (${CMAKE_SYSTEM_PROCESSOR} MATCHES "(arm|aarch32|aarch64)") AND NOT USE_BACKEND STREQUAL "ROCM")
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -mfpmath=sse")
else()
# For ARM architecture, as a hack, ensure that char is signed
Expand Down
2 changes: 1 addition & 1 deletion cpp/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ Summary of source folders, in approximate dependency order, from lowest level to
* `nninputs.{cpp,h}` - Implements the input features for the neural net.
* `sgfmetadata.{cpp,h}` - Implements the input features for the [HumanSL neural net](https://github.com/lightvector/KataGo/blob/master/docs/Analysis_Engine.md#human-sl-analysis-guide), for conditioning on various SGF metadata about human players from training data.
* `nninterface.h` - Common interface that is implemented by every low-level neural net backend.
* `{cuda,opencl,eigen,trt,dummy}backend.cpp` - Various backends.
* `{cuda,opencl,eigen,trt,rocm,metal,dummy}backend.cpp` - Various backends.
* `nneval.{cpp,h}` - Top-level handle to the neural net used by the rest of the engine, implements thread-safe batching of queries.
* `search` - The main search engine.
* `timecontrols.cpp` - Basic handling of a few possible time controls.
Expand Down
5 changes: 5 additions & 0 deletions cpp/command/benchmark.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -265,6 +265,11 @@ int MainCmds::benchmark(const vector<string>& args) {
cout << "If you have a strong GPU capable of FP16 tensor cores (e.g. RTX2080), "
<< "using the Cuda version of KataGo instead may give a mild performance boost." << endl;
#endif
#ifdef USE_ROCM_BACKEND
cout << "You are currently using the ROCm version of KataGo." << endl;
cout << "If you have a strong GPU capable of FP16 tensor cores (e.g. RX6900XT), "
<< "using the ROCm version of KataGo instead may give a mild performance boost." << endl;
#endif
#ifdef USE_EIGEN_BACKEND
cout << "You are currently using the Eigen (CPU) version of KataGo. Due to having no GPU, it may be slow." << endl;
#endif
Expand Down
Loading