{ "cells": [ { "cell_type": "markdown", "id": "0d6fecdf-48c0-4745-b802-2117fb3137cf", "metadata": {}, "source": [ "# Basics of CuPy" ] }, { "cell_type": "markdown", "id": "15a05d43-0bf5-48d3-9c88-6074eed82a04", "metadata": {}, "source": [ "## Overview\n", "### In this tutorial, you learn:\n", "\n", "* Basics of Cupy and GPU computing\n", "* Data Transfer Between Host and Device\n", "* Compare speeds to NumPy\n", "\n", "## Prerequisites\n", "\n", "| Concepts | Importance | Notes |\n", "| --- | --- | --- |\n", "| [Familiarity with NumPy](https://foundations.projectpythia.org/core/numpy.html) | Necessary | |\n", "\n", "- **Time to learn**: 30 minutes\n", "\n", "## Introduction to CuPy\n", "CuPy is an open-source GPU-accelerated array library for Python that is compatible with NumPy/SciPy. \n", "\n", "\n", "\n", "CuPy uses NVIDIA CUDA to run operations on the GPU, which can provide significant performance improvements for numerical computations compared to running on the CPU, especially at larger data sizes. CuPy provides a NumPy-like interface for array manipulation and supports a wide range of mathematical operations, making it a powerful tool for scientific computing on GPUs.\n", "\n", "
\n", " In simple terms, CuPy can be described as the GPU equivalent of NumPy.\n", "
\n", "\n", "CuPy is a library that has similar capabilities as NumPy, but with important distinctions that make it ideal for GPU computing. CuPy provides:\n", "\n", "* An object similar to NumPy's multidimensional array, except that it resides in the memory of the GPU, allowing for faster computations involving large data sets.\n", "\n", "* A system for applying \"universal functions\" (`ufuncs`) that adhere to broadcasting rules. This system leverages the parallel computing power of GPUs for better performance.\n", "\n", "* CuPy provides an extensive collection of CUDA-ready array functions. CUDA is NVIDIA's parallel computing platform and API model, which allows software developers to use a CUDA-enabled GPU for general purpose processing. CuPy's extensive set of pre-implemented mathematical functions can be used on arrays right off the bat, taking full advantage of GPU acceleration.\n", "\n", "For more information about CuPy, please visit:\n", "\n", "[CuPy Homepage](https://docs.cupy.dev/en/stable/index.html#)\n", "\n", "[CuPy Github](https://github.com/cupy/cupy)\n", "\n", "In this tutorial, we will explore the distinctive features of CuPy and show their differences from NumPy. Let's get started!" ] }, { "cell_type": "markdown", "id": "77343efb-de6d-423c-b1cd-934c5d6d68e1", "metadata": {}, "source": [ "## Getting Started with CuPy" ] }, { "cell_type": "markdown", "id": "1c0a8fe5-0923-464e-8ea0-77e8d46b7977", "metadata": {}, "source": [ "Once CuPy is installed, we can import it in the same way as NumPy:\n" ] }, { "cell_type": "code", "execution_count": 1, "id": "55c72b7d-8899-4e2f-9432-e9cf1531cbdf", "metadata": {}, "outputs": [], "source": [ "## Import NumPy and CuPy\n", "import cupy as cp\n", "import numpy as np" ] }, { "cell_type": "markdown", "id": "62af1f7c-0ac2-4bad-ab92-8f1bcfbaffe3", "metadata": {}, "source": [ "### Arrays in CuPy vs. NumPy\n", "\n", "CuPy arrays can be declared using the `cupy.ndarray` class, much like NumPy arrays using `numpy.ndarrays`. However, it is important to note that while NumPy arrays are generated on the CPU (referred to as the \"host\"), CuPy arrays are generated on the GPU (known as the \"device\").\n", "\n", "CuPy arrays look just like NumPy arrays:" ] }, { "cell_type": "code", "execution_count": 2, "id": "c98d68a4-3b43-4a7d-91e2-53afdb121273", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "On the CPU: [1 2 3 4 5]\n", "\n" ] } ], "source": [ "# create a 1D array with 5 elements on CPU\n", "arr_cpu = np.array([1, 2, 3, 4, 5])\n", "print(\"On the CPU: \", arr_cpu)\n", "print(type(arr_cpu))" ] }, { "cell_type": "code", "execution_count": 3, "id": "7f09bd38-67fd-465f-a3f7-547b2b989b62", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "On the GPU: [1 2 3 4 5]\n", "\n" ] } ], "source": [ "# create a 1D array with 5 elements on GPU\n", "arr_gpu = cp.array([1, 2, 3, 4, 5])\n", "print(\"On the GPU: \", arr_gpu)\n", "print(type(arr_gpu))" ] }, { "cell_type": "markdown", "id": "e4d08c51-65a1-471f-841d-418ad0df592c", "metadata": {}, "source": [ " You can also create multi-dimensional arrays:" ] }, { "cell_type": "code", "execution_count": 4, "id": "693b52b5-0b94-464d-b3bd-1c4a53b4f17d", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "On the CPU: [[0. 0. 0. 0.]\n", " [0. 0. 0. 0.]\n", " [0. 0. 0. 0.]]\n", "\n" ] } ], "source": [ "# create a 2D array of zeros with 3 rows and 4 columns\n", "arr_cpu = np.zeros((3, 4))\n", "print(\"On the CPU: \", arr_cpu)\n", "print(type(arr_cpu))" ] }, { "cell_type": "code", "execution_count": 5, "id": "9845d93b-0d04-450b-ae68-47fc911f339d", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "On the GPU: [[0. 0. 0. 0.]\n", " [0. 0. 0. 0.]\n", " [0. 0. 0. 0.]]\n", "\n" ] } ], "source": [ "arr_gpu = cp.zeros((3, 4))\n", "print(\"On the GPU: \", arr_gpu)\n", "print(type(arr_gpu))" ] }, { "cell_type": "markdown", "id": "266ab29b-d11f-419d-b52b-9b6be5638945", "metadata": {}, "source": [ "As we can see in the above examples, CuPy arrays look just like NumPy arrays, except that Cupy arrays are stored on GPUs vs. Numpy arrays are stored on CPUs." ] }, { "cell_type": "markdown", "id": "5398e305-063a-4b15-b259-5eddf29c8cf9", "metadata": {}, "source": [ "### Basic Operations \n", "CuPy provides equivalents for many common NumPy functions, although not all. Most of CuPy's functions have the same function call as their NumPy counterparts. See the reference for the supported subset of NumPy API.\n", "| | |\n", "| :--- | :--- |\n", "| **NumPy** | **CuPy** |\n", "| numpy.identity | cupy.identity |\n", "| numpy.matmul | cupy.matmul |\n", "| numpy.nan_to_num | cupy.nan_to_num |\n", "| numpy.zeros | cupy.zeros |\n", "| numpy.ones | cupy.ones |\n", "| numpy.shape | cupy.shape |\n", "| numpy.reshape | cupy.reshape |\n", "| numpy.tensordot | cupy.tensordot |\n", "| numpy.transpose | cupy.transpose |\n", "| numpy.fft.fft | cupy.fft.fft |\n", "\n", "Cupy also provides equivalant functions for some SciPy functions, but its implementation is not as extensive as NumPy's.\n", "\n", "See [here](https://docs.cupy.dev/en/stable/reference/comparison.html) for a full list of CuPy's Numpy and Scipy equivalent functions.\n", "\n", "\n", "[CuPy API Reference](https://docs.cupy.dev/en/stable/reference/index.html)" ] }, { "cell_type": "code", "execution_count": 6, "id": "0850adf9-0c24-4687-b8de-1b7da734347e", "metadata": {}, "outputs": [], "source": [ "# NumPy: Create an array\n", "numpy_a = np.array([1, 2, 3, 4, 5])\n", "\n", "# CuPy: Create an array\n", "cupy_a = cp.array([1, 2, 3, 4, 5])" ] }, { "cell_type": "markdown", "id": "45fe9f2a-e00f-4cfa-b0a5-eb5a5682e743", "metadata": {}, "source": [ "Basic arithmetic operations is exactly identical between numpy and cupy. " ] }, { "cell_type": "code", "execution_count": 7, "id": "f6e7880b-2238-4c3b-a431-157e6c5389dc", "metadata": {}, "outputs": [], "source": [ "# Basic arithmetic operations\n", "numpy_b = numpy_a + 2\n", "cupy_b = cupy_a + 2\n", "\n", "numpy_c = numpy_a * 2\n", "cupy_c = cupy_a * 2\n", "\n", "numpy_d = numpy_a.dot(numpy_a)\n", "cupy_d = cupy_a.dot(cupy_a)\n", "\n", "# Reshaping arrays\n", "numpy_e = numpy_a.reshape(5, 1)\n", "cupy_e = cupy_a.reshape(5, 1)\n", "\n", "# Transposing arrays\n", "numpy_f = numpy_e.T\n", "cupy_f = cupy_e.T\n", "\n", "# Complex example: element-wise exponential and sum\n", "numpy_g = np.exp(numpy_a) / np.sum(np.exp(numpy_a))\n", "cupy_g = cp.exp(cupy_a) / cp.sum(cp.exp(cupy_a))" ] }, { "cell_type": "markdown", "id": "9f25ee88-1adf-45fd-8b24-04e30fe4488f", "metadata": {}, "source": [ "### Data Transfer\n", "\n", "#### Data Transfer to a Device\n", "`cupy.asarray()` can be used to move a numpy array to a device (GPU)." ] }, { "cell_type": "code", "execution_count": 8, "id": "20bd69c4-5a8b-4147-9169-efc65a49b5e4", "metadata": {}, "outputs": [], "source": [ "# Move data to GPU\n", "arr_gpu = cp.asarray(arr_cpu)" ] }, { "cell_type": "markdown", "id": "39ccf012-f467-49f4-99cd-0eea489e21a0", "metadata": {}, "source": [ "#### Move array from GPU to the CPU\n", "\n", "Moving a device array to the host (i.e. CPU) can be done by `cupy.asnumpy()` as follows:" ] }, { "cell_type": "code", "execution_count": 9, "id": "2e557105-755f-48ec-977e-bedee81b99c9", "metadata": {}, "outputs": [], "source": [ "# Move data back to host\n", "arr_cpu = cp.asnumpy(arr_gpu)" ] }, { "cell_type": "markdown", "id": "30386bbc-26b0-4afd-904b-30bb34d80d6a", "metadata": {}, "source": [ "We can also use `cupy.ndarray.get()`:" ] }, { "cell_type": "code", "execution_count": 10, "id": "bc09a284-dbd1-4f30-b262-0761b2832bfa", "metadata": {}, "outputs": [], "source": [ "arr_cpu = arr_gpu.get()" ] }, { "cell_type": "markdown", "id": "46dfb920-eb81-4cd1-b407-00099b76f633", "metadata": { "tags": [] }, "source": [ "### Device Information \n", "CuPy introduces the concept of a *current* device, which represents the default GPU device for array allocation, manipulation, calculations, and other operations. \n", "\n", "`cupy.ndarray.device` attribute can be used to determine the device allocated to a CUPY array: " ] }, { "cell_type": "code", "execution_count": 11, "id": "114120c2-99c1-4f0f-9ad8-40486dfff4e5", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cupy_g.device" ] }, { "cell_type": "markdown", "id": "e6310585-8dbe-4cc4-a235-6694db49d44a", "metadata": {}, "source": [ "To obtain the total number of accessible devices, you can utilize the getDeviceCount function." ] }, { "cell_type": "code", "execution_count": 12, "id": "e808fa97-7360-4f4a-b239-12d6a3cacbaf", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cp.cuda.runtime.getDeviceCount()" ] }, { "cell_type": "markdown", "id": "983b94ba-8127-461c-89cb-651fe123ccbb", "metadata": {}, "source": [ "The default behavior runs code on Device 0, but we can transfer arrays other devices with CuPy using `cp.cuda.Device()`. This capability becomes significantly important when your code is designed to harness the power of multiple GPUs.\n", "\n", "If you want to change to a different GPU device, you can do so by utilizing the \"device\" context manager. For example the following create an array on the GPU 2. \n", "\n", "``` python \n", "with cp.cuda.Device(2):\n", " x_on_gpu2 = cp.array([1, 2, 3, 4, 5])\n", "```\n", "\n", "There is no need for explicit device switching when only one device is available." ] }, { "cell_type": "markdown", "id": "747151e6-dc5f-4444-a906-528d1066a1dd", "metadata": {}, "source": [ "## CuPy vs NumPy: Speed Comparison\n", "\n", "Now that we are familar with CuPy, let's explore the performance improvements that CuPy can provide in comparison to NumPy for different data sizes. \n", "\n", "First, we are looking at matrix multiplication for array size of 3000x3000." ] }, { "cell_type": "code", "execution_count": 13, "id": "1545d1e5-3ae8-422b-95b5-cd88e7eb64e7", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "NumPy time: 0.7095739841461182 seconds\n", "CuPy time: 0.6216685771942139 seconds\n", "CuPy provides a 1.14 x speedup over NumPy.\n" ] } ], "source": [ "import time\n", "\n", "# create two 3000x3000 matrices\n", "n = 3000\n", "\n", "a_np = np.random.rand(n, n)\n", "b_np = np.random.rand(n, n)\n", "\n", "a_cp = cp.asarray(a_np)\n", "b_cp = cp.asarray(b_np)\n", "\n", "# perform matrix multiplication with NumPy and time it\n", "start_time = time.time()\n", "c_np = np.matmul(a_np, b_np)\n", "end_time = time.time()\n", "\n", "numpy_time = end_time - start_time\n", "print(\"NumPy time:\", numpy_time, \"seconds\")\n", "\n", "# perform matrix multiplication with CuPy and time it\n", "start_time = time.time()\n", "c_cp = cp.matmul(a_cp, b_cp)\n", "cp.cuda.Stream.null.synchronize() # wait for GPU computation to finish\n", "end_time = time.time()\n", "\n", "cupy_time = end_time - start_time\n", "\n", "print(\"CuPy time:\", cupy_time, \"seconds\")\n", "print(\"CuPy provides a\", round(numpy_time / cupy_time, 2), \"x speedup over NumPy.\")" ] }, { "cell_type": "markdown", "id": "a1cb881d-9ef1-4fbb-8044-d5bd4c8cb8b6", "metadata": {}, "source": [ "Now, let's run the same CuPy operation again:" ] }, { "cell_type": "code", "execution_count": 14, "id": "a83b1fd5-7896-49ed-9e64-74bcd1417c2c", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CuPy time: 0.01408529281616211 seconds\n", "CuPy provides a 50.38 x speedup over NumPy.\n" ] } ], "source": [ "# perform matrix multiplication with CuPy and time it\n", "start_time = time.time()\n", "c_cp = cp.matmul(a_cp, b_cp)\n", "cp.cuda.Stream.null.synchronize() # wait for GPU computation to finish\n", "end_time = time.time()\n", "\n", "cupy_time = end_time - start_time\n", "\n", "print(\"CuPy time:\", cupy_time, \"seconds\")\n", "print(\"CuPy provides a\", round(numpy_time / cupy_time, 2), \"x speedup over NumPy.\")" ] }, { "cell_type": "markdown", "id": "ca229603-89a0-49ca-8920-b40d29a2b703", "metadata": {}, "source": [ "### What happened? Why CuPy is faster the second time?\n", "When running these functions for the first time, you may experience a brief pause. This occurs as CuPy compiles the CUDA functions for the first time and cached them on disk for future use.\n" ] }, { "cell_type": "markdown", "id": "662bb0c3-4051-4125-b801-173b8b3c30b5", "metadata": {}, "source": [ "Now, let's make the same comparison with different array sizes." ] }, { "cell_type": "markdown", "id": "29b798e6-4c8b-44c6-86df-b92abdb0a683", "metadata": {}, "source": [ "We can use the following function to find the size of a variable on memory. " ] }, { "cell_type": "code", "execution_count": 15, "id": "e4fdad11-9e3b-4f65-9dce-9ffbd33dc419", "metadata": {}, "outputs": [], "source": [ "# Define function to display variable size in MB\n", "import sys\n", "\n", "\n", "def var_size(in_var):\n", " result = sys.getsizeof(in_var) / 1e6\n", " print(f\"Size of variable: {result:.2f} MB\")" ] }, { "cell_type": "code", "execution_count": 33, "id": "bba9681e-ca7b-486c-92c8-1a79434ba0da", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "n = 100\n", "Size of variable: 0.08 MB\n", "CuPy provides a 6.45 x speedup over NumPy.\n", "\n", "n = 200\n", "Size of variable: 0.32 MB\n", "CuPy provides a 1.28 x speedup over NumPy.\n", "\n", "n = 500\n", "Size of variable: 2.00 MB\n", "CuPy provides a 9.83 x speedup over NumPy.\n", "\n", "n = 1000\n", "Size of variable: 8.00 MB\n", "CuPy provides a 41.17 x speedup over NumPy.\n", "\n", "n = 2000\n", "Size of variable: 32.00 MB\n", "CuPy provides a 72.55 x speedup over NumPy.\n", "\n", "n = 5000\n", "Size of variable: 200.00 MB\n", "CuPy provides a 77.9 x speedup over NumPy.\n", "\n", "n = 10000\n", "Size of variable: 800.00 MB\n", "CuPy provides a 80.68 x speedup over NumPy.\n", "\n" ] } ], "source": [ "speed_ups = []\n", "arr_sizes = []\n", "sizes = [100, 200, 500, 1000, 2000, 5000, 10000]\n", "for n in sizes:\n", " print(\"n =\", n)\n", "\n", " # create two nxn matrices\n", " a_np = np.random.rand(n, n)\n", " b_np = np.random.rand(n, n)\n", "\n", " a_cp = cp.asarray(a_np)\n", " b_cp = cp.asarray(b_np)\n", "\n", " arr_size = a_cp.nbytes / 1e6\n", " print(f\"Size of variable: {arr_size:.2f} MB\")\n", "\n", " # perform matrix multiplication with NumPy and time it\n", " start_time = time.time()\n", " c_np = np.matmul(a_np, b_np)\n", " end_time = time.time()\n", " numpy_time = end_time - start_time\n", "\n", " # perform matrix multiplication with CuPy and time it\n", " start_time = time.time()\n", " c_cp = cp.matmul(a_cp, b_cp)\n", " cp.cuda.Stream.null.synchronize() # wait for GPU computation to finish\n", " end_time = time.time()\n", " cupy_time = end_time - start_time\n", "\n", " speed_up = round(numpy_time / cupy_time, 2)\n", "\n", " speed_ups.append(speed_up)\n", " arr_sizes.append(arr_size)\n", " # print the speedup\n", " print(\"CuPy provides a\", speed_up, \"x speedup over NumPy.\\n\")" ] }, { "cell_type": "markdown", "id": "fc3dd3c8-032f-437b-a6d0-7dae7e55b73c", "metadata": {}, "source": [ "We can also create a plot of data size vs. speed-ups:" ] }, { "cell_type": "code", "execution_count": 36, "id": "fd5dbadf-8286-464d-880c-c25297ed310d", "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "import matplotlib.pyplot as plt\n", "\n", "plt.figure(figsize=(5, 5))\n", "plt.plot(sizes, speed_ups, marker=\"o\")\n", "plt.xlabel(\"Matrix size\")\n", "plt.ylabel(\"Speedup (CuPy time / NumPy time)\")\n", "# plt.xticks(sizes)\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "93da5acf-ba39-4617-8246-ac7f7b3fd8df", "metadata": {}, "source": [ "```{note}\n", "As we can see above, GPUs computations can be slower than CPUs!\n", "```\n", "\n", "There are several reasons for this: \n", " \n", "* The size of our arrays: The GPU's performance relies on parallelism, processing thousands of values simultaneously. To fully leverage the GPU's capabilities, we require a significantly larger array. As we see in the above example, for bigger matrix size we see more speed-ups. \n", "\n", "* The simplicity of our calculation: Transferring a calculation to the GPU involves considerable overhead compared to executing a function on the CPU. If our calculation lacks a sufficient number of mathematical operations (known as \"arithmetic intensity\"), the GPU will spend most of its time waiting for data movement.\n", "\n", "* Data copying to and from the GPU impacts performance: While including copy time can be realistic for a single function, there are instances where we need to execute multiple GPU operations sequentially. In such cases, it is advantageous to transfer data to the GPU and keep it there until all processing is complete." ] }, { "cell_type": "markdown", "id": "1fc57cc2-d237-49e9-bf4f-ef74511673d7", "metadata": {}, "source": [ "Congratulations! You have now uncovered the capabilities of CuPy. It's time to unleash its power and accelerate your own code by replacing NumPy with CuPy wherever applicable and appropriate. In the next chapters we will delve into Cupy Xarray capabilities. \n", "\n", "## Summary\n", "\n", "In this notebook, we have learned about:\n", "\n", "* Cupy Basics\n", "* Data Transfer between Device and Host\n", "* Performance of Cupy vs. Numpy on different array sizes. \n", "\n", "```{seealso}\n", "\n", "[CuPy Homepage](https://cupy.dev/) \n", "[CuPy Github](https://github.com/cupy/cupy) \n", "[CuPy User Guide](https://docs.cupy.dev/en/stable/user_guide/index.html)\n", "```" ] } ], "metadata": { "kernelspec": { "display_name": "Python [conda env:gpu-xdev]", "language": "python", "name": "conda-env-gpu-xdev-py" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.15" }, "widgets": { "application/vnd.jupyter.widget-state+json": { "state": {}, "version_major": 2, "version_minor": 0 } } }, "nbformat": 4, "nbformat_minor": 5 }