hpc access

Preliminaries

  1. Fill out the Google form provided.
  2. After validation, you will receive an email containing the port number and password for your Jupyter notebook.
    • Note: Validation is done manually. Please do not fill-out the form twice.
  3. Open your Web browser and go to:

    https://hpc.eee.upd.edu.ph:<port-number
    (e.g.: https://hpc.eee.upd.edu.ph:9999)


    Note: If you received the “
    This site can’t be reached” error message, make sure you are using https and not http

  4. Login using the password provided.
  5. On the Launcher page, you can see the different utility tools like Terminal, Slurm Queue Manager and Jupyter Notebooks


    Note: You can go to the Launcher page again by navigating to File > New Launcher. You can also open it using the corresponding keyboard shortcut.

Creating and using environments using Conda

Creating and using a virtual environment is useful If you are working on a project that needs a specific version of python or other softwares. Conda (other tools are virtualenv and pipenv) is a tool to create isolated Python environments. It allows specific versions of packages and their dependencies to be stored, so instead of loading a lot of different modules with different versions, you can create and activate an environment where all of a package’s specific required dependencies are stored. 
Creating a virtual environment is also a requirement if you want to run tasks that require GPU use. (later)

Preliminaries

  1. From the Launcher page, select Terminal to open a command line text interface.
  2. Check if conda is running using the command:
    conda info


Create an environment using Conda

  1. On the Terminal, type conda create -n cuda-test , where cuda-test is the name of the virtual environment. Press [y] to proceed
  2. List all the available environment using
    conda info --envs
  3. Activate the newly created environment using conda activate.  Notice the changes in bash environment.

  4. To deactivate the current environment:
    conda deactivate 

Installing packages inside an environment

  1. Activate the environment you want to use. 
  2. For this tutorial, I will be installing Pytorch and CUDA toolkit 11.3 on cuda-test (your softwares and packages might vary. Search online for the conda installation of your packages, you may also refer https://repo.anaconda.com/pkgs/), using
    conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch 
  3. Wait for the installation to finish. You can see the packages that are in the environment using
    conda list
  4. Validate PyTorch version
    On your environment terminal type: python
    >>> import torch
    >>> print(torch.__version__)
    1.12.1
    >>> exit()

 [optional] Use environment as Jupyter notebook kernel

If you want to use a jupyter notebook for developing your project, follow these steps to integrate your environment as a kernel for the Jupyter notebook.

  1. Activate the environment you want to use. 
  2. Install ipykernel on your environment:
    conda install ipykernel
  3. Add the virtual environment to Jupyter:
    python -m ipykernel install --user --name='environment_name'
  4. Refresh your Web browser. Go to Launcher, you should see the new Jupyter Notebook. 
  5. Validate PyTorch version. On your environment integrated Jupyter notebook: cuda-test
    Type the following:
    import torch
    print(torch.__version__)

  6. IMPORTANT! Commands running on Jupyter notebook will NOT use GPU/CPU resources from the compute nodes. You have to convert this notebook into python script and submit it SLURM. If you wish to submit this notebook to SLURM (later). You can convert a notebook to python using the command on Terminal
    jupyter nbconvert Untitled1.ipynb --to python

Running non-GPU tasks

By default, the provided Jupyter lab can perform non-GPU tasks good for single-core jobs or for debugging your code. In other words, as a standalone, you can perform tasks such as pre-processing, development and post-processing of your project without submitting any jobs to SLURM. However, if you need additional CPUs for multicore tasks, you can submit a SLURM job request for additional CPU resources. (see #SBATCH –cpus-per-task=1 below)

Submitting and running GPU tasks via SLURM

For tasks that need GPU for acceleration like training and inference, a job has to be submitted to the shared GPU cluster via SLURM. Note that the jobs were processed based on a First Come, First Serve basis. As a shared resource, please expect waiting/queue times.

In this tutorial, we will submit a job to one of the GPU servers on the cluster.

  1. From Launcher, select Text File then rename file as: test.py 
    On test.py enter:
    import torch
    print(torch.__version__)
  2. Create an environment (see above). Then create a new file again and rename it to: job.slurm

    On job.slurm enter:

    #!/bin/bash

     

    #SBATCH --output=notMNIST.out
    #SBATCH --gres=gpu:1
    #SBATCH --cpus-per-task=1
    #SBATCH --partition=samsung

     

    #Activate your environment
    source /opt/miniconda3/etc/profile.d/conda.sh
    conda activate cuda-test

     

    # Insert your code here.
    srun python test.py

     

    # [Optional] You can run other commands here like compiling your code, etc.
    srun nvidia-smi
    srun sleep 10

With this job.slurm, you will be submitting a job in partition samsung, for 1 GPU, 1 CPU and the output file will be notMNIST.out. You can see available partition by entering on the Terminal: sinfoTo see how many available CPUs in your partition, run: scontrol show <partition-name>.

Note: You may not have permission to use other partition aside from samsung or partition assigned to you.

Note: It is important to activate your environment first before running your code.

  • Submit your job to SLURM. You can submit using two ways:
    1. Using Terminal, type:
      sbatch job.slurm

      If your job was submitted successfully, your should see Submitted batch job plus a job id

      If you see sbatch: error: QOSMaxSubmitJobPerUserLimit, it means you have reached the limit for the maximum submit job per user. Delete currently running/submitted job first before submitting a new one.

      If you see sbatch: error: Batch job submission failed: Invalid account or account/partition combination specified. You might be using the wrong partition. Kindly change partition value in your job.slurm.

    2. You can also use the SLURM queue manager utility on the Launcher. Navigate to Submit Jobs and choose job.slurm from your current directory. Errors and notifications can be viewed on Job Notifications while currently running jobs can be viewed on Slurm Queue.
  • To view the status of your running job:
    1. From the Terminal, type squeue. Status R means your job is currently running, PD means it is currently waiting for available resources. Take note of your JobID.
    2. You can also use the SLURM queue manager. Use the Update Queue button to check the queue list.
  • After running your job, a file named notMNIST.out should appear.
  • To cancel currently running/submitted jobs, you can use the Kill Job(s) in Slurm queue manager. Or using Terminal: scancel JobID
  • Note: You can only kill jobs you submitted.