hpc access
Preliminaries
- Fill out the Google form provided.
- After validation, you will receive an email containing the port number and password for your Jupyter notebook.
- Note: Validation is done manually. Please do not fill-out the form twice.
- Open your Web browser and go to:
https://hpc.eee.upd.edu.ph:<port-number>
(e.g.: https://hpc.eee.upd.edu.ph:9999)
Note: If you received the “This site can’t be reached” error message, make sure you are using https and not http. - Login using the password provided.
- On the Launcher page, you can see the different utility tools like Terminal, Slurm Queue Manager and Jupyter Notebooks
Note: You can go to the Launcher page again by navigating to File > New Launcher. You can also open it using the corresponding keyboard shortcut
Creating and using environments using Conda
Creating and using a virtual environment is useful If you are working on a project that needs a specific version of python or other softwares. Conda (other tools are virtualenv and pipenv) is a tool to create isolated Python environments. It allows specific versions of packages and their dependencies to be stored, so instead of loading a lot of different modules with different versions, you can create and activate an environment where all of a package’s specific required dependencies are stored.
Creating a virtual environment is also a requirement if you want to run tasks that require GPU use. (later)
Preliminaries
- From the Launcher page, select Terminal to open a command line text interface.
- Check if conda is running using the command:
conda info
Create an environment using Conda
- On the Terminal, type
conda create --prefix ~/<Environment-Name>
where Environment-Name is the name of the virtual environment. Example:conda create --prefix ~/cuda-test
Press [y] to proceed.
- List all the available environment using
conda info --envs
- Activate the newly created environment using conda activate /home/<Username>/<Environment-Name>. Notice the changes in bash environment. Username is the name of the current user before ‘@’, for example:
where marlo is the username.
For example, to activate cuda-test for username marlo,
conda activate /home/marlo/cuda-test - To deactivate the current environment:
conda deactivate - To remove an environment
conda remove --prefix /home/marlo/cuda-test --all
Installing packages inside an environment
- Activate the environment you want to use.
- For this tutorial, I will be installing Pytorch and CUDA toolkit 11.3 on cuda-test (your softwares and packages might vary. Search online for the conda installation of your packages, you may also refer https://repo.anaconda.com/pkgs/), using
conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch
- Wait for the installation to finish. You can see the packages that are in the environment using
conda list
- Validate PyTorch version
On your environment terminal type: python
>>> import torch
>>> print(torch.__version__)
1.12.1
>>> exit()
[optional] Use environment as Jupyter notebook kernel
As of Mar 23, 2023, there is a bug that prevents installed packages from appearing in a Jupyter notebook kernel. For now, please skip this part of the guide
If you want to use a jupyter notebook for developing your project, follow these steps to integrate your environment as a kernel for the Jupyter notebook.
- Activate the environment you want to use.
- Install ipykernel on your environment:
conda install ipykernel
- Add the virtual environment to Jupyter:
python -m ipykernel install --user --name='environment_name'
- Refresh your Web browser. Go to Launcher, you should see the new Jupyter Notebook.
- Validate PyTorch version. On your environment integrated Jupyter notebook: cuda-test
Type the following:import torch
print(torch.__version__)
- IMPORTANT! Commands running on Jupyter notebook will NOT use GPU/CPU resources from the compute nodes. You have to convert this notebook into python script and submit it SLURM. If you wish to submit this notebook to SLURM (later). You can convert a notebook to python using the command on Terminal
jupyter nbconvert Untitled1.ipynb --to python
Running non-GPU tasks
By default, the provided Jupyter lab can perform non-GPU tasks good for single-core jobs or for debugging your code. In other words, as a standalone, you can perform tasks such as pre-processing, development and post-processing of your project without submitting any jobs to SLURM. However, if you need additional CPUs for multicore tasks, you can submit a SLURM job request for additional CPU resources. (see #SBATCH –cpus-per-task=1 below)
Submitting and running GPU tasks via SLURM
For tasks that need GPU for acceleration like training and inference, a job has to be submitted to the shared GPU cluster via SLURM. Note that the jobs were processed based on a First Come, First Serve basis. As a shared resource, please expect waiting/queue times.
In this tutorial, we will submit a job to one of the GPU servers on the cluster.
- From Launcher, select Text File then rename file as: test.py
On test.py enter:import torch
print(torch.__version__)
- Create an environment (see above). Then create a new file again and rename it to: job.slurm
On job.slurm enter:
#!/bin/bash
#SBATCH --output=notMNIST.out
#SBATCH --gres=gpu:1
#SBATCH --cpus-per-task=1
#SBATCH --partition=samsung#Activate your environment
source /opt/miniconda3/etc/profile.d/conda.sh
conda activate /home/marlo/cuda-test# Insert your code here.
srun python test.py# [Optional] You can run other commands here like compiling your code, etc.
srun nvidia-smi
srun sleep 10
With this job.slurm, you will be submitting a job in partition samsung, for 1 GPU, 1 CPU and the output file will be notMNIST.out. You can see available partition by entering on the Terminal: sinfo. To see how many available CPUs in your partition, run: scontrol show <partition-name>.
Note: You may not have permission to use other partition aside from samsung or partition assigned to you.
Note: It is important to activate your environment first before running your code.
- Submit your job to SLURM. You can submit using two ways:
- Using Terminal, type:
sbatch job.slurm
If your job was submitted successfully, your should see Submitted batch job plus a job id
If you see sbatch: error: QOSMaxSubmitJobPerUserLimit, it means you have reached the limit for the maximum submit job per user. Delete currently running/submitted job first before submitting a new one.
If you see sbatch: error: Batch job submission failed: Invalid account or account/partition combination specified. You might be using the wrong partition. Kindly change partition value in your job.slurm.
- You can also use the SLURM queue manager utility on the Launcher. Navigate to Submit Jobs and choose job.slurm from your current directory. Errors and notifications can be viewed on Job Notifications while currently running jobs can be viewed on Slurm Queue.
- Using Terminal, type:
- To view the status of your running job:
- From the Terminal, type squeue. Status R means your job is currently running, PD means it is currently waiting for available resources. Take note of your JobID.
- You can also use the SLURM queue manager. Use the Update Queue button to check the queue list.
- From the Terminal, type squeue. Status R means your job is currently running, PD means it is currently waiting for available resources. Take note of your JobID.
- After running your job, a file named notMNIST.out should appear.
- To cancel currently running/submitted jobs, you can use the Kill Job(s) in Slurm queue manager. Or using Terminal: scancel JobID
- Note: You can only kill jobs you submitted.