Multi-Host Multi-GPU Monitoring Solution for Labs

· / , , , ·

182

reads

AI TranslationSimplified ChineseEnglish

Key Insights

AI · GEN

Foreword

This blog records the evolution process and lacks reference value. It is recommended to directly check the following projects and deploy quickly using Docker.

agent2

The agent run on the GPU machines to collect the GPU metrics.

Python

dashboard3

The dashboard run on the web server to show the GPU metrics.

TypeScript

Screenshots:

Light	Dark

Old Solution: SSH to Get nvidia-smi Output

My previous front-end experience was limited to generating HTML with Python, so I configured a GPU monitoring solution on my small host:

Use the ssh command to obtain the output of nvidia-smi and parse information such as memory usage from it.
According to the pid of the process occupying the GPU, use the ps command to get the user and command.
Use Python to output the above information as markdown, and then output it as HTML via Markdown.
Configure cron to execute the above steps every minute, and set the web page root in nginx to the directory where the HTML is located.

The corresponding code is as follows:

CodeBlock Loading...

This solution has several obvious drawbacks: low update frequency and complete reliance on backend updates, meaning data is constantly refreshed regardless of whether anyone is accessing it.

New Solution: Front-End and Back-End Separation

I had always wanted to implement a GPU monitoring system with front-end and back-end separation, where each server runs a fastapi that returns the required data upon request. The recently developed NJU Charge gave me the confidence to develop a front end that fetches data from the API and renders it on the page.

FastAPI Backend

Recently, I accidentally discovered that nvitop supports Python calls. I had always thought it could only visualize data via commands.
Great, this makes it much easier to get the required data, and the amount of code is greatly reduced! ( •̀ ω •́ )✧

However, a troublesome issue is that our lab servers are behind a router that I don't control, and only SSH port forwarding is enabled.
Here I chose to use frp to map each server's API port to my small host on campus. Coincidentally, my small host already runs several web services, making it convenient to access the APIs via domain names.

I was being silly earlier; actually, SSH (ssh -fN -L 8000:localhost:8000 user@ip) can be used for port mapping, which allows removing frp-related content from the code and makes it easier to start the web side with Docker.

CodeBlock Loading...

There are three environment variables in the code:

SUBURL: Used to configure the API path, for example, specifying the server name.
FRP_PATH: The path where frp and its configuration are located, used to map the API port to my on-campus small host. If your servers can be accessed directly, you can delete the related functions, change the last line to 0.0.0.0, and then access via IP (or configure a domain name for each server).
PORT: The port where the API is located.

Here I only wrote two interfaces ~~, actually only one is used~~

/count: Returns how many GPUs there are.
/status: Returns specific status information; see the example below for the returned data. However, I also wrote two optional parameters:
- idx: Comma-separated numbers to get the status of specified GPUs.
- process: Used to filter the returned processes. When I use it, I directly set it to C to show only computing tasks.

CodeBlock Loading...

Vue Frontend

Here I took a shortcut ~~, actually because I didn't know how~~, and temporarily copied the UI originally generated with Python.

CodeBlock Loading...

npm run build successfully produced the release files, and configuring nginx's root to that folder completed the task.
The implemented effect: https://nvtop.njucite.cn/
Although the UI is still ugly, at least it can refresh dynamically now, yay!

New UI

~~First, draw the pie here, waiting for me to return after learning. (￣_,￣ )~~

~~More beautiful UI~~ (I'm not sure if this has been achieved; I'm a design waste)
~~Add utilization line chart~~
~~Support dark mode~~

2024/12/27: Completed the above TODOs using Next.js, and also implemented the function to hide some hosts, setting the hidden hosts as cookies for the same state when reopened next time.

2025/03/11: Next.js updated with email login functionality, restricting access to authorized users only.

2025/09/18: Organized the code; now the functionality is basically complete and convenient for others to deploy directly.

Full code available at:

agent2

The agent run on the GPU machines to collect the GPU metrics.

Python

dashboard3

The dashboard run on the web server to show the GPU metrics.

TypeScript

Foreword

This blog records the evolution process and lacks reference value. It is recommended to directly check the following projects and deploy quickly using Docker.

agent2

The agent run on the GPU machines to collect the GPU metrics.

Python

dashboard3

The dashboard run on the web server to show the GPU metrics.

TypeScript

Screenshots:

Light	Dark

Old Solution: SSH to Get nvidia-smi Output

My previous front-end experience was limited to generating HTML with Python, so I configured a GPU monitoring solution on my small host:

Use the ssh command to obtain the output of nvidia-smi and parse information such as memory usage from it.
According to the pid of the process occupying the GPU, use the ps command to get the user and command.
Use Python to output the above information as markdown, and then output it as HTML via Markdown.
Configure cron to execute the above steps every minute, and set the web page root in nginx to the directory where the HTML is located.

The corresponding code is as follows:

PYTHON

# main.py
import subprocess
from copy import deepcopy
import json
from markdown import markdown
import time

from parse import parse, parse_proc
from gen_md import gen_md

num_gpus = {
    "s1": 4,
    "s2": 4,
    "s3": 2,
    "s4": 4,
    "s5": 5,
}


def get1GPU(i, j):
    cmd = ["ssh", "-o", "ConnectTimeout=2", f"s{i}", "nvidia-smi", f"-i {j}"]
    try:
        output = subprocess.check_output(cmd)
    except subprocess.CalledProcessError as e:
        return None, None
    ts = int(time.time())
    output = output.decode("utf-8")
    ret = parse(output)
    processes = deepcopy(ret["processes"])
    ret["processes"] = []
    for pid in processes:
        cmd = [
            "ssh",
            f"s{i}",
            "ps",
            "-o",
            "pid,user:30,command",
            "--no-headers",
            "-p",
            pid[0],
        ]
        output = subprocess.check_output(cmd)
        output = output.decode("utf-8")
        proc = parse_proc(output, pid[0])
        ret["processes"].append(proc)
        ret["processes"][-1]["pid"] = pid[0]
        ret["processes"][-1]["used_mem"] = pid[1]
    return ret, ts


def get_html(debug=False):
    results = {}
    for i in range(1, 6):
        results_per_host = {}
        for j in range(num_gpus[f"s{i}"]):
            ret, ts = get1GPU(i, j)
            if ret is None:
                continue
            results_per_host[f"GPU{j}"] = ret
        results[f"s{i}"] = results_per_host
    md = gen_md(results)

    with open("html_template.html", "r") as f:
        template = f.read()
        html = markdown(md, extensions=["tables", "fenced_code"])
        html = template.replace("{{html}}", html)
        html = html.replace(
            "{{update_time}}", time.strftime("%Y-%m-%d %H:%M", time.localtime())
        )
    if debug:
        with open("results.json", "w") as f:
            f.write(json.dumps(results, indent=2))
        with open("results.md", "w", encoding="utf-8") as f:
            f.write(md)
    with open("index.html", "w", encoding="utf-8") as f:
        f.write(html)


if __name__ == "__main__":
    import sys

    debug = False
    if len(sys.argv) > 1 and sys.argv[1] == "debug":
        debug = True
    get_html(debug)

CodeBlock Loading...

PYTHON

# parse.py
def parse(text: str) -> dict:
    lines = text.split('\n')
    used_mem = lines[9].split('|')[2].split('/')[0].strip()[:-3]
    total_mem = lines[9].split('|')[2].split('/')[1].strip()[:-3]
    temperature = lines[9].split('|')[1].split()[1].replace('C', '')
    used_mem, total_mem, temperature = int(used_mem), int(total_mem), int(temperature)

    processes = []
    for i in range(18, len(lines) - 2):
        line = lines[i]
        if 'xorg/Xorg' in line:
            continue
        if 'gnome-shell' in line:
            continue
        pid = line.split()[4]
        use = line.split()[7][:-3]
        processes.append((pid, int(use)))
    return {
        'used_mem': used_mem,
        'total_mem': total_mem,
        'temperature': temperature,
        'processes': processes
    }

def parse_proc(text: str, pid: str) -> dict:
    lines = text.split('\n')
    for line in lines:
        if not line:
            continue
        if line.split()[0] != pid:
            continue
        user = line.split()[1]
        cmd = ' '.join(line.split()[2:])
        return {
            'user': user,
            'cmd': cmd
        }

CodeBlock Loading...

PYTHON

# gen_md.py
def per_server(server: str, results: dict) -> str:
    md = f'# {server}\n\n'
    for gpu, ret in results.items():
        used, total, temperature = ret['used_mem'], ret['total_mem'], ret['temperature']
        md += f'<div class="oneGPU">\n'
        md += f'    <code>{gpu}: </code>\n'
        md += f'    <div class="g-container" style="display: inline-block;">\n'
        md += f'        <div class="g-progress" style="width: {used/total*100}%;"></div>\n'
        md += f'    </div>\n'
        md += f'    <code>  {used:5d}/{total} MiB  {temperature}℃</code>\n'
        md += '</div>\n'
    md += '\n'
    if any([len(ret['processes']) > 0 for ret in results.values()]):
        md += '\n| GPU | PID | User | Command | GPU Usage |\n'
        md += '| --- | --- | --- | --- | --- |\n'
        for gpu, ret in results.items():
            for proc in ret['processes']:
                md += f'| {gpu} | {proc["pid"]} | {proc["user"]} | {proc["cmd"]} | {proc["used_mem"]} MB |\n'
    md += '\n\n'
    return md

def gen_md(results: dict) -> dict:
    md = ''
    for server, ret in results.items():
        md += per_server(server, ret)
    return md

CodeBlock Loading...

This solution has several obvious drawbacks: low update frequency and complete reliance on backend updates, meaning data is constantly refreshed regardless of whether anyone is accessing it.

New Solution: Front-End and Back-End Separation

FastAPI Backend

PYTHON

# main.py
from fastapi import FastAPI, Request
from fastapi.middleware.cors import CORSMiddleware
from fastapi.middleware.gzip import GZipMiddleware
from fastapi.responses import JSONResponse
import uvicorn
from nvitop import Device, bytes2human
import os
import asyncio
from contextlib import asynccontextmanager

suburl = os.environ.get("SUBURL", "")
if suburl != "" and not suburl.startswith("/"):
    suburl = "/" + suburl
frp_path = os.environ.get("FRP_PATH", "/home/peijie/Nvidia-API/frp")
if not os.path.exists(f"{frp_path}/frpc") or not os.path.exists(
    f"{frp_path}/frpc.toml"
):
    raise FileNotFoundError("frpc or frpc.toml not found in FRP_PATH")


@asynccontextmanager
async def run_frpc(app: FastAPI): # frp tunnel to my on-campus small host
    command = [f"{frp_path}/frpc", "-c", f"{frp_path}/frpc.toml"]
    process = await asyncio.create_subprocess_exec(
        *command,
        stdout=asyncio.subprocess.DEVNULL,
        stderr=asyncio.subprocess.DEVNULL,
        stdin=asyncio.subprocess.DEVNULL,
        close_fds=True,
    )
    try:
        yield
    finally:
        try:
            process.terminate()
            await process.wait()
        except ProcessLookupError:
            pass


app = FastAPI(lifespan=run_frpc)

app.add_middleware(GZipMiddleware, minimum_size=100)
app.add_middleware(
    CORSMiddleware,
    allow_origins=["*"],
    allow_credentials=True,
    allow_methods=["*"],
    allow_headers=["*"],
)


@app.get(f"{suburl}/count")
async def get_ngpus(request: Request):
    try:
        ngpus = Device.count()
        return JSONResponse(content={"code": 0, "data": ngpus})
    except Exception as e:
        return JSONResponse(
            content={"code": -1, "data": None, "error": str(e)}, status_code=500
        )


@app.get(f"{suburl}/status")
async def get_status(request: Request):
    try:
        ngpus = Device.count()
    except Exception as e:
        return JSONResponse(
            content={"code": -1, "data": None, "error": str(e)}, status_code=500
        )

    idx = request.query_params.get("idx", None)
    if idx is not None:
        try:
            idx = idx.split(",")
            idx = [int(i) for i in idx]
            for i in idx:
                if i < 0 or i >= ngpus:
                    raise ValueError("Invalid GPU index")
        except ValueError:
            return JSONResponse(
                content={"code": 1, "data": None, "error": "Invalid GPU index"},
                status_code=400,
            )
    else:
        idx = list(range(ngpus))
    process_type = request.query_params.get("process", "")
    if process_type not in ["", "C", "G", "NA"]:
        return JSONResponse(
            content={
                "code": 1,
                "data": None,
                "error": "Invalid process type, choose from C, G, NA",
            },
            status_code=400,
        )
    try:
        devices = []
        processes = []
        for i in idx:
            device = Device(i)
            devices.append(
                {
                    "idx": i,
                    "fan_speed": device.fan_speed(),
                    "temperature": device.temperature(),
                    "power_status": device.power_status(),
                    "gpu_utilization": device.gpu_utilization(),
                    "memory_total_human": f"{round(device.memory_total() / 1024 / 1024)}MiB",
                    "memory_used_human": f"{round(device.memory_used() / 1024 / 1024)}MiB",
                    "memory_free_human": f"{round(device.memory_free() / 1024 / 1024)}MiB",
                    "memory_utilization": round(
                        device.memory_used() / device.memory_total() * 100, 2
                    ),
                }
            )
            now_processes = device.processes()
            sorted_pids = sorted(now_processes)
            for pid in sorted_pids:
                process = now_processes[pid]
                if process_type == "" or process_type in process.type:
                    processes.append(
                        {
                            "idx": i,
                            "pid": process.pid,
                            "username": process.username(),
                            "command": process.command(),
                            "type": process.type,
                            "gpu_memory": bytes2human(process.gpu_memory()),
                        }
                    )
        return JSONResponse(
            content={
                "code": 0,
                "data": {"count": ngpus, "devices": devices, "processes": processes},
            }
        )
    except Exception as e:
        return JSONResponse(
            content={"code": -1, "data": None, "error": str(e)}, status_code=500
        )


if __name__ == "__main__":
    port = int(os.environ.get("PORT", "8000"))
    uvicorn.run(app, host="127.0.0.1", port=port, reload=False)

CodeBlock Loading...

There are three environment variables in the code:

SUBURL: Used to configure the API path, for example, specifying the server name.
FRP_PATH: The path where frp and its configuration are located, used to map the API port to my on-campus small host. If your servers can be accessed directly, you can delete the related functions, change the last line to 0.0.0.0, and then access via IP (or configure a domain name for each server).
PORT: The port where the API is located.

Here I only wrote two interfaces ~~, actually only one is used~~

/count: Returns how many GPUs there are.
/status: Returns specific status information; see the example below for the returned data. However, I also wrote two optional parameters:
- idx: Comma-separated numbers to get the status of specified GPUs.
- process: Used to filter the returned processes. When I use it, I directly set it to C to show only computing tasks.

JSON

{
  "code": 0,
  "data": {
    "count": 2,
    "devices": [
      {
        "idx": 0,
        "fan_speed": 41,
        "temperature": 71,
        "power_status": "336W / 350W",
        "gpu_utilization": 100,
        "memory_total_human": "24576MiB",
        "memory_used_human": "18653MiB",
        "memory_free_human": "5501MiB",
        "memory_utilization": 75.9
      },
      {
        "idx": 1,
        "fan_speed": 39,
        "temperature": 67,
        "power_status": "322W / 350W",
        "gpu_utilization": 96,
        "memory_total_human": "24576MiB",
        "memory_used_human": "18669MiB",
        "memory_free_human": "5485MiB",
        "memory_utilization": 75.97
      }
    ],
    "processes": [
      {
        "idx": 0,
        "pid": 1741,
        "username": "gdm",
        "command": "/usr/lib/xorg/Xorg vt1 -displayfd 3 -auth /run/user/125/gdm/Xauthority -background none -noreset -keeptty -verbose 3",
        "type": "G",
        "gpu_memory": "4.46MiB"
      },
      {
        "idx": 0,
        "pid": 2249001,
        "username": "xxx",
        "command": "~/.conda/envs/torch/bin/python -u train.py",
        "type": "C",
        "gpu_memory": "18618MiB"
      },
      {
        "idx": 1,
        "pid": 1741,
        "username": "gdm",
        "command": "/usr/lib/xorg/Xorg vt1 -displayfd 3 -auth /run/user/125/gdm/Xauthority -background none -noreset -keeptty -verbose 3",
        "type": "G",
        "gpu_memory": "9.84MiB"
      },
      {
        "idx": 1,
        "pid": 1787,
        "username": "gdm",
        "command": "/usr/bin/gnome-shell",
        "type": "G",
        "gpu_memory": "6.07MiB"
      },
      {
        "idx": 1,
        "pid": 2249002,
        "username": "xxx",
        "command": "~/.conda/envs/torch/bin/python -u train.py",
        "type": "C",
        "gpu_memory": "18618MiB"
      }
    ]
  }
}

CodeBlock Loading...

Vue Frontend

Here I took a shortcut ~~, actually because I didn't know how~~, and temporarily copied the UI originally generated with Python.

VUE

<!-- App.vue -->
<script setup>
import GpuMonitor from './components/GpuMonitor.vue';

let urls = [];
let titles = [];
for (let i = 1; i <= 5; i++) {
  urls.push(`https://xxxx/status?process=C`);
  titles.push(`s${i}`);
}
const data_length = 100; // Length of GPU utilization history data, used for drawing line charts (drawing a pie in the sky for now)
const sleep_time = 500;  // Interval for refreshing data, in milliseconds
</script>

<template>
  <h3><a href="https://www.do1e.cn/posts/citelab/server-help">Server Usage Instructions</a></h3>
  <GpuMonitor v-for="(url, index) in urls" :key="index" :url="url" :title="titles[index]" :data_length="data_length" :sleep_time="sleep_time" />
</template>

<style scoped>
body {
  margin-left: 20px;
  margin-right: 20px;
}
</style>

CodeBlock Loading...

VUE

<!-- components/GpuMonitor.vue -->
<template>
  <div>
    <h1>{{ title }}</h1>
    <article class="markdown-body">
      <div v-for="device in data.data.devices" :key="device.idx">
        <b>GPU{{ device.idx }}: </b>
        <b>Memory: </b>
        <div class="g-container">
          <div class="g-progress" :style="{ width: device.memory_utilization + '%' }"></div>
        </div>
        <code style="width: 25ch;">{{ device.memory_used_human }}/{{ device.memory_total_human }} {{ device.memory_utilization }}%</code>
        <b>Utilization: </b>
        <div class="g-container">
          <div class="g-progress" :style="{ width: device.gpu_utilization + '%' }"></div>
        </div>
        <code style="width: 5ch;">{{ device.gpu_utilization }}%</code>
        <b>Temperature: </b>
        <code style="width: 4ch;">{{ device.temperature }}°C</code>
      </div>
      <table v-if="data.data.processes.length > 0">
        <thead>
          <tr><th>GPU</th><th>PID</th><th>User</th><th>Command</th><th>GPU Usage</th></tr>
        </thead>
        <tbody>
          <tr v-for="process in data.data.processes" :key="process.pid">
            <td>GPU{{ process.idx }}</td>
            <td>{{ process.pid }}</td>
            <td>{{ process.username }}</td>
            <td>{{ process.command }}</td>
            <td>{{ process.gpu_memory }}</td>
          </tr>
        </tbody>
      </table>
    </article>
  </div>
</template>

<script>
import axios from 'axios';
import { Chart, registerables } from 'chart.js';

Chart.register(...registerables);

export default {
  props: {
    url: String,
    title: String,
    data_length: Number,
    sleep_time: Number
  },
  data() {
    return {
      data: {
        code: 0,
        data: {
          count: 0,
          devices: [],
          processes: []
        }
      },
      gpuUtilHistory: {}
    };
  },
  mounted() {
    this.fetchData();
    this.interval = setInterval(this.fetchData, this.sleep_time);
  },
  beforeDestroy() {
    clearInterval(this.interval);
  },
  methods: {
    fetchData() {
      axios.get(this.url)
        .then(response => {
          if (response.data.code !== 0) {
            console.error('Error fetching GPU data:', response.data);
            return;
          }
          this.data = response.data;
          for (let device of this.data.data.devices) {
            if (!this.gpuUtilHistory[device.idx]) {
              this.gpuUtilHistory[device.idx] = Array(this.data_length).fill(0);
            }
            this.gpuUtilHistory[device.idx].push(device.gpu_utilization);
            this.gpuUtilHistory[device.idx].shift();
          }
        })
        .catch(error => {
          console.error('Error fetching GPU data:', error);
        });
    }
  }
};
</script>

<style>
.g-container {
  width: 200px;
  height: 15px;
  border-radius: 3px;
  background: #eeeeee;
  display: inline-block;
}
.g-progress {
  height: inherit;
  border-radius: 3px 0 0 3px;
  background: #6e9bc5;
}
code {
  display: inline-block;
  text-align: right;
  background-color: #ffffff !important;
}
</style>

CodeBlock Loading...

// main.js
import { createApp } from 'vue'
import App from './App.vue'

createApp(App).mount('#app')

CodeBlock Loading...

HTML

<!DOCTYPE html>
<html lang="">
  <head>
    <meta charset="UTF-8">
    <link rel="icon" href="/favicon.ico">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/github-markdown-css/5.2.0/github-markdown.min.css">
    <title>Lab GPU Usage</title>
  </head>
  <body>
    <div id="app"></div>
    <script type="module" src="/src/main.js"></script>
  </body>
</html>

CodeBlock Loading...

New UI

~~First, draw the pie here, waiting for me to return after learning. (￣_,￣ )~~

~~More beautiful UI~~ (I'm not sure if this has been achieved; I'm a design waste)
~~Add utilization line chart~~
~~Support dark mode~~

2024/12/27: Completed the above TODOs using Next.js, and also implemented the function to hide some hosts, setting the hidden hosts as cookies for the same state when reopened next time.

2025/03/11: Next.js updated with email login functionality, restricting access to authorized users only.

2025/09/18: Organized the code; now the functionality is basically complete and convenient for others to deploy directly.

Full code available at:

agent2

The agent run on the GPU machines to collect the GPU metrics.

Python

dashboard3

The dashboard run on the web server to show the GPU metrics.

TypeScript