skip to main content

Training Deep Learning Models Using ThinkSystem SR680a V3, SR780a V3, SR685a V3 Compute Nodes with DDN AI400X2 Storage Nodes

Reference Architecture

Home
Top
Published
8 Sep 2024
Form Number
LP2021
PDF size
42 pages, 3.3 MB

Abstract

Training of deep learning models, including Generative AI (GenAI) and its subset Large Language models (LLMs), require data movement through the network to be efficient and rapid with little to no data backups. Lenovo approaches this challenge through powerful high performance data center appliances (SR680a V3, SR780a V3, and SR685a V3) that support NVIDIA’s 8-way GPUs and high-performance storage using DDN’s AI optimized storage appliances.

This reference architecture considers data movement during training and GPU utilization as the primary design considerations. This architecture uses the latest NVIDIA H100 and H200 GPUs along with an InfiniBand network topology to deliver the speeds necessary to train large comprehensive models. The components of the architecture are described, an example of a scalable unit is provided, and the bill of materials for this design are included.

Table of Contents

1. Introduction
2. Architectural Overview
3. Compute Layer
4. DDN Storage Layer
5. Neptune Water Cooled Technology
Appendix: Lenovo Bill of Materials
Resources

To view the document, click the Download PDF button.

Related product families

Product families related to this document are the following: