Hopper (microarchitecture)

Grace Hopper GH200
Designed by	Nvidia
Manufactured by	TSMC;
Fabrication process	TSMC 4N
Codename(s)	Grace Hopper
Specifications
Compute	GPU: 132 Hopper SMs; CPU: 72 Neoverse V2 cores
Shader clock rate	1980 MHz
Memory support	GPU: 96 GB HBM3 or 144 GB HBM3e; CPU: 480 GB LPDDR5X

Hopper
Launched	September 20, 2022; 2 years ago
Designed by	Nvidia
Manufactured by	TSMC;
Fabrication process	TSMC N4
Product Series
Server/datacenter	Tesla H series;
Specifications
L1 cache	256 KB (per SM)
L2 cache	50 MB
Memory support	HBM3
PCIe support	PCI Express 5.0
Media Engine
Encoder(s) supported	NVENC
History
Predecessor	Ampere
Variant	Ada Lovelace (consumer and professional)
Successor	Blackwell

Hopper is a graphics processing unit (GPU) microarchitecture developed by Nvidia. It is designed for datacenters and is used alongside the Lovelace microarchitecture. It is the latest generation of the line of products formerly branded as Nvidia Tesla, now Nvidia Data Centre GPUs.

Named for computer scientist and United States Navy rear admiral Grace Hopper, the Hopper architecture was leaked in November 2019 and officially revealed in March 2022. It improves upon its predecessors, the Turing and Ampere microarchitectures, featuring a new streaming multiprocessor, a faster memory subsystem, and a transformer acceleration engine.

Architecture

The Nvidia Hopper H100 GPU is implemented using the TSMC N4 process with 80 billion transistors. It consists of up to 144 streaming multiprocessors.^[1] Due to the increased memory bandwidth provided by the SXM5 socket, the Nvidia Hopper H100 offers better performance when used in an SXM5 configuration than in the typical PCIe socket.^[2]

Streaming multiprocessor

The streaming multiprocessors for Hopper improve upon the Turing and Ampere microarchitectures, although the maximum number of concurrent warps per streaming multiprocessor (SM) remains the same between the Ampere and Hopper architectures, 64.^[3] The Hopper architecture provides a Tensor Memory Accelerator (TMA), which supports bidirectional asynchronous memory transfer between shared memory and global memory.^[4] Under TMA, applications may transfer up to 5D tensors. When writing from shared memory to global memory, elementwise reduction and bitwise operators may be used, avoiding registers and SM instructions while enabling users to write warp specialized codes. TMA is exposed through cuda::memcpy_async^[5]

When parallelizing applications, developers can use thread block clusters. Thread blocks may perform atomics in the shared memory of other thread blocks within its cluster, otherwise known as distributed shared memory. Distributed shared memory may be used by an SM simultaneously with L2 cache; when used to communicate data between SMs, this can utilize the combined bandwidth of distributed shared memory and L2. The maximum portable cluster size is 8, although the Nvidia Hopper H100 can support a cluster size of 16 by using the cudaFuncAttributeNonPortableClusterSizeAllowed function, potentially at the cost of reduced number of active blocks.^[6] With L2 multicasting and distributed shared memory, the required bandwidth for dynamic random-access memory read and writes is reduced.^[7]

Hopper features improved single-precision floating-point format (FP32) throughput with twice as many FP32 operations per cycle per SM than its predecessor. Additionally, the Hopper architecture adds support for new instructions, including the Smith–Waterman algorithm.^[6] Like Ampere, TensorFloat-32 (TF-32) arithmetic is supported. The mapping pattern for both architectures is identical.^[8]

Memory

The Nvidia Hopper H100 supports HBM3 and HBM2e memory up to 80 GB; the HBM3 memory system supports 3 TB/s, an increase of 50% over the Nvidia Ampere A100's 2 TB/s. Across the architecture, the L2 cache capacity and bandwidth were increased.^[9]

Hopper allows CUDA compute kernels to utilize automatic inline compression, including in individual memory allocation, which allows accessing memory at higher bandwidth. This feature does not increase the amount of memory available to the application, because the data (and thus its compressibility) may be changed at any time. The compressor will automatically choose between several compression algorithms.^[9]

The Nvidia Hopper H100 increases the capacity of the combined L1 cache, texture cache, and shared memory to 256 KB. Like its predecessors, it combines L1 and texture caches into a unified cache designed to be a coalescing buffer. The attribute cudaFuncAttributePreferredSharedMemoryCarveout may be used to define the carveout of the L1 cache. Hopper introduces enhancements to NVLink through a new generation with faster overall communication bandwidth.^[10]

Memory synchronization domains

Some CUDA applications may experience interference when performing fence or flush operations due to memory ordering. Because the GPU cannot know which writes are guaranteed and which are visible by chance timing, it may wait on unnecessary memory operations, thus slowing down fence or flush operations. For example, when a kernel performs computations in GPU memory and a parallel kernel performs communications with a peer, the local kernel will flush its writes, resulting in slower NVLink or PCIe writes. In the Hopper architecture, the GPU can reduce the net cast through a fence operation.^[11]

DPX instructions

The Hopper architecture math application programming interface (API) exposes functions in the SM such as __viaddmin_s16x2_relu, which performs the per-halfword $max(min(a+b,c),0)$ . In the Smith–Waterman algorithm, __vimax3_s16x2_relu can be used, a three-way min or max followed by a clamp to zero.^[12] Similarly, Hopper speeds up implementations of the Needleman–Wunsch algorithm.^[13]

Transformer engine

The Hopper architecture was the first Nvidia architecture to implement the transformer engine.^[14] The transformer engine accelerates computations by dynamically reducing them from higher numerical precisions (i.e., FP16) to lower precisions that are faster to perform (i.e., FP8) when the loss in precision is deemed acceptable.^[14] The transformer engine is also capable of dynamically allocating bits in the chosen precision to either the mantissa or exponent at runtime to maximize precision.^[15]

Power efficiency

The SXM5 form factor H100 has a thermal design power (TDP) of 700 watts. With regards to its asynchrony, the Hopper architecture may attain high degrees of utilization and thus may have a better performance-per-watt.^[16]

Grace Hopper

The GH200 combines a Hopper-based H100 GPU with a Grace-based 72-core CPU on a single module. The total power draw of the module is up to 1000 W. CPU and GPU are connected via NVLink, which provides memory coherence between CPU and GPU memory.^[17]

History

In November 2019, a well-known Twitter account posted a tweet revealing that the next architecture after Ampere would be called Hopper, named after computer scientist and United States Navy rear admiral Grace Hopper, one of the first programmers of the Harvard Mark I. The account stated that Hopper would be based on a multi-chip module design, which would result in a yield gain with lower wastage.^[18]

During the 2022 Nvidia GTC, Nvidia officially announced Hopper.^[19] By 2023, during the AI boom, H100s were in great demand. Larry Ellison of Oracle Corporation said that year that at a dinner with Nvidia CEO Jensen Huang, he and Elon Musk of Tesla, Inc. and xAI "were begging" for H100s, "I guess is the best way to describe it. An hour of sushi and begging".^[20]

In January 2024, Raymond James Financial analysts estimated that Nvidia was selling the H100 GPU in the price range of $25,000 to $30,000 each, while on eBay, individual H100s cost over $40,000.^[21] As of February 2024, Nvidia was reportedly shipping H100 GPUs to data centers in armored cars.^[22]

H100 accelerator and DGX H100

Comparison of accelerators used in DGX:^[23]^[24]^[25]

Model	Architecture	Socket	FP32 CUDA cores	FP64 cores (excl. tensor)	Mixed INT32/FP32 cores	INT32 cores	Boost clock	Memory clock	Memory bus width	Memory bandwidth	VRAM	Single precision (FP32)	Double precision (FP64)	INT8 (non-tensor)	INT8 dense tensor	INT32	FP4 dense tensor	FP16	FP16 dense tensor	bfloat16 dense tensor	TensorFloat-32 (TF32) dense tensor	FP64 dense tensor	Interconnect (NVLink)	GPU	L1 Cache	L2 Cache	TDP	Die size	Transistor count	Process	Launched
B200	Blackwell	SXM6	N/A	N/A	N/A	N/A	N/A	8 Gbit/s HBM3e	8192-bit	8 TB/sec	192 GB HBM3e	N/A	N/A	N/A	4.5 POPS	N/A	9 PFLOPS	N/A	2.25 PFLOPS	2.25 PFLOPS	1.2 PFLOPS	40 TFLOPS	1.8 TB/sec	GB100	N/A	N/A	1000 W	N/A	208 B	TSMC 4NP	Q4 2024 (expected)
B100	Blackwell	SXM6	N/A	N/A	N/A	N/A	N/A	8 Gbit/s HBM3e	8192-bit	8 TB/sec	192 GB HBM3e	N/A	N/A	N/A	3.5 POPS	N/A	7 PFLOPS	N/A	1.98 PFLOPS	1.98 PFLOPS	989 TFLOPS	30 TFLOPS	1.8 TB/sec	GB100	N/A	N/A	700 W	N/A	208 B	TSMC 4NP	Q4 2024 (expected)
H200	Hopper	SXM5	16896	4608	16896	N/A	1980 MHz	6.3 Gbit/s HBM3e	6144-bit	4.8 TB/sec	141 GB HBM3e	67 TFLOPS	34 TFLOPS	N/A	1.98 POPS	N/A	N/A	N/A	990 TFLOPS	990 TFLOPS	495 TFLOPS	67 TFLOPS	900 GB/sec	GH100	25344 KB (192 KB × 132)	51200 KB	1000 W	814 mm²	80 B	TSMC 4N	Q3 2023
H100	Hopper	SXM5	16896	4608	16896	N/A	1980 MHz	5.2 Gbit/s HBM3	5120-bit	3.35 TB/sec	80 GB HBM3	67 TFLOPS	34 TFLOPS	N/A	1.98 POPS	N/A	N/A	N/A	990 TFLOPS	990 TFLOPS	495 TFLOPS	67 TFLOPS	900 GB/sec	GH100	25344 KB (192 KB × 132)	51200 KB	700 W	814 mm²	80 B	TSMC 4N	Q3 2022
A100 80GB	Ampere	SXM4	6912	3456	6912	N/A	1410 MHz	3.2 Gbit/s HBM2e	5120-bit	1.52 TB/sec	80 GB HBM2e	19.5 TFLOPS	9.7 TFLOPS	N/A	624 TOPS	19.5 TOPS	N/A	78 TFLOPS	312 TFLOPS	312 TFLOPS	156 TFLOPS	19.5 TFLOPS	600 GB/sec	GA100	20736 KB (192 KB × 108)	40960 KB	400 W	826 mm²	54.2 B	TSMC N7	Q1 2020
A100 40GB	Ampere	SXM4	6912	3456	6912	N/A	1410 MHz	2.4 Gbit/s HBM2	5120-bit	1.52 TB/sec	40 GB HBM2	19.5 TFLOPS	9.7 TFLOPS	N/A	624 TOPS	19.5 TOPS	N/A	78 TFLOPS	312 TFLOPS	312 TFLOPS	156 TFLOPS	19.5 TFLOPS	600 GB/sec	GA100	20736 KB (192 KB × 108)	40960 KB	400 W	826 mm²	54.2 B	TSMC N7	Q1 2020
V100 32GB	Volta	SXM3	5120	2560	N/A	5120	1530 MHz	1.75 Gbit/s HBM2	4096-bit	900 GB/sec	32 GB HBM2	15.7 TFLOPS	7.8 TFLOPS	62 TOPS	N/A	15.7 TOPS	N/A	31.4 TFLOPS	125 TFLOPS	N/A	N/A	N/A	300 GB/sec	GV100	10240 KB (128 KB × 80)	6144 KB	350 W	815 mm²	21.1 B	TSMC 12FFN	Q3 2017
V100 16GB	Volta	SXM2	5120	2560	N/A	5120	1530 MHz	1.75 Gbit/s HBM2	4096-bit	900 GB/sec	16 GB HBM2	15.7 TFLOPS	7.8 TFLOPS	62 TOPS	N/A	15.7 TOPS	N/A	31.4 TFLOPS	125 TFLOPS	N/A	N/A	N/A	300 GB/sec	GV100	10240 KB (128 KB × 80)	6144 KB	300 W	815 mm²	21.1 B	TSMC 12FFN	Q3 2017
P100	Pascal	SXM/SXM2	N/A	1792	3584	N/A	1480 MHz	1.4 Gbit/s HBM2	4096-bit	720 GB/sec	16 GB HBM2	10.6 TFLOPS	5.3 TFLOPS	N/A	N/A	N/A	N/A	21.2 TFLOPS	N/A	N/A	N/A	N/A	160 GB/sec	GP100	1344 KB (24 KB × 56)	4096 KB	300 W	610 mm²	15.3 B	TSMC 16FF+	Q2 2016

References

Citations

^ Elster & Haugdahl 2022, p. 4.
^ Nvidia 2023c, p. 20.
^ Nvidia 2023b, p. 9.
^ Fujita et al. 2023, p. 6.
^ Nvidia 2023b, p. 9-10.
^ ^a ^b Nvidia 2023b, p. 10.
^ Vishal Mehta (September 2022). CUDA Programming Model for Hopper Architecture. Santa Clara: Nvidia. Retrieved May 29, 2023.
^ Fujita et al. 2023, p. 4.
^ ^a ^b Nvidia 2023b, p. 11.
^ Nvidia 2023b, p. 12.
^ Nvidia 2023a, p. 44.
^ Tirumala, Ajay; Eaton, Joe; Tyrlik, Matt (December 8, 2022). "Boosting Dynamic Programming Performance Using NVIDIA Hopper GPU DPX Instructions". Nvidia. Retrieved May 29, 2023.
^ Harris, Dion (March 22, 2022). "NVIDIA Hopper GPU Architecture Accelerates Dynamic Programming Up to 40x Using New DPX Instructions". Nvidia. Retrieved May 29, 2023.
^ ^a ^b Salvator, Dave (March 22, 2022). "H100 Transformer Engine Supercharges AI Training, Delivering Up to 6x Higher Performance Without Losing Accuracy". Nvidia. Retrieved May 29, 2023.
^ "Nvidia's Next GPU Shows That Transformers Are Transforming AI - IEEE Spectrum". spectrum.ieee.org. Retrieved October 23, 2024.
^ Elster & Haugdahl 2022, p. 8.
^ "NVIDIA: Grace Hopper Has Entered Full Production & Announcing DGX GH200 AI Supercomputer". Anandtech. May 29, 2023.
^ Pirzada, Usman (November 16, 2019). "NVIDIA Next Generation Hopper GPU Leaked – Based On MCM Design, Launching After Ampere". Wccftech. Retrieved May 29, 2023.
^ Vincent, James (March 22, 2022). "Nvidia reveals H100 GPU for AI and teases 'world's fastest AI supercomputer'". The Verge. Retrieved May 29, 2023.
^ Fitch, Asa (February 26, 2024). "Nvidia's Stunning Ascent Has Also Made It a Giant Target". The Wall Street Journal. Retrieved February 27, 2024.
^ Vanian, Jonathan (January 18, 2024). "Mark Zuckerberg indicates Meta is spending billions of dollars on Nvidia AI chips". CNBC. Retrieved June 6, 2024.
^ Bousquette, Isabelle; Lin, Belle (February 14, 2024). "Armored Cars and Trillion Dollar Price Tags: How Some Tech Leaders Want to Solve the Chip Shortage". The Wall Street Journal. Retrieved May 30, 2024.
^ Smith, Ryan (March 22, 2022). "NVIDIA Hopper GPU Architecture and H100 Accelerator Announced: Working Smarter and Harder". AnandTech.
^ Smith, Ryan (May 14, 2020). "NVIDIA Ampere Unleashed: NVIDIA Announces New GPU Architecture, A100 GPU, and Accelerator". AnandTech.
^ "NVIDIA Tesla V100 tested: near unbelievable GPU power". TweakTown. September 17, 2017.

Works cited

Elster, Anne; Haugdahl, Tor (March 2022). "Nvidia Hopper GPU and Grace CPU Highlights". Computing in Science & Engineering. 24 (2): 95–100. Bibcode:2022CSE....24b..95E. doi:10.1109/MCSE.2022.3163817. hdl:11250/3051840. S2CID 249474974. Retrieved May 29, 2023.
Fujita, Kohei; Yamaguchi, Takuma; Kikuchi, Yuma; Ichimura, Tsuyoshi; Hori, Muneo; Maddegedara, Lalith (April 2023). "Calculation of cross-correlation function accelerated by TensorFloat-32 Tensor Core operations on NVIDIA's Ampere and Hopper GPUs". Journal of Computational Science. 68. doi:10.1016/j.jocs.2023.101986.
CUDA C++ Programming Guide (PDF). Nvidia. April 17, 2023.
Hopper Tuning Guide (PDF). Nvidia. April 13, 2023.
NVIDIA H100 Tensor Core GPU Architecture (PDF). Nvidia. 2022.^{[permanent dead link]}