So the gain per GPU in dense operations is less than 14%, The biggest improvement comes from supporting very low precision FP4, connecting two GPUs together, plus a massive amount of HBM and bandwidth in the sauce :PReply
they're practically doubling everything here with the only real (admitted) regression being in FP64 tensor unless that's a typo, so I don't know where you're getting the idea that the gain is less than 14%. the improved data movement alone should see huge gains in general at every level except FP64 tensor for some reason.Reply
welcome to the 2.5D, the M300X is a 15% downgrade in tensor fp32 "per GPU" compared to the MI250X but per card it's a 1.7x uplift and per software-addressable GPU it is a 3.4x uplift. The B200 isn't a big improvement in perf/mm² thanks to chiplet overhead but the per-rack density improvement speaks for itselfReply
mi300 uses much smaller chips, it's not even comparable. My point is that Blackwell's architectural advances aren't huge, I also don't understand why they're still using 4nm.Reply
How are they not comparable? Scaling more through chips/chiplet even if the individual is not much better, is all the same. Especially when Hopper by OS is one GPUReply
That's Jensen's fallacy. AMD connected 8 dies of 115mm², creating a product(mi300x) up to 2.4x more powerful than the H100, looking at how mediocre the B200 is, AMD can simply save money and create a product with 16dies CDNA3, update the memory system and will continue with the most powerful product. Reply
Not necessarily. Yes, the peak matrix FLOPs may have increased by a limited amount, but peak is not everything...
One dimension of improvement is simply continually reducing overhead – overhead in scheduling, in starting new tasks, in moving from one NN layer to the next, etc etc.
Another dimension of improvement is moving trivial work off expensive compute hardware to simpler hardware. For example look at how Apple's ANE has evolved. The primary compute engines used to also perform pooling, but that was clearly a dumb use of expensive resources so pooling, and related "simple" tasks (bias and scaling, ReLU lookup, etc) have been moved to the separate Planar Engine. Right now I'm unaware of something similar in nVidia but this would make sense. Another tweak that would make sense is better convolution hardware. Sure, if all you care about is language then convolution won't matter, but with 2024 probably being the year of multi-modal models (the way 2023 was the year of language models) betting against vision seems dumb! And nV's current solution for convolution as I understand it (fake convolution as a multiplication) is horrifying in terms of the memory and bandwidth costs compared to better targeted hardware.
nV run smart (like Apple, unlike Intel); they don't just throw transistors blindly at the problem. My guess is when we see the actual numbers (and as much of a technology explanation as they give us) we will see significant "algorithmic" improvements of the sort I describe, though of course I can't be sure they'll match the exact examples I have given.
As for FP4, Apple did some significant work a few years ago on binarized neural networks where some of the layers (not all, but some) were run as single-bit layers (essentially single bit present/not present feature lists dotted against single-bit feature sets). Their conclusion was that this worked well enough to be interesting, with substantial performance (and memory) boosts, but was hurt by the cost of occasionally having to toggle back from binarized layer to some sort FP layer. But if you provide hardware to make those transitions efficient... Again I can't speak to nVidia, but Apple have versions of such hardware ("cost-free" mapping between certain formats at data load/store/transfer). So point is, you do many of your layers in fp16 or fp8 or whatever then (especially if you have cost-free mapping hardware) do other layers in fp4...Reply
That is all nice and fine, but you are citing NN specific improvements. These devices are not used only in AI workloads. So marketing on FP4 performance only is relevant to only a small part of the market.
If you look at general compute numbers as presented in the table, they managed a 2x increase for a roughly 2x increase in transisor size. Yes specific workloads might benefit from other improvements, but in general there is very little progress on simple compute performance ...Reply
Yeah, HPC is the niche market for them, at this point.
I have no idea how they can justify still spending so much die space on fp64, at this point. I had expected them to bifurcate those product lines, by now. For the HPC folks who need both, you could put a mix of each type of SXM board in a HPC server, but this way you'd not be making the AI market pay for a bunch of dark fp64 silicon they'll never use.
Same reason 100-series GPUs, starting with the A100, don't have graphics hardware accel.Reply
> Apple did some significant work a few years ago on binarized neural networks
Yes, and they were far from the first.
I know it's not directly analogous, but the concept of binary neural networks reminds me a bit of delta-sigma audio. Sure, you can get good quality with a 2.8 MHz DSD bitstream, but that's the same bitrate as you get from a 88.2 kHz stream @ 32 bits per sample! So, is it *really* progress over PCM, if you need a higher bitrate to achieve comparable quality?Reply
I think you are thinking of spike networks, which are directly analogous to biology, but are terrible for digital implementation in terms of power because there is so much switching going on. I don't know if anyone serious is bothering with those these days.
The Apple work (probably inherited from their xnor.ai acquistion) is more about the point that there are layers within a net than can do very well at very low precision. An intuition for this is that at least some layers are essentially about taking in a feature vector, counting up the "yes/no" presence of a long list of features, and feeding the sum on to the next stage – this is the sort of thing that can be done by a binarized layer. And again, the point is not binary; it's that saying "oh FP4 is dumb" betrays an ignorance of NNs. NN's have many layers, many of which have different properties. No-one is claiming you can train a useful model from scratch using FP4, but you may well be able to run a reasonable fraction of your inference layers on FP4, and that's where the value is.Reply
> I don't know if anyone serious is bothering with those these days.
I heard of a company with an ASIC for spiking neural networks like ~5 years ago, but I don't know if that went anywhere. It was primarily for embedded applications.
> the point is not binary; it's that saying "oh FP4 is dumb" betrays an ignorance of NNs.
Yeah, I didn't take a position on fp4. I know you were citing an extreme example, but I was just commenting on that. I don't know enough about the tradeoffs involved in using fp4 to say whether I think it's worthwhile, but I guess there must be consensus that it is.Reply
FP4 is kind of useless right now. INT8 or FP8 might be a lot more useful for current llm. who knows ? maybe someone could make fp4 work for llm but currently most people use bf16Reply
With that high bandwidth C2C interconnect, it should work as a single GPU after all. Hopper was already a two-partition design with a separate crossbar & L2 cache for each partition. The on-chip interconnect bridging two partitions has high-enough BW to support 7~8TB/s L2 cache on the lower-clocked PCIe version (https://github.com/te42kyfo/gpu-benches), meaning the bidirectional BW of bridging interconnect should be higher than 4TB/s. Blackwell design is simply separating each partition into separate die with the same two-partition design, and probably the same coherence support. 10TB/s bidirectional BW should be sufficient to support even higher L2 BW of Blackwell.Reply
Wrong, even it's written for you : "NVIDIA is intent on skipping the awkward “two accelerators on one chip” phase, and moving directly on to having the entire accelerator behave as a single chip.". multiple dies, but act like a single GPU (accelerator). Reply
What's your complaint? The point is clear. nVidia (like Apple's Ultra designs) wants the two-chip product to present to the developer and the OS as a single large GPU, not as two separate GPUs.
This is as opposed to the hacks of the past (things like Crossfire or SLI) which, in theory, allowed a developer to exploit dual GPUs, but which never really got any traction for a variety of reasons. It would be easy (but stupid!) for nVidia to chip Blackwell as a dual GPU solution on top of SLIv2, but the point is the time for hacks is over, both Apple and nV (and I assume soon enough Intel and AMD) agree that it's now time to *engineer* these solutions properly.Reply
He just wants to play on words to make the new architecture more impressive than it really is. If they had two H100 chips together, the performance would be very close.Reply
You trivialize what it takes to scale up from monolithic to 2 dies. Not only did Nvidia squeeze more performance form each Blackwell die, but they also found enough transistor budget for the massive interconnect. Don't think that sort of thing comes cheap!Reply
Most of those specs in the table seem incorrect. A Single Blackwell GPU is still going from 80B transistors to 104B - a huge increase, along with all the architecture improvements and higher clockspeeds. There won't be any regressions or any increases in performance in any metric less than 25% from a single GPU, although as noted in the article, all of these will be sold as two GPUs on one package anyway. I am guessing the consumer versions will still be monolithic, and on 3nm, in order to fit in more transistors at the same die size. NVIDIA can then use all of their allotted 4-nm wafer space from TSMC for the super high profit B200/H100s, and use the consumer GB202/etc line on 3-nm to iron out early kinks and yield issues.Reply
tsmc 4 fully depreciated process is even more important for a GPU mass market product. AD is 602 mm2, plenty of reticle area to play with. Albeit from Ada 4 mobile dGPU design generation with Intel H attach in mind to Blackwell 4 return to desktop generation just add wide bus and more VRAM. mbReply
"While they are not disclosing the size of the individual dies, we’re told that they are “reticle-sized” dies, which should put them somewhere over 800mm2 each."
Blackwell = 2x the die size, ~2.27x the performance
Not that it's a bad thing for some customers, since this unified MCM with 192 GB could be much better than two separate chips. But it's easy to see why there is disappointment.Reply
You are assuming "performance" is equal to the POPs count. That is just massively ignorant. Peak OPS is a marketing and fanboi number; it's of limited interest to professionals.
What matters far more is all the overhead that prevents you from hitting that peak number, often to a massive extent. This can be obvious things like memory bandwidth, slightly less obvious things like reshaping, and non-obvious things like costs of synchronization and switching layers. We have no idea how much nV has improved these, but the history of nV chips is substantial improvements in these sorts of things every generation. As just one, very obvious, example, one of the things Bill Dally was most proud of in Hopper was the addition of the Dynamic Programming instructions. Those don't show up in the peak OPs numbers, and most of the people here have never heard of them and have no idea that they even exist or what they do...Reply
> one of the things Bill Dally was most proud of in Hopper was the addition of the > Dynamic Programming instructions. ... most of the people here have never heard of them
Maybe Nvidia's partly to blame, for that? I follow GPU news better than average, but I don't recall hearing about them.
We're not talking about games. In the tasks that these GPUs perform, raw computational power is a fairly accurate indication of the performance obtained, even more so within an architecture that is just one evolution on top of the other.Reply
is it? a couple or more years ago it was pretty much exclusive to nvidia, now more libraries run on more brands, especially for inference compared to trainingReply
Per die performance only gets a mediocre increase but as a package this is a pretty respectable jump. I do think nVidia made an error in not spinning off the interconnect and some select accelerators off to their own die to end cap the compute dies. This would have permitted a 3 or 4 compute die chain to further increase performance, bandwidth and memory capacity. Once you move to chiplets, it becomes a question about scalability for a leap frog performance in the design. Just having two dies is an improvement but feels like a baby step from what it could've been. Now that nVidia has made this move, I am excited to see where they go with it. In particular how they plan to incorporate CPU and high speed network IO chiplets (photonics?) into future iterations. With the recent news that nVidia wants to move to yearly updates, disaggregation of the design via chiplets will let them tweak and improve various aspects on that cadence without the need to reengineer an entirely new monolithic die each time.
I do question the need for FP4 support: you only have 16 values to work with in a floating point manner. This seems to be very niche but I would presume the performance benefits are there for it. Ditto for FP6.
Since Hooper, there have been a couple of advancements in efficiently computing matrices. Even though the theoretical throughput wouldn't change it'd be a decisive efficiency win. This change may not have been worthy of inclusion in the presentation but I would hope that such an advancement, if present, gets a nod in the white paper.Reply
It was already apparent that 4-bit precision can speed things up while not cutting accuracy too much, and INT4 was already supported by Nvidia Turing, the AMD XDNA NPUs, etc.
Kevin, It is possible to work with binary, ternary and FP4; one doesn't simply truncate, you need to optimize for the lower precision and it doesn't work in all cases.
> Just having two dies is an improvement but feels like a baby step from what it could've been.
Which is I'm sure why you bashed AMD for EPYC supporting only 2P scalability, right?
Sometimes, scaling to 2P is enough. Who says there'd have been room for more compute dies on half an SXM board? Also, once you go beyond 2 dies, maybe latency and bandwidth bottlenecks become too significant for the thing to continue presenting to software as one device.
There are complexities it seems you're just trying to wish away, for the sake of some imagined benefit that might not exist.Reply
I have actually been critical of AMD sticking to dual socket support with Genoa. While a bit odd, the connectivity is there for three socket without sacrificing latency and only a hit to bandwidth. Quad socket would require some compromises but for the memory capacity hungry tasks, it'd be a win.
The size of a SMX board is not hyper critical for nVidia. Making it longer is an easy trade off if they can pack in more compute in a rack.
As for scaling past 2 dies, you are correct that bandwidth and latency do impact scalability. The design does need to be mindful of the layout. Die-to-die traffic has to be prioritized, especially handling traffic where remote source and remote destination are passing through an intermediary die. Internal bandwidth needs to be higher than that of the local dies own memory bus. These are challenges for scalability but they are not insurmountable. One benefit nVidia has here is that they dictate the internal designs themselves: they know precisely how long it takes for data to move across the chip unlike a traditional dual socket system where board layout is handled by a 3rd party. In other words, scaling up in die count has a deterministic effect on latency.
There are other techniques that can be done to cut latency down: increasing the clock speeds and processing logic of command-and-control logic for coherency. For a chip whose execution units are running at ~2 Ghz max, a 4 Ghz bus and logic for cache coherency is within reason. And yes, the wire length does decrease with higher clock speed but signal repeaters exist for this reason. This grants the coherency logic more clock cycles to work and move data around with respect to the execution engines.
The main benefit of chiplets to is being able to obtain n-level scaling: a row of compute dies lined up with dedicated stacks of HBM flanking each compute die. Numerous compute units, lots of chip-to-chip bandwidth, large memory capacity and great memory bandwidth to let the design scale upward.Reply
"Per die performance only gets a mediocre increase but as a package this is a pretty respectable jump."
This is simply not true, To hear the stories claimed above, using a Blackwell to train a large model would only require half as many GPUs (same number of underlying chips) and same amount of energy. In fact It requires one quarter as many GPUs (ie each "chip" is 2x as effective), and one quarter as much energy... And it's not like this is secret - Jensen specifically pointed this out.
Like I keep saying, peak FMAs is only a small part of the real-world performance of these chips and to obsess over it only shows ignorance. Reply
It should have nothing to do with the massive bandwidth and capacity of the HBM you now have available to power your very low-precision tasks.(FP4) LoLReply
Looking at equal precision, The Blackwell package looks lo be roughly 2.25x that of the monolithic Hoppper. In other words, one Blackwell dies is roughly 1..25x that of Hopper which is faster but not that dramatic.
The other chunk of performance gain is lowering precision via FP4. Reply
You are oversimplify the scaling problem. You are not taking into account that each connection port to the close die requires a big transistor budget. And power too. If you want that GPU to work as one single GPU with predictive behavior you have to consider the bandwidth of a single port shared between ALL dies as if die 1 wants the data present in the memory space of die 3 the data has to pass through the port connecting die 3 and 2 and then by the port connecting die 2 to 1. A single transmission uses double the bandwidth. And it is worse if you have a fourth die. You have to take into account the latency of such transmission which is obviously higher than one from die 2 to die 1 and even higher that from memory controllers in die 1. Distributing data over NUMA architecture is not simple and the complexity raises as more dies are involved. You have also to take into account the size of the underlaying interposer (which has a cost) and the energy required to transfer bunch of data between the interconnection ports. You have to consider Ahmdal theorem as scaling is not free in terms of overhead (space, time and energy as said). And at the end the total energy required by all those dies. Here we are already at 1200W for a single package. If you want to place another die and have the system scale to +50%, you will be consuming more than 1800W. And those due would be bigger and to not have unused parts (unconnected ports, as you are not designing different dies for thiose at the extremes with a single port and those in the middle with two) you have to place 4 of then in a square structure thus coming with more than 2400W total consumption for a scale near 4x. Losing only 20% of the calculation power due to darà movement overheads between dies (and the resulting loss of total available bandwidth) you end up with a bit more than 3x scaling for a 4x die architecture. Consuming more than 4x by the way. So you have a diminished return in terms of area, computation and power consumption, cost and thus margins. If it were so easy to scale up with multi dies architectures we would already have packages as large as 1600cm² (40x40cm) with 32 memory channels. But it is not, so just scaling up to 4 dies is already a challenge.Reply
People should really pay attention to patents granted. AMD forced Nvidia's hand early with the MI300 series and Nvidia comes out with Blackwell.
I look forward to the excuses Nvidia fans make when the MI400 series is debuted. Most people could care less to read patents granted, but they should. It really shows how far ahead AMD is in the world of MCM designs now and down the road.Reply
Why don't you help us out? I do my part trying to collect and summarize Apple patents, how they fit together, and why they are valuable. You could do the same for AMD patents. Reply
With this development, I doubled my Nvidia shares. For now, we are at the beginning of new technologies that will come of age. I experienced Apple vision pro. Soon phones will be antiques. Artificial intelligence has evolved to an incredible size thanks to these chips. I'm so excited for the futureReply
We’ve updated our terms. By continuing to use the site and/or by logging into your account, you agree to the Site’s updated Terms of Use and Privacy Policy.
50 Comments
Back to Article
Terry_Craig - Monday, March 18, 2024 - link
So the gain per GPU in dense operations is less than 14%, The biggest improvement comes from supporting very low precision FP4, connecting two GPUs together, plus a massive amount of HBM and bandwidth in the sauce :P Replywhatthe123 - Monday, March 18, 2024 - link
they're practically doubling everything here with the only real (admitted) regression being in FP64 tensor unless that's a typo, so I don't know where you're getting the idea that the gain is less than 14%. the improved data movement alone should see huge gains in general at every level except FP64 tensor for some reason. ReplyTerry_Craig - Monday, March 18, 2024 - link
H100 is a single GPU. B200 are two GPUs connected, the gain per GPU was less than 14% as shown in the table. ReplyUnashamed_unoriginal_username_x86 - Monday, March 18, 2024 - link
welcome to the 2.5D, the M300X is a 15% downgrade in tensor fp32 "per GPU" compared to the MI250X but per card it's a 1.7x uplift and per software-addressable GPU it is a 3.4x uplift. The B200 isn't a big improvement in perf/mm² thanks to chiplet overhead but the per-rack density improvement speaks for itself ReplyDante Verizon - Monday, March 18, 2024 - link
mi300 uses much smaller chips, it's not even comparable. My point is that Blackwell's architectural advances aren't huge, I also don't understand why they're still using 4nm. ReplyMoogleW - Wednesday, March 20, 2024 - link
How are they not comparable? Scaling more through chips/chiplet even if the individual is not much better, is all the same. Especially when Hopper by OS is one GPU ReplyDante Verizon - Friday, March 22, 2024 - link
That's Jensen's fallacy. AMD connected 8 dies of 115mm², creating a product(mi300x) up to 2.4x more powerful than the H100, looking at how mediocre the B200 is, AMD can simply save money and create a product with 16dies CDNA3, update the memory system and will continue with the most powerful product. ReplyBruzzone - Friday, March 22, 2024 - link
tsmc 4 = pretty much depreciated process for low marginal cost of production on knowns that are high learning. mb Replyname99 - Monday, March 18, 2024 - link
Not necessarily.Yes, the peak matrix FLOPs may have increased by a limited amount, but peak is not everything...
One dimension of improvement is simply continually reducing overhead – overhead in scheduling, in starting new tasks, in moving from one NN layer to the next, etc etc.
Another dimension of improvement is moving trivial work off expensive compute hardware to simpler hardware. For example look at how Apple's ANE has evolved. The primary compute engines used to also perform pooling, but that was clearly a dumb use of expensive resources so pooling, and related "simple" tasks (bias and scaling, ReLU lookup, etc) have been moved to the separate Planar Engine. Right now I'm unaware of something similar in nVidia but this would make sense.
Another tweak that would make sense is better convolution hardware. Sure, if all you care about is language then convolution won't matter, but with 2024 probably being the year of multi-modal models (the way 2023 was the year of language models) betting against vision seems dumb! And nV's current solution for convolution as I understand it (fake convolution as a multiplication) is horrifying in terms of the memory and bandwidth costs compared to better targeted hardware.
nV run smart (like Apple, unlike Intel); they don't just throw transistors blindly at the problem. My guess is when we see the actual numbers (and as much of a technology explanation as they give us) we will see significant "algorithmic" improvements of the sort I describe, though of course I can't be sure they'll match the exact examples I have given.
As for FP4, Apple did some significant work a few years ago on binarized neural networks where some of the layers (not all, but some) were run as single-bit layers (essentially single bit present/not present feature lists dotted against single-bit feature sets). Their conclusion was that this worked well enough to be interesting, with substantial performance (and memory) boosts, but was hurt by the cost of occasionally having to toggle back from binarized layer to some sort FP layer.
But if you provide hardware to make those transitions efficient...
Again I can't speak to nVidia, but Apple have versions of such hardware ("cost-free" mapping between certain formats at data load/store/transfer). So point is, you do many of your layers in fp16 or fp8 or whatever then (especially if you have cost-free mapping hardware) do other layers in fp4... Reply
haplo602 - Tuesday, March 19, 2024 - link
That is all nice and fine, but you are citing NN specific improvements. These devices are not used only in AI workloads. So marketing on FP4 performance only is relevant to only a small part of the market.If you look at general compute numbers as presented in the table, they managed a 2x increase for a roughly 2x increase in transisor size. Yes specific workloads might benefit from other improvements, but in general there is very little progress on simple compute performance ... Reply
name99 - Tuesday, March 19, 2024 - link
A small part of the market?OMG dude, you are SO UTTERLY CLUELESS about what is going on here. Reply
mode_13h - Wednesday, March 20, 2024 - link
Yeah, HPC is the niche market for them, at this point.I have no idea how they can justify still spending so much die space on fp64, at this point. I had expected them to bifurcate those product lines, by now. For the HPC folks who need both, you could put a mix of each type of SXM board in a HPC server, but this way you'd not be making the AI market pay for a bunch of dark fp64 silicon they'll never use.
Same reason 100-series GPUs, starting with the A100, don't have graphics hardware accel. Reply
LordSojar - Wednesday, April 10, 2024 - link
Wait, a small part of the market? Uh........... ho'kay den. Replymode_13h - Wednesday, March 20, 2024 - link
> Apple did some significant work a few years ago on binarized neural networksYes, and they were far from the first.
I know it's not directly analogous, but the concept of binary neural networks reminds me a bit of delta-sigma audio. Sure, you can get good quality with a 2.8 MHz DSD bitstream, but that's the same bitrate as you get from a 88.2 kHz stream @ 32 bits per sample! So, is it *really* progress over PCM, if you need a higher bitrate to achieve comparable quality? Reply
GeoffreyA - Thursday, March 21, 2024 - link
I think an autoencoder. Replyname99 - Thursday, March 21, 2024 - link
I think you are thinking of spike networks, which are directly analogous to biology, but are terrible for digital implementation in terms of power because there is so much switching going on. I don't know if anyone serious is bothering with those these days.The Apple work (probably inherited from their xnor.ai acquistion) is more about the point that there are layers within a net than can do very well at very low precision.
An intuition for this is that at least some layers are essentially about taking in a feature vector, counting up the "yes/no" presence of a long list of features, and feeding the sum on to the next stage – this is the sort of thing that can be done by a binarized layer.
And again, the point is not binary; it's that saying "oh FP4 is dumb" betrays an ignorance of NNs. NN's have many layers, many of which have different properties. No-one is claiming you can train a useful model from scratch using FP4, but you may well be able to run a reasonable fraction of your inference layers on FP4, and that's where the value is. Reply
mode_13h - Friday, March 22, 2024 - link
> I think you are thinking of spike networksNo...
> I don't know if anyone serious is bothering with those these days.
I heard of a company with an ASIC for spiking neural networks like ~5 years ago, but I don't know if that went anywhere. It was primarily for embedded applications.
> the point is not binary; it's that saying "oh FP4 is dumb" betrays an ignorance of NNs.
Yeah, I didn't take a position on fp4. I know you were citing an extreme example, but I was just commenting on that. I don't know enough about the tradeoffs involved in using fp4 to say whether I think it's worthwhile, but I guess there must be consensus that it is. Reply
shing3232 - Friday, March 22, 2024 - link
FP4 is kind of useless right now.INT8 or FP8 might be a lot more useful for current llm. who knows ? maybe someone could make fp4 work for llm but currently most people use bf16 Reply
quarph - Monday, March 18, 2024 - link
With that high bandwidth C2C interconnect, it should work as a single GPU after all.Hopper was already a two-partition design with a separate crossbar & L2 cache for each partition. The on-chip interconnect bridging two partitions has high-enough BW to support 7~8TB/s L2 cache on the lower-clocked PCIe version (https://github.com/te42kyfo/gpu-benches), meaning the bidirectional BW of bridging interconnect should be higher than 4TB/s.
Blackwell design is simply separating each partition into separate die with the same two-partition design, and probably the same coherence support. 10TB/s bidirectional BW should be sufficient to support even higher L2 BW of Blackwell. Reply
Eliadbu - Monday, March 18, 2024 - link
Wrong, even it's written for you :"NVIDIA is intent on skipping the awkward “two accelerators on one chip” phase, and moving directly on to having the entire accelerator behave as a single chip.".
multiple dies, but act like a single GPU (accelerator). Reply
name99 - Monday, March 18, 2024 - link
What's your complaint? The point is clear.nVidia (like Apple's Ultra designs) wants the two-chip product to present to the developer and the OS as a single large GPU, not as two separate GPUs.
This is as opposed to the hacks of the past (things like Crossfire or SLI) which, in theory, allowed a developer to exploit dual GPUs, but which never really got any traction for a variety of reasons.
It would be easy (but stupid!) for nVidia to chip Blackwell as a dual GPU solution on top of SLIv2, but the point is the time for hacks is over, both Apple and nV (and I assume soon enough Intel and AMD) agree that it's now time to *engineer* these solutions properly. Reply
Terry_Craig - Tuesday, March 19, 2024 - link
He just wants to play on words to make the new architecture more impressive than it really is. If they had two H100 chips together, the performance would be very close. Replymode_13h - Thursday, March 21, 2024 - link
You trivialize what it takes to scale up from monolithic to 2 dies. Not only did Nvidia squeeze more performance form each Blackwell die, but they also found enough transistor budget for the massive interconnect. Don't think that sort of thing comes cheap! Replymode_13h - Thursday, March 21, 2024 - link
BTW, this comment was a reply to Terry_Craig. ReplyNextGen_Gamer - Tuesday, March 19, 2024 - link
Most of those specs in the table seem incorrect. A Single Blackwell GPU is still going from 80B transistors to 104B - a huge increase, along with all the architecture improvements and higher clockspeeds. There won't be any regressions or any increases in performance in any metric less than 25% from a single GPU, although as noted in the article, all of these will be sold as two GPUs on one package anyway. I am guessing the consumer versions will still be monolithic, and on 3nm, in order to fit in more transistors at the same die size. NVIDIA can then use all of their allotted 4-nm wafer space from TSMC for the super high profit B200/H100s, and use the consumer GB202/etc line on 3-nm to iron out early kinks and yield issues. ReplyBruzzone - Friday, March 22, 2024 - link
tsmc 4 fully depreciated process is even more important for a GPU mass market product. AD is 602 mm2, plenty of reticle area to play with. Albeit from Ada 4 mobile dGPU design generation with Intel H attach in mind to Blackwell 4 return to desktop generation just add wide bus and more VRAM. mb Replynandnandnand - Tuesday, March 19, 2024 - link
"While they are not disclosing the size of the individual dies, we’re told that they are “reticle-sized” dies, which should put them somewhere over 800mm2 each."Blackwell = 2x the die size, ~2.27x the performance
Not that it's a bad thing for some customers, since this unified MCM with 192 GB could be much better than two separate chips. But it's easy to see why there is disappointment. Reply
name99 - Thursday, March 21, 2024 - link
You are assuming "performance" is equal to the POPs count. That is just massively ignorant.Peak OPS is a marketing and fanboi number; it's of limited interest to professionals.
What matters far more is all the overhead that prevents you from hitting that peak number, often to a massive extent. This can be obvious things like memory bandwidth, slightly less obvious things like reshaping, and non-obvious things like costs of synchronization and switching layers. We have no idea how much nV has improved these, but the history of nV chips is substantial improvements in these sorts of things every generation.
As just one, very obvious, example, one of the things Bill Dally was most proud of in Hopper was the addition of the Dynamic Programming instructions. Those don't show up in the peak OPs numbers, and most of the people here have never heard of them and have no idea that they even exist or what they do... Reply
mode_13h - Friday, March 22, 2024 - link
> one of the things Bill Dally was most proud of in Hopper was the addition of the> Dynamic Programming instructions. ... most of the people here have never heard of them
Maybe Nvidia's partly to blame, for that? I follow GPU news better than average, but I don't recall hearing about them.
Check it out: https://developer.nvidia.com/blog/boosting-dynamic... Reply
Dante Verizon - Friday, March 22, 2024 - link
We're not talking about games. In the tasks that these GPUs perform, raw computational power is a fairly accurate indication of the performance obtained, even more so within an architecture that is just one evolution on top of the other. ReplyGeoffreyA - Tuesday, March 19, 2024 - link
I'm sure it's excellent. But it's quite concerning that the AI field is growing more and more dependent on Nvidia. Replykn00tcn - Tuesday, March 19, 2024 - link
is it? a couple or more years ago it was pretty much exclusive to nvidia, now more libraries run on more brands, especially for inference compared to training ReplyGeoffreyA - Tuesday, March 19, 2024 - link
That's good. We're sick of monopoly in this world. At any rate, when envious entities are in the picture, it is concerning. ReplyGeoffreyA - Tuesday, March 19, 2024 - link
* dominating the picture ReplyThreska - Tuesday, March 19, 2024 - link
Exactly why the big users came up with their own chips. ReplyGeoffreyA - Tuesday, March 19, 2024 - link
Yes. I believe even OpenAI wants to build their own. ReplyKevin G - Tuesday, March 19, 2024 - link
Per die performance only gets a mediocre increase but as a package this is a pretty respectable jump. I do think nVidia made an error in not spinning off the interconnect and some select accelerators off to their own die to end cap the compute dies. This would have permitted a 3 or 4 compute die chain to further increase performance, bandwidth and memory capacity. Once you move to chiplets, it becomes a question about scalability for a leap frog performance in the design. Just having two dies is an improvement but feels like a baby step from what it could've been. Now that nVidia has made this move, I am excited to see where they go with it. In particular how they plan to incorporate CPU and high speed network IO chiplets (photonics?) into future iterations. With the recent news that nVidia wants to move to yearly updates, disaggregation of the design via chiplets will let them tweak and improve various aspects on that cadence without the need to reengineer an entirely new monolithic die each time.I do question the need for FP4 support: you only have 16 values to work with in a floating point manner. This seems to be very niche but I would presume the performance benefits are there for it. Ditto for FP6.
Since Hooper, there have been a couple of advancements in efficiently computing matrices. Even though the theoretical throughput wouldn't change it'd be a decisive efficiency win. This change may not have been worthy of inclusion in the presentation but I would hope that such an advancement, if present, gets a nod in the white paper. Reply
nandnandnand - Wednesday, March 20, 2024 - link
Going from 1 compute die to 3-4 is a big step.2 seconds of Googling will find that there is interest and benefits in FP4:
https://papers.nips.cc/paper/2020/file/13b91943825...
It was already apparent that 4-bit precision can speed things up while not cutting accuracy too much, and INT4 was already supported by Nvidia Turing, the AMD XDNA NPUs, etc.
https://arxiv.org/pdf/2305.12356.pdf Reply
Rοb - Wednesday, March 20, 2024 - link
Kevin, It is possible to work with binary, ternary and FP4; one doesn't simply truncate, you need to optimize for the lower precision and it doesn't work in all cases.See also:
https://github.com/nbasyl/LLM-FP4 and
https://www.explainxkcd.com/wiki/index.php/2170:_C... Reply
Rοb - Wednesday, March 20, 2024 - link
The comment software likes to break your links: https://www.google.com/search?q=https%3A%2F%2Fwww.... Replymode_13h - Thursday, March 21, 2024 - link
> Just having two dies is an improvement but feels like a baby step from what it could've been.Which is I'm sure why you bashed AMD for EPYC supporting only 2P scalability, right?
Sometimes, scaling to 2P is enough. Who says there'd have been room for more compute dies on half an SXM board? Also, once you go beyond 2 dies, maybe latency and bandwidth bottlenecks become too significant for the thing to continue presenting to software as one device.
There are complexities it seems you're just trying to wish away, for the sake of some imagined benefit that might not exist. Reply
Kevin G - Wednesday, March 27, 2024 - link
I have actually been critical of AMD sticking to dual socket support with Genoa. While a bit odd, the connectivity is there for three socket without sacrificing latency and only a hit to bandwidth. Quad socket would require some compromises but for the memory capacity hungry tasks, it'd be a win.The size of a SMX board is not hyper critical for nVidia. Making it longer is an easy trade off if they can pack in more compute in a rack.
As for scaling past 2 dies, you are correct that bandwidth and latency do impact scalability. The design does need to be mindful of the layout. Die-to-die traffic has to be prioritized, especially handling traffic where remote source and remote destination are passing through an intermediary die. Internal bandwidth needs to be higher than that of the local dies own memory bus. These are challenges for scalability but they are not insurmountable. One benefit nVidia has here is that they dictate the internal designs themselves: they know precisely how long it takes for data to move across the chip unlike a traditional dual socket system where board layout is handled by a 3rd party. In other words, scaling up in die count has a deterministic effect on latency.
There are other techniques that can be done to cut latency down: increasing the clock speeds and processing logic of command-and-control logic for coherency. For a chip whose execution units are running at ~2 Ghz max, a 4 Ghz bus and logic for cache coherency is within reason. And yes, the wire length does decrease with higher clock speed but signal repeaters exist for this reason. This grants the coherency logic more clock cycles to work and move data around with respect to the execution engines.
The main benefit of chiplets to is being able to obtain n-level scaling: a row of compute dies lined up with dedicated stacks of HBM flanking each compute die. Numerous compute units, lots of chip-to-chip bandwidth, large memory capacity and great memory bandwidth to let the design scale upward. Reply
name99 - Thursday, March 21, 2024 - link
"Per die performance only gets a mediocre increase but as a package this is a pretty respectable jump."This is simply not true, To hear the stories claimed above, using a Blackwell to train a large model would only require half as many GPUs (same number of underlying chips) and same amount of energy.
In fact It requires one quarter as many GPUs (ie each "chip" is 2x as effective), and one quarter as much energy...
And it's not like this is secret - Jensen specifically pointed this out.
Like I keep saying, peak FMAs is only a small part of the real-world performance of these chips and to obsess over it only shows ignorance. Reply
Dante Verizon - Saturday, March 23, 2024 - link
It should have nothing to do with the massive bandwidth and capacity of the HBM you now have available to power your very low-precision tasks.(FP4) LoL ReplyKevin G - Wednesday, March 27, 2024 - link
Looking at equal precision, The Blackwell package looks lo be roughly 2.25x that of the monolithic Hoppper. In other words, one Blackwell dies is roughly 1..25x that of Hopper which is faster but not that dramatic.The other chunk of performance gain is lowering precision via FP4. Reply
CiccioB - Tuesday, March 26, 2024 - link
You are oversimplify the scaling problem.You are not taking into account that each connection port to the close die requires a big transistor budget. And power too.
If you want that GPU to work as one single GPU with predictive behavior you have to consider the bandwidth of a single port shared between ALL dies as if die 1 wants the data present in the memory space of die 3 the data has to pass through the port connecting die 3 and 2 and then by the port connecting die 2 to 1. A single transmission uses double the bandwidth.
And it is worse if you have a fourth die.
You have to take into account the latency of such transmission which is obviously higher than one from die 2 to die 1 and even higher that from memory controllers in die 1.
Distributing data over NUMA architecture is not simple and the complexity raises as more dies are involved.
You have also to take into account the size of the underlaying interposer (which has a cost) and the energy required to transfer bunch of data between the interconnection ports.
You have to consider Ahmdal theorem as scaling is not free in terms of overhead (space, time and energy as said).
And at the end the total energy required by all those dies. Here we are already at 1200W for a single package. If you want to place another die and have the system scale to +50%, you will be consuming more than 1800W.
And those due would be bigger and to not have unused parts (unconnected ports, as you are not designing different dies for thiose at the extremes with a single port and those in the middle with two) you have to place 4 of then in a square structure thus coming with more than 2400W total consumption for a scale near 4x.
Losing only 20% of the calculation power due to darà movement overheads between dies (and the resulting loss of total available bandwidth) you end up with a bit more than 3x scaling for a 4x die architecture.
Consuming more than 4x by the way.
So you have a diminished return in terms of area, computation and power consumption, cost and thus margins.
If it were so easy to scale up with multi dies architectures we would already have packages as large as 1600cm² (40x40cm) with 32 memory channels.
But it is not, so just scaling up to 4 dies is already a challenge. Reply
nikaldro - Wednesday, March 20, 2024 - link
The node is N4P, not 4NP, and the previously used 4N was a 5nm class node Replymdriftmeyer - Friday, March 22, 2024 - link
People should really pay attention to patents granted. AMD forced Nvidia's hand early with the MI300 series and Nvidia comes out with Blackwell.I look forward to the excuses Nvidia fans make when the MI400 series is debuted. Most people could care less to read patents granted, but they should. It really shows how far ahead AMD is in the world of MCM designs now and down the road. Reply
name99 - Thursday, March 28, 2024 - link
Why don't you help us out?I do my part trying to collect and summarize Apple patents, how they fit together, and why they are valuable. You could do the same for AMD patents. Reply
webtasarimo - Wednesday, March 27, 2024 - link
With this development, I doubled my Nvidia shares. For now, we are at the beginning of new technologies that will come of age. I experienced Apple vision pro. Soon phones will be antiques. Artificial intelligence has evolved to an incredible size thanks to these chips. I'm so excited for the future Reply