This is not firing on all cylinders. Should have used HBM3 and TSMC N3B or Intel 3. This is the most important product in today's times and Intel is being too conservative with their choices. Its not going to make much of a splash in the market. Hopefully Falcon Shores goes all in with Intel 18A, HBM 3E and everything else. Reply
A chip that size on a 3nm node would cost so much and have yields so catastrophic they'd probably lose money even at the insane pricing of AI chips.
Regardless, it's most likely a foolish endeavor to bet everything on producing the *best* AI chip at any cost today considering Nvidia's headstart. It's a much smarter play to aim for somewhere between a H100 and H200 at acceptable costs because that's actually a realistic goal. Reply
Even NVIDIA is going for TSMC's 4nm, not the 3nm. There must be a reason. Either low yield or cost or Apple's exclusivity.
Regarding memory, it's strange. I wonder if Intel's decision is based on not wanting to spend money, or if there aren't enough HBM3/E available on the market. After all, NVIDIA surely did secure enough memory chips for its accelerators. Reply
While a step up from Gaudi 2, this seems too little too late given that nVidia's Blackwell was announced beforehand. Granted neither of these are shipping with the common comparison point being Hopper. While faster than that comparison point via Intel's chosen benchmarks, it really has to do better than Blackwell which doesn't seem likely given what nVidia has said about its own generational improvements.
The other huge factor here is interconnects to scale up. Ethernet is a more commodity choice but does permit scaling upward to huge numbers and has benefits of being able to interop with other systems/accelerators. It would have been nice if Intel hadn't sold off their Ethernet switch division 15 months ago as there is a good pairing of technologies here. Essentially throw a newer Barefoot switch ASIC on the base board and leverage a 100GbaseKX style spec to link everything together on the board with massive uplink bandwidth to go outside of the node.
Scaling counts upward is supposed to be Gaudi 3's specialty and Intel should power and price these to enable this and permit clusters outperform those form their competitors. If you can't win on a 1:1 battle, make it a 2:1 battle where you can win. At 900W per board, there are costs and cooling complexities there if Intel is driving the design hard. Lowering clocks and voltages maybe warranted if operating at 600W can do 85% of the performance at 900W. It'd make things cheaper to cool and maintain. There is a niche opportunity win to be able to replace existing clusters without needing to go to liquid cooling which given Intel's position should press hard. Similarly Intlel should be undercutting its competitors in price in an effort to gain market share here while emphasizing performance per dollar (which includes the power side of things). I don't see Intel going in this direction for various reasons and they'll wind up with another 'success' like Ponte Vecchio on their hands.Reply
I'm guessing that with all those Ethernet lanes this chip will not be from an Intel Fab. Intel Fabs are great, but Mixed Signal IP required for Ethernet was never a strong point for them. It's expensive to develop and they don't make enough designs to amortize the high cost of developing the IP.Reply
'Lowering clocks and voltages maybe warranted if operating at 600W can do 85% of the performance at 900W. It'd make things cheaper to cool and maintain.'
Per the article, Intel will sell a 600W PCI-e version.Reply
Lol, I think alot of commentors here have no real life experience with AI. While Nvidia is winning at the high end, any decent AI accelerator that balances power/performance and to some part price is going to sell out given the massive demand and overall shortages of AI accelerators. We are currently quoting 18 month leadtimes for almost decent nvidia cards (H100/H200/MI200/250), with only smaller cards like the A40 available in bulk. The Guadi will sell, and sell well. Reply
L40_ bulk volume sounds reasonable and 4090 channel supply increased + 148% in the last month I see it as run end dumping of AIB inventories. However, channel supply data consistently places H100 volume in front;
Full run channel data and I get the channel does not necessarily see direct end customer sales.
Obviously there is a cost advantage of using a tensor core only design than Nvidia's CUDA offerings but I've yet seen any actual numbers. Are there good sources of number comps?Reply
Surprised Intel haven't disclosed more about Falcon Shores at this point - from their public releases it's unclear to what extent it will carry forward the Gaudi architecture compared to Ponte Vecchio, so you wonder if there is a reticence for customers to commit to Gaudi when it may prove an architectural dead end.Reply
That is the problem with Intel: they don't have a very clear roadmap between their various architectures for acceleration: Xe, Gaudi, and Ponte Vecchio/Falcon Shores are all different architectures from different groups within the company. The Xe and Ponte Vecchio/Falcon Shores design was to merge at some point but is muddy if that ever happened or will happen. Even on their CPU side of things the inconsistencies with AVX and AMX in terms of extensions on top of the P core and E core designs This does make their OneAPI appealing as it'll cover the disjoined hardware map and includes support for Altera FPGAs.Reply
Average Weighed Price of the three on 2,203,062 units of production = $2779. Intel was aiming for around $3185 however 7120 production seemed to fizzle.
Stampede TACC card sample = $400 what a deal Shanghai Jiatong University sample = $400 (now export restricted?) Russia Academy of Science, JSCC RAS Tornado (also now export restricted?)
Gaudi System on substrate if $16,147 approximately Nvidia x00 and AMD x00 gross per unit the key component cost on an Nvidia model is $3608, and if $11,516 on Nvidia accelerator ‘net’ take component cost drops to $2573.
So, Intel could sell them for cost x4 which is a competitive profit (just shy of x5 which is an economic profit point) where at x3 to x4 Intel will fly just under AMD and Nvidia net even if Intel relies on TSMC front end component fabrication WHERE Intel then handles its own backend packaging. Around the $1K price if a high end Xeon.
Apple ANE also has same TOPs rate for 8-bit and 16-bit. The design *allows for* easy handling of one or more versions of FP8, and BF16, but officially as far as I know what’s supported is INT8 and FP16. The lack of BF16 (added across CPU SIMD and GPU with the A15/M2 generation) is especially surprising and I’m guessing reflects lack of documentation, not lack of ability.
ANE also accumulates to 32b (so kinda like nVidia’s TF32) and has essentially transparent support for biasing and rescaling data as it flows through (so like what nVidia calls the “Transformer Engine”, for supporting tensors using just a few bits in memory, that are expanded as they flow through the engine).Reply
We’ve updated our terms. By continuing to use the site and/or by logging into your account, you agree to the Site’s updated Terms of Use and Privacy Policy.
21 Comments
Back to Article
trivik12 - Tuesday, April 9, 2024 - link
This is not firing on all cylinders. Should have used HBM3 and TSMC N3B or Intel 3. This is the most important product in today's times and Intel is being too conservative with their choices. Its not going to make much of a splash in the market. Hopefully Falcon Shores goes all in with Intel 18A, HBM 3E and everything else. Replyelmagio - Tuesday, April 9, 2024 - link
A chip that size on a 3nm node would cost so much and have yields so catastrophic they'd probably lose money even at the insane pricing of AI chips.Regardless, it's most likely a foolish endeavor to bet everything on producing the *best* AI chip at any cost today considering Nvidia's headstart. It's a much smarter play to aim for somewhere between a H100 and H200 at acceptable costs because that's actually a realistic goal. Reply
Threska - Tuesday, April 9, 2024 - link
AI in a lot of IoT. ReplySilma - Wednesday, April 10, 2024 - link
Even NVIDIA is going for TSMC's 4nm, not the 3nm.There must be a reason. Either low yield or cost or Apple's exclusivity.
Regarding memory, it's strange. I wonder if Intel's decision is based on not wanting to spend money, or if there aren't enough HBM3/E available on the market. After all, NVIDIA surely did secure enough memory chips for its accelerators. Reply
Bruzzone - Friday, April 19, 2024 - link
Depreciated process cost TSMC 4__. mb ReplyKevin G - Tuesday, April 9, 2024 - link
While a step up from Gaudi 2, this seems too little too late given that nVidia's Blackwell was announced beforehand. Granted neither of these are shipping with the common comparison point being Hopper. While faster than that comparison point via Intel's chosen benchmarks, it really has to do better than Blackwell which doesn't seem likely given what nVidia has said about its own generational improvements.The other huge factor here is interconnects to scale up. Ethernet is a more commodity choice but does permit scaling upward to huge numbers and has benefits of being able to interop with other systems/accelerators. It would have been nice if Intel hadn't sold off their Ethernet switch division 15 months ago as there is a good pairing of technologies here. Essentially throw a newer Barefoot switch ASIC on the base board and leverage a 100GbaseKX style spec to link everything together on the board with massive uplink bandwidth to go outside of the node.
Scaling counts upward is supposed to be Gaudi 3's specialty and Intel should power and price these to enable this and permit clusters outperform those form their competitors. If you can't win on a 1:1 battle, make it a 2:1 battle where you can win. At 900W per board, there are costs and cooling complexities there if Intel is driving the design hard. Lowering clocks and voltages maybe warranted if operating at 600W can do 85% of the performance at 900W. It'd make things cheaper to cool and maintain. There is a niche opportunity win to be able to replace existing clusters without needing to go to liquid cooling which given Intel's position should press hard. Similarly Intlel should be undercutting its competitors in price in an effort to gain market share here while emphasizing performance per dollar (which includes the power side of things). I don't see Intel going in this direction for various reasons and they'll wind up with another 'success' like Ponte Vecchio on their hands. Reply
Blastdoor - Tuesday, April 9, 2024 - link
Kind of a bummer that Intel isn’t manufacturing this themselves, but rather relying on TSMC. Replyballsystemlord - Tuesday, April 9, 2024 - link
Why sell off their fabs when they can outsource both their production capabilities and needs? (sarcasm). Replyname99 - Saturday, April 20, 2024 - link
I think the word you mean is “red flag”, not “bummer”… Replysoaringrocks - Sunday, April 21, 2024 - link
I'm guessing that with all those Ethernet lanes this chip will not be from an Intel Fab. Intel Fabs are great, but Mixed Signal IP required for Ethernet was never a strong point for them. It's expensive to develop and they don't make enough designs to amortize the high cost of developing the IP. ReplyOxford Guy - Tuesday, April 9, 2024 - link
'Lowering clocks and voltages maybe warranted if operating at 600W can do 85% of the performance at 900W. It'd make things cheaper to cool and maintain.'Per the article, Intel will sell a 600W PCI-e version. Reply
Kevin G - Wednesday, April 10, 2024 - link
True, this was more in reference to the OAM module. The PCIe version probably should be 300 W. ReplyGm2502 - Wednesday, April 10, 2024 - link
Lol, I think alot of commentors here have no real life experience with AI. While Nvidia is winning at the high end, any decent AI accelerator that balances power/performance and to some part price is going to sell out given the massive demand and overall shortages of AI accelerators. We are currently quoting 18 month leadtimes for almost decent nvidia cards (H100/H200/MI200/250), with only smaller cards like the A40 available in bulk. The Guadi will sell, and sell well. ReplyBruzzone - Friday, April 19, 2024 - link
L40_ bulk volume sounds reasonable and 4090 channel supply increased + 148% in the last month I see it as run end dumping of AIB inventories. However, channel supply data consistently places H100 volume in front;Full run channel data and I get the channel does not necessarily see direct end customer sales.
GH200 = 2.07%
H100 = 57.74%
H800 = 15.6%
L40S = 21,48%
L40 - 3.11%
Maybe we can start with this inquiry checking for 'fit' to understand the L40 discrepancy. Why so few L40 in relation L40S on channel supply data?
Thanks for your thoughts on this. mb Reply
ballsystemlord - Tuesday, April 9, 2024 - link
It looks like a solid offering if they price it low enough. Replywr3zzz - Tuesday, April 9, 2024 - link
Obviously there is a cost advantage of using a tensor core only design than Nvidia's CUDA offerings but I've yet seen any actual numbers. Are there good sources of number comps? ReplyRyan Smith - Tuesday, April 9, 2024 - link
The latest MLPerf Inference results are a good starting point.https://mlcommons.org/2024/03/mlperf-inference-v4/ Reply
onewingedangel - Wednesday, April 10, 2024 - link
Surprised Intel haven't disclosed more about Falcon Shores at this point - from their public releases it's unclear to what extent it will carry forward the Gaudi architecture compared to Ponte Vecchio, so you wonder if there is a reticence for customers to commit to Gaudi when it may prove an architectural dead end. ReplyKevin G - Wednesday, April 10, 2024 - link
That is the problem with Intel: they don't have a very clear roadmap between their various architectures for acceleration: Xe, Gaudi, and Ponte Vecchio/Falcon Shores are all different architectures from different groups within the company. The Xe and Ponte Vecchio/Falcon Shores design was to merge at some point but is muddy if that ever happened or will happen. Even on their CPU side of things the inconsistencies with AVX and AMX in terms of extensions on top of the P core and E core designs This does make their OneAPI appealing as it'll cover the disjoined hardware map and includes support for Altera FPGAs. ReplyBruzzone - Thursday, April 18, 2024 - link
Gaudi 3 price?Xeon Phi 7120 P/X = $4125
Xeon Phi 5110 D/P = $2725
Xeon Phi 3120 A/P = $1625
Average Weighed Price of the three on 2,203,062 units of production = $2779. Intel was aiming for around $3185 however 7120 production seemed to fizzle.
Stampede TACC card sample = $400 what a deal
Shanghai Jiatong University sample = $400 (now export restricted?)
Russia Academy of Science, JSCC RAS Tornado (also now export restricted?)
Gaudi System on substrate if $16,147 approximately Nvidia x00 and AMD x00 gross per unit the key component cost on an Nvidia model is $3608, and if $11,516 on Nvidia accelerator ‘net’ take component cost drops to $2573.
So, Intel could sell them for cost x4 which is a competitive profit (just shy of x5 which is an economic profit point) where at x3 to x4 Intel will fly just under AMD and Nvidia net even if Intel relies on TSMC front end component fabrication WHERE Intel then handles its own backend packaging. Around the $1K price if a high end Xeon.
mb Reply
name99 - Saturday, April 20, 2024 - link
Apple ANE also has same TOPs rate for 8-bit and 16-bit. The design *allows for* easy handling of one or more versions of FP8, and BF16, but officially as far as I know what’s supported is INT8 and FP16. The lack of BF16 (added across CPU SIMD and GPU with the A15/M2 generation) is especially surprising and I’m guessing reflects lack of documentation, not lack of ability.ANE also accumulates to 32b (so kinda like nVidia’s TF32) and has essentially transparent support for biasing and rescaling data as it flows through (so like what nVidia calls the “Transformer Engine”, for supporting tensors using just a few bits in memory, that are expanded as they flow through the engine). Reply