Quote:
Maxwell, or GM204 is NVIDIA's 10th generation GPU architecture and is made up of 5.2 Billion Transistors and measures 398 mm2. NVIDIA's goals for Maxwell were increased gaming performance, incredible energy efficiency, and added support for VXGI lighting. The comparison of Maxwell should be made back to the original Kepler GK104 GPU which was the GeForce GTX 680. Compared to Kepler, Maxwell has 2x the performance and 40% more performance per CUDA core.
The original Kepler diagram can be compared here. In GeForce GTX 980, each GPC ships with a dedicated raster engine and four SMMs. Each SMM has 128 CUDA cores, a PolyMorph Engine, and eight texture units. With 16 SMMs, the GeForce GTX 980 ships with a total of 2048 CUDA cores and 128 texture units.
The GeForce GTX 980 features four 64-bit memory controllers supplying a 256-bit total. Tied to each memory controller are 16 ROP units and 512KB of L2 cache. The full chip ships with a total of 64 ROPs and 2048KB of L2 cache (this compared to 32 ROPs and 512K L2 on GK104). NVIDIA was able to integrate 2x more SMs without doubling the die size.
We know GTX 680 is not the fastest Kepler GPU out there right now, that is GeForce GTX 780 Ti. We will make comparisons to GTX 780 Ti when we look at the card itself on the next page. There is a reason these comparisons are being made to GTX 680, and we will talk about that later in the "Who is this card meant for?" section.
Based on efficiency and workload analysis, and math vs. texture processing requirements of modern games, NVIDIA engineers determined that eight texture units per SMM is the best architectural balance for Maxwell; therefore, the total number of texture units is the same as Kepler, 128. However, thanks to GeForce GTX 980’s higher clocks, texture fill rate improves by 12% from one generation to the next. To improve performance in high AA/high resolution gaming scenarios, we doubled the number of ROPs from 32 to 64. Again, thanks to the added benefit of higher clocks, pixel fill-rate is actually more than double that of GTX 680: 72 Gpixels/sec for GTX 980 versus 32.2 Gpixels/sec for GTX 680.
The memory subsystem has also been significantly revamped. GTX 980’s memory clock is over 15% higher than GTX 680, and GM204’s cache is larger and more efficient than Kepler’s design, reducing the number of memory requests that have to be made to DRAM. Improvements in our implementation of memory compression provide a further benefit in reducing DRAM traffic effectively amplifying the raw DRAM bandwidth in the system.
Quote:
You can compare the original Kepler SMX diagram here. You will see the Maxwell SMX units are divided up differently. Firstly, a new PolyMorph 3.0 engine is being used, this is upgraded from Kepler's PolyMorph 2.0. The PolyMorph engine is the heart of tessellation performance in Kepler and Maxwell. With the architectural improvements in the PolyMorph Engine in Maxwell plus more SMs in GM204 the engine can achieve up to 3x performance improvement with high tessellation expansion factors.
Each SMM contains four warp schedulers, and each warp scheduler is capable of dispatching two instructions per warp every clock. The scheduler has been improved to reduce redundant re-computation of scheduling decisions. Maxwell SMM is partitioned into four distinct 32-CUDA core processing blocks, each with its own dedicated resources for scheduling and instruction buffering. Maxwell SMM units feature a 96KB dedicated shared memory, while the L1 caching function has been moved to be shared with the texture caching function.
Maxwell achieves 2x performance per watt vs. Kepler. There is an improved scheduler and new data path organization on Maxwell. Overall there is a 40% improved performance per CUDA core. What does that mean? You need less CUDA cores to match the same performance. So yes, there are less CUDA cores compared to GeForce GTX 780 Ti, but, these CUDA cores are delivering 40% more performance on each one, so that makes up the difference! Plus all the other improvements.
As a result of these changes, each Maxwell CUDA core is able to deliver roughly 1.4x more performance per core compared to a Kepler CUDA core, and 2x the performance per watt. At the SM level, with 33% fewer total cores per SM, but 1.4x performance per core, each Maxwell SMM can deliver total per-SM performance similar to Kepler’s SMX, and the area savings from this more efficient architecture enabled us to then double up the total SM count, compared to GK104.
The 7xx is pretty much dead in the water.