Shortly after the beginning of the development of Polaris architecture for mid-range graphics cards and mobile GPU AMD has decided to create another parallel architecture that focuses on other markets. Of course, AMD in recent months introduced new the FirePro graphics accelerators, but the company has “slept through” the trend of development of areas such as machine learning, deep learning and artificial intelligence. The current GPU AMD because of its architecture is not so well show themselves in these areas. Therefore, AMD Vega architecture eliminates this drawback. Today, AMD has published new information about Vega architecture that allows us to estimate what will happen in 2017.
In this article we will reveal some details of the architecture, which we could not speak in the middle of December, when the published article from the event Instinct the Radeon . While some of the details Vega architectures were already known, but AMD has officially confirmed they are not.
In developing the architecture AMD has expanded the requirements, they are now concerned not only pure computational performance stream processors, and memory infrastructure. High Bandwidth Memory Technology Development, GDDR5X and other multi-layer memory options such as 3D XPoint from Intel and Micron’s HMC by, clearly show that the memory becomes an important factor. The games are recorded into memory gigabytes of data, which can then be read. When processing video data volume increases up to several petabytes. If you look at computing field, which AMD plans to attack with the architecture of the Vega, and there are several exabytes no longer a rarity. With the new architecture, AMD is trying to redress the imbalance between the computing performance and memory subsystem.
Therefore, in the presentation architecture Vega considerable attention has been paid to a new memory architecture. An important component was the cache High-Bandwidth Cache, which works for all kinds of memory available to the GPU. No matter what memory is used: GDDR5, GDDR5X or HBM 1st or 2nd generation.
The advantages of new memory technologies, such HBM2, quite obvious. Memory bandwidth has turned to double to 1,024 GB / s. Accessible capacity up to 32 GB by using a stack of four 8 GB each.
So far, High Bandwidth Memory memory in the desktop segment is used in the family of graphics cards the Radeon R9 are Fury has the X . But only the first generation. AMD uses four stack HBM 1GB each with a capacity of 128 GB / s on the stack. Frequency is 500 MHz, the total amount of 4 GB of bandwidth 512 GB / s.
NVIDIA introduced last year, GPU-accelerator of Tesla the P100 , is the only model with memory HBM second generation. The chip uses four memory stacks, each with a capacity of 4GB and a bandwidth of 180 GB / s. In sum, we get 16 GB of memory, Tesla P100 memory bandwidth of 720 GB / s. If HBM2 clock frequency can be theoretically doubled, so we get 256 GB / s on the stack, but with the accelerator NVIDIA Tesla P100,, do not risk it seems. Either Samsung memory chips were not able to work at such a high frequency.
Unlike HBM1, HBM2 memory can be configured more flexibly. In particular, HBM2 can be stacked on top capacity of 2 (2Hi HBM2), 4 (4Hi HBM2), 8 (8Hi HBM2) and 16 GB (16Hi HBM2). So far, SK Hynix and Samsung released HBM2 memory capacity only stacks up 8Hi HBM2. Memory Specifications 16Hi HBM2 approved, but so far it exists only on paper. Also, the number of stacks may be different, it depends on the total capacity and throughput.
Some advantages of High Bandwidth Memory has already been seen in the first generation of the graphics card Radeon R9 Fury X. We have received a very high throughput of 512 GB / s, while the memory HBM1 was twice as effective as the consume less power. Finally, High Bandwidth Memory located on the substrate adjacent to the GPU, it is in the same package. Therefore, the memory occupied on the PCB up much less space.
Unfortunately, you can not simply increase the amount of DRAM. Otherwise it would have had to increase the PCB area, and power consumption would increase markedly. A certain flexibility is available through heterogeneous storage structure. For its support AMD designed HBCC (High-Bandwidth Cache Controller). The controller provides control and work with different memory technologies.
In the real world are not only the amount of memory and speed, there are many other factors that improve the efficiency of the system as a whole. Some data, for example, can not be placed in high memory, as the GPU processes them at a fixed speed. Thus, data from the control proceeds to game developers and AMD architecture. Until now, developers have used all the memory that is available. Actually, they have full access to it. Of course, high-quality and effective implementation will have to spend time and money.
Radeon Pro SSC c SSD is the first step in this direction in architecture Vega he is already fully supported. It allowed a combination of different memory technologies, hardware, will take on the task of data distribution and efficient use of resources.
new concepts and features have been announced in the framework of a new paradigm. Video memory and the frame buffer AMD now calls High-Bandwidth Cache. With him running the new High Bandwidth Cache Controller (HBCC), which distributes the data to the network storage, system memory, and HB Cache. In total HBCC can provide up to 512 TB of virtual address space. 49-bit addressing of memory allows you to combine all the GPU in a single address space. You can create memory pools. One physical drive may have a capacity of up to 256TB. HBCC decides where to write the data as a result: in a fast or a slow memory. The control is performed at the driver level (this technology began to be used even in Fiji). In addition, access to memory is optimized because now only about half of the data in the frame buffer is used in the future.
The following examples clearly shows how ineffective the memory used, in spite of many optimization. In the example scene Deus Ex: Mankind Divided there are 210 million polygons, but only 2 million polygons visible…
While it is not clear what impact the new memory technology. The extent to which the driver will control the controller, and the impact on his work get programmers? In DirectX 12, the memory management is in the hands of developers. But the NVIDIA and AMD, for its part, continues to make optimization, adapted to the respective GPU.
In addition to AMD memory subsystem and made other changes to the Vega architecture, including the related pipeline geometry (Geometry Pipeline). Now the pipeline throughput increased twice per clock compared to previous architectures AMD.
Let me spend a little excursion into the GPU architecture from AMD. AMD promotes its architecture and instruction sets in recent years under the brand of Graphics Core Next (GCN). AMD began this tradition since the first generation of the family of Radeon HD 7700. Internally architecture called GFX7. Then numbering architectures and dispersed generation GCN. This was followed by GCN 2.0 and GCN 3.0 (or GFX8), the Polaris architecture, introduced last summer, received GCN 4.0 notation.
The architecture of the Vega was a lot of changes, and that led to the change of numbering. Architecture Vega called GFX9. And options Vega 10 and Vega 11 remain within the scatter numbering GCN architecture.
In the pipeline geometry vertex and geometry shaders are no longer handled separately. They are combined in the context of the shader primitive. The latter includes the calculation of the vertex and geometry, but to better distribute the load on available resources, which contributes to improved unit Load Balancer. As a result, the stream processors will be loaded more efficiently. GPU continuously monitors the load in order to achieve an ideal distribution.
The third important component of the new architecture – the new Compute Engine, and there have been significant changes. In particular, now we’ve got the NCU name, corresponding to the “new computing clusters.”
New Compute Engine can perform 512 8-bit operations per cycle, 256 16-bit operations, or 128 32-bit operations. As a result, we obtain the architecture Vega a ratio of 4: 2: 1 for such operations as in the GPU Hawaii. However, double precision is not so smooth. AMD used the technology that NVIDIA has called Mixed Precision. 32-bit registers for 32-bit operations can be divided into 2x 16-bit register for 2x 16-bit operations. This step is very important for the scope of machine learning. In addition, through other optimizations NCU received a higher level of single-threaded performance.
Block Pixel Engine was also improved. Now he supports Draw Stream Binning Rasterizer technology that provides the best level of data compression. The technology reduces the video memory requirements, which also gives advantages in data rate memory or vice versa. On stage, the calculation goes on objects, which can be visible to the rasterizer or not. Draw Stream Binning Rasterizer algorithm allows the pixels to hide objects that are not visible. And for them to carry out shadowing is no longer required. All this allows to reduce the consumed memory space and memory bandwidth. In previous architectures, GPU AMD pixel and texture memory are not synchronized, so often had doubles. In the case of architecture Vega Geometry Pipeline, Compute Engine and the Pixel Engine available using L1 and L2 caches together. The same applies to the rear of the rendering pipeline.
Of course, a serious problem was the use of AMD GPU in increasing numbers of transistors. Nevertheless necessary to increase not only the productivity but also efficiency. Partial improvements can be obtained through optimization of process technology, but still a large part achieved by the transition to a new, more modern architecture. In the case of Vega we have received, for example, a new memory hierarchy, as well as optimization Compute Engine and the Pixel Engine. One of the results of the development, which began four years ago, has become the interconnect Infinity Fabric.
In Zen processors or RYZEN repeatedly mentioned infrastructure Infinity Fabric or Infinity Control Fabric. But what lies behind this term? We will try to clarify.
Zen architecture and the Vega AMD has moved to a new interconnect called Infinity Fabric, on which AMD has worked for four years. Fabric name means a web that indirectly indicates the interconnect structure. According to AMD, Infinity Fabric has a modular design and can be implemented with any degree of complexity. This flexibility allows you to use Infinity Fabric of all new processors and GPU.
Infinity Fabric divided into Control Fabric and Data Fabric. Control Fabric is responsible for managing different parts of the chip. In Control Fabric-based working technologies such as power management, protection, and security features, reset, initialization and testing. Data Fabric, on the other hand, provides high-speed transfer of data within the architecture. Data Fabric is used for secure connection to the memory. In the case of GPU Vega interconnect chips HBM2 works with speeds of up to 512 GB / s. With mobile chips, which connects DDR4 memory, just enough speed 40-50 GB / s. All this demonstrates the flexibility of implementation of Infinity Fabric.
Infinity Fabric is a part of not only graphics architecture Vega, but also processors and Summit Ridge RYZEN, as well as mobile processors Raven Ridge, which will be released in the second half of 2017 also under RYZEN mark. In the case of architecture Vega Infinity Fabric interconnect is implemented in a full mesh topology. The reason is that the GPU contains thousands of stream processors, which is necessary to provide data and efficient data distribution just is best achieved through a fully meshed topology. Regarding processors, they use less complex topology Infinity Fabric. AMD did not disclose further details, but it is quite reasonable ring topology, which, for example, uses Intel.
But Infinity Fabric – not merely interconnect inside the CPU or the GPU chip. As pointed out by AMD, Infinity Fabric can be used for the connection sockets. Also interconnect is the physical basis for the AMD HyperTransport. After entering the market and Zen Vega we certainly obtain more information on interconnection Infinity Fabric.
Before the first hardware implementation will take some time
During the demonstrations at the event Radeon Instinct Raja Coduri, Head Radeon Technologies Group, mentioned that exhibited a few weeks was released Hardware ago. The first trial crystals Vega 10 have been released, apparently in the summer. Approximately the same timeframe AMD pledged to Polaris architecture. Then the first test chips appeared in late November or early December 2015 and the graphics card Radeon RX 480 was released in June 2016, just six months later. The same rate is likely to be respected and for the Vega, that is the first video card is unlikely to wait until May or June 2017.
In the event photographs, too, can pick out some interesting details Vega architecture. Coduri Raja GPU demonstrated many times the cameras of journalists. The package is good GPU two HBM stack visible. That is, the GPU can work with 8 or 16 GB of memory HBM2. But the test system, we found only 8 GB.
At this point it is not known what the implementation architecture Vega was shown. Most likely, AMD showed Vega 10, that is the most “junior” version. Let me return to the demo system, Vega and 8 GB of memory. If HBM2 for AMD is logical to use two memory stack. With memory 4Hi HBM2 we get 8GB, with a frequency of 1.000 MHz bandwidth will be 256 GB / s on the stack or 512GB / s in total.
If we assume that Radeon Instinct MI25, is also based on the GPU the Vega, for which AMD pointed computational performance of 25 TFLOPS (FP16), employs 4,096 stream processors, the GPU frequency of 1.520 MHz. Let’s see, GPU AMD will put some frequency for desktop graphics cards. Judging by the frequency of frames per DOOM in the resolution 3.840 x 2.160 pixels and an Ultra, the first graphics cards Vega will be a little faster than GeForce GTX 1080. But a more accurate estimate yet impossible to carry out.
AMD will gradually reveal technical specifications, and we will delight readers of news. The chips are likely to be made on 14-nm process technology. Judging from the photos and GPU famous square stacks HBM2 (7,75 mm × 11.87 mm and 91.99 mm ²), the chip area is between 520 e 540 mm². In GPU Polaris 10 to 232 mm² area was located 5.7 billion. Transistors. At Vega chip 10 with an area of 520 mm² can expect about 12.8 billion. Transistors, ie, the chip will be at the level of complexity of the GPU GP102 or NVIDIA Titan X . If the area will significantly exceed 500 mm², the first GPU Vega will be much more difficult than originally anticipated. And pure computing performance can be significantly higher than the NVIDIA GeForce GTX 1080.
In any case, we should wait for the first graphics cards GPU Vega. By the time their ad AMD probably share more information. Let’s hope that soon we will get the answer to questions about the frequency and performance.