The rise of the AI processor IPU: what shortcomings of CPUs and GPUs can be made up for

Artificial Intelligence doesn't go into a slump, it generates tons of applications. But many AI applications are very scene-specific and require both proven CPUs and GPUs as well as entirely new AI processors.The Intelligence Processing Unit is a revolutionary architecture that provides an entirely new platform for AI computing, and it's already enjoying success in areas such as finance, healthcare, telecom, robotics, the cloud, and the Internet.
When a third class of AI processors becomes mass-marketed in the IPU market at UK startup Graphcore, will IPUs become the revolutionary architecture they deserve to be when they are better at performing AI tasks that CPUs and GPUs are not?

1. How can IPUs span the chip and AI application gap?

At the end of last year, Thunderbolt already explained the uniqueness of Graphcore's IPU architecture in the article "AI Disrupts GPUs: IPUs usher in the third revolution in computer history". Here we further introduce Graphcore's mass-produced IPU, model GC2, which has 1216 IPU Tiles within the processor, with each Tile containing a separate IPU core for compute and In Processor Memory, which is the memory within the processor. Overall, the GC2 processor has 7,296 threads and can support 7,296 programs running in parallel.

What makes IPUs better than CPUs and GPUs? — IPU/GPU/CPU Difference

Based on TSMC's 16nm process, the GC2 integrates 23.6 billion transistors with a mixing accuracy of 125 TFlops, an integrated power consumption of 120 W, 300 M SRAM to hold the full model, plus memory bandwidth of 15 TB/s, inter-chip switching of 8 TB/s, and inter-chip IPU-Links of 2.5 Tbps.

IPU adopts a distributed on-chip storage system, breaking through the AI chip storage bottleneck. But as Graphcore's vice president of sales and general manager of China, Tao Lu, said at a recent media conference, "There are a lot of gaps from a chip to the ground, including whether there's a better toolchain, how much software is available, how many software libraries support it, and how many mainstream algorithms, frameworks, and operating systems are available.

That is to say, AI chips can only be better grounded if they are utilized through easy-to-use software that takes advantage of the chip. For IPUs, realizing efficient parallel hardware programming is a major issue due to the specificity of their architecture. In order to achieve this goal, Graphcore has adopted BSP technology in GC2 from Google, Facebook and Baidu, which will be used for building large data center clusters, i.e., BSP (Bulk Synchronous Parallel) technology, which divides the entire computational logic into computation, synchronization, and switching through hardware support for the BSP protocol.

With BSP, it's very easy for software engineers and developers not to have to deal with locks, and it's a very user-friendly innovation for the user not to have to worry about whether it's 1,216 cores (Tiles) or more than 7,000 threads, and on which cores the task is being executed," said Luks.

Graphcore then launched Poplar, a software stack that builds a complete toolchain and library of computational graph-based programs between machine learning framework software and hardware. Poplar now reportedly offers more than 50 optimizations for 750 HPC units, supports standard machine learning frameworks such as TensorFlow1,2, ONNX and PyTorch, and will soon support PaddlePaddle.
Containerized deployments are also supported to get up and running quickly. virtualization and security technologies such as Docker, Kubernetes, and Microsoft's Hyper-v are supported in the standards ecosystem. the OS supports three widely used Linux distributions: ubuntu, Red Hat Enterprise Linux, and CentOS.

In May, Graphcore released another analysis tool, PopVisionGraph Analyser, which allows developers and researchers programming with IPUs to analyze software performance, debugging adjustments, and more using PopVision, a visual graphical display tool, and the Poplar developer documentation and community went live this month.

A number of IPU-based applications have covered various areas of machine learning, including natural language processing, image/video processing, temporal analysis, recommendation/ranking and probabilistic modeling. On Github, Graphcore not only provides articles on model porting, but also provides a large number of application examples and models.
In addition, whether developers porting models to IPU need to make changes at the code level is also a key issue. luzzan told LeiFeng.com, "90%'s AI developers use an open source framework, and the development language is Python, which makes the migration of code a very low-cost endeavor for these types of developers. Even for 9% developers, when using performance-level developers based on NvidiacuDNN, we try to provide a similar user experience to cuDNN, an effort that now seems perfectly acceptable.

2. Maximum throughput of IPU is 260 times higher than GPU.

Having solved the chip software application, what scenarios are IPUs better suited for? "The strategy for future development is still to do both training and reasoning at the same time. It can be an individual training task or an individual reasoning task, but we will be focusing more on certain scenarios that require higher accuracy, lower latency, and higher throughput," Lu Tao further said.

How strong is the IPU? — How strong is IPU?

He said, "The mainstream model of CV class which is widely used nowadays is mainly Int8, but like the NLP model nowadays, as well as the model or advertisement algorithm used in some search engines, actually FP16 or even FP32 is used as the mainstream data format, because some of such models have higher requirements for accuracy. So besides Int8, FP16 and FP32, the market for cloud inference is also large.

Zhu Jiang, Sales Director of Graphcore China, pointed out that in addition to dense data, large-scale sparse data, which now represents the whole direction of AI development, will have very obvious advantages when processed on IPUs. For large-scale sparsified data, a new convolutional algorithm, Packet Convolutional Algorithm, is proposed, which has better accuracy variance and performance improvement than the current commonly used ResNet algorithm.

Graphcore provides a Micro-Benchmark which groups the convolutional kernels and compares the Group Dimension from 1 to 512. And 512 is the more applications of " Dense Convolutional Networks ", such as ResNet and other typical applications. ipugc2's performance in 212 dimensions is almost double that of V100. The IPU shows a big advantage when the group size is 1 or 32, and the GPU shows a big advantage when the group size is 1 or 32, when the IPU is compared against MobileNet or MobileNet, thus being able to exponentially increase the performance while greatly reducing the latency.

While packet convolution has a clear advantage in low arrays due to the fact that packet convolution data is not dense enough and may not work for GPUs, IPUs are architecturally designed to be able to take advantage of packet convolution and provide low latency and high throughput that GPUs are difficult, if not impossible, to provide.

Overall, Graphcore's IPUs improve natural language processing by 20%-50% over the NVIDIA V100, while image classification achieves lower latency with a 6x increase in throughput. In real-world landing cases, the IPUs offer the same significant performance benefits.

Applying algorithms such as MarkovChain and MCMC to risk management and algorithmic trading applications in finance, IPUs can increase the sampling rate by 26 times compared to GPUs, and reinforcement learning, which is widely used in finance, can reduce reinforcement learning time to 1/13th. For reinforcement learning, which is widely used in finance, IPU can also reduce the reinforcement learning time to 1/13. In addition, using MLP (Multi-Layer Perceptron) and embedding some data for sales prediction, the throughput of IPU can be increased by 5.9 times compared with GPU.

IPUs have also demonstrated their advantages in medical and life science fields such as new drug discovery, medical imaging, medical research, and precision medicine. Using Microsoft's image analysis of the COVID-19 algorithm model CXR, COVID-19 can complete 5 hours of training on NVIDIA GPUs in 30 minutes.
In addition, in telecom, machine intelligence can help analyze certain variations in wireless data, such as predicting future network planning performance using LSTM models. Based on the time series analysis, 260 times the throughput of GPUs was achieved using IPUs.

The use of IPUs can also increase throughput by a factor of 13 through the reinforcement learning required for 5 G network slicing and resource management.

For innovative Natural Language Processing (NLP) customer experiences, the typical model is BERT. currently, we can train on the BERT program in more than 251 TP3T shorter than the GPU.

IPU is also interesting in the robotics space, where Graphcore has partnered with Imperial College London to help robots do more complex movements and advanced functions, using mostly some spatial AI, as well as real-time localization and map construction techniques.

More important to Graphcore is the use of IPUs in the cloud and data centers, which was an early rollout and is now a major area of promotion, including Microsoft's opening up of IPU services on the Azure public cloud and European search engine company Qwant's use of IPUs to improve search map recognition by more than 3.5 times.

3. How to occupy the high ground in China's AI market?

In terms of IPU landings, our overall strategy right now is still to work with cloud service providers and server providers, which is basically the case in all regions," he said.Luke admits that IPU landings in the U.S. have been faster compared to those in China, including Azure's opening up of its IPU service on the public cloud and its partnership with DelImport Ansion to launch the IPU servers, etc.

"This is because users in the US may be more active and there are more researchers, while China is more focused on productization on the ground," he explains. In China, with some of our local partners, the developers may be more pragmatic. Maybe the introduction will be slower before, but then it will really start to accelerate and the whole development process will become very fast.

Luke also revealed that Graphcore is working with Kingsoft Cloud to offer a free trial of a developer cloud for Chinese developers and innovators.
Regarding localized product services, "For a long time, we are very willing to do product customization to adapt to the needs of the Chinese market. As far as service is concerned, we have two technical teams, the engineering team is responsible for two aspects of work, one is to apply some AI algorithm models to IPUs according to the characteristics of Chinese local AI applications and the needs of the applications; the other is to do some functional development and enhancement work according to the needs of Chinese local users in terms of the software of the AI stability learning framework platform. The field application team is to assist customers to do some technical support in the field."

Naturally, Graphcore supports the ODLA (Open Deep Learning API) hardware standard, a unified interface API abstracted by Alibaba for its underlying architecture, as well as an important deep learning framework, the Baidu framework (Graphcore), which also helps Graphcore achieve IPU in China.
In terms of future trends in AI, IPUs can also play to their strengths. We're seeing a big trend where training and reasoning need to be integrated with each other," says Tao Lu. Examples include online recommendation algorithms, and predictive automotive applications. IPUs can fulfill the needs of both training and inference, while also playing to their strengths.
In addition, "One of the simplest manifestations of packet convolution for algorithm designers is to design algorithmic models with smaller parameters and higher accuracy. I think this is a big trend for the future.

4. Conclusion

IPU is a new architecture that has been highly evaluated by many experts in the industry. However, the distance between Graphcore's chip-to-landing, from innovative architecture to chip to revolutionary product, requires easy-to-use software and rich tools to support it, especially in the cloud chip market, as it relies on the software ecosystem. As for the current situation, Graphcore already has the corresponding tool chain, deployment support products, and landing cases in finance, healthcare, data center and other fields. At the same time, Graphcore's next-generation IPUs based on the 7 nm advanced process will also be available.

What about market acceptance after benchmarking customers? And, is Graphcore's go-to-market strategy as advantageous as their product?
Besides Graphcore, other companies around the world are utilizing the IPU concept to design AI chips and starting to roll them out. We are witnessing the dawn of the IPU era.

THE END