Why the era of artificial intelligence is forced to redesign the spine for an entire account

Want more intelligent visions of your inbox? Subscribe to our weekly newsletters to get what is concerned only for institutions AI, data and security leaders. Subscribe now

The past few decades have seen almost unimaginable progress in the calculation of performance and efficiency, enabling it under Moore’s law and supported by basic commodity devices and dual programs loosely. This architecture has provided online online services for billions worldwide and places almost all human knowledge within our reach.

But the next computing revolution will demand much more. Fulfillment A promise of artificial intelligence Changing a step in capabilities that exceeds the developments of the Internet age requires. To achieve this, we must, as an industry, reconsider some of the foundations that prompted the previous transformation and innovation combined to rethink the entire staple of technology. Let’s explore the forces that lead these disorders and put the shape of this architecture.

From commodity devices to a specialized account

For decades, the prevailing trend in computing was the reckoning addiction through the evaluation structure based on almost identical commodity servers. This monotheism allowed the placing of flexible work burden and the use of effective resources. Demand General AIIt relies heavily on the expected sporting processes on huge data groups, reflecting this trend.

We are now witnessing a decisive shift towards specialized devices-including the ASIC and GPU processing units (TPUS)-which provide performance in performance for per dollar and WATT compared to the CPU. This deployment of account units for the field, the benefactor of narrower tasks, will be very important to lead the continuous rapid progress in artificial intelligence.

AI Impact series returns to San Francisco – August 5

The next stage of artificial intelligence here – are you ready? Join the leaders from Block, GSK and SAP to take an exclusive look on how to restart independent agents from the Foundation’s workflow tasks-from decisions in an actual time to comprehensive automation.

Securing your place now – the space is limited: https://bit.ly/3GUPLF

Beyond ethernet: climbing the specialized bond

These specialized systems often require “comprehensive for all” communication, with the TERABIT-Per-SECOND display and seconds that deal with local memory speeds. Today’s networks, which are largely dependent on ETHERNET ICP/IP, are not equipped to deal with these intense demands.

As a result, to expand the scope Work burdens gen ai Through vast collections of specialized urgent, we are witnessing a rise of specialized interconnection, such as ICI for TPUS and Nvlink for GPU. These networks designed for this purpose give a priority to direct memory transportation to memory and use devices dedicated to accelerate information sharing between processors, which effectively exceeds the general expenses of traditional networks of classes.

This step towards integrated integrated networks will tightly be necessary to overcome communication bottlenecks and expand the next generation of artificial intelligence efficiently.

Break the memory wall

For decades, performance gains in the account exceeded growth in the frequency range of memory. While technologies such as cache and a stacked sam may partially reduce this, the intense data of Amnesty International only exacerbates the problem.

The uncertain need to feed strong account units increasingly to the HBM memory (HBM), which accumulates directly on the processor package to increase the frequency range and reduce cumin. However, even HBM faces basic restrictions: physical chip surrounding the total data flow restricts, transporting huge data collections at TERABIT speeds creates great energy restrictions.

These restrictions shed light on the decisive need for a high frequency range connection and emphasize the urgency to achieve penetration in processing and architecture of memory. Without these innovations, our strong account resources will sit waiting for data, which greatly limits efficiency and size.

From server farms to high -density systems

Advanced automatic learning models (ML) often depends on organized accounts carefully across tens of hundreds of thousands of identical calculation elements, and consumes tremendous power. This narrow and accurate conjunction at the MicroseCond level imposes new demands. Unlike systems that adopt homogeneity, ML accounts require homogeneous elements; Mixing generations the bottleneck faster units. Call paths should also be planned and highly effective, since delaying in one element can stop a full process.

These intense demands for coordination and energy lead the need for an unprecedented calculation. Reducing the material distance between processors becomes necessary to reduce cumin and energy consumption, which paves the way for a new category of high density Artificial intelligence systems.

It changes the drive for intense density and tightly coordinated account mainly the optimal design of infrastructure, and demands the radical rethison of physical planning and dynamic energy management to prevent performance bottlenecks and increase efficiency.

A new approach to tolerance with errors

Tolerance with traditional errors depends on repetition between the systems that are loosely connected to a great time. ML Computing requires a different approach.

First, it makes the huge scale of the calculation excessive cost. Second, typical training is a tightly simultaneous process, as one failure of thousands of processors can expand. Finally, advanced ML devices often push the current technology limits, which may lead to high failure rates.

Instead, the emerging strategy includes a frequently defined definition-savings calculation-in addition to actual monitoring, backup resources and rapid restart. The design of devices and the main network should be able to detect rapid failure and replace the smooth component to maintain performance.

A more sustainable approach to power

Today and looking forward, access to energy is the main bottleneck for expansion Artificial Intelligence Account. While the traditional system design focuses on the maximum performance limit for each chip, we must turn into a comprehensive design that focuses on the delivery that has been widely delivered, on a large scale for every Watt. This approach is vital because it looks at all components of the system – account, network, memory, delivery of energy, cooling and tolerance with errors – together smoothly to maintain performance. Improving ingredients in isolation highly reduces system efficiency in general.

Since we press for increased performance, individual chips require more energy, and the cooling capacity of the refrigerated data centers of traditional air exceeds. This necessitates a shift towards more liquid cooling solutions, but ultimately more efficient, and basic redesign of cooling infrastructure at the data center.

In addition to cooling, repeated traditional energy sources, such as double extracts and diesel generators, create significant financial costs and a slow delivery of capacity. Instead, we must combine various energy sources with a multi -gigawattt, run by MicroGrid consoles in real time. By taking advantage of the flexibility of the work burden of artificial intelligence and geographical distribution, we can provide more capacity without being expensive backup systems that need only a few hours per year.

The advanced energy model allows this actual response to energy availability-from stopping accounts during the shortage to advanced technologies such as the scaling of the frequency of work burdens that can tolerate the low performance. All this requires a distance measurement in actual time and operation at levels currently not available.

Security and privacy: bakery, not installed

The decisive lesson of the Internet era is that security and privacy cannot be effectively installed on an existing structure. Threats will grow from only more sophisticated actors, and require protection for user data and private intellectual property that is built in ML infrastructure fabric. One of the important notes is that artificial intelligence, in the end, enhances the attacker’s capabilities. This, in turn, means that we must make sure that the artificial intelligence simultaneously outperforms our defenses.

This includes data encoding from end to end, tracking strong data ratios with verified access records, and safety bums imposed on devices to protect sensitive accounts and advanced major management systems. Merging these guarantees from A to Z will be necessary to protect users and maintain their confidence. The actual time monitoring of what will be possible will be Petabits/SEC to be a remote measurement and registration of the key to determine and neutralize the tankers of the needle, including those coming from internal threats.

Speed as a strategic necessity

The rhythm of devices is dramatically. Contrary to the development of the growing pregnant woman of traditional infrastructure, the spread of super ML computers requires a major difference. This is because the ML Compute does not easily work on heterogeneous publishing operations; The account code, algorithms and translators must be set specifically to every new generation of devices to take full advantage of its capabilities. The innovation rate is also unprecedented, and it often provides one or more performances on an annual basis of new devices.

Therefore, instead of additional promotions, it is now needed now to make a huge and synchronous arrangement for homogeneous devices, often through the entire data centers. With the update of the annual devices that provide proper performance improvements, these tremendous AI engines are quickly important.

The goal should be to compress the timelines from the design to the spread of chips of more than 100,000 people, allowing efficiency improvements while supporting algorithm. This requires radical acceleration and automation at each stage, and demands a model similar to manufacturing for this infrastructure. From architecture to monitoring and repair, each step and automatic should be simplified to take advantage of all unprecedented devices generation.

Meeting the moment: A group effort for the infrastructure from the following intelligence from the following intelligence

the Gen AI Not just a development, but a revolution requires a radical re -imagination of our computer infrastructure. The upcoming challenges – in specialized devices, interconnected networks and sustainable operations – are important, but also the transformative capabilities of the spontaneous organization that will enable it.

It is easy to see that the infrastructure resulting from our account will not be recognized in the next few years, which means that we cannot simply improve the plans we have already designed. Instead, we must collectively, from research to industry, attempted an attempt to re -examine the requirements of artificial intelligence account from the first principles, and to build a new plan for the basic global infrastructure. This, in turn, will lead to mainly new capabilities, from medicine to education to business, on an unprecedented and efficient scale.

VAHDAT Amin is VP, GM for automatic learning, systems, and Cloud AI in Google cloud.

Daily visions about business use cases with VB daily

If you want to persuade your boss at work, you have covered VB Daily. We give you the internal journalistic precedence over what companies do with obstetric artificial intelligence, from organizational transformations to practical publishing operations, so that you can share visions of the maximum return on investment.

Read with us privacy policy

Thanks for subscribing. Check more VB news bulletins here.

An error occurred.