A short overview of IBM’s new 7 nm Telum mainframe CPU

Each Telum package consists of two 7nm, eight-core / sixteen-thread processors running at a <em>base</em> clock speed above 5GHz. A typical system will have sixteen of these chips in total, arranged in four-socket
Enlarge / Every Telum bundle consists of two 7nm, eight-core / sixteen-thread processors working at a base clock velocity above 5GHz. A typical system can have sixteen of those chips in whole, organized in four-socket “drawers.”

From the angle of a conventional x86 computing fanatic—or skilled—mainframes are unusual, archaic beasts. They’re bodily huge, power-hungry, and costly by comparability to extra conventional data-center gear, usually providing much less compute per rack at a better value.

This raises the query, “Why hold utilizing mainframes, then?” When you hand-wave the cynical solutions that boil right down to “as a result of that is how we have all the time completed it,” the sensible solutions largely come right down to reliability and consistency. As AnandTech’s Ian Cutress factors out in a speculative piece centered on the Telum’s redesigned cache, “downtime of those [IBM Z] programs is measured in milliseconds per yr.” (If true, that is no less than seven nines.)

IBM’s personal announcement of the Telum hints at simply how totally different mainframe and commodity computing’s priorities are. It casually describes Telum’s reminiscence interface as “able to tolerating full channel or DIMM failures, and designed to transparently recuperate information with out impression to response time.”

While you pull a DIMM from a reside, working x86 server, that server doesn’t “transparently recuperate information”—it merely crashes.

IBM Z-series structure

Telum is designed to be one thing of a one-chip-to-rule-them-all for mainframes, changing a way more heterogeneous setup in earlier IBM mainframes.

The 14 nm IBM z15 CPU that Telum is changing options 5 whole processors—two pairs of 12-core Compute Processors and one System Controller. Every Compute Processor hosts 256MiB of L3 cache shared between its 12 cores, whereas the System Controller hosts a whopping 960MiB of L4 cache shared between the 4 Compute Processors.

5 of those z15 processors—every consisting of 4 Compute Processors and one System Controller—constitutes a “drawer.” 4 drawers come collectively in a single z15-powered mainframe.

Though the idea of a number of processors to a drawer and a number of drawers to a system stays, the structure inside Telum itself is radically totally different—and significantly simplified.

Telum structure

Telum is considerably easier at first look than z15 was—it is an eight-core processor constructed on Samsung’s 7nm course of, with two processors mixed on every bundle (just like AMD’s chiplet method for Ryzen). There isn’t a separate System Controller processor—all of Telum’s processors are similar.

From right here, 4 Telum CPU packages mix to make one four-socket “drawer,” and 4 of these drawers go right into a single mainframe system. This supplies 256 whole cores on 32 CPUs. Every core runs at a base clockrate over 5 GHz—offering extra predictable and constant latency for real-time transactions than a decrease base with increased turbo price would.

Pockets stuffed with cache

Getting rid of the central System Processor on every bundle meant redesigning Telum’s cache, as effectively—the large 960MiB L4 cache is gone, in addition to the per-die shared L3 cache. In Telum, every particular person core has a non-public 32MiB L2 cache—and that is it. There isn’t a {hardware} L3 or L4 cache in any respect.

That is the place issues get deeply bizarre—whereas every Telum core’s 32MiB L2 cache is technically personal, it is actually solely nearly personal. When a line from one core’s L2 cache is evicted, the processor appears to be like for empty area within the different cores’ L2. If it finds some, the evicted L2 cache line from core x is tagged as an L3 cache line and saved in core y‘s L2.

OK, so we’ve got a digital, shared up-to-256MiB L3 cache on every Telum processor, composed of the 32MiB “personal” L2 cache on every of its eight cores. From right here, issues go one step additional—that 256MiB of shared “digital L3” on every processor can, in flip, be used as shared “digital L4” amongst all processors in a system.

Telum’s “digital L4” works largely the identical approach its “digital L3” did within the first place—evicted L3 cache strains from one processor search for a house on a distinct processor. If one other processor in the identical Telum system has spare room, the evicted L3 cache line will get retagged as L4 and lives within the digital L3 on the opposite processor (which is made up of the “personal” L2s of its eight cores) as a substitute.

AnandTech’s Ian Cutress goes into extra detail on Telum’s cache mechanisms. He ultimately sums them up by answering “How is that this attainable?” with a easy “magic.”

AI inference acceleration

IBM’s Christian Jacobi briefly outlines Telum’s AI acceleration on this two-minute clip.

Telum additionally introduces a 6TFLOPS on-die inference accelerator. It is meant for use for—amongst different issues—real-time fraud detection throughout monetary transactions (versus shortly after the transaction).

Within the quest for optimum efficiency and minimal latency, IBM threads a number of needles. The brand new inference accelerator is positioned on-die, which permits for decrease latency interconnects between the accelerator and CPU cores—however it’s not constructed into the cores themselves, a la Intel’s AVX-512 instruction set.

The issue with in-core inference acceleration like Intel’s is that it sometimes limits the AI processing energy accessible to any single core. A Xeon core working an AVX-512 instruction solely has the {hardware} inside its personal core accessible to it, which means bigger inference jobs should be cut up amongst a number of Xeon cores to extract the complete efficiency accessible.

Telum’s accelerator is on-die however off-core. This permits a single core to run inference workloads with the may of the whole on-die accelerator, not simply the portion constructed into itself.

Itemizing picture by IBM

Recent Articles

Home windows 11 hurts AMD Ryzen efficiency much more than we thought

Home windows 11 might need simply had its Chernobyl second. AMD and Microsoft already confirmed that the brand new working system increases L3 cache...

Twitter testing annoying advert technique with advertisements within the replies

Twitter is getting annoying with the rising variety of advertisements on the platform, particularly on cellular apps. Get able to bad-mouth Twitter much more...

5 greatest video conferencing apps for Android

Conferences are so much simpler than they was once. There weren’t a ton of choices, most of them had been costly, and video high...

Google Sends 50,000 Warnings to Customers Focused by State Hackers

Picture: Kenzo Tribouillard / AFP (Getty Pictures)If the web is a digital Wild West, it’s time to lock your...

Related Stories

Stay on op - Ge the daily news in your inbox