The VISC Instruction Set and Global Front End

Common instruction set architectures (ISAs) such as x86, ARMv8, Power, SPARC and other more esoteric ones rely on system code converting into predefined instructions that each design can handle. VISC comes with its own ISA as well, separate from the others, which VISC cores and virtual cores use. When using native VISC code, the global front end will split the instructions into smaller ‘virtual hardware threadlets’ which are then dispatched to separate virtual cores. These virtual cores can then issue them to the available resources on any of the physical cores and keep track of where the data goes. Multiple virtual cores can push threadlets into the reorder buffer of a single physical core, which can split partial instructions and data from multiple threadlets through the execution ports at the same time. We were told that each ‘virtual core’ keeps track of the position of the relative output.

The true kicker (and so much of what sets VISC apart) is that when multiple virtual cores are in flight at one time, the core design allows the virtual core allocation of resources to be dynamic on a near-single cycle latency level (we were told from 1-4 cycles depending on the change in allocation). Thus if two virtual cores are competing for resources, there are appropriate algorithms in place to determine what resources are allocated where.

One big area of focus in optimizing processor designs for single-thread performance is speculation – being able to deal with branches in code and/or prefetch relevant data from memory when needed. Typically when speculation occurs, as the data for a single thread is contained within a core, it is easy enough to deal with code paths that rely on previous data or end up with bad speculation.

In the virtual core scenario however this becomes trickier. VISC tackles this in two ways – firstly, the threadlet generation is designed to minimize cross-core communication because this adds latency and reduces performance. Second, each core can communicate through either the register file or the L1 data caches. The register files have a single cycle latency for data but can only transmit tens of values, whereas the L1 cache has a 4-cycle latency but can transmit thousands of values.

Typically communicating through a register file is seen as a risky maneuver and difficult to control, especially when you have multiple physical cores and each core needs each other core to be able to place/take data into the right registers. Soft Machines told us that a large part of their design work has been in this area of speculation and data transfer. Specifically on speculation and branch prediction, we postulated that they were over ten years behind Intel in this, and the response we got was in a similar vein, stating that using Intel’s branch prediction methods could offer at least 20-30% better performance with branching code. However, we were told that the VISC design is quicker to recover in the event of a failed branch, needing only a few cycles.

The Pipeline

The first VISC core available for license is Shasta, a dual core part that enables up to two virtual cores or threads (2C/2VC), and we were given a base overview of the pipeline.

Normally we would see a pipeline of one core but this is a pipeline of both cores of Shasta. This pipeline, compared to the original VISC prototype, is also deeper. The pipeline looks relatively normal to others to start, where the thread either takes an instruction or issues a fetch for data into the instruction assembly. Making the VISC instructions and data into threadlets takes another three stages, but the allocation and scheduling takes six (plus one). On that subject, Soft Machines mentioned that keeping track of data across multiple cores per virtual core is tricky, as well as dealing with reorder buffers and parallel instruction management, that’s why there are a large amount of stages here. The plus one goes back to variable physical core allocation methodology, ensuring that if there are two threads active that the heavier one will get the most resources. The threadlets are then executed on the ports of each core, with a possible 1-4 cycle delay if data needs to be transferred across the core boundaries via registers or L1 cache.

With the variable allocation of fractions of a core to a virtual core, VISC is designed for this situation:

If one heaver thread needs more resources, it can take them from idle ports on a second core (or third, or fourth). The virtual cores can be configured at the software stage as well to limit their use (e.g. keep a VC to half a physical core), and this can be configured at runtime at the expense of 10-12 cycles. There is a quality of service implementation as well, so if a virtual core takes a high priority thread, it will have access to more resources by default.

VISC: Identifying Single Thread Performance Bottlenecks Dealing with Guest ISAs and a Translation Layer
Comments Locked

97 Comments

View All Comments

  • vladx - Sunday, February 14, 2016 - link

    If it works, Intel or ARM won't be abke to copy them because they've already patented the techniques used.
  • vladx - Sunday, February 14, 2016 - link

    *able to
  • valinor89 - Saturday, February 13, 2016 - link

    AMD, Samsung and Globalfoundries are chief investors so it is doubtful Intel or Nvidia will be able to aquire this company.
  • xthetenth - Friday, February 12, 2016 - link

    Why is that such a red flag? They show the optimal part of the curve for A72, and they show the suboptimal tail for all of them, although they extend it farther for the A72 to show what it takes to get it up to the same performance level (basically it's non-viable and that they're in a different class if accurate), and they say as much. There's a huge list of objections the article raises and that isn't on it for pretty good reason. It's just not nearly as big a deal as the rest.
  • Andrei Frumusanu - Friday, February 12, 2016 - link

    It's incorrect to simply extend a curve of an existing design beyond its design operating range. It's perfectly possible to design the physical implementation to be optimized at very high frequencies - in such a case the curve would less steep but consume higher power at the low frequencies. Extending the curve of a low-power design is relatively misleading in this case.
  • extide - Friday, February 12, 2016 - link

    Yeah, and I don't think they should adjust the Intel cores at all. Intel chips come as Intel makes them, that's it. You will never see the skylake arch on TSMC or GF foundry processes. You should take the results from the Intel chip as they are because that is what you will be competing against, not some made up adjusted result that will never exist in the wild.

    As for adjusting the OTHER chips, well, ok I see what they are going at here, but I still think they took it a bit too far, like adjusting for more or less cache. Although you can see those other chips on various processes, form GF and TSMC, so the process correction isn't really as big of a red flag to me.
  • name99 - Saturday, February 13, 2016 - link

    The curve is not illegitimate because you're missing the point. The goal of the curve is not to show how great their CPU is, it is to show how great their TECHNOLOGY is (ie their microarchitecture). This is best done by comparisons that hold all else equal (ie same process, same compiler, same caches, etc; only different microarchitecture).

    If you're going to criticize the presentation, criticize it on grounds that actually make sense:

    - their "performance" score is garbage because they claim to be in the business of speeding up SINGLE-THREADED code, but then mix in a number of benchmarks that are very naturally parallelized. This is much like comparing an ARMv8 CPU with NEON switched off to an x86 using AVX-512, to test matrix multiplication speed --- it's simply NOT telling you anything about single-threaded performance.

    - the robustness of their normalizations is dodgy and they provide little evidence that the ways in which they have normalized are legit.
  • gamerk2 - Friday, February 12, 2016 - link

    This is where CPUs are eventually going to go, since it's really the only way to get maximized CPU performance without adding a lot of power-hungry components onto the die.

    That being said, the likely outcome is someone (Intel most likely, possibly NVIDIA) acquires Soft Machines and integrates their IP onto their own chips.
  • vladx - Sunday, February 14, 2016 - link

    Doubt it, the only chance for NVidia would be to license it and Intel would most likely be blocked from nuying such a company.
  • Avendit - Friday, February 12, 2016 - link

    How doe this all compare to the Transmetta/Crusoe parts? That had a different purpose but did have the translation abstraction layer approach, but didn't seem to go anywhere unfortunetly. Are there any parallels or learnings to be had?

Log in

Don't have an account? Sign up now