The Automata Processor - Practical processing in memory, Pt 2
The evaluation of processing architectures requires the examination of the most fundamental components of the processing cycle. Processors and the software they run can be modeled as a repeating cycle that is executed over and over again until the final or desired result is achieved. The repeated cycle is often referred to as the fetch-decode-execute (FDX) cycle and is one of the most fundamental ways to describe the operation of a microprocessor. The FDX cycle is shown in Figure 1, along with a simplified block diagram of a microprocessor.
Figure 1. The fetch-decode-execute (FDX) cycle (left) and simplified block diagram of a microprocessor.
The basic FDX cycle is easy to understand, but there are many variations depending on the specific microprocessor being used and the specific instruction being executed. It should be noted that the three stages defined in the FDX cycle do not include the storage of results back to the memory system. To accommodate the storage operation, the storage of results can be defined as the next instruction to be processed by the microprocessor, which will require another complete FDX cycle.
Increasing the rate at which FDX cycles are completed is the key to scaling computing performance. Methods for increasing the FDX cycle rate include:
- Increasing core clock frequencies
- Decreasing memory latency
- Pipelining operations
- Multi-core and super-scalar architectures
Each of these traditional methods has limitations. Scaling clock frequencies leads to exponential increases in power consumption. Reducing memory latencies requires more expensive memory architectures. Improvements related to pipelining are generally limited to the depth of the pipeline and are not saleable beyond that point. Lastly, multi-core and super-scalar architectures provide the ability to execute more FDX cycles in parallel, but they are hindered by programming complexity.
In the FDX cycle, the most costly operations are the ones that require information to be retrieved from or stored into memory. The penalty increases when the source or destination of the data is further up the memory hierarchy. Level1 cache operations are relatively fast, but memory operations take increasingly longer as system memory (DRAM) and finally storage-class memory (HDD/SSD) are accessed.
The penalty for accessing higher levels of the memory hierarchy is exacerbated by the fact that the information being retrieved is often overhead associated with the von Neumann architecture itself. Consider that all algorithms require data to be manipulated in one way or another. In order to implement these data manipulations, the von Neumann architecture relies on the concept of machine instructions and addressable memory. These instructions and memory addresses are not typically part of the final results of the algorithm, but they are handled extensively during the processing operations.