Recognizing technology’s inflections
Inflections involve improving performance, cost, or both. More important, they simplify systems for their users.
By Robert Cravotta, Technical Editor -- EDN, January 21, 2010
| AT A GLANCE |
| Technology inflections are easier to spot in hindsight, but correctly responding to the market during an inflection can produce large winners.Technology inflections happen when they integrate the right mix of components to simplify a developer’s view of system complexity.As software consumes even more of a design budget, the hardware must support the tool’s ability to abstract more of the system complexity. |
Technology inflection points occur when there is a fundamental change in how technology achieves some goal or how you use that technology. These changes can profoundly affect entire industries, but inflection points are sometimes not apparent when they are occurring. A company’s ability to recognize and respond to these inflection points can mean the difference between becoming a huge winner and turning out as a historical footnote. Technology inflection points for embedded processing have the added complexity of being mostly invisible to end users, so the populace for the most part doesn’t notice fundamental shifts.
Consider, for example, Intel’s 4004 microprocessor and Texas Instruments’ 32010 DSP. The genesis of the Intel 4004 processor was the result of Nippon Calculating Machine Corp’s contracting with Intel to convert Nippon’s Busicom 141-PF printing-calculator logic design into 12 custom chips, of which Nippon sold approximately 100,000 units. The company was looking to take competitive advantage of the availability of MOS LSI (large-scale-integration) technology to shift from electromechanical calculators to electronic ones. The Intel design team for the MCS-4 project proposed to substitute the 12-chip approach with a four-chip implementation that included a single chip that designers could program for use in multiple tasks. The programmable-chip approach made the system possible and provided a level of flexibility and reliability that the 12-chip approach could not. With a payment of $60,000, Intel was able to change the license agreement between the two companies so that Intel secured the rights for the microprocessor design and the rights to market it for applications other than calculators.
Intel in November 1971 introduced the 4004 microprocessor, the first general-purpose “building-block” processor on the market, and it has since been a leading player in the microprocessor market. The programmable microprocessor fundamentally changed how manufacturers designed and built products, replaced mechanical-control mechanisms with microcontrollers, and enabled more precise control and monitoring of all types of end systems.
In contrast, Texas Instruments in 1983 introduced the TMS32010 DSP—not the first on the market but the first to integrate a 16-bit MAC (multiply/accumulate)-unit accelerator that made it easier for developers to use multiplication in their applications. According to Ray Simar, a professor at Rice University and a former TI fellow and DSP-design-team manager, the company originally built and marketed the 32010 for speech processing but quickly discovered that its customers were using it for other applications. TI then changed its marketing position and message to general-purpose digital signal processing and has since been a leader in that market. Digital signal processing has become so pervasive that you could consider it an embedded technology within embedded systems. In that scenario, semiconductor suppliers are now providing software stacks to allow access to the integrated application-specific accelerators without the need for developers to become signal-processing experts (Reference 1).
Hiding complexity
Inflection points are not just about technical capability; capability alone is usually not enough to cause an inflection point. The inflection shift involves hiding complexity from the user of the system. This approach does not reduce the overall complexity of the system but instead simplifies the learning curve and understanding model that a user must develop to effectively use the technology. Consider Microsoft’s Windows operating system and Apple’s iPhone products. Microsoft Windows did not encourage an inflection in the market until Version 3.0 emerged—five years after Version 1.0. Version 3.0 simplified the management of the vast array of optional peripherals available for the desktop PC, and it gained widespread third-party support. It also enabled, simplified, and hid much of the complexity of sharing data between programs. Desktop computers were already supporting a robust third-party-peripheral market, and Windows 3.0 hid some of the complexity so that more users could confidently choose best-in-class components. They could transfer data among applications, but that task involved the use of translation programs and the loss of data from special features. Windows 3 hid the complexity of selecting those translation programs and provided a data-interchange format and mechanism that further improved users’ ability to share data among applications.
The Apple iPhone changed the way people think of touch and gesture interfaces, but it was not the first use of touch interfaces on a smartphone (Reference 2). The IBM Simon predates it by 14 years. The iPhone enjoys significantly higher processing performance than the earlier Simon device, however, allowing the iPhone to incorporate more smarts in the control system to successfully handle input ambiguities. The system does a better job than earlier systems of adapting to users.
However, it is still unclear whether this round of touch-sensitive and gesture-recognition systems, which the iPhone represents, is sufficient to enable an inflection point in embedded-system designs. In an attempt to avoid the lost opportunity that Nippon experienced with Busicom, many semiconductor companies are taking no chances. Last year, a dozen or so companies released or upgraded their touch-sensing kits. Although many touch-sensing kits are available, touch interfaces may not sufficiently simplify them to justify their use in embedded designs. A later article will explore these kits’ maturity, level of abstraction, bundled software, and development tools.
Adding integration
A technology inflection point involves a fundamental shift in how designers and users perform their tasks. The economical integration of formerly separate pieces enables this shift. The 4004 integrated a software-programmable core to replace a purely custom logic design. The TMS32010 integrated a 16-bit hardware accelerator and an accompanying instruction-set architecture that simplified the implementation of multiplication. Windows 3 integrated device drivers and data sharing to simplify the support of the available peripherals and applications. The iPhone integrated smarter input processing to enable the touch and gesture interface to reliably handle ambiguous conditions that plagued similar interfaces in earlier products.
The ability to recognize an imminent inflection point is not essential, but it lessens the reliance on luck to respond appropriately to the market reactions to such changes. You may wonder whether inflection points share some common trait that can help you identify when such an opportunity exists. As Moore’s Law continues to approximately every two years double the number of transistors that you can inexpensively place on an IC, the opportunities increase to “waste” transistors on features that make it easier, faster, and more reliable to build systems in fundamentally different ways from before. You can use these extra transistors to make redundant resources and to provide resources for other parts of the design value chain, such as operating systems, development tools, and on-chip debugging and profiling resources.
Long before multicore processors became popular, processor architectures had evolved on a path toward having more parallel resources. The scale of the number and organization of transistors available to a processor architecture changes what is important to developers. The earliest transistor-count-constrained architectures often could not afford to waste any transistors on redundant resources. The transition to a register file made sense as more transistors became available to waste on parallel, redundant accumulators that would greatly improve processing performance because it could eliminate temporary data moves that were necessary when there was only one accumulator. This type of change provided the first level of relief for clock-cycle counting for many applications.
|
Transistors eventually became available for designers to waste on redundant, wider address and data buses and accompanying arithmetic units. Witness the ingenuity of 8-bit 8051 processors to deliver an address space exceeding 2 Mbytes through block addressing and bank switching. As designers could economically waste more transistors on larger, parallel, and redundant integrated memory structures, the processors could perform more complicated tasks because the software code could account for more details in the task it needed to perform. Processors with large-enough address spaces allow developers to avoid the complexities of managing bank switching and conceptually treat memory as one block. The expansion of the available memory helps relax the urgency of byte counting in many designs.
The ability to waste transistors on a parallel and redundant hardware multiplier accelerator opens a new industry for signal processing. The processor retains its ALU (arithmetic-logic unit) in addition to the new multiply accelerator, and designers can still use the ALU to perform multiplication—albeit more slowly. You can make similar statements for nearly every accelerator that contemporary processor architectures integrate.
The earliest integrated redundant resources hid or transferred some of the complexity away from the software developer, and each of these transfers enabled the processors to take on complexities that they could not previously handle. Processors with wide-enough ALUs or hardware accelerators allow developers to avoid having to break up multiplications, floating-point operations, Viterbi algorithms, and other tasks into lower- and higher-order bit operations that require manual management and combination. In contrast, general-purpose multicore implementations often require software developers to explicitly identify where parallelism exists—a step back to where the industry was.
Architectural inflection points do not necessarily apply across the entire application area. An innovation that transforms one application might be inappropriate for another. Pipelines and caches are redundant resources that can help reduce the complexity of managing memory-access times for software developers by masking the significant access latency when the system must access data in cheaper memory that is farther from the core (Reference 3). However, pipelines and caches are inappropriate for some embedded designs, especially those, such as motor controllers, that require fast, deterministic behavior. As a result, many motor controllers do not use pipelines or caches because they would unnecessarily increase the developer’s exposure to complexity.
Steve Leibson, a consultant for semiconductor and EDA companies and a former editor-in-chief of EDN, points out that energy dissipation drives the need for many contemporary parallel-processing implementations and algorithms. It is no longer practical to keep driving clock rates faster because the industry has crossed a threshold in which static leakage current is a larger issue than it was at larger process geometries and slower clock rates. To deliver more processing performance for a unit of time, high-end processors employ multiple cores in one device. For parallel-processing applications, such as video processing, this approach works well for the same reasons that extra registers and accelerators work well for other applications: They offload some of the data management and scheduling complexity from the software developer.
However, similar to pipeline and caches for highly deterministic applications, applying a multicore approach to a general-purpose problem exposes and complicates the software developer’s already-significant load. In addition to the proper execution of the functional tasks, the software developer must prevent timing dependencies that were simpler in a single-instruction-engine architecture. Another area of complexity is how to identify parallelism, partition, and balance the workload across the cores. A memory architecture that you can no longer treat as a single block further complicates this design task; data may reside in main memory or in one or more of several local memories of other cores—each with different implications for access latency. The developer may have to provide data coherence in software if the processor lacks a hardware coherence controller. All of these complexities scream for software-development tools to help shoulder the additional load from the software developer.
Software tools
Along with the processor-architecture changes, software-development tools have undergone a number of inflection shifts, but they happened in step with the changes in silicon. The earliest processors might come with an assembler and some application notes to help developers figure out how to use them. The assembler was a direct reflection of the underlying architecture and instruction set; it was primarily a tool to help the developer think in terms of the steps the processor executed. The developer had to manually translate even simple algebraic expressions into a series of machine reads, stores, shifts, and additions.
High-level languages helped simplify the translation to machine or assembly code, but they usually produced code that was significantly worse than what a developer could do manually—on a system with severely constrained resources. Processor architectures became more compiler-friendly as they implemented register files and orthogonal instruction sets and as they could support larger memories. In other words, fewer improvements in compiler technology would have occurred if the appropriate silicon support had been lacking. Contemporary compilers are good enough to use for almost all programming except for those leading-edge functions that still benefit from differentiated, application-specific resources that a targeted processor might include.
As compilers improved, development tools underwent an integration effort not unlike that for SOCs (systems on chips). Integrated development environments grew into sets of many tools, including editors, compilers, debuggers, and profilers. The value of this integrated environment lies in the fact that it simplifies a developer’s learning curve and shortens compilation and build time, especially as software continues to be an ever-larger portion of the development budget, not just for end-equipment designers but increasingly for semiconductor companies. These environments allow a developer to spread common tasks among targets so that the developer can focus on the differences between targets instead of differences in the development tools.
Semiconductor companies are also helping to hide complexity by allowing developers to choose their processor target later in the design cycle. Freescale’s Flexis and Atmel’s AVR lines blur the line between 8 and 32 bits by sharing common IP (intellectual property) between the processor groups. Microchip takes the same approach with some of its 16- and 32-bit PIC devices. Many other companies offer large device families that allow developers to move up and down through the family to size the target processor as late as possible in the design cycle. These abstractions or choices allow developers to focus on what function the design needs to perform and less on how to fit a chosen processor into the design.
Exploratory compilation is potentially emerging as a trend among compilers for complex signal-processing systems that can help offset the increasing complexity of software. Texas Instruments a few years ago implemented exploratory compilation for its VLIW (very-long-instruction-word) C64x processors that have eight execution engines that can operate in parallel. Ceva recently added a similar capability to its compiler tool set (Figure 1). The compilers perform multiple compilations of each function with different settings. These tools then present a developer with information about how the settings affect code size and performance so that the developer can fine-tune the results of the compilation. Academic variations on this concept can include profiler-based feedback that further refines the compilation settings.
Exploratory compilation and profiling feedback have greater potential as processor architectures become more complex with even more heterogeneous or homogeneous execution units. A compiler that can generate dozens, hundreds, or perhaps thousands of candidate configurations of the resources and code and then dynamically test and rank each of them would provide a significant level of abstraction to the developer. The system would also need to be able to provide confidence, maybe through mathematical proof, that the final candidate configurations are equivalent to the source code. Such a technology could also provide the essential capability for code reuse because the source code could focus on function and synchronization specifications while the compiler tries out many configurations on whatever resources the target processor supports.
To make all of these capabilities possible, hardware resources had to coincide with the software advancements. Further advancements in software tools will probably require as-yet-unknown specialized hardware resources. As a point of interest, this year saw two significant acquisitions of operating-system companies. Cavium Networks acquired Montavista, and Intel acquired Wind River. These moves may signal an imminent inflection shift for operating systems and hypervisors to hide complexity from developers.
| References |
|
Apple
www.apple.com
ARM
www.arm.com
Atmel
www.atmel.com
Cavium Networks
www.caviumnetworks.com
Ceva
www.ceva-dsp.com
Cypress Semiconductor
www.cypress.com
Freescale
www.freescale.com
Intel
www.intel.com
Microchip www.microchip.com
Microsoft
www.microsoft.com
Montavista
www.mvista.com
PLX Technology
www.plxtech.com
Power.org
www.power.org
Solarflare Communications
www.solarflare.com
Steve Leibson Consulting
www.sleibson.com
Texas Instruments
www.ti.com
Wind River
www.windriver.com
-
The kind of application that is a 'natural' use of multicore systems is tomography generically. Right now it's used for medical imaging and geology, but there are other potential uses, some as simple as engine or structure diagnosis. To my knowledge, no one has 'hidden' the complexity of gathering hundreds of simultaneous sample sets from known coordinates for building 3-d model based on those samples. Having someone without a heavy programming background being able to construct a model based on this data would create a number of entirely new worlds.
Meredith Poor - 2010-24-3 20:25:00 PDT -
At some point a sensing platform can be constructed that is indistinguishable from a lump of rock, even if one attempts to bust it open and look for chips, batteries, antennae, etc. Part of this is a simple reduction in size, some of it is being able to construct an amorphous crystalline structure of which some elements are 'circuits', and the surrounding matrix is inert 'packaging'. The obvious use is intelligence gathering, whether military or law enforcement. How far are we from having it, and who would be making it?
Meredith Poor - 2010-24-3 20:15:00 PDT -
"Inflection points are not just about technical capability" - consider Forth is sufficient to enable an inflection point in embedded-system designs. Only a few companies made an attempt to avoid a lost opportunity of using Forth, most companies are taking no chances. While for big companies successfull using Forth is a secret worth to hide, others follow the mainstream and miss this opportunity.
"A technology inflection point involves a fundamental shift in how designers and users perform their tasks. The economical integration of formerly separate pieces enables this shift." Forth integrated datastacks, dictionaries, and an accompanying instruction-set architecture that simplified the implementation of practical every program and programming system.
Chuck Moore’s Forth continues to decrease to “waste” transistors on features that make it easier, faster, and more reliable to build systems in fundamentally different ways than before. Chuck Moore's Stack machines need a lower count of transistors than any other microprocessor machine, the last example being the SEAforth® multicore processors which offer unprecedented flexibility and scalability. Derived from a proprietary Scalable Embedded Array™ (SEA) Platform, SEAforth solutions are poised to raise the performance-per-watt bar in a host embedded applications.
From the early beginning on Forth used virtual memory to allow developers to avoid the complexities of managing bank switching and conceptually treat memory as a block.
It is strange that contemporary processor architectures have not integrated data stack as accelerators ever since.
Forth allows developers to avoid having to break up multiplications, floating-point operations, and other tasks into lower- and higher-order bit operations that require manual management and combination.
Pipelines and caches are not used with Forth microprocessors, and therefore Forth microprocessors have fast, deterministic behavior, decreasing the developer’s exposure to complexity.
And even so, Forth multicore implementations using VentureForth make life easier for software developers to handle parallelism — a step forward into the future.
There is another neglected inflection point at the sideline: The transputer was a pioneering concurrent computing microprocessor design in the 1980s. To avoid all these complexities connected to parallel programming, a special programming language was developed, Occam, which took care of all these tasks. Using Occam, the same program could run efficiently on a single transputer or on a sea of transputers, it was managed by the compiler. Occam is a concurrent programming language that builds on the Communicating Sequential Processes (CSP) process algebra. The "not invented here" syndrome leads now to the path to re-invent the wheel. It's a waste of resources and a waste of already made progress.
Dirk Bruehl - 2010-4-2 07:33:00 PST -
?? How is it that Microsoft gets the credit for technology Apple first brought to the masses?
Chris - 2010-3-2 12:42:00 PST -
Some inflection points are the sum of emerging technologies that may not appear related. There is a very practical consequence of the 'wireless power transmission' technology that one can now buy off the shelf at consumer electronics stores. It is more significant when coupled with Lithium-Ion batteries, with their higher energy densities and avoidance of materials like lead and cadmium. I can build a robot to run around the house and perhaps dock periodically to recharge. A wireless power recharging system allows me to create a far more pervasive service presence, since I can put these in multiple locations where they may appear to be 'invisible'. In short, I can stop thinking about periodic round trops to the charger, and simply focus on building the robot.
Meredith Poor - 2010-23-1 00:31:00 PST


















