A Discussion With Our Founders

A Discussion With
Our Founders

Until the mid-2000s, software developers relied on computer architects to translate increasing transistor counts into performance gains without requiring software changes. However, with the end of Dennard scaling and the physical limitations of power and energy dissipation, this “free ride” came to an end. While modern architectures offer additional hardware resources, leveraging them effectively introduces significant software complexity, particularly in determining how best to utilize these resources. Furthermore, software systems have become increasingly configurable, requiring users to manage performance-related parameters directly. The interaction between these software and hardware parameters creates a highly complex and challenging environment to model.

In real-world deployments, systems must meet multiple, often conflicting, design requirements and continue to perform optimally despite unpredictable operating environments and workloads. These challenges reflect two distinct issues:

Inherent Complexity: Hardware and software expose a diverse set of configurable parameters, whose interactions can have non-linear effects on mission-critical metrics. Additional factors such as environment, maintenance, upgrades and developer expertise exacerbate this complexity.
Dynamics: Computing systems must adapt reliably to unpredictable changes in operating environments, input workloads and user needs.

To address these challenges, Self-aware Computing Systems have emerged as a solution. These systems leverage principles of self-awareness to monitor their state and environment, reason about potential adaptations and act autonomously to meet performance objectives while balancing complexity and dynamics. This approach provides a scalable and adaptive foundation for managing modern computing systems in increasingly unpredictable and demanding real-world scenarios.

Addressing Complexity and Dynamics

To address these challenges, we have undertaken extensive research and developed the transformative technologies that enable Self-Aware™ computing for every system.

Self-Aware software and hardware systems are aware of their quantifiable goals—defined in terms of metrics like throughput, latency, power, energy and accuracy.

Self-aware systems automatically adapt behavior to meet user-specified goals in complex, dynamic computing systems. The entire development process has been based upon sound mathematical models which permit some reasoning and assurance about when systems will meet user goals and—perhaps more importantly—providing understanding of when those goals are unreachable.

They continuously monitor critical metrics and adapt their internal behavior to ensure their goals are maintained in complex, dynamic environments, by combining machine learning—to address complexity and control theory—to handle dynamics.

While most academic research tends to be narrowly focused, the true value of self-aware computing can be demonstrated by the large number of domains where we have consistently shown that self-aware computing outperforms prior, unaware systems, including at the circuit-level, the architecture-level, the OS-level, the application-level and coordination across multiple complex system interfaces. We have applied these techniques to manage energy on embedded micro-controllers, in collaboration with Argonne National Laboratory to manage power for the Theta supercomputer, and even for quantum computers.A key aspect of Self-aware computing is the integration of AI and machine learning into dynamic computing systems.

We’ll briefly discuss each component separately and then discuss their combination into a powerful set of configuration, optimization and run-time tools.

Learning Models of Complex Computer Systems

AI has an obvious use in resource management and scheduling, especially for power and energy in datacenters and supercomputers. However, AI has not shown itself to be a panacea and must be applied carefully as Improving learning accuracy does not necessarily improve the system outcome.

This is a somewhat counter intuitive result as one would assume that improving the AI’s accuracy to estimate a workload’s latency and energy for different resource assignments would improve energy efficiency; yet in our work, even fairly large accuracy improvements did not always reduce energy.

After exploration, we determined the reason: this is a constrained optimization problem (meet a target latency with minimal energy) and, of course, only the resource configurations on the optimal frontier of power and performance are useful to solve this problem. Empirically, however, most resource configurations are not on the optimal frontier (typically we find about 90% are not optimal for any given task).

Unfortunately, when the AI model is optimized for accuracy, it gets the biggest win by improving accuracy for these non-optimal configurations (since they are the majority), which does not improve the system result (which only concerns optimal configurations, which are a small minority).

Once we accounted for this structure, we designed a learning approach that was less accurate overall, but better for predicting the optimal configurations. Our AI gets 26% closer to optimal energy than the most accurate prior AI we could find. This is an essential observation for those deploying AI to manage complex computer systems: the learning approaches should account for the structure of the systems problem they are deployed to solve rather than strictly optimize for accuracy.

Controlling Computing Systems Through Dynamic Workload Fluctuations

Control theory is a powerful branch of engineering for ensuring that systems meet goals in dynamic environments. Control theory represents a general set of techniques, but its implementations are almost always system-specific. While there is a rich history of applying control solutions to meet latency and throughput goals in computer systems, these solutions require laborious profiling and tuning and must be reimplemented or redesigned wholesale when ported to a new computer system. Essentially, the great bulk of control theory applied to computing requires engineers to be experts in both control and computing.

In practice, we have replaced existing static configuration parameters or heuristics with control-based solutions that dynamically adjust parameters based on runtime conditions. We repeatedly find that the control-based approach keeps the system up and meeting its goals in scenarios where even patches posted by expert developers fail. We have even built controllers that control other control systems.

We have embedded control systems into existing computer systems. As an alternative, we have explored techniques that hide control parameters from users by embedding them into systems below the user interface. As examples, we have embedded controllers into larger systems to manage approximate computing applications, to create portable, energy efficient embedded software to guarantee accuracy for machine learning inference in scientific simulations, meet energy budgets for embedded systems, managed bandwidth in network protocols for video analytics and to automatically configure large scale software systems like HDFS, HBASE and Hadoop MapReduce.

Combining Machine Learning and Control Theory

As noted above, AI techniques are well-suited to build models of complex, configurable computing systems and control techniques are ideal for configuring those systems to meet goals despite dynamic fluctuations. It makes intuitive sense to combine both, and that capability is core for Config Dynamics’ Self-Aware computing technology.

It is not obvious how to bridge the gap between learned, non-linear models of discrete computing systems and the continuous, linear models used by common control systems. Our original research in this space initially explored several specific problems:

Building AI models of system resource efficiency and combining those with control of application alternatives to maximize application quality for a given energy budget;
Using AI to model application resource requirements and then combining that with control to meet quality-of-service guarantees for minimal cost in infrastructure-as-a-service plat-forms;
Using AI to understand application resource needs to control latency with minimal energy for GPUs.

The key lesson learned from this work was the interface needed to combine the AI model with the control system. In short, we have the learners produce piece-wise linear models representing tradeoffs (for example the tradeoff between energy and latency for a given application and system) and pass these to the controller, which uses this linear model to make efficient resource allocation decisions in response to dynamics. One additional insight was that results improved if the AI system could also communicate its uncertainty to the control system.

The result is a general methodology where an abstract control system is customized at runtime by a learner. We find that it enables:

Fast reaction to dynamic changes,
Error tolerance based on runtime feedback,
Robust design that guarantees that operaing requirements will be respected when possible or the ability to report when those requirements cannot be met.

This is why we say that any Self-Aware system is better than any unaware system: the self-aware system is future-proofed and capable of meeting goals even when deployment conditions do not match the conditions anticipated at development time.

The Next Step: Making AI Adaptive

The CS community and the world at large has been transformed by advances in AI. And of course, like all computing systems, deployed AI inference systems also must meet quantifiable goals including latency and energy, but also inference accuracy.

Regrettably, AI inference has not been “configurable” in a classic sense, being a product of AI training which itself is a product of specific neural network architecture choices. The same limitations apply as well to agents, themselves products of very targeted inference.

But now, the good news is that our prior research on self-aware-computing, approximate computing and loop perforation has found us in a leadership position for dynamically managing AI to meet latency, energy and accuracy goals in dynamic environments.

We can now make AI configurable and can then make AI Self-Aware, using goals to be responsive to workloads to extract performance. For us, AI, whether for training inference or agent use, is another complex system problem, with huge consequences, made more performant when it is dynamically tuned through goals.With these breakthroughs, we are addressing two fundamental questions as we add AdaptiveAI™ tools to our Sequitur™ platform for general use:

First how can we design and train neural networks that support dynamic (inference-time) tradeoffs between resource usage and accuracy? The answer so far has been to structure the networks to maximize internal data reuse and use a novel orthogonalized stochastic gradient descent to balance accuracy across multiple network configurations.

Second, how do we dynamically control these networks to create better computing systems? Here the answer has been to combine the novel network structure from the first problem with novel control theoretic techniques designed for these adaptable neural networks.

The results are exciting. For example, working with scientists at Argonne National Laboratory, we have developed techniques that use control theory to guarantee the accuracy of ML inference in the context of material science simulations. We have also used these techniques on embedded systems that harvest energy from the environment to maximize inference accuracy given a dynamically changing energy budget. As the world grows more dependent on AI, the value of AdaptiveAI will only increase, from embedded systems to enterprise solutions and everything in-between.

The Sequitur™ Platform

We’ve taken our know-how and technology and are making it broadly available for developers’ use. Whether for local use or for collaborating with our system integration team, we are available for improving legacy systems, analysis and development support in configuring new, modern complex systems, for improving AI tools themselves, for collecting data on system performance under various goals and conditions and for managing run-time operations to improve performance.

Config Dynamics’ Sequitur™ Platform is based on research that makes application of Self-Aware™ and AdaptiveAI™ computing functionality widely accessible, by supporting the design and implementation of self-aware systems without requiring developers to be experts in AI, learning, or control.

This is accomplished through a series of API’s to access advanced computational methods, goal-setting support and resident configuration analysis and run-time optimization tools, to permit developers to build systems with formal guarantees that their goals will be met.