Circuits and Systems

Spatial Computing
Around 2005, decades of improvement attributed to Moore's Law clocking improvements ended and we entered the era of many-core computing.  The basic rationale is that advances in performance must now come from utilization of larger areas of silicon instead of faster device operation.   This trend is clearly visible as the blue data rolls over in 2005, yet the capacity continues unabated in the red line and the price drops consequently, as does the power (primarily through aggressive techniques rather than raw technology benefits).   This particular data is specific to FPGA technology.

The CAL Lab is actively exploring architectures which embrace high count devices, and one particular area of exploration is that of algorithmic acceleration on malleable substrates.  Typically programmable logic has enjoyed an increase in device size with the technology shift, and also a fortuitous re-partnering with fabrication facilities to enjoy head-of-the line status for new nodes due to the regular nature of the patterning providing a good evaluation platform for process tuning.

Circuit specialization has long demonstrated both power and performance benefits of up to 1000X, but at the expense of cost and time-to-market.  It is impractical in HPC to equip general-purpose platforms with dedicated ASICs (unless an exceptional case such as GRAPE and Anton bio-computing tasks justified the cost and loss of flexibility).

Embedded devices with severe power constraints like battery-powered cellular phones justify significant dedicated acceleration, so-called dark silicon. because those circuit areas are powered off when not in use.  This is the predominant mechanism for energy efficiency in portable devices.  It is our task in the CAL Lab to adapt the method for HPC needs as we enter the Exascale era.  The Apple A8 chip contains an estimated 29 specialized accelerator areas according to David Brooks (purple areas).

Programmable devices like FPGAs have demonstrated around 10X performance (energy/speed of computation) advantage over conventional microprocessors, but sacrifice several orders of magnitude benefit in order to remain "programmable", at least on a different timescale than CPUs.

Use of FPGAs embedded with host standard processors is termed acceleration or sometimes algorithmic acceleration in reference to the task under improvement.  At the University of Illinois between 2005-2007 CAL Lab manager Dan Burke developed a full API and Linux co-execution stack based upon loadable linked libraries, and this work is being leveraged now for HPC tasks.

Collaborators at Microsoft Research have concurrently developed and deployed Big Data specific key-match search accelerators, and deep-learning convolutional neural net classifiers and deployed into a pilot 1,600 node cluster; we have access to the test cluster for our evaluations through a research partnership.

Search improvement was documented at a conservative 2X throughput improvement, but more importantly a 29% reduction in result delivery as demonstrated in a full production environment within a Bing datacenter.  It is our assertion that this is a potentially viable path for HPC enhancement going forward.

Advanced Prototyping Substrate
Between 2010 and 2012 at BWRC a maskless silicon substrate technology was (mostly) developed based upon femto-second laser ablation.  This project was brought to the Lab, and is currently inactive awaiting a next round of funding.  It is described here as a mechanism to advance Exascale evaluation by integrating advanced memory component (HMC, HBM, NVRAM), various technology nodes, photonics components, and prototype prototype processors (RISCV).

  • Explore on-chip network topologies useful for HPC and characterize with executing applications directly connected to high-bandwidth memory and possibly silicon photonics I/O; develop efficient software layer for memory access. 
  • Incorporate commercial engineering samples within a neutral, protected environment, and expose the cores to actual workloads.

Near-Threshold Operation
Experimental efforts have demonstrated between 5-8X (possibly 10X) energy improvement by operating very near the transistor threshold.  This comes at the expense of some speed performance, and is a region not easy to implement standard 6T SRAM, however as multiple voltage domains become common practice this concern eases.

The CAL Lab has a strong association with the Berkeley Wireless Research Center and has participated in chip evaluation of the Raven series of low-power RISC-V cores which achieve 34+ GFLOPS/w with double precision and vector acceleration and contain significant other power reduction specialized components (V-F clocking, on-chip DC-DC conversion, and low threshold 8T SRAM)

Near-threshold computing embraces these ideals:

  • Avoid energy wastage by not operating transistors in strong inversion mode
  • Recover lost performance by exploiting massive parallel execution
  • Manage aggressively idle power as leakage, switching become dominant contributors

Sub-Threshold Operation 
Conventional CMOS technology can achieve very high levels of complexity by operating the individual transistors in their subthreshold region, where the drain current is an exponential function of the gate-source voltage and remains near-linear in this mode for many decades. In this regime of operation, amplifiers can be operated with current levels in the range from 10-12 A to 10-7 A. At these low currents, the drain current of the individual transistors saturates at drain voltages above 100 to 200 mV in analog mode, or when logic levels are redefined from a saturating regime to exponential "1" consume very little energy, albeit at the expense of speed.  

This approach may prove useful when deployed in spatial computing where power concerns dominate clocking and circuits are intentionally self-timed or asynchronous (not chip-crossing) and plentiful in number.