Committee

  • Aarti Gupta (advisor)
  • Andrew Appel
  • Zak Kincaid

Title

Sequential abstractions of data-parallelism in hardware

Abstract

Semi-programmable hardware accelerators are ubiquitous in modern systems-on-chip but verifying correctness properties about the interaction between hardware accelerators and software is very challenging. In particular, many accelerators for machine learning use data-parallelism in hardware, and existing modeling and verification techniques would typically need to perform expensive reasoning about the interleaved updates to states of the system by the parallel hardware components.

In this talk, I present my work on two sequential abstractions of data-parallelism in hardware, "SIMD instructions" and "n-way parallelism", which I use to extend an existing specification model for hardware called an ILA (instruction level abstraction). An ILA provides architecture-level specifications for hardware accelerators in much the same way that an ISA (instruction set architecture) provides specifications for processors. Essentially, an ILA instruction corresponds to certain inputs at the interface of the accelerator and captures the updates to architectural state that are performed by the accelerator in response to those inputs.

For these two abstractions, I show how each abstraction allows an ILA instruction to capture the parallel updates due to data-parallelism in hardware, while still maintaining a sequential state transition semantics. This avoids expensive reasoning in verification due to interleaving of the parallel updates. Next, I show a further use of these abstractions, defining a "SIMD transformation" that can take a low-level model of a data-parallel accelerator and produce a higher-level model that is guaranteed to be equivalent to the low-level model. This has applications in automated code generation for accelerator hardware. Finally, I discuss a proof-of-concept implementation of these abstractions in ILAng, a platform for modeling and verification of software/hardware systems using ILA models. I will demonstrate the use of these abstractions on a real-world accelerator designed for machine learning and describe the future directions the work can be extended.

Reading List

  1. (Foundational, Textbook) Software Foundations, vol. 1 & 2, Pierce et al.
  2. (Foundational, Textbook) Logic in Computer Science, Huth and Ryan
  3. (Foundational SMT) Handbook of Model Checking, chapter on Satisfiability Modulo Theories, Barrett and Tinelli
  4. (Foundational MC) Handbook of Model Checking, chapter on Interpolation and Model Checking, McMillan
  5. (Foundational HW) Automatic verification of pipelined microprocessor control, Burch & Dill CAV’94
  6. (Foundational, MCM) A Primer on Memory Consistency and Cache Coherence, chapters 3, 5, 11
  7. (Closely Related, ILA) Instruction-Level Abstraction (ILA): A Uniform Specification for System-on-Chip (SoC) Verification, Huang et al. TODAES’18
  8. (Closely Related, 3LA) Relay: A New IR for Machine Learning Frameworks, Roesch et al. MAPL’18
  9. (Related, Loop optimization) Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation in Image Processing Pipelines, Ragan-Kelley et al. PLDI’13
  10. (Related Prior Art) Adam Betts, Nathan Chong, Alastair F. Donaldson, Jeroen Ketema, Shaz Qadeer, Paul Thomson, John Wickerson: The Design and Implementation of a Verification Technique for GPU Kernels. ACM Trans. Program. Lang. Syst. 37(3): 10:1-10:49 (2015)

Slides

They can be found here in PPTX format.

Questions

(including follow-up questions)
  1. (Appel) Give an operational semantics for simply typed lambda calculus. (I gave call-by-value semantics.) Follow-up: The semantics you implemented are called call-by-value, how would these rules change under another semantics? (I mentioned call-by-name and described how semantics would change. I also incorrectly mentioned something about capture, leading to the next question.)
  2. (Appel) Why do your rules not require us to worry about capture? Why would modifying the semantics still not require us to worry about capture? (Substituted expressions don't have free variables.)
  3. (Kincaid) If you model a transition system with predicates I(x) (initial state), T(x, x') (next state), and have an error condition E(x), how would you check whether the transition system avoids the error state?
    (I explained the bounded model checking approach and the interpolation approach. Follow-up on interpolation approach: If your main query is UNSAT, you can get a better interpolant, or your interpolant is an invariant and you stop. What happens if your main query is SAT? I said it gives a counterexample, but I was wrong because interpolants overapproximate the set of reachable states. Correct answer: we need to increase the number of steps we look ahead.)
  4. (Gupta) Describe how hardware to FSM equivalence is checked in the ILA methodology and in Burch and Dill.
  5. (Appel) What are some of the problems with two cores executing at the same time over shared memory? How can the cores avoid these issues? Andrew followed up with various questions, but the main question: Why would someone want to move away from sequential consistency in hardware?