Skip to content

Latest commit

 

History

History
116 lines (86 loc) · 5.03 KB

README.md

File metadata and controls

116 lines (86 loc) · 5.03 KB

Bloom Compiler

Build Status

(This is work-in-progress code that @JoshRosen wrote in grad school).

This is an experimental compiler for a new language based on Bloom.

Usage

The command line interface to the Bloom compiler is implemented by the Compiler object.

During development, the easiest way to run the compiler is through sbt run; use

sbt "run --infile <sourcefile> --target [dataflow|rxflow]"

to compile sourcefile and generate either a GraphViz .DOT file representing a dataflow graph (for the dataflow target) or Javascript code (for the rxflow target).

To build a binary distribution of the compiler, use the sbt-native-packager sbt targets; for example, running sbt stage will stage a binary distribution in the compiler/target/universal/stage directory, where you can use

./compiler/target/universal/stage/bin/bloom-compiler

to run the compiler.

We don't have a language reference yet, but in the meantime there are some example programs that give a flavor of the language.

Phases

To compile a Bloom program, the compiler executes the following phases:

  • Parse source
  • Resolve names
  • Assign types
  • Analyze dependencies
  • Stratify collections and rules
  • Generate intermediate dataflow graph representation
  • Calculate invalidation and rescan sets
  • Emit source code to instantiate dataflow on the target runtime

Implementation Details / Developer Notes

A goal of this implementation is to maintain a clean separation of concerns between the different compilation and analysis components. For example, the stratification logic is implemented in two files, one that contains all of the dependency analysis code, and another that declaratively describes the stratification in terms of dependencies.

Several components of the compiler are implemented using Kiama. We use its positional parser utilities, lazy attribute grammars, and tree-rewriting systems to address "the AST typing problem".

Structure of the Generated Dataflow Code

The runtime-specific dataflow code generators try to isolate the Bloom-specific code to a few operators that sit at the edges of the dataflow, plus some orchestration code that sits outside the dataflow. By not imposing Bloom-specific requirements on the dataflow operators, this allows us to re-use existing dataflow systems by implementing at most one or two new operators.

The RxFlow backend translates a Bloom program to a Javascript object that encapsulate a dataflow and exposes RxJS Observer and Observable interfaces for integration with external data sources.

This description is out of date, now that I'm moving to punctuation-based control flow:

To execute a single logical timestep of a Bloom program, an instance of the RxFlow Javascript runtime performs the following actions:

  1. For all tables, check whether they have pending deletions. Perform those deletions and record whether any records were actually deleted from the tables. This second step is necessary because deletions may try to remove nonexistent tuples.
  2. Determine the set of invalidated elements: for any table that performed deletions, look up the set of invalidated elements by indexing into a static structure generated by the compiler's invalidation analysis module.
  3. For each invalidated dataflow element, perform the appropriate calls to clear its cached state.
  4. In order, execute each stratum's dataflow until fixed point:
    • To initiate the computation of this stratum:
      • Rescan the invalidated tables, then perform any pending insertions that were deferred in the previous timestep.
    • To detect termination:
      • Acyclic dataflows sources can emit "end-of-round" messages once they have produced all of their outputs; these messages can propagate downstream until the dataflow sinks have finished the round.
      • For cyclic dataflows, we can rely on the fact that every cycle includes either a scratch or table; we can buffer insertions into these tables and call flush() on each table to initiate another cycle. If a round passes where no tuples were flushed, then we declare termination. This effectively breaks the cycles and replaces them by iterative execution of an acyclic dataflow; it's possible to implement a completely asynchronous, cyclic dataflow, but that requires more complex termination detection techniques.

In this architecture, the Table element is somewhat Bloom-specific, but the other dataflow elements are completely generic; they only need to support a clear/reset method to clear cached state and possibly an end-of-round mechanism to cause buffering operators to emit all of their outputs.