Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve recalc engine algorithm #308

Open
MikeStall opened this issue Apr 14, 2022 · 12 comments
Open

Improve recalc engine algorithm #308

MikeStall opened this issue Apr 14, 2022 · 12 comments
Labels
good first issue Good for newcomers

Comments

@MikeStall
Copy link
Contributor

(this is a bit more advanced for a first issue - requires knowledge of topological sorts. But it's marked as a "first issue" since the code is very accessible with few dependencies)

RecalcEngine's lets you define Formulas (via SetFormula) and then it will automatically recalc the formula if the dependencies change.

But the implementation here is extremely hack:

// $$$ This is a really bad, inefficient, buggy implementation.

This is the ability to call UpdateVariable() and cause other Formulas (created from SetFormula) to be recalculated.

It's good enough to pass the tests, but a very poor implementation.
See tests here:

A good implementation:

  • would statically understand what each formula depends on
  • efficient recalc - does not unnecessary work.
  • robust handling for circular references
  • include aggressive tests.
  • allow deleting a formula.

The current implementation is also very inefficient. We could construct a perf benchmark to show before and after.

@MikeStall MikeStall added the good first issue Good for newcomers label Apr 14, 2022
@omprakaash
Copy link
Contributor

Can I take up this issue? I am interested in working on this. My current understanding is that the implementation of RecalcEngine avoids cycles by only allowing new formulas to refer to previously defined formulas and by not allowing formulas to be redefined. Would an improved implementation allow for formula redefinition by detecting cycles in the top-sort algorithm ?

@orcmid
Copy link

orcmid commented Apr 16, 2022

This is an off-hand comment. I have not examined RecalcEngine, just musing from somewhere close to first principles. I recall that there was once an iteration cap that was used to clamp off recalcs with cycles. Is that still around?

@omprakaash asks: Would an improved implementation allow for formula redefinition by detecting cycles in the top-sort algorithm ?

Technically, there is no topological sort of a directed graph having cycles longer than 1. I just dug this out of "The Art of Computer Programming", volume 1 Exercise 2.3.4.2-4.

So, the ideal conditions given by @MikeStall require some sort of hacks for which there is assured termination and minimal update (addition, change, deletion) consequences. That's a large order.

The other aspect (in comparison with Knuth's solution) is that it would be unusual for the (Excel) dependency tree to be rooted. There are usually multiple cells that are not depended on and may or may not depend on others although I presume that the inversion (ones depended on that depend on no others) is important for recalc. Then there's the matter of determining when to recalc when change(s) occur and how generational updates occur (I am guessing though.)

PS: I have this as a practical problem with a pet data structure, although the conditions for topological sorting are satisfied. The inversion is desired for extracting the structure to a sequential stream. The optimization challenge is having nodes appear in the stream close to where they are (first) needed (depended on) as chunks of the data structure are composed. I expect there will have to be a good-enough heuristic for that.

@MikeStall
Copy link
Contributor Author

@omprakaash - feel free to take a look! I'd recommend a) understanding why the current implementation is poor; b) perhaps starting with some new tests (bonus if they break the current impl), c) a draft PR to capture the essence of your fix.

(fyi, @anderson-joyle had also expressed interest in this).

@orcmid - we do not support cycles - so SetFormula() should fail if attempting to create a cycle. It does beg the question if we should have an overload of SetFormula() that accepts multiple formulas and sets in bulk.

This is still preview features, so we can make breaking changes to the SetFormula contract here if really needed.

@omprakaash
Copy link
Contributor

Got it. Thank You! Spent some time testing. Will create a Draft PR to discuss more about it.

@orcmid Thank you for the detailed reply. Sorry I wasn't clear on the top-sort part. I meant that the algorithm itself could detect cycles as a means of error management. Thus, cycles can now be prevented without placing restrictions on formula redefinition.

@anderson-joyle
Copy link
Contributor

Will be good to see 2 or more approaches to this problem.

@MikeStall
Copy link
Contributor Author

One thing that's hacky about the current implementation - we shouldn't need Hash in RecalcEngineWorker. If we update A, we should know statically the other vars that need to be updated, without having to do string comparisons or hooking the resolver or IScope.

@jcoc611-microsoft
Copy link
Contributor

jcoc611-microsoft commented Apr 18, 2022

What do you all think of these C# APIs? I think this transaction-style approach would be slightly simpler (orthogonal to a better implementation for tracking dependencies)

RecalcEngine re = new();

// None of these affect the underlying RecalcEngine until the re.Recalc(rt) call
RecalcTransaction rt = re.NewTransaction();
rt.UpdateVariable(name, formulaValue);
rt.UpdateFormula(name, formula);
rt.RemoveVariable(name);
rt.RemoveFormula(name);
//...
rt.DirtyFormulas;         // fast, topo-sorted, and deduped no-alloc enumerator
rt.DirtySymbols;          // both formulas and variables
rt.Errors, rt.HasErrors; // e.g. missing symbols, cycles, etc


await re.Recalc(rt); // async and parallelized, maybe throws if rt.HasErrors?

Not sure if onUpdate callbacks are also needed.

Another interesting issue that hasn't been mentioned is "volatile" functions such as Today() which technically don't depend on other variables but should probably be recomputed on every recalc (this is the Excel policy)

@MikeStall
Copy link
Contributor Author

@bryancroteau-MSFT is looking at adding topological sorts. That could be useful here.
#320

@bryancroteau-MSFT
Copy link
Contributor

#320 has been merged into main. The next step is to modify the binder, or add a new binder-like tree visitor, which can produce a list of known edges between formulas.

@jcoc611-microsoft
Copy link
Contributor

#320 has been merged into main. The next step is to modify the binder, or add a new binder-like tree visitor, which can produce a list of known edges between formulas.

what's the difference between that and using checkResult.TopLevelIdentifiers?

@bryancroteau-MSFT
Copy link
Contributor

I will need to see how that is being used. The difference is that we would want the binding to skip some work if we are just checking for dependencies between formulas without doing full type reconstruction or type checking.

@omprakaash
Copy link
Contributor

There seems to be a breaking bug in the current implementation of RecalcWorker when the dependency graph is diamond shaped. I included some tests in PR #315. Eg: a = 1,b = 2*a, c = 2* a, d = b + c.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

6 participants