Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve testing framework #37

Open
wants to merge 36 commits into
base: main
Choose a base branch
from
Open

Conversation

SUPERCILEX
Copy link
Contributor

The current tests are quite painful to work with since it's very difficult to tell what the expected values should actually be. This PR flips the testing around by passing in a sequence of expected values that the test framework will validate.

Note: depends on #19.

SUPERCILEX added 19 commits May 29, 2022 22:10
Signed-off-by: Alex Saveau <[email protected]>
# Conflicts:
#	src/lib.rs
#	src/tweenable.rs
Signed-off-by: Alex Saveau <[email protected]>

# Conflicts:
#	examples/sequence.rs
#	src/lib.rs
#	src/tweenable.rs
Signed-off-by: Alex Saveau <[email protected]>
# Conflicts:
#	examples/sequence.rs
#	src/lib.rs
#	src/tweenable.rs
Signed-off-by: Alex Saveau <[email protected]>
# Conflicts:
#	examples/sequence.rs
#	src/tweenable.rs
Signed-off-by: Alex Saveau <[email protected]>
Signed-off-by: Alex Saveau <[email protected]>
Signed-off-by: Alex Saveau <[email protected]>
Signed-off-by: Alex Saveau <[email protected]>
Signed-off-by: Alex Saveau <[email protected]>
Signed-off-by: Alex Saveau <[email protected]>
@SUPERCILEX SUPERCILEX changed the title Improve testing framework [WIP] Improve testing framework Aug 3, 2022
@SUPERCILEX
Copy link
Contributor Author

@djeedai This PR is ready and is already showing its value by making it super easy to see that a bunch of stuff is broken (the transforms, completion counts, progress). I didn't bother converting the sequence/track/delay tests because I'll do those in a pre-PR before rewriting the implementations.

@SUPERCILEX SUPERCILEX changed the title [WIP] Improve testing framework Improve testing framework Aug 4, 2022
Signed-off-by: Alex Saveau <[email protected]>
src/tweenable.rs Outdated Show resolved Hide resolved
src/tweenable.rs Show resolved Hide resolved
src/tweenable.rs Outdated Show resolved Hide resolved
src/tweenable.rs Outdated Show resolved Hide resolved
src/tweenable.rs Outdated Show resolved Hide resolved
src/tweenable.rs Show resolved Hide resolved
src/tweenable.rs Outdated
repeat(TweenState::Active),
[
Transform::from_translation(Vec3::splat(0.6)),
Transform::from_translation(Vec3::splat(-0.2)),
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is it ever possible that a tween animating a transform between x=0 and x=1 produces a negative x value?!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because main is broken. :) Progress currently increases forever so when we pong you get 1. - 1.2.

I'm actually super happy that you spotted this because it was broken in #19 but the tests were too complicated for us to even notice.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm now in favor of progress being just this thing always in [0-1] that only tells you how far along you are on a single loop. Anything else, especially since it's (or will be) a second-class concept, is too complicated I think and not valuable anymore.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm personally still in favor of not including a progress API at all since it's not that hard to implement yourself with whatever semantics you like. That said, this won't be used internally so it doesn't matter too much and I'm fine implementing it either way.

Copy link
Contributor Author

@SUPERCILEX SUPERCILEX Aug 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though that's discussion for #38

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we write the test as if there was no bug, and merge a failing test, rather than merge a test that ensure there is a bug? I'm uncomfortable having tests wrong on purpose.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or, comment out the test for now if the concern is CI will fail.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually disagree on this one, opinions forthcoming.

Tests validate behavior, not correctness — only a human can assert correctness. In other words, tests document the current behavior.

By removing the tests, we lose the ability to see how the code currently works. That has value in and of itself: someone believing something is broken can find a test that confirms their understanding of the current behavior and then propose new behavior that the fix must adhere.

Furthermore, if there are no existing tests, when a change comes to "fix" the behavior we won't have old tests to validate the bug was actually fixed. That is, adding a new tests prevents you from seeing the transition from incorrect to correct, meaning you can't verify that the test is doing anything at all because you as the reviewer have never seen it fail. Again, this transition provides effortless documentation of the change in behavior that would otherwise be invisible... are the new tests just adding more coverage or validating some new behavior? Who knows!

Adding tests that validate behavior, right or wrong, document the state of the world and its transitions.

@djeedai djeedai added the enhancement New feature or request label Aug 6, 2022
Signed-off-by: Alex Saveau <[email protected]>
Signed-off-by: Alex Saveau <[email protected]>
Signed-off-by: Alex Saveau <[email protected]>
@SUPERCILEX
Copy link
Contributor Author

@djeedai This one should be ready for another round of review. It'd also be nice to get a decision on #44 since #38 will need to update Animator APIs otherwise.

Copy link
Owner

@djeedai djeedai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I was not convinced at first, but that does look a lot easier to read now indeed. I've left a couple of comments, mostly nit-picking that you may skip if you wish, but I'd like to clarify #43 and the negative transform thing first before merging:

src/tweenable.rs Outdated Show resolved Hide resolved
src/tweenable.rs Show resolved Hide resolved
src/tweenable.rs Show resolved Hide resolved
src/tweenable.rs Outdated Show resolved Hide resolved
src/tweenable.rs Outdated Show resolved Hide resolved
src/tweenable.rs Outdated Show resolved Hide resolved
src/tweenable.rs Outdated Show resolved Hide resolved
src/tweenable.rs Outdated
repeat(TweenState::Active),
[
Transform::from_translation(Vec3::splat(0.6)),
Transform::from_translation(Vec3::splat(-0.2)),
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we write the test as if there was no bug, and merge a failing test, rather than merge a test that ensure there is a bug? I'm uncomfortable having tests wrong on purpose.

src/tweenable.rs Outdated
repeat(TweenState::Active),
[
Transform::from_translation(Vec3::splat(0.6)),
Transform::from_translation(Vec3::splat(-0.2)),
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or, comment out the test for now if the concern is CI will fail.

@SUPERCILEX
Copy link
Contributor Author

Ok I've been thinking about these tests some more and I'm realizing that I've been dumb. What I'm trying to do is replicate expects tests but I'm forcing us to write code to generate the goldens instead of just using a golden file test. I'm going to switch to that which should be way less code and even easier to understand.

@SUPERCILEX
Copy link
Contributor Author

@djeedai Ok, WDYT?

@djeedai
Copy link
Owner

djeedai commented Sep 21, 2022

@djeedai Ok, WDYT?

I think I don't like goldens conceptually. I can see the value for something like a codegen tool, or some logs, or other text-heavy things. But using that to compare numerical values output as text sounds pretty contrived when you can compare the values directly. This also removes the value of having an assert!() telling you exactly where the test failed; I guess you can find it back diff'ing the stdout, but e.g. CI probably won't tell you right away. rstest was fine, but personally I don't like that new iteration.

Signed-off-by: Alex Saveau <[email protected]>
@SUPERCILEX
Copy link
Contributor Author

This also removes the value of having an assert!() telling you exactly where the test failed

I strongly disagree here. The asserts are basically useless which is part of why I'm trying to get rid of them. Let me demonstrate with an incorrect completion count.

Here's what a failing golden test looks like:

failures:

---- tweenable::tests::tween_tick_loop_finite stdout ----

Goldenfile diff for "tween_tick_loop_finite.stdout":
To regenerate the goldenfile, run
    REGENERATE_GOLDENFILES=1 cargo test
------------------------------------------------------------
thread 'tweenable::tests::tween_tick_loop_finite' panicked at 'assertion failed: `(left == right)`'
...

Differences (-left|+right):
 1.333333332s/1s elapsed
 Direction: Forward
 State: Active
 Progress: 1.3333334
-Total completions: 0
+Total completions: 1
 Transform: Transform { translation: Vec3(1.3333334, 1.3333334, 1.3333334), rotation: Quat(0.0, 0.0, 0.0, 1.0), scale: Vec3(1.0, 1.0, 1.0) }
 Event received: TweenCompleted { entity: 42v0, user_data: 69420 }
...

The OG tests:

failures:

---- tweenable::tests::tween_tick stdout ----
TweeningType: count=Finite(1) strategy=Repeat dir=Forward
Expected: progress=0.2 factor=0.2 times_completed=0 direction=Forward state=Active just_completed=false translation=Vec3(0.2, 0.2, 0.2)
Expected: progress=0.4 factor=0.4 times_completed=0 direction=Forward state=Active just_completed=false translation=Vec3(0.4, 0.4, 0.4)
Expected: progress=0.6 factor=0.6 times_completed=0 direction=Forward state=Active just_completed=false translation=Vec3(0.6, 0.6, 0.6)
Expected: progress=0.8 factor=0.8 times_completed=0 direction=Forward state=Active just_completed=false translation=Vec3(0.8, 0.8, 0.8)
Expected: progress=1 factor=1 times_completed=0 direction=Forward state=Completed just_completed=true translation=Vec3(1.0, 1.0, 1.0)
thread 'tweenable::tests::tween_tick' panicked at 'assertion failed: `(left == right)`
  left: `1`,
 right: `0`', src/tweenable.rs:1096:21

The previous iteration of the new framework:

failures:

---- tweenable::tests::tween_tick_loop_finite stdout ----
thread 'tweenable::tests::tween_tick_loop_finite' panicked at 'assertion failed: `(left == right)`
  left: `0`,
 right: `1`: times_completed', src/tweenable.rs:1058:13
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

For me, the goldens are hands down the easiest to understand. The OG one requires you to look at the code to find out which value is wrong while the previous iteration clearly needs some context on the current state of the tween. That aside, I think this goes to your next point:

But using that to compare numerical values output as text sounds pretty contrived when you can compare the values directly.

I kind of agree here. But! The problem I'm trying to solve is understandability and ease of test addition. The OG tests are pretty much impossible to understand with all the indexing and branching (which also means it's impossible to add/update the tests). In the previous iteration, I think it became reasonable to understand which values were expected, but it was still hard to reason about the state context (i.e. after the 7th tick, what should my completion count be?). That meant writing the tests was still a pain because you have to come up with the math to generate the right values and it's super easy to make off-by-one errors or just goof the math. I do agree that the goldens are a bit of a weird approach, but they make writing tests effortless (just create a tween and say what your deltas are) and most importantly they're super easy to audit (they read like a story).

I generally bias towards doing whatever makes the tests easiest to write, because that will lead to the most coverage and therefore the most bugs found.


Alrighty, WDYT now? 😁 Any of that convincing?

@djeedai
Copy link
Owner

djeedai commented Sep 22, 2022

I like the idea and rationale to push for tests which are both easier to understand and to maintain. But I'm not convinced we need goldens for that.

I think the value argued here is that the version with goldens outputs a human understandable text output that developers can look at to get some context on the failure. On the other hand, the fact goldens use text compare to check correctness is not required, and probably not wanted at all. Conversely, the assert version misses clarity for developers.

So can we have the best of both worlds? Can we output to console the same paragraph of summary text information for each step, and then right after that use assert!() to check the validity of each item? That way we retain the debugging / maintainability of tests and avoid the text comparison. I think if we output anything with println!(), the test will by default swallow the output if it passes, and only show the test output if it fails (if I remember cargo correctly), which is exactly what we want I think. Any thoughts on this?

@SUPERCILEX
Copy link
Contributor Author

I think the value argued here is that the version with goldens outputs a human understandable text output that developers can look at to get some context on the failure.

Yes except for the last part. I've been thinking about this off and on and I've discovered the following underlying assumption motivating the golden approach: I believe writing the math to generate correct values is too difficult. That is, generating the sequence of correct values takes too long to think about (so fewer tests will get written: to put numbers on this, writing the 6 tests in dffbaec (#53) took ~1h30m) and involves too much cognitive overhead (meaning it is too difficult for someone to reason about the correctness of the test). And then the sub-assumptions there are that tests should be easy to write and easy to check for correctness. A failing assert is irrelevant in this scenario because I'm saying you don't know whether or not it should have failed. A passing test that's wrong is much scarier than a failing test. (See the appendix for code examples.)

Hence, the golden approach embraces the idea that figuring out what the correct values are is too difficult and just says, "Hey, check out what this program did!" Correctness is then determined by deciding whether or not you agree with what the program did rather than pre-determining what the program should do.

On the other hand, the fact goldens use text compare to check correctness is not required, and probably not wanted at all.

I think this misses the key difference presented above: do you want to write a test that checks what the program did or one that determines what the program should do? The argument is that "did" is much easier to write and understand (for correctness) than "should do."

So can we have the best of both worlds? [...] Any thoughts on this?

If the golden approach is abandoned, I'll definitely do this since it will make understanding the source of failures much easier, but it won't address the difficulty in writing the test or reasoning about its correctness when it's passing.


Appendix

Is the transform correct?

let tween = Tween::new(
    EaseMethod::Linear,
    Duration::from_secs(1),
    TransformPositionLens {
        start: Vec3::ZERO,
        end: Vec3::ONE,
    },
)
.with_direction(TweeningDirection::Forward)
.with_repeat_count(RepeatCount::Finite(3))
.with_repeat_strategy(RepeatStrategy::Repeat);

let expected_values = ExpectedValues {
    ...
    transforms: successors(Some(0.), |progress| Some(f32::min(3., progress + 1. / 3.)))
        .map(|progress| Transform::from_translation(Vec3::splat(progress))),
};

We both know it's not because of #42, but I would argue that's not immediately obvious. You could make it easier to see by expanding out the successors call (i.e. [0., 1./3., 2./3., ...]), but writing that test would be a massive PitA. Furthermore, it doesn't protect you from [0, 0, 0, 0, 1, 1, 1, 2, 2, 2, 3] where you have no idea if the numbers start in the right places. Maybe it should be [0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 3], who knows. The golden shows you those values in context so you can easily reason about their correctness:

Initial tween state:
Lens=TransformPositionLens {
    start: Vec3(
        0.0,
        0.0,
        0.0,
    ),
    end: Vec3(
        1.0,
        1.0,
        1.0,
    ),
}

...

Tick by 333.333333ms:
Completion received for 42v0
1.333333332s/1s elapsed
Direction: Forward
State: Active
Progress: 1.3333334
Total completions: 1
Transform: Transform { translation: Vec3(1.3333334, 1.3333334, 1.3333334), rotation: Quat(0.0, 0.0, 0.0, 1.0), scale: Vec3(1.0, 1.0, 1.0) }
Event received: TweenCompleted { entity: 42v0, user_data: 69420 }

@djeedai
Copy link
Owner

djeedai commented Sep 26, 2022

That is, generating the sequence of correct values takes too long to think about (so fewer tests will get written: to put numbers on this, writing the 6 tests in dffbaec (#53) took ~1h30m) and involves too much cognitive overhead (meaning it is too difficult for someone to reason about the correctness of the test). And then the sub-assumptions there are that tests should be easy to write and easy to check for correctness.
[...]
Hence, the golden approach embraces the idea that figuring out what the correct values are is too difficult and just says, "Hey, check out what this program did!" Correctness is then determined by deciding whether or not you agree with what the program did rather than pre-determining what the program should do.

Fair enough, it's hard to write animation tests because there's a fair number of moving variables, and we're testing them in groups so we need to reason about all of them at once. And maybe that's the issue, we should test individual quantities (repeat count, duration, progress, etc.) in separate tests. But that doesn't mean making it easier to write those test should take priority over having tests that are correct.

  • Text comparison for floating point values, or for any numeric value for that matters, is never guaranteed to produce anything useful by any library of any language I know (except maybe libraries explicitly trying to provide that edge-case feature, if any). So testing that floating-point values output by the library are correct by looking at their text representation is extremely brittle; there's no telling what factors are influencing the conversion, and certainly Rust doesn't guarantee it, even less so across compiler versions. I do not want this library to rely on such an approach; with the moving of progress as a second-class citizen and the use of Duration everywhere we were actually moving in the right direction of correctness, where we can hope to achieve exact bit-level accuracy on playback with the mess of floating-point inaccuracies. Here with goldens this would move us backward in the complete opposite direction.
  • You said it, and we obviously have a different conclusion on whether this is wanted, but goldens mean that instead of writing a test about what you think the program does and having the test confirm that (and if not, you find a bug and fix it), now you actually tell the test "tell me what the program actually does", and then I'll have a look at a text file and try to figure out if that's correct, and give a single binary yes/no answer for the entire content of a large file containing hundreds of moving data points. If anything, that sounds exactly what you're describing you want to avoid, only just worst. I simply cannot see how this can ever prevent wrong assumptions and writing incorrect tests. It's already hard enough to write an unbiased test, but now if a developer needs to analyze a 50- to 100-line file instead of focusing on the value a single assert!() should have they're bound to make an error and assume the golden is correct when it's not.

I think this misses the key difference presented above: do you want to write a test that checks what the program did or one that determines what the program should do? The argument is that "did" is much easier to write and understand (for correctness) than "should do."

Yes, "did" is much easier to write. It's also not what a test should do. I know you disagree with that, but I'll repeat my position which is that a test is here to confirm the developer's assumptions of how the code they wrote is supposed to behave. You have a mental model, convert it into some code, then write a test that says "here I wrote the code to achieve this result". And then you run the test and confirm that assumption. And to the extreme, with test-driven development, you actually write the test first.

I see very little value in testing what the program actually does (as opposed to what it was meant to do):

  • if it's doing what's expected, then it's actually the same as testing the expected behavior as above, so there's nothing to discuss.
  • if not, then you have either of:
    • the only real case of interest, an unexpected behavior, and here "unexpected" can only means "differs from documentation" because that's the only other source of behavior documenting that is shared by everyone, developers and users.
    • a documented behavior, and then that's not a bug but a feature so there's nothing to discuss.

In short, testing the actual behavior serves as a second set of documentation (you said it yourself) that necessarily conflicts with the textual docs (comments), otherwise you'd have a feature. It's also all too easy to forget about such a "wrong behavior" test in the middle of all the other tests. It encourages users looking at the test code to think of bugs as features and take a dependency on them, backing developers in a corner where the "bug" stops being a bug and can never be fixed again (looking at you Windows API, and so many others). Finally, I don't see where you draw the line between an "acceptable" wrong behavior and an "unacceptable" one. Is the behavior acceptable for users if the library produces "3.999999" instead of the "4.0" that the developer intended? Probably, for most users at least, even if not 100% accurate? What about negative progress()? That starts to be quite awkward, most users will not expect this, since this represents a percentage, so they probably won't handle negative values and the library will cause errors in their code. And where do we draw a line here? Can we format the user's HDD when playing an animation and say "a yes, you lost everything, but there's a test that validates that behavior so you should have expected it!"? It's obviously exaggerated, but to illustrate the point that there's no more rule on what the test validating.

With intended testing, you test what the behavior should be, as imagined by whoever wrote the code. If something is found wrong in practice, you either fix it now, or if not possible you log a bug and comment out the test assert!() so the rest of the test(s) continue to check the other behaviors and produce some value, and you leave the bug reference (link) in comment. Users browsing the test code see that there's a bug, both because the assert is commented out and because there's a link to the bug, and they can follow its evolution. They can find another reference to the bug later in the CHANGELOG once it's fixed, and know the behavior has been restored.

Also, small digression, but on that topic we should have a clean main branch with only expected behavior. The "fixing in next PR" that led to #42 is a surefire way to end up with a bug at release time because either whoever was going to fix it forgot, or whoever releases do so "too fast" without knowing the fix was due.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants