find right end-to-end testing level #679

lsf37 · 2019-12-11T08:25:33Z

(sorry for the wall of text, somewhat subtle, TLDR at bottom)

@regisd has made a valiant effort porting the old regression test suite into something more modern, and it's time to think about removing the old ones (as proposed in #678).

From the recent experience with refactoring macros for char classes (#654), it looks like we're close to be able to do that, but not quite there yet.

The main observation is that I had the bazel test suite passing quite some time before the old one (for the same tests), and was catching a number of bugs in the old setup that I would have missed in the new one. This was unfortunately mainly based on the thing we wanted to get rid of -- the comparison of string output.

Basically, what happens is that esp the --dump output gives information about the char classes and NFA/DFA structure that is very hard to test for by behaviour of the next stage down (the lexer itself). In the old setup, this information is too fine-grained -- it is very sensitive to any change in JFlex at all, including ones that are just messages only and don't change behaviour. In the new setup, the information is too coarse-grained. For instance, I missed a case in the refactor where just the wrong char classes were generated for some of the operators, but wrong subtly enough that everything works fine for many inputs. It's hard to test these by scanner behaviour, because you'd have to check the entire set of possible input characters (positively and negatively), and you'd have to do it in the right sequence (depending on what the lexer spec does).

It might be possible to check whether the char class partitions in particular are the same as a given "good" partition (esp now that final char classes don't depend on order of occurrence of things in the spec any more). We'd need to be able to get information out of the JFlex run, instead of only out of the scanner/runtime engine. It might even be useful for JFlex to offer an API for that. Or, even though it might be slightly distasteful, we could make an option that only reports this info on output, without the additional chatter around it, which would make it a lot more reliable to test against.

For NFA/DFA structure, it is a bit harder. We don't want to test that for all specs we have in the test set, but we have a few that are specifically designed to inspect manually if something changes. Usually (not always) if that happens there is a bug. The main output relevant for tests is the final minimised DFA, because that is unique (up to renaming of states) for a given set of regular expressions, and changes in that would indicate a reasonably major behaviour change in JFlex. These are pretty much impossible to test by behaviour of the runtime scanner, because you don't know beforehand which inputs will distinguish two DFAs (without looking at their structure). There's always a finite input that can distinguish two DFAs if they are different, but I don't think you can always tell in finite time if two DFAs are equivalent by inputs only (if you've tested to input length 100, length 101 might be the one that finds a problem etc). So we could do something similar as with the char classes and provide either an API or a string output for just that information on the JFlex side.

We don't need to do the DFA for many test cases, so it'd be Ok for this to be more verbose to write, but we should do it for some (basically the ones we do --dump for in the old framework), so that we trigger for bugs in the generation engine itself. It changes rarely, but even an otherwise innocent refactor could do something we hadn't thought of.

For most "normal" test cases, I just look at the number of states in the minimised DFA -- if it is not equal to before, the DFAs are not equal, and if they are equal, usually it's fine (that's why many of the old test don't have --dump, but not -q either). This is probably something we can check easily in the test case, and should, if the old test doesn't use -q.

Ok, so TLDR, we need mechanisms for:

comparing generated char classes against given expected ones
comparing full generated minimised DFA against given expected ones
as a more lenient option, compare number of states in minimised DFA against given expected

I'm unsure about warnings. It'd be good to check that expected warnings are
given, at least by default. Do we also need to check that no more warnings
than expected are produced?

The text was updated successfully, but these errors were encountered:

regisd · 2019-12-11T13:52:38Z

Thanks for the detailed description.

Yes, I want to get rid of sysout comparison We should test behaviour, not check against modifications.

You have a good point, that it would be useful to verify against the minimized DFA.

I agree that checking that some warnings are produced would be useful. I think we need a refactoring of the logging framework for that, in order to check the warning content rather than it's outputted form

lsf37 · 2019-12-11T23:11:54Z

Cool, I think we agree on where we want to go with the tests. A logging framework that we could query for the set of warnings that were delivered sounds like a good idea.

lsf37 added the testing Adding tests or test infrastructure. label Dec 11, 2019

lsf37 mentioned this issue Dec 11, 2019

Bazelify testcase/line-cont #677

Merged

regisd mentioned this issue Dec 12, 2019

Let Lex generator return the number of minimized states #684

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

find right end-to-end testing level #679

find right end-to-end testing level #679

lsf37 commented Dec 11, 2019

regisd commented Dec 11, 2019

lsf37 commented Dec 11, 2019

find right end-to-end testing level #679

find right end-to-end testing level #679

Comments

lsf37 commented Dec 11, 2019

regisd commented Dec 11, 2019

lsf37 commented Dec 11, 2019