Making a Testing Plan¶
When contributing to a project as large and open-ended as WPT, it’s easy to get lost in the details. It can be helpful to start by making a rough list of tests you intend to write. That plan will let you anticipate how much work will be involved, and it will help you stay focused once you begin.
Many people come to WPT with a general testing goal in mind:
- specification authors often want to test for new spec text
- browser maintainers often want to test new features or fixes to existing features
- web developers often want to test discrepancies between browsers on their web applications
(If you don’t have any particular goal, we can help you get started. Check out
the issues labeled with
Leave a comment if you’d like to get started with one, and don’t hesitate to
ask clarifying questions!)
This guide will help you write a testing plan by:
- showing you how to use the specifications to learn what kinds of tests will be most helpful
- developing your sense for what doesn’t need to be tested
- demonstrating methods for figuring out which tests (if any) have already been written for WPT
The level of detail in useful testing plans can vary widely. From a list of specific cases, to an outline of important coverage areas, to an annotated version of the specification under test, the appropriate fidelity depends on your needs, so you can be as precise as you feel is helpful.
Understanding the “testing surface”¶
Web platform specifications are instructions about how a feature should work. They’re critical for implementers to “build the right thing,” but they are also important for anyone writing tests. We can use the same instructions to infer what kinds of tests would be likely to detect mistakes. Here are a few common patterns in specification text and the kind of tests they suggest.
Algorithms may accept input from many sources. Modifying the input is the most direct way we can influence the browser’s behavior and verify that it matches the specifications. That’s why it’s helpful to be able to recognize different sources of input.
|Type of feature||Potential input sources|
|HTML||element content, attributes, attribute values|
|CSS||selector strings, property values, markup|
Determine which input sources are relevant for your chosen feature, and build a list of values which seem worthwhile to test (keep reading for advice on identifying worthwhile values). For features that accept multiple sources of input, remember that the interaction between values can often produce interesting results. Every value you identify should go into your testing plan.
Example: This is the first step of the
Notification constructor from the
A thorough test suite for this constructor will include tests for the behavior of many different values of the title parameter and the options parameter. Choosing those values can be a challenge unto itself–see Avoid Excessive Breadth for advice.
The state of the browser may also influence algorithm behavior. Examples include the current document, the dimensions of the viewport, and the entries in the browsing history. Just like with direct input, a thorough set of tests will likely need to control these values. Browser state is often more expensive to manipulate (whether in terms of code, execution time, or system resources), and you may want to design your tests to mitigate these costs (e.g. by writing many subtests from the same state).
You may not be able to control all relevant aspects of the browser’s state.
includes issues for web platform features which cannot be controlled in a
cross-browser way. You should include tests like these in your plan both to
communicate your intention and to remind you when/if testing solutions become
Example: In the
Notification constructor referenced
above, the type of “the
current global object” is also a form of input. The test suite should include
tests which execute with different types of global objects.
When an algorithm branches based on some condition, that’s an indication of an interesting behavior that might be missed. Your testing plan should have at least one test that verifies the behavior when the branch is taken and at least one more test that verifies the behavior when the branch is not taken.
Example: The following algorithm from the HTML
standard describes how the
localStorage.getItem method works:
getItem(key) method must return the current value associated with the given key. If the given key does not exist in the list associated with the object then this method must return null.
This algorithm exhibits different behavior depending on whether or not an item
exists at the provided key. To test this thoroughly, we would write two tests:
one test would verify that
null is returned when there is no item at the
provided key, and the other test would verify that an item we previously stored
was correctly retrieved when we called the method with its name.
Even without branching, the interplay between sequential algorithm steps can suggest interesting test cases. If two steps have observable side-effects, then it can be useful to verify they happen in the correct order.
Most of the time, step sequence is implicit in the nature of the algorithm–each step operates on the result of the step that precedes it, so verifying the end result implicitly verifies the sequence of the steps. But sometimes, the order of two steps isn’t particularly relevant to the result of the overall algorithm. This makes it easier for implementations to diverge.
There are many common patterns where step sequence is observable but not necessarily inherent to the correctness of the algorithm:
- input validation (when an algorithm verifies that two or more input values satisfy some criteria)
- event dispatch (when an algorithm fires two or more events)
- object property access (when an algorithm retrieves two or more property values from an object provided as input)
Example: The following text is an abbreviated excerpt of the algorithm that runs during drag operations (from the HTML specification):
[…] 4. Otherwise, if the user ended the drag-and-drop operation (e.g. by releasing the mouse button in a mouse-driven drag-and-drop interface), or if the
dragevent was canceled, then this will be the last iteration. Run the following steps, then stop the drag-and-drop operation:
- If the current drag operation is “
none” (no drag operation) […] Otherwise, the drag operation might be a success; run these substeps:
- Fire a DND event named
dragendat the source node.
A thorough test suite will verify that the
drop event is fired as specified,
and it will also verify that the
dragend event is fired as specified. An even
better test suite will also verify that the
drop event is fired before the
In September of 2019, Chromium accidentally changed the ordering of the
events, and as
a result, real web applications stopped functioning. If there had been a test
for the sequence of these events, then this confusion would have been avoided.
When making your testing plan, be sure to look carefully for event dispatch and the other patterns listed above. They won’t always be as clear as the “drag” example!
Specifications occasionally allow browsers discretion in how they implement certain features. These are described using RFC 2119 terms like “MAY” and “OPTIONAL”. Although browsers should not be penalized for deciding not to implement such behavior, WPT offers tests that verify the correctness of the browsers which do. Be sure to label the test as optional according to WPT’s conventions so that people reviewing test results know how to interpret failures.
Example: The algorithm underpinning
includes the following paragraph:
That statement uses the word “may,” so even though it modifies the behavior of the preceding algorithm, it is strictly optional. The test we write for this should be designated accordingly.
It’s important to read these sections carefully because the distinction between “mandatory” behavior and “optional” behavior can be nuanced. In this case, the optional behavior is never allowed if the document’s type has changed. That makes for a mandatory test, one that verifies browsers don’t return the same result when the document’s type changes.
When writing conformance tests, choosing what not to test is sometimes just as hard as finding what needs testing.
Don’t dive too deep¶
Algorithms are composed of many other algorithms which themselves are defined in terms of still more algorithms. It can be intimidating to consider exhaustively testing one of those “nested” algorithms, especially when they are shared by many different APIs.
In general, you should plan to write “surface tests” for the nested algorithms. That means only verifying that they exhibit the basic behavior you are expecting.
It’s definitely important to test exhaustively, but it’s just as important to do so in a structured way. Reach out to the test suite’s maintainers to learn if and how they have already tested those algorithms. In many cases, it’s acceptable to test them in just one place (and maybe through a different API entirely), and rely only on surface-level testing everywhere else. While it’s always possible for more tests to uncover new bugs, the chances may be slim. The time we spend writing tests is highly valuable, so we have to be efficient!
As described earlier in this guide, we’d certainly want to test the branch
regarding the parsing failure. However, there are many ways a string might fail
to parse–should we verify them all in the tests for
document.querySelectorAll? Should we test them all there, too?
The answers depend on the current state of the test suite: whether or not tests for selector parsing exist and where they are located. That’s why it’s best to confer with the people who are maintaining the tests.
Avoid excessive breadth¶
When the set of input values is finite, it can be tempting to test them all exhaustively. When the set is very large, test authors can reduce repetition by defining tests programmatically in loops.
Using advanced control flow techniques to dynamically generate tests can actually reduce test quality. It may obscure the intent of the tests since readers have to mentally “unwind” the iteration to determine what is actually being verified. The practice is more susceptible to bugs. These bugs may not be obvious–they may not cause failures, and they may exercise fewer cases than intended. Finally, tests authored using this approach often take a relatively long time to complete, and that puts a burden on people who collect test results in large numbers.
The severity of these drawbacks varies with the complexity of the generation logic. For example, it would be pronounced in a test which conditionally made different assertions within many nested loops. Conversely, the severity would be low in a test which only iterated over a list of values in order to make the same assertions about each. Recognizing when the benefits outweigh the risks requires discretion, so once you understand them, you should use your best judgement.
Example: We can see this consideration in the very first step of the
Response constructor from the Fetch
Response(body, init) constructor, when invoked, must run these steps:
- If init[“
status”] is not in the range
599, inclusive, then throw a
This function accepts exactly 400 values for the “status.” With WPT’s
testharness.js, it’s easy to dynamically create one test for
each value. Unless we have reason to believe that a browser may exhibit
drastically different behavior for any of those values (e.g. correctly
546 but incorrectly rejecting
547), then the complexity of
testing those cases probably isn’t warranted.
Instead, focus on writing declarative tests for specific values which are novel
in the context of the algorithm. For ranges like in this example, testing the
boundaries is a good idea.
599 should not produce an error while
600 should produce an error. Feel free to use what you know about
the feature to choose additional values. In this case, HTTP response status
codes are classified by the “hundred” order of magnitude, so we might also want
to test a “3xx” value and a “4xx” value.
It’s very likely that WPT already has some tests for the feature (or at least the specification) that you’re interesting in testing. In that case, you’ll have to learn what’s already been done before starting to write new tests. Understanding the design of existing tests will let you avoid duplicating effort, and it will also help you integrate your work more logically.
Even if the feature you’re testing does not have any tests, you should still keep these guidelines in mind. Sooner or later, someone else will want to extend your work, so you ought to give them a good starting point!
The names of existing files and folders in the repository can help you find tests that are relevant to your work. This page on the design of WPT goes into detail about how files are generally laid out in the repository.
Generally speaking, every conformance tests is stored in a subdirectory dedicated to the specification it verifies. The structure of these subdirectories vary. Some organize tests in directories related to algorithms or behaviors. Others have a more “flat” layout, where all tests are listed together.
Whatever the case, test authors try to choose names that communicate the behavior under test, so you can use them to make an educated guess about where your tests should go.
Example: Imagine you wanted to write a test to verify that headers were made
immutable by the
Request.error method defined in the Fetch
standard. Here’s the algorithm:
In order to figure out where to write the test (and whether it’s needed at
all), you can review the contents of the
fetch/ directory in WPT. Here’s how
that looks on a UNIX-like command line:
$ ls fetch api/ data-urls/ range/ content-encoding/ http-cache/ README.md content-length/ images/ redirect-navigate/ content-type/ metadata/ security/ corb/ META.yml stale-while-revalidate/ cors-rfc1918/ nosniff/ cross-origin-resource-policy/ origin/
This test is for a behavior directly exposed through the API, so we should look
$ ls fetch/api abort/ cors/ headers/ policies/ request/ response/ basic/ credentials/ idlharness.any.js redirect/ resources/
And since this is a static method on the
Response constructor, we would
expect the test to belong in the
$ ls fetch/api/response multi-globals/ response-static-error.html response-cancel-stream.html response-static-redirect.html response-clone.html response-stream-disturbed-1.html response-consume-empty.html response-stream-disturbed-2.html response-consume.html response-stream-disturbed-3.html response-consume-stream.html response-stream-disturbed-4.html response-error-from-stream.html response-stream-disturbed-5.html response-error.html response-stream-disturbed-6.html response-from-stream.any.js response-stream-with-broken-then.any.js response-init-001.html response-trailer.html response-init-002.html
There seems to be a test file for the
response-static-error.html. We can open that to decide if the behavior is
already covered. If not, then we know where to write the
Failures on wpt.fyi¶
There are many behaviors that are difficult to describe in a succinct file
name. That’s commonly the case with low-level rendering details of CSS
specifications. Test authors may resort to generic number-based naming schemes
for their files, e.g.
feature-002.html, etc. This makes
it difficult to determine if a test case exists judging only by the names of
If the behavior you want to test is demonstrated by some browsers but not by others, you may be able to use the results of the tests to locate the relevant test.
wpt.fyi is a website which publishes results of WPT in various browsers. Because most browsers pass most tests, the pass/fail characteristics of the behavior you’re testing can help you filter through a large number of highly similar tests.
Example: Imagine you’ve found a bug in the way Safari renders the top CSS
border of HTML tables. By searching through directory names and file names,
you’ve determined the probable location for the test: the
directory. However, there are three hundred files that begin with
border-top-! None of the names mention the
<table> element, so any one of
the files may already be testing the case you found.
Luckily, you also know that Firefox and Chrome do not exhibit this bug. You
could find such tests by visual inspection of the wpt.fyi
results overview, but the website’s “search” feature includes operators that
let you query for this information
To find the tests which begin with
border-top-, pass in Chrome, pass in
Firefox, and fail in Safari, you could write `border-top- chrome:pass
The results show only three such tests exist:
These may not describe the behavior you’re interested in testing; the only way to know for sure is to review their contents. However, this is a much more manageable set to work with!
Querying file contents¶
Some web platform features are enabled with a predictable pattern. For example, HTML attributes follow a fairly consistent format. If you’re interested in testing a feature like this, you may be able to learn where your tests belong by querying the contents of the files in WPT.
You may be able to perform such a search on the web. WPT is hosted on GitHub.com, and GitHub offers some basic functionality for querying code. If your search criteria are short and distinctive (e.g. all files containing “querySelectorAll”), then this interface may be sufficient. However, more complicated criteria may require regular expressions. For that, you can download the WPT repository and use git to perform more powerful searches.
The following table lists some common search criteria and examples of how they can be expressed using regular expressions:
|Criteria||Example match||Example regular expression|
|HTML tag names||
|CSS property name||
Bear in mind that searches like this are not necessarily exhaustive. Depending on the feature, it may be difficult (or even impossible) to write a query that correctly identifies all relevant tests. This strategy can give a helpful guide, but the results may not be conclusive.
Example: Imagine you’re interested in testing how the
src attribute of the
iframe element works with
directories, you’ve found a lot of potential locations for such a test. You
also know many tests use
name. How can you find where to contribute new tests?
You can design a regular expression that matches many cases where a
src property in HTML. You can use the
git grep command to query the contents of the
You will still have to review the contents to know which are relevant for your
purposes (if any), but compared to the 5,000 files in the
this list is far more approachable!