Validate early and don't let go!
April 12, 2024
A few years ago I came across a JavaScript library called Zod. Zod provides tools for validating data, with a smooth TypeScript integration. I love it.
What I’m most grateful to Zod for is not it’s capabilities though: it’s that its documentation links to the article “Parse, don’t validate” by Alexis King. This article got me thinking thinking deeply about architectural philosophy and patterns around data validation, and helped me grow.
Below, I present my conclusions around data validation and then show off Zod and how its design makes it easier to follow what I believe are good practices.
Data validation make programs predictable
We make assumptions about how data is structured every time we write code.
function getName(person) {
// assumes `person` is an object with property `name`
return person.name;
}
If data doesn’t fulfill these assumptions, then what our code does is unpredictable. In the best case it throws an error. In the worst case it silently continues to process this data and ends up corrupting something, presenting a security vulnerability, or producing some other impactful negative result.
This is especially concerning for data that comes from outside our application, and in the world of web development almost every application interacts with external data. Common sources include command-line arguments, environment variables, files, and HTTP requests. Because our application doesn’t control this data, we don’t know if it fulfills the assumptions made in our code. We need to check the structure of the data before we use it. This is called data validation.
Just-in-time validation
The path of least resistance is to validate data just before it’s used. I’ll call this just-in-time validation.
function getName(person) {
if (!Object.hasOwn(person, "name")) {
return undefined;
}
return person.name;
}
Just-in-time validation is the most approachable way to add validation. It also gives us maximum confidence that the data upholds our assumptions, because there’s no chance for it to have mutated (barring race conditions and cosmic rays).
Just-in-time validation downsides, though. It adds complexity to the code because the function using the data now has to handle validation and all its potential outcomes in addition to the business logic it was already responsible for. Often this results in the function needing to provide an additional return value to indicate a validation failure. The function’s consumers in turn end up needing to add more logic to handle this additional return value.
function getGreeting(person) {
const name = getName(person);
if (name === undefined) {
return undefined;
}
return `Hello ${name}`;
}
function getName(person) {
if (!Object.hasOwn(person, "name")) {
return undefined;
}
return person.name;
}
This extra handling can keep bubbling up, all the way to where the data is first received. The net effect is that validation handling spreads to pollute a whole vertical slice of the application.
Just-in-time validation also misses an opportunity to share the validation outcome, because the outcome is buried within a function and code at a higher scope can’t access it without difficulty. The path of least resistance is for every function to implement its own validation logic, leading to duplicate validation and compounding complexity as validation logic squirms its way into every nook and cranny.
Mixing validation and business logic like this is a known anti-pattern called Shotgun Parsing, which brings up two more downsides.
The first is that mixing validation and business logic leads to late-discovered errors because an invalid part of the data might not be found until it’s validated deep within the application. Late discovered errors can be difficult to reason about and handle because the state of the application is less well known and the code may already have done something difficult or impossible to revert.
The second downside is that mixing validation and business logic also makes it difficult to systematically guarantee that the data as a whole is valid, because validation is spread far apart and interleaved with operations that can mutate the data. Proving data validity in this circumstance can feel like chasing a moving target that’s also an expert in guerrilla warfare.
A better way: hoisting validation
An alternative to just-in-time validation is to hoist the validation to a higher scope, which I’ll creatively refer to as validation hoisting.
function getGreeting(person) {
if (!Object.hasOwn(person, "name")) {
return undefined;
}
const name = getName(person);
return `Hello ${name}`;
}
function getName(person) {
return person.name;
}
By hoisting validation, the function using the data no longer has to include validation logic because it already knows the data it gets is valid. It can focus on handling business logic. The function’s consumer often ends up doing the same amount of handling it would have done with just-in-time validation, so this is a net simplification.
When we hoist validation out of the consumer as well, and then continue to hoist it as far up as we can, this strategy yields two delightful benefits.
First, we extract validation logic from the whole vertical slice of application code. The application code can focus on meaningful business logic, and any remaining complexity is solely due to meaningful business requirements. The code becomes easier to read, reason about, and maintain.
Second, we collect the validation logic into a single place: right where we first receive the data. Here we can validate the data completely and atomically, avoiding Shotgun Parsing. We discover validation errors as soon as possible, making errors easier to understand and handle because application state is well known and no business logic has been run. We also get a complete understanding of the state of our data at this point in time.
Validation hoisting can be a bit more work up front in a codebase that doesn’t already use it. For applications that do already use validation hoisting, though, adding validation ends up being a small change in a single place and is just as straightforward as adding just-in-time validation.
Carrying confidence in validated data forward
While validation hoisting gives us complete confidence and clarity about our data at the point of validation, it can become difficult to remain confident that the data matches our assumptions when it’s used later in the application. What if something mutates it along the way? We need a vehicle to bring the confidence forward.
One option is to make the data immutable. If we know the data can’t be changed, then we can be confident the structure is the same as it was directly after validation.
A second option is to encapsulate the data in a structure that re-validates it whenever it changes. This restores our confidence in the validity of the data after a change while keeping the validation out of the business logic and avoiding Shotgun Parsing. In exchange we incur some runtime cost and discover errors later. Providing clear error messages can make handling the late error discovery palatable, at least.
This can be done in JavaScript using a class with a setter method to validate the data whenever it’s updated:
class Integer() {
#value;
public constructor(value) {
this.set(value);
}
set value(value) {
if (!Number.isInt(value)) {
throw Error(`Value must be an integer`);
}
this.#value = value;
}
get value() {
return this.#value;
}
}
A third option is to use a static type system. Once the structure of the data is validated, the type system tracks changes to the data as it moves through the application. Using a static type system is generally the best option when available: it doesn’t incur runtime costs, it uncovers errors early at compile time, and it maintains confidence even as the data is transformed.
Static type systems are not a silver bullet, though. Every type system has limitations on what it can practically describe. TypeScript has a number type, for example, but doesn’t differentiate between integers and floats. When we validate assumptions that can’t easily be expressed with a static type system, we can additionally use immutability and encapsulation to bring confidence forward.
Putting theory into practice
Let’s say we have an application that reads a JSON configuration file on startup.
While the user populating the config file probably has good intentions, the contents of the file are outside our application’s control and in some sense are untrustworthy. To make sure the contents meet our application’s assumptions about the structure of the configuration data, our application should validate the contents.
For demonstration purposes, let’s say our application expects a simple structure for the configuration data: an object with a port field that it uses to determine which port it should listen on. Here’s what a valid configuration file might look like.
{
"port": 8080
}
We can validate the configuration data with a few conditional checks and some control flow. We’ll assume the configuration file has already been read and parsed as JSON successfully (though the JSON parsing is also part of the validation process).
interface Config {
port: number;
}
function validateConfig(rawConfig: unknown): Readonly<Config> {
if (typeof rawConfig !== "object" || rawConfig === null) {
throw Error("Config must be an object");
}
if (!("port" in rawConfig)) {
throw Error("Config field 'port' must be defined");
}
if (typeof rawConfig.port !== "number") {
throw Error("Config field 'port' must be a number");
}
if (!Number.isInteger(rawConfig.port)) {
throw Error("Config field 'port' must be an integer");
}
if (rawConfig.port < 1024 || rawConfig.port > 49151) {
throw Error(
"Config field 'port' must be between 1024 and 49151 (inclusive)",
);
}
// Reconstructing the object helps TypeScript
// pick up the validated types
const config: Readonly<Config> = {
port: rawConfig.port,
};
// Freezing the object makes it immutable at runtime
Object.freeze(config);
return config;
}
const config = validateConfig(rawConfig);
We capture the structure of the validated data in a TypeScript interface so we can carry forward our confidence in the data.
Some of the validation is not practical to capture in TypeScript, though, such as the port number being an integer.
To help retain confidence, we make the config object immutable using Object.freeze()
and by typing it as readonly.
It’s fairly straightforward, but it’s also a lot of code to validate such a simple piece of data. Applications may have dozens of configuration options, and writing validation code can become tedious. Doing bespoke validation for many applications can become a chaotic and heavy maintenance burden. Using a validation library can make this more standardized, concise, robust, and maintainable.
Zod is just such a validation library. Let’s see the same validation using Zod.
import { z } from "zod";
const configSchema = z
.object({
port: z.number().int().min(1024).max(49151),
})
.readonly();
// config is typed as Readonly<{ port: number }>
const config = configSchema.parse(rawConfig);
// we can export the type directly using z.infer<>
export type Config = z.infer<typeof configSchema>;
Way more concise!
If you’ve used other validation libraries like Yup or Joi, seeing this example may give you déjà vu. What makes Zod special is its focus on TypeScript.
Notice that we didn’t need to manually define a type for the validated data. Zod generates it for us, seamlessly handing off responsibility for data validity over to TypeScript. TypeScript can carry our confidence in the structure of the data forward from there.
Zod’s TypeScript integration goes even deeper, down to how Zod structures its validation schema.
Let’s say our application also expects the configuration object to have a boolean field, exposeMetrics
,
for choosing whether or not to expose metrics.
If exposeMetrics
is true
, our application further expects the metricsEndpoint
field to be set
to configure the endpoint the metrics will be exposed on.
Here are two examples of what a valid configuration file might look like now.
{
"port": 8080,
"exposeMetrics": false
}
{
"port": 8080,
"exposeMetrics": true,
"metricsEndpoint": "/metrics"
}
TypeScript types can be composed using unions and intersections, and Zod can use the same approach to compose schema to capture both of these possibilities.
import { z } from "zod";
const metricsConfigSchema = z.union([
z.object({
exposeMetrics: z.literal(false), // matches exactly 'false'
}),
z.object({
exposeMetrics: z.literal(true), // matches exactly 'true'
metricsEndpoint: z.string().nonempty(),
}),
]);
const configSchema = z
.intersection(
z.object({
port: z.number().int().min(1024).max(49151),
}),
metricsConfigSchema,
)
.readonly();
const config = configSchema.parse(rawConfig);
A union is a logical OR, so data satisfies metricsConfigSchema
if it satisfies either of the sub-schemas.
We could have more simply marked metricsEndpoint
as an optional field and used a single schema,
but using a union captures our intention better because we only care about metricsEndpoint when exposeMetrics is true.
An intersection is a logical AND, so data satisfies configSchema when it satisfies both the inline object schema containing the port field and metricsConfigSchema
.
Modeling validation the same way TypeScript models types provides a low-friction developer experience: we can use our experience with TypeScript to help construct Zod schemas, and it eases context switching between Zod schemas and the TypeScript type of the validated data.
TL;DR
Validate data as soon as it’s received. Doing so simplifies our application’s logic, makes it more secure and robust, and makes discovering and handling validation errors easier.
Pick a way to carry forward the confidence in the structure of data gained from validation: immutability, encapsulation, or static types. Some cases necessitate using multiple strategies. When available, try static types first.
For TypeScript projects, Zod is an excellent choice for handling data validation.