There’s no such thing as a schemaless database. I know, lots of people want a schemaless database, and lots of companies are promoting their products as schemaless DBMSs. And schemaless DBMSs exist. But schemaless
databases are mythical beasts because there is always a schema somewhere. Usually in multiple places, which I will later claim is what causes grief.
There Is Always A Schema
We should define “schema” first. It comes from Greek roots, meaning “form, figure” according to my dictionary. Wikipedia says, roughly,
A database schema is its structure; a set of integrity constraints imposed on a database. These integrity constraints ensure compatibility between parts of the schema.
In other words, a schema expresses expectations about what fields exist in a database, and what their types will be. It also enforces those expectations, at least to some extent (there’s usually some flexibility).
My claim is that there’s always a schema, because somewhere,
something has expectations about what’s in a database. At least, any useful, practical, real database. The DBMS itself may not have such expectations, but something else does.
Schema In The Database
When the DBMS enforces the schema, then we say the schema is in the database. If you’re using MySQL and you try to insert a value into a column that doesn’t exist, you’ll get an error like this:
ERROR 1054 (42S22): Unknown column 'flavor' in 'field list'
Whoops. I’ll have to run an
ALTER TABLE
if I want to do that.
Schema In The Code
If I used MongoDB, I wouldn’t have this problem. I could write my code to insert
flavor
fields in documents, and read back those documents and do something with the
flavor
field. I don’t have to have the schema in the database (the DBMS doesn’t have to enforce it).
Now my schema is in my code, isn’t it? I can’t do anything useful with something’s
flavor
attribute unless the code knows it’s there. You could argue that maybe my code doesn’t have to know about it; perhaps it just mindlessly accesses whatever it finds and lets something else do what it pleases with it. In that case, though, the schema is in the client application or user. The buck has to stop somewhere.
It reminds me of the semantic web, micro-formats, and the like. All very nice, but somewhere, something or someone has to know what a person is, what an address is, what a song is, what an album and artist is. It can’t be infinite turtles all the way down, can it?
Schema In Both Places
I’ve just claimed that the schema is in the code if it’s not in the DBMS. If I use MySQL and add a
flavor
column to the table, then my DBMS knows that this attribute is valid. But even when the DBMS has the schema,
the code does too. If my code doesn’t know, respect, and agree with the schema in the DBMS, then we’re going to have problems like the
Unknown column
error above.
This is where the fallacy enters, in my opinion. People say their database has no schema, is unstructured, etc. It would be more accurate to say “there is no single centralized schema definition. It is scattered throughout my code.”
Is that a bad thing?
In my opinion, no. A strongly enforced central definition is a dependency that doesn’t scale well, in human terms. Large codebases end up with dependencies on centralized schema definitions that are brittle and require lots of things to be updated at a single time, instead of allowing the code to cope with a fluid and evolving schema definition and gradually be updated.
I remember working at an eCommerce website that had many hundreds of databases, thousands of tables, and if I recall correctly, millions of stored procedures. We used a vendor tool to scan all our source code and databases and show us graphs of the relationships between all these things. After months of waiting for the indexing to complete, we opened up the application and the moment of truth arrived. “Let’s look at the order inventory table,” someone suggested. A glorious hairball emerged, slowly painting line after line until the screen was just a big black blob. It was useless and just told us what we already knew: the schema of the order inventory table was expressed in so many places, a change to it was probably impossible. I don’t know, but I’d bet a donut it hasn’t changed since then.
The other point of view on this is that the database’s job is to define the data and ensure only valid data is entered. I know this is a common point of pride among people who like PostgreSQL better than MySQL. And it’s surely valid, as well. It’s true that if the DBMS is permissive, you can end up with garbage in it. But my experience with large applications has been that this feels good at first and then becomes a problem later on. Just my two cents.
Conclusion
Since this is more or less a rant, I should not go on too much longer. Main points:
- A database isn’t just a DBMS and the schema and data in it. The apps that interact with the data are usually part of the database per se, too.
- There’s no such thing as schemaless. The schema is always in the code; the question is whether it’s also centrally enforced in the DBMS.
- My experience has been that centralized schema definitions are harder to scale on large applications and codebases.