Look, everyone makes mistakes. It's true. But not all of us have the chance to make mistakes that end up costing millions of dollars in hardware and production support costs. Any one of the following five mistakes listed below will add costs to your company. It's guaranteed. The costs could be hardware related (extra disk space, network bandwidth), which tend to add up quickly. The costs are also support related (bad performance, database re-design, report creation, etc.), and they add up very quickly. Want to save your company some money, some headaches for your end-users, and ensure data quality all at the same time? Then avoid these five database design mistakes.
Watch the on-demand webcast:How to Conduct a Database Design Review
1. Going BIG, just in case
I often see data types being chosen as if they don't matter. But the truth is, they do matter. If you know that the only possible values for a certain column are between 0 and 100,000, then you don't need to slap a BIGINT data type for that column when a INT would do just fine. Why does this matter? The BIGINT data type requires 8 bytes of storage, and the INT requires only 4 bytes of storage. That means for each row of data you could be wasting 4 bytes. Doesn't sound like much, right? OK then, let's consider that your table has two million rows. Multiply those rows by 4 bytes and you have 8 million bytes, or roughly 7.8MB of wasted space. It doesn't sound like a lot, does it? Well, it adds up, and quickly. I've only shown you an example for just one column, but what about those NCHAR(2000) columns that are storing mostly first and last names? How much extra overhead for those? How about your date columns? Do you need calendar dates prior to the year 1900, or after the year 2079? Then SMALLDATETIME is likely to work just fine for you. Oh, and let's not forget that these columns can be indexed, and those indexes will also be unnecessarily wider as well. Choosing the right data type matters, for all sorts of reasons. Take the time and make an effort to get it right at the start.
2. Expecting a DBMS feature to mean the same thing across all systems
Bigger companies tend to have a variety of systems, all of which need to exchange data. It is quite common to see familiar terms in use by those systems. For example, you are likely to see the use of the term TIMESTAMP. You simply cannot assume that the format of TIMESTAMP will be the same between systems. If you do, you are likely to have issues at some point. If you are lucky, the issues will manifest themselves early. If you are unlucky, the issue will lay dormant until the worst possible time for your system to come to a grinding halt. I chose TIMESTAMP as my example here because it is one of my favorite misnomers inside of SQL Server. Why? Because TIMESTAMP has nothing to do with date or time. And I love this hidden gem inside the BOL: The Transact-SQL timestamp data type is different from the timestamp data type defined in the ISO standard. So that means other systems that do adhere to the ISO standard of what it means to be a TIMESTAMP are going to return something vastly different. Take Oracle, for example, and how they define a TIMESTAMP value. Even DB2 has some subtle differences as well. The lesson here is that when you are working with disparate systems, do not assume that familiar terminologies will work in exactly the same way.
3. Expecting database-generated identifiers to be sequential
Being able to insert data into a database in sequential order has many benefits, and most database designers will look to implement some type of mechanism to ensure that such sequences are used. However, many of those mechanisms will leave "gaps" from time to time, and these gaps lead to confusion as people expect to see {1, 2, 3, 4, 5...} and not something like {1, 2, 4, 5, 6...}. As soon as a gap is found, people send out a search party looking for the missing data. I know I've lost hours trying to track down "bugs" in systems where people thought rows of data were missing, only to realize that the integrity of the data was fine but the method for generating the sequence was not. Now, for reasons I can't explain, people just love GUIDs in database designs. One of the (many) issues with GUIDs is that they are not sequential by default. This will lead to decreased performance, often in a very short amount of time. For a great detailed explanation of this, check out this post by Kimberly Tripp at SQL Skills. The part you want to pay attention to is here: "But, a GUID that is not sequential...can be a horribly bad choice - primarily because of the fragmentation that it creates in the base table but also because of its size. It's unnecessarily wide..." Yeah, that *is* what she said.
4. Not examining your foreign keys (FKs) as part of your indexing strategy
I am assuming, of course, that you even have foreign keys defined. I've seen *many* databases that have little to no primary keys, foreign keys, or even any indexes defined. No, I don't know who would do such a thing either. But they're out there, and sooner or later you will find them, too. Assuming you have FKs defined, you should be evaluating to see if it would make sense to add indexes to match those FK definitions. In some cases, it will. In other cases, it won't. But you should make certain that this type of review is part of your overall design process.
5. Applying the surrogate key without having the original business key as an alternate
Back to those folks who love the GUIDs for a moment. If you are using a surrogate key (ideally, one that is sequential) in your database design, you should not forget that the data in your table should also have an alternate key based on the business logic that necessitated the creation of the table in the first place. Many database designers tend to focus on the surrogate key only, and forget all about the other columns. This is bad for your data quality, as it would allow duplicate values to creep into your business layer if you do not put a unique constraint/index on that alternate key. Nobody wants that, trust me.
And there you have it. Avoid these five database design mistakes and save your company money and headaches, and ensure data quality all at the same time.
Thomas LaRock is a Head Geek™ at SolarWinds and a Microsoft® Certified Master, Microsoft Data Platform MVP, VMware® vExpert, and former Microsoft Certified Trainer. He has over…