If you don't know your data, you don't know your application.

The art of data modelling is definitely lost on some [most] people, or they never found it, even though they think they did. Over dinner with good friend Morgan last night we were swapping present stories on the topic.

Morgan wrote recently about I want my 4 bytes back damn it., and interesting example storing an ISBN. Further reference can be found at Getting started with MySQL of a more impractical ISBN example.

Disk is cheap now, so the attitude and poor excuse can be, well a few extra bytes doesn’t matter. Well no! If your a social hacker and have a website with a maximium concurrent connections of 2 maybe, but much like some recent Java Code Reviewing I just performed, just because the system isn’t 24×7, doesn’t give you excuse to be lazy about writing the code not to handle concurrency, thread safety and also as efficient as possible, in this case RAT verses CAT. (I’ll need to write about this, it seemed to go over some of the other professionals even)

I can remember a very specific example some 10 years ago in doing some performance analysis on a site. I’d identified the need for an additional index on a table. Now this table was sized for 200 million rows, and it already had about 70 million. The problem was adding another index required 4GB disk allocation. These things have an effect on sizing, growth and backups.

So the impact on appropiate sizing can clearly have an effect, if it was just one poorly sized column that’s acceptable (just), but normally it’s a pattern that litters a data model.

What’s important to realise is, it’s not just diskspace, it’s also memory. Without really touching on sizing data, I did mention some examples previously in Improving Open Source Databases – WordPress. Here the use of BIGINT(20) for primary keys proved my point. That’s 8 bytes, but unless you have going to have 4 billion categories, it’s a waste. It’s a waste when it’s a foreign key in a table, and it’s a big waste when it’s indexed, and that index is then in memory, and wasting more precious resources.

So how to do identify when the designer of the model has no idea about the intrinsic data value being stored? If you see tables with VARCHAR(255), that’s a clear sign. They have no idea regarding the data, so a default limit is used. Morgan referred to it as “Shooting guns in the dark with your eyes closed”. Books don’t help the cause, I was just skimming High Performance MySQL last night (one of the many freebies from the UC). There on page 82, is a table definition with not one column, but two with varchar(255). Hmmm!

If you see any new applications with VARHAR(255) they are even more lost, because MySQL 5, which has been around quite some time now, supports VARCHAR(65535). Speaking of that, has anybody seen VARCHAR(65535). I’d like to know.

Another example, is in Sheeri’s Top 8 SQL Best Practices Point 4 in regards to storing IP’s effeciently. If you log for example every page hit, this can be your largest table, and moving from varchar(15) to int can save you upto 11 bytes per row alone.

These may just be simple examples, but it’s the principle. When you define a column, consider it’s data, if you don’t know then take the time to do the reasearch and learn.

Comments

  1. sheeri says

    Fascinating. I just did some consulting and reviewed a schema, and varchar(255) was everywhere. I asked him if there were limits being enforced (do you really need varchar(255) for a first name?) and he said that the application did it. I mentioned if the application did it so should the db — but remember to change the application AND the db if you tweak the limits.

    Although, this was a new database. I mentioned he could (if he wanted) leave the varchar(255)’s there, because varchar and blob and varbinary types are dynamic, and could fill up to a certain amount. Then, after a year or so, when he had “a bunch of” data, he could use procedure analyse() to figure out what the real data types should be.

    After all, the db will be small to start with. And, you never know what kinds of data people will put in there, so sometimes it’s good to go with a default. (I have a post coming up about procedure analyse() when importing data from a flat file, although Markus just posted about the procedure. . . )

    So, yeah. The danger is, of course, saying “do this now, but do that later on.” Hopefully they will remember later on they need to do “that”….

  2. sheeri says

    BTW, what you say is true — however for new services/companies, you often do NOT know your data OR your app — remember, the first version is almost always gutted. So your schema probably will be too — if you normalize but use fields bigger than you need, when you get a “good” amount of data you can size it down.

    Of course, you can just, say, go to your email program, get all the e-mail addresses there, put them in the db, run procedure analyse() and say “that’s a good representative sample of emails, so I’ll use this”.

  3. says

    Some followup to sheeri comments.

    Like most things today it’s all in moderation. Historically we spent a lot of time in the planning stages of data models, partly because the traditional waterfall approach to software development needed that rigid structure, otherwise software development would never have been completed.

    Today, in an Agile frameset (which I fully support, it’s just common sense) you have to consider that things can rapidly change. At certain points, we don’t know what the structure of the data will be so we have to consider the circumstances. Why then choose varchar(255). Why not investigate the possible types of values that are going to be recorded. You must have initially some idea, otherwise you would not know how to name the column for instance. It’s possible that you could consider a risk and assign a suitable buffer zone, be it a number of characters, or even 100%.

    However the premise that First Name is varchar(255) is a clear sign of laziness. With imperial data we use and know every day like first name, last name, email, address, zipcode we can set better limits. First Name can be set to varchar(20). It’s only when you get an exception from somebody that you should consider increasing the size.

    Would this not be a better judgment of data boundaries. I’d rather consider as a general rule of thumb, try to make an educated guess, even with a suitable variance for uncertainly. It’s more likely you will not revisit things if they fit, focusing on more complex problems with your application in the future.

  4. says

    I had an offline comment from a reader, thought I’d add it any my response for the readers.

    Query:
    What do you intend to do with IPv6 addresses?

    Response:
    I was wondering if anybody would pick that up. That’s a good question, and as Sheeri pointed out using the MySQL Function INET_ATON().
    The docs at http://dev.mysql.com/doc/refman/5.0/en/miscellaneous-functions.html state “Addresses may be 4- or 8-byte addresses.”
    Tests show:

    mysql> select inet_aton('255.255.255.255'), inet_aton('255.255.255.255.255.255');
    +------------------------------+--------------------------------------+
    | inet_aton('255.255.255.255') | inet_aton('255.255.255.255.255.255') |
    +------------------------------+--------------------------------------+
    |                   4294967295 |                      281474976710655 |
    +------------------------------+--------------------------------------+
    1 row in set (0.00 sec)
    

    In this case you need to store in a BIGINT, but 8 bytes is better then 23.

  5. says

    What’s wrong with lazy? In this case, it’s just another term for modular or encapsulated. Why make the data layer enforce an application layer decision?

    As far as I can tell, the database doesn’t care if you use varchar(255) or varchar(20). Both occupy L + 1 bytes. They require identical memory allocation, identical bandwidth going back to the client, and identical index requirements. Have I missed something here?

    Duplication is bad. Duplicate data means non-normalized structure. Duplicate code means poor use of functions and libraries. Why deliberately create duplicate length constraints?

    I say if it absolutely has to be a varchar, make it a varchar(255) and be done with it. Worry about it only if you cross the 256 boundary.

  6. James Day says

    Scott, they will not require identical memory allocation. In assorted situations MySQL allocates internally buffers of the maximum size possible for the field. They will require identical disk space, but that’s a different thing.