To enum or not to enum?

I’ve never used database columns that embedded defined valid values within the schema definition. Within MySQL there are 2 definitions, ENUM and SET. There are a few reasons why, but first an explanation of these data types.

In summary, using the MySQL Sample Database.

CREATE TABLE film (
film_id INT UNSIGNED NOT NULL AUTO_INCREMENT,
...
rating ENUM('G','PG','PG-13','R','NC-17') DEFAULT 'G',
special_features SET('Trailers','Commentaries','Deleted Scenes','Behind the Scenes') DEFAULT NULL,
PRIMARY KEY (film_id)
)ENGINE=InnoDB DEFAULT CHARSET=utf8;

So from this, the following commands allow you to inspect this information via mysql.


DESCRIBE film;
SHOW COLUMNS FROM film LIKE 'rating';

With the introduction of the INFORMATION_SCHEMA in MySQL 5, a more traditional method using a valid SELECT statement.


SELECT COLUMN_TYPE
FROM INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_SCHEMA='sakila'
AND TABLE_NAME='film'
AND COLUMN_NAME = 'rating'

This is great, however you have to know a lot of information, the table_name and column_name ok, they are fixed, but now the table_schema information is required to be known by the application.

Why I don’t use these?

Historically, there have been 3 reasons why I’ve never used ENUM.

  • First, it’s not standard SQL (at least to my knowledge), and historically hasn’t been consistent between Database products
  • More importantly as data, it’s terrible to maintain easily, as it’s within the table structure
  • The management of valid values within an application, and the need to manage this data dynamically.

Now, I’m not certain if within MySQL there are any funky ways to manage this type of information. I’d welcome comments on what people do. However from my previous experience, this is my method of implementation.

All information of this type, I refer to as a code. I implement this within a common table, defined as reference_data (based historically on the Oracle cg_ref_codes). Here is a cut down version of my definition for simplicity purposes.


CREATE TABLE reference_data(
id INT UNSIGNED NOT NULL AUTO_INCREMENT,
type VARCHAR(20) NOT NULL,
code VARCHAR(20) NOT NULL,
value VARCHAR(100) NOT NULL,
PRIMARY KEY (id)
)ENGINE=InnoDB DEFAULT CHARSET=utf8;

Of course, I’ve extended this basic structure to include ordering of data per type, default value of a given type, grouping of types (for larger systems) and a status to manage historically codes no longer valid for new data.

Within tables, I define all codes with an extention of _code so as to clearly identify the content of a given column.

So, for this example I would add data such as:


INSERT INTO reference_data(type,code,value) VALUES
('rating','G','General Audience - All Ages Admitted'),
('rating','PG','Parental Guidance Suggested. Some Material May Not Be Suitable For Children'),
('rating','PG-13','Parents Strongly Cautioned. Some Material May Be Inappropriate For Children Under 13'),
('rating','R','Restricted. Under 17 Requires Accompanying Parent or Adult Guardian'),
('rating','NC-17','No One 17 And Under Admitted');

This for me provides the following benefits:

  • The Data is data, meaning it can be easily modified via normal application and sql functionality
  • Additional information such as a more general description can be provided and presented to end users
  • Additional attributes non shown in this example, provides ordering support, and a default value
  • The same set of data for a given type can be used in multiple tables

Now, the clear downside of this level of functionality is that the data integrity that is managed by ENUM at the database level has now been lost. You need to temper this with the level of access to your database, and you can easily implement via Database Triggers this level of security. However, I think that clear ability to tie the required values for an ENUM column to easy accessibility via an application, and allow for easy management is clearly a strength.

I’m yet to be convinced otherwise.

Comments

  1. Joelle says

    Thanks for this. I’ve always put each set of reference data in it’s own table. But that can get to be a lot of tables to manage and keep track of. I like this method – much cleaner.

  2. says

    Hi Joelle,

    I’ll give you the trick for the ordering as well as I’ve seen people stumble here and it’s very easy to solve.

    column order_seq TINYINT(3) UNSIGNED NOT NULL DEFAULT 100

    Now whenever you are retrieving data based on a type, e.g. type=’rating’ you order by order_seq,value
    So by default, if you never mess with the order_seq, you get alphabetical ordering. In our example with rating, we may not want to order alphabetically, so we adjust the order_seq. The benefit of default to 100 rather then 1, is normally alphabetical may be ok, except for one or two values, with a default of 100, you only have to change 2 rather then them all.

    With this storage structure, I cache all this data within memory at an application level, making access to this data instant.

  3. Chris says

    I came across your blog entry while searching for evidence on why ENUM is evil.

    But your entry has contradicted my theory of why storing reference data in the database is base. I manage a very popular website and when you start serving 10k worth of hits per minute. small queries to reference display information become a huge overhead — In initial development it makes sense, but when your app goes large scale and is pushed to the limit due to budget restrictions etc etc etc Reference calls are the first to go and get embedded into your application using Arrays or Hashes. Using the file system for simple quick values is always going to be faster.

    About ENUM and how your entry applies — ENUM was meant to help organize or divide data or groups of data. If your adding extra attributes to your desired group of data than ENUM is not for you.

    Hope this helps

  4. says

    I like ENUMs as well, never had a problem. A company I used to work for stored MPAA rating data just like you did, in VARCHAR columns, and trust me they paid DEARLY for it in performance, and integrity sucked.

  5. ojak says

    Organizationally speaking, I prefer small reference tables as they’re a really nice way of commenting your otherwise ENUM-able values (as in the ‘ratings’ example). I prefer to use ENUM only for unchangeable sets in which the ENUM value is actually defined in the ENUM value itself (Male, Female, etc).

    For everything else, I call the reference tables only when the app server loads and cache the values in memory, similarly to what Joelle said above (#2). That way, no performance hit (other than what usually amounts to a tiny amount of additional memory usage) and the app deals with the rest. Only issue to keep in mind is that if the reference values change, all apps must be rebooted/re-cached to apply the DB changes (which is good practice anyhow, assuming production environment allows for it)…

    Anyhow, great write-up and comments.

  6. Jim says

    Ronald,

    1) I ran this thru phpMyAdmin=

    CREATE TABLE film (
    film_id INT UNSIGNED NOT NULL AUTO_INCREMENT,

    rating ENUM(‘G’,’PG’,’PG-13′,’R’,’NC-17′) DEFAULT ‘G’,
    special_features SET(‘Trailers’,’Commentaries’,’Deleted Scenes’,’Behind the Scenes’) DEFAULT NULL,
    PRIMARY KEY (film_id)
    )ENGINE=InnoDB DEFAULT CHARSET=utf8;

    2= RESULTED error

    #1064 – You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near ‘.. rating ENUM(‘G’,’PG’,’PG-13′,’R’,’NC-17′) DEFAULT ‘G’, special_features SET’ at line 3

    Will I get an e-mail when a response is posted here please?

    Thx…Jim

  7. says

    I never found it to difficult to manage enums. Your storage and performance sizes generally seem to be less with enums in MySQL, of course depending on your structure it might be nominal or the other way. I never liked the argument that getting enum values is difficult, with PHP it is fairly easy:
    // returns an associative array of enum_value => enum_value
    function EnumValues($link_name, $table_name, $column_name){
    $open_link = $this->OpenLink($link_name);
    if(!$open_link) return false;

    $result = mysql_query(‘SHOW COLUMNS FROM `’.$table_name.’` LIKE ”.$column_name.”’);
    if(!$result){
    if(DEBUG) echo ‘ERROR: SHOWING ENUM FIELDS, failed on "’.$mysql_hup[$link_name.”_database”].’".MySQL said: ‘.mysql_error().”;
    return false;
    }
    $row = mysql_fetch_array($result , MYSQL_NUM);
    // extract the values, the values are enclosed in single quotes and separated by commas
    preg_match_all(“/'(.*?)’/” , $row[1], $enum_array);
    // set the value from $enum_array = to the key
    $enum_fields = array_combine($enum_array[1], $enum_array[1]);
    return $enum_fields;
    }

    The only real tangible arguments against them is that you would still need a reference table to store additional info, like a description, or if your planning on moving away from MySQL, which from my experience, very very few web apps ever do.

  8. says

    You make some very strong points and in some cases I would agree that the ability to organize, modify, add data that’s stored in a separate table is great, it’s not always necessary.

    ENUM is great if used right. Sure, it might not be 100% supported by all DB platforms/programing frameworks, but if you know that your combination of DB/framework will work with ENUM then there is no reason not to take advantage of it.

    This is not to say that you should implement it everywhere… but there are times when it makes sense to do so.

    I like to use ENUM when I need to define things like “true/false”, “draft/pending/approved/banned”, “Mr/Mrs/Ms/Dr/Prof” and other similar data. When you have simple values that don’t change then you can save a lot of headaches and make your code A LOT cleaner by taking advantage of the ENUM type. These values are fixed and never change for the life of an app (mostly) and there is no reason to complicate your code to get the extra flexibility that you DON’T NEED (ie. storing this data in a separate table)… it’s easier to use ENUM and might actually prevent a few future bugs. It will certainly improve the speed of your queries by eliminating an extra JOIN or two.

    that’s just my 2c… I might be wrong (it happens) ;)