Deleting from ARCHIVE tables

I can’t say I’ve used the ARCHIVE storage engine before, but at the NY MySQL Meetup last night there was discussion of the improvements to ARCHIVE in 5.1 and the fact that you could not DELETE from archive. A simple test confirmed this indeed throws an error.

DROP TABLE IF EXISTS url_log;
CREATE TABLE url_log(
log_id INT UNSIGNED NOT NULL AUTO_INCREMENT,
log_date TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
user_id INT UNSIGNED NULL,
url VARCHAR(100) NOT NULL,
PRIMARY KEY (log_id))
ENGINE=ARCHIVE;

DELETE FROM url_log;
ERROR 1031 (HY000): Table storage engine for 'url_log' doesn't have this option

However, part of MySQL 5.1 which is RC status, there is partitioning. Thinking that one could probably partition say a log table by DAY OF MONTH, and then you could do what you want with the data in a partition and delete the partition, I tried the following test.
NOTE: for the purposes of testing, I used SECOND() rather then DAY() and smaller ranges for simplicity.

DROP TABLE IF EXISTS url_log;
CREATE TABLE url_log(
log_id INT UNSIGNED NOT NULL AUTO_INCREMENT,
log_date TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
user_id INT UNSIGNED NULL,
url VARCHAR(100) NOT NULL,
PRIMARY KEY (log_id))
ENGINE=ARCHIVE
PARTITION BY RANGE ( SECOND(log_date) ) (
    PARTITION p0 VALUES LESS THAN (10),
    PARTITION p1 VALUES LESS THAN (20),
    PARTITION p2 VALUES LESS THAN (30),
    PARTITION p3 VALUES LESS THAN MAXVALUE
);

However this throws an error.

ERROR 1503 (HY000): A PRIMARY KEY must include all columns in the table's partitioning function

Primary keys, and AUTO_INCREMENT do not play well with partitioning, so for the purpose of this proof of concept, I’ll drop these.

DROP TABLE IF EXISTS url_log;
CREATE TABLE url_log(
log_date TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
user_id INT UNSIGNED NULL,
url VARCHAR(100) NOT NULL)
ENGINE=ARCHIVE
PARTITION BY RANGE ( SECOND(log_date) ) (
    PARTITION p0 VALUES LESS THAN (10),
    PARTITION p1 VALUES LESS THAN (20),
    PARTITION p2 VALUES LESS THAN (30),
    PARTITION p3 VALUES LESS THAN MAXVALUE
);

Create a simple Stored Procedure to randomly generate some data. The function is not efficient using RAND() and SLEEP() but it does provide the generation of some data.

DELIMITER $$
DROP PROCEDURE IF EXISTS load_url_log;
CREATE PROCEDURE load_url_log (insert_count INT)
BEGIN
  DECLARE i INT DEFAULT 1;

  WHILE i < insert_count
  DO
     INSERT INTO url_log(user_id, url)
     VALUES (FLOOR(RAND()*100), CONCAT(REPEAT('x',FLOOR(RAND()*99)),SLEEP(RAND())));
  END WHILE;
END $$

DELIMITER ;

CALL load_url_log(500);

I quick check shows a distribution of data.

mysql> select distinct(second(log_date)) from url_log;
mysql> select distinct(second(log_date)) from url_log where second(log_date) < 10;
+--------------------+
| (second(log_date)) |
+--------------------+
|                  0 |
|                  1 |
|                  2 |
|                  3 |
|                  4 |
|                  5 |
|                  6 |
|                  7 |
|                  8 |
|                  9 |
+--------------------+
10 rows in set (0.00 sec)

And now the purpose of the test. Deleting data via deleting a partition.

mysql> ALTER TABLE url_log DROP PARTITION p0;
Query OK, 0 rows affected (0.01 sec)
Records: 0  Duplicates: 0  Warnings: 0

mysql> select distinct(second(log_date)) from url_log where second(log_date) < 10;
Empty set (0.00 sec)

And it works. Re-creating however did not.

ALTER TABLE url_log ADD PARTITION (PARTITION p0 VALUES LESS THAN (10));
ERROR 1481 (HY000): MAXVALUE can only be used in last partition definition

Ok, so in my example I was lazy, I didn’t create specific partitions as I would in real world here, e.g. 31 partitions for DAYS. Simulate a little better.

DROP TABLE IF EXISTS url_log;
CREATE TABLE url_log(
log_date TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
user_id INT UNSIGNED NULL,
url VARCHAR(100) NOT NULL)
ENGINE=ARCHIVE
PARTITION BY RANGE ( SECOND(log_date) ) (
    PARTITION p0 VALUES LESS THAN (10),
    PARTITION p1 VALUES LESS THAN (20),
    PARTITION p2 VALUES LESS THAN (30),
    PARTITION p3 VALUES LESS THAN (60)
);
ALTER TABLE url_log DROP PARTITION p0;
ALTER TABLE url_log ADD PARTITION (PARTITION p0 VALUES LESS THAN (10));
ERROR 1493 (HY000): VALUES LESS THAN value must be strictly increasing for each partition


Still doesn’t work. RTFM indicates this.

DROP TABLE IF EXISTS url_log;
CREATE TABLE url_log(
log_date TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
user_id INT UNSIGNED NULL,
url VARCHAR(100) NOT NULL)
ENGINE=ARCHIVE
PARTITION BY LIST ( SECOND(log_date) ) (
    PARTITION p0 VALUES IN (0,1,2,3,4,5,6,7,8,9),
    PARTITION p1 VALUES IN (10,11,12,13,14,15,16,17,18,19),
    PARTITION p2 VALUES IN (20,21,22,23,24,25,26,27,28,29),
    PARTITION p3 VALUES IN (30,31,32,33,34,35,36,37,38,39),
    PARTITION p4 VALUES IN (40,41,42,43,44,45,46,47,48,49),
    PARTITION p5 VALUES IN (50,51,52,53,54,55,56,57,58,59)
);

ALTER TABLE url_log DROP PARTITION p0;
ALTER TABLE url_log ADD PARTITION (PARTITION p0 VALUES IN (0,1,2,3,4,5,6,7,8,9));

And that works.

So by creating an ARCHIVE table with 31 LIST partitions, one for each day of the month, you could use ARCHIVE to log data into DAY partitions, analyze, summarize,log,copy the data from the previous day, and purge it within 28 days.

Comments

  1. says

    Hi!

    I made Archive not delete data on purpose. The only way to delete data was via drop table (so a different set of privs). This was a feature request by Robin at the time.

    Personally I would call the above a bug, and I would suspect that having partitioning optimizing the delete to what must be a fast truncate means that it is not following the GENERATE macro. Which probably means that Partitioning is not handling HTON flags for different engines correctly :)

    Cheers,
    -Brian

  2. says

    More in keeping with the intent of the Archive engine would just be to actually create different tables… have a permanently empty template table from which you could run CREATE TABLE LIKE statements, and then mark the date in the table name. Then you can safely just drop old tables at whatever interval you feel appropriate.