There are three simple practices that can improve general INSERT throughput. Each requires consideration on how the data is collected and what is acceptable data loss in a disaster.
General inserting of rows can be performed as single INSERT’s for example.
INSERT INTO table (col1, col2, col3) VALUES (?, ?, ?); INSERT INTO table (col1, col2, col3) VALUES (?, ?, ?); INSERT INTO table (col1, col2, col3) VALUES (?, ?, ?);
While this works, there are two scalability limitations. First is the network overhead of the back and forth of each SQL statement, the second is the synchronous nature, that is your code can not continue until your INSERT is successfully completed.
The first improvement is to use MySQL’s multi values capability with INSERT. That is you can insert multiple rows with a single INSERT statement. For example:
INSERT INTO table (col1, col2, col3) VALUES (?, ?, ?), (?, ?, ?), (?, ?, ?);
Depending on how you collect the information to be inserted, you may be able to easily implement this. The benefit, as previously mentioned in The value of multi insert values shows an 88% improvement in performance.
One system variable to be aware of is max_allowed_packet. You may have to adjust this for larger INSERT statements.
Next is the ability to INSERT data based on information already in another table. You can also leverage for example another storage engine like MEMORY to batch up data to be inserted via this approach.
INSERT INTO table (col1, col2, col3) SELECT col1, col2, col3 FROM anothertable
The third option is to batch load your data from a flat file. Depending on how you source the information you are inserting, that may also be a significant improvement in throughput in bulk inserting data.
LOAD DATA [CONCURRENT] INFILE 'file' INTO TABLE (col1, col2, col3)
On a closing note, the choice of storage engine can also have a significant effect on INSERT throughput. MySQL also offers other non ANSI options including DELAYED, IGNORE and LOW_PRIORITY. These should definitely be avoided.
Morgan Tocker says
I wouldn’t recommend the memory batch option if you are using replication or want point in time recovery (SBR). It makes slaves not able to adhoc shutdown/restart, because you need to make sure no temporary tables are open.
Or to paraphrase that: Using MEMORY/TEMPORARY tables is the quickest way to break your slaves.
For the batch inserting, it’s also important to remember that there are good situations to not recommend this. Applications using READ/WRITE splitting need to make sure that every statement is as fast as possible to limit lag.
sbester says
INSERT INTO table (col1, col2, col3) VALUES (?, ?, ?), VALUES (?, ?, ?), (?, ?, ?);
should be
INSERT INTO table (col1, col2, col3) VALUES (?, ?, ?),(?, ?, ?), (?, ?, ?);
ronald says
@sbester Thanks
@morgan You are right, different approaches can definitely affect consistency between replication and/or sharding. As I opened, “what is acceptable data loss in a disaster.” includes many considerations.
This post was more to highlight the different approaches because I’ve seen people open a data file, read line by line, then insert into the DB. I’ve seen people insert data, that can be batched, and can have acceptable loss be a synchronous burden to online performance.
Everything is a trade-off and requires appropriate design and knowledge of all limitations.
Paul says
With InnoDB, arranging the inserts in primary key order (where possible) can give large improvements.