WeSQL Introduction – MySQL running on S3

I recently became aware of WeSQL. A MySQL-compatible database that separates compute and storage, using S3 as the storage layer. The product uses a columnar format by default which is significantly more space-efficient than InnoDB.

WeSQL introduces a new storage engine called SmartEngine using a LSM-tree-based structure that is ideal for a storage bucket implementation, and documentation shows the implementation of raft replication to combat latency concerns. There is a lot more information to review, the serverless architecture and WeScale, a database proxy and resource manager.

It was very easy to take it for an initial spin using a docker container and an AWS S3 bucket. I would really like to try CloudFlare R2 which implements the S3 API.

Under the covers there are over 180 new variables comprising 83 for the smartengine, 57 for raft, and 22 for objectstore and more. This implies a lot of tunable options and a lot of complexity to optimize for a variety of workloads using the 79 new status variables.

I was able to launch a demo and confirm

mysql> SELECT VERSION();
+-----------+
| VERSION() |
+-----------+
| 8.0.35    |
+-----------+
1 row in set (0.01 sec)

mysql> SELECT @@wesql_version;
+-----------------+
| @@wesql_version |
+-----------------+
| 0.1.0           |
+-----------------+
1 row in set (0.00 sec)

One of my early tests showed that it does not support FOREIGN KEYS, which is not a major concern.

ERROR 1235 (42000) at line 10: SE currently doesn't support foreign key constraints

I did have some subsequent issues with the current docs version 8.0.35-0.1.0_beta1.37 and I did revert to a prior version from docs earlier this week 8.0.35-0.1.0_beta1.gedaf338.36. Given it’s a very new product I am sure there is a lot of ongoing development.

This is just a quick introduction but it’s a definitely a different architecture in the RDBMS landscape for MySQL compatibility. I hope to run some more tests using the provided sysbench use cases and my own workloads to delve under the covers more.

New Variables

branch_objectstore_id
clone_autotune_concurrency
clone_block_ddl
clone_buffer_size
clone_ddl_timeout
clone_delay_after_data_drop
clone_donor_timeout_after_network_failure
clone_enable_compression
clone_max_concurrency
clone_max_data_bandwidth
clone_max_network_bandwidth
clone_ssl_ca
clone_ssl_cert
clone_ssl_key
clone_valid_donor_list
initialize_branch_objectstore_id
initialize_from_objectstore
initialize_objectstore_bucket
initialize_objectstore_endpoint
initialize_objectstore_provider
initialize_objectstore_region
initialize_objectstore_use_https
initialize_repo_objectstore_id
initialize_smartengine_objectstore_data
objectstore_bucket
objectstore_endpoint
objectstore_mtr_test_bucket_dir
objectstore_provider
objectstore_region
objectstore_use_https
raft_replication_allow_no_valid_entry
raft_replication_appliedindex_force_delay
raft_replication_archive_log_bin_index
raft_replication_archive_recovery
raft_replication_archive_recovery_stop_datetime
raft_replication_auto_leader_transfer
raft_replication_auto_leader_transfer_check_seconds
raft_replication_auto_reset_match_index
raft_replication_check_commit_index_interval
raft_replication_checksum
raft_replication_cluseter_info_on_objectstore
raft_replication_cluster_id
raft_replication_cluster_info
raft_replication_configure_change_timeout
raft_replication_current_term
raft_replication_disable_election
raft_replication_disable_fifo_cache
raft_replication_dynamic_easyindex
raft_replication_election_timeout
raft_replication_flow_control
raft_replication_force_change_meta
raft_replication_force_recover_index
raft_replication_force_reset_meta
raft_replication_force_single_mode
raft_replication_force_sync_epoch_diff
raft_replication_heartbeat_thread_cnt
raft_replication_io_thread_cnt
raft_replication_large_batch_ratio
raft_replication_large_event_split_size
raft_replication_large_trx
raft_replication_learner_heartbeat
raft_replication_learner_node
raft_replication_learner_pipelining
raft_replication_learner_timeout
raft_replication_log_cache_size
raft_replication_log_level
raft_replication_log_type_node
raft_replication_max_delay_index
raft_replication_max_log_size
raft_replication_max_packet_size
raft_replication_min_delay_index
raft_replication_mts_recover_use_index
raft_replication_new_follower_threshold
raft_replication_optimistic_heartbeat
raft_replication_pipelining_timeout
raft_replication_prefetch_cache_size
raft_replication_prefetch_wakeup_ratio
raft_replication_prefetch_window_size
raft_replication_purged_gtid
raft_replication_recover_backup
raft_replication_recover_new_cluster
raft_replication_reset_prefetch_cache
raft_replication_send_timeout
raft_replication_start_index
raft_replication_sync_follower_meta_interva
raft_replication_with_cache_log
raft_replication_worker_thread_cnt
recovery_snapshot_from_objectstore
recovery_snapshot_only
recovery_snapshot_timestamp
recovery_snapshot_tmpdir
repo_objectstore_id
server_id_on_objectstore
serverless
smartengine_auto_shrink_enabled
smartengine_auto_shrink_schedule_interval
smartengine_batch_group_max_group_size
smartengine_batch_group_max_leader_wait_time_us
smartengine_batch_group_slot_array_size
smartengine_block_cache_size
smartengine_block_size
smartengine_bottommost_level
smartengine_bulk_load_size
smartengine_compact
smartengine_compaction_delete_percent
smartengine_compaction_task_extents_limit
smartengine_compaction_threads
smartengine_compression_options
smartengine_compression_per_level
smartengine_concurrent_writable_file_buffer_num
smartengine_concurrent_writable_file_buffer_switch_limit
smartengine_concurrent_writable_file_single_buffer_size
smartengine_data_dir
smartengine_deadlock_detect
smartengine_disable_auto_compactions
smartengine_disable_instant_ddl
smartengine_disable_online_ddl
smartengine_disable_parallel_ddl
smartengine_dump_memtable_limit_size
smartengine_enable_2pc
smartengine_estimate_cost_depth
smartengine_flush_delete_percent
smartengine_flush_delete_percent_trigger
smartengine_flush_delete_record_trigger
smartengine_flush_log_at_trx_commit
smartengine_flush_memtable
smartengine_flush_threads
smartengine_hotbackup
smartengine_idle_tasks_schedule_time
smartengine_level0_file_num_compaction_trigger
smartengine_level0_layer_num_compaction_trigger
smartengine_level1_extents_major_compaction_trigger
smartengine_level2_usage_percent
smartengine_level_compaction_dynamic_level_bytes
smartengine_lock_scanned_rows
smartengine_lock_wait_timeout
smartengine_master_thread_compaction_enabled
smartengine_master_thread_monitor_interval_ms
smartengine_max_background_dumps
smartengine_max_free_extent_percent
smartengine_max_row_locks
smartengine_max_shrink_extent_count
smartengine_max_write_buffer_number_to_maintain
smartengine_memtable_size
smartengine_min_write_buffer_number_to_merge
smartengine_mutex_backtrace_threshold_ns
smartengine_parallel_flush_log
smartengine_parallel_read_threads
smartengine_parallel_recovery_thread_num
smartengine_parallel_wal_recovery
smartengine_pause_background_work
smartengine_persistent_cache_dir
smartengine_persistent_cache_mode
smartengine_persistent_cache_size
smartengine_purge_invalid_subtable_bg
smartengine_query_trace_print_slow
smartengine_query_trace_sum
smartengine_query_trace_threshold_time
smartengine_rate_limiter_bytes_per_sec
smartengine_reset_pending_shrink
smartengine_row_cache_size
smartengine_scan_add_blocks_limit
smartengine_shrink_allocate_interval
smartengine_shrink_table_space
smartengine_sort_buffer_size
smartengine_stats_dump_period_sec
smartengine_strict_collation_check
smartengine_strict_collation_exceptions
smartengine_table_cache_numshardbits
smartengine_table_cache_size
smartengine_total_max_shrink_extent_count
smartengine_total_memtable_size
smartengine_total_wal_size
smartengine_unsafe_for_binlog
smartengine_wal_dir
smartengine_wal_recovery_mode
smartengine_write_disable_wal
snapshot_archive
snapshot_archive_dir
snapshot_archive_expire_auto_purge
snapshot_archive_expire_seconds
snapshot_archive_innodb_tar_mode
snapshot_archive_on_objectstore
snapshot_archive_period
snapshot_archive_smartengine_backup_checkpoint
snapshot_archive_smartengine_tar_mode
table_on_objectstore
wesql_version

New Status

Com_show_consensuslogs
Com_raft_replication_start
Com_raft_replication_stop
Com_native_admin_proc
Com_native_trans_proc
Com_show_consensuslog_events
Smartengine_block_cache_miss
Smartengine_block_cache_hit
Smartengine_block_cache_add
Smartengine_block_cache_index_miss
Smartengine_block_cache_index_hit
Smartengine_block_cache_filter_miss
Smartengine_block_cache_filter_hit
Smartengine_block_cache_data_miss
Smartengine_block_cache_data_hit
Smartengine_row_cache_add
Smartengine_row_cache_hit
Smartengine_row_cache_miss
Smartengine_memtable_hit
Smartengine_memtable_miss
Smartengine_number_keys_written
Smartengine_number_keys_read
Smartengine_number_keys_updated
Smartengine_bytes_written
Smartengine_bytes_read
Smartengine_block_cachecompressed_miss
Smartengine_block_cachecompressed_hit
Smartengine_wal_synced
Smartengine_wal_bytes
Smartengine_write_self
Smartengine_write_other
Smartengine_write_wal
Smartengine_number_superversion_acquires
Smartengine_number_superversion_releases
Smartengine_number_superversion_cleanups
Smartengine_number_block_not_compressed
Smartengine_snapshot_conflict_errors
Smartengine_wal_group_syncs
Smartengine_rows_deleted
Smartengine_rows_inserted
Smartengine_rows_updated
Smartengine_rows_read
Smartengine_system_rows_deleted
Smartengine_system_rows_inserted
Smartengine_system_rows_updated
Smartengine_system_rows_read
Smartengine_max_level0_layers
Smartengine_max_imm_numbers
Smartengine_max_level0_fragmentation_rate
Smartengine_max_level1_fragmentation_rate
Smartengine_max_level2_fragmentation_rate
Smartengine_max_level0_delete_percent
Smartengine_max_level1_delete_percent
Smartengine_max_level2_delete_percent
Smartengine_all_flush_megabytes
Smartengine_all_compaction_megabytes
Smartengine_top1_subtable_size
Smartengine_top2_subtable_size
Smartengine_top3_subtable_size
Smartengine_top1_mod_mem_info
Smartengine_top2_mod_mem_info
Smartengine_top3_mod_mem_info
Smartengine_global_external_fragmentation_rate
Smartengine_write_transaction_count
Smartengine_pipeline_group_count
Smartengine_pipeline_group_wait_timeout_count
Smartengine_pipeline_copy_log_size
Smartengine_pipeline_copy_log_count
Smartengine_pipeline_flush_log_size
Smartengine_pipeline_flush_log_count
Smartengine_pipeline_flush_log_sync_count
Smartengine_pipeline_flush_log_not_sync_count

Managing SQL Drift: Ensuring Stability in Database Transitions

SQL drift is a significant challenge that occurs when SQL statements from an existing system produce unexpected results after migration to a new environment or system. These issues manifest in several critical ways: SQL statements may generate new execution errors, experience significant performance degradation, or yield differences in data integrity. Such challenges extend beyond simple compatibility issues, stemming from variations in database engines, optimization strategies, and SQL implementations. SQL drift represents a fundamental shift in how SQL behaves across different platforms and versions. Whether during on-premises to cloud migrations, transitions to managed services, database vendor switches, or even routine version upgrades, SQL drift presents a critical consideration for data-driven applications.

SQL drift frequently occurs during:

  • On-premises to cloud migrations
  • Cloud to managed service transitions
  • Cross-product migrations (e.g., switching database vendors)
  • Database version upgrades
  • Platform modernization efforts

The implications of SQL drift can be significant, leading to application instability, increased operational costs, and delayed migration timelines. The impact often extends to compromised data quality and results in a degraded user experience as systems become less reliable and responsive. Successfully managing SQL drift involves four key stages:

  1. Identification
  2. Prioritization
  3. Correction
  4. Validation

Identification is the critical first step in managing SQL drift, focusing on systematically discovering potential issues. This phase involves detecting SQL statements that may behave differently in the new environment, analyzing syntax compatibility across platforms, establishing performance baselines, and validating data outputs to ensure consistency.

Prioritization involves evaluating SQL drift issues based on business impact, risk assessment, resource allocation, and migration scheduling to determine the optimal order for resolution.

Correction addresses SQL drift through code remediation, performance optimization, syntax updates, and developing alternative solutions when necessary.

Validation confirms SQL drift corrections through comprehensive testing, performance verification against established baselines, and data integrity checks to ensure the corrected SQL maintains its intended functionality.

An effective way to demonstrate the impact of SQL drift is by using a sample collection of SQL statements executed across different versions of MySQL. The End of Life (EOL) for MySQL 5.7, coupled with AWS RDS and AWS RDS Aurora beginning extended support in 2024, has increased costs for organizations that are not proactive in managing database migrations. This situation is particularly common in development-focused teams that lack dedicated architecture and operations resources.

A MySQL demonstration of SQL Drift

Using a subset of SQL statements executed in MySQL 5.7 and subsequent MySQL versions 8.0, 8.4, 9.0, and 9.1,Next BaseLine can examine the impact of SQL drift. This output shows the changing state of errors, deprecations, warnings and notices for the 42 example SQL statements.

Example Output from Next BaseLine

In MySQL 5.7, the use of the keyword SQL_NO_CACHE in an SQL statement presents as a deprecated warning.

17 Deprecations
ID: 5, Hash: f31f2e99b2
  SQL: "SELECT SQL_NO_CACHE 1;"
  Deprecation: (1681) 'SQL_NO_CACHE' is deprecated and will be removed in a future release.

In MySQL 8.0, the MySQL Query Cache is removed, however the use of SQL_NO_CACHE in SQL statements is still valid. Even in the next GA version, 8.4, this SQL keyword is still on the deprecated list, and it continues to deprecated in the current 9.1 innovation release.

A different example of deprecated functions are ENCRYPT and DES_ENCRYPT.

ID: 17, Hash: 947fcef53a
  SQL: "SELECT ENCRYPT('BaseLine',1);"
  Deprecation: (1287) 'ENCRYPT' is deprecated and will be removed in a future release. Please use AES_ENCRYPT instead
ID: 18, Hash: 364c0ffbf4
  SQL: "SELECT DES_ENCRYPT('BaseLine');"
  Deprecation: (1287) 'DES_ENCRYPT' is deprecated and will be removed in a future release. Please use AES_ENCRYPT instead

In MySQL 8.0, these SQL statements produce a hard error. These actually present as internal functions that are not present in the schema used rather than a “FUNCTION does not exist”. (More on this later).

ID: 17, Hash: 947fcef53a
  SQL: "SELECT ENCRYPT('BaseLine',1);"
  Error 1370 (42000): execute command denied to user 'nextbaseline'@'%' for routine 'airport.ENCRYPT'
ID: 18, Hash: 364c0ffbf4
  SQL: "SELECT DES_ENCRYPT('BaseLine');"
  Error 1370 (42000): execute command denied to user 'nextbaseline'@'%' for routine 'airport.DES_ENCRYPT'

Some example GIS SQL statements that in MySQL 5.7 present as deprecated, however they each are a different error number.

ID: 19, Hash: f319748e0c
  SQL: "SELECT CONTAINS(ST_GeomFromText('POLYGON((0 0, 0 10, 10 10, 10 0, 0 0))'), ST_GeomFromText('POINT(5 5)'));"
  Deprecation: (1287) 'CONTAINS' is deprecated and will be removed in a future release. Please use MBRCONTAINS instead
ID: 20, Hash: d686267b19
  SQL: "SELECT ST_GeomFromWKB(Point(0, 0));"
  Deprecation: (3195) st_geometryfromwkb(geometry) is deprecated and will be replaced by st_srid(geometry, 0) in a future version. Use st_geometryfromwkb(st_aswkb(geometry), 0) instead.

In MySQL 8.0+, these two deprecated statements produce different error messages.

ID: 19, Hash: f319748e0c
  SQL: "SELECT CONTAINS(ST_GeomFromText('POLYGON((0 0, 0 10, 10 10, 10 0, 0 0))'), ST_GeomFromText('POINT(5 5)'));"
  Error 1064 (42000): You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '(ST_GeomFromText('POLYGON((0 0, 0 10, 10 10, 10 0, 0 0))'), ST_GeomFromText('POI' at line 1
ID: 20, Hash: d686267b19
  SQL: "SELECT ST_GeomFromWKB(Point(0, 0));"
  Error 3037 (22023): Invalid GIS data provided to function st_geomfromwkb.

Migrating a WordPress site

A more realistic example would involve taking production workload, such as WordPress running on a self-hosted MySQL 5.7 server, and assessing the potential impact of switching to MySQL 8.0 without upgrading the application (not a recommended approach). We have collected representative Production SQL statements for this WordPress setup, referred to as a BaseLine

After collecting SQL traffic and testing this workload against a MySQL 5.7 environment, previously unnoticed SQL warnings were highlighted for the team.

When executed against an upgraded MySQL 8.0 instance, problematic SQL statements were immediately identified. For a larger, more complex product, this process would help prioritize where resources are most needed.

A modern cloud database implementation

Finally, let’s consider TiDB from PingCap as an example of validating your application with a cloud implementation. Using the same small set of 42 SQL statements, TiDB has taken a proactive approach by entirely eliminating warnings in their MySQL protocol. In TiDB, SQL statements are now either valid SQL syntax or produce a hard error.

What was a deprecation for ENCRYPT is now a hard error. Also, a more correct error message is provided ‘FUNCTION does not exist’.

ID: 17, Hash: 947fcef53a
  SQL: "SELECT ENCRYPT('BaseLine',1);"
  Error 1305 (42000): FUNCTION ENCRYPT does not exist
ID: 18, Hash: 364c0ffbf4
  SQL: "SELECT DES_ENCRYPT('BaseLine');"
  Error 1305 (42000): FUNCTION DES_ENCRYPT does not exist

In MySQL 5.7, ENCODE was deprecated and in MySQL 8.0+ it was removed. In TiDB, it is a valid function.

TiDB also produces some interesting artifacts in error messages for SQL statements not seen with MySQL. An example is Error 1235 ... has only noop implementation in tidb now .... This syntax however shows that a setting can change the status of these SQL statements.

ID: 14, Hash: dbcb4b05a2
  SQL: "SELECT table_name, count(*) FROM information_schema.tables GROUP BY table_name ASC;"
  Error 1235 (42000): function GROUP BY expr ASC|DESC has only noop implementation in tidb now, use tidb_enable_noop_functions to enable these functions
...
ID: 26, Hash: 7369c77d51
  SQL: "SELECT SQL_CALC_FOUND_ROWS * FROM information_schema.schemata;"
  Error 1235 (42000): function SQL_CALC_FOUND_ROWS has only noop implementation in tidb now, use tidb_enable_noop_functions to enable these functions

Even during development, an interesting and unintended bug in early testing, resulted in an interesting error using TiDB.

ID: 31, Hash: 9cae50cbfc
  SQL: "SELECT DATE('2024-01-01   10:00:00'); /* Example of bad data causing warning */SELECT 'abc' AS full;"
  Error 8130 (HY000): client has multi-statement capability disabled. Run SET GLOBAL tidb_multi_statement_mode='ON' after you understand the security risk

Conclusion

Next BaseLine is now available in limited beta. Eliminate the uncertainty around “Will the migration work?” by performing an independent risk assessment of your product in a migrated database environment before committing to ad-hoc engineering efforts. If you’re interested in seeing a demo with your own SQL workload, you can register here.

Next BaseLine currently supports MySQL, PostgreSQL, Oracle, and SQL Server RDBMS products, covering both self-hosted and cloud-managed implementations across AWS, GCP, Azure, and Alibaba. It supports multiple MySQL- and PostgreSQL-compatible databases, including TiDB, SingleStore, Neon Serverless, Nile, ElephantSQL, TimeScale, and more. Additional compatibility is available for Snowflake, ClickHouse, and DuckDB.

RDS MySQL Aurora 3.07.0 is unusable for upgrades

Yesterday I detailed an incompatible breakage with RDS MySQL Aurora 3.06.0, and one option stated is to upgrade to the just released 3.07.0.

Turns out that does not work. It is not possible to upgrade any version of AWS RDS MySQL Aurora 3.x to 3.07.0, making this release effectively useless.

3.06.0 to 3.07.0 fails

$ aws rds modify-db-cluster --db-cluster-identifier $CLUSTER_ID --engine-version 8.0.mysql_aurora.3.07.0 --apply-immediately

An error occurred (InvalidParameterCombination) when calling the ModifyDBCluster operation: Cannot upgrade aurora-mysql from 8.0.mysql_aurora.3.06.0 to 8.0.mysql_aurora.3.07.0

3.06.0 to 3.06.1 succeeds

Sometimes you need to be on the current point release of a prior version.

3.06.0 to 3.07.0 fails

$ aws rds modify-db-cluster --db-cluster-identifier $CLUSTER_ID --engine-version 8.0.mysql_aurora.3.07.0 --apply-immediately

An error occurred (InvalidParameterCombination) when calling the ModifyDBCluster operation: Cannot upgrade aurora-mysql from 8.0.mysql_aurora.3.06.1 to 8.0.mysql_aurora.3.07.0

There is no upgrade path

You can look at all valid ValidUpgradeTarget for all versions. There is in-fact no version that can upgrade to AWS RDS Aurora MySQL 3.07.0.
Seems like a common test pattern overlooked.

$ aws rds describe-db-engine-versions --engine aurora-mysql

...

{
            "Engine": "aurora-mysql",
            "Status": "available",
            "DBParameterGroupFamily": "aurora-mysql8.0",
            "SupportsLogExportsToCloudwatchLogs": true,
            "SupportsReadReplica": false,
            "DBEngineDescription": "Aurora MySQL",
            "SupportedFeatureNames": [],
            "SupportedEngineModes": [
                "provisioned"
            ],
            "SupportsGlobalDatabases": true,
            "SupportsParallelQuery": true,
            "EngineVersion": "8.0.mysql_aurora.3.04.1",
            "DBEngineVersionDescription": "Aurora MySQL 3.04.1 (compatible with MySQL 8.0.28)",
            "ExportableLogTypes": [
                "audit",
                "error",
                "general",
                "slowquery"
            ],
            "ValidUpgradeTarget": [
                {
                    "Engine": "aurora-mysql",
                    "IsMajorVersionUpgrade": false,
                    "AutoUpgrade": false,
                    "Description": "Aurora MySQL 3.04.2 (compatible with MySQL 8.0.28)",
                    "EngineVersion": "8.0.mysql_aurora.3.04.2"
                },
                {
                    "Engine": "aurora-mysql",
                    "IsMajorVersionUpgrade": false,
                    "AutoUpgrade": false,
                    "Description": "Aurora MySQL 3.05.2 (compatible with MySQL 8.0.32)",
                    "EngineVersion": "8.0.mysql_aurora.3.05.2"
                },
                {
                    "Engine": "aurora-mysql",
                    "IsMajorVersionUpgrade": false,
                    "AutoUpgrade": false,
                    "Description": "Aurora MySQL 3.06.0 (compatible with MySQL 8.0.34)",
                    "EngineVersion": "8.0.mysql_aurora.3.06.0"
                },
                {
                    "Engine": "aurora-mysql",
                    "IsMajorVersionUpgrade": false,
                    "AutoUpgrade": false,
                    "Description": "Aurora MySQL 3.06.1 (compatible with MySQL 8.0.34)",
                    "EngineVersion": "8.0.mysql_aurora.3.06.1"
                }
            ]
        },

Database testing for all version changes (including minor versions)

We know that SQL statement compatibility can change with major database version upgrades and that you should adequately test for them. But what about minor version upgrades?

It is dangerous to assume that your existing SQL statements work with a minor update, especially when using an augmented version of an open-source database such as a cloud provider that may not be as transparent about all changes.

While I have always found reading the release notes an important step in architectural principles over the decades, many organizations skip over this principle and get caught off guard when there are no dedicated DBAs and architects in the engineering workforce.

Real-world examples of minor version upgrade issues

Here are two real-world situations common in the AWS RDS ecosystem using MySQL.

  1. You are an organization that uses RDS Aurora MySQL for its production systems, and you upgrade one minor version at a time. A diligent approach is to be one minor version behind unless a known bug is fixed in a newer version you depend on.
  2. You are an organization that, to save costs with a comprehensive engineering team, uses AWS RDS MySQL (not Aurora) for developers and some testing environments.

I’ve simplified a real-world example to a simple SQL statement and combined these two separate use cases into one simulated situation for demonstration purposes.

mysql> SELECT content_type FROM reserved2;
Empty set (0.00 sec)

mysql> SELECT VERSION(), @@aurora_version;
+-----------+------------------+
| VERSION() | @@aurora_version |
+-----------+------------------+
| 8.0.28    | 3.04.2           |
+-----------+------------------+

mysql> SELECT VERSION();
+-----------+
| VERSION() |
+-----------+
| 8.0.34    |
+-----------+
1 row in set (0.00 sec)

This is a simple enough query, this runs in AWS RDS Aurora MySQL 3.04.02 (which is the present Aurora MySQL long-term support (LTS) release). This is based on MySQL 8.0.28 which is FWIW not a supported AWS RDS MySQL version anymore, the minimum is now 8.0.32 (Supported MySQL minor versions on Amazon RDS).

It runs in AWS RDS MySQL 8.0.34 which is for example what version your developer setup is.

An AWS RDS MySQL Aurora minor version upgrade

You decide to upgrade from Aurora 3.04.x/3.05.x to 3.06.x. This Aurora version is actually based on MySQL 8.0.34 (the version you just tested in RDS). Without adequate due diligence you roll out to production only to find after the fact that this SQL statement (realize this is one simplified example for demonstrate purposes) now breaks for no apparent reason.

mysql> select content_type from reserved2;
ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'content_type from reserved2' at line 1

mysql> SELECT VERSION(), @@aurora_version;
+-----------+------------------+
| VERSION() | @@aurora_version |
+-----------+------------------+
| 8.0.34    | 3.06.0           |
+-----------+------------------+

Now, you need to investigate the problem, which can take hours, even days of resource time, and a lot of shaking heads to realize it has nothing to do with your application code but to do with the minor version upgrade. Which you simply cannot roll back. See Risks from auto upgrades with managed database services for some interesting facts.

Wait, what just happened?

If you performed this upgrade to the latest AWS RDS Aurora MySQL 3.06.0 version sometime after the release on 3/7/24 and before 6/4/24, a 3-month period, you are left with one choice. You have to make application code changes to address the breakage.

How many man-hours/man-days does this take? If you upgraded to this version in the past two weeks, technically you have a second choice. You can go to the most current version, 3.07.0, but you have already spent time in testing and deploying 3.06.0, which you need to re-test, then rollout in non-production accounts and then rollout to production. How many man-days of work is this?

It may be hard to justify the cost of automated testing until you uncover a situation like this one; however, it can easily be avoided in the future.

So why did this happen?

Lets look deeper are the fine-print

RDS Aurora MySQL 3.06.0

Aurora MySQL version 3.06.0 supports Amazon Bedrock integration and introduces the new reserved keywords accept, aws_bedrock_invoke_model, aws_sagemaker_invoke_endpoint, content_type, and timeout_ms. Check the object definitions for the usage of the new reserved keywords before upgrading to version 3.06.0. To mitigate the conflict with the new reserved keywords, quote the reserved keywords used in the object definitions. For more information on the Amazon Bedrock integration and handling the reserved keywords, see What is Amazon Bedrock? in the Amazon Aurora User Guide. For additional information, see Keywords and Reserved Words, The INFORMATION_SCHEMA KEYWORDS Table, and Schema Object Names in the MySQL documentation.

From AWS RDS Aurora MySQL 3.06.0 release notes (3/7/24).

While less likely you would name a column aws_bedrock_invoke_model, column names of content_type and timeout_ms are common words.

RDS Aurora MySQL 3.07.0

Aurora MySQL version 3.06.0 added support for Amazon Bedrock integration. As part of this, new reserved keywords (accept, aws_bedrock_invoke_model, aws_sagemaker_invoke_endpoint, content_type, and timeout_ms) were added. In Aurora MySQL version 3.07.0, these keywords have been changed to nonreserved keywords, which are permitted as identifiers without quoting. For more information on how MySQL handles reserved and nonreserved keywords, see Keywords and reserved words in the MySQL documentation.

From AWS RDS Aurora MySQL 3.07.0 release notes (6/4/24). Clearly someone at AWS saw the breaking changes and it was reverted. While it’s possible many customers may not need to catch this situation, this is one specific use case.

Conclusion

The moral of the database story here is Be Prepared.

You should always be prepared for future breaking compatibility. You should test with a regular software upgrade cadence and leverage automation as much as possible.

Next BaseLine is a software product that automates testing for many use cases, including this simple SQL compatibility issue. By adding to your CI/CD pipeline can help identify and risk in all SQL database access, including new engineering software releases or infrastructure updates. This product can be implemented in a few hours, and cost significantly less than the large amount of time lost with this one realistic situation.

Next BaseLine - Helping to create a better and faster next version of your data-driven product

Footnote

This example was not uncovered from a customer situation. It was uncovered and used as a demonstration because I read the release notes.

Test Case


SELECT VERSION();
SELECT VERSION(), @@aurora_version; /* No way to comment out the !Aurora example */
CREATE SCHEMA IF NOT EXISTS test;
USE test;
CREATE TABLE reserved1(id INT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY, accept CHAR(1) NOT NULL DEFAULT 'N');
CREATE TABLE reserved2(id INT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY, content_type VARCHAR(10) NULL DEFAULT 'text/plain');
CREATE TABLE reserved3(id INT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY, timeout_ms INT UNSIGNED NOT NULL);
SELECT accept FROM reserved1;
SELECT content_type FROM reserved2;
SELECT timeout_ms FROM reserved3;

Are you patching your AWS RDS MySQL 5.7 EOL databases?

Recently, I noticed a second AWS RDS MySQL 5.7 version available 5.7.44-rds.20240408. Curious what this was as 5.7.44 is the only RDS 5.7.x EOL version available, I launched an instance to discount this as errant metadata.

Today I noticed a second version 5.7.44-rds.20240529. I do not run a MySQL 5.7 AWS RDS instance or pay the AWS Extended Support tax, so I would not receive any notices or recommendations that customers may be receiving.

AWS RDS MySQL 5.7 EOL notificationsImage generated by ChatGPT. Mistakes left as a reminder genAI is not there yet for text.

I needed to do some searching before I found a reference here and then this announcement that mentions 5.7.44-RDS.20240408 as a vulnerability fix. This document does not mention the second version however, based on the dates, this was 18 days ago? There is also no whats-new announcement of this second version. With more searching I also came across Extended Support Version Standards which is not linked from the extended support page that describes this new format.

Are AWS customers being informed they need to continue with a minor version upgrade cadence as you would normally perform? Is it now more important because only more severe vulnerabilities will get backported? The CVE-2024-20963 does only mention MySQL 8.0 and above, but as Oracle has officially marked 5.7 as EOL. This does align with the AWS Extended Support commitment to keep MySQL 5.7.

If you are a customer that has auto-minor upgrades enabled for MySQL 5.7 be aware of the risks from auto upgrades with managed database services.

If you are running AWS RDS PostgreSQL 11 (11.22), the same need applies. There are also version updates. These
Postgres docs also show extension fixes for 20240529 but no mention of a vulnerability. This may be the trigger for the same named RDS MySQL version and there are no actual modifications?

Why are you still running MySQL 5.7?

Next BaseLine identifies and categorizes the risks for SQL migration from MySQL 5.7 to MySQL 8.0 and can accelerate moving off this EOL software version. Stop paying the AWS Extended Support Tax. Get started at https://app.kanangra.io/.

Current AWS RDS MySQL versions

$ aws rds describe-db-engine-versions --engine mysql
{
    "DBEngineVersions": [
        {
            "Engine": "mysql",
            "Status": "available",
            "DBParameterGroupFamily": "mysql5.7",
            "SupportsLogExportsToCloudwatchLogs": true,
            "SupportsReadReplica": true,
            "DBEngineDescription": "MySQL Community Edition",
            "SupportedFeatureNames": [],
            "SupportsGlobalDatabases": false,
            "SupportsParallelQuery": false,
            "EngineVersion": "5.7.44",
            "DBEngineVersionDescription": "MySQL 5.7.44",
            "ExportableLogTypes": [
                "audit",
                "error",
                "general",
                "slowquery"
            ],
            "ValidUpgradeTarget": [
                {
                    "Engine": "mysql",
                    "IsMajorVersionUpgrade": false,
                    "AutoUpgrade": false,
                    "Description": "MySQL 5.7.44-rds.20240408",
                    "EngineVersion": "5.7.44-rds.20240408"
                },
                {
                    "Engine": "mysql",
                    "IsMajorVersionUpgrade": false,
                    "AutoUpgrade": false,
                    "Description": "MySQL 5.7.44-rds.20240529",
                    "EngineVersion": "5.7.44-rds.20240529"
                },
                {
                    "Engine": "mysql",
                    "IsMajorVersionUpgrade": true,
                    "AutoUpgrade": false,
                    "Description": "MySQL 8.0.28",
                    "EngineVersion": "8.0.28"
                },
                {
                    "Engine": "mysql",
                    "IsMajorVersionUpgrade": true,
                    "AutoUpgrade": false,
                    "Description": "MySQL 8.0.32",
                    "EngineVersion": "8.0.32"
                },
                {
                    "Engine": "mysql",
                    "IsMajorVersionUpgrade": true,
                    "AutoUpgrade": false,
                    "Description": "MySQL 8.0.33",
                    "EngineVersion": "8.0.33"
                },
                {
                    "Engine": "mysql",
                    "IsMajorVersionUpgrade": true,
                    "AutoUpgrade": false,
                    "Description": "MySQL 8.0.34",
                    "EngineVersion": "8.0.34"
                },
                {
                    "Engine": "mysql",
                    "IsMajorVersionUpgrade": true,
                    "AutoUpgrade": false,
                    "Description": "MySQL 8.0.35",
                    "EngineVersion": "8.0.35"
                },
                {
                    "Engine": "mysql",
                    "IsMajorVersionUpgrade": true,
                    "AutoUpgrade": false,
                    "Description": "MySQL 8.0.36",
                    "EngineVersion": "8.0.36"
                }
            ]
        },
        {
            "Engine": "mysql",
            "Status": "available",
            "DBParameterGroupFamily": "mysql5.7",
            "SupportsLogExportsToCloudwatchLogs": true,
            "SupportsReadReplica": true,
            "DBEngineDescription": "MySQL Community Edition",
            "SupportedFeatureNames": [],
            "SupportsGlobalDatabases": false,
            "SupportsParallelQuery": false,
            "EngineVersion": "5.7.44-rds.20240408",
            "DBEngineVersionDescription": "MySQL 5.7.44-rds.20240408",
            "ExportableLogTypes": [
                "audit",
                "error",
                "general",
                "slowquery"
            ],
            "ValidUpgradeTarget": [
                {
                    "Engine": "mysql",
                    "IsMajorVersionUpgrade": false,
                    "AutoUpgrade": false,
                    "Description": "MySQL 5.7.44-rds.20240529",
                    "EngineVersion": "5.7.44-rds.20240529"
                },
                {
                    "Engine": "mysql",
                    "IsMajorVersionUpgrade": true,
                    "AutoUpgrade": false,
                    "Description": "MySQL 8.0.28",
                    "EngineVersion": "8.0.28"
                },
                {
                    "Engine": "mysql",
                    "IsMajorVersionUpgrade": true,
                    "AutoUpgrade": false,
                    "Description": "MySQL 8.0.32",
                    "EngineVersion": "8.0.32"
                },
                {
                    "Engine": "mysql",
                    "IsMajorVersionUpgrade": true,
                    "AutoUpgrade": false,
                    "Description": "MySQL 8.0.33",
                    "EngineVersion": "8.0.33"
                },
                {
                    "Engine": "mysql",
                    "IsMajorVersionUpgrade": true,
                    "AutoUpgrade": false,
                    "Description": "MySQL 8.0.34",
                    "EngineVersion": "8.0.34"
                },
                {
                    "Engine": "mysql",
                    "IsMajorVersionUpgrade": true,
                    "AutoUpgrade": false,
                    "Description": "MySQL 8.0.35",
                    "EngineVersion": "8.0.35"
                },
                {
                    "Engine": "mysql",
                    "IsMajorVersionUpgrade": true,
                    "AutoUpgrade": false,
                    "Description": "MySQL 8.0.36",
                    "EngineVersion": "8.0.36"
                }
            ]
        },
        {
            "Engine": "mysql",
            "Status": "available",
            "DBParameterGroupFamily": "mysql5.7",
            "SupportsLogExportsToCloudwatchLogs": true,
            "SupportsReadReplica": true,
            "DBEngineDescription": "MySQL Community Edition",
            "SupportedFeatureNames": [],
            "SupportsGlobalDatabases": false,
            "SupportsParallelQuery": false,
            "EngineVersion": "5.7.44-rds.20240529",
            "DBEngineVersionDescription": "MySQL 5.7.44-rds.20240529",
            "ExportableLogTypes": [
                "audit",
                "error",
                "general",
                "slowquery"
            ],
            "ValidUpgradeTarget": [
                {
                    "Engine": "mysql",
                    "IsMajorVersionUpgrade": true,
                    "AutoUpgrade": false,
                    "Description": "MySQL 8.0.28",
                    "EngineVersion": "8.0.28"
                },
                {
                    "Engine": "mysql",
                    "IsMajorVersionUpgrade": true,
                    "AutoUpgrade": false,
                    "Description": "MySQL 8.0.32",
                    "EngineVersion": "8.0.32"
                },
                {
                    "Engine": "mysql",
                    "IsMajorVersionUpgrade": true,
                    "AutoUpgrade": false,
                    "Description": "MySQL 8.0.33",
                    "EngineVersion": "8.0.33"
                },
                {
                    "Engine": "mysql",
                    "IsMajorVersionUpgrade": true,
                    "AutoUpgrade": false,
                    "Description": "MySQL 8.0.34",
                    "EngineVersion": "8.0.34"
                },
                {
                    "Engine": "mysql",
                    "IsMajorVersionUpgrade": true,
                    "AutoUpgrade": false,
                    "Description": "MySQL 8.0.35",
                    "EngineVersion": "8.0.35"
                },
                {
                    "Engine": "mysql",
                    "IsMajorVersionUpgrade": true,
                    "AutoUpgrade": false,
                    "Description": "MySQL 8.0.36",
                    "EngineVersion": "8.0.36"
                }
            ]
        },
        {
            "Engine": "mysql",
            "Status": "available",
            "DBParameterGroupFamily": "mysql8.0",
            "SupportsLogExportsToCloudwatchLogs": true,
            "SupportsReadReplica": true,
            "DBEngineDescription": "MySQL Community Edition",
            "SupportedFeatureNames": [],
            "SupportsGlobalDatabases": false,
            "SupportsParallelQuery": false,
            "EngineVersion": "8.0.32",
            "DBEngineVersionDescription": "MySQL 8.0.32",
            "ExportableLogTypes": [
                "audit",
                "error",
                "general",
                "slowquery"
            ],
            "ValidUpgradeTarget": [
                {
                    "Engine": "mysql",
                    "IsMajorVersionUpgrade": false,
                    "AutoUpgrade": false,
                    "Description": "MySQL 8.0.33",
                    "EngineVersion": "8.0.33"
                },
                {
                    "Engine": "mysql",
                    "IsMajorVersionUpgrade": false,
                    "AutoUpgrade": false,
                    "Description": "MySQL 8.0.34",
                    "EngineVersion": "8.0.34"
                },
                {
                    "Engine": "mysql",
                    "IsMajorVersionUpgrade": false,
                    "AutoUpgrade": true,
                    "Description": "MySQL 8.0.35",
                    "EngineVersion": "8.0.35"
                },
                {
                    "Engine": "mysql",
                    "IsMajorVersionUpgrade": false,
                    "AutoUpgrade": false,
                    "Description": "MySQL 8.0.36",
                    "EngineVersion": "8.0.36"
                }
            ]
        },
        {
            "Engine": "mysql",
            "Status": "available",
            "DBParameterGroupFamily": "mysql8.0",
            "SupportsLogExportsToCloudwatchLogs": true,
            "SupportsReadReplica": true,
            "DBEngineDescription": "MySQL Community Edition",
            "SupportedFeatureNames": [],
            "SupportsGlobalDatabases": false,
            "SupportsParallelQuery": false,
            "EngineVersion": "8.0.33",
            "DBEngineVersionDescription": "MySQL 8.0.33",
            "ExportableLogTypes": [
                "audit",
                "error",
                "general",
                "slowquery"
            ],
            "ValidUpgradeTarget": [
                {
                    "Engine": "mysql",
                    "IsMajorVersionUpgrade": false,
                    "AutoUpgrade": false,
                    "Description": "MySQL 8.0.34",
                    "EngineVersion": "8.0.34"
                },
                {
                    "Engine": "mysql",
                    "IsMajorVersionUpgrade": false,
                    "AutoUpgrade": true,
                    "Description": "MySQL 8.0.35",
                    "EngineVersion": "8.0.35"
                },
                {
                    "Engine": "mysql",
                    "IsMajorVersionUpgrade": false,
                    "AutoUpgrade": false,
                    "Description": "MySQL 8.0.36",
                    "EngineVersion": "8.0.36"
                }
            ]
        },
        {
            "Engine": "mysql",
            "Status": "available",
            "DBParameterGroupFamily": "mysql8.0",
            "SupportsLogExportsToCloudwatchLogs": true,
            "SupportsReadReplica": true,
            "DBEngineDescription": "MySQL Community Edition",
            "SupportedFeatureNames": [],
            "SupportsGlobalDatabases": false,
            "SupportsParallelQuery": false,
            "EngineVersion": "8.0.34",
            "DBEngineVersionDescription": "MySQL 8.0.34",
            "ExportableLogTypes": [
                "audit",
                "error",
                "general",
                "slowquery"
            ],
            "ValidUpgradeTarget": [
                {
                    "Engine": "mysql",
                    "IsMajorVersionUpgrade": false,
                    "AutoUpgrade": true,
                    "Description": "MySQL 8.0.35",
                    "EngineVersion": "8.0.35"
                },
                {
                    "Engine": "mysql",
                    "IsMajorVersionUpgrade": false,
                    "AutoUpgrade": false,
                    "Description": "MySQL 8.0.36",
                    "EngineVersion": "8.0.36"
                }
            ]
        },
        {
            "Engine": "mysql",
            "Status": "available",
            "DBParameterGroupFamily": "mysql8.0",
            "SupportsLogExportsToCloudwatchLogs": true,
            "SupportsReadReplica": true,
            "DBEngineDescription": "MySQL Community Edition",
            "SupportedFeatureNames": [],
            "SupportsGlobalDatabases": false,
            "SupportsParallelQuery": false,
            "EngineVersion": "8.0.35",
            "DBEngineVersionDescription": "MySQL 8.0.35",
            "ExportableLogTypes": [
                "audit",
                "error",
                "general",
                "slowquery"
            ],
            "ValidUpgradeTarget": [
                {
                    "Engine": "mysql",
                    "IsMajorVersionUpgrade": false,
                    "AutoUpgrade": false,
                    "Description": "MySQL 8.0.36",
                    "EngineVersion": "8.0.36"
                }
            ]
        },
        {
            "Engine": "mysql",
            "Status": "available",
            "DBParameterGroupFamily": "mysql8.0",
            "SupportsLogExportsToCloudwatchLogs": true,
            "SupportsReadReplica": true,
            "DBEngineDescription": "MySQL Community Edition",
            "SupportedFeatureNames": [],
            "SupportsGlobalDatabases": false,
            "SupportsParallelQuery": false,
            "EngineVersion": "8.0.36",
            "DBEngineVersionDescription": "MySQL 8.0.36",
            "ExportableLogTypes": [
                "audit",
                "error",
                "general",
                "slowquery"
            ],
            "ValidUpgradeTarget": []
        }
    ]
}

The curse of MySQL warnings

MySQL warnings are an anti-pattern when it comes to maintaining data integrity. When the information retrieved from a database does not match what was entered, and this is not identified immediately, this can be permanently lost.

MySQL by default for several decades until the most recent versions enabled you to insert incorrect data, or insert data that was then truncated, or other patterns that resulted in failed data integrity. Very few applications considered handling warnings as errors, and there is a generation of software products that have never informed the developers that warnings were occurring.

The most simplest example is:

CREATE SCHEMA IF NOT EXISTS warnings;
USE warnings;

CREATE TABLE short_name(
  id INT UNSIGNED NOT NULL AUTO_INCREMENT,
  name VARCHAR(20) NOT NULL,
  PRIMARY KEY(id)
);

INSERT INTO short_name (name) VALUES ('This name is too long and will get truncated');
ERROR 1406 (22001): Data too long for column 'name' at row 1

This is what you expect would happen. In many, many applications IT DOES NOT.

For almost 20 years the default setting was to support possible data corruption

If you used an older version without setting up a more strict SQL_MODE from the default you end up with.

INSERT INTO short_name (name) VALUES ('This Name is too long and will get truncated');
Query OK, 1 row affected, 1 warning (0.00 sec)

SELECT * FROM short_name;
+----+----------------------+
| id | name                 |
+----+----------------------+
|  1 | This name is too lon |
+----+----------------------+
1 row in set (0.00 sec)

Only if you run SHOW WARNINGS and after the actual SQL statement would you know? There is no other way to find this information in any logs. There is no way to

mysql> SHOW WARNINGS;
+---------+------+-------------------------------------------+
| Level   | Code | Message                                   |
+---------+------+-------------------------------------------+
| Warning | 1265 | Data truncated for column 'name' at row 1 |
+---------+------+-------------------------------------------+
1 row in set (0.00 sec)

Numerous other examples can shock a customer when, after some time, expected data in a production is lost and unretrievable.

If you came from a more strict RDBMS background, or you tuned your MySQL installation or uncovered this and many other poor defaults, you would have improved your data integrity with and improved SQL_MODE.

So MySQL warnings are bad? No, they are ideal when used appropriately. However, the next critical dilemma occurs.

Warnings are valuable when used to identify important characteristics of an SQL statement that a developer or database administrator should be aware of. However, the only way to retrieve these warnings is from the application making the connection to the database at each statement, and generally, these warnings are just lost.

Here are some examples of warnings that are important for the engineering team that define criteria such as deprecation notices, which are important for production database upgrades.

SELECT JSON_MERGE('["a"]','["b"]'); 
Warning (Code 1287): 'JSON_MERGE' is deprecated and will be removed in a future release. Please use JSON_MERGE_PRESERVE/JSON_MERGE_PATCH instead

SELECT ST_GeomFromWKB(Point(0, 0));
Warning: (3195) st_geometryfromwkb(geometry) is deprecated and will be replaced by st_srid(geometry, 0) in a future version. Use st_geometryfromwkb(st_aswkb(geometry), 0) instead.

SELECT DATE('2024-01-01 10:00:00') 
Warning (Code 4096): Delimiter ' ' in position 11 in datetime value '2024-01-01 10:00:00' at row 1 is superfluous and is deprecated. Please remove.

SELECT BINARY 'a' = 'A' 
Warning (Code 1287): 'BINARY expr' is deprecated and will be removed in a future release. Please use CAST instead 

You definitely want to know about these, collect them (hard), add them to your backlog, and don’t leave it until its too late in the I can’t upgrade my database to have to address.

If you want to know about these, collect them (hard), add them to your backlog, and don’t leave it until it’s too late for a critical last-minute upgrade to my database to have to address.

There are also warnings that should be collected and used for performance verification, which apply to running systems. I wanted to show one specific example uncovered during testing of a MySQL upgrade to version 8.0.

Warning (Code 3170): Memory capacity of 8388608 bytes for 'range_optimizer_max_mem_size' exceeded. Range optimization was not done for this query.

In fact, this warning occurs in MySQL 5.7, but the customer never knew because they did not look at the warnings. How many other SQL statements in your application produce warnings now? How can you find this out?

It was rather easy to create a reproducible test case but what now?

  • Do you set range_optimizer_max_mem_size=0
  • Do you set to the value you need, which you can identify with SELECT * FROM performance_schema.memory_summary_by_thread_by_event_name WHERE thread_id=PS_CURRENT_THREAD_ID() AND event_name='memory/sql/test_quick_select'\G
  • Do you need to modify your optimizer_switch settings?
  • Do you try something else?
  • Do you refactor your application?
  • Do you just leave it as is?

When you want to consider several different options, which one works best for this query? What about the impact on your entire production workload? Knowing statistically which is the best choice for your full workload and under various conditions is the optimal output, but how?

Next BaseLine was built to perform experiments comparing changes to your data, configuration, and infrastructure to validate the next version of your product statistically performs better than your current version across all of your application at different workloads.

Next BaseLine also provides numerous benefits for a major database upgrade, so I’ve focussed on getting these capabilities to customers quicker to save money. It provides the benefit of detecting SQL statements that produce errors in the next MySQL version, enabling you to categorize and prioritize areas of your application that must be corrected. It also captures important information about the performance and quality of the data from your MySQL queries; this also can help in identifying the most critical aspects of your application to invest engineering time and mitigate risk in your database upgrade plan. It can also collect warning messages such as these discussed when considering migrating from MySQL 5.7 to MySQL 8, or it can just find them with your current application.

What is your pain point with MySQL database upgrades? What are you doing right now to help reduce this additional budget spend? Join our private beta program now to find out more.

Next BaseLine

Helping to create a better and faster next version of your data-driven product

What happened to Digital Tech Trek Digest?

I started 2024 with several goals. The first goal was to iterate over some weekend project ideas and actually deploy them. These were never designed to make money or have widespread value; they were an exercise in iterating over an idea in preparation for a larger project. This led to InstanceHunt, which turned out to be very useful, lead to a few interesting leads including MicroConf. I also have a number of future features on my todo list. I have always wanted to keep product configs so I dropped Configs Info the next month as a weekend project. Likewise, there are plenty of additional feature ideas.

A second goal was to read more outside my comfort zone. As part of keeping notes, I started publishing weekly the Digital Tech Trek Digest for the first three months of 2024. It was not part of the original goal to share the details; it just worked out that way. I still read many of my sources regularly including TLDR, Forbes Daily, ThoughWorks Podcasts, Daily Dose of Data Science and BoringCashCow. Also Scientific American Technology, Fareed’s Global Briefing, Software Design: Tidy First? by Kent Beck, Last Week in AWS, Micro Newsletter and more. I guess my draft of lucky #13 never made it to publication. However, my time of writing was in preparation toward my next goal of 2024.

My third goal for 2024 was to pursue taking an idea into an actual product and to look at making founder-led sales and have a goal of producing a product. I expected this would be most of 2024 before I had a deliverable. Of the 5-7 ideas I considered, I decided on one and started creating a POC, creating some slides, talking with people, and spending a lot of time on how I could market and sell my startup idea. On March 26, I decided to take my large startup idea and narrow the focus to a specific and immediate problem that could be solved; however, this is not the goal of my startup. That problem was helping customers stop paying the AWS tax of RDS Extended Support.

For the past two months, I’ve been striving to accelerate creating a SaaS solution, learning how to rewrite all my code on Go, thinking about how to identify my Ideal Customer Profile (ICP), determine a Monthly Recurring Revenue (MRR) fixture, and talk with prospects and potential partners. All this also happening in May, which is surprisingly a month of way too many end-of-school events.

Stay tuned. News is coming soon.

Digital Tech Trek Digest [#Issue 2024.12]

Falsehoods programmers believe about time zones

If I told you there was a timezone 30 minutes past the hour, would you believe me? In a small section of Western Australia, there is. However, time zones (TZ) are way more complicated. If you have not missed a meeting due to a TZ mixup, you haven’t worked in a multi-national company or had a meeting in a country with >1 timezone.

I shared this article with several friends, and this response sums it up. “I think documenting failure is as important as celebrating success. Always share what you’ve learned. And adding in humor makes it human.”

PostgreSQL Schema Changes with pg_osc

While Online Schema Change (OSC) has been part of the MySQL ecosystem for decades, evolving from pt-osc (previously MaaKit) to gh-ost and now spirit this is the first time I’ve head of OSC for PostgreSQL. It likely has existed in some form for some time.

Source: LinkedIn

Pricing your product

One key takeaway from the MicroConf Founder-lead Sales event was pricing. “Be transparent and provide a number.” is basically what I noted as the essence. Well, I’m going to track sites that do not offer pricing to evaluate why. My first two entrants are Vanta and CultureAmp. Why is a price not offered? Is it a high-entry point, too complex to articulate easily, a way to charge the customer what they are willing to pay, or perhaps it’s not credit-card worthy but a detailed commitment? As a child, I remember being told that if I had to ask the price when it was not visible, I could not afford it. Does this apply to SaaS?

Postgres is eating the database world

The Extension is a fundamental growth mindset for PostgreSQL. Combined with the protocol for TCP communications, a robust and growing ecosystem can be seen with PostgreSQL. Add the wisdom of moving to an annual release cadence, and these factors would seem to highlight that PostgreSQL is quickly outpacing MySQL in the RDBMS open-source ecosystem. This article written by the creators of Pigsty, and other players including Tembo and Aiven are re-enforcing the narrative.

Source: LinkedIn

Why should I not upload images of code/data/errors?

This post is a comment by Bill Karwin to Stop with the Video Documentation by Jon Sustar. This post is just the recipe that should be enshrined in the ticket support system of any company when any user tries to upload an image. When a text command is provided, and the response is not provided back in the text, it’s hugely inefficient.

Source: LinkedIn

About “Digital Tech Trek Digest”

Most days, I take some time early in the morning to scan my inbox newsletters, the news, LinkedIn, or other sources to read something new about professional and personal topics of interest. I turn what I read into actionable notes in a short, committed time window, summarizing what I learned, what I should learn and use, or what is of random interest. And thus my Digital Tech Trek.

Some of my regular sources include TLDR, Forbes Daily, ThoughWorks Podcasts, Daily Dose of Data Science and BoringCashCow. Also Scientific American Technology, Fareed’s Global Briefing, Software Design: Tidy First? by Kent Beck, Last Week in AWS, Micro Newsletter to name a few.

New Additions

I have added Building a boring, but wildly profitable, online business portfolio as a new source to review.

Digital Tech Trek Digest [#Issue 2024.11]

In his Newsletter Solopreneur Ian Nuttal writes, “I sold my startup (again).”

In 4 months URL Monitor scaled far beyond what I expected:

550+ customers
2 million indexed pages
17 million pages monitored
$100k+ ARR

You can follow Ian on Twitter/X. (Will the word every drop the word Twitter, I would say no).

New AWS Console functionality

If you have ever tried to keep up with AWS News and Product Announcements even for a subset of products, let me know how. While I try to keep monitoring, sometimes you accidentally see a new feature. I am not a fan of using GUI interfaces. I’m all about the CLI and APIs. However, one must always spin up the AWS Console to look at what new blurb is being presented in the 10+ database products to

AWS has started offering more in-depth recommendations, but you need to Install or update to the latest version of the AWS CLI to 2.15+ to see them programmatically.

What is the doc format to use

I have moved to use MarkDown (.md) for all of my repo documentation, but there are different Markdown variants (city). I was struck by the above AWS documentation using `.rst,` known as reStructuredText, for its documentation.

MicroConf Remote 8.0: Early Stage SaaS Sales!

I recently this event as I am an entrepreneur looking at how to price

Founder-Led Sales Best Practices: Getting the 80/20 out of your sales efforts but Craig Hewitt

Selling with words: what early-stage startup founders tend to get wrong by Sam Howard. Several attendees, including myself, installed Hotjar following this presentation.

How to Build Scalable Founder-led Sales by Rachel Liaw
How to Build Your First Sales Process as a Technical Founder by Daniel Herbert.

I also got to speak to several founders 1:1, everything from I have an idea, to executing successful startups, including Sponsy (impressive logos) and PlaybookWriter.

About “Digital Tech Trek Digest”

Most days, I take some time early in the morning to scan my inbox newsletters, the news, LinkedIn, or other sources to read something new about professional and personal topics of interest. I turn what I read into actionable notes in a short, committed time window, summarizing what I learned, what I should learn and use, or what is of random interest. And thus my Digital Tech Trek.

Some of my regular sources include TLDR, Forbes Daily, ThoughWorks Podcasts, Daily Dose of Data Science and BoringCashCow. Also Scientific American Technology, Fareed’s Global Briefing, Software Design: Tidy First? by Kent Beck, Last Week in AWS, Micro Newsletter to name a few.

Today’s interesting websites experience

Nice 500 gitHub

Digital Tech Trek Digest [#Issue 2024.10]

Google advances with vector search in MySQL, leapfrogging Oracle in LLM support

As the title states, GCP is the first MySQL-managed service to offer “vector” support. Clearly the buzz-word of 2024 along with RAG, genAI and LLM is so 2023.

IMO, Oracle should just rename MySQL Heatwave to Heatwave. It would distinguish the product as unique, which is is,

4 Signs You Belong In a Startup Accelerator for SaaS Founders

YouTube

Accelerators help early stage startup and enable businesses to grow faster, get mentorship, obtain funding/equitya and are generally industry specific. In this example Tinyseed focusses on B2B.

Some tips for considering an acclerator.

1. You know your numbers and your focussing on the right ones
– MRR
– Churn
-LTV or ACV – Lifetime value, Annual Contract Value
Vanity metrics/candy metrics. Email Subscribers, Free Trial Users, Unique Visitors. Without context on how they are growing paid users

2. You experimenting, especially in marketing and sales
– You should always be tweaking and finding the right market.

3. Your coachable
– You are going to use the resources provided and you are vulnerable (receiving a lot of feedback, lower your defenses)
– Your accountable to groups, e.g. masterminds
– Be open to criticism

4. You are fast and furious
– Your drive and need to be moving forward, never satisfied with status quo
– you try new things
– Nothing is perfect at the beginning, experimentation is a key

The Founder’s Guide to Stealth Startups

EverydaySpy

In this postcast “The Diary of a CEO“, Steven Bartlett interviews Andrew Bustamante. Andrew is a former covert CIA intelligence officer and US Air Force combat veteran. He is the founder of EverydaySpy, an online education platform that teaches real-world international espionage techniques that can be used in everyday life.

Messaging builds narrative. Don’t use mass marketing via social media, believe in your brand.
Marketing, present a message, crafted with an emotion, responsing showing motivation.

Competition is for Losers with Peter Thiel (How to Start a Startup 2014: 5)

This is a very old presentation that was recently re-shared with me.

In this presentation, Thiel starts off his presentation with the concept of “Avoid competition”

Creating value is a very simple formula of two things. Create X$ for the world, and you capture Y% of X where X and Y are independent variables.

A big piece of a small pie can drastically affect profit margin. All United State airline carriers combined compare with Google. Much smaller, much higher value.

He goes on to talk about effectively two types of businesses, a competitive business or a monopoly, there is no in between.

Are you talking about data the WRONG WAY?

Scott Taylor, a colleague I discovered lived in a neighboring town and whom I could meet in person after attending a virtual conference event, asks a very valid question about the importance of data management. He re-iterates the “3Vs” of effective storytelling, Vocabulary, Voice, and Vision. You can discover a lot more information in his book Telling Your Data Story: Data Storytelling for Data Management. The art of Effective Data Storytelling is something I practice daily. It is easy for a data specialist to have the data facts and visualize the data, but the art is being able to drive change by combining data, visualizations, and what I consider the most important component, narrative. I highly recommend Brent Dykes book of the same name, and when combined with Be Data Driven by Jordan Morrow you have a cradle of strategy when discussing data management to organizations that want to become a data-driven organization. It is way more difficult to implement than any plan and strategy you may read and prepare for.

About “Digital Tech Trek Digest”

Most days, I take some time early in the morning to scan my inbox newsletters, the news, LinkedIn, or other sources to read something new about professional and personal topics of interest. I turn what I read into actionable notes in a short, committed time window, summarizing what I learned, what I should learn and use, or what is of random interest. And thus my Digital Tech Trek.

Some of my regular sources include TLDR, Forbes Daily, ThoughWorks Podcasts, Daily Dose of Data Science and BoringCashCow. Also Scientific American Technology, Fareed’s Global Briefing, Software Design: Tidy First? by Kent Beck, Last Week in AWS, Micro Newsletter to name a few.

New Additions to my reading

SaaS Developer Community

Digital Tech Trek Digest [#Issue 2024.09]

As an entrepreneur, pricing is an important consideration in any evaluation, development, and customer testing. In How To Price A SaaS Product, we see different pricing strategies, cost-based pricing, competitor-based pricing, penetration pricing, value-based pricing, freemium pricing. None of these match what I am ultimately considering: consumption-based pricing. Pricing is critical to define the value proposition statement and determine the range of the total lifetime value (TLV). It can vary greatly for B2C, B2B, and B2B enterprise offerings. If we look at YCombinator https://www.ycombinator.com/library/6h-startup-pricing-101 a basic principle is determining the gap between price and cost. That is your margin and your incentive to sell, and you work with either cost-plus or value-based pricing. Starting with founder-led sales is difficult as you do not have the luxury of a dedicated and experienced head of sales to work on different models and guide a technical founder, even before you enter the minefield of enterprise sales with applicable bids, contract, and compliance complexities. I am drawn back to “Consumption-based pricing is a pricing model that charges customers based on their product or service usage. Consumption-based pricing calculates pricing based on usage volume rather than the number of users and is a popular pricing model for IT services, SaaS, and cloud computing and storage” Cite: Consumption-Based Pricing.

Moving a Billion Postgres Rows on a $100 Budget

I wrote recently about the 1 Billion Row Challenge (1BR). This week, I found this article on the same number with a different title. The objective was not performance; it was cost. PeerDB enables the efficient extraction of data from PostgreSQL into a data warehouse, such as Big Query, ClickHouse or Snowflake. It was interesting to see Arvo as a format used over, for example, Parquet. The product also offers different streaming modes, including log-based (CDC), cursor-based (timestamp or integer), and XMIN-based. I will need to do further research on this new term XMIN-Based.

Test queries against your production database (responsibly)

This post links off to a YouTube video of The Safest Way to Test Postgres Destructive Queries, which provides a basic introduction to branching of the Neon PostgreSQL DBaaS. While the title originally interested me, the example showing the mechanics is like many other product examples in which it is extremely simplistic and not a true representation of “production” size or workload. I see this as a similar concept to AWS RDS Aurora cloning. However, any example should modify the structure of a table, measure the impact of that structure against production queries (note plural), and provide additional metadata rather than just a response time. These are important considerations in my own evaluation of test coverage of data access and the gathering of configuration, data, and infrastructure when running experiments to determine a more optimal data access path or a new functionality requirement. More documentation can be found here on Neon Branching

About “Digital Tech Trek Digest”

Most days, I take some time early in the morning to scan my inbox newsletters, the news, LinkedIn, or other sources to read something new about professional and personal topics of interest. I turn what I read into actionable notes in a short, committed time window, summarizing what I learned, what I should learn and use, or what is of random interest. And thus my Digital Tech Trek.

Some of my regular sources include TLDR, Forbes Daily, ThoughWorks Podcasts, Daily Dose of Data Science and BoringCashCow. Also Scientific American Technology, Fareed’s Global Briefing, Software Design: Tidy First? by Kent Beck, Last Week in AWS, Micro Newsletter to name a few.

Random Wisdom

This week, I was reminded via a very interesting statement that work-life balance and joy in what you do are critically important. You will not find on a tombstone the statement:

“I never worked enough hours.”

Digital Tech Trek Digest [#Issue 2024.08]

The One Billion Row Challenge Shows That Java Can Process a One Billion Rows File in Two Seconds

Well, it’s way under 2 seconds for the 1brc. The published results are in, and if you’re good, you can read 1 billion data points of weather data and analyze it. The final best number, as per the article release, is “00:00.323″. Yes, that answer is in milliseconds “Result (m:s.ms)”. Mind-blowing.

ScyllaDB Summit 2024

Last week, I attended this virtual event. All the presentations can be found online. I had never used the product before, so while some new features like Tablets were not as applicable in understanding the full impact, the DynamoDB performance and cost comparisons were very applicable.

So what is ScyllaDB? It is a distributed NoSQL DBaaS that speaks Cassandra protocol (do large companies still use this?), and it speaks AWS DynamoDB protocol. That is really interesting to me. You can choose a Cloud Hosted offering, or if you’re into managing your setup, you can use the Open Source ScyllaDB version available from GitHub. I started at ScyllaDB University to get a grip on the basics. I have yet to try the local Docker Compose setup.

Thanks also to the team for the swag which I received.

Playing a game with your CI/CD pipeline

My friend Sergey has created a game in GitLab called GitTerra. Drop a few lines into your .gitlab-ci.yml, and each build will give you a generated 3D map of a city based on your commit. I look forward to some of his next steps, leveraging potentially different colors for languages or different building structures for artifacts found in your commit.

We raised 11.6M to build Serverless Postgres for Modern SaaS

Congrats to Gwen and her co-founder for getting seed funding for Nile Serverless Postgres for Modern SaaS. Awesome news for an entrepreneur, and I’m very hopeful for the success of Nile.

The Safest Way to Test Postgres Destructive Queries

While I am a user of ElephantSQL serverless PostgreSQL and Neon, Nile and Xata are just a few that are competing in the space. With multiple other products that also speak PostgreSQL protocol, you can easily trial a small product in an RDBMS in the cloud at no cost. PostgreSQL is definitely outdoing MySQL in this space. You have the extensive set of NoSQL Cloud offerings, SycllaDB I just mentioned, and D1 by CloudFlare I have yet to try this branching feature for your database, sounds interesting and I’ve added to my just as long list of products to try, as books to read. Nit: It’s PostgreSQL, not Postgres.

About “Digital Tech Trek Digest”

I take some time early in the morning to scan my inbox newsletters, the news, LinkedIn, or other sources to read something new covering professional and personal topics of interest. Turning what I read into some actionable notes in a short, committed time window is a summary of what I learned, what I should learn and use, or what is of random interest. And thus my Digital Tech Trek.

Some of my regular sources include TLDR, Forbes Daily, ThoughWorks Podcasts, Daily Dose of Data Science and BoringCashCow. Also Scientific American Technology, Fareed’s Global Briefing, Software Design: Tidy First? by Kent Beck, Last Week in AWS, Micro Newsletter to name a few.

Digital Tech Trek Digest [#Issue 2024.07]

Everything you need to know about seed funding for startups

A recent call with a startup founder funded by TinySeed led me to learn about MicroConf and Rob Walling. (Thanks Tony for the info). This has led to a lot of great info in several new newsletters and videos including this video. A few very valuable tips I learned included the answer to Why should you raise funds at all? The 1-9-90 rule, and different types of funding including Indie funding. It was interesting to find out that the TinySeed accelerator is 1 year, and not 13 weeks, which is common in NY. Rather than sharing my notes, go watch the video.

5 Books That Paved My Path to Entrepreneurial Success

I have not heard of any of these books, and I have such a long list, perhaps I need to publish my list and elicit feedback on prioritizing. The list from this article is as follows:

    1. Mastering negotiations with never split the difference
  1. 2. Embracing risk with skin in the game
  2. 3. Building habit-forming products with Hook
  3. 4. The roadmap to a billion dollar app in How to Build a Billion Dollar App

Visualization

Last week I was at two events in Brussels. I chose to head to London to fly home. I found this map present in many tube stations (The tube is the London Metro Subway). It’s been a decade since I was in London, and over two decades since I lived in the UK. I found the new map great. When I mentioned it as a good visualization, I was surprised that locals of the London area thought it was horrible. I saw the value in the visualization, but perhaps others see it like art, “in the eyes of the beholder”. It could also be “habit”.
London Tube Map - 2024 presentation
Typical London Tube map

Cats and Dogs

How many *NIX `cat` memes can there be? Well, a lot cat is the most misused thing by programmers new to Linux. I cringe every time someone uses it wrong in a bash script. Thread below with proper uses of cat only.

Hey, dogs, you are in the count also with HTTP STATUS DOGS. My picks are 300 Multiple Choices, and 429 Too Many Requests for me.

Upcoming Events on my radar

About “Digital Tech Trek Digest”

I take some time early in the morning to scan my inbox newsletters, the news, LinkedIn, or other sources to read something new covering professional and personal topics of interest. Turning what I read into some actionable notes in a short, committed time window is a summary of what I learned, what I should learn and use, or what is of random interest. And thus my Digital Tech Trek.

Some of my regular sources include TLDR, Forbes Daily, ThoughWorks Podcasts, Daily Dose of Data Science and BoringCashCow. Also Scientific American Technology, Fareed’s Global Briefing, Software Design: Tidy First? by Kent Beck, Last Week in AWS to name a few.

Random

I shared with a colleague on Feb 9. “3 SQL databases walked into a NoSQL bar. A little while later, they walked out because they couldn’t find a table.”

Digital Tech Trek Digest [#Issue 2024.06]

MySQL Belgian Days 2024 and FOSDEM 2024

In this past week, I’ve been able not just to read or watch digital content online but to meet people in person. In Brussels, first at the MySQL Belgian Days 2024 event, followed by FOSDEM 2024.

There was a wide array of presentations covering many different topics; this is just a summary. Fred talked history of Command Line Monitoring and an intro to the new player Dolphie. Dave Stokes talked security, Sunny Bains gave us a brain dump of TiDB scalable architecture. We got an update on PMM and MySQL on k8s from Peter Zaitsev as well as a chat about his new product coroot. And then a great intro to a new generation of online schema change at scale with Sprit by Morgan Tocker. Alex Rubin shows us how not how to hack MySQL, but how MySQL can hack you. We have all crossed paths as MySQL Inc. employees or MySQL community members since 2006.

Marcelo Altmann gave us a detailed intro of a new era of caching with ReadySet. We also heard updates on Vitess. And that was just Day 1 presentations. The evening event was at the incredibly wall-to-wall packed Delirium Café, sponsored by ReadySet, which we offer great thanks and cheers.

Day 2 was packed with great content about MySQL Shell, MySQL Heatwave ML and Vector, MySQL Router, and the MySQL optimizer from many well-known Oracle MySQLers before amazing awards, Belgian beer, and black vodka, of course.

Congratulations Giuseppe Maxia on your MySQL Legends award at MySQL Belgium Days 2024. It is well deserved for all of your community contributions over the decades.

Check out the details at Unveiling the Highlights: A Look Back at MySQL Belgian Days 2024.

Saturday and Sunday were FOSDEM 24 and its usual location. So many people crossing the university, tunnels, and weird transit paths between all the university lecture halls it can feel like a blur. For the first time, I had no fixed agenda so I could check out random talks on random topics.

A shout-out to many people I know and some new people I met. Colin Charles, Alkin Tezuysal, Walter Heck, Charly Batista, Robert Hodges, Jens Bollmann, Monty Widenius, Matthias Crauwels, Michael Pope, Marcelo Altmann, Emerson Gaudencio, Aldo Junior and tons more I have forgotten to mention by name. There were many conversations also with random community people I didn’t even get names, for example, the team at Canonical.

About “Digital Tech Trek Digest”

Most days, I take some time early in the morning to scan my inbox newsletters, the news, LinkedIn, or other sources to read something new about professional and personal topics of interest. I turn what I read into actionable notes in a short, committed time window, summarizing what I learned, what I should learn and use, or what is of random interest. And thus my Digital Tech Trek.

Some of my regular sources include TLDR, Forbes Daily, ThoughWorks Podcasts, Daily Dose of Data Science and BoringCashCow. Also Scientific American Technology, Fareed’s Global Briefing, Software Design: Tidy First? by Kent Beck, Last Week in AWS to name a few.

Digital Tech Trek Digest [#Issue 2024.05]

Because the world needs better dashboards

While my professional interests in Building Better Data Insights Faster rely on using visuals and narratives to show data-driven results, “Starting from first principles” is the question you have to ask. Identifying the quality of data sources, the time to delivery, and the confidence of accuracy are critical aspects of any dashboard.

Source: WrapText by Equals

This is the second article I’ve read about Equals in a week, and while I’m not ready to go back to a spreadsheet, this company has some great previous posts with excellent content, such as the 2023 summary and How to ship fast. An appropriate statement would be.

What a year. We embraced AI. We reimagined BI. We waved freemium goodbye. And as the cliché goes, we’re only just getting started.

[Last Week in AWS] Issue #352: New Year, New You, Here’s December in Review

Damm right, I think you are giving too much created by saying “a year”. More than once I had to rewrite code because AWS was years behind standard Python releases. AWS Lambda adds support for Python 3.12.

Whatever was going on with the delays in getting new language runtimes out a year or more after the language version itself was released seems to have been resolved. I wonder how long it’ll take that unpleasant chapter to fade from the collective awareness around Lambda.

Source: Last Week in AWS

Latency is the new outage

While technically a video that I listened to with Getting Started with ElastiCache for Redis Performance & Cost Optimization, this needs to be a slogan used more frequently. It is so true. The speaker in the opening minutes also describes some compelling reasons why our proliferation of data can contribute to a negative impact.

Source: Random AWS reading.

About “Digital Tech Trek Digest”

Most days I take some time early in the morning to scan my inbox newsletters, the news, LinkedIn, or other sources to read something new covering professional and personal topics of interest. Turning what I read into some actionable notes in a short, committed time window is a summary of what I learned, what I should learn and use, or what is of random interest. And thus my Digital Tech Trek.

Some of my regular sources include TLDR, Forbes Daily, ThoughWorks Podcasts, Daily Dose of Data Science and BoringCashCow. Also Scientific American Technology, Fareed’s Global Briefing, Software Design: Tidy First? by Kent Beck, Last Week in AWS to name a few.

Digital Tech Trek Digest [#Issue 2024.04]

NoOps and Serverless solutions

I was reminded of an upcoming expiry of a test website that I have on PythonAnywhere. This site enables you to host, run, and code Python in the cloud without any infrastructure and starts with a free account and then a $5 account. Striving towards NoOps and serverless is an important consideration for any small and simple application, I’d forgotten completely about this service.

5 IT services industry trends on tap for 2024

As major companies either want to use a service provider or maintain a relationship with one, knowing the trends lets you consider what SaaS providers of all the services you use like authentication, security, chatbots, support systems, and more are thinking about.

This article considers these trends:

  • Cloud cost optimization
  • Focused transformation, innovation
  • Investment in generative AI skills
  • Vertical market focus
  • Partner programs, reconsidered

Rapid developments in AI will also shape business prospects for consulting firms, MSPs, and systems integrators. AI could potentially provide a way to deliver new capabilities in shorter timeframes that satisfy the C-level demand for a quick ROI.

Source:https://www.techtarget.com/searchitchannel/feature/IT-services-industry-trends-on-tap

Context switching is killing your productivity

I believe the title says it all. The article provides several ways to combat this productivity killer.
Source: https://asana.com/resources/context-switching

Exploding Topics

A colleague pointed me to Exploding Topics. An interesting look at the growth of certain topics over recent years. I’m not sure if they are measuring, articles, products, websites, or just conversations on the topic in question.

Thoughtworks Technology Radar

I spent a lot of time reviewing the recent Thoughtworks Technology Radar. I was hoping that 2024 would issue a current version however Sep 2023 is still recent. My thoughts on the tools, techniques platforms, and frameworks in vogue I’ll leave for a separate post.

Why I’m excited about profit-sharing startups

Every year there is a list of the startups that failed and 2023 failures was no different. There is also the list of likely IPOs for the year. Is it going to be Space X, DataBricks, and Reddit for example.

This article along with a host of links reaching out to sites such as Creator Fund, Humanism and Weekend Fund and other interesting stories re-iterate that it is great people and not great ideas that are the right way of being an entrepreneur. The concept of investing that asks for a return of 1-5% of future earnings is an interesting movement from going down the VC slog.

The article lists these points:

  • There’s a culture shift in tech toward profit-generating businesses.
  • There’s a tech shift that enables talent to build more with less.
  • There’s a regulatory shift that makes exits challenging.

… believe a few big shifts will drive more founders and investors to pursue profit-sharing models in 2024 and beyond.

This tweet talks about Gumroad issuing dividends back to our investors. I always understood that investors wanted to see a return, or a positive change in the return capabilities within a 5 year horizon. Also interesting is this Challenging your assumptions about starutps video.

Combined with Why the Future of Startups are Studios really helps me consider what I started back in 2011 with a number of technology leaders in New York as a viable alternative to what we know about funding a startup. We were always able to get through the first 3 steps easily.

  • Generate an idea
  • Flesh out the idea
  • Launch and experiment
  • Create a project
  • Create a big company

I believe Graham was ahead of his time with Ultra Light Startups some 15 years ago.

Source: TLDR

About ‘Digital Tech Trek Digest’

Most days I take some time early in the morning to scan my inbox newsletters, the news, LinkedIn, or other sources to read something new covering the professional and personal topics of interest. Turning what I read into some actionable notes in a short committed time window is a summary of what I learned today, what I should learn and use, or what is of random interest. And thus my Digital Tech Trek.

Some of my regular sources include TLDR, Forbes Daily, ThoughWorks Podcasts, Daily Dose of Data Science and BoringCashCow. Also Scientific American Technology, Fareed’s Global Briefing, Software Design: Tidy First? by Kent Beck to name a few.

Digital Tech Trek Digest [#Issue 2024.03]

Lessons from going freemium: a decision that broke our business

As an entrepreneur always considering how to produce a sustaining passive revenue, what licensing model to use, and how to acquire and retain customers, the allure of a freemium model is ever present in so many offerings. You may wish to read this article and look at the visualizations provided with the narrative. I found this a useful data storytelling example.

The allure of seeing a new product is the strongest motivator new users have to complete setup. If you make onboarding too easy, they’ll never come back to do the hard task you let them skip.

Read more at Lessons from going freemium: a decision that broke our business Source: TLDR

Newsletters and online content creators

Lenny’s Newsletter from the prior article, listed with over 574,000 subscribers is one of several Substack newsletters I subscribe to. Substack is described as “The subscription network for independent writers and creators”. I have been collecting the number of subscribers from several newsletters I follow, however, there is no way to see that growth over time. Also missing are the price rates over time and the ratio of free to paying subscribers. Random Idea: What is missing is a history of this information. Other stats I’ve noted previously include 66,000 subscribers for Kent Becks Newsletter with 3 subscription plan offerings, 1,250,000 subscribers for the free TLDR (I can remember this years ago being much less) and 65,000 subscribers for the Seattle Data Guy newsletter.

FWIW this post from Lenny’s Newsletter This newsletter is growing up is from 2020.

Golden Kitty Awards 2022

I came across the Golden Kitty Awards, which unfortunately are only current to 2022 (fail on being current). It was interesting to scan the list for innovative ideas. I’ve yet to visit any sites, but I’m always encouraged by what people think of and commit to building, regardless of the motivation or incentive. What counts is an entrepreneur takes an idea and releases a product. 

Source: Random

Streamer JS – Video stream layout manager for OBS Studio and other streaming applications.

I am a new user of Twitch streaming for personal projects.  My good friend Sergey Chernyshev organizer of the large New York Web Performance Group has created Streamer JS as a means to drive more dynamic content in the browser and with common languages of HTML/CSS/Javascript and using OBS more as the streaming only component.  One objective is better version control management of assets/scenes/sources/filters/etc.  It’s interesting that  PouchDB is an eventually consistent distributed datastore in Javascript. Yet another simple data store to review for suitability. 

Source: Word of mouth

The 37th Chaos Communication Congress (37C3) by the Chaos Computer Club

Last month I was introduced to the Chaos Computer Club. This large German-based annual tech conference focuses on security & infrastructure/hacking.  Over 100 talks from the most recent event last week have been posted here.

Source: Word of mouth

About ‘Digital Tech Trek Digest’

Most days I take some time early in the morning to scan my inbox newsletters, the news, LinkedIn, or other sources to read something new covering professional and personal topics of interest. Turning what I read into actionable notes in a short, committed time window summarizes what I learned today, what I should learn and use, or what is of random interest. And thus my Digital Tech Trek.

Some of my regular sources include TLDR, Forbes Daily, ThoughWorks Podcasts, Daily Dose of Data Science and BoringCashCow. Also Scientific American Technology, Fareed’s Global Briefing, Software Design: Tidy First? by Kent Beck to name a few.

Digital Tech Trek Digest [#Issue 2024.02]

Indie Newsletter Tool Generates $15,000 a Month

There are so many different email newsletter sites you could wonder if there is market saturation. MailChimp, Mailgun, ConvertKit, Sendgrid (now part of Twilio it seems), Moosend and Mailersend come to mind.

It seems the space still has plenty of revenue-producing options including buttondown.email reportedly a side gig generating $15k per month. Source: BoringCashCow

When I asked a good friend and author of the Technical SEO Weekly his use of ConvertKit directed me to this Baremetrics Dashboard which is another product to look at sometime.

LLMs and Programming in the first days of 2024

How do use an LLM? If you are still on the fence start getting into the habit of using it more frequently then start. I now use ChatGPT and Claude AI daily, and with a crowded market there are many other emerging technologies to also consider.

I use ChatGPT for coding and image generation with DALL.E. I use Claude more for reviewing large documents that seem to be ideal for producing a summary, or to generate a fictitious movie script from those documents.

I do not like Javascript nor do I wish to actually learn this language however I write it daily via ChatGPT. Javascript is the ever-changing technology of web development and it’s impossible to keep up with the next product, or version of a product you may know. ChatGPT helps me navigate this combined with asking for HTML and TailwindCSS.  However, it’s not perfect, you need to be an experienced engineer who has learned how to write code for many years to ask the right questions and to correct the LLM when it does not produce what you expect. Let’s look at CSS. Now there is flex and grid and it’s hard to keep up with changing features that browsers support. This is where ChatGPT has helped me. I have been using Tailwindcss but it still took an expert friend 30 minutes to help me debug a CSS formatting issue of a future OBS Twitch streaming project to correctly size the content all in a 1920×1080 box. I learned a lot of new features of Google Chrome Developer Tools Inspector I did not know and are probably just the start of expert debugging features.

Until a few months ago I never knew it’s now much easier to read JSON in Javascript.

async function fetchData() {
  try {
    const response = await fetch('data.json');
    const data = await response.json();
    console.log(data);
    return data
  } catch (error) {
    console.error('Error fetching data:', error);
  }
}

let data = await fetchData();

I’d like to remind users that  ChatGPT can make mistakes. Consider checking important information.. Source:  TLDR

ParadeDB (GitHub Repo)

Every day there is another PostgreSQL product to review.  I am a current user of ElephantSQL which I didn’t know existed two months ago. Neon and Tembo are two more PostgreSQL serverless-related products on my product review list.  Now adding ParadeDB as well as reading Thoughts on PostgreSQL in 2024.

About ‘Digital Tech Trek Digest’

Most days I take some time early in the morning to scan my inbox newsletters, the news, LinkedIn, or other sources to read something new covering professional and personal topics of interest. Turning what I read into some actionable notes in a short committed time window is a summary of what I learned today, what I should learn and use, or what is of random interest. And thus my Digital Tech Trek.

Some of my regular sources include TLDR, Forbes Daily, ThoughWorks Podcasts, Daily Dose of Data Science and BoringCashCow to name a few.

Digital Tech Trek Digest [#Issue 2024.01]

The Tiny Stack (Astro, SQLite, Litestream)

I spent many years in the LAMP stack, and there are often many more acronyms of technology stacks in our evolving programming ecosystem. New today is “The Tiny Stack”, consisting of Astro, a modern meta-framework for javascript (not my words), and Lightstream Continuously stream SQLite changes. I’ve never been a fan of Javascript, a necessary evil in modern stacks, but it changes so rapidly it’s a constant stream of new products with never the time to learn any. Lightstream is interesting. Replication of SQL operations to a database is nothing new, the Change Data Capture (CDC) of your data, however, I’d not thought of SQLite which is embedded everywhere offered this type of capability.

Amazon Aurora Introduces Long-Awaited RDS Data API to Simplify Serverless Workloads

AWS Aurora Serverless version 2 has been out for at least a year (actually 20 months – Apr 21, 2022), but a feature of version 1 that was not available in version 2 is the Data API. This is for developers without SQL skills to have a RESTful interface to the database, however, it only works in AppSync and only for recent versions of PostgreSQL and only in certain regions. I’ve never used it myself, but it is news.

Speaking of what is available in what AWS region, recently released InstanceHunt allows you to identify the instance families/types available in different regions across various AWS Database services. I developed this in just a few days and released it only last week as a working MVP. Future goals are to include other clouds and other categories of services such as Compute. The prior announcement may facilitate a future version that supports the features of services in regions.

Stop Stalling And Start Your Dream Side Business In 2024

The title kinda says it all. As an inspiring entrepreneur, my pursuits have only offered limited minimal success over the decades and never a passive revenue stream. While the article did not provide valuable nuggets, the title did. One of my goals for 2024 is to elevate my creation and release of side projects, regardless of each project being a source of revenue. I consider refining my design, development, testing, and implementation skills and providing information of value, are all resources of a soft income that showcase some of my diverse skills.

About ‘Digital Tech Trek Digest ‘

Most days I take some time early in the morning to scan my inbox newsletters, the news, LinkedIn, or other sources to read something new covering professional and personal topics of interest. Turning what I read into some actionable notes in a short committed time window is a summary of what I learned today, what I should learn and use, or what is of random interest. And thus my Digital Tech Trek.

Announcing InstanceHunt

InstanceHunt identifies the instance (families/types/classes) available for a cloud service across all the regions of that cloud.

The initial version is a working example of several AWS database services. Future releases will enable advanced filtering and will cover other service categories (e.g. compute) as well as GCP and Azure cloud platforms, as well as providing the full list of instance types within families within the service matrix.

For a few days investment this MVP is a usable service, complete with adding new regions the same day, for example ca-central-1 data was available the day of release. It is interesting and can answer questions like what regions the new generation 7 instance families are available? What consistent instance types can use use across Europe regions? Where is MemoryDB not available?

Feature requests are welcome. From today’s reading, being able to show a feature of a service may be also a useful future matrix, e.g. AWS Aurora Serverless Data API now available in Serverless v2, but only one of two engines and only in a few regions.

China regions and AWS Gov Cloud regions are coming soon.

InstanceHunt - Find what instances you can use for your cloud services

Mastering MySQL 5.7 EOL migrations

In a recent podcast on Mastering EOL Migrations: Lessons learned from MySQL 5.7 to 8.0 I discuss with my colleague Adam North not only the technical issues that become a major migration but also key business and management requirements with having a well-articulated strategy that covers:

  • Planning
  • Testing
  • Be Prepared
  • Proactiveness

Having a plan is key to any significant task including data migrations. You should heed the warnings and the deprecations and consider all potential downstream product impacts such as connector upgrades. The plan includes a timeline but also needs to define all the stakeholders both technical and business, the definition of a successful migration, and most importantly the decision tree for a non-successful migration that would include any outage, failback, rollback, or fix-forward requirements.

Test, Test, Test. Leveraging the simple design pattern of read-write splitting (hint: if your application does not support this, it should) you get to test with minimal risk all of your application reads and with real load from 1% to 100%. You can validate all writes but this does not match concurrency, however, you can emulate load testing and using this two-way door strategy, verify and prevent many common problems before the decision point of failover.

Being prepared is assuming your migration will fail, rather than assuming it will succeed. Rehearsal of all steps that are documented and reproducible. Validating that your backup and recovery strategy is still optimal and operational with the new version, preparing supporting staff for availability before, during, and after the migration. There are probably not many technologists that can say, “Well that was a boring, uneventful successful migration”. The question is why not?

Being proactive is just as important. Leaving a large migration to the last minute is procrastination and a cause of unneeded stress during a non-successful migration. The Meltdown/Spectre vulnerabilities are one example of a highly impactful event outside of your control that sidelined entire teams in many companies for months. Does an outage of your cloud provider impact your uptime requirements and force you to delay a last-minute migration due to customer SLA obligations? While being prepared is for the reasons you could think of, being proactive and prepared is for the situations you have not thought of.

Having solid architectural design practices will aid greatly in many critical business requirements of uptime, read-only mode, scale-out, scale-up, and sharding. These design patterns also greatly enhance the likelihood of a successful database migration.

We have also created a Checklist to cover the planning and execution of a migration. Any input is welcome.

You can check out the video podcast on YouTube or listen with your favorite podcast tool.

Data Masking 101

I continue to dig up and share this simple approach for production data masking via SQL to create testing data sets. Time to codify it into a post.

Rather than generating a set of names and data from tools such as Mockaroo, it is more practical to use actual data for a variety of testing reasons.

The SQL below is a self-explanatory approach of removing Personal Identifiable Information (PII), but keeping data relevant. I use this approach for a number of reasons.

  • We are using production data rather than synthetic data. Data volume, distribution, and additional column values are realistic. This is a subset of an example, but dates and locations are therefore realistic
  • Indexes (and unique indexes) still work, and distribution across the index is adequate for searching. Technically the index will be a little larger in disk footprint.
  • You cannot reverse engineer the masked value into a real value with just this data set. An engineer in a test environment cannot obtain the underlying information.
  • If you identify an issue with data quality for any row of data, there is a way to present the uniqueness of that row. This enables a person with production access to match the underlying row. Of course, any unique identifier (auto increment or UUID) should also be modified to mask real data.


SELECT CONCAT(SUBSTR(first_name,1,2),REPEAT('*',LENGTH(first_name)-2)) AS first_name,
CONCAT(SUBSTR(last_name,1,3),REPEAT('*',LENGTH(last_name)-3),' ', SUBSTRING(MD5(CONCAT(first_name,last_name)),1,6)) AS last_name,
CONCAT(SUBSTR(organization,1,3),REPEAT('*',LENGTH(organization)-3),' ', SUBSTRING(MD5(CONCAT(organization)),1,6)) AS organization,
created, country
FROM customer
LIMIT 10;

+------------+--------------------+------------------+---------------------+---------+
| first_name | last_name | organization | created | country |
+------------+--------------------+------------------+---------------------+---------+
| Sa**** | Cor**** 4c23cd | Ski*** d21420 | 2022-09-20 03:30:14 | PH |
| Fu**** | Wat*** 8b97de | Jax***** e629c2 | 2022-04-08 03:20:22 | BY |
| Mo**** | Zis***** b11d94 | Rhy**** b4073a | 2022-10-06 15:58:38 | IR |
| So**** | Bad** 232cc2 | Rhy*** 1734bd | 2022-02-01 07:35:39 | ID |
| Ni***** | Ter***** d9ffb5 | Wor****** 6e476c | 2021-11-08 17:07:34 | IL |
| Ka****** | Scr***** 9201db | Jax**** 481fd8 | 2022-08-18 19:17:54 | BR |
| Li*** | Coz** 0447f6 | Nlo**** 11da59 | 2022-07-29 06:47:56 | HR |
| Ch***** | Hal******** f5d9c8 | Zoo**** c6e07d | 2022-09-28 04:54:30 | UA |
| Er****** | Ste******* d005f2 | Eid** ffc305 | 2022-04-28 18:50:11 | PT |
| Fo** | O'S***** b35c44 | Buz**** 2c8598 | 2022-09-11 02:05:55 | RU |
+------------+--------------------+------------------+---------------------+---------+

A reliable and dependable application requires observability

Observability (o11y) is a critical pre-requisite component in software architecture when advocating for and preparing organizations for making informed decisions on the success of their application. Open Telemetry from the Cloud Native Computing Foundation is the goto standard regardless of your choices of monitoring tools. However, observability is just a building block that I need to explain when advocating for having reliable and dependable systems. Observability will not inform you “Was the customer actually impacted, or how many, or how long?”. Observability will not tell you “the root cause of a problem?”.

My five layers of building blocks for Reliability are:

  1. Observability – The collection of telemetry (metrics, traces and logs) should just be there. If you are using Kubernetes (k8s) and a Java/Python/Node.js application it is already built in. Just do it.
  2. Reproducibility – The ability with a known set of steps and a given configuration and setup you can reproduce an outcome showing the same observed results is a necessary pre-requisite for any feature development or bug fixes.
  3. Testability – After being able to consistently reproduce an observed event with measurable results, the running of various experiments using a variety of changes enables you to adequately test future improvements or corrections to the initial situation, whether it’s a known bug, or a new piece of functionality. Reproducible and consistent testing is an essential component to the release of software for a reliable application.
  4. Scalability – It is impossible to adequately test a system to failure without an observable, reproducible, and testable framework. Many organizations suffer from the management “Can we support X operations” syndrome, when instead the application should know what “X” is automatically, and have adequate safeguards in place to prevent its occurrence. The ability to proactively disable [expensive] features for the good of the entire system is not a common practice for software (aka a dark mode). In fact, many organizations do not even have the capability of customer-level and individual component-level feature flags or related rate limits that can manually be implemented.
  5. Dependability – A reliable, highly available, and dependable application requires all of the prior layers to be in place to give a level of assurance to your customers and your company that your product is dependable.

AWS RDS Aurora wish list

I’ve had this list on a post-it note on my monitor for all of 2022. I figured it was time to write it down, and reuse the space.

In summary, AWS suffers from the same problem that almost every other product does. It sacrifices improved security for backward compatibility of functionality. IMO this is not in the best practices of a data ecosystem that is under constant attack.

  • Storage should be encrypted by default. When you launch an RDS cluster its storage is not encrypted. This goes against their own AWS Well-Architected Framework Section 2 – Security.
  • Plain text passwords. To launch a cluster you must specify a password in plain text on the command line, again not security best practice. At least change this to using a known secret from AWS secrets manager.
  • TLS for administrative accounts should be the only option. The root user should only be REQUIRE SSL (MySQL syntax).
  • Expanding on the AWS secrets manager usage for passwords, there should not need to be lambda code and cloudwatch cron event for rotation, it should just be automatically built in.
  • The awscli has this neat wait command that will block until you can execute the next statement in a series of sequential events to prepare and launch a cluster, but it doesn’t work for create-db-cluster. You have to build in your own manual “wait” until “available” process.
  • In my last position, I was unable to enforce TLS communications to the database from the application. This insecure practice is a more touchy situation, however, there needs to be some way to ensure security best practices over application developer laziness in the future.
  • AWS has internal special flags that only AWS support can set when say you have a bug in a version. Call it a per-client feature flag. However, there is no visibility into what is set, which account, which cluster, etc. Transparency is of value so that the customer knows to get that special flag unset after minor upgrades.
  • When you launch a new RDS Cluster, for example, MySQL 2.x, you get the oldest version, back earlier in the year it was like 2.7.2, even when 2.10.1 was released. AWS should be using a default version when only an engine is specified as a more current version. I would advocate the latest version is not the automatic choice, but it’s better to be more current.
  • the ALTER SYSTEM CRASH functionality is great, but it’s incomplete. You cannot for example crash a global cluster, forcing a region-specific failover. If you have a disaster resiliency plan that is multi-region it’s impossible to actually test it. You can emulate a controlled failover, but this is a different use case to a real failover (aka Dec 2021)
  • Use arn when it’s required not id. This goes back to my earlier point over maximum compatibility over usability, but when a --db-instance-identifier, or --db-instance-identifier requires the value to be the ARN, then the parameter should be specific. IMO –identifier is what you use for that argument, e.g. --db-cluster-identifier. When you specify for example --replication-source-identifier this must be (as per docs) “The Amazon Resource Name (ARN) of the source DB instance or DB cluster if this DB cluster is created as a read replica.” It should then be --replication-source-arn. There are a number of different occurrences of this situation.

Our Data Security Moonshot Starts With Prevention

The recent re-announcement of the Cancer Moonshot highlighted a common enemy to many endeavors to improve our society as a whole, and that is using common sense and already known methods.

At a high level The goal of the Cancer Moonshot Scholars program is to inspire and support the next generation of world-class and diverse researchers focused on scientific breakthroughs that will make a difference for patients and drive progress toward the goal of ending cancer as we know it today. source fact sheet

As stories of this announcement filtered thru news outlets with interviews of medical professionals, a known thread appeared. Both lacking in the message, and the single greatest advancement to the problem, which is already known, is prevention. This includes known prevention measures, early detection measures, and education.

As a Data Strategist, Data Security is a critical component of any business and the single best defense is prevention and using common sense.

Here are just some simple basics that seem to have to be discussed and argued repeatedly company after company, product after product, yet there is no single effort to eliminate these poor practices.

  • No clear text passwords. If you have to enter a password on the command line (cough cough AWS CLI) or put a clear text password in a configuration file (cough cough 100s of products), you enable simple techniques to obtain unvetted access to your data.
  • Using clear text passwords is amplified when products offer a more secure means of access and identity management but they do not employ it everywhere.Check out Password Plaintext Storage
  • Clear text transport. It pains me to say it but even in recent employment that held critical PII data, I could not enforce TLS communication between applications and databases. While it was as simple as a configuration option, the constant excuses by engineering management were it was too hard to implement (cough cough BS).
  • The default configuration settings for a product need to be secure, not the default that is most compatible with prior versions. For example, if you launch a new cloud instance database with defaults, is it the most secure options, or the least secure options>
  • Credential rotation. Long-lived credentials should just be eliminated. Often these are also not named users, but commonly used processes.
  • Communicating passwords in clear text. This should never ever happen, yet it does. Have you ever received an (insecure protocol) email saying here is your username and password? A short known list of 5880+ sites can be found in the https://plaintextoffenders.com/ list on github offenders.csv.
  • Data systems accessible via the public internet For example MongoDB article, MySQL/MariaDB article, Redis & ElasticSearch etc, etc
  • Data systems that have no credentials required
  • Data systems that have default credentials that were never changed
  • Storing passwords in clear text
  • Storing passwords with a single salt
  • Storing passwords with a symmetric encryption approach
  • Administrators that use a common account for “root” privileges, not individual named accounts
  • Not patching products with fixed vulnerabilities CVS Program Mission
  • and the list could go on and on….

In all of the above points, there are numerous examples of these data security anti-patterns. While many are due to the products in use, some of these examples represent poor business practices. It should not have to be explained that most attacks and breaches are internal. The common and very incorrect attitude of, we are within our Virtual Private Cloud (VPC) we do not need to encrypt our data is well, plainly wrong.

Transparency

One of the greatest threats to businesses is ransomware. Attackers gain access to system via various means, those above are just the simplest means and then hold businesses ransom. Ransomware has multiple impacts including the loss of a business operating, the process and time of making a decision, the penalty for payment to release the random, and generally the threat of release of their data if a fine is not paid.

There is a lot to unpack even with this ransomware statement. Can you not restore your entire business operations within a suitable RTO and RPO? Is important data not encrypted. Are passwords in your business able to de-encrypted (this should never even be possible). Do you have a disaster recovery (DR) strategy? Can you access critical data via others means and systems independently?

The stigma of a ransomware attack is organizations do not share this openly. They do not share why it happened, what could have been done to prevent this, and sharing all information with federal authorities that should be tracking all occurrences. This information is an important and critical education feedback loop for the whole industry and IMO lacking of attention. Do you know of a website that shared known ransomware attack vectors.

Conclusion

If security is an important aspect to the data in your organization, can you name the people in your security department? Can any individual point out an insecure product with a known fixed vulnerability? Is that information transparent? Is there a process to address that as a top priority, moving engineering and operations goals accordingly? While organizations may employ an error budget for outages, do they employ a security vulnerability budget? Do companies note version updates of all their software, have people read ALL the release notes of each point release, or even know every version of each software product in use in the organization?

For more information, check out

There will always be better and more determined attempts to attack data systems, we have to stop the most obvious first, and we have to participate in identification and remediation endeavors.

Using a simple relatable example to every person, your home. We should start with not leaving the door open, or leaving the keys in the door or simply removing the door all together.

Spoiler – Owning your data isn’t good enough

While this is a catchy title, if you use Software as a Service (SaaS), or an online cloud provider, do you actually own and have total control of your business data and its infrastructure? For all the free and paid services your business uses, what happens if one day, a portion of that were no longer available?

When you have data in a CRM, an analytics platform, a marketing platform, a payments platform, if one of those service providers locks you out of your data, you have lost control and access to a part of your business. Can you still operate unaffected? What is the actual impact? What is your contingency? You could be lucky and the impact is temporary, such as a day or a week, but it could also be longer or even indefinite.

Let me give you a simple but concrete example. Fellow woodworker Eric of Spencley Design posted recently on YouTube “I just lost half of my business”. If you listen to just 2 1/2 minutes from 12:00 to 14:30 of his youtube video explanation you will understand that this business relies on several online SaaS services. Many are free, but for an unexplained reason, whether bad code, bad ML/AI, or several other plausible reasons, one of his income streams was shut down without notice. This was not by his doing, or any of his actions but for unrelated reasons. Online attempts to appeal this situation caused a permanent suspension. Talking to a human to understand what happened, why it happened, and how this can be resolved, was also unanswered because there is no ability to physically speak to a human.

This problem is not limited to online services. A great example of just a decade ago is your business credit card stops working, transactions are declined. If you were lucky you could physically call your bank manager, or go to your bank manager to get to the bottom of the situation. You knew your bank account contained sufficient funds as you maintained on-premise accounting practices and you could provide evidence of such facts. If you run a small business today, do you think you can talk to a human that would have the ability to correct this problem, or would you have to talk to 5 humans, multiple automated (and annoying) systems, costing countless hours of time and frustration?

If you rely on Acme George Inc workspaces product for your small business email and shared documents, what if that becomes blocked? How do you communicate with your customers? What if you use Acme Archie Inc for your customer support ticketing system, and for a week it is unavailable to use? Not only can your customers not report issues, but you have no access to see what issues were already outstanding and work on them independently.

At times there are widespread outages of online presences that have a wide effect across industries from hours to weeks. Cloudflare Jun 21, 2022, Fastly June 8, 2021, Amazon Web Services Dec 7, 2021, and then Dec 15 and Dec 22. A blog post called it the AWS’s December Outagepalooza. The Atlassian April 2022 outage for paying customers lasted upto 2 weeks. Even a free social media company and its related entities incurred widespread impact Facebook Oct 4 ,2021 that affected many gig economy businesses. These outages can have far ranging effects. Actual examples include you cannot pay your employees, your staff at a hospital cannot authenticate to access patient records, transportation and logistics of your shipping business is halted.

I am referring here to loss of access to your data in a SaaS environment, and loss of cloud infrastructure that supported your SaaS services or even your internally developed and maintained systems running on cloud infrastructure. If you are not convinced of the larger ramifications of an extreme loss of infrastructure services what was the impact to Parler in 2021.

My point here is you cannot simply stop using these services, or your cloud provider(s) infrastructure. You need to be prepared. In a traditional system, you backup your data for some degree of disaster, and you support the capability to recover both infrastructure and data from this, and if you a smart you actually test this. Sidebar a colleague recently shared that even with massive investment in infrastructure and global redundancy, a scheduled test for this large bank took down services for 12 hours.

Large SaaS organizations could offer services that offer multi-region or multi-cloud capabilities, but they are also at the mercy of the SaaS providers they use. Do you know all the interdependencies? Look no further than the wipe out of Okta’s stock (down 30%) in one day. CEO of Okta Todd McKinnon cited several factors including a security impact by text message provider Twilio. Read more about that at Twilio Employeee, Customer Accounts Breached Through Texts. And yes, the headline here has an incorrect spelling. I tried to add a comment to offer feedback, but the MarketWatch paywall of 4 articles would not let me create an account to login to leave a comment!

The solution is not to host all of your own infrastructure either. Facebook’s very long outage was self-inflicted and they controlled all of their own infrastructure. It not only had an impact on their websites, their internal staff were unable to use security badges to access critical infrastructure to correct the problem because they were physically locked out of buildings holding the infrastructure.

Returning to the small business owner who uses a marketing platform, an analytics platform, a CRM, a payment platform or even a social media platform. Do you keep current copies of your data in these systems so that if there were a loss, you knew who to communicate with? In the first cited case, did Eric have a list of all of his subscribers, a copy of all his online content, and all comments made by subscribers. Was there a means to communicate with them via other means, or was access to sufficient PII not even possible for what was his original content?

In future posts I will share some of my techniques for ensuring you have a data acquisition strategy.

What is the right length of a blog post?

A question without a definitive answer. Finding opinions from authoritative sources can also be easily obscured due to search engine optimization or even the choice of words used while searching.

I used the following search terms initially in Google and DuckDuckGo.

  • what is the right size of a blog post
  • what is the ideal length of a blog post

I then started with the term “ideal blog post”, and here are the type-ahead responses. Clearly “length” is the definitive winner in word association. My first thought was “size”, is that a technical difference?

DuckDuckGo

  • ideal blog post length
  • ideal blog post length for seo
  • ideal blog post size
  • ideal blog post length 2021
  • ideal word count for a blog post
  • ideal length for a blog post
  • ideal length of a blog post
  • ideal length for a blog post

NOTE: Size mentioned only once.

Google

  • ideal blog post length
  • ideal blog post length for seo
  • ideal blog post title length
  • ideal blog post length 2022
  • ideal blog post length for seo 2022
  • ideal blog post length 2021
  • ideal blog post length 2020
  • ideal blog post length for seo 2020
  • ideal blog post frequency
  • ideal blog posts

NOTE: Size not mentioned once. As a result the original title of my post was changed from size to length.

Search Outcomes

Using Google, which now often will provide a summarized result (known as a feature snippet) before examples of what People also ask, or ad results that are even before ranked actual results.

what is the right size of a blog post – Google

2,100-2,400 words
For SEO, the ideal blog post length should be 2,100-2,400 words, according to HubSpot data. We averaged the length of our 50 most-read blog posts in 2019, which yielded an average word count of 2,330. Individual blog post lengths ranged from 333 to 5,581 words, with a median length of 2,164 words. Mar 2, 2020

ideal blog post length – Google

about 1,500 to 2,000 words
Although your blog post length may vary depending on your topic and audience, it is often best to aim for about 1,500 to 2,000 words for articles or posts. Longer pieces seem to do better when it comes to ranking on SERPs.

DuckDuckGo

I have not yet seen, nor in these examples is DuckDuckGo creating a single answer summary. Probably IMO a good thing.

what is the right size of a blog post – Bing

Branching out I was curious what other possible engines provided.

1,600 words – According to 2 sources

And then a non copy/paste answer that I had to extract from developer tools

In the infographic “ The Internet is a Zoo: The Ideal Length of Everything Online ” from Buffer, they find that the ideal blog post length is 1,600 words. But some sources think a good blog post should be even longer than that. In a Medium article, the writer says that posts with an average read time of 7 minutes captured the most attention.

According to research done by popular blogging platform, Medium, the ideal length for blog posts is 1,600 words (or seven minutes of reading). This number is based on an analysis of the “average total seconds spent on each post and compared this to the post length.”

ideal blog post length – Bing

To sum up, here’s a list of common blog posts lengths to help you find your own ideal length:

Micro content: 75–300 words. Super-short posts are best for generating discussion. They rarely get many shares on social…
Short-form content: 300–600 words. This is the standard blogging length, recommended by many “expert” bloggers. Shorter…

More …

what is the right size of a blog post – Yahoo

Above the fold, after ads and before People also ask and actual results was

For SEO , the ideal blog post length should be 2,100-2,400 words, according to HubSpot data. We averaged the length of our 50 most-read blog posts in 2019, which yielded an average word count of 2,330. Individual blog post lengths ranged from 333 to 5,581 words, with a median length of 2,164 words.

ideal blog post length – Baidu

As the homepage was all Chinese and I wasn’t sure if I should continue but I cut/pasted english and hit the button and got results in English.

The text of the first search response was something I’d not seen on any other page, so for reference apparently there are Blog styles :)

Ideal Blog Post Length for SEO Blog posts vary in length from a few short paragraphs (Seth Godin style) to 40,000 words (Neil Patel style).

What an SEO SME says

So I reached out to my most knowledgeable friend in SEO and asked them the question Without googling or searching online, based on your SME.

Q: What is the right size of a blog post?
A: You mean content length? 1500 to start, ideally more towards the 5,000 or 10,000

Q: What is the best reading time for a blog post?
A: depends – long form vs short – some times a simple paragraph is all you need. Other times you want a book.

Summary

Using what the engines provide as a single recommendation, not the top organic search result.

Source Response
Google 1,500-2,000 or 2,100-2,400 depending on question
DuckDuckGo -
Bing 1,600 (only to mention time of 7 minutes)
Yahoo 2,100-2,400
Human SEO SME 1,500

Additional Helpers

A recent edition to my short reading email summaries of useful articles is TLDR. While this is not new information the inclusion of 1 minute read, 2 minute read, 11 minute read is useful data to me in making an informed decision based on the factors at the moment. Other information that helps this example which is a newsletter is 300,000 Subscribers and 43% Open Rate. There are also other data points that help, and could narrow your audience and determine what you may consider and ideal size.

Returning to the summarized results of various search engines, only one, Bing, provided this additional measurement of time, and the answer was “average read time of 7 minutes captured the most attention.” which translated into 1,600 words.

I cannot ofter any personal validation of either of these data points, but I should perhaps start collecting it.

Conclusion

What is the answer? Well, only your target audience can inform you of this. The question(s) is then who is your target audience? Is your target audience who you think they are?

For the record, my last blog post was 1973 words long, and this one is 1216 words long, therefore averaging 1594 words. NOTE: These numbers were the original versions length, both of which have changed/evolved over time with additional feedback.

This leads to a more important question. How are you measuring the impact of your blog posts and how does size/length/time play a role in that?

Sidebar: Is a blog post actually the best way for people to read your content, or at least gain insights into what may be useful for your readers. Is a newsletter a better option?

Going back to the TLDR newsletter for a moment, this information can be found on the website.

  1. Highly technical audience, primarily software engineers and other tech workers
  2. 30% United States, 10% United Kingdom, 10% Canada, 25% other EU, 25% other non-EU
  3. 50% ages 25 to 34, 20% ages 18 to 24, 20% ages 35 to 44, 10% other
  4. Primary sponsors get between 1000 to 1250 clicks
  5. Developer sponsors get between 750 to 1000 clicks
  6. Subscribers from companies like: Google, Amazon, Facebook, Apple, … (it’s interesting this is a list of logos, and what order they are in, FWIW)

I do not have access to the data so I am unable to gain more insights as to what is most read articles based on time. Hint: Interesting infographic for TLDR to publish.

I would ask how do they know point 1 and point 3 of my information without additional data mining providing this detail? I provided an @gmail email address, and my location can be determined via IP.

Can a picture replace a text description?

Data visualization, data storytelling, and data lineage are all ways to better describe and visualize a specific situation for a set of data. Generally, I find these techniques are used as a means to uncover or identify information that ultimately pertains to individuals. For example how many sales have we made across time/location/business unit? How many customers do we have? How many social media photos has a person provided over a period of time? Unfortunately, this is not the kind of data that I feel has real-world meaning to me. It doesn’t describe the advancements made in the biomedical field to help fight disease, it doesn’t tell us the amount of energy that we have saved or the amount of energy that we failed to collect and the impact that has locally and globally on our world, or it doesn’t describe valuable human experiences in history about people and places. I find this value in data visualizations of others.

During a recent vacation, I thought about the impact of the visualization of my experiences and just how much information was not collected, and how much information was collected but is of average or poor value or is extremely valuable. How hard it was to collate even what I had collected, and to who or what the value of this information is? While these are personal experiences and not that of a commercial organization, Google certainly informed me of how many people were viewing a public image I uploaded and a comment I made of an iconic Australian location and food.

Inaccessible value in a text description

I recently caught up with a very dear friend. She had lost her husband 10 years ago after more than 50 years of marriage. He kept a written diary every year since 1946, starting a year after the end of world war II that he served it.

So from 1946 to 2012, that is 66 years, there is a wealth of information that includes personal feelings, expectations, and perhaps thoughts about what is going well and what’s not going well. It would also include valuable information about the world view from a very intelligent and influential individual. These diaries are still located today on the same shelf in the same office they have for the short 25 years I was aware.

To draw a conclusion to the question with a data analogy. A single copy, in a single location of un-indexed information, which first-hand sources know has unlocked potential. It is also an immutable and finite time capsule. It would also IMHO contain great value in feelings, emotions, family, and history that is important to a small community. Could a picture represent this data?

A picture

Whilst traveling I used my camera as a means to record the experiences that I was having with my family. I am a photographer and not a videographer so my expression is a picture rather than a video and optionally audio. Sidebar, we did give our child a diary for this vacation to write in, however the attempt to build this new habit only lasted 2 days. Forming new habits can be hard.

But can a picture relay adequate information to describe when and where this photo was taken, why it was taken, how I felt, who I was with, what did it inspire, how did it make me think about related experiences in the past.

Today there is technology now that can take a picture and describe the contents, effectively it could create a summarized description of the picture. With additional metadata such as Exif data where you can extract more details such as time and location. With machine learning you can do picture comparison to identify locations even if location was not specified with the photo.

You can now have AI create an image from a description, if only my Dall-E-2 account would not keep crashing I would try it out.

A picture on its own only contains some value. If you collect all this information and combined with other sources, for example when I used my phone and not my camera, this is stored use google photos. This company can use this information to create a timeline of where you were, when you were there, perhaps you were with and combined with all the sources this company has such as your Google calendar, and Gmail it can and does create a timeline much like the timeline you see in social media platforms such as Facebook if you are regularly user of such a platform.

So we have not a picture, but a collection of pictures, including those not taken or owned by yourself, combined with other structured and unstructured data that can provide an improved timeline.

In comparison in data visualization there is usually a time component for most data. Animated data visualizations which can be awesome usually represent data across time.

Example pictures

Let me give you some simple images as an example and I’ll add some information that is not included in the photos specifically what is available in today’s modern technology such as GPS location. First my existing EOS 5D camera does not provide that information and second I do not enable that on my phone because I want to keep that information private and Google does not provide a capability that would enable me to store personal information but do not share that information for consumption by for example use that information for other machine learning capabilities.

Emu

I had never seen an emu roaming in the wild in Australia before this trip. From a conversation with a friend on a different topic did she provide information that driving between Jeriderie and Narandera you will find emus in the wild. Indeed we did multiple times on this specific highway without having to randomly goto some isolated place hoping for the same outcome.

This is a truly unique animal without only an ostrich looking similar.
Emu in nature, NSW Australia

AWS Rekognition output with values >90% categorize this photo with Antelope, Animal, Mammal, Wildlife, Bird, Sheep, with parents categories of Wildlife, Mammal, Animal. Well that is horribly wrong. An Antelope has 4 legs, this clearly has 2. An Emu is not a Mammal.

AWS Rekognition Response – August 2022. Larger version

That was so bad (and unexpected) I wanted to give the technology another chance with a different emu pic.
Emu in nature, NSW Australia
This time AWS Rekognition output with values >90% describe this photo with Bird, Animal, Emu, Sheep, Mammal, and at 88% Antelope and Wildlife. So if you get Emu (that has two legs), why would you say Sheep which is 4 legs and not a bird? And if you said Emu and Bird, why would you then select Antelope, also with 4 legs, but so not a Bird.

AWS Rekogintion Response – August 2022. Larger version.

Feeling a bit duped by technology, I tried Google Image search next. The first image was recognized as “Tasmanian Emu”. I didn’t know there was such a thing, but it did say Emu and all other related visual matches were Emu. I was surprised it only picked 3 of the 4 animals in the first pic.

The second image was recognized as “Tasmanian emu”, “Emu” and “Common ostrich”. Doh!

Platypus

This next image I was confident AWS Rekognition would be spot on. It’s even a more unique animal, and there are no grass or obstacles to obscure the animal. Boy was I wrong.

Platypus in nature, Eungella QLD Australia

AWS Rekognition output with values >90% describe this photo with Wildlife, Animal, Mammal and at 87% Hippo. It is true that a Hippo is a mammal, and you do find them in water, but?

Platypus in nature, Eungella QLD Australia
Finding an even more evident picture that anybody would recognize, well the software could not. Wildlife, Animal, Mammal. At 85% Lizard, Reptile and Otter. A lizard is not a mammal?

This post is starting to turn into a self proposition even more than I thought.

Had I provided GPS, it would have said Eungella, QLD and any more additional searching would show that it’s a popular destination for finding a Platypus in nature. In-fact a precise GPS location would give the name literally as “Platypus Deck”. here. A human would quickly articulate this by reading text from basic online searches.

Would AWS Rekognition discount values responses if it could know the precise location or even the country. Somehow I feel not.

For what it’s worth, a Google Image Search of Platypus yields tons of pictures AWS should use in it’s recognition validate and ML model.

What this image does not say with whom I was traveling with, that is a local of the area and his comment that in 20 years I’ve not seen platypus as easily and playful as this. It would not describe that on social media, many locals were surprised with the quality of images and videos. It would also not describe what the video shows, how they dive and then burrow into the muddy water using their bill, then rise to the surface and dive again.

Trying Google Image search again, the first image yielded platypus which it is. For the second image, google found no results. Again, I was rather shocked in comparison to the images of platypus a Google Image search shows and the fact this second image showed more detail IMO.

If I correctly label this image with alt text as a platypus, will Google Image search in future show this within search results, or will the output of correct recognition improve at a later time.

Other animals

At this time I decided to give up on animals. An echidna is a unique animal unlike almost any other, a quokka is also, but does resemble a small kangaroo. I do have a hippo, I’ll have to try that out.

Other images

I decided to try some more easier images and I was again overall disappointed and therefore decided to stop adding content here after these two images.

Surfers during summer waves in Oahu

AWS Rekognition above 90% accuracy was Sea, Ocean, Water, Nature, Outdoors, Person, Human, Surfing, Sea Waves, Sport, Sports.
Google Image search gave the first visual response as Viewing summer Swells on Oahu, which is spot on. It was Oahu, not Maui (wonders if I have a surfer image from Maui), and it was more importantly in Summer and not winter which has much larger waves.

Finally I tried to pick one of the most uniquely observable images that you will only find in one location and this image also included readable text.
U.S.S Arizona Memorial at Pearl Harbor Hawaii

AWS Rekognition above 90% was Flag, Symbol, Watercraft, Vessel, Vehicle, Transportation, Waterfront, Water, with Ferry and Boat being 88%. Rather shocked it could not pull out not 1 but 2 specific and clear areas of text and used this. The visible text is “U.S.S ARIZONA MEMORIAL” and “USS ARIZONA BB 39″.

Google Image search described this image as the Pearl Harbor National Memorial which is exactly what it is (in summary).

Even a picture of the Sydney Opera House, a truly unique building did not yield the result of Sydney via AWS Rekognition.

I look forward to this post being indexed to see if Google can give the source of the image as my site. I am also going to have to look into Google APIs for image recognition rather then the very slow and painful web browser option.

In conclusion

To return to the question of this post. Can a picture replace a text description? In short, no. A picture can summarize and quickly convey meaning simply because of how our mind processes visuals but it cannot replace a full text description.

In these examples I would expect almost every reader to be about to summarize these images more accurately, whereas (accessible) machine learning has a very long way to go. Even with location information, only some people would be able to add additional value or better infer an initial description, however all the information I could convey that is of applicable value is not within the picture itself. You cannot interpret additional meaning and value without more context, or without intrinsic knowledge. Using the surfers example, you can find waves used by surfers all around the world. In Oahu specifically in summer, surfable waves are infrequent and small, whereas in winter apparently they are very different.

With data storytelling, a data visualization is going to provide a similar outcome. Better visualizations will contain color and legends that visually describe information more clearly. They will also offer additional insights in creative ways, such as magnifications or clear differentiators. Should photographs as these shown here, automatically contain further context that is shown just in the image. Should image recognition automatically suggest titles that can be further edited by the author? Should an audio summary of the image be able to be recorded with the image? Should any single picture use context of other pictures around locality or time or similarities in the view to build a better picture. It is interesting to consider how technology could improve to provide greater value to the consumer.

How much text is needed is an entirely different question. Is the saying “A picture is worth a 1,000 words” approximately accurate?

Even a complex picture that takes months to review all the detail does not replace a full text description. If you would like an example for comparison checkout the “L. Tellier Kitchen Poster”. A piece of art that I own.

A summer sabbatical

In recent weeks I have been sharing more informal thoughts and in the upcoming weeks, there will be a period of greater radio silence.

After three decades as a professional, I am taking the entire summer off. This will be a chance to intentionally not sit at a desk, stare at a screen, look at my phone, read emails, read articles, and all those other work and personal related activities one does.

Weekly Musings – July 8, 2022

A very succinct description of the responsibilities of leadership by Jawad Nagda (infographic below) shows a number of key features of management that are also needed in data storytelling such as empathy, integrity, and listening. I wish annual 1:1 performance reviews gave employees an opportunity to rate their manager in such detail? It would be an interesting infographic to compare your managers for the past 3-5 jobs visually?

The Data Management Value Realization Journey by Bill Schmarzo really shows the depth and breadth of what an organization needs to be prepared for. The infographic (shown below) provides a lot of detail and each time you look at it from a different perspective you can see a wealth of key terms and thoughts to value. This starts with your data, the velocity, variety, and volume of data, which is described as fast, diverse and deep. The value journey to operationalizing this information clearly outweighs the risks of not being prepared. As much of Sydney is now underwater, and with the frequency of “rain bombs” in Australia, how could your company prepare for an influx of data (a once-in 500-year event)? Could you filter valuable data from invaluable data and draw insights quickly, or would you need to create an infrastructure to do so and train resources? The timeliness of your investment may be too late. The Data Management Value Creation Journey Map should be at the forefront of your business for planning how information drives your business success.

When your computer is idle, is it really idle? Peter Zaitsev shared this article What does an idle CPU do? which is a great read. Reminds me of my very old Unix (yes before Linux) kernel core dump analysis, where I had the Unix source code in question. A computer does not just do nothing unless it’s powered off or sleeping.

AI is used in many different fields. DALL-E 2 is a service that will create art and images based on a description. While this sounds interesting, I consider ML/AI as tools to help improve our society, and our decision-making and remove and replace redundant workloads. I feel creative expression is a talent and gift of an individual and the value of the work is in the eyes of the beholder. DALL-E 2 had to learn by imitating other famous works of art, some artists would learn this way, but some are just naturally talented. Will there be an AI to critique works of art, and how would it describe DALL-2 E’s works?

Speaking of art. I have always been fascinated with water structures and large outdoor works of art. The Bellagio fountain is one example of that. And for those naysays of water usage do your research, this project is actually very water efficient. This video of ultra-slow motion fluid dynamics (2 minutes) is just incredible. (some screenshots below)

Some images of the week.

Volcano + lightning