Digital Tech Trek Digest [#Issue 2024.09]

As an entrepreneur, pricing is an important consideration in any evaluation, development, and customer testing. In How To Price A SaaS Product, we see different pricing strategies, cost-based pricing, competitor-based pricing, penetration pricing, value-based pricing, freemium pricing. None of these match what I am ultimately considering: consumption-based pricing. Pricing is critical to define the value proposition statement and determine the range of the total lifetime value (TLV). It can vary greatly for B2C, B2B, and B2B enterprise offerings. If we look at YCombinator https://www.ycombinator.com/library/6h-startup-pricing-101 a basic principle is determining the gap between price and cost. That is your margin and your incentive to sell, and you work with either cost-plus or value-based pricing. Starting with founder-led sales is difficult as you do not have the luxury of a dedicated and experienced head of sales to work on different models and guide a technical founder, even before you enter the minefield of enterprise sales with applicable bids, contract, and compliance complexities. I am drawn back to “Consumption-based pricing is a pricing model that charges customers based on their product or service usage. Consumption-based pricing calculates pricing based on usage volume rather than the number of users and is a popular pricing model for IT services, SaaS, and cloud computing and storage” Cite: Consumption-Based Pricing.

Moving a Billion Postgres Rows on a $100 Budget

I wrote recently about the 1 Billion Row Challenge (1BR). This week, I found this article on the same number with a different title. The objective was not performance; it was cost. PeerDB enables the efficient extraction of data from PostgreSQL into a data warehouse, such as Big Query, ClickHouse or Snowflake. It was interesting to see Arvo as a format used over, for example, Parquet. The product also offers different streaming modes, including log-based (CDC), cursor-based (timestamp or integer), and XMIN-based. I will need to do further research on this new term XMIN-Based.

Test queries against your production database (responsibly)

This post links off to a YouTube video of The Safest Way to Test Postgres Destructive Queries, which provides a basic introduction to branching of the Neon PostgreSQL DBaaS. While the title originally interested me, the example showing the mechanics is like many other product examples in which it is extremely simplistic and not a true representation of “production” size or workload. I see this as a similar concept to AWS RDS Aurora cloning. However, any example should modify the structure of a table, measure the impact of that structure against production queries (note plural), and provide additional metadata rather than just a response time. These are important considerations in my own evaluation of test coverage of data access and the gathering of configuration, data, and infrastructure when running experiments to determine a more optimal data access path or a new functionality requirement. More documentation can be found here on Neon Branching

About “Digital Tech Trek Digest”

Most days, I take some time early in the morning to scan my inbox newsletters, the news, LinkedIn, or other sources to read something new about professional and personal topics of interest. I turn what I read into actionable notes in a short, committed time window, summarizing what I learned, what I should learn and use, or what is of random interest.

Some of my regular sources include TLDR, Forbes Daily, ThoughWorks Podcasts, Daily Dose of Data Science and BoringCashCow. Also Scientific American Technology, Fareed’s Global Briefing, Software Design: Tidy First? by Kent Beck, Last Week in AWS, Micro Newsletter to name a few.

Random Wisdom

This week, I was reminded via a very interesting statement that work-life balance and joy in what you do are critically important. You will not find on a tombstone the statement:

“I never worked enough hours.”

Data Masking 101

I continue to dig up and share this simple approach for production data masking via SQL to create testing data sets. Time to codify it into a post.

Rather than generating a set of names and data from tools such as Mockaroo, it is more practical to use actual data for a variety of testing reasons.

The SQL below is a self-explanatory approach of removing Personal Identifiable Information (PII), but keeping data relevant. I use this approach for a number of reasons.

  • We are using production data rather than synthetic data. Data volume, distribution, and additional column values are realistic. This is a subset of an example, but dates and locations are therefore realistic
  • Indexes (and unique indexes) still work, and distribution across the index is adequate for searching. Technically the index will be a little larger in disk footprint.
  • You cannot reverse engineer the masked value into a real value with just this data set. An engineer in a test environment cannot obtain the underlying information.
  • If you identify an issue with data quality for any row of data, there is a way to present the uniqueness of that row. This enables a person with production access to match the underlying row. Of course, any unique identifier (auto increment or UUID) should also be modified to mask real data.


SELECT CONCAT(SUBSTR(first_name,1,2),REPEAT('*',LENGTH(first_name)-2)) AS first_name,
CONCAT(SUBSTR(last_name,1,3),REPEAT('*',LENGTH(last_name)-3),' ', SUBSTRING(MD5(CONCAT(first_name,last_name)),1,6)) AS last_name,
CONCAT(SUBSTR(organization,1,3),REPEAT('*',LENGTH(organization)-3),' ', SUBSTRING(MD5(CONCAT(organization)),1,6)) AS organization,
created, country
FROM customer
LIMIT 10;

+------------+--------------------+------------------+---------------------+---------+
| first_name | last_name | organization | created | country |
+------------+--------------------+------------------+---------------------+---------+
| Sa**** | Cor**** 4c23cd | Ski*** d21420 | 2022-09-20 03:30:14 | PH |
| Fu**** | Wat*** 8b97de | Jax***** e629c2 | 2022-04-08 03:20:22 | BY |
| Mo**** | Zis***** b11d94 | Rhy**** b4073a | 2022-10-06 15:58:38 | IR |
| So**** | Bad** 232cc2 | Rhy*** 1734bd | 2022-02-01 07:35:39 | ID |
| Ni***** | Ter***** d9ffb5 | Wor****** 6e476c | 2021-11-08 17:07:34 | IL |
| Ka****** | Scr***** 9201db | Jax**** 481fd8 | 2022-08-18 19:17:54 | BR |
| Li*** | Coz** 0447f6 | Nlo**** 11da59 | 2022-07-29 06:47:56 | HR |
| Ch***** | Hal******** f5d9c8 | Zoo**** c6e07d | 2022-09-28 04:54:30 | UA |
| Er****** | Ste******* d005f2 | Eid** ffc305 | 2022-04-28 18:50:11 | PT |
| Fo** | O'S***** b35c44 | Buz**** 2c8598 | 2022-09-11 02:05:55 | RU |
+------------+--------------------+------------------+---------------------+---------+

A reliable and dependable application requires observability

Observability (o11y) is a critical pre-requisite component in software architecture when advocating for and preparing organizations for making informed decisions on the success of their application. Open Telemetry from the Cloud Native Computing Foundation is the goto standard regardless of your choices of monitoring tools. However, observability is just a building block that I need to explain when advocating for having reliable and dependable systems. Observability will not inform you “Was the customer actually impacted, or how many, or how long?”. Observability will not tell you “the root cause of a problem?”.

My five layers of building blocks for Reliability are:

  1. Observability – The collection of telemetry (metrics, traces and logs) should just be there. If you are using Kubernetes (k8s) and a Java/Python/Node.js application it is already built in. Just do it.
  2. Reproducibility – The ability with a known set of steps and a given configuration and setup you can reproduce an outcome showing the same observed results is a necessary pre-requisite for any feature development or bug fixes.
  3. Testability – After being able to consistently reproduce an observed event with measurable results, the running of various experiments using a variety of changes enables you to adequately test future improvements or corrections to the initial situation, whether it’s a known bug, or a new piece of functionality. Reproducible and consistent testing is an essential component to the release of software for a reliable application.
  4. Scalability – It is impossible to adequately test a system to failure without an observable, reproducible, and testable framework. Many organizations suffer from the management “Can we support X operations” syndrome, when instead the application should know what “X” is automatically, and have adequate safeguards in place to prevent its occurrence. The ability to proactively disable [expensive] features for the good of the entire system is not a common practice for software (aka a dark mode). In fact, many organizations do not even have the capability of customer-level and individual component-level feature flags or related rate limits that can manually be implemented.
  5. Dependability – A reliable, highly available, and dependable application requires all of the prior layers to be in place to give a level of assurance to your customers and your company that your product is dependable.

Our Data Security Moonshot Starts With Prevention

The recent re-announcement of the Cancer Moonshot highlighted a common enemy to many endeavors to improve our society as a whole, and that is using common sense and already known methods.

At a high level The goal of the Cancer Moonshot Scholars program is to inspire and support the next generation of world-class and diverse researchers focused on scientific breakthroughs that will make a difference for patients and drive progress toward the goal of ending cancer as we know it today. source fact sheet

As stories of this announcement filtered thru news outlets with interviews of medical professionals, a known thread appeared. Both lacking in the message, and the single greatest advancement to the problem, which is already known, is prevention. This includes known prevention measures, early detection measures, and education.

As a Data Strategist, Data Security is a critical component of any business and the single best defense is prevention and using common sense.

Here are just some simple basics that seem to have to be discussed and argued repeatedly company after company, product after product, yet there is no single effort to eliminate these poor practices.

  • No clear text passwords. If you have to enter a password on the command line (cough cough AWS CLI) or put a clear text password in a configuration file (cough cough 100s of products), you enable simple techniques to obtain unvetted access to your data.
  • Using clear text passwords is amplified when products offer a more secure means of access and identity management but they do not employ it everywhere.Check out Password Plaintext Storage
  • Clear text transport. It pains me to say it but even in recent employment that held critical PII data, I could not enforce TLS communication between applications and databases. While it was as simple as a configuration option, the constant excuses by engineering management were it was too hard to implement (cough cough BS).
  • The default configuration settings for a product need to be secure, not the default that is most compatible with prior versions. For example, if you launch a new cloud instance database with defaults, is it the most secure options, or the least secure options>
  • Credential rotation. Long-lived credentials should just be eliminated. Often these are also not named users, but commonly used processes.
  • Communicating passwords in clear text. This should never ever happen, yet it does. Have you ever received an (insecure protocol) email saying here is your username and password? A short known list of 5880+ sites can be found in the https://plaintextoffenders.com/ list on github offenders.csv.
  • Data systems accessible via the public internet For example MongoDB article, MySQL/MariaDB article, Redis & ElasticSearch etc, etc
  • Data systems that have no credentials required
  • Data systems that have default credentials that were never changed
  • Storing passwords in clear text
  • Storing passwords with a single salt
  • Storing passwords with a symmetric encryption approach
  • Administrators that use a common account for “root” privileges, not individual named accounts
  • Not patching products with fixed vulnerabilities CVS Program Mission
  • and the list could go on and on….

In all of the above points, there are numerous examples of these data security anti-patterns. While many are due to the products in use, some of these examples represent poor business practices. It should not have to be explained that most attacks and breaches are internal. The common and very incorrect attitude of, we are within our Virtual Private Cloud (VPC) we do not need to encrypt our data is well, plainly wrong.

Transparency

One of the greatest threats to businesses is ransomware. Attackers gain access to system via various means, those above are just the simplest means and then hold businesses ransom. Ransomware has multiple impacts including the loss of a business operating, the process and time of making a decision, the penalty for payment to release the random, and generally the threat of release of their data if a fine is not paid.

There is a lot to unpack even with this ransomware statement. Can you not restore your entire business operations within a suitable RTO and RPO? Is important data not encrypted. Are passwords in your business able to de-encrypted (this should never even be possible). Do you have a disaster recovery (DR) strategy? Can you access critical data via others means and systems independently?

The stigma of a ransomware attack is organizations do not share this openly. They do not share why it happened, what could have been done to prevent this, and sharing all information with federal authorities that should be tracking all occurrences. This information is an important and critical education feedback loop for the whole industry and IMO lacking of attention. Do you know of a website that shared known ransomware attack vectors.

Conclusion

If security is an important aspect to the data in your organization, can you name the people in your security department? Can any individual point out an insecure product with a known fixed vulnerability? Is that information transparent? Is there a process to address that as a top priority, moving engineering and operations goals accordingly? While organizations may employ an error budget for outages, do they employ a security vulnerability budget? Do companies note version updates of all their software, have people read ALL the release notes of each point release, or even know every version of each software product in use in the organization?

For more information, check out

There will always be better and more determined attempts to attack data systems, we have to stop the most obvious first, and we have to participate in identification and remediation endeavors.

Using a simple relatable example to every person, your home. We should start with not leaving the door open, or leaving the keys in the door or simply removing the door all together.