-
Notifications
You must be signed in to change notification settings - Fork 65
Using StringFormat for one‐off parsing
I had a program, it reads a bunch of timeline data from a database table, and applies some business rules to generate output billing data.
For simplicity, let's assume the method looks like:
BillingResult runBilling(List<BillingTimelineRow> input);
One day in production we observed some unexpected result. And we narrowed it down to a certain time range (let's say it's 10 minutes of data).
In the principle of TDD, it's best if I can reproduce the problem in a unit test; do some local debugging to find where the bug is; then fix the bug. The new unit test will be submitted to prevent future regression.
The first thing I did was running a SQL query against the prod db: SELECT * FROM BillingTimeline WHERE timestamp between ... AND ...
.
The commandline SQL tool printed some dozens of rows that looked like this:
+---------+--------------------+------------+---------------------------+--------+---------------------+--------------------+-------------+----------------------------+
| ShardId | PayloadTaskId | Tier | StartTimestamp | PlanId | RedactedUserGroupId | ProfileName | StartSecond | Timeline |
+---------+--------------------+------------+---------------------------+--------+------------------------------+-----------+-------------+----------------------------+
| 27 | tttsk1 | PREMIUM | 2023-12-05 17:51:00-08:00 | 234 | 123 | testonlyaccount-2 | 21 | [1334, 3923] |
+---------+--------------------+------------+---------------------------+--------+------------------------------+-----------+-------------+----------------------------+
| 30 | tttsk2 | PREMIUM | 2023-12-05 17:43:00-08:00 | 125 | 123 | testonlyaccount-2 | 39 | [161, 54888, 77364, 14468] |
+---------+--------------------+------------+---------------------------+--------+------------------------------+-----------+-------------+----------------------------+
| 34 | tttsk1 | PREMIUM | 2023-12-05 17:42:00-08:00 | 789 | 999 | testonlyaccount-1 | 40 | [129, 8877] |
+---------+--------------------+------------+---------------------------+--------+------------------------------+-----------+-------------+----------------------------+
| 47 | tttsk1 | PREMIUM | 2023-12-05 17:42:00-08:00 | 110 | 123 | testonlyaccount-3 | 28 | [1739, 6026] |
+---------+--------------------+------------+---------------------------+--------+------------------------------+-----------+-------------+----------------------------+
| 22 | tttsk7 | PREMIUM | 2023-12-05 17:35:00-08:00 | 226 | 456 | testonlyaccount-24 | 1 | [26031, 42307, 13820] |
+---------+--------------------+------------+---------------------------+--------+------------------------------+--------------------+-------------+-------------------+
...
51 row(s)
I need to feed this input data to the unit test.
Do I do some vim data masssaging to make it a csv file? Or maybe that commandline tool already has an option to output csv? Or maybe find another SQL GUI tool with richer supports?
But I hate yak shaving! On a second thought, why can't I just use this nice ASCII-arted data directly? It's just a freaking test, right?
With StringFormat
, parsing is a pretty easy thing to do. Of course I need to first define the format of each data row:
StringFormat inputFormat = new StringFormat(
"|{shard_id}|{payload_task_id}|{tier}|{start_timestamp}|{plan_id}|{user_group_id}|{profile_name}|{start_second}|{timeline}|\n");
Then I just need to read the text file, and scan through it for all valid rows:
String input = readTheTextFile(...);
List<BillingTimelineRow> timelines =
inputFormat.scan(
input,
(shardId, taskId, tier, startTime, planId, userGroup, profile, startSecond, timeline) -> {
try {
return BillingTimelineRow.builder()
.setShardId(Integer.parseInt(shardId.trim()))
.setTaskId(taskId.trim())
.setTier(Tier.valueOf(tier.trim()))
...
.build();
} catch (NumberFormatException notValidRow) {
return null; // skip
}
}).collect(toList());
The parser is one-off. But in the future if I need to do similar debugging and tests, I can always just copy-paste from the commandline tool's output without worrying about data conversion. And it's nice to submit the pretty-printed input file to the repo so anyone can easily read it.