-
Notifications
You must be signed in to change notification settings - Fork 65
Using StringFormat for one‐off parsing
I had a program, it reads a bunch of timeline data from a database table, and applies some business rules to generate output billing data.
For simplicity, let's assume the method looks like:
BillingResult runBilling(List<BillingTimelineRow> input);
One day in production we observed some unexpected result. And we narrowed it down to a certain time range (let's say it's 10 minutes of data).
In the principle of TDD, it's best if I can reproduce the problem in a unit test; do some local debugging to find where the bug is; then fix the bug. The new unit test will be submitted to prevent future regression.
The first thing I did was running a SQL query against the prod db: SELECT * FROM BillingTimeline WHERE timestamp BETWEEN ... AND ...
(this table has no PII data)
The commandline SQL tool printed some dozens of rows that looked like this:
+---------+--------------------+------------+---------------------------+--------+---------------------+--------------------+-------------+----------------------------+
| ShardId | PayloadTaskId | Tier | StartTimestamp | PlanId | RedactedUserGroupId | ProfileName | StartSecond | Timeline |
+---------+--------------------+------------+---------------------------+--------+------------------------------+-----------+-------------+----------------------------+
| 27 | tttsk1 | PREMIUM | 2023-12-05 17:51:00-08:00 | 234 | 123 | testonlyaccount-2 | 21 | [1334, 3923] |
+---------+--------------------+------------+---------------------------+--------+------------------------------+-----------+-------------+----------------------------+
| 30 | tttsk2 | PREMIUM | 2023-12-05 17:43:00-08:00 | 125 | 123 | testonlyaccount-2 | 39 | [161, 54888, 77364, 14468] |
+---------+--------------------+------------+---------------------------+--------+------------------------------+-----------+-------------+----------------------------+
| 34 | tttsk1 | PREMIUM | 2023-12-05 17:42:00-08:00 | 789 | 999 | testonlyaccount-1 | 40 | [129, 8877] |
+---------+--------------------+------------+---------------------------+--------+------------------------------+-----------+-------------+----------------------------+
| 47 | tttsk1 | PREMIUM | 2023-12-05 17:42:00-08:00 | 110 | 123 | testonlyaccount-3 | 28 | [1739, 6026] |
+---------+--------------------+------------+---------------------------+--------+------------------------------+-----------+-------------+----------------------------+
| 22 | tttsk7 | PREMIUM | 2023-12-05 17:35:00-08:00 | 226 | 456 | testonlyaccount-24 | 1 | [26031, 42307, 13820] |
+---------+--------------------+------------+---------------------------+--------+------------------------------+--------------------+-------------+-------------------+
...
51 row(s)
I need to feed this input data to the unit test.
Do I do some vim data masssaging to make it a csv file? Or maybe that commandline tool has an option to tweak the output format? Or maybe find another SQL GUI tool with richer support?
But I hate yak shaving! On a second thought, why can't I just use this nice ASCII-arted data directly? It's just a freakin test, right?
With StringFormat
, parsing is a pretty easy thing to do.
First I need to first define the format of each data row (The shardId isn't useful to me and the tier is always PREMIUM so I don't need it either):
StringFormat inputFormat = new StringFormat(
"|{...}|{payload_task_id}|{...}|{start_timestamp}|{plan_id}|{user_group_id}|{profile_name}|{start_second}|{timeline}|");
I'll also need to parse the start_timestamp
into Instant
objects as expected by the code. For that I should get the right DateTimeFormatter
that matches the format in db. Luckily DateTimeFormatter
gets me that without me having to check the cheatsheet.
Then I just need to read the text file, and scan through it for all valid rows:
import static com.google.mu.time.DateTimeFormats.formatOf;
String input = readTheTextFile(...);
DateTimeFormatter timestampFormat = formatOf("2023-12-05 17:51:00-08:00");
List<BillingTimelineRow> timelines = inputFormat.scan(
input,
(taskId, startTime, planId, userGroup, profile, startSecond, timeline) -> {
try {
return BillingTimelineRow.builder()
.setUserGroupId(Long.parseLong(userGroup.trim()))
.setStartTimestamp(
OffsetDateTime.parse(startTime.trim(), timestampFormat)).toInstant())
...
.build();
} catch (NumberFormatException notValidRow) {
return null; // skip
}
}).collect(toList());
The parser is one-off. But in the future if I need to do similar debugging and tests, I can always just copy-paste from the commandline tool's output without worrying about data conversion. And it's nice to submit the pretty-printed input file to the repo so anyone can easily read it.