-
Notifications
You must be signed in to change notification settings - Fork 65
Using StringFormat for one‐off parsing
I had a program, it reads a bunch of timeline data from a database table, and applies some business rules to generate output billing data.
For simplicity, let's assume the method looks like:
BillingResult runBilling(List<BillingTimelineRow> input);
One day in production we observed some unexpected result. And we narrowed it down to a certain time range (let's say it's 10 minutes of data).
In the principle of TDD, it's best if I can reproduce the problem in a unit test; do some local debugging to find where the bug is; then fix the bug. The new unit test will be submitted to prevent future regression.
The first thing I did was running a SQL query against the prod db: SELECT * FROM BillingTimeline WHERE timestamp BETWEEN ... AND ...
(this table has no PII data)
The commandline SQL tool printed some dozens of rows that looked like this:
+---------+--------------------+------------+---------------------------+--------+---------------------+--------------------+-------------+----------------------------+
| ShardId | PayloadTaskId | Tier | StartTimestamp | PlanId | RedactedUserGroupId | ProfileName | StartSecond | Timeline |
+---------+--------------------+------------+---------------------------+--------+------------------------------+-----------+-------------+----------------------------+
| 27 | tttsk1 | PREMIUM | 2023-12-05 17:51:00-08:00 | 234 | 123 | testonlyaccount-2 | 21 | [1334, 3923] |
+---------+--------------------+------------+---------------------------+--------+------------------------------+-----------+-------------+----------------------------+
| 30 | tttsk2 | PREMIUM | 2023-12-05 17:43:00-08:00 | 125 | 123 | testonlyaccount-2 | 39 | [161, 54888, 77364, 14468] |
+---------+--------------------+------------+---------------------------+--------+------------------------------+-----------+-------------+----------------------------+
| 34 | tttsk1 | PREMIUM | 2023-12-05 17:42:00-08:00 | 789 | 999 | testonlyaccount-1 | 40 | [129, 8877] |
+---------+--------------------+------------+---------------------------+--------+------------------------------+-----------+-------------+----------------------------+
| 47 | tttsk1 | PREMIUM | 2023-12-05 17:42:00-08:00 | 110 | 123 | testonlyaccount-3 | 28 | [1739, 6026] |
+---------+--------------------+------------+---------------------------+--------+------------------------------+-----------+-------------+----------------------------+
| 22 | tttsk7 | PREMIUM | 2023-12-05 17:35:00-08:00 | 226 | 456 | testonlyaccount-24 | 1 | [26031, 42307, 13820] |
+---------+--------------------+------------+---------------------------+--------+------------------------------+--------------------+-------------+-------------------+
...
51 row(s)
I need to feed this input data to the unit test.
Do I do some vim data masssaging to make it a csv file? Or maybe that commandline tool has an option to tweak the output format? Or maybe find another SQL GUI tool with richer support?
But I hate yak shaving! On a second thought, why can't I just use this nice ASCII-arted data directly? It's just a freakin test, right?
With StringFormat
, parsing is a pretty easy thing to do.
First I need to define the format of each data row. The ShardId and Tier aren't needed so I just ignore them in the format string:
StringFormat inputFormat = new StringFormat(
"|{...}|{payload_task_id}|{...}|{start_timestamp}|{plan_id}|{user_group_id}|{profile_name}|{start_second}|{timeline}|");
I'll also need to parse the start_timestamp
into Instant
objects as expected by the code. This requires the right DateTimeFormatter
that matches the format in db. Luckily DateTimeFormats
gets me that without needing to check some cheatsheet.
Then I just need to read the text file, and scan through it for all valid rows:
import static com.google.mu.time.DateTimeFormats.formatOf;
String input = readTheTextFile(...);
DateTimeFormatter timestampFormat = formatOf("2023-12-05 17:51:00-08:00");
List<BillingTimelineRow> timelines = inputFormat.scan(
input,
(taskId, startTime, planId, userGroup, profile, startSecond, timeline) -> {
try {
return BillingTimelineRow.builder()
.setUserGroupId(Long.parseLong(userGroup.trim()))
.setStartTimestamp(
OffsetDateTime.parse(startTime.trim(), timestampFormat)).toInstant())
...
.build();
} catch (NumberFormatException notValidRow) {
return null; // skip
}
}).collect(toList());
The parser is one-off. But in the future if I need to do similar debugging and tests, I can always just copy-paste from the commandline tool's output without worrying about data conversion. And it's nice to submit the pretty-printed input file to the repo so anyone can easily read it.