Skip to content

Using StringFormat for one‐off parsing

Ben Yu edited this page Mar 31, 2024 · 21 revisions

I had a program, it reads a bunch of timeline data from a database table, and applies some business rules to generate output billing data.

For simplicity, let's assume the method looks like:

BillingResult runBilling(List<BillingTimelineRow> input);

One day in production we observed some unexpected result. And we narrowed it down to a certain time range (let's say it's 10 minutes of data).

In the principle of TDD, it's best if I can reproduce the problem in a unit test; do some local debugging to find where the bug is; then fix the bug. The new unit test will be submitted to prevent future regression.

The first thing I did was running a SQL query against the prod db: SELECT * FROM BillingTimeline WHERE timestamp BETWEEN ... AND ... (this table has no PII data)

The commandline SQL tool printed some dozens of rows that looked like this:

+---------+--------------------+------------+---------------------------+--------+---------------------+--------------------+-------------+----------------------------+
| ShardId |  PayloadTaskId     | Tier       | StartTimestamp            | PlanId | RedactedUserGroupId | ProfileName        | StartSecond |      Timeline              |
+---------+--------------------+------------+---------------------------+--------+------------------------------+-----------+-------------+----------------------------+
| 27      | tttsk1             | PREMIUM    | 2023-12-05 17:51:00-08:00 | 234    | 123                 | testonlyaccount-2  | 21          | [1334, 3923]               |
+---------+--------------------+------------+---------------------------+--------+------------------------------+-----------+-------------+----------------------------+
| 30      | tttsk2             | PREMIUM    | 2023-12-05 17:43:00-08:00 | 125    | 123                 | testonlyaccount-2  | 39          | [161, 54888, 77364, 14468] |
+---------+--------------------+------------+---------------------------+--------+------------------------------+-----------+-------------+----------------------------+
| 34      | tttsk1             | PREMIUM    | 2023-12-05 17:42:00-08:00 | 789    | 999                 | testonlyaccount-1  | 40          | [129, 8877]                |
+---------+--------------------+------------+---------------------------+--------+------------------------------+-----------+-------------+----------------------------+
| 47      | tttsk1             | PREMIUM    | 2023-12-05 17:42:00-08:00 | 110    | 123                 | testonlyaccount-3  | 28          | [1739, 6026]               |
+---------+--------------------+------------+---------------------------+--------+------------------------------+-----------+-------------+----------------------------+
| 22      | tttsk7             | PREMIUM    | 2023-12-05 17:35:00-08:00 | 226    | 456                 | testonlyaccount-24 | 1           | [26031, 42307, 13820]      |
+---------+--------------------+------------+---------------------------+--------+------------------------------+--------------------+-------------+-------------------+
...
51 row(s)

I need to feed this input data to the unit test.

Do I do some vim data masssaging to make it a csv file? Or maybe that commandline tool has an option to tweak the output format? Or maybe find another SQL GUI tool with richer support?

But I hate yak shaving! On a second thought, why can't I just use this nice ASCII-arted data directly? It's just a freakin test, right?

With StringFormat, parsing is a pretty easy thing to do.

First I need to define the format of each data row (The shardId isn't useful to me and the tier is always PREMIUM so I don't need it either):

StringFormat inputFormat = new StringFormat(
  "|{...}|{payload_task_id}|{...}|{start_timestamp}|{plan_id}|{user_group_id}|{profile_name}|{start_second}|{timeline}|");

I'll also need to parse the start_timestamp into Instant objects as expected by the code. For that I should get the right DateTimeFormatter that matches the format in db. Luckily DateTimeFormatter gets me that without me having to check the cheatsheet.

Then I just need to read the text file, and scan through it for all valid rows:

import static com.google.mu.time.DateTimeFormats.formatOf;

String input = readTheTextFile(...);
DateTimeFormatter timestampFormat = formatOf("2023-12-05 17:51:00-08:00");
List<BillingTimelineRow> timelines = inputFormat.scan(
    input,
    (taskId, startTime, planId, userGroup, profile, startSecond, timeline) -> {
      try {
        return BillingTimelineRow.builder()
            .setUserGroupId(Long.parseLong(userGroup.trim()))
            .setStartTimestamp(
                OffsetDateTime.parse(startTime.trim(), timestampFormat)).toInstant())
            ...
            .build();
      } catch (NumberFormatException notValidRow) {
        return null; // skip
      }
    }).collect(toList());

The parser is one-off. But in the future if I need to do similar debugging and tests, I can always just copy-paste from the commandline tool's output without worrying about data conversion. And it's nice to submit the pretty-printed input file to the repo so anyone can easily read it.