Skip to content

Using StringFormat for one‐off parsing

Ben Yu edited this page Mar 30, 2024 · 21 revisions

I had a program, it reads a bunch of timeline data from a database table, and applies some business rules to generate output billing data.

For simplicity, let's assume the method looks like:

BillingResult runBilling(List<BillingTimelineRow> input);

One day in production we observed some unexpected result. And we narrowed it down to a certain time range (let's say it's 10 minutes of data).

In the principle of TDD, it's best if I can reproduce the problem in a unit test; do some local debugging to find where the bug is; then fix the bug. The new unit test will be submitted to prevent future regression.

The first thing I did was running a SQL query against the prod db: SELECT * FROM BillingTimeline WHERE timestamp between ... AND ....

The commandline SQL tool printed some dozens of rows that looked like this:

+---------+--------------------+------------+---------------------------+--------+---------------------+--------------------+-------------+----------------------------+
| ShardId |  PayloadTaskId     | Tier       | StartTimestamp            | PlanId | RedactedUserGroupId | ProfileName        | StartSecond |      Timeline              |
+---------+--------------------+------------+---------------------------+--------+------------------------------+-----------+-------------+----------------------------+
| 27      | tttsk1             | PREMIUM    | 2023-12-05 17:51:00-08:00 | 234    | 123                 | testonlyaccount-2  | 21          | [1334, 3923]               |
+---------+--------------------+------------+---------------------------+--------+------------------------------+-----------+-------------+----------------------------+
| 30      | tttsk2             | PREMIUM    | 2023-12-05 17:43:00-08:00 | 125    | 123                 | testonlyaccount-2  | 39          | [161, 54888, 77364, 14468] |
+---------+--------------------+------------+---------------------------+--------+------------------------------+-----------+-------------+----------------------------+
| 34      | tttsk1             | PREMIUM    | 2023-12-05 17:42:00-08:00 | 789    | 999                 | testonlyaccount-1  | 40          | [129, 8877]                |
+---------+--------------------+------------+---------------------------+--------+------------------------------+-----------+-------------+----------------------------+
| 47      | tttsk1             | PREMIUM    | 2023-12-05 17:42:00-08:00 | 110    | 123                 | testonlyaccount-3  | 28          | [1739, 6026]               |
+---------+--------------------+------------+---------------------------+--------+------------------------------+-----------+-------------+----------------------------+
| 22      | tttsk7             | PREMIUM    | 2023-12-05 17:35:00-08:00 | 226    | 456                 | testonlyaccount-24 | 1           | [26031, 42307, 13820]      |
+---------+--------------------+------------+---------------------------+--------+------------------------------+--------------------+-------------+-------------------+
...
51 row(s)

I need to feed this input data to the unit test.

Do I do some vim data masssaging to make it a csv file? Or maybe that commandline tool already has an option to output csv? Or maybe find another SQL GUI tool with richer supports?

But I hate yak shaving! On a second thought, why can't I just use this nice ASCII-arted data directly? It's just a freaking test, right?

With StringFormat, parsing is a pretty easy thing to do. Of course I need to first define the format of each data row:

StringFormat inputFormat = new StringFormat(
  "|{shard_id}|{payload_task_id}|{tier}|{start_timestamp}|{plan_id}|{user_group_id}|{profile_name}|{start_second}|{timeline}|\n");

Then I just need to read the text file, and scan through it for all valid rows:

String input = readTheTextFile(...);
List<BillingTimelineRow> timelines =
    inputFormat.scan(
        input,
        (shardId, taskId, tier, startTime, planId, userGroup, profile, startSecond, timeline) -> {
          try {
            return BillingTimelineRow.builder()
              .setShardId(Integer.parseInt(shardId.trim()))
              .setTaskId(taskId.trim())
              .setTier(Tier.valueOf(tier.trim()))
              ...
              .build();
          } catch (NumberFormatException notValidRow) {
            return null; // skip
          }
        }).collect(toList());

The parser is one-off. But in the future if I need to do similar debugging and tests, I can always just copy-paste from the commandline tool's output without worrying about data conversion. And it's nice to submit the pretty-printed input file to the repo so anyone can easily read it.