Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving Performance #8

Open
j3soon opened this issue Apr 21, 2022 · 2 comments
Open

Improving Performance #8

j3soon opened this issue Apr 21, 2022 · 2 comments
Labels
enhancement New feature or request

Comments

@j3soon
Copy link
Owner

j3soon commented Apr 21, 2022

Let's say we have a event file containing 10^6 scalar events:

import os
from torch.utils.tensorboard import SummaryWriter

N_EVENTS = 10 ** 6
log_dir = "./tmp"
writer = SummaryWriter(os.path.join(log_dir, f'run'))
for i in range(N_EVENTS):
    writer.add_scalar('y=2x', i, i)

and compare the loading time between pivot=False and pivot=True:

import time
from tbparse import SummaryReader

def time_tbparse():
    for use_pivot in {False, True}:
        start = time.time()
        reader = SummaryReader("./tmp", pivot=use_pivot)
        df = reader.scalars
        end = time.time()
        print(f"pivot={use_pivot}:", end - start)
time_tbparse()

The results are 11 seconds and 24 seconds respectively on my Intel i7-9700 CPU and Seagate ST8000DM004 HDD. Using pivot=True costs twice the time of pivot=False, and the performance is much worse when parsing multiple files.

If we profile the code with cProfile:

import cProfile
cProfile.run('time_tbparse()')

we can see the results:

         206029117 function calls (191028625 primitive calls) in 66.427 seconds

   Ordered by: standard name
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
...
        6    0.000    0.000   34.819    5.803 apply.py:143(agg)
        3    0.000    0.000   34.819   11.606 apply.py:308(agg_list_like)
...
  3000000    5.838    0.000   24.541    0.000 summary_reader.py:209(_merge_values)
      6/2    0.001    0.000   35.408   17.704 summary_reader.py:237(get_events)
...
        2    0.001    0.001   35.409   17.705 summary_reader.py:304(scalars)
      6/2    0.169    0.028   31.403   15.701 summary_reader.py:61(__init__)
...

The bottleneck is in located in the _merge_values function called here, which is not executed when pivot=False.

I believe the _merge_values function can be optimized to improve the performance when using pivot=True.

Moreover, it would be nice to provide some benchmarks and document the performance analysis in the README file, which will be useful for future optimizations.

@j3soon
Copy link
Owner Author

j3soon commented May 18, 2022

The performance is slightly improved in commit 5d69fa1 and 4bd8740. Several benchmarks are provided in tbparse/profiling.

To further accelerate the parsing process, there are two potential solutions: Numba (supported by pandas) and cuDF.

For parsing single event files, the bottleneck is located in get_cols(...) and grouped.aggregate(self._merge_values).

  • Accelerating _merge_values with Numba is not straightforward due to the object data type and the unknown length of the outcome results.
  • As for get_cols(...), we know the number of rows/columns and the data type beforehand (based on the tensorboard event data). Therefore, it's possible to replace the list as numpy arrays with fixed length and non-object data type.

So the next step is to re-write the get_cols(...) functions in numpy array style and provide an option to allow Numba to JIT-compile these functions.

Update (2022/11/17): Similar to Numba, cuDF also does not support the object data type as mentioned here.

@j3soon j3soon changed the title Poor performance when using pivot=True Improving Performance Aug 6, 2022
@j3soon
Copy link
Owner Author

j3soon commented Aug 6, 2022

When parsing many event files inside a deep filesystem hierarchy, the parsing speed might be very slow.

This is due to the use of a recursive tree parsing logic (bad design) to combine the DataFrames constructed in each subroutines, making the worst time complexity $O(n^2)$ for $n$ files.

The solution to this is to remove the recursive parsing logic and combine all DataFrames at once, improving the worst time complexity to $O(n)$.

@j3soon j3soon added the enhancement New feature or request label Dec 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant