Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Very slow (and increasingly slow) to read cells from a read-only worksheet #66

Open
Phrogz opened this issue Nov 25, 2024 · 2 comments
Open

Comments

@Phrogz
Copy link

Phrogz commented Nov 25, 2024

Looping through 1000 rows and finding the first cell in each row in a read-only workbook takes 13s and calls openpyxl.worksheet._reader.WorkSheetParser.parse_cell() over 7 million times.

I have a ~large workbook with some ~large sheets. The .xlsx is 583kB, there are 9 worksheets:

sheet='alpha'   1015x18 = 18270 cells
sheet='beta'    786x18  = 14148 cells
sheet='gamma'   778x18  = 14004 cells
sheet='delta'   778x18  = 14004 cells
sheet='epsilon' 90x78   = 7020 cells
sheet='zeta'    104x112 = 11648 cells
sheet='eta'     198x18  = 3564 cells
sheet='theta'   55x18   = 990 cells
sheet='iota'    35x18   = 630 cells
import openpyxl
import time
from openpyxl.worksheet._reader import WorkSheetParser

book = openpyxl.load_workbook("test.xlsx", read_only=True, data_only=True)
sheet = book['alpha']
maxr = sheet.max_row
rows = range(1, maxr + 1)

print(maxr)                        #=> 1015
print(WorkSheetParser.parse_calls) #=> 0
t0 = time.perf_counter()
[sheet.cell(r, 1) for r in rows]   # I'm not even asking for the value!
print(time.perf_counter() - t0)    #=> 13.412
print(WorkSheetParser.parse_calls) #=> 7291413

The only change I made to the library was this change in _reader.py:

    parse_calls = 0
    def parse_cell(self, element):
        WorkSheetParser.parse_calls += 1
@Phrogz
Copy link
Author

Phrogz commented Nov 25, 2024

The first row takes ~0.4ms. The last row takes ~31ms. The slowdown is linear across the number of rows.

@Phrogz
Copy link
Author

Phrogz commented Nov 25, 2024

Using iter_rows() is significantly faster:

print(sheet.max_row)               #=> 1015
print(WorkSheetParser.parse_calls) #=> 0
t0 = time.perf_counter()
[row[0] for row in sheet.iter_rows()]
print(time.perf_counter() - t0)    #=> 0.0368
print(WorkSheetParser.parse_calls) #=> 14535

Interestingly:

  • If the workbook is loaded with read_only=False this test gets 6x faster.
  • If the workbook is loaded with read_only=True, Worksheet.iter_cols() does not exist.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant