Skip to content

Commit

Permalink
Wire in new layout model
Browse files Browse the repository at this point in the history
  • Loading branch information
VikParuchuri committed May 1, 2024
1 parent 30da488 commit 1dfbd0d
Show file tree
Hide file tree
Showing 7 changed files with 103 additions and 30 deletions.
32 changes: 32 additions & 0 deletions .github/workflows/cla.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
name: "Marker CLA Assistant"
on:
issue_comment:
types: [created]
pull_request_target:
types: [opened,closed,synchronize]

# explicitly configure permissions, in case your GITHUB_TOKEN workflow permissions are set to read-only in repository settings
permissions:
actions: write
contents: write
pull-requests: write
statuses: write

jobs:
CLAAssistant:
runs-on: ubuntu-latest
steps:
- name: "Marker CLA Assistant"
if: (github.event.comment.body == 'recheck' || github.event.comment.body == 'I have read the CLA Document and I hereby sign the CLA') || github.event_name == 'pull_request_target'
uses: contributor-assistant/[email protected]
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
# the below token should have repo scope and must be manually added by you in the repository's secret
# This token is required only if you have configured to store the signatures in a remote repository/organization
PERSONAL_ACCESS_TOKEN: ${{ secrets.PERSONAL_ACCESS_TOKEN }}
with:
path-to-signatures: 'signatures/version1/cla.json'
path-to-document: 'https://github.com/VikParuchuri/marker/blob/master/CLA.md'
# branch should not be protected
branch: 'master'
allowlist: VikParuchuri
24 changes: 24 additions & 0 deletions CLA.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
Marker Contributor Agreement

This Marker Contributor Agreement ("MCA") applies to any contribution that you make to any product or project managed by us (the "project"), and sets out the intellectual property rights you grant to us in the contributed materials. The term "us" shall mean Vikas Paruchuri. The term "you" shall mean the person or entity identified below.

If you agree to be bound by these terms, sign by writing "I have read the CLA document and I hereby sign the CLA" in response to the CLA bot Github comment. Read this agreement carefully before signing. These terms and conditions constitute a binding legal agreement.

1. The term 'contribution' or 'contributed materials' means any source code, object code, patch, tool, sample, graphic, specification, manual, documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and registrations, in your contribution:
- you hereby assign to us joint ownership, and to the extent that such assignment is or becomes invalid, ineffective or unenforceable, you hereby grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge, royalty free, unrestricted license to exercise all rights under those copyrights. This includes, at our option, the right to sublicense these same rights to third parties through multiple levels of sublicensees or other licensing arrangements, including dual-license structures for commercial customers;
- you agree that each of us can do all things in relation to your contribution as if each of us were the sole owners, and if one of us makes a derivative work of your contribution, the one who makes the derivative work (or has it made will be the sole owner of that derivative work;
- you agree that you will not assert any moral rights in your contribution against us, our licensees or transferees;
- you agree that we may register a copyright in your contribution and exercise all ownership rights associated with it; and
- you agree that neither of us has any duty to consult with, obtain the consent of, pay or render an accounting to the other for any use or distribution of vour contribution.
3. With respect to any patents you own, or that you can license without payment to any third party, you hereby grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge, royalty-free license to:
- make, have made, use, sell, offer to sell, import, and otherwise transfer your contribution in whole or in part, alone or in combination with or included in any product, work or materials arising out of the project to which your contribution was submitted, and
- at our option, to sublicense these same rights to third parties through multiple levels of sublicensees or other licensing arrangements.
If you or your affiliates institute patent litigation against any entity (including a cross-claim or counterclaim in a lawsuit) alleging that the contribution or any project it was submitted to constitutes direct or contributory patent infringement, then any patent licenses granted to you under this agreement for that contribution shall terminate as of the date such litigation is filed.
4. Except as set out above, you keep all right, title, and interest in your contribution. The rights that you grant to us under these terms are effective on the date you first submitted a contribution to us, even if your submission took place before the date you sign these terms. Any contribution we make available under any license will also be made available under a suitable FSF (Free Software Foundation) or OSI (Open Source Initiative) approved license.
5. You covenant, represent, warrant and agree that:
- each contribution that you submit is and shall be an original work of authorship and you can legally grant the rights set out in this MCA;
- to the best of your knowledge, each contribution will not violate any third party's copyrights, trademarks, patents, or other intellectual property rights; and
- each contribution shall be in compliance with U.S. export control laws and other applicable export and import laws.
You agree to notify us if you become aware of any circumstance which would make any of the foregoing representations inaccurate in any respect. Vikas Paruchuri may publicly disclose your participation in the project, including the fact that you have signed the MCA.
6. This MCA is governed by the laws of the State of California and applicable U.S. Federal law. Any choice of law rules will not apply.
17 changes: 4 additions & 13 deletions marker/convert.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@

from marker.cleaners.table import merge_table_blocks, create_new_tables
from marker.debug.data import dump_bbox_debug_data
from marker.layout.layout import surya_layout, annotate_block_types
from marker.ocr.lang import replace_langs_with_codes, validate_langs
from marker.ocr.detection import surya_detection
from marker.ocr.recognition import run_ocr
Expand All @@ -22,12 +23,6 @@
from marker.settings import settings


def annotate_spans(blocks: List[Page], block_types: List[BlockType]):
for i, page in enumerate(blocks):
page_block_types = block_types[i]
page.add_block_types(page_block_types)


def convert_single_pdf(
fname: str,
model_lst: List,
Expand Down Expand Up @@ -79,18 +74,14 @@ def convert_single_pdf(
print(f"Could not extract any text blocks for {fname}")
return "", out_meta

block_types = detect_document_block_types(
doc,
pages,
layoutlm_model,
batch_size=int(settings.LAYOUT_BATCH_SIZE * parallel_factor)
)
surya_layout(doc, pages, layout_model)

# Find headers and footers
bad_span_ids = filter_header_footer(pages)
out_meta["block_stats"] = {"header_footer": len(bad_span_ids)}

annotate_spans(pages, block_types)
# Add block types in
annotate_block_types(pages)

# Dump debug data if flags are set
dump_bbox_debug_data(doc, pages)
Expand Down
39 changes: 39 additions & 0 deletions marker/layout/layout.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
from typing import List

from surya.layout import batch_layout_detection

from marker.pdf.images import render_image
from marker.schema.page import Page
from marker.settings import settings


def surya_layout(doc, pages: List[Page], layout_model):
images = [render_image(doc[pnum], dpi=settings.SURYA_LAYOUT_DPI) for pnum in range(len(pages))]
text_detection_results = [p.text_lines for p in pages]

processor = layout_model.processor
layout_results = batch_layout_detection(images, layout_model, processor, detection_results=text_detection_results)
for page, layout_result in zip(pages, layout_results):
page.layout = layout_result


def annotate_block_types(page):
max_intersections = {}
for i, block in enumerate(page.blocks):
bbox = block.bbox
for j, layout_block in enumerate(page.layout.bboxes):
layout_bbox = layout_block.bbox
intersection_pct = bbox.intersection_pct(layout_bbox)
if i not in max_intersections:
max_intersections[i] = (intersection_pct, j)
elif intersection_pct > max_intersections[i][0]:
max_intersections[i] = (intersection_pct, j)

for i, block in enumerate(page.blocks):
block = page.blocks[i]
if i in max_intersections:
j = max_intersections[i][1]
block_type = page.layout.bboxes[j].label
else:
block_type = "Text"
block.block_type = block_type
18 changes: 2 additions & 16 deletions marker/schema/page.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@

from marker.schema.bbox import BboxElement
from marker.schema.schema import Block, Span
from surya.schema import TextDetectionResult
from surya.schema import TextDetectionResult, LayoutResult


class Page(BboxElement):
Expand All @@ -12,6 +12,7 @@ class Page(BboxElement):
column_count: Optional[int] = None
rotation: Optional[int] = None # Rotation degrees of the page
text_lines: Optional[TextDetectionResult] = None
layout: Optional[LayoutResult] = None

def get_nonblank_lines(self):
lines = self.get_all_lines()
Expand All @@ -27,21 +28,6 @@ def get_nonblank_spans(self) -> List[Span]:
spans = [s for l in lines for s in l.spans if s.text.strip()]
return spans

def add_block_types(self, page_block_types):
if len(page_block_types) != len(self.get_all_lines()):
print(f"Warning: Number of detected lines {len(page_block_types)} does not match number of lines {len(self.get_all_lines())}")

i = 0
for block in self.blocks:
for line in block.lines:
if i < len(page_block_types):
line_block_type = page_block_types[i].block_type
else:
line_block_type = "Text"
i += 1
for span in line.spans:
span.block_type = line_block_type

def get_font_stats(self):
fonts = [s.font for s in self.get_nonblank_spans()]
font_counts = Counter(fonts)
Expand Down
2 changes: 1 addition & 1 deletion marker/schema/schema.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,6 @@ class Span(BboxElement):
font: str
font_weight: float
font_size: float
block_type: Optional[str] = None


@field_validator('text')
Expand All @@ -42,6 +41,7 @@ def start(self):
class Block(BboxElement):
lines: List[Line]
pnum: int
block_type: Optional[str] = None

@property
def prelim_text(self):
Expand Down
1 change: 1 addition & 0 deletions marker/settings.py
Original file line number Diff line number Diff line change
Expand Up @@ -74,6 +74,7 @@ def OCR_ENGINE_INTERNAL(self) -> str:
TEXIFY_MODEL_NAME: str = "vikp/texify"

# Layout model
SURYA_LAYOUT_DPI: int = 96
BAD_SPAN_TYPES: List[str] = ["Caption", "Footnote", "Page-footer", "Page-header", "Picture"]
LAYOUT_MODEL_MAX: int = 512
LAYOUT_CHUNK_OVERLAP: int = 64
Expand Down

0 comments on commit 1dfbd0d

Please sign in to comment.