Skip to content

Commit

Permalink
Documentation docstrings, and changelog
Browse files Browse the repository at this point in the history
Moved documentation for functions from README to
the code itself, where it can then be reflected on
the documentation website itself.

Added a CHANGELOG to note what's changed between
releases.
  • Loading branch information
BryceStevenWilley committed Feb 24, 2023
1 parent 3bfed5d commit 81e83f9
Show file tree
Hide file tree
Showing 3 changed files with 194 additions and 108 deletions.
45 changes: 45 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
# CHANGELOG

## Version v0.1.1

### Added

* You can now pass in your SPOT and OpenAPI token directly to their functions: (https://github.com/SuffolkLITLab/FormFyxer/commit/5555bc15e399a8e10894a9f919be32a102554e7a)

### Fixed

* If GPT-3 says the readability is too high (i.e. high likelyhood we have garabage), we will use ocrmypydf to re-evaluate the text in a PDF (https://github.com/SuffolkLITLab/FormFyxer/commit/a6dcd9872d2d0a6542f687aa46b1b9b00f16d3e5)
* Adds more actionable information to the stats returned from `parse_form` (https://github.com/SuffolkLITLab/FormFyxer/pull/83):
* Gives more context for citations in found in the text: https://github.com/SuffolkLITLab/FormFyxer/pull/83/commits/b62bd41958fc1bd0373b7698adde1a234779f77a

### Changed

* Many of the internal functions in `pdf_wrangling`, to enable re-labeling existing fields: https://github.com/SuffolkLITLab/FormFyxer/commit/71d903804b0178ff409dd15c49785663fcaf59c6
* Renamed `swap_pdf_page` to `copy_pdf_fields`, deprecated the former: https://github.com/SuffolkLITLab/FormFyxer/commit/71d903804b0178ff409dd15c49785663fcaf59c6

## Version v0.1.0

### Added

* Added the `form_complexity` function (https://github.com/SuffolkLITLab/FormFyxer/commit/60acfdb082fc8f1e701a528ac277ef8783f000c6).
* Added the `need_calculations` metric to see if a form needs any mathematical calculations (https://github.com/SuffolkLITLab/FormFyxer/commit/60acfdb082fc8f1e701a528ac277ef8783f000c6).
* Added OpenAPI functions: `plain_lang`, `describe_form`, and `guess_form_name` (https://github.com/SuffolkLITLab/FormFyxer/commit/4fcf5dbd877ec48a9718803384a22f1928062681, https://github.com/SuffolkLITLab/FormFyxer/commit/a8aa7d39463eb0d610baf6651c6485c5bf569127).
* returns any errors from `parse_form` in the returned dictionary (https://github.com/SuffolkLITLab/FormFyxer/pull/75)

### Fixed

* Gets the correct PDF fields on some types of PDFs (https://github.com/SuffolkLITLab/FormFyxer/commit/fbf5b64c67bd8bc6d14ba4dc34041191e34c22b8):
* PDF fields can be nested, so we should recursively get all of the `Kids` fields if there are any
* filter out push button fields, which don't save data in the form itself
* Speed up `time_to_answer_form` by using numpy more, and not looping as much (https://github.com/SuffolkLITLab/FormFyxer/pull/75)

## Version v0.0.10.1

### Internal

* formatting and missing mypy dependencies (https://github.com/SuffolkLITLab/FormFyxer/commit/dfb0804d0d09e9c2eea93ec5b84eff0a9cbd03cc)

## Version 0.0.10

October 2022 release. Previous releases are not documented in this CHANGELOG.
If you are interested, you can browse the [project's previous history](https://github.com/SuffolkLITLab/FormFyxer/compare/f7f3154890d92...v0.0.10.1).
151 changes: 50 additions & 101 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,22 +16,56 @@ If you are on Anaconda, simply run `conda install poppler`. Otherwise, follow th

## Functions

- [re_case](#formfyxerre_casetext)
- [regex_norm_field](#formfyxerregex_norm_fieldtext)
- [reformat_field](#formfyxerreformat_fieldtextmax_length30)
- [normalize_name](#formfyxernormalize_namejurgroupnperlast_fieldthis_field)
- [vectorize](#formfyxervectorizetextnormalize0)
- [spot](#formfyxerspottextlower025pred05upper06verbose0)
- [guess_form_name](#formfyxerguess_form_nametext)
- [plain_lang](#formfyxerplain_langtext)
- [describe_form](#formfyxerdescribe_formtext)
- [parse_form](#formfyxerparse_formfileloctitlenonejurnonecatnonenormalize1use_spot0rewrite0)
- [cluster_screens](#formfyxercluster_screensfieldsdamping07)
- [set_fields](#formfyxerset_fields)
- [rename_pdf_fields](#formfyxerrename_pdf_fields)
- [swap_pdf_page](#formfyxerswap_pdf_page)
- [get_possible_fields](#formfyxerget_possible_fields)
- [auto_add_fields](#formfyxerauto_add_fields)
Functions from `pdf_wrangling` are found on [our documentation site](https://suffolklitlab.org/docassemble-AssemblyLine-documentation/docs/reference/formfyxer/pdf_wrangling).

- [FormFyxer](#formfyxer)
- [Installation and updating](#installation-and-updating)
- [Functions](#functions)
- [formfyxer.re\_case(text)](#formfyxerre_casetext)
- [Parameters:](#parameters)
- [Returns:](#returns)
- [Example:](#example)
- [formfyxer.regex\_norm\_field(text)](#formfyxerregex_norm_fieldtext)
- [Parameters:](#parameters-1)
- [Returns:](#returns-1)
- [Example:](#example-1)
- [formfyxer.reformat\_field(text,max\_length=30)](#formfyxerreformat_fieldtextmax_length30)
- [Parameters:](#parameters-2)
- [Returns:](#returns-2)
- [Example:](#example-2)
- [formfyxer.normalize\_name(jur,group,n,per,last\_field,this\_field)](#formfyxernormalize_namejurgroupnperlast_fieldthis_field)
- [Parameters:](#parameters-3)
- [Returns:](#returns-3)
- [Example:](#example-3)
- [formfyxer.vectorize(text,normalize=0)](#formfyxervectorizetextnormalize0)
- [Parameters:](#parameters-4)
- [Returns:](#returns-4)
- [Example:](#example-4)
- [formfyxer.spot(text,lower=0.25,pred=0.5,upper=0.6,verbose=0)](#formfyxerspottextlower025pred05upper06verbose0)
- [Parameters:](#parameters-5)
- [Returns:](#returns-5)
- [Example:](#example-5)
- [formfyxer.guess\_form\_name(text)](#formfyxerguess_form_nametext)
- [Parameters:](#parameters-6)
- [Returns:](#returns-6)
- [Example:](#example-6)
- [formfyxer.plain\_lang(text)](#formfyxerplain_langtext)
- [Parameters:](#parameters-7)
- [Returns:](#returns-7)
- [Example:](#example-7)
- [formfyxer.describe\_form(text)](#formfyxerdescribe_formtext)
- [Parameters:](#parameters-8)
- [Returns:](#returns-8)
- [Example:](#example-8)
- [formfyxer.parse\_form(fileloc,title=None,jur=None,cat=None,normalize=1,use\_spot=0,rewrite=0)](#formfyxerparse_formfileloctitlenonejurnonecatnonenormalize1use_spot0rewrite0)
- [Parameters:](#parameters-9)
- [Returns:](#returns-9)
- [Example:](#example-9)
- [formfyxer.cluster\_screens(fields,damping=0.7)](#formfyxercluster_screensfieldsdamping07)
- [Parameters:](#parameters-10)
- [Returns:](#returns-10)
- [Example:](#example-10)
- [License](#license)


### formfyxer.re_case(text)
Expand Down Expand Up @@ -394,91 +428,6 @@ An object grouping together similar field names.
```
[back to top](#formfyxer)

### formfyxer.set_fields
This function adds fields to an input PDF, writing the new PDF to a new file.
#### Parameters:
* `in_file: Union[str, Path, BinaryIO]`: the input file name or path of the PDF that we're adding the fields to
* `out_file: Union[str, Path, BinaryIO]`: the output file name or path where the new version of `in_file` will be written. Doesn't need to exist.
* `fields_per_page: Iterable[Iterable[FormField]]`: for each page, a series of fields that should be added to that page.
* `overwrite:bool`: if the input file already has some fields (AcroForm fields specifically) and this value is true, it will erase those existing fields and just
add `fields_per_page`. If not true and the input file has fields, we won't generate a PDF, since we don't currently have a way to merge AcroForm fields from different PDFs at the moment.
### Returns:
Nothing

#### Example:
```python
set_fields('no_fields.pdf', 'four_fields_on_second_page.pdf',
[
[], # nothing on the first page
[ # Second page
FormField('new_field', 'text', 110, 105, configs={'width': 200, 'height': 30}),
# Choice needs value to be one of the possible options, and options to be a list of strings or tuples
FormField('new_choices', 'choice', 110, 400, configs={'value': 'Option 1', 'options': ['Option 1', 'Option 2']}),
# Radios need to have the same name, with different values
FormField('new_radio1', 'radio', 110, 600, configs={'value': 'option a'}),
FormField('new_radio1', 'radio', 110, 500, configs={'value': 'option b'})
]
]
)
```

### formfyxer.rename_pdf_fields
Given a dictionary that maps existing PDF field names to the corresponding desired names, this function renames the PDF fields from an input file.
#### Parameters:
* `in_file: str`: the file name of an input file
* `out_file: str`: the output file name. Doesn't need to exist, and will be overwritten if it does exist.
* `mapping: Mapping`: a python dict from a current field name to the desired name
#### Returns:
Nothing.
#### Example:
```python
rename_pdf_fields('current.pdf', 'new_field_names.pdf',
{'abc123': 'user1_name', 'abc124': 'user1_address_city'})
```

### formfyxer.swap_pdf_page
Copies the AcroForm fields from one PDF to another PDF (without AcroForm fields). Useful for getting started with an updated PDF form, where the old fields
are pretty close to where they should go on the new document.
#### Parameters:
* `source_pdf: Union[str, Path, Pdf]`: a file name or path to a PDF that has AcroForm fields
* `destination_pdf: Union[str, Path, Pdf]`: a file name or path to a PDF without AcroForm fields. Existing fields will be removed.
* `source_offset: int`: the starting page that fields will be copied from. Defaults to 0
* `destination_offset: int`: the start page that fields will be copied to. Defaults to 0
* `append_annotations: bool`: controls whether formfyxer will try to append form fields instead of overwriting. Defaults to false; when enabled may lead to undefined behavior
#### Returns:
A pikepdf.Pdf with the new fields. If `blank_pdf` was a pikepdf.Pdf object, the same object is returned
#### Example:
```python
new_pdf_with_fields = swap_pdf_page(source_pdf="old_pdf.pdf", destination_pdf="new_pdf_with_no_fields.pdf")
new_pdf_with_fields.save("new_pdf_with_fields.pdf")
```

### formfyxer.get_possible_fields
Given an input PDF, runs a series of heuristics to predict where there might be places for user enterable information (i.e. PDF fields), and returns those predictions
#### Parameters:
`in_pdf_file: Union[str, Path, bytes]`: the input PDF
#### Returns:
For each page in the input PDF, a list of predicted form fields
#### Example:
```python
fields = get_possible_fields('no_fields.pdf')
print(fields)
[[Type: FieldType.TEXT, Name: name, User name: , X: 67.68, Y: 666.0, Configs: {'fieldFlags': 'doNotScroll', 'width': 239.4, 'height': 16}, Type: FieldType.TEXT, Name: address, User name: , X: 67.68, Y: 638.28, Configs: {'fieldFlags': 'doNotScroll', 'width': 239.4, 'height': 16}, Type: FieldType.TEXT, Name: city__state__zip, User name: , X: 67.67999999999999, Y: 610.5600000000001, Configs: {'fieldFlags': 'doNotScroll', 'width': 239.4, 'height': 16}, Type: FieldType.TEXT, Name: phone, User name: , X: 67.67999999999999, Y: 582.84, Configs: {'fieldFlags': 'doNotScroll', 'width': 239.4, 'height': 16}, Type: FieldType.TEXT, Name: email, User name: , X: 67.67999999999999, Y: 552.6, Configs: {'fieldFlags': 'doNotScroll', 'width': 239.4, 'height': 16}, Type: FieldType.TEXT, Name: email, User name: , X: 62.28, Y: 536.76, Configs: {'fieldFlags': 'doNotScroll', 'width': 479.16, 'height': 16}, Type: FieldType.TEXT, Name: in_the_district_justice_court_of_utah, User name: , X: 304.56000000000006, Y: 481.68, Configs: {'fieldFlags': 'doNotScroll', 'width': 125.64000000000001, 'height': 16}, Type: FieldType.TEXT, Name: judicial_district_county, User name: , X: 125.64000000000001, Y: 481.68, Configs: {'fieldFlags': 'doNotScroll', 'width': 78.84, 'height': 16}, Type: FieldType.TEXT, Name: court_address, User name: , X: 161.28000000000003, Y: 453.59999999999997, Configs: {'fieldFlags': 'doNotScroll', 'width': 374.40000000000003, 'height': 16}, Type: FieldType.TEXT, Name: , User name: , X: 325.08, Y: 352.43999999999994, Configs: {'fieldFlags': 'doNotScroll', 'width': 211.32, 'height': 16}, Type: FieldType.TEXT, Name: , User name: , X: 67.32000000000001, Y: 348.84, Configs: {'fieldFlags': 'doNotScroll', 'width': 234.35999999999999, 'height': 16}, Type: FieldType.TEXT, Name: judge, User name: , X: 325.08, Y: 312.84, Configs: {'fieldFlags': 'doNotScroll', 'width': 211.32, 'height': 16}, Type: FieldType.TEXT, Name: respondent_name_and_address, User name: , X: 62.28, Y: 253.44, Configs: {'fieldFlags': 'doNotScroll', 'width': 479.16, 'height': 16}, Type: FieldType.TEXT, Name: court_appoint_name__as_your_guardian_to, User name: , X: 157.32, Y: 156.24, Configs: {'fieldFlags': 'doNotScroll', 'width': 203.4, 'height': 16}, Type: FieldType.TEXT, Name: page_1_of_4, User name: , X: 67.67999999999999, Y: 46.800000000000004, Configs: {'fieldFlags': 'doNotScroll', 'width': 478.08, 'height': 16}]]
```

### formfyxer.auto_add_fields
This function uses [`get_possible_fields`](#formfyxergetpossiblefields) and [`set_fields`](#formfyxersetfields) to automatically add new detected fields to an input PDF.
#### Parameters:
* `in_pdf_file: Union[str, Path]`: the input file name or path of the PDF where we'll try to find possible fields.
* `out_pdf_file: Union[str, Path]`: the output file name or path of the PDF where a new version of `in_pdf_file` will be stored, with the new fields. Doesn't need to exist, but if a file does exist at that file name, it will be overwritten.

#### Returns:
Nothing.

#### Example:
```python
auto_add_fields('no_fields.pdf', 'newly_add_fields.pdf')
```

## License
[MIT](https://github.com/SuffolkLITLab/FormFyxer/blob/main/LICENSE)
106 changes: 99 additions & 7 deletions formfyxer/pdf_wrangling.py
Original file line number Diff line number Diff line change
Expand Up @@ -310,10 +310,10 @@ def set_fields(
*,
overwrite=False,
):
"""Adds fields per page to the in_file PDF, writing the new PDF to out_file.
"""Adds fields per page to the in_file PDF, writing the new PDF to a new file.
Example usage:
```
```python
set_fields('no_fields.pdf', 'four_fields_on_second_page.pdf',
[
[], # nothing on the first page
Expand All @@ -328,6 +328,21 @@ def set_fields(
]
)
```
Args:
in_file: the input file name or path of a PDF that we're adding the fields to
out_file: the output file name or path where the new version of in_file will
be written. Doesn't need to exist.
fields_per_page: for each page, a series of fields that should be added to that
page.
owerwrite: if the input file already some fields (AcroForm fields specifically)
and this value is true, it will erase those existing fields and just add
`fields_per_page`. If not true and the input file has fields, this won't generate
a PDF, since there isn't currently a way to merge AcroForm fields from
different PDFs.
Returns:
Nothing.
"""
if not fields_per_page:
# Nothing to do, lol
Expand All @@ -341,7 +356,7 @@ def set_fields(
_create_only_fields(io_obj, fields_per_page)
temp_pdf = Pdf.open(io_obj)

in_pdf = swap_pdf_page(source_pdf=temp_pdf, destination_pdf=in_pdf)
in_pdf = copy_pdf_fields(source_pdf=temp_pdf, destination_pdf=in_pdf)
in_pdf.save(out_file)


Expand All @@ -351,7 +366,22 @@ def rename_pdf_fields(
mapping: Mapping[str, str],
) -> None:
"""Given a dictionary that maps old to new field names, rename the AcroForm
field with a matching key to the specified value"""
field with a matching key to the specified value.
Example:
```python
rename_pdf_fields('current.pdf', 'new_field_names.pdf',
{'abc123': 'user1_name', 'abc124', 'user1_address_city'})
Args:
in_file: the filename of an input file
out_file: the filename of the output file. Doesn't need to exist,
will be overwritten if it does exist.
mapping: the python dict that maps from a current field name to the desired name
Returns:
Nothing
"""
in_pdf = Pdf.open(in_file, allow_overwriting_input=True)

for parent_field in iter(in_pdf.Root.AcroForm.Fields):
Expand Down Expand Up @@ -465,10 +495,35 @@ def copy_pdf_fields(
destination_offset: int = 0,
append_fields: bool = False,
) -> Pdf:
"""Copies the AcroForm fields from one PDF to another blank PDF form. Optionally, choose a starting page for both
"""Copies the AcroForm fields from one PDF to another blank PDF form (without AcroForm fields).
Useful for getting started with an updated PDF form, where the old fields are pretty close to where
they should go on the new document.
Optionally, you can choose a starting page for both
the source and destination PDFs. By default, it will remove any existing annotations (which include form fields)
in the destination PDF. If you wish to append annotations instead, specify `append_fields = True`
Example:
```python
new_pdf_with_fields = copy_pdf_fields(
source_pdf="old_pdf.pdf",
destination_pdf="new_pdf_with_no_fields.pdf")
new_pdf_with_fields.save("new_pdf_with_fields.pdf")
```
Args:
source_pdf: a file name or path to a PDF that has AcroForm fields
destination_pdf: a file name or path to a PDF without AcroForm fields. Existing fields will be removed.
source_offset: the starting page that fields will be copied from. Defaults to 0.
destination_offset: the starting page that fields will be copied to. Defaults to 0.
append_annotations: controls whether formfyxer will try to append form fields instead of
overwriting. Defaults to false; when enabled may lead to undefined behavior.
Returns:
A pikepdf.Pdf object with new fields. If `blank_pdf` was a pikepdf.Pdf object, the
same object is returned.
"""

if isinstance(source_pdf, (str, Path)):
source_pdf = Pdf.open(source_pdf)
if isinstance(destination_pdf, (str, Path)):
Expand Down Expand Up @@ -834,6 +889,27 @@ def get_possible_fields(
in_pdf_file: Union[str, Path, BinaryIO],
textboxes: Optional[List[List[Textbox]]] = None,
) -> List[List[FormField]]:
"""Given an input PDF, runs a series of heuristics to predict where there
might be places for user enterable information (i.e. PDF fields), and returns
those predictions.
Example:
```python
fields = get_possible_fields('no_field.pdf')
print(fields[0][0])
# Type: FieldType.TEXT, Name: name, User name: , X: 67.68, Y: 666.0, Configs: {'fieldFlags': 'doNotScroll', 'width': 239.4, 'height': 16}
```
Args:
in_pdf_file: the input PDF
textboxes (optional): the location of various lines of text in the PDF.
If not given, will be calculated automatically. This allows us to
pass through expensive info to calculate through several functions.
Returns:
For each page in the input PDF, a list of predicted form fields
"""

images = convert_from_path(in_pdf_file, dpi=dpi)

tmp_files = [tempfile.NamedTemporaryFile() for i in range(len(images))]
Expand Down Expand Up @@ -1209,8 +1285,24 @@ def sort_contours(cnts, method: str = "left-to-right"):


def auto_add_fields(in_pdf_file: Union[str, Path], out_pdf_file: Union[str, Path]):
"""Uses `get_possible_fields` and `set_fields` to automatically add new fields
to an input PDF."""
"""Uses [get_possible_fields](#formfyxer.pdf_wrangling.get_possible_fields) and
[set_fields](#formfyxer.pdf_wrangling.set_fields) to automatically add new detected fields
to an input PDF.
Example:
```python
auto_add_fields('no_fields.pdf', 'newly_added_fields.pdf')
```
Args:
in_pdf_file: the input file name or path of the PDF where we'll try to find possible fields
out_pdf_file: the output file name or path of the PDF where a new version of `in_pdf_file` will
be stored, with the new fields. Doesn't need to existing, but if a file does exist at that
filename, it will be overwritten.
Returns:
Nothing
"""
textboxes = get_textboxes_in_pdf(in_pdf_file)
fields = get_possible_fields(in_pdf_file, textboxes=textboxes)
fields = improve_names_with_surrounding_text(fields, textboxes=textboxes)
Expand Down

0 comments on commit 81e83f9

Please sign in to comment.