Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Text from the document is not extracted when a super note-tag is present in the document. #5

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

achouhan93
Copy link

Current Issue with the package:

The package extracts the text and its references; however, when a super note-tag is present, the complete text is omitted, and only the respective note-tag text is extracted as an output.

For example: Below is the text present exactly below the title of the document in test document 32019R0947:

Having regard to Regulation (EU) 2018/1139 of the European Parliament and of the Council of 4 July 2018 on common rules 

in the field of civil aviation and establishing a European Union Aviation Safety Agency, and amending Regulations (EC) No 

2111/2005, (EC) No 1008/2008, (EU) No 996/2010, (EU) No 376/2014 and Directives 2014/30/EU and 2014/53/EU of the 

European Parliament and of the Council, and repealing Regulations (EC) No 216/2008 and (EC) No 552/2004 of the 

European Parliament and of the Council and Council Regulation (EEC) No 3922/91 [(1)](https://eur-lex.europa.eu/legal-

content/EN/TXT/HTML/?uri=CELEX:32019R0947#ntr1-L_2019152EN.01004501-E0001), and in particular Article 57 thereof,

The expected output is:

text = "Having regard to Regulation (EU) 2018/1139 of the European Parliament and of the Council of 4 July 2018 on 

common rules in the field of civil aviation and establishing a European Union Aviation Safety Agency, and amending 

Regulations (EC) No 2111/2005, (EC) No 1008/2008, (EU) No 996/2010, (EU) No 376/2014 and Directives 2014/30/EU and 

2014/53/EU of the European Parliament and of the Council, and repealing Regulations (EC) No 216/2008 and (EC) No 

552/2004 of the European Parliament and of the Council and Council Regulation (EEC) No 3922/91 [(1)], and in particular 

Article 57 thereof,"

Current Output is:

text = "1"

Changes made in the script:
The issue is present for the statements when there is a super note-tag present. Thus a regular expression is added to modify the html tag before being passed to the ETree for information extraction.

modified_html = re.sub(
r'<a[^>]*>\(<span class="super note-tag">([^<]*)</span>\)</a>',
r'[LINK = \1]',
html
)

# Parse the modified HTML using ElementTree
tree = ETree.fromstring(modified_html)

After the code changes, for the text below:

Having regard to Regulation (EU) 2018/1139 of the European Parliament and of the Council of 4 July 2018 on common rules 

in the field of civil aviation and establishing a European Union Aviation Safety Agency, and amending Regulations (EC) No 

2111/2005, (EC) No 1008/2008, (EU) No 996/2010, (EU) No 376/2014 and Directives 2014/30/EU and 2014/53/EU of the 

European Parliament and of the Council, and repealing Regulations (EC) No 216/2008 and (EC) No 552/2004 of the 

European Parliament and of the Council and Council Regulation (EEC) No 3922/91 [(1)](https://eur-lex.europa.eu/legal-

content/EN/TXT/HTML/?uri=CELEX:32019R0947#ntr1-L_2019152EN.01004501-E0001), and in particular Article 57 thereof,

The output is as follows:

text = "Having regard to Regulation (EU) 2018/1139 of the European Parliament and of the Council of 4 July 2018 on 

common rules in the field of civil aviation and establishing a European Union Aviation Safety Agency, and amending 

Regulations (EC) No 2111/2005, (EC) No 1008/2008, (EU) No 996/2010, (EU) No 376/2014 and Directives 2014/30/EU and 

2014/53/EU of the European Parliament and of the Council, and repealing Regulations (EC) No 216/2008 and (EC) No 

552/2004 of the European Parliament and of the Council and Council Regulation (EEC) No 3922/91 [LINK = 1], and in 

particular Article 57 thereof,"

…nks, then text for that statement was replaced with the number of the super note-tag. Thus, the issue is tackled by adding the regular expression that tackle the super note-tag and modify the html before extracting their elements
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant