Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More testing #17

Merged
merged 8 commits into from
Apr 12, 2014
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 17 additions & 3 deletions lib/llt/segmenter.rb
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ def self.default_options
# the xml escaped characters cannot be refactored to something along
# &(?:amp|quot); - it's an invalid pattern in the look-behind
SENTENCE_CLOSER = /(?<!#{AWB})\.(?!\.)|[\?!:]|((?<!&amp|&quot|&apos|&lt|&gt);)/
DIRECT_SPEECH_DELIMITER = /['"]|&(?:apos|quot);/ # the bracketed part had ” before too - which throws an error - look into that
DIRECT_SPEECH_DELIMITER = /['"]|&(?:apos|quot);/
TRAILERS = /\)|\s*<\/.*?>/

def segment(string, add_to: nil, **options)
Expand All @@ -51,8 +51,14 @@ def setup(options)
@indexing = parse_option(:indexing, options)
@id = 0 if @indexing

# newline_boundary is only active when we aren't working with xml!
nl_boundary = parse_option(:newline_boundary, options)
@sentence_closer = Regexp.union(SENTENCE_CLOSER, /\n{#{nl_boundary}}/)

@sentence_closer = build_sentence_closer_regexp(nl_boundary)
end

def build_sentence_closer_regexp(nl_boundary)
@xml ? SENTENCE_CLOSER : Regexp.union(SENTENCE_CLOSER, /\n{#{nl_boundary}}/)
end

# Used to normalized wonky whitespace in front of or behind direct speech
Expand Down Expand Up @@ -115,8 +121,12 @@ def toggle_direct_speech_status

def scan_through_string(scanner, sentences = [])
while scanner.rest?
loop_guard = scanner.pos

sentence = scan_until_next_sentence(scanner, sentences)

raise if scanner.pos == loop_guard

if @xml
rebuild_xml_tags(scanner, sentence, sentences)
take_all_closing_tags(scanner, sentence)
Expand All @@ -133,6 +143,10 @@ def scan_through_string(scanner, sentences = [])
sentences
end

def scan_to_first_real_text(scanner)
scanner.scan_until(/<.*?>\s*(?=\w)/)
end

def scan_until_next_sentence(scanner, sentences)
scanner.scan_until(@sentence_closer) ||
rescue_no_delimiters(sentences, scanner)
Expand Down Expand Up @@ -171,7 +185,7 @@ def take_all_closing_tags(scanner, sentence)
end

def closing_tags_only?(str)
str.match(/\A(\s*<.*?\/.*?>\s*)+\z/)
str.match(/\A(\s*<\/.*?>\s*|\s*<.*?\/>\s*)+\z/)
end


Expand Down
203 changes: 203 additions & 0 deletions spec/fixtures/petrov_eleg01_cleaned.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,203 @@
<?xml version="1.0" encoding="UTF-8"?>
<?xml-model href="http://www.stoa.org/epidoc/schema/latest/tei-epidoc.rng" type="application/xml" schematypens="http://relaxng.org/ns/structure/1.0"?>
<TEI xmlns="http://www.tei-c.org/ns/1.0">
<text xml:lang="lat">
<body>
<div type="edition" subtype="poesis-elegia" n="urn:cts:croala:petrov02.eleg01.lat1">
<lg met="elegiacum">
<l n="1">Conjugis ut carae patrio mens debita caelo</l>

<l n="2"> Pars melior, miseram laeta reliquit humum:</l>

<l n="3">Postquam, illa rapta, simul omnis rapta voluptas</l>

<l n="4"> Est mihi (jam ex illo tempore mensis abit)</l>

<l n="5">Non nisi perpetuos fundunt mea lumina fletus;</l>

<l n="6">Lux sequitur noctem tristis, et umbra diem.</l>

<l n="7">Pabula sunt <sic>lacrymae</sic>, <sic>lacrymae</sic> sunt pocula, sed quae</l>

<l n="8"> Plus, Medea, tuo gramine fellis habent.</l>

<l n="9">Sic me qui pascit, simul enecat humor, et hospes</l>

<l n="10"> Ipse mei cordis, cor mihi luctus edit.</l>

<l n="11">Ac veluti turtur, sociam cui barbarus auceps</l>

<l n="12"> Exceptam structis perdidit insidiis,</l>

<l n="13">Si quam forte videt sine vite, et frondibus ulmum,</l>

<l n="14"> Aequalem sorti, consimilemque suae,</l>

<l n="15">Flectit iter, ramoque sedens miserabilis ales</l>

<l n="16"> Gutture subrauco nil nisi triste gemit.</l>

<l n="17">Non illum exhilarat facies pulcherrima Veris,</l>

<l n="18"> Nulla sibi in notis pabula quaerit agris:</l>

<l n="19">Non sociae possunt volucres abducere ramo,</l>

<l n="20"> Ad prope labentes non sitis urget aquas.</l>

<l n="21">Sic ego. Sic vitam sine te, dulcissima conjux,</l>

<l n="22"> Si vitae haec nomen vita meretur, ago.</l>

<l n="23">Sola queri misero, sola est mihi flere voluptas,</l>

<l n="24"> Sola loci facies maesta, silensque placet.</l>

<l n="25">Non aures cantus, non fila loquacia mulcent;</l>

<l n="26"> Non oculos formae gratia, flosque rapit.</l>

<l n="27">Unam te in sylvis, unam in florentibus hortis,</l>

<l n="28">Per juga, per valles quaero, nec invenio.</l>

<l n="29">Nec magis Eurydice est Vati quaesita marito,</l>

<l n="30"> Tartareum quamvis viderit ille canem:</l>

<l n="31">Nec magis est Cephalo Procris defleta, videnti</l>

<l n="32"> Deceptae errorem, flagitiumque manus;</l>

<l n="33">Quam totas ego te noctes, mea vita, diesque</l>

<l n="34"> Quaero, nec inventam maestus abesse queror.</l>

<l n="35">Et tamen ante oculos errat tua semper imago,</l>

<l n="36"> (Quid non fingit amans?) et tua verba sonant.</l>

<l n="37">Si qua avis in densis, Siren innoxia, lucis</l>

<l n="38"> Est audita mihi fundere dulce melos,</l>

<l n="39">Sisto gradum, et similis deceptus imagine vocis</l>

<l n="40"> Est, inquam, est cantus conjugis ille meae.</l>

<l n="41">Si quando ad fontes, aut ad vernantia prata,</l>

<l n="42"> Aut maris ad placidas me tulit error aquas,</l>

<l n="43">Hic locus est, dico, quem visere saepe solebat.</l>

<l n="44"> Quae mora (jam sol est ortus) abesse facit?</l>

<l n="45">Sed jam jam veniet; latet illa forte sub umbra,</l>

<l n="46"> Aut illi pietas est sua causa morae.</l>

<l n="47">Causa morae est certe pietas: nisi fallimur, haec est,</l>

<l n="48"> Fundentem ad superos quae videt hora preces.</l>

<l n="49">Mox sat ut illusum me liquit amabilis error,</l>

<l n="50"> Protinus ex oculis bina fluenta cadunt.</l>

<l n="51">Bina fluenta cadunt, quorum hinc dolor elicit unum,</l>

<l n="52"> Inde aliud, tanti causa doloris, amor.</l>

<l n="53">Meque ipsum incuso, quod sim tam stultus, et amens,</l>

<l n="54"> Et pascam aerumnas crudelitate meas:</l>

<l n="55">Rursus in errores tamen hos delabor, et hujus</l>

<l n="56"> Erroris rursus paenitet esse reum.</l>

<l n="57">Sic pugnant mea vota meis contraria votis,</l>

<l n="58"> Nec placet, heu! misero quod modo dulce fuit.</l>

<l n="59">Nec quod sim discors, angit modo; saevius angit</l>

<l n="60"> Vivere me longos te sine posse dies.</l>

<l n="61">Ah! ubi sunt voces illae, et mea fortia verba?</l>

<l n="62"> Ah! ubi, quae verbis debet inesse, fides?</l>

<l n="63">Me quoque rapturam subito, quae te hora tulisset,</l>

<l n="64"> Et pariter praedam mortis utrumque fore?</l>

<l n="65">Ecce tamen vivo, nec post nova cornua Phoebes</l>

<l n="66"> Vis me maeroris perdere longa potest.</l>

<l n="67">Heu! quae dura silex, quod inexsuperabile robur,</l>

<l n="68"> Quod ferrum, et triplex aes mihi pectus obit?</l>

<l n="69">Vivo equidem, vivo, sed morte est tristior ipsa,</l>

<l n="70"> Quae sine te, conjux, vita relicta mihi est.</l>

<l n="71">At tu nunc choreis Natorum immixta tuorum,</l>

<l n="72"> Qui (prona) facili ad Superos te praeiere (via) gradu,</l>

<l n="73">Plena Deo frueris, nec, quae tibi parta, bonorum</l>

<l n="74"> Amittendorum te timor ullus habet.</l>

<l n="75">Nam tua non tristes pietas te duxit in oras:</l>

<l n="76"> Debetur sedes non nisi laeta piis.</l>

<l n="77">Te plaga (credo equidem) summi plaga lucida caeli,</l>

<l n="78"> Te laeta aeterno vere vireta tenent.</l>

<l n="79">Ipsum ipsum Auctorem rerum, quem qui videt, ultra</l>

<l n="80"> Nil habet optandum, jam sine nube vides.</l>

<l n="81">Usque et ubique vides, at non saturata videndo</l>

<l n="82"> Illo oculos pascis; pressa sed usque fame es.</l>

<l n="83">Te vis implet opum, sed non (licet impleat) explet;</l>

<l n="84"> Excipit unum aliud, subsequiturque bonum.</l>

<l n="85">Non te humiles curae, non te mortalia tangunt;</l>

<l n="86"> Prae caelo, et stellis quam tibi sordet humus!</l>

<l n="87">Sordet humus certe. non sic tamen, ut tua nunquam</l>

<l n="88"> Ad miserum flectas lumina blanda virum;</l>

<l n="89">Audire aut flentem fugias, et saucia flentis,</l>

<l n="90"> Qua licet, admota corda fovere manu;</l>

<l n="91">Iactatumque diu ventisque undisque vocare</l>

<l n="92"> Ad laeta Eridani littora stelliferi.</l>

<l n="93">Quam tua sors felix, quam nostra simillima morti est,</l>

<l n="94"> Felle ego, tu Divum vesceris ambrosia.</l>

<l n="95">Non tamen invideo tua gaudia, sed miser opto,</l>

<l n="96"> Laetitiae consors quam prius esse tuae.</l>
</lg>
</div>
</body>
</text>
</TEI>
Loading