Edmundson summarizer #21

nick-magnini · 2015-04-06T23:31:37Z

Hi,

I checked the code for Edmundson summarizer. As I figured out it doesn't do anything for English. Basically it suppose to extract cue words and significant words and the words in title and rank the sentences based in these scores and the location. Well, when the input is a raw text file, then the summarizer works based on the location of the sentence. Is that right? There is no method to extract the cue words and significant words as well as title words for the text. So in this way the implementation is wrong I suppose. Let me know if I did not understand your code or I'm making a mistake? Thanks.

nick-magnini · 2015-04-07T00:12:01Z

I realized that even the location in Edmundson doesn't work when the input document is a raw text document in one sentence per line format.

miso-belica · 2015-04-10T17:04:42Z

Hi, I suppose some format of "plain text". But I'm not sure if I understand you. Can you give an example of the text? And what does "it doesn't do anything for English" means? It means that for other languages summarizer works correctly? And what do you suggest? How do you think should the summarized behave?

nick-magnini · 2015-04-10T17:17:23Z

Hi,

Well, it does give the output but it's not based on the Edmundson algorithm. Basically the list of cue words and significant words are the non_english version which is in the parser/parse.py:

SIGNIFICANT_WORDS = (
"významný",
"vynikající",
"podstatný",
"význačný",
"důležitý",
"slavný",
"zajímavý",
"eminentní",
"vlivný",
"supr",
"super",
"nejlepší",
"dobrý",
"kvalitní",
"optimální",
"relevantní",
)
STIGMA_WORDS = (
"nejhorší",
"zlý",
"šeredný",
)

Which is called from the main:

if summarizer_class is EdmundsonSummarizer:
summarizer.null_words = stop_words
summarizer.bonus_words = parser.significant_words
summarizer.stigma_words = parser.stigma_words

So when the Edmundson summarizer for English is called, the it will go not find any significant/stigma words in English. In the document is one sentence per line, the location class will not give the correct output for the edmundson_location.py as well. So the Edmundson method will get totally wrong inputs. Correct me if I'm wrong.

miso-belica · 2015-04-10T17:52:13Z

Yes, you are absolutely right. I totally forget about it. I tested summarizers with Czech texts and let it there. This should be fixed. Thanks a lot for this :)

But as I remember there is no method for gathering stigma/bonus words from the text. They should be provided based on the language like stop-words are.

nick-magnini · 2015-04-10T18:36:54Z

Ok, we should then think about it then. stigma/bonus words should be extracted from the summarizing text. A general one will not help. It can be done using various methods such as topic extraction, phrase extraction, ... We can work on it. I'll come with some modules and points on that soon.

nick-magnini · 2015-04-10T18:37:24Z

Also regarding the location, it should be fixed in the edmundson_location.py

miso-belica self-assigned this Apr 10, 2015

miso-belica added the bug label Apr 10, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Edmundson summarizer #21

Edmundson summarizer #21

nick-magnini commented Apr 6, 2015

nick-magnini commented Apr 7, 2015

miso-belica commented Apr 10, 2015

nick-magnini commented Apr 10, 2015

miso-belica commented Apr 10, 2015

nick-magnini commented Apr 10, 2015

nick-magnini commented Apr 10, 2015

Edmundson summarizer #21

Edmundson summarizer #21

Comments

nick-magnini commented Apr 6, 2015

nick-magnini commented Apr 7, 2015

miso-belica commented Apr 10, 2015

nick-magnini commented Apr 10, 2015

miso-belica commented Apr 10, 2015

nick-magnini commented Apr 10, 2015

nick-magnini commented Apr 10, 2015