Skip to content
This repository has been archived by the owner on Sep 20, 2022. It is now read-only.

[WIP][HIVEMALL-126] Maximum Entropy Model using OpenNLP MaxEnt #93

Open
wants to merge 52 commits into
base: master
Choose a base branch
from

Conversation

helenahm
Copy link

@helenahm helenahm commented Jul 2, 2017

What changes were proposed in this pull request?

A Distributed Max Entropy Model

What type of PR is it?

Feature
HIVEMALL-126

What is the Jira issue?

[WIP][HIVEMALL-126]

How was this patch tested?

There are two tests at the moment, hivemall.smile.classification.MaxEntUDTFTest.java
and hivemall.smile.tools.MaxEntPredictUDFTest.java

plus I have tested the code on EMR:

add jar hivemall-core-0.4.2-rc.2-maxent-with-dependencies.jar;
add jar opennlp-maxent-3.0.0.jar;
source define-all.hive;
create temporary function train_maxent_classifier as 'hivemall.smile.classification.MaxEntUDTF';
create temporary function predict_maxent_classifier as 'hivemall.smile.tools.MaxEntPredictUDF';
drop table tmodel_maxent;
CREATE TABLE tmodel_maxent
STORED AS SEQUENCEFILE
AS
select
train_maxent_classifier(features, klass, "-attrs

Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,Q,Q,Q,Q,Q,Q,Q,Q")
from
t_test_maxent;

create table tmodel_combined as
select model, attributes, features, klass from t_test_maxent join tmodel_maxent;

create table tmodel_predicted as
select
predict_maxent_classifier(model, attributes, features) result, klass from tmodel_combined;

Source table:
drop table t_test_maxent;
create table t_test_maxent as select
array( x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,x11,x12,x13,x14,x15,x16,x17,x18,x19,x20,x21,x22,x23,x24,x25,x26,x27,x28,x29,x30,x31,x32,x33,x34,x35,x36,
cast(tWord(x37) as double),
cast(tWord(x38) as double),
cast(tWord(x39) as double),
cast(tWord(x40) as double),
cast(tWord(x41) as double),
cast(tWord(x42) as double),
cast(tWord(x43) as double),
cast(tWord(x44) as double),
cast(contentWord(x45) as double),
cast(contentWord(x46) as double),
cast(contentWord(x47) as double),
cast(contentWord(x48) as double),
cast(contentWord(x49) as double),
cast(contentWord(x50) as double),
cast(contentWord(x51) as double),
cast(contentWord(x52) as double),
cast(contentWord(x53) as double),
cast(presentationWord(x54) as double),
cast(presentationWord(x55) as double),
cast(presentationWord(x56) as double),
cast(presentationWord(x57) as double),
cast(presentationWord(x58) as double),
cast(presentationWord(x59) as double),
cast(presentationWord(x60) as double),
cast(presentationWord(x61) as double),
cast(presentationWord(x62) as double),
x63,x64,x65,x66,x67,x68,x69,x70) features
, klass from pdfs_and_tiffs_instances_combined_instances where regexp_replace(tp, 'T', '') == '76_698_855_347';

How to use this feature?

Maximum Entropy Classifier is, from my point of view, the most useful classification technique for many NLP tasks and many other tasks that are not related to NLP. It is used for part of speech tagging, NER, and some other tasks.

I have been searching for a distributed version of it and found one article only that talks about it. "Efficient Large Scale Distributed Training of Conditional Maximum Entropy Models" by Mehryar Mohri [quite well-known] and his colleagues at Google. (Please, let me know how I can send you the article if you will not get it by googling). Thus, I think it is time to implement that. I use Mixture Weight Method they describe.

create temporary function aggregate_classifiers as 'hivemall.smile.tools.MaxEntMixtureWeightUDAF';
select aggregate_classifiers(model) from tmodel5;

See if you like the idea and will accept the code. It is based on Apache maxent, that is open source and is written in a simple way.

Regards,
Elena.

@helenahm
Copy link
Author

helenahm commented Jul 2, 2017

And please, disregard the LDAUDTFTest.java. We have already discussed that.

@myui
Copy link
Member

myui commented Jul 2, 2017

@helenahm Thank you for your contribution!

  • Could you resolve the conflict? You can add public void testUTF8() and rename existing public void test2() to public void testASCII().
  • Could you apply Hivemall code formatting? You can use mvn formatter:format or this style file for Eclipse/IntelliJ.
  • Could you add DDLs in resources/ddl? grep tree_predict for a reference to add DDLs.
  • Could you create a JIRA issue for MaxEntropy classifier and rename the PR title to [WIP][HIVEMALL-xxx] ..... ?

BTW, hivemall.smile package is used for the fork of smile. It's better to use another package such as hivemall.opennlp (needs discussion). I'm considering to move general classes such as hivemall.smile.data.Attribute to hivemall.common.

Copy link
Member

@myui myui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left some comments. Review is still on-going though.

import opennlp.model.MaxentModel;
import opennlp.model.RealValueFileEventStream;

public class MaxEntPredictUDF extends GenericUDF {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


@Override
public String getDisplayString(String[] children) {
return "tree_predict(" + Arrays.toString(children) + ")";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tree_predict is not correct. Better to be renamed tomatext_predict or something.

* @return the Event object which is next in this EventStream
*/
public Event next () {
while (next == null && (rowNum < ds.numRows()))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

{} is needed to Google coding style that Hivemall follows.


private Event createEvent(double[] obs, int y) {
rowNum++;
if (obs == null)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

{} same as the above.

float[] values = new float[obs.length];
for (int i = 0; i < obs.length; i++){
if (attributes[i].type == AttributeType.NOMINAL){
names[i] = i + "_" + String.valueOf(obs[i]).toString();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

String.valueOf(obs[i]) is enough. toString() is redundant.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i + "_" + obs[i] will be perfectly compiled into
new StringBuilder().append(i).append("_").append(obj[i])

for (int i = 0; i < obs.length; i++){
if (attributes[i].type == AttributeType.NOMINAL){
names[i] = i + "_" + String.valueOf(obs[i]).toString();
values[i] = Double.valueOf(1.0).floatValue();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just values[i] = 1.f;

}
}

String opts_str = "";
Copy link
Member

@myui myui Jul 2, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


if (rowNum < ds.numRows()){
next = createEvent(ds.getRow(rowNum), y[rowNum]);
};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

};=> }

for (int i = 0; i < obs.length; i++){
if (attrs[i].type == AttributeType.NOMINAL){
names[i] = i + "_" + String.valueOf(obs[i]).toString();
values[i] = Double.valueOf(1.0).floatValue();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

values[i] = 1.f;

@helenahm
Copy link
Author

helenahm commented Jul 4, 2017

Just added a MaxEntMixtureWeightUDAF to aggregate weights of models obtained on each part of data.
create temporary function aggregate_classifiers as 'hivemall.smile.tools.MaxEntMixtureWeightUDAF';

tested it on EMR only:

add jar hivemall-core-0.4.2-rc.2-maxent-with-dependencies.jar;
add jar opennlp-maxent-3.0.0.jar;
source define-all.hive;
create temporary function train_maxent_classifier as 'hivemall.smile.classification.MaxEntUDTF';
create temporary function predict_maxent_classifier as 'hivemall.smile.tools.MaxEntPredictUDF';
create temporary function aggregate_classifiers as 'hivemall.smile.tools.MaxEntMixtureWeightUDAF';
select aggregate_classifiers(model) from tmodel5;

Where tmodel5 contains 5 lines of same model.

I will do more testing.

@helenahm
Copy link
Author

helenahm commented Jul 4, 2017

•Could you create a JIRA issue for MaxEntropy classifier and rename the PR title to [WIP][HIVEMALL-xxx] ..... ?

How do I do that?

•Could you apply Hivemall code formatting? You can use mvn formatter:format or this style file for Eclipse/IntelliJ.

just mvn formatter:format ? How does it work? Is there something in the pom that tells maven how to re-format files?

@myui
Copy link
Member

myui commented Jul 4, 2017

•Could you create a JIRA issue for MaxEntropy classifier and rename the PR title to [WIP][HIVEMALL-xxx] ..... ?

How do I do that?

I created an issue for you. https://issues.apache.org/jira/browse/HIVEMALL-126
You need to create an account at Apache JIRA ^.

•Could you apply Hivemall code formatting? You can use mvn formatter:format or this style file for Eclipse/IntelliJ.

just mvn formatter:format ? How does it work? Is there something in the pom that tells maven how to re-format files?

mvn formatter:format uses the following style format internally.
https://github.com/apache/incubator-hivemall/blob/master/resources/eclipse-style.xml

If you are using Eclipse or IntelliJ, just import the above style file as the project code formatter and run source code formatting on IDE.

@helenahm
Copy link
Author

helenahm commented Jul 5, 2017

I have just put myself as a watcher on Hivemall-126. And run mvn formatter. The formatter made changes files that are written by other people too. Shall I commit all the changes? Or would it better to pinpoint my own files only?

@myui
Copy link
Member

myui commented Jul 24, 2017

Not sure. Please avoid about it for coveralls.

It seems we need to fix Travis config due to this issue.
travis-ci/travis-ci#7964

Verbose outputs from MaxEntUDTFTest and MaxEntPredictUDFTest should be avoided.
https://travis-ci.org/apache/incubator-hivemall/jobs/256887101

Also, it seems CI is killed due to some reason while the latest commit in the master is build passing.
https://docs.travis-ci.com/user/common-build-problems/#My-build-script-is-killed-without-any-error

<dependency>
<groupId>opennlp</groupId>
<artifactId>maxent</artifactId>
<version>3.0.0</version>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using a recent version of OpenNLP instead. The maxent code was moved into opennlp-tools (latest version is 1.8.1). The version you use here is from a couple of years ago.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the comment. Is there a reason for chasing the latest version? Algorithms do not age... Was an error discovered in the maxent 3.0.0 training?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We fixed a couple of bugs over time, and added new features, usual maintenance (e.g. testing on recent java versions), so yeah, it probably makes sense to use a recent version when you build something new. Also it supports now multi-threaded training.

The version you are linking above is 7 years old.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general I totally agree. I think it would be good to perform the move to another version of maxent in a few steps.

  1. The code I have re-used is that of GISTrainer. That is more or less updating the weights in a matrix where matrix is hivemall's matrix. Everything else is just following your class structure. I have checked that the resulting models are the same and I have also confirmed that the resulting model makes sense on my own data. So the resulting weights must be correct. Can we say that training is correct and accept the current version as the correct and functioning one?

  2. After that there are a few options:
    we could try to re-write the code in a way that will accept the newest version of opennlp maxent and all the following versions. I guess that would require changes in opennlp maxent too, but perhaps it is better than manual alteration of GISTrainer every time you update something, and both projects will benefit from such collaboration.

if not, perhaps for Hivemall as a project, we may consider re-writing iterative scaling from scratch to make it Hivemall efficient, perhaps using the tricks OpenNLP uses to make the code more efficient, and making sure that the resulting weights are comparable, but without aiming to being able to plug a new OpenNLP jar each time new version appears.

What do you think?

Regards,
Elena.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same as before. I recommend to use a recent version.

Since you are including parts of the code directly I kindly ask you to also update the NOTICE file with Apache OpenNLP attribution.

@helenahm
Copy link
Author

helenahm commented Aug 1, 2017

I think the code is ready to be checked and pulled now.

@myui
Copy link
Member

myui commented Aug 1, 2017

@helenahm I think it's better to use Apache opennlp-tools v1.8.1 as @kottmann
mentioned.

v3.0.0 is not supported anymore and may have some bugs for training other datasets.
Could you explain difficulties in applying v1.8.1 a bit more? Has API been changed significantly between v3.0.0 and v1.8.1?

Notice file should be updated too if ported sources (not jar) are included.

@helenahm
Copy link
Author

helenahm commented Aug 2, 2017

It will include some work.

Let me explain.

You were right when you have said that OpenNLP implementation is poor memory-wise. Indeed, they store data in [][] and few times. Using their code directly causes Java Heap Space, GC errors, etc. (Tested that on my 97 mil of data rows. Newer version of code has same problems.) And you were right about the wonderful CSRMatrix. And DoKMatrix too. They allow to store more data. Thus, more or less, I have changed all the [][] (related to input data) to CSRMatrix and [][] holding weights to DoKMatrix.

To explain that more, it is best to look at source code for the GISTrainer. In fact all 3 of them, old maxent, new maxent, and Hivemall's BigGISTrainer. The links are below.

Newer GISTrainer:
https://github.com/apache/opennlp/blob/master/opennlp-tools/src/main/java/opennlp/tools/ml/maxent/GISTrainer.java

Older (3.0.0) GISTrainer:
https://sourceforge.net/projects/maxent/files/ - whole achive
GISTrainer attached:
GISTrainer.txt

Hivemall GISTrainer:
https://github.com/helenahm/incubator-hivemall/blob/master/core/src/main/java/hivemall/opennlp/tools/BigGISTrainer.java

Notice how trainModel of BigGISTrainer gets MatrixForTraining (https://github.com/helenahm/incubator-hivemall/blob/master/core/src/main/java/hivemall/opennlp/tools/MatrixForTraining.java), that contains references to Matrix, and outcomes. This is CSRMatrix.

And row data is collected from the CSRMatrix in MatrixForTraining instead of the double[][].

when
ComparableEvent ev = x.createComparableEvent(ti, di.getPredicateIndex(), di.getOMap());
(they use this convenience Event thing to work with a row of data. Instead of storing a List of Events in memory the modified code also builds an event when needed.)

and results are stored in
Matrix predCount = new DoKMatrix(numPreds, numOutcomes); instead of [][] again.

GISTrainer did not change very dramatically. If 3.0.0 training is reliable enough, I would, of course, consider the existing version as 1.0, and did all the effort to adapt newer GISTrainer later on. It makes sense to do that, I totally agree. And perhaps it makes sense to continue after that to understanding training process in greater details and perhaps write a newer comparable trainer that will be independent from OpenNLP.

@kottmann
Copy link
Member

kottmann commented Aug 2, 2017

@helenahm as far as I know the training data is stored once in memory, and then for each thread a copy of the parameters is stored.

Yeah, so if you have a lot of training data then running out of memory is one symptom you run into, but that is not the actual problem of this implementation. The actual cause is that it won't scale beyond one machine.

Bottom line if you want to use GIS training with lots of data don't use this implementation, the training requires a certain amount of CPU time and it increases with the amount of training data. In case you manage to make this run with much more data the time it will take to run will be uncomfortably high.

@helenahm
Copy link
Author

helenahm commented Aug 2, 2017

"Yeah, so if you have a lot of training data then running out of memory is one symptom you run into, but that is not the actual problem of this implementation."

  • it was the big problem for me to use on Hadoop and that is why i had to alter the training code
  • the newer version of the code is as bad as the old one from this point of view

"The actual cause is that it won't scale beyond one machine."

  • yes, that is why I really like what Hivemall project is about, and that is why i needed MaxEnt for Hive

"In case you manage to make this run with much more data the time it will take to run will be uncomfortably high."
-- that is why i have tested my new implementation on almost 100 mils of training samples and saw each of 302 mappers finish work in very reasonable time

@myui
Copy link
Member

myui commented Aug 2, 2017

@helenahm I agree to use Hivemall's Matrix to reduce memory consumption and create a custom BigGISTrainer for Hivemall.

My concern is that the modification can be based on the latest release of Apache OpenNLP, v1.8.1 if there are no reason to use pre-apache release.

Anyway, I look into your PR after merging #105 . Maybe in the next week. Some refactoring would be applied (such as removing debug prints and unused codes) forking your PR branch.

BTW, multi-thresholding should be avoided when running a task in a Yarn container. Better to be parallelized by Hive.

@kottmann
Copy link
Member

kottmann commented Aug 2, 2017

Sure, there are ways to make this work across multiple machines, but then you can't use it like we ship it. Maybe the best solution for you would be to just take the code you need, strip it down and get rid of opennlp as a dependency?

@kottmann
Copy link
Member

kottmann commented Aug 2, 2017

@myui the maxent 3.0.1 version went through Apache IP clearance when the code base was moved from SourceForge, and should be almost identical to 3.0.0.

@myui
Copy link
Member

myui commented Aug 2, 2017

@kottmann Do you know in which version maxent classifier is moved to opennlp-tools?
Versioning scheme of opennlp-maxent and opennlp-tools modules are very different.

https://mvnrepository.com/artifact/org.apache.opennlp/opennlp-maxent
https://mvnrepository.com/artifact/org.apache.opennlp/opennlp-tools

@kottmann
Copy link
Member

kottmann commented Aug 2, 2017

@myui that was done for the 1.6.0 release, and in maxent 3.0.3 it was modified to run in multiple threads.

You probably need to take a similar approach as we took for multi-threaded training e.g. split the amount of work done per iteration and scale it out to multiple machines, merge the parameters, and repeat for the next iteration.

@helenahm
Copy link
Author

helenahm commented Aug 4, 2017

@myui I share the concern that the modification can be based on the latest release of Apache OpenNLP, v1.8.1 if there are no reason to use pre-apache release. If I knew about the newer version of maxent at the very beginning I would have used it.

I will examine the newer maxent code in the next few days.

As you said, have a look at the PR when you have time. And then a decision what to do can be made.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants