-
Notifications
You must be signed in to change notification settings - Fork 119
[WIP][HIVEMALL-126] Maximum Entropy Model using OpenNLP MaxEnt #93
base: master
Are you sure you want to change the base?
Conversation
And please, disregard the LDAUDTFTest.java. We have already discussed that. |
@helenahm Thank you for your contribution!
BTW, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
left some comments. Review is still on-going though.
import opennlp.model.MaxentModel; | ||
import opennlp.model.RealValueFileEventStream; | ||
|
||
public class MaxEntPredictUDF extends GenericUDF { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UDF/UDTF/UDAF classes should have UDF annotation as seen in https://github.com/apache/incubator-hivemall/blob/master/core/src/main/java/hivemall/smile/tools/TreePredictUDF.java#L56.
It is used when desc function extended xxx
.
|
||
@Override | ||
public String getDisplayString(String[] children) { | ||
return "tree_predict(" + Arrays.toString(children) + ")"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tree_predict
is not correct. Better to be renamed tomatext_predict
or something.
* @return the Event object which is next in this EventStream | ||
*/ | ||
public Event next () { | ||
while (next == null && (rowNum < ds.numRows())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
{}
is needed to Google coding style that Hivemall follows.
|
||
private Event createEvent(double[] obs, int y) { | ||
rowNum++; | ||
if (obs == null) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
{}
same as the above.
added 2 maxent functions
float[] values = new float[obs.length]; | ||
for (int i = 0; i < obs.length; i++){ | ||
if (attributes[i].type == AttributeType.NOMINAL){ | ||
names[i] = i + "_" + String.valueOf(obs[i]).toString(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
String.valueOf(obs[i])
is enough. toString()
is redundant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i + "_" + obs[i]
will be perfectly compiled into
new StringBuilder().append(i).append("_").append(obj[i])
for (int i = 0; i < obs.length; i++){ | ||
if (attributes[i].type == AttributeType.NOMINAL){ | ||
names[i] = i + "_" + String.valueOf(obs[i]).toString(); | ||
values[i] = Double.valueOf(1.0).floatValue(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just values[i] = 1.f
;
} | ||
} | ||
|
||
String opts_str = ""; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better to use StringBuilder
for opts_str
.
http://www.pellegrino.link/2015/08/22/string-concatenation-with-java-8.html
|
||
if (rowNum < ds.numRows()){ | ||
next = createEvent(ds.getRow(rowNum), y[rowNum]); | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
};
=> }
for (int i = 0; i < obs.length; i++){ | ||
if (attrs[i].type == AttributeType.NOMINAL){ | ||
names[i] = i + "_" + String.valueOf(obs[i]).toString(); | ||
values[i] = Double.valueOf(1.0).floatValue(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
values[i] = 1.f;
Just added a MaxEntMixtureWeightUDAF to aggregate weights of models obtained on each part of data. tested it on EMR only: add jar hivemall-core-0.4.2-rc.2-maxent-with-dependencies.jar; Where tmodel5 contains 5 lines of same model. I will do more testing. |
•Could you create a JIRA issue for MaxEntropy classifier and rename the PR title to [WIP][HIVEMALL-xxx] ..... ? How do I do that? •Could you apply Hivemall code formatting? You can use mvn formatter:format or this style file for Eclipse/IntelliJ. just mvn formatter:format ? How does it work? Is there something in the pom that tells maven how to re-format files? |
I created an issue for you. https://issues.apache.org/jira/browse/HIVEMALL-126
If you are using Eclipse or IntelliJ, just import the above style file as the project code formatter and run source code formatting on IDE. |
I have just put myself as a watcher on Hivemall-126. And run mvn formatter. The formatter made changes files that are written by other people too. Shall I commit all the changes? Or would it better to pinpoint my own files only? |
Not sure. Please avoid about it for coveralls. It seems we need to fix Travis config due to this issue. Verbose outputs from Also, it seems CI is killed due to some reason while the latest commit in the master is build passing. |
<dependency> | ||
<groupId>opennlp</groupId> | ||
<artifactId>maxent</artifactId> | ||
<version>3.0.0</version> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider using a recent version of OpenNLP instead. The maxent code was moved into opennlp-tools (latest version is 1.8.1). The version you use here is from a couple of years ago.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the comment. Is there a reason for chasing the latest version? Algorithms do not age... Was an error discovered in the maxent 3.0.0 training?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We fixed a couple of bugs over time, and added new features, usual maintenance (e.g. testing on recent java versions), so yeah, it probably makes sense to use a recent version when you build something new. Also it supports now multi-threaded training.
The version you are linking above is 7 years old.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general I totally agree. I think it would be good to perform the move to another version of maxent in a few steps.
-
The code I have re-used is that of GISTrainer. That is more or less updating the weights in a matrix where matrix is hivemall's matrix. Everything else is just following your class structure. I have checked that the resulting models are the same and I have also confirmed that the resulting model makes sense on my own data. So the resulting weights must be correct. Can we say that training is correct and accept the current version as the correct and functioning one?
-
After that there are a few options:
we could try to re-write the code in a way that will accept the newest version of opennlp maxent and all the following versions. I guess that would require changes in opennlp maxent too, but perhaps it is better than manual alteration of GISTrainer every time you update something, and both projects will benefit from such collaboration.
if not, perhaps for Hivemall as a project, we may consider re-writing iterative scaling from scratch to make it Hivemall efficient, perhaps using the tricks OpenNLP uses to make the code more efficient, and making sure that the resulting weights are comparable, but without aiming to being able to plug a new OpenNLP jar each time new version appears.
What do you think?
Regards,
Elena.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same as before. I recommend to use a recent version.
Since you are including parts of the code directly I kindly ask you to also update the NOTICE file with Apache OpenNLP attribution.
merge into newer directory
I think the code is ready to be checked and pulled now. |
@helenahm I think it's better to use Apache opennlp-tools
Notice file should be updated too if ported sources (not jar) are included. |
It will include some work. Let me explain. You were right when you have said that OpenNLP implementation is poor memory-wise. Indeed, they store data in [][] and few times. Using their code directly causes Java Heap Space, GC errors, etc. (Tested that on my 97 mil of data rows. Newer version of code has same problems.) And you were right about the wonderful CSRMatrix. And DoKMatrix too. They allow to store more data. Thus, more or less, I have changed all the [][] (related to input data) to CSRMatrix and [][] holding weights to DoKMatrix. To explain that more, it is best to look at source code for the GISTrainer. In fact all 3 of them, old maxent, new maxent, and Hivemall's BigGISTrainer. The links are below. Newer GISTrainer: Older (3.0.0) GISTrainer: Hivemall GISTrainer: Notice how trainModel of BigGISTrainer gets MatrixForTraining (https://github.com/helenahm/incubator-hivemall/blob/master/core/src/main/java/hivemall/opennlp/tools/MatrixForTraining.java), that contains references to Matrix, and outcomes. This is CSRMatrix. And row data is collected from the CSRMatrix in MatrixForTraining instead of the double[][]. when and results are stored in GISTrainer did not change very dramatically. If 3.0.0 training is reliable enough, I would, of course, consider the existing version as 1.0, and did all the effort to adapt newer GISTrainer later on. It makes sense to do that, I totally agree. And perhaps it makes sense to continue after that to understanding training process in greater details and perhaps write a newer comparable trainer that will be independent from OpenNLP. |
@helenahm as far as I know the training data is stored once in memory, and then for each thread a copy of the parameters is stored. Yeah, so if you have a lot of training data then running out of memory is one symptom you run into, but that is not the actual problem of this implementation. The actual cause is that it won't scale beyond one machine. Bottom line if you want to use GIS training with lots of data don't use this implementation, the training requires a certain amount of CPU time and it increases with the amount of training data. In case you manage to make this run with much more data the time it will take to run will be uncomfortably high. |
"Yeah, so if you have a lot of training data then running out of memory is one symptom you run into, but that is not the actual problem of this implementation."
"The actual cause is that it won't scale beyond one machine."
"In case you manage to make this run with much more data the time it will take to run will be uncomfortably high." |
@helenahm I agree to use Hivemall's Matrix to reduce memory consumption and create a custom BigGISTrainer for Hivemall. My concern is that the modification can be based on the latest release of Apache OpenNLP, Anyway, I look into your PR after merging #105 . Maybe in the next week. Some refactoring would be applied (such as removing debug prints and unused codes) forking your PR branch. BTW, multi-thresholding should be avoided when running a task in a Yarn container. Better to be parallelized by Hive. |
Sure, there are ways to make this work across multiple machines, but then you can't use it like we ship it. Maybe the best solution for you would be to just take the code you need, strip it down and get rid of opennlp as a dependency? |
@myui the maxent 3.0.1 version went through Apache IP clearance when the code base was moved from SourceForge, and should be almost identical to 3.0.0. |
@kottmann Do you know in which version maxent classifier is moved to opennlp-tools? https://mvnrepository.com/artifact/org.apache.opennlp/opennlp-maxent |
@myui that was done for the 1.6.0 release, and in maxent 3.0.3 it was modified to run in multiple threads. You probably need to take a similar approach as we took for multi-threaded training e.g. split the amount of work done per iteration and scale it out to multiple machines, merge the parameters, and repeat for the next iteration. |
@myui I share the concern that the modification can be based on the latest release of Apache OpenNLP, v1.8.1 if there are no reason to use pre-apache release. If I knew about the newer version of maxent at the very beginning I would have used it. I will examine the newer maxent code in the next few days. As you said, have a look at the PR when you have time. And then a decision what to do can be made. |
What changes were proposed in this pull request?
A Distributed Max Entropy Model
What type of PR is it?
Feature
HIVEMALL-126
What is the Jira issue?
[WIP][HIVEMALL-126]
How was this patch tested?
There are two tests at the moment, hivemall.smile.classification.MaxEntUDTFTest.java
and hivemall.smile.tools.MaxEntPredictUDFTest.java
plus I have tested the code on EMR:
add jar hivemall-core-0.4.2-rc.2-maxent-with-dependencies.jar;
add jar opennlp-maxent-3.0.0.jar;
source define-all.hive;
create temporary function train_maxent_classifier as 'hivemall.smile.classification.MaxEntUDTF';
create temporary function predict_maxent_classifier as 'hivemall.smile.tools.MaxEntPredictUDF';
drop table tmodel_maxent;
CREATE TABLE tmodel_maxent
STORED AS SEQUENCEFILE
AS
select
train_maxent_classifier(features, klass, "-attrs
Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,Q,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,Q,Q,Q,Q,Q,Q,Q,Q")
from
t_test_maxent;
create table tmodel_combined as
select model, attributes, features, klass from t_test_maxent join tmodel_maxent;
create table tmodel_predicted as
select
predict_maxent_classifier(model, attributes, features) result, klass from tmodel_combined;
Source table:
drop table t_test_maxent;
create table t_test_maxent as select
array( x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,x11,x12,x13,x14,x15,x16,x17,x18,x19,x20,x21,x22,x23,x24,x25,x26,x27,x28,x29,x30,x31,x32,x33,x34,x35,x36,
cast(tWord(x37) as double),
cast(tWord(x38) as double),
cast(tWord(x39) as double),
cast(tWord(x40) as double),
cast(tWord(x41) as double),
cast(tWord(x42) as double),
cast(tWord(x43) as double),
cast(tWord(x44) as double),
cast(contentWord(x45) as double),
cast(contentWord(x46) as double),
cast(contentWord(x47) as double),
cast(contentWord(x48) as double),
cast(contentWord(x49) as double),
cast(contentWord(x50) as double),
cast(contentWord(x51) as double),
cast(contentWord(x52) as double),
cast(contentWord(x53) as double),
cast(presentationWord(x54) as double),
cast(presentationWord(x55) as double),
cast(presentationWord(x56) as double),
cast(presentationWord(x57) as double),
cast(presentationWord(x58) as double),
cast(presentationWord(x59) as double),
cast(presentationWord(x60) as double),
cast(presentationWord(x61) as double),
cast(presentationWord(x62) as double),
x63,x64,x65,x66,x67,x68,x69,x70) features
, klass from pdfs_and_tiffs_instances_combined_instances where regexp_replace(tp, 'T', '') == '76_698_855_347';
How to use this feature?
Maximum Entropy Classifier is, from my point of view, the most useful classification technique for many NLP tasks and many other tasks that are not related to NLP. It is used for part of speech tagging, NER, and some other tasks.
I have been searching for a distributed version of it and found one article only that talks about it. "Efficient Large Scale Distributed Training of Conditional Maximum Entropy Models" by Mehryar Mohri [quite well-known] and his colleagues at Google. (Please, let me know how I can send you the article if you will not get it by googling). Thus, I think it is time to implement that. I use Mixture Weight Method they describe.
create temporary function aggregate_classifiers as 'hivemall.smile.tools.MaxEntMixtureWeightUDAF';
select aggregate_classifiers(model) from tmodel5;
See if you like the idea and will accept the code. It is based on Apache maxent, that is open source and is written in a simple way.
Regards,
Elena.