Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to train the comments-content model with train_many_models #91

Open
kun-ninja opened this issue May 3, 2019 · 0 comments
Open

Comments

@kun-ninja
Copy link

Hi, I am new to the machine learning area, and try to train the model based on https://github.com/dragnet-org/dragnet#training-content-extraction-models, but it failed with the error ValueError: setting an array element with a sequence.
I just use the dragnet data , and my code is simply as follows:

features = ['kohlschuetter', 'weninger', 'readability']
to_extract = ['content', 'comments']

extract_all_gold_standard_data(
    data_dir=rootdir,
    nprocesses=20
)
model = ExtraTreesClassifier()
base_extractor = Extractor(
    features=features,
    to_extract=to_extract,
    model=model
)
param_grid={'n_estimators': [10, 20, 50, 75]}
extractor = train_many_models(base_extractor, param_grid, rootdir, train_out_dir, verbose=1)

The details error message is:

train.py
WARNING:root:extraction failed: too few blocks (1)
/Users/XXXXXX/dev/venv3/lib/python3.6/site-packages/sklearn/model_selection/_search.py:643: DeprecationWarning: "fit_params" as a constructor argument was deprecated in version 0.19 and will be removed in version 0.21. Pass fit parameters to the "fit" method instead.
  '"fit" method instead.', DeprecationWarning)
Fitting 5 folds for each of 4 candidates, totalling 20 fits
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
Traceback (most recent call last):
  File "/Users/XXXXXX/dev/src/train.py", line 31, in <module>
    extractor = train_many_models(base_extractor, param_grid, rootdir, train_out_dir, verbose=1)
  File "/Users/XXXXXX/dev/venv3/lib/python3.6/site-packages/dragnet/model_training.py", line 190, in train_many_models
    gscv = gscv.fit(train_features, train_labels)
  File "/Users/XXXXXX/dev/venv3/lib/python3.6/site-packages/sklearn/model_selection/_search.py", line 722, in fit
    self._run_search(evaluate_candidates)
  File "/Users/XXXXXX/dev/venv3/lib/python3.6/site-packages/sklearn/model_selection/_search.py", line 1191, in _run_search
    evaluate_candidates(ParameterGrid(self.param_grid))
  File "/Users/XXXXXX/dev/venv3/lib/python3.6/site-packages/sklearn/model_selection/_search.py", line 711, in evaluate_candidates
    cv.split(X, y, groups)))
  File "/Users/XXXXXX/dev/venv3/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py", line 917, in __call__
    if self.dispatch_one_batch(iterator):
  File "/Users/XXXXXX/dev/venv3/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py", line 759, in dispatch_one_batch
    self._dispatch(tasks)
  File "/Users/XXXXXX/dev/venv3/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py", line 716, in _dispatch
    job = self._backend.apply_async(batch, callback=cb)
  File "/Users/XXXXXX/dev/venv3/lib/python3.6/site-packages/sklearn/externals/joblib/_parallel_backends.py", line 182, in apply_async
    result = ImmediateResult(func)
  File "/Users/XXXXXX/dev/venv3/lib/python3.6/site-packages/sklearn/externals/joblib/_parallel_backends.py", line 549, in __init__
    self.results = batch()
  File "/Users/XXXXXX/dev/venv3/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py", line 225, in __call__
    for func, args, kwargs in self.items]
  File "/Users/XXXXXX/dev/venv3/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py", line 225, in <listcomp>
    for func, args, kwargs in self.items]
  File "/Users/XXXXXX/dev/venv3/lib/python3.6/site-packages/sklearn/model_selection/_validation.py", line 528, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/Users/XXXXXX/dev/venv3/lib/python3.6/site-packages/sklearn/ensemble/forest.py", line 253, in fit
    sample_weight = check_array(sample_weight, ensure_2d=False)
  File "/Users/XXXXXX/dev/venv3/lib/python3.6/site-packages/sklearn/utils/validation.py", line 527, in check_array
    array = np.asarray(array, dtype=dtype, order=order)
  File "/Users/XXXXXX/dev/venv3/lib/python3.6/site-packages/numpy/core/numeric.py", line 538, in asarray
    return array(a, dtype, copy=False, order=order)
ValueError: setting an array element with a sequence.

However, if I change
to_extract = ['content', 'comments'] to to_extract = ['content'], it succeeded.

Could I have your guidance on the failure? Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant