The problem is to estimate the growth stage of a wheat crop based on an image sent in by the farmer. Model must take in an image and output a prediction for the growth stage of the wheat shown, on a scale from 1 (crop just showing) to 7 (mature crop).
The following system requirements should be satisfied:
- OS: Ubuntu 16.04
- Python: 3.6
- CUDA: 10.1
- cudnn: 7
- pipenv (
pip install pipenv
)
The training has been done on 2 x GeForce RTX 2080 (training time of the final ensemble is about 15 hours). The batch sizes are selected accordingly.
- Change the list of available GPUs in
./conf/config.yaml
. The parameter is calledgpu_list
- Override the batch size if needed in
train.sh
andinference.sh
providingtraining.batch_size
argument
- Navigate to the project directory
- Run
pipenv install --deploy --ignore-pipfile
to install dependencies - Run
pipenv shell
to activate the environment
- Download data from the competition website and save it to the
./data/
directory. - Unzip
Images.zip
there:unzip Images.zip
- The best Private LB submission (0.39949 RMSE) is available in
./lightning_logs/best_model.csv
- Download zindi_wheat_weights.zip to the project directory
- Unzip them there:
unzip zindi_wheat_weights.zip
- For inference, run
./inference.sh
. Final ensemble predictions will be saved to./lightning_logs/2_3_4_5_6_ens.csv
- To start training the model from scratch, run
./train.sh
(takes about 15 hours on 2x2080) - Afterwards,
./inference.sh
could be run
The dataset has two sets of labels: bad
and good
quality, but test dataset consists only of good quality labels.
First of all, there is no clear correspondance between bad and good labels (good labels contain only five classes: 2, 3, 4, 5, 7
).
Secondly, bad and good quality images could be easily distinguished using a simple binary classifier. So, they come from the different distributions. Looking at the Grad-CAM of such a model suggests that the major difference between two sets of images is these white sticks (poles):
That's why training process consists of 2 steps:
- Pre-train the model on the mix of bad and good quality labels
- Finetune the model only on the good quality labels
Best single model (average of 5 folds on the test set) achieves 0.40327 RMSE on Private LB.
- Architecture: ResNet101
- Problem type: classification
- Loss: cross entropy
- FC Dropout probability: 0.3
- Input size: (256, 256)
- Predicted probabilities are multiplied by the class labels and summed up
The median image size in the data is about 180 x 512
. For preprocessing, firstly image is padded to img_width // 2 x img_width
and then is resized to 256 x 256
. Augmentations list includes:
- Horizontal flips
- RandomBrightnessContrast
- ShiftScaleRotate
- Imitating additional white sticks (poles) on the images
- Label augmentation (changing class labels to the neighbor classes with a low probability)
- Use horizontal flips as TTA
- Pre-train on mix of good and bad labels for 10 epochs
- Fine-tune on good labels for 50 epochs with reducing learning rate on plateau
Ensembling multiple models worked pretty well in this problem. My final solution is just an average of five 5-fold models (25 checkpoints overall):
- Architectures: ResNet50, ResNet101, ResNeXt50
- Input sizes:
256x256
and512x512
- It achieved 0.39949 RMSE on Private LB
- Various options of pseudolabels. Generating pseudolabels for bad quality label and test set; pre-training; mixing with actual labels, etc.
- MixUp and CutMix augmentations
- EffecientNet architectures
- Treating the problem as a regression (didn't help even in the ensemble)
- Stacking over the first level models predictions