Enhancing accuracy in ship classification through deep learning ensembles and multi-teacher knowledge distillation
Abstract
In maritime surveillance, reliably classifying ships from optical imagery remains challenging due to the wide variability in ship appearance, lighting conditions, and occlusions. In this work, we study classification committees formed by the optimized selection of 8 Convolutional Neural Network (CNN) models: EfficientNetV2, ResNetRS, ResNetV2, MobileNetV3, InceptionResNetV2, InceptionV3, Xception, and NASNet. Among the evaluated ensemble strategies, the stacking-based metamodel, which combines CatBoost, XGBoost, LightGBM, and LightGBM-Large, achieved the highest accuracy of 96.49 %, resulting in a 0.60% gain compared to the best standalone model, EfficientNetV2B3, which reached 95.89% on the InaTechShips dataset. The evaluation was conducted using 5-fold cross-validation, ensuring greater robustness of the results. The approach was also validated on the MARVEL dataset, confirming its robustness across different domains. To reduce computational cost, a unified preprocessing block and parallel execution strategies were employed, resulting in a 58.03 % reduction in inference time, from 36.60 to 15.36 milliseconds, compared to the sequential approach with individual preprocessing blocks. Despite the improvement in efficiency, the solution still demands high GPU usage, which may limit its application on resource-constrained devices. Thus, in another part of the study, the five stacking-based meta-models were used as teachers in a multi-teacher knowledge distillation scheme, training an EfficientNetV2B3 student model. The resulting model achieved an accuracy of 96.14%, with a 0.25% gain over the same model trained without distillation, while maintaining an inference time of approximately 1.08 milliseconds. These results highlight the trade-off between two distinct approaches, one that maximizes accuracy through committees at the expense of increased computational cost, and another that replicates ensemble behavior in a single model, achieving more modest gains while preserving computational efficiency. The dataset, fold splits for training, validation, and testing, and trained models are publicly available1.