Abstract
Existing methods for dynamic scene recognition mostly use global features extracted from the entire video frame or a video segment. In this paper, a part-based method is proposed to aggregate local features from video frames. A pre-trained Fast R-CNN model is used to extract local convolutional features from the regions of interest of training images. These features are clustered to locate representative parts. A set cover problem is then formulated to select the discriminative parts, which are further refined by fine-tuning the Fast R-CNN model. Local features from a video segment are extracted at different layers of the fine-tuned Fast R-CNN model and aggregated both spatially and temporally. Extensive experimental results show that the proposed method is very competitive with state-of-the-art approaches.
| Original language | English |
|---|---|
| Pages (from-to) | 7353-7370 |
| Number of pages | 18 |
| Journal | Neural Computing and Applications |
| Volume | 33 |
| Issue number | 13 |
| DOIs | |
| Publication status | Published - Jul 2021 |
Keywords
- Deep neural networks
- Dynamic scene recognition
- Feature aggregation
- Part-based models