Spatial And Temporal Modeling For Human Activity Recognition From Multimodal Sequential Data