The total duration of videos is 2 hours. The dataset adopted different cameras to shoot fire videos. The shooting time includes day and night.The dataset can be used for tasks such as fire detection.
Human lipreading performance increases for longer words, indicating the importance of features capturing temporal context in an ambiguous communication channel.
These expressions are produced at two levels of emotional intensities (regular and strong) except for the neutral emotion that only contains regular intensity.
Each video contains some number of procedure steps to fulfill a recipe. All the procedure segments are temporal localized in the video with starting time and ending time. The distributions of 1) video duration, 2) number of recipe steps per video, 3) recipe segment duration and 4) number of words per sentence are shown below.