Human action datasets are used within AI/ML models to help organizations understand real-time action and kinetic, organic movement. Often, it requires a lot of data and training to handle these datasets correctly.
For those interested in Human Action video datasets, Twine has brought together our top selection – so you don’t have to go looking.
Are you ready?
Let’s dive into our list of the best Human Action video datasets in 2022.
Do you want to build a custom dataset? We specialize in helping companies create high-quality custom audio and video datasets. Find out more here.
Here are our top picks for Human Action video datasets:
1. Largest Human Action Video Dataset
Kinetics-700 is a large-scale video dataset that includes human-object interactions such as playing instruments, as well as human-human interactions such as shaking hands and hugging.
Each clip is annotated with an action class and lasts approximately 10 seconds.
- 650,000 video clips
- 700 human action classes
- The Kinetics Dataset consists of around 500,000 video clips covering 600 human action classes with at least 600 video clips for each action class. The videos are collected from YouTube.
- NTU RGB+D is a large-scale dataset for RGB-D human action recognition. It involves 56,880 samples of 60 action classes collected from 40 subjects.
2. Best Multimodal Activity Recognition Video Dataset
The Moments in Time Dataset is a research project dedicated to building a very large-scale dataset to help AI systems recognize and understand actions and events in videos. Each video captures the gist of a dynamic scene, allowing models to be built upon the most organic physical movement.
- One million, 3-second, labeled videos
- Involves people, animals, objects, and natural phenomena
- The UTD-MHAD Dataset consists of 27 different actions performed by 8 subjects. Each subject repeated the action four times, resulting in 861 action sequences in total. The RGB, depth, skeleton, and inertial sensor signals were recorded.
- Home Action Genome is a large-scale multi-view video database of indoor daily activities. Every activity is captured by synchronized multi-view cameras, including an egocentric view. There are 30 hours of videos with 70 classes of daily activities and 453 classes of atomic actions.
3. Best Pose Estimation Video Dataset
MPII Human Pose dataset is a state-of-the-art benchmark for the evaluation of articulated human pose estimation. The images were systematically collected using an established taxonomy of everyday human activities. Each image was extracted from a YouTube video and provided with preceding and following un-annotated frames.
- 25K images
- 40K people with annotated body joints
- 410 human activities
- The Leeds Sports Pose (LSP) Dataset contains 2,000 images of sportspersons gathered from Flickr, 1000 for training and 1000 for testing. Each image is annotated with 14 joint locations, where left and right joints are consistently labeled from a person-centric viewpoint.
- The YCB-Video Dataset is a large-scale video dataset for 6D object pose estimation. provides accurate 6D poses of 21 objects from the YCB dataset observed in 92 videos with 133,827 frames.
4. Best Scene Understanding Video Dataset
The ADE20K Dataset is a large-scale, semantic segmentation dataset. Every scene-centric image is exhaustively annotated, with pixel-level objects and object parts labels. With a variety of categories and a large number of images, this is a perfect dataset to build a model for scene understanding.
- 20K scene-centric images
- 150 semantic categories – includes sky, road, grass, and discrete objects like person, car, bed, etc.
- SceneNet is a dataset of labeled synthetic indoor scenes. There are 57 total indoor scenes, with 3,699 objects.
- KITTI Road Dataset is a road and lane estimation benchmark that consists of 289 training and 290 test images. It contains three different categories of road scenes.
5. Best Emotion Recognition Video Dataset
The Expression in-the-Wild (ExpW) dataset aims to investigate if such fine-grained and high-level relation traits can be characterized and quantified from face images in the wild. It is a deep model that learns a rich face representation to capture gender, expression, head pose, and age-related attributes, and then performs pairwise-face reasoning for relation prediction.
- contains 91,793 faces manually labeled with expressions
- seven basic expression categories: “angry”, “disgust”, “fear”, “happy”, “sad”, “surprise”, or “neutral”
- EmoBank is a corpus of 10k English sentences balancing multiple genres, annotated with dimensional emotion metadata in the Valence-Arousal-Dominance (VAD) representation format. EmoBank excels with a bi-perspectival and bi-representational design.
- Aff-Wild is a dataset for emotion recognition from facial images in a variety of head poses, illumination conditions, and occlusions.
To conclude, here are top picks for the best NLP Speech datasets for your projects:
- Largest Human Action Video Dataset: Kinetics-700 Dataset
- Best Multimodal Activity Recognition Video Dataset: Moments in Time Dataset
- Best Pose Estimation Video Dataset: MPII Human Pose Dataset
- Best Scene Understanding Video Dataset: ADE20K Dataset
- Best Emotion Recognition Video Dataset: The Expression in-the-Wild (ExpW) Dataset
We hope that this list has either helped you find a dataset for your project or, realize the myriad of options available to you.
If there are any datasets you would like us to add to the list then please let us know here.
If you would like to find out more about how we could help build a custom dataset for your project then please don’t hesitate to contact us!
Let us help you do the math – check our AI dataset project calculator.