Best Human Action Video Datasets of 2022

Human action datasets are used within AI/ML models to help organizations understand real-time action and kinetic, organic movement. Often, it requires a lot of data and training to handle these datasets correctly.

For those interested in Human Action video datasets, Twine has brought together our top selection – so you don’t have to go looking.

Are you ready?

Let’s dive into our list of the best Human Action video datasets in 2022.

Do you want to build a custom dataset? We specialize in helping companies create high-quality custom audio and video datasets. Find out more here

Here are our top picks for Human Action video datasets:

1. Largest Human Action Video Dataset

Kinetics-700 is a large-scale video dataset that includes human-object interactions such as playing instruments, as well as human-human interactions such as shaking hands and hugging. 

Each clip is annotated with an action class and lasts approximately 10 seconds.


  • 650,000 video clips
  • 700 human action classes

Access the dataset


  • The Kinetics Dataset consists of around 500,000 video clips covering 600 human action classes with at least 600 video clips for each action class. The videos are collected from YouTube.
  • NTU RGB+D is a large-scale dataset for RGB-D human action recognition. It involves 56,880 samples of 60 action classes collected from 40 subjects. 

2. Best Multimodal Activity Recognition Video Dataset 

The Moments in Time Dataset is a research project dedicated to building a very large-scale dataset to help AI systems recognize and understand actions and events in videos. Each video captures the gist of a dynamic scene, allowing models to be built upon the most organic physical movement.


  • One million, 3-second, labeled videos
  • Involves people, animals, objects, and natural phenomena

Access the dataset


  • The UTD-MHAD Dataset consists of 27 different actions performed by 8 subjects. Each subject repeated the action four times, resulting in 861 action sequences in total. The RGB, depth, skeleton, and inertial sensor signals were recorded.
  • Home Action Genome is a large-scale multi-view video database of indoor daily activities. Every activity is captured by synchronized multi-view cameras, including an egocentric view. There are 30 hours of videos with 70 classes of daily activities and 453 classes of atomic actions.

3. Best Pose Estimation Video Dataset

MPII Human Pose dataset is a state-of-the-art benchmark for the evaluation of articulated human pose estimation. The images were systematically collected using an established taxonomy of everyday human activities. Each image was extracted from a YouTube video and provided with preceding and following un-annotated frames.


  • 25K images
  • 40K people with annotated body joints
  • 410 human activities

Access the dataset


  • The Leeds Sports Pose (LSP) Dataset contains 2,000 images of sportspersons gathered from Flickr, 1000 for training and 1000 for testing. Each image is annotated with 14 joint locations, where left and right joints are consistently labeled from a person-centric viewpoint.
  • The YCB-Video Dataset is a large-scale video dataset for 6D object pose estimation. provides accurate 6D poses of 21 objects from the YCB dataset observed in 92 videos with 133,827 frames.

4. Best Scene Understanding Video Dataset

The ADE20K Dataset is a large-scale, semantic segmentation dataset. Every scene-centric image is exhaustively annotated, with pixel-level objects and object parts labels. With a variety of categories and a large number of images, this is a perfect dataset to build a model for scene understanding. 


  • 20K scene-centric images
  • 150 semantic categories – includes sky, road, grass, and discrete objects like person, car, bed, etc.

Access the dataset


  • SceneNet is a dataset of labeled synthetic indoor scenes. There are 57 total indoor scenes, with 3,699 objects.
  • KITTI Road Dataset is a road and lane estimation benchmark that consists of 289 training and 290 test images. It contains three different categories of road scenes.

5. Best Emotion Recognition Video Dataset

The Expression in-the-Wild (ExpW) dataset aims to investigate if such fine-grained and high-level relation traits can be characterized and quantified from face images in the wild. It is a deep model that learns a rich face representation to capture gender, expression, head pose, and age-related attributes, and then performs pairwise-face reasoning for relation prediction. 


  • contains 91,793 faces manually labeled with expressions
  • seven basic expression categories: “angry”, “disgust”, “fear”, “happy”, “sad”, “surprise”, or “neutral”

Access the dataset


  • EmoBank is a corpus of 10k English sentences balancing multiple genres, annotated with dimensional emotion metadata in the Valence-Arousal-Dominance (VAD) representation format. EmoBank excels with a bi-perspectival and bi-representational design.
  • Aff-Wild is a dataset for emotion recognition from facial images in a variety of head poses, illumination conditions, and occlusions.

Wrapping up

To conclude, here are top picks for the best NLP Speech datasets for your projects:

  1. Largest Human Action Video Dataset: Kinetics-700 Dataset
  2. Best Multimodal Activity Recognition Video Dataset: Moments in Time Dataset
  3. Best Pose Estimation Video Dataset: MPII Human Pose Dataset
  4. Best Scene Understanding Video Dataset: ADE20K Dataset
  5. Best Emotion Recognition Video Dataset: The Expression in-the-Wild (ExpW) Dataset

We hope that this list has either helped you find a dataset for your project or, realize the myriad of options available to you. 

If there are any datasets you would like us to add to the list then please let us know here.

If you would like to find out more about how we could help build a custom dataset for your project then please don’t hesitate to contact us!

Let us help you do the math – check our AI dataset project calculator.

Ready to learn more? Check out our Dataset Archives:

Twine AI

Harness Twine’s established global community of over 400,000 freelancers from 190+ countries to scale your dataset collection quickly. We have systems to record, annotate and verify custom video datasets at an order of magnitude lower cost than existing methods.