dev-resources.site
for different kinds of informations.
Journey into Visual AI: Exploring FiftyOne Together — Part III Preparing a Computer Vision Challenge.
Author: Paula Ramos (Senior DevRel and Applied AI Research Advocate at Voxel51)
This blog is part of the series “Journey into Visual AI: Exploring FiftyOne Together,” in which I want to bring my experience using FiftyOne in multiple stages. Don’t miss the previous blogs here:
Blog 1: Journey into Visual AI: Exploring FiftyOne Together — Part I Introduction.
Blog 2: Journey into Visual AI: Exploring FiftyOne Together — Part II Getting Started
In this Blog 3, we’ll explore the new Elderly Action Recognition Challenge I’m working on, its goals, the challenges we face, and how the open-source community can collaborate to address them. At the end of this blog, I hope you are interested in participating in the challenge and bringing your ideas to AI for Good.
From the early days of my professional career, I’ve been passionate about applications for automated systems. In recent years, my focus has naturally gravitated toward cutting-edge AI trends. Yet, despite the advancements, many unresolved challenges remain. I vividly remember working during my master’s degree on a system designed to detect falls in the elderly. The idea involved developing a sensor-based belt that activated an inflatable device to prevent injuries. Such concepts have gone over the years, and companies now market similar solutions.
However, with the rise of computer vision and robotics to assist humans in their daily lives, we face a new challenge: leveraging camera-based technology to detect human actions. I recall my first blog with OpenVINO, “Human Action Recognition,” where I implemented an encoder-decoder architecture to generate embeddings from 16 frames and determine actions captured in videos. You cannot miss that notebook—I have my son in there, lovely!
Since then, models have evolved dramatically, with new architectures released nearly every week. This rapid evolution in model development begs the question: Can we generate reliable data at a pace that matches this rate of innovation?
What Is the Elderly Action Recognition Challenge?
This challenge aims to tackle one of the most critical applications in human action recognition: identifying activities of daily living (ADLs) and fall detection for the elderly. The competition invites participants to train models on a significant, generic benchmark of human action recognition and apply transfer learning using a subset of data and class labels specific to elderly-related actions.
Key Details:
Goals: Enable more efficient and accurate recognition of elderly actions, addressing real-world healthcare and assisted living challenges.
Deadlines: Submissions close Feb 15th 2025.
Evaluation: Given a path to mp4, the evaluation script should intake the video and output category + label. The evaluation framework will use the following metrics to ensure a fair and comprehensive assessment:
Submissions must include: 1) an Eval submission CSV JSON file, with the prediction results over the Evaluation Dataset, .2) a Hugging Face Link of your PyTorch model weights, and 3) a PDF Report documenting the data curation process and datasets used.
Target Audience: The challenge is open to AI researchers, students, developers, and enthusiasts interested in advancing action recognition in critical domains.
Here is the submission platform: https://eval.ai/web/challenges/challenge-page/2427/overview
Discord Channel: https://discord.com/channels/1266527359511564372/1319053378843836448
Note: This challenge is part of the Computer Vision for Smalls Workshop (CV4Smalls) hosted in WACV 2025.
Human Action Recognition in the Era of Vision Transformers
The development of models for human action recognition has significantly transformed in the era of Vision Transformers (ViTs). While convolutional architectures laid the foundation, ViTs have introduced a new paradigm with their ability to effectively capture long-range dependencies and process spatiotemporal data.
However, this challenge seeks a solution that doesn’t necessarily rely on Vision Transformers. It is open to different approaches, even the more simplistic ones, emphasizing the solution's practicality and accessibility rather than exclusively adopting cutting-edge architectures.
Data complexity and model generalization are the main challenges in model development and deployment. Handling spatiotemporal data is resource-intensive and demands robust architectures and achieving high accuracy across diverse datasets remains a challenge.
Regarding data creation challenges, unfortunately, data creation does not have the same rate of model development, and the available data is still too restrictive. Open-access datasets for elderly action recognition are limited, presenting challenges for reproducibility and benchmarking.
Current Data for Detecting ADLs and Falls in the Elderly
The availability of open-access datasets for elderly action recognition is a critical bottleneck. Most existing datasets have limitations in scale, diversity, or licensing. The key issues I can identify after preparing this material for potential participants of the challenge are:
- Data Limitations: Many datasets lack coverage of diverse scenarios or fail to represent real-world variability.
- Licensing Challenges: Open-access datasets often have restrictive licenses, limiting their utility for commercial or collaborative applications.
The Role of FiftyOne in Video Data Management
As you can see in my previous blogs, FiftyOne is a powerful open-source tool for handling and analyzing data. The new aspect of this blog is that FiftyOne can also process video data, offering critical functionality for dataset curation and exploration in complex datasets.
With FiftyOne, we can create video datasets and streamline importing, organizing, and visualizing video data. Managing the metadata easily manages metadata associated with datasets, enabling better insights and analysis. It also explores the data curation tools, efficiently visualizing, cleaning, filtering, and curating video datasets, ensuring high-quality inputs for model training.
Here, you can find extra resources for video management with FiftyOne:
- Exploring the UCF101 Dataset: A Large-Scale, YouTube-Based Action Recognition Dataset
- Video Labels — FiftyOne Tips and Tricks (10/14/2023)
Getting Hands-On: Exploring ADL and Fall Detection Datasets
For this demonstration, we’ll dive into the GMNCSA2024 dataset, which provides a comprehensive collection of elderly activity and fall detection videos.
- Contains 160 videos (mp4) covering diverse indoor scenarios related to ADLs (81 videos) and falls (79 videos).
- Includes rich metadata for better context and model interpretability.
- Each video could have two or more actions.
- Activities: Drinking, eating, exercising, reading, sitting, sleeping, standing, walking, writing.
- Fall classes: Fall backward (BW), fall forward (FW), fall sideways (SW).
Using FiftyOne, we’ll navigate this dataset, showcasing how to explore its structure, visualize key insights, and prepare it for training robust AI models.
Step 1 – Defining Path for Dataset and Checking if Dataset Exists:
After installing the required libraries and importing the necessary modules, the first step is to define the dataset path and create a new dataset. To avoid conflicts with previous executions, we first check if a dataset with the same name already exists. If it does, we delete it to start fresh.
# Define the path to your dataset
dataset_path = "/path/to/the/GMDCSA24/folder" # Replace with the actual path
dataset_name = "ADL_Fall_Videos"
# Check if the dataset already exists
if fo.dataset_exists(dataset_name):
# Delete the existing dataset
fo.delete_dataset(dataset_name)
# Create a FiftyOne dataset
fo_dataset = fo.Dataset(dataset_name)
Step 2 – Setting up helper functions:
To process the dataset effectively, we define two key helper functions:
2.1 Function to Parse the Classes
This function extracts action names and their respective time ranges from the dataset. Since each video can include multiple actions, the label file specifies which actions occur at specific timestamps. We use this information to split videos into smaller clips and prepare a new dataset based on these segments.
# Function to parse the Classes column
def parse_classes(classes_str):
actions = []
if pd.isna(classes_str):
return actions
# Split by ';' to handle multiple actions
class_entries = classes_str.split(';')
for entry in class_entries:
match = re.match(r"(.+?)\[(.+?)\]", entry.strip())
if match:
action = match.group(1).strip() # Extract action name
time_ranges = match.group(2).strip() # Extract time ranges within brackets
#print("Action=", action)
#print("Time_Group=", time_ranges)
# Split time ranges by ';' and process each range
ranges = time_ranges.split(';')
#print(ranges)
for time_range in ranges:
time_match = re.match(r"(\d+(\.\d+)?) to (\d+(\.\d+)?)", time_range.strip())
if time_match:
start_time = float(time_match.group(1))
#print("Starttime=", start_time)
end_time = float(time_match.group(3))
#print("Endtime=", end_time)
# Ensure start_time is less than or equal to end_time
if start_time > end_time:
continue # Skip invalid ranges
actions.append({"action": action, "start_time": start_time, "end_time": end_time})
return actions
2.2 Function to Map Actions to Categories
One of the goals of the challenge is to categorize actions. This function maps each action to a predefined category to ensure the action recognition task also includes a higher-level classification.
Step 3 – Iteration in the Main Folders, Per Subject, and Splitting Video by Actions Using FiftyOne.
This section combines several important tasks:
Adding Samples to the Dataset: We read the dataset from CSV files, extract metadata (e.g., file name, action, and description), and add these as new samples. We also enrich the metadata by adding fields like subject, type_of_activity (e.g., ADL or Fall), and categories derived from the actions.
Splitting Videos into Clips: We split videos into smaller clips using the parsed action information for each specific action. This is achieved by creating a metadata field called events, which stores the timestamps and frames corresponding to each action.
Exporting the Dataset: The updated dataset can be exported into a FiftyOne format or a Classification Directory Tree after processing. The latter option is especially useful for working with split clips instead of full videos.
# Iterate through the main folders (one per subject)
for subject_folder in os.listdir(dataset_path):
subject_path = os.path.join(dataset_path, subject_folder)
if not os.path.isdir(subject_path):
continue
# Extract the subject number from the folder name
subject_number = subject_folder.split("_")[-1] # Adjust the split logic if needed
# Look for ADL and Fall folders and CSV files
adl_folder = os.path.join(subject_path, "ADL")
fall_folder = os.path.join(subject_path, "Fall")
label_files = [f for f in os.listdir(subject_path) if f.endswith(".csv")]
# Load metadata from CSV files
for label_file in label_files:
label_path = os.path.join(subject_path, label_file)
metadata = pd.read_csv(label_path)
print(label_path)
for _, row in metadata.iterrows():
file_name = row["File Name"]
length = row["Length (seconds)"]
time_of_recording = row["Time of Recording"]
attire = row["Attire"]
description = row["Description"]
classes = row[" Classes"]
# Parse the Classes column
parsed_classes = parse_classes(classes)
# Determine the file's path
if "ADL" in label_path:
video_path = os.path.join(adl_folder, file_name)
subset = "ADL"
elif "Fall" in label_path:
video_path = os.path.join(fall_folder, file_name)
subset = "Fall"
else:
continue
if not os.path.exists(video_path):
print(f"Video file not found: {video_path}")
continue
# Create a FiftyOne sample
metadata = fo.VideoMetadata.build_for(video_path)
sample = fo.Sample(filepath=video_path, metadata=metadata)
#temporaldetection using actions detections on labeled dataset
temp_detections = []
for action in parsed_classes:
start_time = float(action["start_time"])
end_time = float(action["end_time"])
# Check if end_time exceeds video duration
if end_time > metadata.duration:
end_time = metadata.duration
event = fo.TemporalDetection.from_timestamps(
[start_time, end_time],
label=action["action"],
sample=sample,
)
temp_detections.append(event)
sample["events"] = fo.TemporalDetections(detections=temp_detections)
# Add metadata to the sample
sample["subset"] = subset
sample["subject_number"] = subject_number
sample["length"] = length
sample["time_of_recording"] = time_of_recording
sample["attire"] = attire
sample["description"] = description
sample["classes"] = classes
#sample["events"] = events
# Assign category based on actions
categories = [get_category(action["action"]) for action in parsed_classes]
sample["category"] = list(set(categories)) # Deduplicate categories
# Add the sample to the dataset
fo_dataset.add_sample(sample)
fo_dataset.compute_metadata()
Step 4 — Launch the APP
Once the dataset is prepared, you can interact with it programmatically by launching the FiftyOne app. This allows you to explore the dataset visually, create views, and export those views to various formats for further analysis or sharing.
The FiftyOne app provides a highly interactive way to:
- Inspect the dataset and its metadata.
- Visualize events and clips.
- Filter and sort data based on specific criteria.
- Export customized views to your desired format.
session = fo.launch_app(fo_dataset)
view = fo_dataset.to_clips("events")
session.view = view
print(view)
After launching the app, I can see that my new metadata and events are on the left side of the menu, along with all the metadata of the dataset, which I successfully added through the code I shared below and in the notebook.
Step 5 — Exporting Clips and Single Actions
Using TemporalDetections, we can focus on specific ranges of frames within the original videos, corresponding to individual actions. The events field in the metadata marks these individual events with precise timestamps, enabling clear segmentation.
After this process, we can export only the relevant clips and single actions instead of entire videos with complex labels. This streamlined dataset structure is ideal for training machine learning models or for submission to challenges that require precise action recognition.
view.export(
export_dir="/path/to/the/GMDCSA24/new_folder",
dataset_type=fo.types.VideoClassificationDirectoryTree,
)
By isolating and exporting these segments, we reduce dataset size and improve clarity and usability for downstream tasks.
FiftyOne can manage different kinds of datasets; in this notebook, we used a custom dataset and added each sample to the dataset. It is time to export this to use more of FiftyOne's capabilities. For more information about which datasets FiftyOne can manage, take a look at this page).
export_dir = "/path/to/the/GMDCSA24/new_folder_FO_Dataset"
new_dataset.export(
export_dir=export_dir,
dataset_type=fo.types.FiftyOneDataset,
)
Additional resources:
- Notebook for digesting GMDCSA24 Dataset: https://github.com/voxel51/fiftyone-examples/blob/master/examples/elderly_action_recognition.ipynb
- GMDCSA24 Dataset: https://github.com/ekramalam/GMDCSA24-A-Dataset-for-Human-Fall-Detection-in-Videos
- Tips and tricks for human action recognition with FiftyOne: https://voxel51.com/blog/exploring-ucf101-youtube-based-action-recognition-dataset/
- Try the FiftyOne APP in a browser: https://try.fiftyone.ai/
- EAR Challenge: https://voxel51.com/computer-vision-events/elderly-action-recognition-challenge-wacv-2025/
- FiftyOne Documentation: https://docs.voxel51.com/
Just wrapping up! 😀
Thank you for joining me in exploring the Elderly Action Recognition Challenge and the powerful tools FiftyOne provides for dataset preparation and video data management. We have learned how to define a complex dataset to launch the FiftyOne app and export actionable clips. We’ve seen how FiftyOne streamlines the complexities of handling video datasets.
I invite you to participate in the challenge, test the notebook shared in this blog, and share your experience with FiftyOne.
I would love to hear about your experiences! Please Share Your Thoughts, Ask Questions, and Provide Testimonials. Your insights might help others in our next posts. Don’t forget to participate in the challenge and try out the notebook I have created for you all.
Together, we can innovate in action recognition and make meaningful contributions to AI for Good. Let’s build something impactful!
Stay tuned for the next post, in which we’ll explore FiftyOne’s advanced and evaluate the model.
Let’s make this journey with FiftyOne a collaborative and enriching experience. Happy coding!
Stay Connected:
- Follow me on Medium: https://medium.com/@paularamos_phd
- Follow Me on LinkedIn: https://www.linkedin.com/in/paula-ramos-phd/
- Join the Conversation: Discord Fiftyone-community
What is next?
I’m excited to share more about my journey at Voxel51! 🚀 If you’d like to follow along as I explore the world of AI and grow professionally, feel free to connect or follow me on LinkedIn. Let’s inspire each other to embrace change and reach new heights!
You can find me at some Voxel51 events (https://voxel51.com/computer-vision-events/), or if you want t
Featured ones: