EnvisionBOX

Building Open eXchange (BOX) for learners of multimodal social signal processing and analyses


This platform is to promote a community of learners and a culture of sharing knowledge and labor within fields that work with multimodal data to understand embodied minds in interaction and communication. Here you can learn new programming skills and routines that will help you qualify, process, analyze, and classify, multimodal data streams reflecting the diversity of ways that human and non-human animals communicate.

We aim to develop and compile resources that help researchers to make use of the open source resources that are increasingly available. Next to our coding modules, check out our curated list of open-access multimodal datasets and our overview of academic-led diamond open access journals that are potential outlets for this community. We also organize events such as summer/winter schools and workshops to support social scientists in coding and data analysis.

Subscribe to our newsletter to receive content updates and summer/winter school announcements. You can also put in a request for new modules and share your experiences via our 2-minute survey.

Points of departure

Community of learners

As researchers, we often communicate the results of studies and describe methods, but we don't always tailor our communication to teach or guide readers on how to use or reproduce our methods. This platform aims to promote literacy and use of each other's methods, and everyone is welcome to contribute.

Didactical

Each contributed module introduces code with instructions on how to use it in practice. The modules are designed to introduce learners to new concepts and routines, rather than just sharing code with other experts.

Self-ownership

Each module contributor provides the appropriate citation for their module, so they are properly acknowledged when they have been helpful to others in the community.

Build to grow

We envision this platform expanding in scope to host lectures on general theoretical and methodological frameworks, as well as a well-curated update bulletin featuring new papers and tools of interest to the community. If you have ideas and time to help out, please reach out.

Modules quick search

Modules

Phoneme-Level Text-to-Speech Synchronization (for Python)

In this module, we demonstrate how the Montreal Forced Aligner (MFA), an open-source tool, can be used to automatically align speech with its corresponding text transcription at both the word and phoneme levels.
By Shuguang Sheng, Davide Ahmar & Wim Pouw

EnvisionObjectAnnotator (Python App)

This modules provides a new python desktop app that leverages SAM2 to automatically track any object in videos and detect spatial overlaps between a speficied target and objects in the scene .
By Davide Ahmar, Babajide Owoyele & Wim Pouw

Audio Processing & Speech Analysis Suite (Python)

Comprehensive toolkit for audio analysis, speech transcription, and speaker identification using multiple Python libraries.
by Marianne de Heer Kloots

Audio Analysis with Parselmouth

Extract and visualize speech features using Praat's Python interface.

Speaker Diarization with pyannote

Identify who spoke when using pyannote-audio toolkit.

Speech-to-Text with Whisper

Generate automatic transcriptions using OpenAI's Whisper model.

EnvisionHGdetector Package Suite (Python)

Complete gesture detection and analysis toolkit for research and real-time applications using machine learning.
by Wim Pouw, Bosco Yung, Sharjeel Shaikh, James Trujillo, Gerard de Melo, Babajide Owoyele

Automatic Gesture Analysis & Visualization

Automatically annotate hand gesture stroke events, analyze kinematics, and create dashboard visualizations.

Real-Time Gesture Detection

Detect gestures in real-time from webcam feed using Light Gradient Boosting Machine.

SPUDNIG-PYTHON Motion-detection Assisted Gesture Annotation (Python)

This module uses motion detection to aid in automatic gesture annotation based on the tool SPUDNIG
By James Trujillo (based on SPUDNIG by Ripperda, Drijvers & Holler)

Full-body tracking, +masking/blurring, and movement tracing (Python)

This module shows how to track the face, hands, and body using MediaPipe, with the option of masking, blurring, and movement tracing.
By Wim Pouw & Sho Akamine

Post-synchronizing video and audio recordings from separate devices (Python)

This module provides a multi-purpose pipeline for post-synchronizing video and audio recordings of the same event, recorded separately on different devices.
by Hamza Nalbantoğlu & Šárka Kadavá

Automatic Stimuli Creation with Blocked Facial Information (Python)

This python module provides scripts for systematically masking of facial information at different intensities. This can be used to reduce communicative potential of mouthing in signed languages or articulatory gestures in spoken language.
by Wim Pouw & Annika Schiefner

Quantifying Interpersonal Synchrony (Python)

This module provides an introduction to calculating interpersonal movement synchrony, including time-lag assessment and pseudo-pair calculation.
by James Trujillo

Multi-person tracking with YOLO and computing social proximity (Python)

This module uses the very reliable YOLO ultralytics pose tracking for multiple persons for top view or other perspectives, and shows an simple calculation of interpersonal distance between two persons.
By Wim Pouw, Arkadiusz Białek, and James Trujillo

Visual Communication (ViCOM) tutorial with exercises: A complete kinematic feature analysis pipeline (Python)

This module contains a kinematic feature extraction pipeline with excercises for students of communicative motion analysis.
By Wim Pouw

Behavioral Classification Using Convolutional Neural Networks (Python)

This module takes you through training a model to automatically annotate bodily gestures.
By Wim Pouw

Decision Tree-Based Classification Algorithms (R)

This module takes you through using decision trees to make sense of high-dimensional data.
By Alexander Kilpatrick

Multimodal annotation distances (Python and R)

This module takes in annotations in ELAN and allows to compare the overlap between them using the multimodal-annotation-distance tool.
By Camila Antônio Barros

Creating video-embedded time series animations (Python)

This module takes in a video, and then creates movement-sound time series animations embedded in the video.
By Wim Pouw

Turn-Taking Dynamics and Entropy (Python)

This module introduces calculating turn-taking measures, such as gaps and overlaps, as well as entropy, from conversations with 2 or more speakers.
By James Trujillo

Gesture networks and DTW (Python)

This module demonstrates how to implement gesture networks and gesture spaces using dynamic time warping.
By Wim Pouw

Demo for OpenPose with 3D tracking with Pose2Sim (Python)

This module provides a python pipeline for openpose tracking and 3D triangulation with Pose2Sim.
By Šárka Kadavá & Wim Pouw

Dynamic visualization dashboard (Python)

This module provides an example of a dynamic dashboard that displays audio-visual and static data.
By Wim Pouw

Head rotation tracking by adapting mediapipe (Python)

This module shows a way to track head directions, next the face, hands, and body tracking using MediaPipe.
By Wim Pouw

Running OpenPose in batches (Batch script)

This module demonstrates how to use batch scripting to run OpenPose on a set of videos.
By James Trujillo & Wim Pouw

Recording from multiple cameras synchronously while also streaming to LSL

This module demonstrates how to record from multiple cameras synchronously, which is very helpful for creating your own 3D motion tracking pipeline.
By Šárka Kadavá & Wim Pouw

3D tracking from 2D videos using anipose and deeplabcut (Python)

This module shows how to set up a 3D motion tracking system with multiple 2D cameras, using anipose and human pose tracking with DeepLabCut.
By Wim Pouw

Aligning and pre-processing multiple data streams (R)

This module provides an overview of how to wrangle multiple data streams (motion tracking, acoustics, annotations) and preprocess them (smoothing) to create a single long time series dataset ready for further processing.
By Wim Pouw

Aligning and pre-processing multiple data streams (Python)

In this module an overview is provided how to wrangle multiple data streams (motion tracking, acoustics, annotations) and preprocess them (smoothing) so that you end up with one long timeseries dataset ready for further processing.
By Wim Pouw

Extracting a smoothed amplitude envelope from sound (R)

This module demonstrates how to extract a smoothed amplitude envelope from a sound file.
By Wim Pouw

Motion tracking analysis: Kinematic feature extraction (Python)

This module provides an example of how to analyze motion tracking data using kinematic feature extraction.
By James Trujillo

Feature extraction for machine classification & practice dataset SAGA (R)

This module introduces a practice dataset and provides R code for setting up a kinematic and speech acoustic feature dataset that can be used to train a machine classifier for gesture types.
By Wim Pouw

Cross-Wavelet Analysis of Speech-Gesture Synchrony (R)

This module introduces the use of Cross-Wavelet analysis as a way to measure temporal synchrony of speech and gesture (or other visual signals).
By James Trujillo