Mikhail Volkov, MSc1, Daniel A Hashimoto, MD, MS2, Guy Rosman, PhD1, Ozanan Meireles, MD2, Daniela Rus1. 1MIT, 2Massachusetts General Hospital
Background and Objective: While recording laparoscopic surgical procedures is quick and easy, reviewing video remains a time-consuming process. Context-aware video segmentation must be done manually by browsing through the video to identify different steps of an operation. As an alternative, by using state-of-the-art machine learning methods trained on surgical video annotations, we demonstrate the use of coresets to automatically segment, summarize, and classify a surgical video according to its constituent steps.
Description: Coresets are compact data reduction constructs that can be used to efficiently approximate inefficient algorithms. Coreset algorithms are able to take a video stream and compute temporal segmentation of the video based on detected features. The end result is a small, hierarchical subset of video frames that summarizes the semantic content of the video. We present a system that computes such a summary automatically, online, and in real-time. This system was trained and subsequently tested on videos of laparoscopic vertical sleeve gastrectomy (LSG).
Preliminary Results: We conducted experiments on 4 surgical videos of LSG procedure, which amounted to approximately 3 hours of video footage. A set of 7 steps were thus identified for this procedure as an initial starting point by expert laparoscopic surgeons. Since not all the videos contained all the steps, we restricted initial experiments to just the major steps in this procedure, which may combine several separate steps
The expert surgeons provided qualitative semantic descriptions corresponding to their annotations which constitute the ground truth segmentation of each video sequence. By combining and augmenting standard computer vision algorithms, a new representation space was constructed based on these qualitative semantic descriptions which yields features that are highly discriminative of their respective steps.
A cascade of support vector machined (SVM) was then constructed and trained to classify each step of the surgery. We completed 3 experiments, each time leaving out an entire video as the test set, and splitting the remaining combined video frames into 80/20 training/validation set.
The system was then run live on each test video, classifying incoming frames and producing a vector of binary SVM outputs for each frame. Finally, a Hidden Markov Model (HMM) is trained and used to produce the most likely underlying states given the incoming stream of observations.
Preliminary results show that prediction accuracy is on par with similar state of the art approaches (85%+).
Future Directions: Video-based coaching and debriefing of laparoscopic surgical procedures has been demonstrated to contribute to enhanced surgical performance; however, modern pressures on training and productivity preclude spending hours viewing and editing surgical video for the purpose of routine video-based coaching. Automated segmentation of surgical videos would allow important and interesting parts of large video sequences to be efficiently indexed for use in training and during time-critical expert consultations for targeted feedback and coaching. A summarization of the surgical video can be archived for teaching and training purposes and to help identify errors or examples of good technique that may have occurred during the procedure.