David Z Li1, Masaru Ishii, MD, PhD2, Russell H Taylor, PhD1, Gregory D Hager, PhD1, Ayushi Sinha, PhD1. 1The Johns Hopkins University, 2Johns Hopkins Medical Institutions
Objective: The aim of our work is to develop an unsupervised approach for tool detection in endoscopic video data. There is an abundance of medical imaging data available, but most of this data is unlabeled because manually annotating data is extremely tedious. In order to overcome this limitation in endoscopic video data, we hope to coarsely classify endoscopic video frames into two classes – with tools and without tools. These coarse labels can then open up the potential for more fine-grained labels, like tool segmentation, tool pose classification, etc.
Description: During endoscopic procedures, surgical tools enter and leave the endoscopic field of view. Knowing when these events occur provides crucial information about surgical phase and activity. Additionally, computer vision-based navigation systems rely on anatomical features from endoscopic video to align video and preoperative image data. Therefore, being able to ignore frames with tools or to mask out tools in frames containing tools is important. By detecting frames with and without surgical tools, these two distinct classes of data can be exploited in different ways, allowing us to then perform more fine-grained learning tasks.
Method: Variational autoencoders (VAEs) are generative models that are often used for performing unsupervised learning since they do not require labeled training data. VAEs model the data distribution as nonlinear transformations of latent variables. Inference of these latent variables given the observed data is performed by an encoding model that learns a lower-dimensional representation of the data. The decoding model infers samples belonging to the modeled distribution given the latent variables.
The first step in our study is to use VAEs to learn a useful latent representation of sequences of endoscopic video. We hope to manipulate the variables in latent space and study their effect on the decoded output to learn, for instance, which latent variables are responsible for encoding tool movement or background movement, etc. We expect that the latent variables encoding tool movement will change drastically when the tool in not present in the frame. We can leverage this to separate frames with and without tools.
Conclusion: Our unsupervised approach to detect tool presence in endoscopic video data will allow us to separate video frames with and without tools and to treat these two classes differently. This initial coarse classification can make more fine-grained learning tasks easier to approach. For instance, most anatomical structures can be expected move coherently relative to the endoscope, while a surgical tool can move randomly. Knowing whether a sequence of frames contains tools allows us to look for differences in, for instance, the optical flow between frames to detect or segment the tool from the background tissue. Additionally, by encoding tool movement, we can learn whether different surgical tasks or phases appear different in latent space, enabling surgical phase detection.
Presented at the SAGES 2017 Annual Meeting in Houston, TX.
Abstract ID: 98877
Program Number: ETP749
Presentation Session: Emerging Technology Poster Session (Non CME)
Presentation Type: Poster