Ayushi Sinha, PhD1, Masaru Ishii, MD, PhD2, Russell H Taylor, PhD1, Gregory D Hager, PhD1, Austin Reiter, PhD1. 1The Johns Hopkins University, 2Johns Hopkins Medical Institutions
Objective: The aim of this work is to enable automatic initialization of vision-based navigation systems developed for endoscopy. Most such systems present innovations in registration between endoscopy video and preoperative image. However, many registration algorithms are manually initialized during evaluation. This interferes with surgical workflow and is infeasible during surgery. Our work focuses on camera pose classification directly from endoscopy video. Registration algorithms can then be initialized using this camera pose in order to fine-tune the alignment.
Description: Minimally invasive procedures performed through the nasal cavity involve an endoscope and a navigation system along with a preoperative image to guide the endoscope safely to the surgical site. A vision-based navigation system does not introduce any additional hardware into the surgical environment and, instead, uses the tool already present – the endoscope. Our work aims to automatically estimate the endoscope pose in preoperative image coordinate frame with sufficient accuracy to allow registration algorithms to converge and initiate navigation.
Method: Since there is a lack of labeled data with ground truth pose associated with frames of endoscopy video, we generate a simulated dataset in OpenGL using a textured mesh of the nasal cavity. A camera and a single co-located light source are simulated and steered through the textured mesh environment. 1420 camera views are observed within this simulated nasal cavity. 300×300 RGB images rendered from these views and corresponding camera poses are saved.
Our preliminary study evaluates whether the camera position can be coarsely classified. Camera poses are grouped into 4 classes based on how far into the nasal cavity the endoscope has traversed. The rendered scenes from these 4 classes capture different parts of the nasal cavity. The scenes in the first class observe the anterior part of the nasal turbinates, scenes in the second class observe the length of the turbinates, those in the third observe the posterior part of the turbinates, and those in the fourth observe the pharyngeal recess beyond the turbinates. We randomly chose 1270 rendered images to train a 2-layer CNN to learn the 4 classes associated with the rendered images. We trained and tested our network 5 times and report mean accuracy.
Results: Preliminary results on our 150-image test set produced a mean accuracy of 76.53 (±1.19)% with errors in classification mainly occurring with neighboring classes. This is reasonable since images rendered at the border of the specified regions are likely to be similar in appearance. These results are promising given the limited sizes of our current dataset and network and without hyperparameterization.
Conclusions: Since our simplified classification task showed promise, we are currently working on expanding our dataset to train a deeper network and solve a more fine-grained classification problem for both position and orientation. We will also use different meshes and textures to simulate a more diverse population and different lighting conditions to simulate different endoscopes. Finally, we hope to transfer our learned parameters to in-vivo endoscopy video and evaluate whether the classifications can produce initializations that will lead to reliable final registrations.
Presented at the SAGES 2017 Annual Meeting in Houston, TX.
Abstract ID: 98757
Program Number: ETP717
Presentation Session: Emerging Technology Poster Session (Non CME)
Presentation Type: Poster