Compiling Provation MD – A Powerful Data Mining Tool Amenable to Statistical Analysis

Victor B Tsirline, MD MS, Igor Belyansky, MD, David A Klima, MD, Cynthia M Hlavacek, Kristian T Dacey, Amy E Lincourt, PhD, Ronald F Sing, DO FACS, B. Todd Heniford, MD FACS. Carolinas Medical Center, Charlotte, NC 28204

Clinical research is highly dependent on the availability of data, its accuracy, completeness, and suitability for analysis. ProVation™ MD is a dictation system for endoscopic procedures made by Wolters Kluwer and is used in over 700 sites and 7000 clinical users. ProVation contains a powerful text analysis module, which potentially enables rapid analysis of a virtually unlimited number of procedure reports. The data generated by ProVation does not lend itself to statistical analysis by virtue of its complexity, as a result, its use as a research tool is limited. We have developed a methodology to overcome this problem.

At the Carolinas Healthcare System some 30,000 endoscopic procedures are dictated in the ProVation system annually. We queried procedures performed between 2003 and 2010. Text mining engine extracted statements from endoscopy reports into the following categories: Procedure Name, Providers and their Roles, Indications for Procedure, Findings with Locations and corresponding Maneuvers, Complications, Medications, Impressions, Recommendations, and Endoscopy Instruments. The results was a 3 dimensional variable-length array of discrete text value. The maximum number of values in each category across all records was explicitly determined, together comprising a maximum of 250 data elements per endoscopy record. The resulting dataset table was imported into Microsoft Access as raw data table. A series of dynamic action queries was generated using Visual Basic for Applications (VBA) to construct a relational database with tables corresponding to each Category, all linked to the master table using standard indexing methodologies. The variables in the dataset were electronically examined for sparsity and variability, and were assimilated into clinically relevant supersets. The data was then compiled into the final table with one procedure record per row, capturing commonly occurring data in multiple dichotomy sets and combining sparse data into aggregate variables.

A total of 95,378 exam records were generated by ProVation, containing up to 3 procedures per exam, for a total of 189,156 procedures. The dataset contained 103,805 lower GI endoscopy reports, 66,888 upper GI endoscopy reports, 6,197 bronchoscopies, and 4,195 ERCPs. The relational database contained 9 categories and 2 subcategories, with each record having up to 20 data values per category – a total of 3,667,119 non-blank text elements. Each category was comprised of up to 1,000 unique elements. Contextually similar elements were grouped together by descriptors, resulting in 20 to 100 unique descriptors per category. The relational database was flattened into a spreadsheet of one endoscopy record per row for further statistical analysis. The entire conversion process from ProVation text mining to the final spreadsheet amenable to analysis required no additional input from the user.

The application of systematic data mining methods to ProVation MD dictation system allows rapid analysis of vast amounts of clinical data, providing a powerful research tool for those who utilize the system in clinical practice. Our conversion process required limited user input, irrespective of the database size and making it practical to analyze a virtually unlimited number of reports in a timeframe of hours to days.

Session: Emerging Technology Poster
Program Number: ETP090
View Poster

« Return to SAGES 2011 abstract archive