The main objectives of the simpatix project are to a) develop methods for the construction of structured, process-oriented case representations from large data sets including unstructured documents; b) research algorithms for process-enhanced similarity search over richly annotated case collections; and to c) design and implement a generic repository to store process-enhanced case collections that allows scalable, effective similarity search. The work programme consists of 5 work packages.

WP1: Information Extraction.

The first work package targets information extraction from EHR and accompanying documents. Relevant entities will be recognized and normalized to common medical ontological identifiers by building as much as possible on existing tools for clinical information extraction. Confidence levels will be included with the annotations to represent uncertainties about the chosen identifiers.

WP2: Construction of Structured Case Representations.

The second work package focuses on the systematic construction of structured case representations from the given document collections based on the documents’ temporal alignment. Several sources of information will be used, including document metadata, data and time data extracted from documents, and clinical guidelines as skeleton process templates. An intermediate evaluation with medical professionals will assess the level of correctness and consistency of the representations constructed in this WP with the corresponding annotated cases.

WP3: Similarity of Richly Annotated Structured Cases.

Once a set of medical case representations are available and approved, methods of assessing the similarity of the collected medical cases will be studied. The patients’ medical cases will be represented as time-structured processes of annotated events created in WP2,  accompanied by time-invariate data such as the patient’s age or genetic profile, where available. Assessment of similarity will then consider four levels: the level of whole patient cases including static data and process information, the level of whole or partial processes, the level of single events (within each process), and the level of different entity types and their values (annotating each event). To resolve inter-event causalities, we will investigate aligning our automatically extracted cases to processes manually created from clinical guidelines.

WP4: Case Repository.

A repository framework will be designed and implemented to allow efficient retrieval of cases using different similarity measures. Research questions encompass the definition of a suitable storage format and similarity-indexing algorithm; furthermore, we will implement an interface to allow retrieval and display of the stored cases. For easy setup and maintenance, we plan to use existing applications and libraries as much as possible.

WP5: Evaluation.

Evaluation of the similarity measures proposed in WP4 will be done with the medical experts cooperating on the project. This evaluation will specifically target the use case of decision support: For a given case, find the k most similar patients for which data is available beyond the current point of care of the query case.