2012 Program

Day 1



Suchi Saria and Randall Wetzel


Integrating Data for Analysis, Anonymization and Sharing (iDASH): New Models of Sharing Clinical and Laboratory Data

Lucila Ohno Machado, UC San Diego, Division of Biomedical Informatics

iDASH (integrating data for analysis, anonymization, and sharing) is the newest National Center for Biomedical Computing funded by the NIH. It focuses on algorithms and tools for sharing data in a privacy-preserving manner. Foundational privacy technology research performed within iDASH is coupled with innovative engineering for collaborative tool development and data- sharing capabilities in a private Health Insurance Portability and Accountability Act (HIPAA)-certified cloud. Driving Biological Projects, which span different biological levels (from molecules to individuals to populations) and focus on various health conditions, help guide research and development within this Center. Furthermore, training and dissemination efforts connect the Center with its stakeholders and educate data owners and data consumers on how to share and use clinical and biological data. Through these various mechanisms, iDASH implements its goal of providing biomedical and behavioral researchers with access to data, software, and a high-performance computing environment, thus enabling them to generate and test new hypotheses.

Removing Confounding Factors via Constraint-Based Clustering: An Application to Finding Homogeneous Groups of MS Patients

Carla Brodley, Tufts University, Department of Computer Science

Confounding factors in unsupervised data can lead to undesirable clustering results. For example in medical data sets, age is often a confounding factor in tests designed to judge the severity of a patient's disease through measures of mobility, eyesight and hearing. In such cases, removing age from each instance will not remove its affect from the data as other features will be correlated with age. We present a method based on constraint-based clustering to remove the impact of such confounding factors and compare it to the standard approach of detrending. Motivated by the need to find homogenous groups of MS patients, we apply our approach to remove physician subjectivity from patient data. The result is a promising novel grouping of patients that can help uncover the factors that impact disease progression in MS.

The Importance of Workflows in Big Data Research

Yolanda Gil, USC Information Sciences Institute

Big data goes beyond large volume, and encompasses data that is diverse along a vast number of dimensions. A recent editorial in Science reported that the majority of research labs already lack the expertise required to analyze their data. Researchers often resort to forming teams of collaborators that have complementary expertise. This approach to big data analytics is rapidly becoming extremely time consuming and in most cases impractical, and will not scale as the complexity and variety of the data continues to increase. These big data challenges have created great opportunities for artificial intelligence to make data analytic processes easier to share and more efficient to execute, lowering the barriers of the complexity of the problems that can be tackled. I will describe our ongoing research on intelligent workflow systems that assist users with complex data analysis problems.


Interactive Session 1: Problems of Interest in Clinical Settings

Randall Wetzel, M.D., Children's Hospital LA, Robinder Khemani, M.D., M.S.C.I., Children's Hospital LA, Heather Duncan, M.B., Birmingham Children's Hospital, Jim Fackler, M.D., Johns Hopkins University School of Medicine

The wealth of clinical data in electronic health care records (EHRs) holds no end of fascinating problems and potential solutions for researchers and innovators from other disciplines, but which of these has the potential to impact real-world care that patients receive and to improve outcomes? The first step in deriving meaning and value from digital clinical data is to identify the key problems of interest to clinicians, patients, and other stakeholders. A panel of clinical experts, led by Randall Wetzel, M.D., will make a series of short presentations on problems they would like to see addressed and then lead an open discussion that prepares participants for the brainstorming session after lunch.


Interactive Session 2: Active Collaborative Problem Brainstorming

Suchi Saria

Following from pre-lunch presentations and discussion on open, data-related problems in medicine, in this session participants will be organized into groups (according to numbers on their badges) to brainstorm interesting problems and potential solutions that could be addressed by applying computational methods to large amounts of clinical data.

Combining Crowdsourcing and Machine Learning

Gert Lanckriet, Ph.D. Associate Professor, Department of Electrical and Computer Engineering University of California, San Diego

Combining crowdsourcing with machine learning enables us to leverage both the effectiveness of human computation and the scalability of machine learning. In this talk I will describe how we have applied this paradigm in a number of settings, including game-based music annotation and clinical diagnosis.

SMART Platform for Enabling Apps on EMR Data

Kenneth Mandl, M.D., M.P.H. Associate Professor, Division of Emergency Medicine Harvard Medical School Director, Intelligent Health Laboratory, Children's Hospital Informatics Program, Harvard-MIT Division of Health Sciences and Technology Children's Hospital Boston

Most vendor electronic health record (EHR) products are architected monolithically, making modification difficult for hospitals and physician practices. An alternative approach is to reimagine EHRs as iPhone-like platforms that support substitutable apps-based functionality. Substitutability is the capability inherent in a system of replacing one application with another of similar functionality. I will discuss the Substitutable Medical Applications, Reusable Technologies (SMART) Platforms project which seeks to develop a health information technology platform with substitutable apps constructed around core services. The goal of SMART is to create a common platform to support an “app store for health” as an approach to drive down healthcare costs, support standards evolution, accommodate differences in care workflow, foster competition in the market, and accelerate innovation.

Sequential Multiple Assignment Randomized Trials and Treatment Policies

Susan A. Murphy, Ph.D. H.E. Robbins Professor, Department of Statistics Professor, Department of Psychiatry Research Professor, Institute for Social Research University of Michigan

The effective treatment and management of many health disorders requires individualized, sequential, decision making, in which the treatment is dynamically adapted over time based on an individual's changing illness course. Treatment policies operationalize this sequential decision making via a sequence of decision rules that specify whether, how, for whom, and when to alter the intensity, type, or delivery of behavioral and pharmacological treatments. In this talk, we discuss and, provide examples of, a randomized clinical trial design--the sequential multiple assignment randomized trial--that is currently being used to develop and optimize treatment policies. We review how one can use the resulting patient data to develop treatment policies; in particular we illustrate a data analysis algorithm that is a "reverse engineered" version of Q-Learning or Fitted Q iteration from the field of reinforcement learning. We will illustrate the use of this algorithm on data from a sequential multiple assignment randomized trial on children with ADHD.

Core Research Questions for Natural Language Processing of Clinical Text

Noemie Elhadad, Ph.D. Assistant Professor, Department of Biomedical Informatics Columbia University

The Electronic Health Record comes laden with an ambitious array of promises: at the point of patient care, it will improve quality of documentation, reduce cost of care, and promote patient safety; in parallel, as more and more data is collected about patients, the EHR gives rise to exciting opportunities for mining patient characteristics and holds out the hope of compiling comprehensive phenotypic information. Leveraging the information present in the EHR for is not a trivial step, however, especially when it comes to the information conveyed in clinical notes. In this talk, I will discuss some of the success stories of clinical NLP and will review the core underlying research questions entailed in processing the EHR, as well as current progress in the field.

Dinner and Poster Session

Day 2


Running the Big Machine: Using Real Time EMR Data in High Risk, High Reward Environments

Warren Sandberg, M.D., Ph.D. Chair, Department of Anesthesiology Professor, Department of Anesthesiology, Surgery and Biomedical Informatics Division of Multispecialty Adult Anesthesiology Vanderbilt University Medical Center

Real time use of electronic documentation data to support decision-making in the operating room is complicated by unique challenges posed by the tight temporal time frame for decision-making and a 'last mile' problem. Moreover, the OR and procedural areas already have high information flux and high inherent risks, both medical and operational, and all of these features raise the stakes for decision support systems. Nevertheless, these same features, manifesting as frequent opportunities for missed vigilance, as well as decision errors of omission or wrong decisions, demand development of effective automated process monitoring and process control systems capable of reliable operation on a seconds- to minutes- time scale. These results must be pushed, without a requirement for active information seeking or searching, into the hands of providers at the bedside. In this talk, examples of the problem space will be shown, along with results of proof-of-concept solutions (and their limitations) that are beginning to see routine implementation at institutions willing to invest to improve upon the current commercial state of the art clinical OR IT systems. A framework for how such systems are financially self-sustaining is demonstrated. A qualitative description of different forms of meaningful use and complexity, taking into account the unique features of the acute care procedural environment will be developed.

Predictive Modeling in Intensive Care

Peter Szolovits, Ph.D. Professor, Computer Science and Engineering, Department of Electrical Engineering and Computer Science Professor, Health Sciences and Technology, Harvard/MIT Division of Health Sciences and Technology Head, Clinical Decision-Making Group, MIT Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

From growing collections of intensive care data, we can now build surprisingly accurate models that predict death, morbidities and opportunities to wean patients from dangerous interventions. These models are built using simple machine learning techniques, assuming that the experience of previous patients predicts future ones. Practicing intensivists, however, rely on more than just such experience, and we are developing models that take pathophysiologic knowledge into account, combining expert knowledge and evidence from data. I will review our current work aimed at this objective. Clinical application of such models also remains a future challenge.

The Privacy-Usefulness Tradeoff

Katrina Ligett, Ph.D. Assistant Professor, Computer Science and Economics California Institute of Technology

Large databases of medical data have enormous potential; thoughtful exploitation of this data will surely save many lives, improve outcomes, and make medical care more efficient and cost-effective. However, such data, ranging from diagnoses to DNA, is also private and potentially quite sensitive. This talk will overview recent work in computer science on formalizing notions of data privacy, which can allow us to reason in a principled manner about the tradeoff between exploiting rich datasets and protecting the people from whom the data are derived. I will focus on "differential privacy," a particular privacy definition, and on some recent tools for achieving it that may have relevance to medical data.


Interactive Session 3: Multidisciplinary Collaborations, led by Héctor Corrada Bravo

Héctor Corrada Bravo, Ph.D. Assistant Professor, Department of Computer Science UM Institute for Advanced Computer Studies Center for Bioinformatics and Computational Biology University of Maryland

Building effective methods and tools that can derive meaning and value from digital clinical data and support clinical decision making is a challenging problem that demands an interdisciplinary effort and diverse teams that include experts from fields as diverse as medicine, biology, statistical learning, decision theory, human-computer interaction, hardware engineering, systems design, and cognitive psychology. Building such teams and getting them to work together productively can present a variety of challenges, including cross-domain communication, goal alignment, and clashing cultures. Prof. Héctor Corrada Bravo and an interdisciplinary panel of researchers lead an interactive discussion about how to perform successful cross-domain research.


Emotion Tracking for Memory, Health, and Awareness

Mary Czerwinski, Ph.D. Research Area Manager, Visualization and Interaction (VIBE) Research Group Microsoft Research

In this talk I will describe novel systems that allow users to reflect or react upon their moods, both retrospectively and in real time. We surveyed potential users of such systems to see what they remembered about their mood swings and behavioral patterns emotionally over time, and it was clear that they felt they did not have a good handle on this after even 48 hours. They also let us know that they would value systems that notified them of their moods or behavioral trends with simple mobile phone alerts. We then built systems to help users track their moods, and tested them on real users over a period of time. The results were promising. Users found interesting patterns in the data and gave us great feedback on how to evolve the user interface visualizations for real time feedback on emotional reactions, mood swings and activities.

Modeling and Prediction with ICU Electronic Health Records Data

Benjamin Marlin, Ph.D. Assistant Professor, Department of Computer Science University of Massachusetts Amherst

As a growing number of hospitals have adopted the use of electronic records systems to manage data collected during the course of patient care, leveraging this data to improve the quality of care has emerged as a key problem. The physiological data contained in ICU electronic health records can be thought of as a multivariate time series that begins when the patient is admitted and ends when the patient is discharged. Each time series contains measurements for a different physiological variable like heart rate or blood pressure. These data are entered by medical staff during routine care and are not continuously recorded. This results in data with several very challenging properties that push the boundaries of machine learning and computational statistics. In this talk, I will describe a number of these properties including temporal sparsity, irregular sampling, the possible presence of sample selection bias and/or non-random missing data, and the confounding effect of interventions. I will present research addressing temporal sparsity using probabilistic mixture models and highlight current research targeting irregular sampling using novel kernel-based models and neural networks.

The Life of a Hadoop Cluster (Or, Whatever I Want to Talk About Today)

Joshua Wills Director, Data Science Cloudera

Most organizations initially adopt Hadoop in order to have cheap online storage for data of uncertain value and to take advantage of the MapReduce parallel programming framework in order to ensure that important ETL jobs meet their service-level agreements. However, when you talk to those organizations a year later, the value that they find in Hadoop lies in its flexibility: the ability to organize their data in order to answer new questions, run large-scale machine learning algorithms to influence the future, and to build fast, scalable backend systems to provide intelligent services to their customers. In this talk, we will walkthrough the lifecycle of a Hadoop cluster at successful organizations, from the common starter use cases to building a data science team to production deployment, and give an overview of advanced analytical applications deployed by Cloudera customers across a variety of industries, from insurance to healthcare to bioinformatics.


It's Not the Size but What You Do with It: "Tiny Data" and Business Value in a "Perverse" Market

Josh Rosenthal RowdMap

Interactive Session 4: Meaningful Business Use of Clinical Data

Led by Josh Rosenthal

Clinical Data Hack-a-thon Begins! Evening Reception (drinks and food provided)

Senior Program Committee:

Randall Wetzel, M.D.
Children's Hospital Los Angeles
John Langford, Ph.D.
Microsoft Research
Suchi Saria, Ph.D
Johns Hopkins Computer Science and Public Health

Need more information?

If you have any questions regarding the symposium, please send us an email.