Course Description

KeyNote and Courses


  • Courses

  • Kenji Takeda
    (Director, Health and AI Partnerships, Microsoft Research) []
    Big Data and AI - What's It Really Good for ?


    There is tremendous hype around big data and AI. This talk aims to cut through the hype by focussing on how big data, AI and machine learning approaches are being successfully used across a range of disciplines, illustrated with real-world examples.


    Dr Kenji Takeda is Director of health and AI partnerships for Microsoft Research Cambridge (UK). He is working on the application of AI and machine learning to transform healthcare by exploiting data in the cloud, empowering those at the frontline of healthcare, and moving towards precision medicine. This includes work in medical imaging on Project InnerEye and working with the global academic and healthcare research communities. He was previously global lead for Microsoft’s Azure for Research, helping researchers take best advantage of cloud computing, including through data science, high-performance computing, and the internet of things. He has extensive experience in cloud computing, high performance and high productivity computing, data-intensive science, scientific workflows, scholarly communication, engineering and educational outreach. He has a passion for open science, and developing novel computational approaches to tackle fundamental and applied problems in healthcare, science and engineering. He is a visiting industry fellow at the Alan Turing Institute, and visiting Associate Professor at the University of Southampton, UK.

    Thomas Bäck & Hao Wang
    (Leiden University) [introductory/intermediate]
    Data Driven Modeling and Optimization for Industrial Applications


    In industrial applications related to production processes and virtual product design, one is often interested in combining data-driven modeling and optimization to support decision making in an efficient way. Using data driven models in combination with optimization is also called prescriptive analytics and has important applications in Industry 4.0.

    In this tutorial, we discuss various aspects of such a combination of data driven modeling and optimization, including multi-objective optimization. For costly applications (e.g., in terms of time), this combination is embodied by methods such as efficient global optimization (EGO) and Bayesian global optimization, since they are able to produce good solutions using a small budget of function evaluations. Such methods rely on supervised machine learning methods for building a nonlinear regression model of the objective function, such that they are directly related to data driven modeling algorithms. In this context, Gaussian process regression is widely used, and we explain such an approach as well as its extension, cluster Kriging, for dealing with large data sets efficiently. Based on this, we will discuss the multi-objective extension to EGO, which greatly supports the decision- making process.

    In the unsupervised machine learning domain, methods for anomaly detection have recently gained increasing interest in many real-world applications. We discuss a new method for anomaly detection in high-dimensional data and illustrate applications of anomaly detection in the automotive industry. Other applications include the combination of simulation and optimization, where efficient global optimization is also a very powerful approach. The tutorial concludes by giving an outlook on applications Industry 4.0, with examples from steel production and automotive industry.

    • Industrial applications in simulation-based optimization
    • Industrial applications in production process optimization / Industry 4.0
    • Efficient global optimization / Bayesian global optimization
    • Gaussian process regression / Kriging
    • Cluster Kriging and fuzzy cluster Kriging for big data
    • Infill criteria and generalizations for parallel evaluations and heterogeneous search spaces
    • Infill criteria Optimization / Evolutionary Algorithm
    • Exploration – exploitation tradeoff
    • Multi-objective efficient global optimization
    • Unsupervised data-driven anomaly detection

    Short Bio - Thomas Bäck

    Thomas Bäck is full professor of computer science at the Leiden Institute of Advanced Computer Science (LIACS), Leiden University, The Netherlands, where he is head of the Optimization, Data Analytics and Industry 4.0 group since 2002. Thomas Bäck has more than 350 publications in predictive analytics, optimization, and industrial applications, as well as two books on evolutionary algorithms: Evolutionary Algorithms in Theory and Practice (1996), and Contemporary Evolution Strategies (2013). He is co-editor of the Handbook of Evolutionary Computation and co-editor-in.chief of the Handbook of Natural Computing. He is also co-editor-in-chief of Theoretical Computer Science C, and editorial board member and associate editor of a number of journals on evolutionary and natural computing. Thomas received the best dissertation award from the German Society of Computer Science (Gesellschaft für Informatik, GI) in 1995, is a fellow of the International Society for Genetic and Evolutionary Computation since 2003, and received the IEEE CIS Evolutionary Computation Pioneer Award in 2015.

    Thomas has ample experience in real-life applications of optimization and predictive analytics through working with global enterprises such as BMW, Daimler, Honda, Tata Steel, and many others. His work with companies focuses on applications of predictive analytics and optimization, in particular for product development, production process optimization, predictive maintenance, anomaly detection, and related areas.

    Short Bio - Hao Wang

    Hao Wang is a postdoctoral researcher at Leiden Institute of Advanced Computer Science (LIACS), Leiden University, The Netherlands, where he is the member of the Natural Computing group. Hao Wang received his master degree in Computer Science from Leiden University in 2013 and obtained his PhD (cum laude, promotor: Prof. Thomas Bäck) in Computer Science from the same university in 2018. His research interests are proposing, improving and analysing stochastic optimization algorithms, especially Evolutionary Strategies and Bayesian Optimization. In addition, he also works on developing statistical machine learning algorithms for big and complex industrial data. He also aims at combining the state-of-the- art optimization algorithm with data mining / machine learning techniques to make the real-world optimization tasks more efficient and robust.

    Richard Bonneau
    (New York University) [introductory]
    Large Scale Machine Learning Methods for Integrating Protein Sequence and Structure to Predict Gene Function


    There are a large number of ways that machine learning is advancing biology, biotech and medicine. A key grand challenge in biology is the annotation of genomes. We sequence large numbers of genomes for diverse reasons: from diagnosing and tailoring treatment for diseases to cataloging and mining biodiversity. We will dive deeply into the problem of predicting the function of proteins and protein families. This is a good illustrative problem at the intersection of biology and ML for several reasons: 1) input data is diverse with networks, sequences, and 3 dimensional structures all in great abundance, 2) protein function is organized into a hierarchy of many thousands of labels, both organized and challengingly rich, and 3) the potential for positive impact is quite high with applications from bioremediation, biosynthesis, ecology, and medicine.


    • 1: Framing of the problem: Overview of features, labels, and past work in protein function prediction.
    • 1.1 the data: proteins, sequences, structures and networks.
    • 1.2 Previous work on protein function prediction.
    • 1.3 Common task challenges: the critical assessment of function annotation (CAFA) and MouseFunc
    • 1.4 Introduction to autoencoders with examples drawn from biology and social networks analysis.
    • 2: Learning from networks: deep multimodal autoencoders applied to the problem of integrating multiple biological networks.
    • 2.1 Biological networks: a rich but complicated input
    • 2.2 Using non-negative matrix factorization to combine high dim features
    • 2.3 Deep multimodal autoencoders for combining networks
    • 2.4 Behind the curtain: would other simple methods work better? alternate architectures
      1. Learning from structure using graph convolutional neural networks and Integrating features
    • 3.1 Similarities between sequences and structure: biology’s workhorses
    • 3.2 Sequence autoencoders
    • 3.3 Protein sequence and protein family CNNs
    • 3.4 Salience mapping to localize function
    • 3.5 Protein structure features overview
    • 3.6 Encapsulating and learning from protein structure using Graph CNNs



    Familiarity with statistics and basic concepts of machine learning. Biology will be introduced and biology knowledge is not needed, but it couldn't hurt to read up on basic concepts like sequence alignment, biological sequence databases, and protein structure.

    Short Bio

    Richard Bonneau joined the Simons Foundation in 2014 to develop next-generation computational biology methods for the Center for Computational Biology at the Flatiron institute, he is also a faculty member (and former acting director) at NYU’s center for data science. He focuses on creating new methods for using protein structure modeling to interpret genetic variation and new methods for understanding biological networks. Before coming to the foundation, Bonneau was a senior scientist at the Institute for Systems Biology in Seattle, Washington, and before that he was a senior scientist at Structural GenomiX in San Diego, California. He is co-director of the Social Media and Political Participation Lab at New York University. He holds a Ph.D. in biomolecular structure and design from the University of Washington, Seattle.

    Short Bio (Vladimir Gligorijevic, Research Fellow):

    Vladimir Gligorijevic joined the Simons Foundation in March 2017 as a member of Systems Biology group at the Center for Computational Biology to develop protein function prediction methods using deep learning techniques. Prior to this, Gligorijevic was a research assistant in the computing department at Imperial College London. There, he worked on developing machine learning methods for integration of large- scale, heterogeneous biological data with applications in protein function prediction and precision medicine. Gligorijevic holds a B.Sc. and M.Sc. in physics from the University of Belgrade in Serbia and a Ph.D. in computer science from Imperial College London, United Kingdom.

    Altan Cakir
    (Istanbul Technical University) [introductory/intermediate]
    Processing Big Data with Apache Spark: From Science to Industrial Applications


    Apache spark, open-source cluster-computing framework providing a fast and general engine for large-scale processing, has been one of the exciting technologies in recent years for the big data development. The main idea behind this technology is to provide a memory abstraction which allows us to efficiently share data across the different stages of a map-reduce job or provide in-memory data sharing. Our lecture starts with a brief introduction to Spark and its ecosystem, and then shows some common techniques - classification, collaborative filtering, and anomaly detection, among others, to fields particle physics, genomics, social media analysis, web-analytics and finance. If you have an entry-level understanding of machine learning and statistics, and program in Python or Scala, you will find these subjects useful for working on your own big data projects.


    • Introduction to Data Analysis with Apache Spark
    • Spark Programming Model with RDD
    • Running Spark Applications on Hadoop / AWS Systems
    • Spark SQL
    • Spark Streaming
    • Machine Learning with Spark MLlib
    • Advanced Analytics Applications with Spark
    • Anaysis of real world applications


    •, Unified Analytics Engine for Big Data
    • Advanced Analytics with Spark: Patterns For Learning From Data at Scale, A. Teller, M. Pumperla, M. Malohlava
    • Mastering Machine Learning with Apache Spark 2.x, S. Amirgodshi, M. Rajendran, B. Hall, S. Mei


    Python, Statistics, Machine Learning

    Short Bio

    Altan Cakir received his M.Sc. degree in theoretical particle physics from Izmir Institute of Technology in 2006 and then went straight to graduate school at the Karlsruhe Institute of Technology, Germany, from which he earned a Ph.D. in experimental high energy and particle physics in 2010. During his Ph.D., he was responsible for a scientific research based on new physics searches in the CMS detector at the Large Hadron Collider (LHC) at European Nuclear Research Laboratory (CERN). Thereafter he was granted as a post-doctoral research fellow at Deutsches Elektronen-Synchotron (DESY), a national nuclear research center in Hamburg, Germany where he spent 5 years, and then recently got his present a full professor position at Istanbul Technical University (ITU), Istanbul, Turkey. Currently, Altan Cakir is a group leader of ITU-CMS group at CERN and leading a data analysis group at the CMS detector. Furthermore, he was a visiting faculty at Fermi National Accelerator Laboratory (Fermilab), Illinois, USA in 2017. His group’s expertise is focused around machine learning techniques in large scale data analysis. However, their research is very much interdisciplinary, with expertise in the group ranging from particle physics, detector development and big data synthesis to economy and industrial applications.

    Altan Cakir involved a large number of high profile research projects at CERN, DESY and Fermilab last thirteen years. He enjoys being able to integrate his research and teaching key concepts of science and big data technologies. It’s rewarding to be part of the development of the next generation of scientist, and help his students move on to careers all over the world, in academia, industry and government.

    The following lectures on big data are periodically given by Assoc. Prof. Dr. Altan Cakir in Big Data and Business Analytics Program ( at Istanbul Technical University: Big Data Technologies and Applications, Machine Learning with Big Data

    Jiannong Cao
    (Hong Kong Polytechnic University) [introductory/intermediate]
    Cross-Domain Multi-Source Big Data Fusion and Analytics


    Big data analytics using cross-domain multi-source datasets allow us to study the phenomena of our interest by fusing views from multiple angles, facilitating us to identify meaningful problems and discover new insights. However, we need methods and techniques to solve the challenges like heterogeneity, uncertainty and high dimensionality in analyzing cross-domain datasets. In this lecture, I will describe a general framework of cross-domain big data fusion and analytics, and introduce existing works including our own on fusing and analyzing datasets from multiple domains to uncover the underlying patterns, correlations and interactions. Example applications include human and urban dynamics like predicting traffic congestions, optimize demand dispatching in emerging on-demand services, and designing wireless networks.


      1. Introduction to big data analytics (0.5 hour)
      1. Cross-domain multi-source data analytics: opportunities and challenges (1 hour)
      1. Cross-domain multi-source data analytics framework and techniques (1.5 hours)
      1. Examples: our work on cross-domain data analytics (1 hour)
      1. Summary and future directions (0.5 hour)


    • [1] Y. Zheng, “Methodologies for Cross-Domain Data Fusion: An Overview”, IEEE Transactions on Big Data, 1(1): 16-34 (2015)
    • [2] C. Shi, Y. Li, J. Zhang, Y. Sun, Philip S. Yu, “A Survey of Heterogeneous Information Network Analysis”, IEEE Transactions on Knowledge and Data Engineering, Volume: 29 , Issue: 1 , Jan. 1 (2017)
    • [3] C-W. Tsai, C-F. Lai, H-C. Chao, A. V. Vasilakos, “Big Data Analytics: a Survey”, Journal of Big Data, December 2:21(2015)
    • [4] Y. Zheng, L. Capra, O. Wolfson, H. Yang, “Urban Computing: Concepts, Methodologies, and Applications”, ACM Transactions on Intelligent Systems and Technology, Vol. 5, No. 3, September (2014)


    Basic knowledge in big data analytics, machine learning, and urban computing applications.

    Short bio:

    Dr. Cao is currently a Chair Professor of Department of Computing at The Hong Kong Polytechnic University, Hong Kong. He is also the director of the Internet and Mobile Computing Lab in the department and the director of University’s Research Facility in Big Data Analytics. His research interests include parallel and distributed computing, wireless sensing and networks, pervasive and mobile computing, and big data and cloud computing. He has co-authored 5 books, co-edited 9 books, and published over 500 papers in major international journals and conference proceedings. He received Best Paper Awards from conferences including DSAA’2017, IEEE SMARTCOMP 2016, ISPA 2013, IEEE WCNC 2011, etc.

    Dr. Cao served the Chair of the Technical Committee on Distributed Computing of IEEE Computer Society 2012-2014, a member of IEEE Fellows Evaluation Committee of the Computer Society and the Reliability Society, a member of IEEE Computer Society Education Awards Selection Committee, a member of IEEE Communications Society Awards Committee, and a member of Steering Committee of IEEE Transactions on Mobile Computing. Dr. Cao has also served as chairs and members of organizing and technical committees of many international conferences, and as associate editor and member of the editorial boards of many international journals. Dr. Cao is a fellow of IEEE and ACM distinguished member. In 2017, he received the Overseas Outstanding Contribution Award from China Computer Federation.

    Nitesh Chawla
    (University of Notre Dame) [intermediate/advanced]
    Network Science: Representation Learning and Higher Order Networks


    Network-based representation has quickly emerged as the norm in representing rich interactions among the components of a complex system for analysis and modeling. It is thus critical for the network to truly represent the inherent phenomena in the complex system to avoid incorrect analysis results or conclusions. For a variety of complex systems, representing them with the conventional first-order (Markov property) networks, which is the norm, captures only the first order relationship connections in the underlying data, missing the variable and higher order of dependencies that might be driving the system. An accurate network representation of the underlying data is a prerequisite for reliable analyses that build upon the network. The limited representation of the first-order network can lead to inaccurate results that rely on the network representation of the underlying data. This has spurred recent research that goes beyond the dyadic interactions to higher and variable orders of interactions to more accurately construct the network representation of the underlying data from a complex system.

    The goal of this tutorial is to provide an introduction to higher order networks and related techniques; and on representation learning for networks (inclusive of higher order and first order networks).


    • Introduction to Higher Order Networks
    • Modeling Higher Order Networks
    • Visualization of Higher Order Networks
    • Anomaly Detection Using Higher Networks
    • Applications of Higher Order Networks
    • Learning Embeddings on Homogeneous Networks
    • Learning Embeddings on Heterogeneous Networks
    • Learning Embeddings on Higher Order Networks
    • Applications of Representation Learning


    Introductory background in network science, machine learning, deep learning

    Relevant Readings

    • Xu, Jian, Thanuka L. Wickramarathne, and Nitesh V. Chawla. "Representing higher-order dependencies in networks." Science advances 2.5 (2016): e1600028.

    • Rosvall, Martin, et al. "Memory in network flows and its effects on spreading dynamics and community detection." Nature communications 5 (2014): 4630.

    • Dong, Yuxiao, Nitesh V. Chawla, and Ananthram Swami. "metapath2vec: Scalable representation learning for heterogeneous networks." Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2017.

    • Grover, Aditya, and Jure Leskovec. "node2vec: Scalable feature learning for networks." Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2016.

    • Perozzi, Bryan, Rami Al-Rfou, and Steven Skiena. "Deepwalk: Online learning of social representations." Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2014

    Short Bio

    Nitesh Chawla is the Frank M. Freimann Professor of Computer Science and Engineering, and director of the research center on network and data sciences at the University of Notre Dame. His research is focused on machine learning and network science with a special initiative on societal impact and advancing the common good. He started his tenure-track career at Notre Dame in 2007, and quickly advanced from assistant professor to an endowed (chaired) full professor position in 2016. He has brought in over $26M dollars in research funding to Notre Dame since 2007. He has received numerous awards for research, innovation, and teaching. He is the recipient of the 2015 IEEE CIS Outstanding Early Career Award; the IBM Watson Faculty Award, the IBM Big Data and Analytics Faculty Award, National Academy of Engineering New Faculty Fellowship, 1st Source Bank Technology Commercialization Award, and his PhD dissertation also received the Outstanding Dissertation Award. His papers have received several outstanding paper nominations and awards. In addition, his students are also recipient of several honors, recent ones include Honorable Mention for Outstanding Dissertation Award at KDD’17 and second prize at 2017 ACM Grace Hopper student research competition. In recognition of the societal and impact of his research, he was recognized with the Rodney Ganey Award and Michiana 40 Under 40. He is a two-time recipient of Outstanding Teaching Award at Notre Dame. He is a Fellow of the Reilly Center for Science, Technology, and Values;, Fellow of the Institute of Asia and Asian Studies; and Fellow of the Kroc Institute for International Peace Studies at Notre Dame. He is the founder of Aunalytics, a data science software and solutions company.

    Nello Cristianini
    (University of Bristol) [introductory]
    The Interface between Big Data and Society



    Ethical and Social Implications of AI

    The introduction of AI in the midst of society has created new opportunities and new challenges, that include deep issues of fairness, transparency, and human autonomy. The solution of those new problems cannot be just technical, but there is a role for technical solutions too, within a more general effort to understand what can be done - so that we can safely coexist with intelligent machines in a data-driven society.

    Machine decisions can affect our rights, and we need to ensure that Artificial Intelligence does not absorb biases by being trained on biased data. Is our autonomy affected by interacting with intelligent machines designed to persuade us?:

    Social Media Analysis

    The analysis of social media content can reveal new information about society, public opinion and even possibly biology. We review recent work that is based on statistical text mining, and that integrates various signals to reveal novel psychological and behavioural patterns in large populations.

    What can we learn about our psychological state by analysing social media content on a vast scale? There are patterns of variation in our cognitive and emotional states, as well as personal concerns.

    News Content Analysis

    The analysis of media content, both present and historical, can reveal new insights about trends, biases and events, that would otherwise be difficult to analyse. We show a series of examples - starting with the digitisation process and ending with the creation of maps and timelines - illustrating how big data and digital humanities can meet and provide new help for historians.

    We discuss a few specific studies The first study is the first look at 150 years of British regional newspapers, aimed at identifying large scale trends (temporal and geographic) which would otherwise be difficult to detect. The rest is a series of follow-up studies, and the discussion of the tools and data behind the first study. The final study is based on the digitisation and analysis of historical newspapers from the italian town of Gorizia.



    See above.

    Short Bio

    Nello Cristianini is a Professor of Artificial Intelligence at the University of Bristol since March 2006, and a recipient of both a ERC Advanced Grant, and of a Royal Society Wolfson Merit Award. He has wide research interests in the areas of data science, artificial intelligence, machine learning, and applications to computational social sciences, digital humanities, news content analysis.

    He has contributed extensively to the field of statistical AI. Before the appointment to Bristol he has held faculty positions at the University of California, Davis, and visiting positions at the University of California, Berkeley, and in many other institutions. Before that he was a research assistant at Royal Holloway, University of London. He has also covered industrial positions. He has a PhD from the University of Bristol, a MSc from Royal Holloway, University of London, and a Degree in Physics from University of Trieste. He is co-author of the books 'An Introduction to Support Vector Machines' and 'Kernel Methods for Pattern Analysis' with John Shawe-Taylor, and "Introduction to Computational Genomics" with Matt Hahn (all published by Cambridge University Press).

    Geoffrey C. Fox
    (Indiana University, Bloomington) [intermediate]
    High Performance Big Data Computing


    Big data problems can be classified into three main categories: batch processing (Hadoop), stream processing (Apache Flink and Apache Heron) and iterative machine learning and graph problems (Apache Spark). Each of these problems have different processing, communication and storage requirements. Therefore, each system provides separate solutions to these needs.

    All these systems use dataflow programming model to perform distributed computations. With this model, big data frameworks represent a computation as a generic graph where nodes doing computations and the edges representing the communication. The nodes of the graph can be executed on different machines in the cluster depending on the requirements of the application.

    We identify four key tasks in big data systems:

    1. Acquiring computing resources, 2) Spawning and managing executor processes/threads, 3) Handling communication between processes, and 4) Managing the data including both static and intermediate. An independent component can be developed for each of these tasks. However, current systems provide tightly coupled solutions to these tasks excluding the resource scheduling.

    Twister2 [1-3] is a loosely-coupled component-based approach to big data. Each of the four essential abstractions have different implementations to support various applications. Therefore, it has a pluggable architecture. It can be used to solve all three types of big data problems mentioned above.

    In this tutorial, we review big data problems and systems, explain Twister2 architecture and features, provide examples for developing and running applications on Twister2 system. By learning Twister2, big data developers will have an experience with a flexible big data solution that can be used to solve all three types of big data problems.


    • Introduction to big data problems and systems
    • Decoupling big data solutions (big data stack)
    • Twister2 overview
    • Resource scheduling (Kubernetes, Mesos)
    • Communication (MPI on Twister2)
    • Task scheduling and execution (Fault Tolerance)
    • Data representation
    • Developing Big Data Solutions in Twister2
    • Batch Processing example
    • Streaming example
    • Machine learning example
    • Conclusion


    • Supun Kamburugamuve, Kannan Govindarajan, Pulasthi Wickramasinghe, Vibhatha Abeykoon, Geoffrey Fox, "Twister2: Design of a Big Data Toolkit" in EXAMPI 2017 workshop November 12 2017 at SC17 conference, Denver CO 2017.
    • Supun Kamburugamuve, Pulasthi Wickramasinghe, Kannan Govindarajan, Ahmet Uyar, Gurhan Gunduz, Vibhatha Abeykoon, Geoffrey Fox, "Twister:Net - Communication Library for Big Data Processing in HPC and Cloud Environments", Proceedings of Cloud 2018 Conference July 2-7 2018, San Francisco.
    • Kannan Govindarajan, Supun Kamburugamuve, Pulasthi Wickramasinghe, Vibhatha Abeykoon, Geoffrey Fox, "Task Scheduling in Big Data - Review, Research: Challenges, and Prospects", Proceedings of 9th International Conference on Advanced Computing (ICoAC), December 14-16, 2017, India.


    Basic knowledge of computer algorithms and software; knowledge of machine learning; some knowledge about big data systems including streaming systems and batch systems.

    Short Bio

    Geoffrey Fox received a Ph.D. in Theoretical Physics from Cambridge University where he was Senior Wrangler. He is now a distinguished professor of Engineering, Computing, and Physics at Indiana University where he is director of the Digital Science Center, and Department Chair for Intelligent Systems Engineering at the School of Informatics, Computing, and Engineering. He previously held positions at Caltech, Syracuse University, and Florida State University after being a postdoc at the Institute for Advanced Study at Princeton, Lawrence Berkeley Laboratory, and Peterhouse College Cambridge. He has supervised the Ph.D. of 73 students and published around 1300 papers (over 500 with at least 10 citations) in physics and computing with an hindex of 77 and over 35000 citations. He is a Fellow of APS (Physics) and ACM (Computing) and works on the interdisciplinary interface between computing and applications. Current application collaboration is in Biology, Pathology, Sensor Clouds and Ice-sheet Science, Image processing and Particle Physics. His architecture work is built around High-performance computing enhanced Software Defined Big Data Systems on Clouds and Clusters. The analytics focuses on scalable parallel machine learning. He is an expert on streaming data and robot-cloud interactions. He is involved in several projects to enhance the capabilities of Minority Serving Institutions. He has experience in online education and its use in MOOCs for areas like Data and Computational Science.

    David Gerbing
    (Portland State University) [introductory]
    Data Visualization with R


    This seminar introduces the R language via data visualization, aka computer graphics, in the context of a discussion of best practices and consideration for the analysis of big data. Code to generate the graphs is presented in terms of R base graphics, Hadley Wickham's ggplot package, and the author's lessR package. The content of the seminar is summarized with R Markup files that include commentary and implementation of all the code presented in the seminar, available to all participants. These explanatory examples serve as templates for applications to new data sets.


    Session 1

    Introduction to R

    • R functions and syntax
    • R variable types
    • Read data into R

    Specialized Graphic Functions

    • Functions from the lessR package
    • The ggplot function from the ggplot2 package
    • Base R graphics


    Session 2

    Bar Charts for Distributions of Categorical Variables

    • R factor variables and lessR doFactors function
    • Counts of one variable
    • Joint frequencies of two variables
    • Statistics for a second variable plotted against one variable

    Graphs for Distributions of a Continuous Variable

    • Histograms and binning
    • Densities
    • Boxplot
    • Scatterplot, 1-dimensional
    • Introduction to the integrated Violin/Box/Scatterplot, the VBS plot

    Scatterplots, 2-dimensional

    • With two or more continuous variables
    • A categorical variable with a continuous variable
    • Bubble plots with categorical variables
    • Two variable plot with a third variable, categorical or continuous

    Session 3

    Scatterplots, 2-dimensional (continued)

    • Visualization of relationships for big data sets

    Time Series Plots

    • One-variable plot
    • Stacked time-series plot
    • Area plots
    • Forecasts

    Interactive R Visualizations

    • Shiny
    • Plotly


    Gerbing, D. W. (2013). R Data Analysis without Programming, NY: Routledge. Wickham, H. (2009). ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York.


    Basic understanding of data analysis


    David Gerbing, Ph.D., since 1987 Professor of Quantitative Methods, School of Business Administration, Portland State University. He received his Ph.D. in quantitative psychology from Michigan State University in 1979. From 1979 until 1987 he was an Assistant and then Associate Professor of Psychology and Statistics at Baylor University. He has authored R Data Analysis without Programming, which describes his lessR package, and many articles on statistical techniques and their application in a variety of journals that span several academic disciplines.

    Craig Knoblock
    (University of Southern California) [intermediate/advanced]
    Building Knowledge Graphs


    There is a tremendous amount of data spread across the web and stored in databases that can be turned into an integrated semantic network of data, called a knowledge graph. Knowledge graphs have been applied to a variety of challenging real-world problems including combating human trafficking by analyzing web ads, identifying illegal arms sales from online marketplaces, and predicting cyber attacks using data extracted from both the open and dark web. However, exploiting the available data to build knowledge graphs is difficult due to the heterogeneity of the sources, scale in the amount of data, and noise in the data. In this course I will present the techniques for building knowledge graphs, including extracting data from online sources, cleaning and transforming the data, aligning the data to a common terminology,and linking the data across sources, and constructing and querying knowledge graphs at scale.


      1. Knowledge graphs
      1. Web data extraction
      1. Data cleaning and transformation
      1. Source alignment
      1. Entity linking
      1. Graph construction and querying


    Background in computer science and some basic knowledge of AI, machine learning, and databases will be helpful, but not required.


    Short Bio

    Craig Knoblock is a Research Professor of both Computer Science and Spatial Sciences at the University of Southern California (USC), Executive Director of the USC Information Sciences Institute, Research Director of the Center on Knowledge Graphs, and Associate Director of the Informatics Program at USC. He received his Bachelor of Science degree from Syracuse University and his Master’s and Ph.D. from Carnegie Mellon University in computer science. His research focuses on techniques for describing, acquiring, and exploiting the semantics of data. He has worked extensively on source modeling, schema and ontology alignment, entity and record linkage, data cleaning and normalization, extracting data from the Web, and combining all of these techniques to build knowledge graphs. He has published more than 300 journal articles, book chapters, and conference papers on these topics and has received 7 best paper awards on this work. Dr. Knoblock is a Fellow of the Association for the Advancement of Artificial Intelligence (AAAI), a Fellow of the Association of Computing Machinery (ACM), past President and Trustee of the International Joint Conference on Artificial Intelligence (IJCAI), and winner of the 2014 Robert S. Engelmore Award.

    Geoff McLachlan
    (University of Queensland) [intermediate/advanced]
    Applying Finite Mixture Models to Big Data

    Geoff McLachlan University of Queensland

    COURSE DESCRIPTION: Applying Finite Mixture Models to Big Data


    Attention is focussed initially on the role of finite mixture models in modelling and clustering heterogeneous data. The use of the EM algorithm to fit mixture distributions via maximum likelihood is reviewed. Extensions of the commonly used normal (Gaussian) mixture models are considered through the use of hidden variables to formulate various skew component distributions suitable for representing clusters in the presence of skewness and kurtosis. Also, hidden variables in the form of latent factors are adopted to fit mixtures of factor analyzers in situations where the dimension of the feature data are large relative to the data within a cluster. Consideration is given to further extensions of this approach to handle big data of possibly high-dimensions after an appropriate reduction where necessary in the number of variables. This latter reduction in the first instance is effected by a clustering of the variables via mixtures of linear mixed models. There is coverage also of deep normal mixtures to increase the flexibility of these models. Another role of mixture models addressed is their use for controlling the false positive rate in multiple comparisons and testing. The statistical methodology to be presented is to be highlighted by consideration of several real-data examples from various fields, including flow cytometry, bioinformatics, and image analysis.


    Role of mixture distributions in modelling and clustering heterogeneous data; maximum likelihood fitting of mixture models via the EM algorithm; Normal (Gaussian) mixture models and extensions for clusters in the presence of skewness and kurtosis; mixtures of linear mixed models; mixtures of factor analyzers for high-dimensional data; deep mixture factor analyzers; role of mixture models in controlling the false discovery rate in multiple comparisons and testing; applications of the aforementioned methodology to data from flow cytometry, microarray analyses, magnetic resonance imaging, and other various studies.


      1. Lee, S.X., Leemaqz, K.L., and McLachlan, G.J. (2018). A block EM algorithm for multivariate skew normal and skew t-mixture models. IEEE Transactions on Neural Networks and Learning Systems. (Advance Access published 09 March, 2018). To appear. Preprint arXiv:1608.02797.
      1. Lee, S.X. and McLachlan, G.J. (2016a). Finite mixtures of canonical fundamental skew t-distributions: the unification of the restricted and unrestricted skew t-mixture models. Statistics and Computing 26, 573-589. Correction. Preprint (2014) arXiv: 1401.8182.
      1. Lee, S.X. and McLachlan, G.J. (2018). EMMIXcskew: an R Package for the fitting of a mixture of canonical fundamental skew t-distributions. Journal of Statistical Software 83, No. 3.
      1. Lee, S.X., McLachlan, G.J., and Pyne, S. (2016). Modelling of inter-sample variation in flow cytometric data with the joint clustering and matching (JCM) procedure. Cytometry: Part A 89A, 30-43.
      1. McLachlan, G.J., Bean, R.W., and Peel, D. (2002). A mixture model-based approach to the clustering of microarray expression data. Bioinformatics 18, 413-422.
      1. Ng, S.K., McLachlan, G.J., Wang, K., Nagymanyoki, Z., Liu, S., and Ng, S.W. (2015). Inference on differential expression using cluster-specific contrasts of mixed effects. Biostatistics 16, 98-112.
      1. Nguyen, H.D., McLachlan G.J., Ullmann, J.F.P., and Janke, A.L. (2016). Spatial clustering of time-series via mixtures of autoregressive models and Markov random fields. Statistica Neerlandica 70, 414-439.
      1. Viroli, C. and McLachlan, G.J. (2018). Deep Gaussian mixture models. Statistics and Computing. (Advance Access, published 1 December, 2017). To appear. Preprint arXiv:1711.06929.


    Participants are expected to have a knowledge of Statistics at a level corresponding to at least the third year of a three-year undergraduate course in Science. Background knowledge particularly useful for this Course can be obtained from the two monographs McLachlan and Krishnan (2008, The EM Algorithm and Extensions. Second Edition, Wiley) and McLachlan and Peel (2000, Finite Mixture Models, Wiley). This material, however, will be briefly reviewed.

    Short Bio

    Geoff McLachlan is Professor of Statistics in the School of Mathematics and Physics at the University of Queensland. His research in Statistics is in the related fields of classification, cluster and discriminant analyses, image analysis, machine learning, neural networks, and pattern recognition, and in the field of statistical inference. The focus in the latter field has been on the theory and applications of finite mixture models and on estimation via the EM algorithm. A common theme of his research in these fields has been statistical computation, with particular attention being given to the computational aspects of the statistical methodology. He has written over 270 peer-reviewed research articles which have received over 40,000 citations. He has written six monographs on discriminant analysis (McLachlan, 1992), mixture models (McLachlan and Basford, 1988; McLachlan and Peel, 200), the EM algorithm (McLachlan and Krishnan, 1997 & 2008); and the analysis of gene expression data (McLachlan, Do, and Ambroise; 2004). He is currently an associate editor of several journals including Advances in Data Analysis and Classification, BMC Bioinformatics, Journal of Classification, Statistics and Computing, and Statistical Modelling. He is a former president of the International Federation of Classification Societies (IFCS). He is a fellow of the Australian Academy of Science and also a fellow of the American Statistical Association and the Royal Statistical Society. For 2007 to 2011 he was an Australian Research Council Professorial Fellow. He has received several awards, including the Pitman Medal of the Statistical Society of Australia in 2010, the IEEE ICDM Research Contributions Award in 2011, and the IFCS Research Medal for Outstanding Research Achievements in 2017.

    Folker Meyer
    (Argonne National Laboratory) [intermediate]
    Skyport2: A Multi Cloud Framework for Executing Scientific Workflows


    Executing scientific workflows at scale poses a significant challenge to many teams and institutions, we present a unified system for portable, reproducible execution on local and remote resources.

    The Skyport system [1] provides containerized workflow execution with Docker [2] across systems boundaries. Allowing researchers to execute scientific workflows across system boundaries using the AWE [1, 3, 4] workflow engine and the SHOCK [5] as an active object store. AWE and SHOCK are implemented a RESTful service for managing and executing workflows, workflows are specified in Common workflow language (CWL) format[6].

    CWL is a single, multi-vendor language to describe scientific workflows created by a community of practitioners. In addition to being multi-vendor – and thus supporting multiple engines for computing --, another critical feature of CWL is the separation of science content and computational implementation. This allows experts in each of the domains to focus on their area (CWL,

    One large scale use case driving the development of Skyport is MG-RAST [7, 8] (, a hosted data analytics system used by over 40,000 researchers all over the planet.


    • Session 1: Initial system setup and execution of demo workflow
      • o System overview
        • § Scientific computing & workflows
        • § Distributed computing
        • § Example use case MG-RAST
        • § CWL, Docker (why did we use those)
      • o Installskyport2services(usingsingledockercomposeimage)
        • § Install authentication services, AWE-server and SHOCK-server
        • § Load demo data into SHOCK
        • § Load demo workflow
        • § Install awe-worker node
        • § Download results from SHOCK
    • Session 2: Customizing system setup and monitoring execution
      • o Setupofexpandedsystem
        • § Basis for customization
        • § Creation of custom data types
        • § Monitoring execution via the web interface or cmd-line
        • § Adding tools to workflows
        • § Adding workflow steps
      • o Handsonexercisecreatingandexecutingacustomizedworkflow
    • Session 3: Advanced topics
      • o CombiningAWEwithotherexecutionengines
        • § Using Singularity [9]
      • o AddingdatatypestoSHOCK


    The assumption is that participants will bring a laptop with the ability to execute multiple Docker containers. Participants should test their available memory and hard drive space. Software systems required:

    • Docker
      • Dockercompose(recentversion)
    • ASCII editor: e.g. Emacs, vi, Textmate, ...

    The participants will be asked to install software (via Docker), modify configuration files and perform other Unix command line style activities.


      1. Wolfgang Gerlach WT, Andreas Wilke, Dan Olson, Folker Meyer: Container orchestration for scientific workflows. In: Cloud Engineering (IC2E), 2015 IEEE International Conference on: 2015/3/9. IEEE Transactions on Knowledge and Data Engineering 2016: 377-378.
      1. Merkel D: Docker: lightweight Linux containers for consistent development and deployment. Linux Journal 2014, 2014(239):2.
      1. Tang W, Bischof J, Desai N, Mahadik K, Gerlach W, Harrison T, Wilke A, Meyer F: Workload Characterization for MG-RAST Metagenomic Data Analytics Service in the Cloud. In: Proc of IEEE Int’l Conf on Big Data. 2014.
      1. Tang W, Wilkening J, Bischof J, Gerlach W, Wilke A, Desai N, Meyer F: Building Scalable Data Management and Analysis Infrastructure for Metagenomics. In: 5th International Workshop on Data-Intensive Computing in the Clouds. IEEE 2013.
      1. Bischof J, Wilke A, Gerlach W, Harrison T, Paczian T, Tang W, Trimble W, Wilkening J, Desai N, Meyer F: Shock: Active Storage for Multicloud Streaming Data Analysis. In: 2nd IEEE/ACM International Symposium on Big Data Computing: 2015; Limassol, Cyprus.
      1. Common Workflow Language, v1.0.
      1. Wilke A, Bischof J, Gerlach W, Glass E, Harrison T, Keegan KP, Paczian T, Trimble WL, Bagchi S, Grama A et al: The MG-RAST metagenomics database and portal in 2015. Nucleic Acids Res 2016, 44(D1):D590-594.
      1. Meyer F, Bagchi S, Chaterji S, Gerlach W, Grama A, Harrison T, Paczian T, Trimble WL, Wilke A: MG-RAST version 4—lessons learned from a decade of low-budget ultra-high-throughput metagenome analysis. Briefings in bioinformatics 2017.
      1. Kurtzer GM, Sochat V, Bauer MW: Singularity: Scientific containers for mobility of compute. PLoS One 2017, 12(5):e0177459.

    Short Bio

    Folker Meyer is a computational biologist at Argonne National Laboratory and a Professor of Bioinformatics at the University of Chicago. He is also the Deputy Division Director for the Biology Divison at Argonne National Laboratory. Folker was trained as a computer scientist and with that came his interest in building software systems to answer complex biological questions. He is the driving force behind the MG-RAST project.

    Wladek Minor
    (University of Virginia) [introductory/advanced]
    Big Data in Biomedical Sciences



    • Big Data and Big Data in Biomedical Sciences
    • Why big data is perceived as a big problem - technological considerations
    • Data reduction - should we preserve unreduced (raw) data?
    • Databases and databanks
    • Data mining with the use of raw data, databanks and databases
    • Data Integration
    • Automatic and semi-automatic curation of large amounts of data
    • Conversion of databanks into databases
    • Database priorities – content and design
    • Interaction between databases
    • Modern data management in biomedical sciences – necessity or luxury
    • Automatic data harvesting – close reality or still on the horizon
    • Reproducibility of the biomedical experiments - drug discovery considerations
    • Artificial Intelligence and machine learning in drug discovery
    • Big data in medicine - new possibilities
    • Future considerations



    Short Bio

    Harrison Distinguished Professor of Molecular Physiology and Biological Physics, University of Virginia. Development of methods for structural biology, in particular macromolecular structure determination by protein crystallography. Data management in structural biology, data mining as applied to drug discovery, bioinformatics. Member of Center of Structural Genomics of Infectious Diseases. Former Member of Midwest Center for Structural Genomics, New York Center for structural Genomics and Enzyme Function Initiative.

    Soumya Mohanty
    (University of Texas Rio Grande Valley) [introductory/intermediate]
    Swarm Intelligence Methods for Statistical Regression


    Big Data applications generically require the use of flexible, hence high-dimensional, statistical models to capture meaningful patterns in the data, and this usually leads to challenging non-linear and non-convex global optimization problems. The large data volume that must be handled further increases their difficult nature. This course will introduce methods from the field of computational swarm intelligence (SI), with the focus being on an in-depth presentation of Particle Swarm Optimization (PSO), that have proven useful in solving such optimization problems. PSO is a metaheuristic inspired by observations of cooperative behavior in multi-agent biological systems. It shows a remarkable robustness across a wide range of optimization problems, reducing the burden that is generally involved in tuning stochastic global optimization algorithms. The course will use concrete problems to illustrate the application of PSO to statistical regression and address practical issues that are often encountered in creating a successful implementation.


    The following is a list of the main topics to be covered in the course.

      • Overview:
          • The role of optimization in statistical analysis
          • Fundamental results in optimization theory
              • Survey of SI algorithms
          • Particle Swarm Optimization
          • Performance benchmarking
      • Tuning considerations
      • Applications of PSO to:
          • Parametric regression
          • Non-parametric regression


    None, as this is an introductory course. However, familiarity with basic probability theory and statistics will be a plus.


      • The following textbooks (and references therein):
        • "Swarm intelligence methods for statistical regression", Soumya D. Mohanty, To be published (CRC press, Fall 2018).
        • "Fundamentals of Computational Swarm Intelligence", A. P. Engelbrecht, Wiley.
        • "Particle Swarm Optimization", M. Clerc, ISTE.
      • Additional material to be provided during the course.

    Short bio

    Soumya D. Mohanty, Professor of Physics at UTRGV, completed his PhD degree in 1997 at the Inter-University Center for Astronomy and Astrophysics, India. He subsequently held post-doctoral positions at Northwestern University, Penn State, and the Max-Planck Institute for Gravitational Physics. He was also a visiting scholar with the LIGO project at Caltech. Mohanty's research has focused on solving some of the important data analysis challenges faced in the realization of Gravitational Wave (GW) astronomy across all observational frequency bands. These include semi-parametric regression of very weak signals in noisy data, high-dimensional non-linear parametric regression, time series classification, and analysis of data from large heterogeneous sensor arrays. His work has been funded by grants from the Research Corporation, the U.S. National Science Foundation, and NASA.

    Sankar K. Pal
    (Indian Statistical Institute) [introductory/advanced]
    Machine Intelligence and Soft Granular Mining: Features, Applications and Challenges


    The lecture has two parts. Beginning with the role of pattern recognition in data mining and machine intelligence, it describes the various components of granular computing (GrC), information granules, significance of fuzzy sets and rough sets in GrC and its relevance in mining large data sets. Uncertainty modelling in fuzzy-rough framework, including generalised entropy measures, concerning data analytics is emphasized.

    The second part deals with some mining applications such as, video tracking in ambiguous situations, selection of genes and miRNAs in bioinformatics, and community detection, target set selection and link prediction in social networks. All these data possess Big data characteristics. Roles of different kinds of granules, f-information measures and rough lower approximation in mining are demonstrated. Significance of lower approximation for knowledge encoding in designing granular neural networks, estimating the object model in unsupervised tracking, and determining the probability of definite and doubtful regions in cancer classification is illustrated. Finally, the concept of perception granules in natural language understanding and the use of z-numbers in abstracting various semantic-precisiations are explained. New terms like generalized rough entropy, granular flow graph, rough filter, intuitionistic entropy, granular social network model, fuzzy-rough community, double bounded rough sets, and z*-numbers are defined.

    Several examples are provided to explain the aforesaid concepts. The talk concludes mentioning the challenging issues and the future directions of research including the significance in computational theory of perception, natural computing and in granulated deep learning.


    Knowledge of pattern recognition, fuzzy sets, rough sets, neural networks, data mining, probability theory

    Short Bio

    Sankar K. Pal received PhD degrees from Calcutta University and Imperial College, London. He joined the Indian Statistical Institute, Calcutta in 1975 as a CSIR-SRF where he became a full professor in 1987, a distinguished scientist in 1998, and the Director in 2005. He is currently an INSA Distinguished Professor Chair. He founded the Machine Intelligence Unit and the Center for Soft Computing Research at his Institute. He worked at Imperial College, London; UC Berkeley and UMD, College Park, the NASA JSC, Houston, Texas, and the US Naval Research Lab, Washington DC; served as a IEEE CS Distinguished visitor since 1987; and held several visiting positions in Italy, Poland, Hong Kong, and Australia. Fellows of IEEE, TWAS, IAPR, IFSA, and all four National Academies for Science/Engg. in India, he is a coauthor of 20 books and more than 400 research publications in the areas of pattern recognition, machine learning, image/video processing, data mining, web intelligence, soft computing, bioinformatics, and cognitive machines. He is/was on the editorial boards of 20 journals including some IEEE Transactions. Coveted national/ international awards received include: S.S. Bhatnagar Prize (India), Padma Shri (India), Khwarizmi International Award (Iran), and NASA Tech Brief Award (USA). Visited 45 countries as keynote /invited speaker.

    Lior Rokach
    (Ben-Gurion University of the Negev) [introductory/advanced]
    Ensemble Learning


    Ensemble learning imitates our second nature to seek several opinions before making a crucial decision. The core principle is to weigh several individual models and combine them in order to reach a decision or prediction that is better than the one obtained by each of them separately. Researchers from various disciplines have explored the use of ensemble methods since the late seventies.

    This short course aims to provide a methodical and coherent presentation of classical ensemble methods as well as extensions and novel approaches that were recently introduced. Along with algorithmic descriptions of each method, we will provide a description of the settings in which this method is applicable and the trade-offs incurred by using the method.


    Decision tree learning, Introduction to Ensemble Learning, Random forest, Gradient boosting, Fusion methods, Error-correcting output codes, Ensemble diversity, Ensemble pruning, Bias-variance tradeoff, Combining deep learning with ensemble learning


    Familiarity with probability theory, statistics, and algorithms will be assumed, at the level typically taught at the bachelor level in computer science or engineering programs. Basic understanding of machine learning and data mining would be helpful.


    1.   Chen, Tianqi, and Carlos Guestrin. "Xgboost: A scalable tree boosting system." Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. ACM, 2016.
    2. Kuncheva, Ludmila I. Combining pattern classifiers: methods and algorithms. John Wiley & Sons, 2004.

    3. Rokach, Lior, and Oded Z. Maimon. Data mining with decision trees: theory and applications. Second Edition. World Scientific, 2014.

    4. Rokach, Lior. Pattern classification using ensemble methods. Vol. 75. World Scientific, 2010.

    5. Rokach, Lior. Decision forest: Twenty years of research, Information Fusion 27, 111-125

    6. Zhou, Zhi-Hua. Ensemble methods: foundations and algorithms. Chapman and Hall/CRC, 2012.


    Lior Rokach is a professor of data science at the Ben-Gurion University of the Negev, where he currently serves as the chair of the Department of Software and Information System Engineering. His research interests lie in the areas of Machine Learning, Big Data, Deep Learning and Data Mining and their applications. Prof. Rokach is the author of over 300 peer-reviewed papers in leading journals and conference proceedings. Rokach has authored several popular books in data science, including Data Mining with Decision Trees (1st edition, World Scientific Publishing, 2007, 2nd edition, World Scientific Publishing, 2015). He is also the editor of "The Data Mining and Knowledge Discovery Handbook" (1st edition, Springer, 2005; 2nd edition, 2010) and "Recommender Systems Handbook" (1st edition, Springer, 2011; 2nd edition, 2015). He currently serves as an editorial board member of ACM Transactions on Intelligent Systems and Technology (ACM TIST) and an area editor for Information Fusion (Elsevier).

    Michael Rosenblum
    (University of Potsdam) [introductory/intermediate]
    Synchronization Approach to Time Series Analysis


    The course presents data analysis techniques based on the theory of coupled oscillators and aimed at reconstruction of the network structure and inference of nodes properties from observations. The approach assumes that the multi-variate time series under study are outputs of weakly-coupled self-sustained oscillators and that the signals are appropriate for phase estimation. Therefore, the course will begin with a short introduction into the theory of interacting oscillators and there synchronization. We will discuss effects of phase and frequency locking and illustrate them with numerous examples. Next, we will discuss how this ideas can be used in data analysis. The main idea is to infer a model for phase dynamics of the observed network. Hence, the first step is to estimate phases from time series, and this step will be discussed in details. Then we will proceed with an analysis of the phase model with the goal to obtain the strength of directed links and infer the network structure. Thus, our technique represents an approach to the connectivity problem, relevant for physiology, neuroscience, and other fields. We demonstrate that our technique provides effective phase connectivity which is close, though not identical to the structural one. However, for weak coupling we achieve a good separation between existing and non-existing connections. We also discuss how the frequencies and phase response curves of interacting units can be estimated. Next, we extend the approach to cover the case of pulse-coupled neuron-like units, where only times of spikes can be registered, so that the data represent point processes. We will also demonstrate how the inferred phase model can predict synchronisation domains in experiments.


    1. Self-sustained oscillator. Forced and coupled oscillators, phase and frequency locking, synchronization domains. Model of phase dynamics.
    2. Phase and its estimation from data. Hilbert Transform. Phase vs angle variable.
    3. Strength of interaction. Synchronization indices (phase locking values).
    4. Direction of interaction. Reconstruction and analysis of phase model for two interacting units.
    5. Networks, structural and effective connectivity. Triplet analysis, true and spurious connections.
    6. An example: cardio-respiratory interaction.
    7. Case of pulse coupled oscillators.
    8. Properties of nodes: natural frequencies, phase response curves.


    • A. Pikovsky, M. Rosenblum, and J. Kurths, Synchronization. A Universal Concept in Nonlinear Sciences, Cambridge University Press, 2001.


    Basic knowledge of calculus and differential equations.

    Short Bio

    MICHAEL ROSENBLUM has been a research scientist and Professor in the Department of Physics and Astronomy, University of Potsdam, Germany, since 1997.

    His main research areas are nonlinear dynamics, synchronization theory, and time series analysis, with application to biological systems. The most important results include description of phase synchronization of chaotic systems, analysis of complex collective dynamics in large networks of interacting oscillators, development of feedback techniques for control of collective synchrony in neuronal networks (as a model of deep brain stimulation of Parkinsonian patients), methods for reconstruction of oscillatory networks from observations, application of these methods to analysis of cardio-respiratory interaction in humans.

    Michael Rosenblum studied physics at Moscow Pedagogical University, and went on to work in the Mechanical Engineering Research Institute of the USSR Academy of Sciences, where he was awarded a PhD in physics and mathematics. He was a Humboldt fellow in the Max-Planck research group on nonlinear dynamics, and a visiting scientist at Boston University. He is a co-author (with A. Pikovsky and J. Kurths) of the book "Synchronization. A Universal Concept in Nonlinear Sciences", Cambridge University Press, 2001 and has published over 100 peer-review publications.

    Michael Rosenblum served as a member of the Editorial Board of Physical Review E, terms 2008-2013. Since 2014 he is an Editor of Chaos: Int. J. of Nonlinear Science. He was named an American Physical Society Outstanding Referee for 2015.

    Hanan Samet
    (University of Maryland) [introductory/intermediate]
    Sorting in Space: Multidimensional, Spatial, and Metric Data Structures for Applications in Spatial and Spatio-textual Databases, Geographic Information Systems (GIS), and Location-based Services


    The representation of multidimensional, spatial, and metric data is an important issue in applications of spatial and spatiotextual databases, geographic information systems (GIS), and location-based services. Recently, there has been much interest in hierarchical data structures such as quadtrees, octrees, and pyramids which are based on image hierarchies, as well methods that make use of bounding boxes which are based on object hierarchies. Their key advantage is that they provide a way to index into space. In fact, they are little more than multidimensional sorts. They are compact and depending on the nature of the spatial data they save space as well as time and also facilitate operations such as search.

    We describe hierarchical representations of points, lines, collections of small rectangles, regions, surfaces, and volumes. For region data, we point out the dimension-reduction property of the region quadtree and octree. We also demonstrate how to use them for both raster and vector data. For metric data that does not lie in a vector space so that indexing is based simply on the distance between objects, we nreview various representations such as the vp-tree, gh-tree, and mb-tree. In particular, we demonstrate the close relationship between these representations and those designed for a vector space.
    For all of the representations, we show how they can be used to compute nearest objects in an incremental fashion so that the number of objects need not be known in advance. The VASCO JAVA applet is presented that illustrates these methods (found at They are also used in applications such as the SAND Internet Browser (found at

    The above has been in the context of the traditional geometric representation of spatial data, while in the final part we review the more recent textual representation which is used in location-based services where the key issue is that of resolving ambiguities. For example, does London'' correspond to the name of a person or a location, and if it corresponds to a location, which of the over 700 different instances ofLondon'' is it. The NewsStand system at and the TwitterStand system at system are examples. See also the cover article of the October 2014 issue of Communications of the ACM at or a cached version at and the accompanying video at


    1. Introduction a. Sample queries b. Spatial Indexing c. Sorting approach d. Minimum bounding rectangles (e.g., R-tree) e. Disjoint cells (e.g., R+-tree, k-d-B-tree) f. Uniform grid g. Location-based queries vs: feature-based queries h. Region quadtree i. Dimension reduction j. Pyramid k. Region quadtrees vs: pyramids l. Space ordering methods

    2. Points a. point quadtree b. MX quadtree c. PR quadtree d. k-d tree e. Bintree f. BSP tree

    3. Lines a. Strip tree b. PM1 quadtree c. PM2 quadtree d. PM3 quadtree e. PMR quadtree

    4. Rectangles and arbitrary objects a. MX-CIF quadtree b. Loose quadtree c. Partition fieldtree d. R-tree

    5. Surfaces and Volumes a. Restricted quadtree b. Region octree c. PM octree

    6. Metric Data a. vp-tree b. gh-tree c. mb-tree

    7. Operations a. Incremental nearest object location b. Boolean set operations

    8. Spatial Database Issues a. General issues b. Specific issues

    9. Indexing for spatiotextual databases and location-based services delivered on platforms such as smart phones and tablets a. Incorporation of spatial synonyms in search engines b. Toponym recognition c. Toponym resolution d. Spatial reader scope e. Incorporation of spatiotemporal data f. System integration issues g. Demos of live systems on smart phones

    10. Example systems a. SAND internet browser b. JAVA spatial data applets c. STEWARD d. NewsStand e. TwitterStand


    1. H. Samet. ``Foundations of Multidimensional Data Structures.'' Morgan-Kaufmann, San Francisco, 2006.

    2. H. Samet. ``A sorting approach to indexing spatial data.'' International Journal of Shape Modeling. 14(1):15--37, 28(4):517--580, June 2008.

    3. G. R. Hjaltason and H. Samet. ``Index-driven similarity search in metric spaces.'' ACM Transactions on Database Systems, 28(4):517--580, December 2003.

    4. G. R. Hjaltason and H. Samet. ``Distance browsing in spatial databases.'' ACM Transactions on Database Systems, 24(2):265--318, June 1999. Also Computer Science TR-3919, University of Maryland, College Park, MD.

    5. G. R. Hjaltason and H. Samet. ``Ranking in spatial databases.'' In Advances in Spatial Databases --- 4th International Symposium, SSD'95, M. J. Egenhofer and J. R. Herring, eds., Portland, ME, August 1995, 83--95. Also Springer-Verlag Lecture Notes in Computer Science

    6. H. Samet. ``Applications of Spatial Data Structures: Computer Graphics, Image Processing, and GIS.'' Addison-Wesley, Reading, MA,

    7. H. Samet. ``The Design and Analysis of Spatial Data Structures.'' Addison-Wesley, Reading, MA, 1990.

    8. C. Esperanca and H. Samet. ``Experience with SAND/Tcl: a scripting tool for spatial databases.'' Journal of Visual Languages and Computing, 13(2):229--255, April 2002.

    9. H. Samet, H. Alborzi, F. Brabec, C. Esperanca, G. R. Hjaltason, F. Morgan, and E. Tanin. ``Use of the SAND spatial browser for digital government applications.'' Communications of the ACM, 46(1):63--66, January 2003.

    10. B. Teitler, M. D. Lieberman, D. Panozzo, J. Sankaranarayanan, H. Samet, and J. Sperling. ``NewsStand: A new view on news.'' Proceedings of the 16th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, Irvine, CA, November 2008, 144--153. SIGSPATIAL 10-Year Impact Award.

    11. H. Samet, J. Sankaranarayanan, M. D. Lieberman, M. D. Adelfio, B. C. Fruin, J. M. Lotkowski, D. Panozzo, J. Sperling, and B. E. Teitler. ``Reading news with maps by exploiting spatial synonyms.'' Communications of the ACM, 57(10):64--77, October 2014.

    12. J. Sankaranarayanan, H. Samet, B. Teitler, M. D. Lieberman, and J. Sperling. ``TwitterStand: News in tweets.'' Proceedings of the 17th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, Seattle, WA, November 2009, 42--51.

    13. M. D. Lieberman, H. Samet, and J. Sankaranarayanan. ``Geotagging with local lexicons to build indexes for textually-specified spatial data.'' Proceedings of the 26th IEEE International Conference on Data Engineering, Long Beach, CA, March 2010, 201--212.

    14. M. D. Lieberman and H. Samet. ``Multifaceted Toponym Recognition for Streaming News.'' Proceedings of the ACM SIGIR Conference. Beijing, July 2011, 843--852.

    15. M. D. Lieberman and H. Samet. ``Adaptive Context Features for Toponym Resolution in Streaming News.'' Proceedings of the ACM SIGIR Conference. Portland, OR, August 2012, 731--740.

    16. M. D. Lieberman and H. Samet. Supporting Rapid Processing and Interactive Map-Based Exploration of Streaming News. Proceedings of the 20th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems. Redondo Beach, CA, November 2012, 179--188/

    17. H. Samet, B. C. Fruin, and S. Nutanong. Duking it out at the smartphone mobile app mapping API corral: A}pple, G}oogle, and the competition. In Proceedings of the 1st ACM SIGSPATIAL International Workshop on Mobile Geographic Information Systems (MobiGIS 2012), Redondo Beach, CA, November 2012.

    18. H. Samet, S. Nutanong, and B. C. Fruin. Dynamic presentation consistency issues in smartphone mapping apps. Communications of the ACM, 59(9):58--67, September 2016.

    19. H. Samet, S. Nutanong, and B. C. Fruin. Static presentation consistency issues in smartphone mapping apps. Communications of the ACM, 59(5):88--98, May 2016.

    20. Spatial Data Structure applets at;


    Practitioners working in the areas of big spatial data and spatial data science that involve spatial databases, geographic information systems, and location-based services will be given a different perspective on data structures found to be useful in most applications. Familiarity with computer terminology and some programming experience is needed to follow this course.

    Short Bio

    Hanan Samet ( is a Distinguished University Professor of Computer Science at the University of Maryland, College Park and is a member of the Institute for Computer Studies. He is also a member of the Computer Vision Laboratory at the Center for Automation Research where he leads a number of research projects on the use of hierarchical data structures for database applications, geographic information systems, computer graphics, computer vision, image processing, games, robotics, and search. He received the B.S. degree in engineering from UCLA, and the M.S. Degree in operations research and the M.S. and Ph.D. degrees in computer science from Stanford University. His doctoral dissertation dealt with proving the correctness of translations of LISP programs which was the first work in translation validation and the related concept of proof-carrying code. He is the author of the recent book Foundations of Multidimensional and Metric Data Structures'' ( published by Morgan-Kaufmann, an imprint of Elsevier, in 2006, an award winner in the 2006 best book in Computer and Information Science competition of the Professional and Scholarly Publishers (PSP) Group of the American Publishers Association (AAP), and of the first two books on spatial data structuresDesign and Analysis of Spatial Data Structures'', and ``Applications of Spatial Data Structures: Computer Graphics, Image Processing, and GIS,'' both published by Addison-Wesley in 1990. He is the Founding Editor-In-Chief of the ACM Transactions on Spatial Algorithms and Systems (TSAS), the founding chair of ACM SIGSPATIAL, a recipient of a Science Foundation of Ireland (SFI) Walton Visitor Award at the Centre for Geocomputation at the National University of Ireland at Maynooth (NUIM), 2009 UCGIS Research Award, 2010 CMPS Board of Visitors Award at the University of Maryland, 2011 ACM Paris Kanellakis Theory and Practice Award, 2014 IEEE Computer Society Wallace McDowell Award, and a Fellow of the ACM, IEEE, AAAS, IAPR (International Association for Pattern Recognition), and UCGIS (University Consortium for Geographic Science). He received best paper awards in the 2007 Computers & Graphics Journal, the 2008 ACM SIGMOD and SIGSPATIAL ACMGIS Conferences, the 2012 SIGSPATIAL MobiGIS Workshop, and the 2013 SIGSPATIAL GIR Workshop, as well as a best demo award at the 2011 SIGSPATIAL ACMGIS'11 Conference. The 2008 ACM SIGSpATIAL ACMGIS best paper awared winner also received the SIGSPATIAL 10-Year Impact Award. His paper at the 2009 IEEE International Conference on Data Engineering (ICDE) was selected as one of the best papers for publication in the IEEE Transactions on Knowledge and Data Engineering. He was elected to the ACM Council as the Capitol Region Representative for the term 1989-1991, and is an ACM Distinguished Speaker.

    Rory Smith
    (Monash University) [intermediate/advanced]
    Bayesian Inference: 18th century insight into 21st century data science


    What is the statistically optimal way to detect and extract information from signals in noisy data? After detecting ensembles of signals, what can we learn about the population of all the signals? This course will address these questions using the language of Bayesian inference. After reviewing the basics of Bayes theorem, we will frame the problem of signal detection in terms of hypothesis testing and model selection. Extracting information from signals will be cast in terms of computing posterior density functions of signal parameters. After reviewing model selection and parameter estimation, the course will focus on practical methods. Specifically, we will implement sampling algorithms which we will use to perform model selection and parameter estimation on signals in synthetic data sets. Finally, we will ask what can be learned about the population properties of an ensemble of signals. This population-level inference will be studied as a hierarchical inference problem.


    • The basics of Bayesian inference
    • Parameter estimation, hypothesis testing and model selection
    • Sampling methods: MCMC and Nested Sampling
    • Illustrative examples: Detecting signals in noise using Bayesian inference
    • Illustrative examples: Performing parameter estimation to learn about a signal's properties
    • Hierarchical inference: Using "Hyper-parameter" estimation to learn about populations of signals


    • Probability Theory: The logic of Science, E.T. Jaynes, Cambridge University Press.
    • Nested sampling for general Bayesian computation, J. Skilling, Bayesian Analysis (2006).
    • For an overview of Markov Chain Monte Carlo (MCMC), see e.g. Bayesian Data Analysis, Andrew Gelman, Chapman & Hall.
    • For a practical implementation of MCMC, see e.g. emcee, D. Foreman-Mackey,


    Basic probability theory, sampling, python and jupyter hub.

    Short Bio

    Dr. Rory Smith is a lecturer in physics at Monash University in Melbourne, Australia. In 2013-2017, he was a senior postdoctoral fellow at the California Institute of Technology, where he worked on searches for gravitational waves. Dr. Smith participated in the landmark first detection of gravitational waves for which the 2017 Nobel prize in physics was awarded. Dr. Smith's research focuses on detecting astrophysical gravitational-wave signals from black holes and neutron stars, and

    extracting the rich astrophysical information encoded within to study the fundamental nature of spacetime.

    Jaideep Srivastava
    (University of Minnesota) [intermediate]
    Social Computing - Concepts and Applications


    Social Computing is an emerging discipline, and just like any discipline at a nascent stage it can often mean different things to different people. However, there are three distinct threads that are emerging. First thread is often called Socio-Technical Systems, which focuses on building systems that allow large scale interactions of people, whether for a specific purpose or in general. Examples include social networks like Facebook and Google Plus, and Multi Player Online Games like World of Warcraft and Farmville. The second thread is often called Computational Social Science, whose goal is to use computing as an integral tool to push the research boundaries of various social and behavioral science disciplines, primarily Sociology, Economics, and Psychology. Third is the idea of solving problems of societal relevance using a combination of computing and humans. The three modules of this course are structured according to this description. The goal of this course is to discuss, in a tutorial manner, through case studies, and through discussion, what Social Computing is, where it is headed, and where is it taking us.


    • • Module 1: Socio-technical systems

      • • Introduction to Social Computing

      • • Socio-technical systems

        • • Examples of a number of social computing systems, e.g. Twitter, FaceBook, MMO games, etc.
      • • Applying data mining to social computing systems

    • • Module 2: Computational Social Science

      • • Online trust

      • • Social influence

      • • Individual and group/team performance

      • • Identifying and preventing bad behavior

    • • Module 3: Solving Problems of Societal Relevance

      • • Social computing for humanitarian assistance

      • • Wrap-up discussion

        • • Privacy and ethics

        • • Where are we headed


    To be provided later.


    This course is intended primarily for graduate students. Following are the potential audiences:

    • · Computer Science graduate students: All that is needed for this audience is interest in one of the themes of social computing
    • · Social Science graduate students: Some exposure to building models from data, at least what these techniques are and what they can do
    • · Management graduate students: Those with MIS focus

    Short Bio

    Jaideep Srivastava ( is Professor of Computer Science at the University of Minnesota, where he directs a laboratory focusing on research in Web Mining, Social Analytics, and Health Analytics. He is a Fellow of the Institute of Electrical and Electronics Engineers (IEEE), and has been an IEEE Distinguished Visitor and a Distinguished Fellow of Allina’s Center for Healthcare Innovation. He has been awarded the Distinguished Research Contributions Award of the PAKDD, for his lifetime contributions to the field of machine learning and data mining. Dr. Srivastava has significant experience in the industry, in both consulting and executive roles. Most recently he was the Chief Scientist for Qatar Computing Research Institute (QCRI), which is part of Qatar Foundation. Earlier, he was the data mining architect for (, built a data analytics department at Yodlee (, and served as the Chief Technology Officer for Persistent Systems ( He has provided technology and strategy advice to Cargill, United Technologies, IBM, Honeywell, KPMG, 3M, TCS, and Eaton. Dr. Srivastava Co-Founded Ninja Metrics (, based on his research in behavioral analytics. He was advisor and Chief Scientist for CogCubed (, an innovative company with the goal to revolutionize the diagnosis and therapy of cognitive disorders through the use of online games, which was subsequently acquired by Teladoc (, a public company. He has been a technology advisor to a number of startups at various stages, including Jornaya ( - a leader in cross-industry lead management, and Kipsu ( - which provides an innovative approach to improving service quality in the hospitality industry. Dr. Srivastava has held distinguished professorships at Heilongjiang University and Wuhan University, China. He has held advisory positions with the State of Minnesota, and the State of Maharashtra, India. He is a technology advisor to the Unique ID (UID) project of the Government of India, whose goal is to provide biometrics-based social security numbers to the 1.3 Billion citizens of India. Dr. Srivastava has a Bachelors of Technology from the Indian Institute of Technology (IIT), Kanpur, India, and MS and PhD from the University of California, Berkeley.

    Mayte Suárez-Fariñas
    (Icahn School of Medicine at Mount Sinai) [intermediate]
    A Practical Guide to the Analysis of Longitudinal Data Using R


    Longitudinal data is obtained when a time-sequence of measurements is made on a response variable for each of a number of subjects in an experimental or observational study. In such cases, individuals usually display a high degree of similarity in responses over time and, thus, classical regression models are inadequate. This course is aimed at giving its attendees insights into the theoretical concepts and practical experience into the models used for analysis of longitudinal data, particularly mixed-effect models. It will provide an introduction to (1) the theoretical foundations of mixed models, 2) a guide to build, examine, interpret and compare mixed-effect models as well as to conduct hypothesis testing and (3) the necessary resources to conduct a wide range of longitudinal analyses. All practical exercises will be conducted in R. Participants are encouraged to bring datasets to the course and apply the principles to their specific areas of research.


    1. Introduction to linear Mixed-Effect Models.
    2. Random intercept and Random Intercept and slopes models
    3. Mixed-Effects Models in R’s nlme packages
    4. Build, examine, interpret, expand and compare mixed effects models
    5. Testing hypotheses in mixed-effect models through R’s emmeans package
    6. Non-linear Mixed-effect Models


    Students must be familiar with basic R functions to read and manipulate data and generate basic data plots. Familiarity with statistical concepts and basic understanding of regression and anova. An installed version of R ( and R-Studio ( on a laptop for completing exercises.


    • • Diggle, P. J., Heagerty, P., Liang, K-Y. and Zeger, S. L. (2002) Analysis of Longitudinal Data. Second Edition. Oxford: Oxford University Press.
    • • Fitzmaurice, G.M., Laird, N.M., andWare, J.H. (2004) Applied Longitudinal Analysis. New York: Wiley.

    Short Bio

    Mayte Suarez-Farinas, PhD is currently an Associate Professor at the Center for Biostatistics and The Department of Genetics and Genomics Science of the Icahn School of Medicine at Mount Sinai, New York. She received a masters in mathematics from the University of Havana, Cuba and, in 2003, a Ph.D. degree in quantitative analysis from the Pontifical Catholic University of Rio de Janeiro, Brazil. Prior to joining Mount Sinai, she was co-director of Biostatistics at the Center for Clinical and Translational Science at the Rockefeller University, where she developed methodologies for data integration across omic studies, and a framework to evaluate drug response at the molecular level in proof of concept studies in inflammatory skin diseases using mixed-effect models and machine learning. Her long terms goals are to develop robust statistical techniques to mine and integrate complex high-throughput data, tailored to specific disease models, with an emphasis on immunological diseases and to develop precision medicine algorithms to predict treatment response and phenotype.

    Jeffrey Ullman
    (Stanford University) [introductory]
    Big-data Algorithms That Aren't Machine Learning


    We shall study algorithms that have been found useful in querying large dataswets. The emphasis is on algorithms that cannot be considered "machine learning."


    • Locality-sensitive hashing: shingling, minhashing, applications;
    • PageRank and related ideas: topic-specific PageRank, combatting link spam;
    • Stream-processing algorithms: counting occurrences, counting unique values, sampling;
    • Graph-processing algorithms: counting neighborhoods, counting triangles, transitive closure.


    A course in algorithms at the advanced-undergraduate level is important. A course in database systems is helpful, but not required.


    We will be covering (parts of) Chapters 3, 4, 5, and 10 of the free text: Mining of Massive Datasets by Jure Leskovec, Anand Rajaraman, and Jeff Ullman, available at

    Short Bio

    A brief on-line bio is available at

    Andrey Ustyuzhanin
    (National Research University Higher School of Economics) [intermediate/advanced]
    Challenge-driven Data Science: Cracking Domain Problems by Crowd Intelligence


    The AI hype today is only partially fueled by the development of deep learning techniques. Two other drivers are platforms and crowd (citizen) intelligence [1]. The latter is mostly exploited by services similar to Mechanical Turk or Figure Eight. Usually, such services do not require significant intellectual efforts from the participants. However, there is a considerable exception - challenges in Machine Learning that are hosted by platform. Usually it companies use that approach to solve an ML problem for a reasonable price.

    In this mini-course, I'll give examples of research problems that were addressed by the citizen science approach. Also will attest the enormous potential of such a method and its limitations. We'll look at the main ingredients of a good competition organisation and compare the most popular platform capabilities.

    By the end of the mini-course, you will be able to host own competition that can address a particular research problem.


    • introduction into challenge-driven data-science. the profits, motivation, and common pitfalls examples;
    • resources for data sharing. licensing. data preparation;
    • main data challenge platforms, limitations and capabilities;
    • practical session on organising own challenge;
    • data leakage examples, detection and prevention;
    • best practices for future challenges.



    • python 3 [basic level], familiarity with machine learning concepts and tools [basic level]



    Dr Andrey Ustyuzhanin - the head of Yandex-CERN joint research projects as well as the head of the Laboratory of Methods for Big Data Analysis at NRU HSE. His team is the member of frontier research international collaborations: LHCb - collaboration at Large Hadron Collider, SHiP (Search for Hidden Particles) - the experiment is designed for the New Physics discovery. His group is unique for both collaborations since the majority of the team members are coming from the Computer and Data Science worlds. The primary priority of his research is the design of new Machine Learning methods and using them to solve tough scientific enigmas thus improving the fundamental understanding of our world. Amongst the project he has been working on are efficiency improvement of online triggers at LHCb, speed up BDT-based online processing formula, the design of custom convolutional neural networks for processing tracks of muon-like particles on smartphone cameras. Development of the algorithm for tracking in scintillators optical fibre detectors and emulsion cloud chambers. Those project aid research at various experiments: LHCb, OPERA, SHiP and CRAYFIS. Discovering the deeper truth about the Universe by applying data analysis methods is the primary source of inspiration in Andrey’s lifelong journey. Andrey is a co-author of the course on the Machine Learning aimed at solving Particle Physics challenges at Coursera and organiser of the annual international summer schools following the similar set of topics. Andrey has graduated from Moscow Institute of Physics and Technology in 2000 and received PhD in 2007 at Institute of System Programming Russian Academy of Sciences.

    Wil van der Aalst
    (RWTH Aachen University) [introductory/intermediate]
    Process Mining: Data Science in Action


    Process mining is the missing link between model-based process analysis and data-oriented analysis techniques. Through concrete data sets and easy to use software the course provides data science knowledge that can be applied directly to analyze and improve processes in a variety of domains.

    The course explains the key analysis techniques in process mining. Participants will learn various process discovery algorithms. These can be used to automatically learn process models from raw event data. Various other process analysis techniques that use event data will be presented. Moreover, the course will provide easy-to-use software, real-life data sets, and practical skills to directly apply the theory in a variety of application domains.

    Process mining provides not only a bridge between data mining and business process management; it also helps to address the classical divide between "business" and "IT". Evidence-based business process management based on process mining helps to create a common ground for business process improvement and information systems development.

    Note that Gartner recently identified process mining software as a new and important class of software. Currently, there are over 25 vendors providing commercial process mining tools and a rapid uptake of the new technology is expected.


    The course focuses on process mining as the bridge between data science and process science. The course will introduce the three main types of process mining.

      1. The first type of process mining is discovery. A discovery technique takes an event log and produces a process model without using any a-priori information. An example is the Alpha-algorithm that takes an event log and produces a process model (a Petri net) explaining the behavior recorded in the log.
      1. The second type of process mining is conformance. Here, an existing process model is compared with an event log of the same process. Conformance checking can be used to check if reality, as recorded in the log, conforms to the model and vice versa.
      1. The third type of process mining is enhancement. Here, the idea is to extend or improve an existing process model using information about the actual process recorded in some event log. Whereas conformance checking measures the alignment between model and reality, this third type of process mining aims at changing or extending the a-priori model. An example is the extension of a process model with performance information, e.g., showing bottlenecks. Process mining techniques can be used in an offline, but also online setting. The latter is known as operational support. An example is the detection of non-conformance at the moment the deviation actually takes place. Another example is time prediction for running cases, i.e., given a partially executed case the remaining processing time is estimated based on historic information of similar cases.

    The course uses many examples using real-life event logs to illustrate the concepts and algorithms. After taking this course, one is able to run process mining projects and have a good understanding of the Business Process Intelligence field.


    This course is aimed at both students (Master or PhD level) and professionals. A basic understanding of logic, sets, and statistics (at the undergraduate level) is assumed. Basic computer skills are required to use the software provided with the course (but no programming experience is needed). Participants are also expected to have an interest in process modeling and data mining but no specific prior knowledge is assumed as these concepts are introduced in the course.


    W.M.P. van der Aalst. Process Mining: Data Science in Action. Springer-Verlag, Berlin, 2016. (The course will also provide access to slides, several articles, software tools, and data sets.)

    Short Bio Wil van der Aalst is a full professor at RWTH Aachen University leading the Process and Data Science (PADS) group. He is also part-time affiliated with the Technische Universiteit Eindhoven (TU/e). Until December 2017, he was the scientific director of the Data Science Center Eindhoven (DSC/e) and led the Architecture of Information Systems group at TU/e. Since 2003, he holds a part-time position at Queensland University of Technology (QUT). Currently, he is also a visiting researcher at Fondazione Bruno Kessler (FBK) in Trento and a member of the Board of Governors of Tilburg University. His research interests include process mining, Petri nets, business process management, workflow management, process modeling, and process analysis. Wil van der Aalst has published over 200 journal papers, 20 books (as author or editor), 450 refereed conference/workshop publications, and 65 book chapters. Many of his papers are highly cited (he one of the most cited computer scientists in the world; according to Google Scholar, he has an H-index of 138 and has been cited over 85,000 times) and his ideas have influenced researchers, software developers, and standardization committees working on process support. Next to serving on the editorial boards of over ten scientific journals, he is also playing an advisory role for several companies, including Fluxicon, Celonis, Processgold, and Bright Cape. Van der Aalst received honorary degrees from the Moscow Higher School of Economics (Prof. h.c.), Tsinghua University, and Hasselt University (Dr. h.c.). He is also an elected member of the Royal Netherlands Academy of Arts and Sciences, the Royal Holland Society of Sciences and Humanities, and the Academy of Europe. In 2017, he was awarded a Humboldt Professorship.

    Zhongfei Zhang
    (Binghamton University) [introductory/advanced]
    Relational and Multimedia Data Learning


    This course aims at exposing the audience an introduction to knowledge discovery and machine learning theories and case studies in real-world applications for relational and multimedia data as well as the relationships between them. The course begins with an extensive introduction to the fundamental concepts and theories of knowledge discovery and machine learning for relational and multimedia data, and then showcases several important applications as case studies in the real-world as examples for big data knowledge discovery and learning from relational and multimedia data.


    The course consists of three two-hour sessions. The syllabus is as follows:

    • First session: Introduction to the fundamental concepts and theories for relational and multimedia data with the specific foci on an overview of the wide spectrum of techniques and technologies available as well as their relationships for knowledge discovery and learning and applications to big data scenarios through real-world case studies;
    • Second session: Specific discussions on the classic and state-of-the-art methods for relational data learning;
    • Third session: Specific discussions on the state-of-the-art methods on multimedia data learning;


    College math, fundamentals about computer science


      1. Bo Long, Zhongfei (Mark) Zhang, and Philip S. Yu, Relational Data Clustering: Models, Algorithms, and Applications, Taylor & Francis/CRC Press, 2010, ISBN: 9781420072617
      1. Zhongfei (Mark) Zhang and Ruofei Zhang, Multimedia Data Mining -- A Systematic Introduction to Concepts and Theory, Taylor & Francis Group/CRC Press, 2008, ISBN: 9781584889663
      1. Zhongfei (Mark) Zhang, Bo Long, Zhen Guo, Tianbing Xu, and Philip S. Yu, Machine Learning Approaches to Link-Based Clustering, in Link Mining: Models, Algorithms and Applications, Edited by Philip S. Yu, Christos Faloutsos, and Jiawei Han, Springer, 2010
      1. Zhen Guo, Zhongfei Zhang, Eric P. Xing, and Christos Faloutsos, Multimodal Data Mining in a Multimedia Database Based on Structured Max Margin Learning, ACM Transactions on Knowledge Discovery and Data Mining, ACM Press, 2015

    Short Bio:

    Zhongfei (Mark) Zhang is a full professor of Computer Science at State University of New York (SUNY) at Binghamton, and directs the Multimedia Research Computing Laboratory in the University. He has also served as a QiuShi Chair Professor at Zhejiang University, China, and as the Director of the Data Science and Engineering Research Center at the university while he was on leave from State University of New York (SUNY) at Binghamton, USA. He has received a B.S. in Electronics Engineering (with Honors), an M.S. in Information Sciences, both from Zhejiang University, China, and a PhD in Computer Science from the University of Massachusetts at Amherst, USA. His research interests include machine learning and artificial intelligence, data mining and knowledge discovery, multimedia information indexing and retrieval, computer vision, and pattern recognition. He is the author and co-author of the first monograph on multimedia data mining and the first monograph on relational data clustering, respectively. His research is sponsored by a wide spectrum of government funding agencies, industrial labs, as well as private agencies. He has published over 200 papers in premier venues in his areas and is an inventor for more than 30 patents. He has served in several journal editorial boards and received several professional awards including best paper awards in the premier conferences in his areas.