2nd International Winter School on Big Data

Bilbao, Spain, February 8-12, 2016

Keynote Speakers


Nektarios Benekos (European Organization for Nuclear Research), Role of Computing and Software in Particle Physics

Summary. The challenges of computing in science and technology have grown in several aspects: computing software and infrastructure is mission critical, requirements in communication, storage and CPU capacity are often above the currently feasible and affordable level and the large science projects have a lifetime that is much longer than one cycle in hardware and software technology. Computing in particle physics and related research domains faces unprecedented challenges, determined by the complexity of physics under study, of the detectors and of the environments where they operate. Cross-disciplinary solutions in physics and detector simulation, data analysis as well as in software engineering and Grid technology are presented, with examples of their use in global science projects in the fields of particle physics and space.

Bio. Nektarios Chr. Benekos received a Ph.D. in particle physics-detector R&D and data analysis and is now visiting Associate Professor in the Department of Applied Physics and Computing at the National Technical University of Athens and Senior Data Analyst at CERN. He previously held positions at Illinois Urbana- Champaign University, Max Planck Institute for Physics after being a Fellow at the Physics Division of the European Organization for Nuclear Research-CERN. He hasupervised the PhD and Master of 10 students and published around 65 papers in physics and computer science with an hindex of 100 and over 50000 citations.

He currently works in applying computer science from infrastructure to analytics in large-scale particle and astroparticle physics experiments, Image processing, Network Science and Particle Physics. The infrastructure work is built around Software Defined Systems on Clouds and Clusters. He is involved several projects to Big Data and Big Computing Centers. He has experience in online education and its use in MOOC’s and other webinar platforms for areas like Data and Computational Science.

He is also acting as senior scientific advisor at the GF-ACCORD, a fast Business Consulting company, supervising the R&D efforts in Data Science Unit for Business Intelligence Applications.


Chih-Jen Lin (National Taiwan University), When and When Not to Use Distributed Machine Learning

Summary. In the big-data era, data are often too large tobe stored in one computer.However, whether one should conduct machine learning tasks or dataanalytics in the distributed environment is still a debatable issue.Sub-sampling data to one machine for analysis is always an option.In this talk I will argue that both the traditional single-machinesetting and the new distributed setting are both importantfor big-data machine learning in the future. I will discuss theiradvantage/disadvantages and when to use which.

Bio. National Taiwan University page.


Jeffrey Ullman (Stanford University), Theory of MapReduce Algorithms

Summary. Not all parallel algorithms are MapReduce algorithms. A complexity theory for MapReduce algorithms has developed recently, and it involves the trade-off between the memory used by reducers ("reducer size") and the communication between the map and reduce phases ("replication rate"). We shall outline the theory, review some of the problems whose complexity in this sense has been analyzed, and mention some of the open problems that remain in this theory.

Bio. Stanford page.


Alexandre Vaniachine (Argonne National Laboratory), Big Data Technologies and Data Science Methods in the Higgs Boson Discovery

Summary. Big Data technologies and Data Science methods in Higgs boson discovery The data volumes collected at the Large Hadron Collider represent research challenges common for data-intensive sciences. The scientific goals of the LHC include high precision tests of the Standard Model of fundamental particles and searches for new physics. These goals require detailed comparison of the expected physics models and detector behavior with data. Progress in Big Data technologies enables LHC data collection and analytics, including comparison with computational models simulating the expected data behavior. We use the Higgs boson example to describe the end-to-end data flow for scientific discovery highlighting the roles of distributed computing and Data Science classification methods maximizing signal significance over background noise models.

Bio. Argonne page.

Nektarios Benekos (European Organization for Nuclear Research), [introductory/intermediate] Exploring the Mysteries of our Cosmos: the Big Deal between Big Data and Big Science


Summary. We are living in "the age of big data," according to The World Economic Forum. I agree too. The collection and mining of massive amounts of digital data currently defines the term big data. Those are processes that businesses largely handle. However, the analysis of that data -- that magic ingredient of algorithms and advanced mathematics that bridges the gap between knowledge and insight -- is big science. It is wherhe value is. It is the future. Getting right message to the right customer at the right time is the promise of relevant, real-time marketing. Big science, not big data, will bring this to life. CERN is one of the world’s largest and most respected centres for scientific research where big data meets big science. Its business is fundamental physics, finding out what the Universe is made of and how it works, after analyzing massive scale of raw and derived data. CERN operates the Large Hadron Collider (LHC), the world’s largest and most powerful particle accelerator where th ATLAS and CMS experiments announced their observations of a particle consistent with the long-sought Higgs boson, in summer of 2012. The particle’s detection has set the worldwide scientific community buzzing, but behind the success of the work undertaken at the LHC, lies a story of Big Data success that is truly ground-breaking.

This rapid increase in performance of the LHC accelerator is having an impact on the computing requirements since it increases the rate (Velocity), complexity (Variety) and quantity (Volume) of data that the LHC experiments need to store, distribute and process.

Put simply, the analysis that big science brings to the table makes big data relevant. Therefore, this course will introduce students to big science combining with big data to create big opportunities in three significant ways: real-time relevant content, data visualization, and predictive analytics and how these performed at the LHC.

Syllabus. Introductions to mysteries of our Cosmos, introduction to the Large Hadron Collider and particle physics expements, introduction to big data programming, streaming, management, triggering, filtering, visualization, monitoring and analyzing in real time.


Pre-requisites. Some background in physics experiments, programming techniques and algorithms, databases, and probability and statistics will be useful.

Bio. Nektarios Chr. Benekos received a Ph.D. in particle physics-detector R&D and data analysis and is now visiting Associate Professor in the Department of Applied Physics and Computing at the National Technical University of Athens and Senior Data Analyst at CERN. He previously held positions at Illinois Urbana- Champaign University, Max Planck Institute for Physics after being a Fellow at the Physics Division of the European Organization for Nuclear Research-CERN. He hasupervised the PhD and Master of 10 students and published around 65 papers in physics and computer science with an hindex of 100 and over 50000 citations.

He currently works in applying computer science from infrastructure to analytics in large-scale particle and astroparticle physics experiments, Image processing, Network Science and Particle Physics. The infrastructure work is built around Software Defined Systems on Clouds and Clusters. He is involved several projects to Big Data and Big Computing Centers. He has experience in online education and its use in MOOC’s and other webinar platforms for areas like Data and Computational Science.

He is also acting as senior scientific advisor at the GF-ACCORD, a fast Business Consulting company, supervising the R&D efforts in Data Science Unit for Business Intelligence Applications.

Hendrik Blockeel (KU Leuven), [intermediate] Decision Trees for Big Data Analytics


Summary. Decision trees, and derived methods such as Random Forests, are among the most popular methods for learning predictive models from data. This is to a large extent due to the versatility and efficiency of these algorithms. This course will introduce students to the basic methods for learning decision trees, as well as to variations and more sophisticated versions of decision tree learners, with a particular focus on those methods that make decision trees work in the context of big data.

Syllabus. Classification and regression trees, multi-output trees, clustering trees, model trees, ensembles of trees, incremental learning of trees, learning decision trees from large datasets, learning from data streams.

References. Relevant references will be provided as part of the course.

Pre-requisites. Familiarity with mathematics, probability theory, statistics, and algorithms is expected, on the level it is typically introduced at the bachelor level in computer science or engineering programs.

Bio. Hendrik Blockeel is a professor at KU Leuven, Belgium, and part-time associate professor at Leiden University, the Netherlands. He received his Ph.D. degree in 1998 from KU Leuven. His research interests lie mostly within artificial intelligence, with a focus on machine learning and data mining, and in the use of AI-based modeling in other sciences. Dr. Blockeel has made a variety of contributions on topics such as inductive logic programming, probabilistic-logical learning, and decision trees. He is an action editor of Machine Learning and a member of the editorial board of several other journals. He has chaired or organized several conferences, including ECMLPKDD 2013, and organized the ACAI summer school in 2007. He has served on the board of the European Coordinating Committee for Artificial Intelligence, and currently serves on the ECMLPKDD steering committee as publications chair. He became an ECCAI fellow in 2015.

Nello Cristianini (University of Bristol), [introductory] THINKBIG: Towards Large Scale Computational Social Sciences, History and Digital Humanities


Summary. We describe a series of studies in which massive sets of data (mostly text and images) and mined in order to gain new insights about society, the media system and history. These studies are only possible with large scale AI techniques, and we expect them to become increasingly common in the future. Among other things, we will study gender in the media, mood on twitter, cultural change in history, by analysing several millions of documents. The methods used can directly be trasferred and applied to a variety of other domains.


  1. on data driven AI
  2. what is a pattern ?
  3. extracting information from social media (mood, flu, etc) / large scale
  4. mining newspapers: science, politics, history... and analysis of news images. / large scale
  5. can science be automated? / large scale
  6. ethical aspects

Pre-requisites. No prerequitites, it is an introductory course.

Bio. Wiki page.

Ernesto Damiani (University of Milan and EBTIC/Khalifa University), [introductory/intermediate] Architectures, Models and Tools for Big-Data-as-a-Service


Summary. Many companies andorganisations worldwide have become aware of the potential competitiveadvantage they could get by timely and accurate Big Data Analytics (BDA), but lackthe data management expertise and budget to fully exploit BDA. To help overcoming this hurdle,this course will introduce model-based BDA-as-a-service (MBDAaaS), providing clear models of the entire Big Data analysis process and of its artefacts. The approach supports automation and commoditisation of Big Data analytics, while enabling BDA customization to domain-specific customer requirements. Besides models for representing all aspects ofBDA, the course will discuss and compare available architectural patterns and toolkits for repeatable set-up and management of Big Data analyticspipelines. Repeatable patterns can drive costs of Big Data analytics withinreach of EU organizations (including SMEs) that do not have either in-house BigData expertise or budget for expensive data consultancy. Activities discussed in the course will include (i) planning Big Data sourcespreparation handling Big Data opacity, diversity, security, and privacy compliance (ii) Service Level Agreements (SLAs)for BDA detailing privacy, timing, and accuracy needs (iii) datamanagement and algorithm parallelisation strategies, from distributed data acquisition/storage to the design and parallel deployment of analytics and presentation of results. (iv) auditing andassessment of legal compliance (for example, to privacy regulations) of BDAenactment.

Bio. Ernesto Damiani joined Khalifa University, Abu Dhabi as Director of the Information Security Center and Leader of the Big Data Initiative at the Etisalat British Telecom Innovation Center (EBTIC). He is on extended leave from the Department of Computer Science, Università degli Studi di Milano, Italy, where he leads the SESAR research lab. Ernesto's research interests include secure service-oriented architectures, and privacy-preserving Big Data analytics. Ernesto holds/has held visiting positions at a number of international institutions, including George Mason University in Virginia, US, Tokyo Denki University, Japan, LaTrobe University in Melbourne, Australia, and the Institut National des Sciences Appliquées (INSA) at Lyon, France. He is a Fellow of the Japanese Society for the Progress of Science.

From December 2015, Ernesto is the Principal Investigator of the TOREADOR H2020 project on Model-driven Big-Data-as-a-Service. He has served as a PI in a number of large-scale research projects funded by the European Commission, the Italian Ministry of Research and by private companies such as British Telecom, Cisco Systems, SAP, Telecom Italia, Siemens Networks (now Nokia Siemens) and many others. Ernesto serves in the editorial board of several international journals; he is the EIC of the International Journal on Big Data and of the International Journal of Knowledge and Learning. He is Associate Editor of IEEE Transactions on Service-oriented Computing and of the IEEE Transactions on Fuzzy Systems. Ernesto is a senior member of the IEEE and served as Vice-Chair of the IEEE Technical Committee on Industrial Informatics. In 2008, Ernesto was nominated ACM Distinguished Scientist and received the Chester Sall Award from the IEEE Industrial Electronics Society. Ernesto has co-authored over 350 scientific papers and many books and international patents.

Francisco Herrera (University of Granada), [introductory] Big Data Preprocessing


Summary. Data Preprocessing for Data Mining addresses one of the most important issues within the well-known Knowledge Discovery from Data process. Data directly taken from the source usually are not ready to be considered for a data mining process. Data preprocessing techniques adapt the data to fulfill the input demands of each data mining algorithm. Data preprocessing includes data preparation methods for cleaning, transformation or managing imperfect data (missing values and noise data) and data reduction techniques, which aim at reducing the complexity of the data, detecting or removing irrelevant and noisy elements from the data, including feature and instance selection and discretization.

The knowledge extraction process from Big Data has become a very difficult task for most of the classical and advanced existing techniques. The main challenges are to deal with the increasing amount of data considering the number of instances and/or features, and the complexity of the problem. The design of data preprocessing methods for big data requires to redesign the methods adapting them to the new paradigms such as MapReduce and the directed acyclic graph model using Apache Spark.

In this course we will pay attention to preprocessing approaches for classification big data. We will analyze the design of preprocessing methods for big data (feature selection discretization, data preprocessing for imbalance classification, noise data cleaning,…), discussing how to include data preprocessing methods along the knowledge discovery process. We will pay attention to their design for MapReduce paradigm and Apache Spark framework.


  1. Data preprocessing
  2. Big data reduction (feature selection, discretization, prototype generation)
  3. Noise and big data
  4. Imbalanced big data classification preprocessing

References. See the page.

Bio. Francisco Herrera is a Professor in the Department of Computer Science and Artificial Intelligence at the University of Granada.

He has been the supervisor of 38 Ph.D. students and published more than 300 journal papers. He is co-author of the book "Data Preprocessing in Data Mining" (Springer, 2015). He currently acts as Editor in Chief of the international journals "Information Fusion" (Elsevier) and “Progress in Artificial Intelligence (Springer). He acts as editorial member of a dozen of journals.

He has been given many awards and honors for his personal work or for his publications in journals and conferences, among others; ECCAI Fellow 2009, IFSA Fellow 2013, IEEE Transactions on Fuzzy System Outstanding 2008 and 2012 Paper Award ((bestowed in 2011 and 2015), and nomination as Highly Cited Researchers in the areas of Engineering and Computer Sciences.

Chih-Jen Lin (National Taiwan University), [introductory/intermediate] Large-scale Linear Classification


Summary. Many classification methods such as kernel methods or decision trees are nonlinear approaches. However, linear methods of using a simple weight vector as the model remain to be very useful for many applications. By careful feature engineering and having data in a rich dimensional space, the performance may be competitive with that of using a highly nonlinear classifier. Successful application areas include document classification and computational advertising (CTR prediction). In the first part of this talk, we give an overview of linear classification by introducing commonly used formulations through different aspects. This discussion is useful because many people are confused about the relationships between, for example, SVM and logistic regression. We also discuss the connection between linear and kernel classification. In the second part we move to investigate techniques for solving optimization problems for linear classification. In particular, we show details of two representative settings: coordinate descent methods and Newton methods. The third part of the talk discusses issues in applying linear classification for big-data analytics. We present effective training methods in both multi-core and distributed environments. After demonstrating some promising results we discuss future challenges of linear classification.



Pre-requisites. No prerequisites.

Bio. Chih-Jen Lin is currently a distinguished professor at the Department of Computer Science, National Taiwan University. He obtained his B.S. degree from National Taiwan University in 1993 and Ph.D. degree from University of Michigan in 1998. His major research areas include machine learning, data mining, and numerical optimization. He is best known for his work on support vector machines (SVM) for data classification. His software LIBSVM is one of the most widely used and cited SVM packages. For his research work he has received many awards, including the ACM KDD 2010 and ACM RecSys 2013 best paper awards. He is an IEEE fellow, a AAAI fellow, and an ACM fellow for his contribution to machine learning algorithms and software design. More information about him can be found at the National Taiwan University page.

George Karypis (University of Minnesota), [intermediate/advanced] Scaling Up Recommender Systems


Summary. Recommender systems are designed to identify the items that a user will like or find useful based on the user’s prior preferences and activities. These systems have become ubiquitous and are an essential tool for information filtering and (e-)commerce. Over the years, collaborative filtering, which derive these recommendations by leveraging past activities of groups of users, has emerged as the most prominent approach for solving this problem.

This course is designed to provide an overview of the different state-of-the-art methods and applications of recommender systems with a focus towards deploying them in domains with large amounts of data and/or low-latency recommendations. The course consists of two major parts. The first will cover various serial algorithms for solving some of the most common recommendation problems including rating prediction, top-N recommendation, and cold-start. The second will cover various serial and parallel algorithms, formulations, and approaches that allow these methods to scale to large problems.

In order to succeed in the course, students need to have a background in algorithms, numerical optimization, and parallel computing.


Bio. George Karypis is an ADC Chair of Digital Technology Professor at the Department of Computer Science & Engineering at the University of Minnesota, Twin Cities. His research interests spans the areas of data mining, high performance computing, information retrieval, collaborative filtering, bioinformatics, cheminformatics, and scientific computing. His research has resulted in the development of software libraries for serial and parallel graph partitioning (METIS and ParMETIS), hypergraph partitioning (hMETIS), for parallel Cholesky factorization (PSPASES), for collaborative filtering-based recommendation algorithms (SUGGEST), clustering high dimensional datasets (CLUTO), finding frequent patterns in diverse datasets (PAFI), and for protein secondary structure prediction (YASSPP). He has coauthored over 250 papers on these topics and two books (“Introduction to Protein Structure Prediction: Methods and Algorithms” (Wiley, 2010) and “Introduction to Parallel Computing” (Publ. Addison Wesley, 2003, 2nd edition)). In addition, he is serving on the program committees of many conferences and workshops on these topics, and on the editorial boards of the IEEE Transactions on Knowledge and Data Engineering, ACM Transactions on Knowledge Discovery from Data, Data Mining and Knowledge Discovery, Social Network Analysis and Data Mining Journal, International Journal of Data Mining and Bioinformatics, the journal on Current Proteomics, Advances in Bioinformatics, and Biomedicine and Biotechnology.

Geoff McLachlan (University of Queensland), [intermediate/advanced] Big Data Extensions of Some Methods of Classification and Clustering


Summary. Attention is focussed first on supervised classification (discriminant analysis) for high-dimensional datasets. Issues discussed include variable selection and the estimation of the associated error rates to circumvent selection bias problems. The unsupervised classification (cluster analysis) is considered next with the focus on the use of finite mixture distributions, in particular multivariate normal distributions, to provide a model-based approach to clustering. A coverage of the main issues involved with the maximum likelihood fitting of mixture models via the EM algorithm will be given, including extensions of normal component distributions for long-tailed and/or skew data. Finally, consideration is given to further extensions of these mixture models to handle big data of possibly high-dimensions through the use of factor models after an appropriate reduction where necessary in the number of variables. Various real-data examples are given.


  1. Construction of classifiers, including variable selection and estimation of their error rates, for the supervised classification of high-dimensional datasets. (2 hours).
  2. An introduction to finite mixture models for unsupervised classification (clustering) with extensions of normal component distributions tot-distributions to handle long-tailed clusters. (2 hours).
  3. Extensions of normal andt-mixture models for the clustering of big datasets of possibly high dimensions and with clusters that are not necessarily elliptically symmetric. The use of mixtures of factor models witht- and skew t-component distributions is to be highlighted. (2 hours).


Pre-requisites. A good knowledge of multivariate statistics at least at an advanced undergraduate level.

Bio. Homepage atUniversity of Queensland.

Wladek Minor (University of Virginia), [introductory/intermediate] Big Data in Biomedical Sciences



Raymond Ng (University of British Columbia), [introductory/intermediate] Mining and Summarizing Text Conversations


Summary. With the ever-increasing popularity of Internet technologies and communication devices such as smartphones and tablets, and with huge amounts of such conversational data generated on hourly basis, intelligent text analytic approaches can greatly benefit organizations and individuals. For example, managers can find the information exchanged in forum discussions crucial for decision making. Moreover, the posts and comments about a product can help business owners to improve the product.

In this lecture, we first give an overview of important applications of mining text conversations, using sentiment summarization of product reviews as a case study. Then we examine three topics in this area: (i) topic modeling; (ii) natural language summarization; and (iii) extraction of rhetorical structure and relationships in text.


  1. Text conversations and business intelligence (0.5 hours)
  2. Sentiment extraction and summarization as applications (1.5 hours)
  3. Topic modeling (1 hour)
  4. Extractive and abstractive summarization (1 hour)
  5. Rhetorical analysis (1.5 hour)
  6. Summary (0.5 hour)


  1. Shafiq Joty, Giuseppe Carenini and Raymond Ng. Topic Segmentation and Labeling in Asynchronous Conversations. Journal of AI Research (JAIR) (2013), Vol. 47, Page 521-573
  2. Shafiq Joty, Giuseppe Carenini, Gabriel Murray and Raymond Ng. Exploiting Conversation Structure in Unsupervised Topic Segmentation for Emails. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2010), MIT, Massachusetts, USA
  3. Yashar Mehdad, Giuseppe Carenini, Raymond Ng and Shafiq Joty. Towards Topic Labeling with Phrase Entailment and Aggregation. In Proceedings of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2013), Atlanta, USA
  4. Yashar Mehdad, Giuseppe Carenini, and Raymond Ng. Abstractive Summarization of Spoken and Written Conversations based on Phrasal Queries. In Proceedings of Association of Computational Linguistics (ACL 2014)
  5. Shafiq Joty, Giuseppe Carenini and Raymond Ng. A Novel Discriminative Framework for Sentence-Level Discourse Analysis. EMNLP 2012
  6. Shafiq Joty, Giuseppe Carenini, Raymond Ng and Yashar Mehdad. Combining Intra- and Multi-sentential Rhetorical Parsing for Document-level Discourse Analysis. ACL 2013
  7. Shima Gerani, Yashar Mehdad, Giuseppe Carenini, Raymond Ng and Bita Nejat. Abstractive Summarization of Product Reviews Using Discourse Structure. EMNLP 2014
  8. Kelsey Allen, Giuseppe Carenini and Raymond Ng, Detecting Disagreement in Conversations using Pseudo-Monologic Rhetorical Structure, EMNLP 2014

Pre-requisites. Basic knowledge of machine learning and natural language processing is preferred but not required.

Bio. Dr. Raymond Ng is a professor in Computer Science at the University of British Columbia. His main research area for the past two decades is on data mining, with a specific focus on health informatics and text mining. He has published over 180 peer-reviewed publications on data clustering, outlier detection, OLAP processing, health informatics and text mining. He is the recipient of two best paper awards - from 2001 ACM SIGKDD conference, which is the premier data mining conference worldwide, and the 2005 ACM SIGMOD conference, which is one of the top database conferences worldwide. He co-authors the books entitled “Methods for mining and summarizing text conversations,” and “Perspectives for Business Intelligence”.

Sankar K. Pal (Indian Statistical Institute), [introductory/advanced] Machine Intelligence and Granular Mining: Relevance to Big Data



Bio. Sankar K. Pal is a Distinguished Scientist and former Director of the Indian Statistical Institute. He is also a J.C. Bose Fellow of the Govt. of India. He founded the Machine Intelligence Unit and the Center for Soft Computing Research: A National Facility in the Institute in Calcutta. He received a Ph.D. in Radio Physics and Electronics from the University of Calcutta in 1979, and another Ph.D. in Electrical Engineering along with DIC from Imperial College, University of London in 1982.

Prof. Pal worked at the University of California, Berkeley and the University of Maryland, College Park in 1986-87; the NASA Johnson Space Center, Houston, Texas in 1990-92 & 1994; and in US Naval Research Laboratory, Washington DC in 2004. Since 1997 he has been a Distinguished Visitor of the IEEE Computer Society, for the Asia-Pacific Region, and held several visiting positions in Italy, Poland, Hong Kong and Australian universities.

He is a co-author of seventeen books and more than four hundred research publications in the areas of Pattern Recognition and Machine Learning, Image Processing, Data Mining, Web Intelligence, Soft Computing, Neural Nets, Genetic Algorithms, Fuzzy Sets, Rough Sets Social Network Analysis and Bioinformatics. He visited about forty countries as a Keynote/ Invited speaker.

He is a Fellow of the IEEE, the Academy of Sciences for the Developing World (TWAS), International Association for Pattern Recognition, International Association of Fuzzy Systems, and all the four National Academies for Science & Engineering in India. He serves(d) in the editorial boards of twenty-two international journals including several IEEE Transactions.

He has received the 1990 S.S. Bhatnagar Prize (the most coveted award for a scientist in India), 2013 Padma Shri (one of the highest civilian awards) by the President of India and many prestigious awards in India and abroad including the 1999 G.D. Birla Award, 1993 Jawaharlal Nehru Fellowship, 2000 Khwarizmi International Award from the President of Iran, 1993 NASA Tech Brief Award (USA), 1994 IEEE Trans. Neural Networks Outstanding Paper Award, and 2005-06 Indian Science Congress-P.C. Mahalanobis Birth Centenary Gold Medal from the Prime Minister of India for Lifetime Achievement.

More info Indian Statistical Institute page.

Erhard Rahm (University of Leipzig), [introductory/intermediate] Scalable and Privacy-preserving Data Integration


Summary. Data integration is a key challenge for Big Data applications to analyze large sets of heterogeneous data of potentially different kinds, including structured database records as well as semi-structured entities from web sources or social networks. In many cases, there is also a need to deal with a very high number of data sources, e.g. product offers from many e-commerce websites. The integration of sensitive personal information from different sources, e.g. about customers of different companies or patients of different hospitals, poses additional challenge to protect the privacy of the individuals while still supporting certain data analysis tasks on the anonymized data. We will cover proposed approaches to deal with the key data integration tasks of (large-scale) entity resolution and schema or ontology matching. In particular we discuss how entity resolution can be performed in parallel on Hadoop platforms together with so-called blocking approaches to avoid comparing to many entities with each other and load balancing techniques to deal with data skew. For privacy-preserving record linkage we focus on the use of Bloom filters for encrypting sensitive attribute values while still permitting effective match decisions. We discuss the use of different configurations with or without the use of a dedicated linkage unit and their implications regarding privacy and runtime complexity. Another topic are graph-based data integration and analysis approaches that keep all relevant relationships between entities to enable more sophisticated analysis tasks. Such approaches are not only useful for typical graph applications such as for social networks, but can also lead to enhanced business intelligence on enterprise data.


  1. Data integration
    • Scalable entity resolution / link discovery
    • Large-scale schema/ontology matching
    • Holistic data integration
  2. Privacy-preserving record linkage (PPRL)
    • Encryption of sensitive information
    • PPRL with linkage unit
    • Secure multi-party approaches
  3. Graph-based data integration and analytics
    • ETL for graph data
    • Graph-based business intelligence
    • Hadoop-based graph analytics


Pre-requisites. Participants should have a computer science background and be familiar with traditional database systems and data warehouses. Knowledge about basic Big Data technologies such as Hadoop is beneficial.

Bio. Erhard Rahm is full professor for databases at the computer science institute of the University of Leipzig, Germany. His current research focusses on Big Data and data integration. He has authored several books and more than 200 peer-reviewed journal and conference publications. His research on data integration and schema matching has been awarded several times, in particular with the renowned 10-year best-paper award of the conference series VLDB (Very Large Databases) and the Influential Paper Award of the conference series ICDE (Int. Conf. on Data Engineering). Prof. Rahm is one of the two scientific coordinators of the new German center of excellence on Big Data ScaDS (competence center for SCAlable Data services and Solutions) Dresden/Leipzig.

Hanan Samet (University of Maryland), [introductory/intermediate] Sorting in Space: Multidimensional, Spatial, and Metric Data Structures for Applications in Spatial Databases, Geographic Information Systems (GIS), and Location-based Services


Summary. The representation of multidimensional, spatial, and metric data is an important issue in applications of spatial database, geographic information systems (GIS), and location-based services. Recently, there has been much interest in hierarchical data structures such as quadtrees, octrees, and pyramids which are based on image hierarchies, as well methods that make use of bounding boxes which are based on object hierarchies. Their key advantage is that they provide a way to index into space. In fact, they are little more than multidimensional sorts. They are compact and depending on the nature of the spatial data they save space as well as time and also facilitate operations such as search.

We describe hierarchical representations of points, lines, collections of small rectangles, regions, surfaces, and volumes. For region data, we point out the dimension-reduction property of the region quadtree and octree. We also demonstrate how to use them for both raster and vector data. For metric data that does not lie in a vector space so that indexing is based simply on the distance between objects, we review various representations such as the vp-tree, gh-tree, and mb-tree. In particular, we demonstrate the close relationship between these representations and those designed for a vector space. For all of the representations, we show how they can be used to compute nearest objects in an incremental fashion so that the number of objects need not be known in advance. The VASCO JAVA applet is presented that illustrates these methods.

The above has been in the context of the traditional geometric representation of spatial data, while in the final part we review the more recent textual representation which is used in location-based services where the key issue is that of resolving ambiguities. For example, does ``London'' correspond to the name of a person or a location, and if it corresponds to a location, which of the over 700 different instances of ``London'' is it. The NewsStand system at newsstand.umiacs.umd.edu and the TwitterStand system at TwitterStand.umiacs.umd.edu system are examples. See also the cover article of the October 2014 issue of Communications of the ACM (or a cached version) and the accompanying video.


  1. Introduction
    1. Sample queries
    2. Spatial Indexing
    3. Sorting approach
    4. Minimum bounding rectangles (e.g., R-tree)
    5. Disjoint cells (e.g., R+-tree, k-d-B-tree)
    6. Uniform grid
    7. Location-based queries vs: feature-based queries
    8. Region quadtree
    9. Dimension reduction
    10. Pyramid
    11. Region quadtrees vs: pyramids
    12. Space ordering methods
  2. Points
    1. point quadtree
    2. MX quadtree
    3. PR quadtree
    4. k-d tree
    5. Bintree
    6. BSP tree
  3. Lines
    1. Strip tree
    2. PM1 quadtree
    3. PM2 quadtree
    4. PM3 quadtree
    5. PMR quadtree
  4. Rectangles and arbitrary objects
    1. MX-CIF quadtree
    2. Loose quadtree
    3. Partition fieldtree
    4. R-tree
  5. Surfaces and Volumes
    1. Restricted quadtree
    2. Region octree
    3. PM octree
  6. Metric Data
    1. vp-tree
    2. gh-tree
    3. mb-tree
  7. Operations
    1. Incremental nearest object location
    2. Boolean set operations
  8. Spatial Database Issues
    1. General issues
    2. Specific issues
  9. Indexing spatiotextual data for location-based services delivered on platforms such as smart phones and tablets
    1. Incorporation of spatial synonyms in search engines
    2. Toponym recognition
    3. Toponym resolution
    4. Spatial reader scope
    5. Incorporation of spatiotemporal data
    6. System integration issues
    7. Demos of live systems on smart phones
  10. Example systems
    1. AND internet browser
    2. JAVA spatial data applets
    3. STEWARD
    4. NewsStand
    5. TwitterStand


  1. H. Samet. Foundations of Multidimensional Data Structures. Morgan-Kaufmann, San Francisco, 2006
  2. H. Samet. A sorting approach to indexing spatial data. International Journal of Shape Modeling. 14(1):15--37, 28(4):517--580, June 2008
  3. G. R. Hjaltason and H. Samet. Index-driven similarity search in metric spaces. ACM Transactions on Database Systems, 28(4):517--580, December 2003
  4. G. R. Hjaltason and H. Samet. Distance browsing in spatial databases. ACM Transactions on Database Systems, 24(2):265--318, June 1999. Also Computer Science TR-3919, University of Maryland, College Park, MD
  5. G. R. Hjaltason and H. Samet. Ranking in spatial databases. In Advances in Spatial Databases --- 4th International Symposium, SSD'95, M. J. Egenhofer and J. R. Herring, eds., pages 83--95, Portland, ME, August 1995. Also Springer-Verlag Lecture Notes in Computer Science 951
  6. H. Samet. Applications of Spatial Data Structures: Computer Graphics, Image Processing, and GIS. Addison-Wesley, Reading, MA, 1990
  7. H. Samet. The Design and Analysis of Spatial Data Structures. Addison-Wesley, Reading, MA, 1990
  8. C. Esperanca and H. Samet
  9. Experience with SAND/Tcl: a scripting tool for spatial databases. Journal of Visual Languages and Computing, 13(2):229--255, April 2002
  10. H. Samet, H. Alborzi, F. Brabec, C. Esperanca, G. R. Hjaltason, F. Morgan, and E. Tanin. Use of the SAND spatial browser for digital government applications. Communications of the ACM, 46(1):63--66, January 2003
  11. B. Teitler, M. D. Lieberman, D. Panozzo, J. Sankaranarayanan, H. Samet, and J. Sperling. NewsStand: A new view on news. Proceedings of the 16th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, pages 144--153, Irvine, CA, November 2008
  12. H. Samet, J. Sankaranarayanan, M. D. Lieberman, M. D. Adelfio, B. C. Fruin, J. M. Lotkowski, D. Panozzo, J. Sperling, and B. E. Teitler
  13. Reading news with maps by exploiting spatial synonyms. Communications of the ACM, 57(10):64--77, October 2014
  14. J. Sankaranarayanan, H. Samet, B. Teitler, M. D. Lieberman, and J. Sperling. TwitterStand: News in tweets. Proceedings of the 17th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, pages 42--51, Seattle, WA, November 2009
  15. M. D. Lieberman, H. Samet, and J. Sankaranarayanan
  16. Geotagging with local lexicons to build indexes for textually-specified spatial data. Proceedings of the 26th IEEE International Conference on Data Engineering, pages 201--212, Long Beach, CA, March 2010
  17. M. D. Lieberman and H. Samet. Multifaceted Toponym Recognition for Streaming News. Proceedings of the ACM SIGIR Conference
  18. Beijing, July 2011, 843--852
  19. M. D. Lieberman and H. Samet
  20. Adaptive Context Features for Toponym Resolution in Streaming News. Proceedings of the ACM SIGIR Conference. Portland, OR, August 2012, 731--740
  21. Spatial Data Structure applets

Intended audience and pre-requisites. Practitioners working in the areas of spatial databases, geographic information systems, and location-based services will be given a different perspective on data structures found to be useful in most applications. Familiarity with computer terminology and some programming experience is needed to follow this course.

Bio. Hanan Samet is a Distinguished University Professor of Computer Science at the University of Maryland, College Park and is a member of the Institute for Computer Studies. He is also a member of the Computer Vision Laboratory at the Center for Automation Research where he leads a number of research projects on the use of hierarchical data structures for database applications, geographic information systems, computer graphics, computer vision, image processing, games, robotics, and search. He received the B.S. degree in engineering from UCLA, and the M.S. Degree in operations research and the M.S. and Ph.D. degrees in computer science from Stanford University. His doctoral dissertation dealt with proving the correctness of translations of LISP programs which was the first work in translation validation and the related concept of proof-carrying code. He is the author of the recent book `` Foundations of Multidimensional and Metric Data Structures'' published by Morgan-Kaufmann, an imprint of Elsevier, in 2006, an award winner in the 2006 best book in Computer and Information Science competition of the Professional and Scholarly Publishers (PSP) Group of the American Publishers Association (AAP), and of the first two books on spatial data structures ``Design and Analysis of Spatial Data Structures'', and ``Applications of Spatial Data Structures: Computer Graphics, Image Processing, and GIS,'' both published by Addison-Wesley in 1990. He is the Founding Editor-In-Chief of the ACM Transactions on Spatial Algorithms and Systems (TSAS), the founding chair of ACM SIGSPATIAL, a recipient of a Science Foundation of Ireland (SFI) Walton Visitor Award at the Centre for Geocomputation at the National University of Ireland at Maynooth (NUIM), 2009 UCGIS Research Award, 2010 CMPS Board of Visitors Award at the University of Maryland, 2011 ACM Paris Kanellakis Theory and Practice Award, 2014 IEEE Computer Society Wallace McDowell Award, and a Fellow of the ACM, IEEE, AAAS, IAPR (International Association for Pattern Recognition), and UCGIS (University Consortium for Geographic Science). He received best paper awards in the 2007 Computers & Graphics Journal, the 2008 ACM SIGMOD and SIGSPATIAL ACMGIS Conferences, the 2012 SIGSPATIAL MobiGIS Workshop, and the 2013 SIGSPATIAL GIR Workshop, as well as a best demo award at the 2011 SIGSPATIAL ACMGIS'11 Conference. His paper at the 2009 IEEE International Conference on Data Engineering (ICDE) was selected as one of the best papers for publication in the IEEE Transactions on Knowledge and Data Engineering. He was elected to the ACM Council as the Capitol Region Representative for the term 1989-1991, and is an ACM Distinguished Speaker.

Jaideep Srivastava (Qatar ComputingResearch Institute), [intermediate] Social Computing: Computing as an Integral Tool to Understanding Human Behavior and Solving Problems of Social Relevance


Summary. Social Computing is an emerging discipline, and just like any discipline at a nascent stage it can often mean different things to different people. However, there are three distinct threads that are emerging. First thread is often called Socio-Technical Systems, which focuses on building systems that allow large scale interactions of people, whether for a specific purpose or in general. Examples include social networks like Facebook and Google Plus, and Multi Player Online Games like World of Warcraft and Farmville. The second thread is often called Computational Social Science, whose goal is to use computing as an integral tool to push the research boundaries of various social and behavioral science disciplines, primarily Sociology, Economics, and Psychology. Third is the idea of solving problems of societal relevance using a combination of computing and humans.

The three modules of this course are structured according to this description. The goal of this course is to discuss, in a tutorial manner, through case studies, and through discussion, what Social Computing is, where it is headed, and where is it taking us.


References. Will provide later.

Pre-requisites. This course is intended primarily for graduate students. Following are the potential audiences:

Bio. Jaideep Srivastava is the Director of the Social Computing division at QCRI. He is on leave from the University of Minnesota, where he directs a laboratory focusing on research in Web Mining, Social Analytics, and Health Analytics. He is a Fellow of the Institute of Electrical and Electronics Engineers (IEEE), has been an IEEE Distinguished Visitor, and is a Distinguished Fellow of Allina’s Center for Healthcare Innovation. He has been awarded the Distinguished Research Contributions Award of the PAKDD, for his lifetime contributions to the field of data mining. Six of his papers have won best paper awards.

Dr. Srivastava is currently co-leading a multi-institutional, multi-disciplinary project in the rapidly emerging area of social computing. He has significant experience in the industry, in both consulting and executive roles. He has led a data mining team at Amazon.com, built a data analytics department at Yodlee, and served as the Chief Technology Officer for Persistent Systems. He is a Co-Founder of Ninja Metrics, and an adviser and Chief Scientist of CogCubed, an innovative company whose goal is to revolutionize the diagnosis and therapy of cognitive disorders through the use of online games.

Dr. Srivastava has held distinguished professorships at Heilongjiang University and Wuhan University, China. He has held advisory positions with the State of Minnesota, and the State of Maharashtra, India. He is a technology advisor to the Unique ID (UID) project of the Government of India, whose goal is to provide biometrics-based social security numbers to the 1.2 Billion citizens of India. He has a Bachelors of Technology from the Indian Institute of Technology (IIT), Kanpur, India, and MS and PhD from the University of California, Berkeley.

Jeffrey Ullman (Stanford University), [introductory] Big Data Algorithms that Aren't Machine Learning


Summary. We shall study algorithms that have been found useful in querying large data volumes. The emphasis is on algorithms that cannot be considered "machine learning.


References. We will be covering (parts of) Chapters 3, 4, 5, and 10 of the free text

Pre-requisites. A course in algorithms at the advanced-undergraduate level is important. A course in database systems is helpful, but not required.

Bio. Stanford page.

Alexandre Vaniachine (Argonne National Laboratory), [introductory/advanced] Big Data: Comparison with Computational Models


Summary. The scientific goals of the Large Hadron Collider include high precision tests of the Standard Model and searches for new physics. These goals require detailed comparison of data with computational models simulating the expected data behavior. We highlight the role which modeling and simulation plays in scientific discovery and experience with methods for processing real and simulated data growing in volume and variety.


  1. Do not let the data speak for themselves: comparison of data-driven vs. model-based approaches; signal and noise. (2 hours, introductory)
  2. Big Data Processing in High Energy Physics: Distributed Computing; Data Science methods in Higgs boson discovery. (2 hours, intermediate)
  3. Higgs Boson Machine Learning Challenge: practical data analysis. (2 hours, advanced)


  1. http://www.fourthparadigm.org
  2. Bird I, Computing for the Large Hadron Collider. Annu. Rev. Nucl. Part. S. (2011) 61: 99
  3. A.V. Vaniachine on behalf of the ATLAS and CMS Collaborations. Advancements in Big Data Processing in the ATLAS and CMS Experiments, arXiv:1303.1950, 2013
  4. ATLAS collaborationDataset from the ATLAS Higgs Boson Machine Learning Challenge, CERN Open Data Portal. DOI:10.7483/OPENDATA.ATLAS.ZBP2.M5T8, 2014
  5. Adam-Bourdarios C, Cowan G, Germain C, Guyon I, Kégl B et al. Learning to discover: the Higgs boson machine learning challenge - Documentation. CERN Open Data Portal. DOI:10.7483/OPENDATA.ATLAS.MQ5J.GHXA, 2014

Pre-requisites. Lecture 3: familiarity with python, machine learning and multivariate analysis will be useful, but is not required to follow this lecture.

Bio. Argonne page.

Xiaowei Xu (University of Arkansas, Little Rock), [introductory/advanced] Big Data Analytics for Social Networks


Summary. Recent explosive growth of online social networks such as Facebook and Twitter provides a unique opportunity for many data mining applications including real time event detection, community structure detection and viral marketing. The course covers big data analytics for social networks. The emphasis will be on scalable algorithms for community structure detection and structural pattern mining.



  1. Aaron Clauset, M. E. J. Newman, and Cristopher Moore, Finding community structure in very large networks, Phys. Rev. E 70, 066111, 2004
  2. X. Xu, N. Yuruk, Z. Feng, and T. A. Schweiger. Scan: a structural clustering algorithm for networks. In Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 824–833. ACM, 2007
  3. Raghavan, Usha Nandini and Albert, Reka and Kumara, Soundar, Near linear time algorithm to detect community structures in large-scale networks, Phys. Rev. E 76, 036106, 2007
  4. S. Sintos and P. Tsaparas. Using strong triadic closure to characterize ties in social networks. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 1466–1475. ACM, 2014
  5. Weizhong Zhao Venkata Swamy Martha Xiaowei Xu PSCAN: A Parallel Structural Clustering Algorithm for Big Networks in MapReduce. 862-869 2013 AINA

Pre-requisites. Basic knowledge in computer algorithms and graph theory.

Bio. Professor Xiaowei Xu is a professor in the Department of Information Science at the University of Arkansas at Little Rock (UALR). He received his Ph.D. in computer science from the University of Munich in 1998. Prior to his appointment at UALR, Dr. Xu was a senior research scientist in Siemens Corporate Technology. Dr. Xu is adjunct professor in the Department of Mathematics at the University of Arkansas. Dr. Xu was an Oak Ridge Institute for Science and Education (ORISE) Faculty Research Program Member in the National Center for Toxicological Research’s (NCTR) Center for Bioinformatics in the Division of Systems Biology from 2010 to 2014. He is also a consultant for companies including Siemens, Acxiom, Dataminr and Neusoft. Dr. Xu’s research focuses on algorithms for data mining and machine learning. Dr. Xu is a recipient of 2014 ACM SIGKDD Test of Time Award for his work in density-based clustering algorithm (DBSCAN), which has received over 7,000 citations based on Google Scholar. Dr. Xu is program committee members and session chairs for premier forums including ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), and IEEE International Conferences on Data Mining (ICDM).

Mohammed J. Zaki (Rensselaer Polytechnic Institute), [introductory/intermediate] Large Scale Graph Analytics and Mining


Summary. Increasingly, today's massive data is in the form of complex graphs and networks. Examples include the world wide web, social networks, biological networks, semantic networks, and so on. The study of such complex and large scale networks (often referred to as network science) – understanding their intrinsic properties, changes to their structure over time or due to other external factors, and the behavior of entities and communities within them – can afford important insight to domain researchers and organizations alike. Given that networks are part and parcel of the complex and connected social, physical and biological world we live in, a coordinated and concerted approach combining these strands of research is essential to make progress on this grand challenge complex systems problem of our times.

In this course, we study the fundamental algorithms to model and mine graph data. We focus on graph and network modeling, graph pattern mining, as well as graph clustering and classification tasks. We will also cover both the algorithms and frameworks for parallel and distributed graph mining over massive graphs.


References. We will cover material from Chapters 4,5,11,13,16 from our freely available textbook (and other supplements)

Pre-requisites. Introductory courses in discrete mathematics, linear algebra, and probability and statistics.

Bio. Mohammed J. Zaki is a Professor of Computer Science at RPI. He received his Ph.D. degree in computer science from the University of Rochester in 1998. His research interests focus on developing novel data mining techniques, especially for applications in bioinformatics and social networks. He has published over 225 papers and book-chapters on data mining and bioinformatics, including the textbook "Data Mining and Analysis: Fundamental Concepts and Algorithms," Cambridge University Press, 2014. He is the founding co-chair for the BIOKDD series of workshops. He is currently Area Editor for Statistical Analysis and Data Mining, and an Associate Editor for Data Mining and Knowledge Discovery, ACM Transactions on Knowledge Discovery from Data, and Social Networks and Mining. He was the program co-chair for SDM'08, SIGKDD'09, PAKDD'10, BIBM'11, CIKM'12, and ICDM'12. He is currently the program co-chair for IEEE BigData'15. He is currently serving on the Board of Directors for ACM SIGKDD. He received the National Science Foundation CAREER Award in 2001 and the Department of Energy Early Career Principal Investigator Award in 2002. He received an HP Innovation Research Award in 2010, 2011, and 2012, and a Google Faculty Research Award in 2011. He is a senior member of the IEEE, and an ACM Distinguished Scientist.