4th International Winter School on Big Data

Timişoara, Romania, January 22-26, 2018

Course Description


Keynotes

Courses







Keynotes


Bing Liu   
Distinguished Professor Department of Computer Science University of Illinois at Chicago (UIC)
Towards Machines that Learn like Humans

Abstract

The classic machine learning (ML) paradigm works by learning a model from a large set of labeled examples. However, humans probably don’t learn this way. In my life, I have never been given a set of labeled documents and asked to build a text classifier. Even if I am given 10000 labeled documents, I will not be able to do it if I do not understand the language. Moreover, after learning to recognize a set of objects, we not only can recognize these objects, but also can identify objects that do not belong to the learned set. We can also learn new knowledge continuously without forgetting what I have learned in the past. We also can easily interact with the dynamic environment and learn from it efficiently, but current reinforcement learning algorithms are too inefficient and hard to use in practice, although they do extremely well in games. In this talk, I would like to describe some of these shortcomings of the classic learning based on my practical experiences in sentiment analysis and self-driving cars and discuss how we try to pursue a paradigm shift and build machines that learn like humans.

Short Bio

Bing Liu is a distinguished professor of Computer Science at the University of Illinois at Chicago. He received his Ph.D. in Artificial Intelligence from the University of Edinburgh. His research interests include lifelong learning, sentiment analysis, data mining, machine learning, and natural language processing. He has published extensively in top conferences and journals. Two of his papers have received 10-year Test-of-Time awards from KDD. He also authored four books: one on lifelong learning, two on sentiment analysis, and one on Web mining. Some of his work has been widely reported in the press, including a front-page article in the New York Times. On professional services, he served as the Chair of ACM SIGKDD (ACM Special Interest Group on Knowledge Discovery and Data Mining) from 2013-2017. He has also served as program chair of many leading data mining conferences, including KDD, ICDM, CIKM, WSDM, SDM, and PAKDD, as associate editor of leading journals such as TKDE, TWEB, and DMKD, and as area chair or senior PC member of numerous natural language processing, AI, Web, and data mining conferences. He is a Fellow of ACM, AAAI and IEEE.








Jeffrey Ullman   
Stanford W. Ascherman Professor of Computer Science (Emeritus)
Data Science: Is it Real?

Abstract

We shall discuss the various ways in which data science is approached by different communities, including Statistics, Machine-Learning, and Database communities. Each presents a different viewpoint and values different outcomes. Some consequences of these approaches will be discussed. We also contrast approaches to education of the large number of data scientists that are expected to be required in the near future.

Short Bio

Link to the bio







Courses


Paul Bliese   
Associate Professor of Business Administration in the Management Department of the Darla Moore School of Business at the University of South Carolina.
Using R for Mixed-effects (Multilevel) Models [introductory/intermediate]

Summary:

Mixed-effects or multilevel models are commonly used when data have some form of nested structure. For instance, individuals may be nested within workgroups, or repeated measures may be nested within individuals. Nested structures in data are often accompanied by some form of non-independence. For example, in work settings, individuals in the same workgroup typically display some degree of similarity with respect to performance or they provide similar responses to questions about aspects of the work environment. Likewise, in repeated measures data, individuals usually display a high degree of similarity in responses over time. Non-independence may be considered either a nuisance variable or something to be substantively modeled, but the prevalence of nested data requires that analysts have a variety of tools to deal with nested data. This course provides and introduction to (1) the theoretical foundation, and (2) resources necessary to conduct a wide range of multilevel analyses. All practical exercises are conducted in R. Participants are encouraged to bring datasets to the course and apply the principles to their specific areas of research.

Syllabus:

Session 1
1. Introduction and overview of Multilevel Models
2. Introduction to R and the nlme and multilevel packages
Session 2:
3. Nested Data and Mixed-Effects Models in nlme
4. R Code for Models and Introduction to Functions Commonly used in Data Manipulation
Session 3:
5. Repeated Measures data and Growth Models in nlme
6. R Code for Models and Introduction to Functions Commonly used in Data Manipulation

Pre-requisites:

Basic understanding of regression. An installed version of R (https://cran.r-project.org/) on a laptop for completing exercises. Users are also encouraged to install R-Studio (https://www.rstudio.com/)

References:

Bliese, P. D. (2016). Multilevel Modeling in R (v. 2.6). https://cran.r-project.org/doc/contrib/Bliese_Multilevel.pdf

Short Bio

Paul D. Bliese, Ph.D. joined the Management Department at the Darla Moore School of Business,University of South Carolina in 2014. Prior to joining South Carolina, he spent 22 years as a researchpsychologist at the Walter Reed Army Institute of Research where he conducted research on stress,adaptation, leadership, well-being, and performance. Professor Bliese has long-term interests inunderstanding how analytics contribute to theory development and in applying analytics to complexorganizational problems. He built and maintains the multilevel package for R. Professor Bliese hasserved on numerous editorial boards, and has been an associate editor at the Journal of AppliedPsychology since 2010. In July of 2017 he took over a editor-in- chief for Organizational ResearchMethods.








Hendrik Blockeel   
Katholieke Universiteit Leuven
Decision Trees for Big Data Analytics [intermediate]

Summary:

Decision trees, and derived methods such as Random Forests, are among the most popular methods for learning predictive models from data. This is to a large extent due to the versatility and efficiency of these algorithms. This course will introduce students to the basic methods for learning decision trees, as well as to variations and more sophisticated versions of decision tree learners, with a particular focus on those methods that make decision trees work in the context of big data.

Syllabus:

Classification and regression trees, multi-output trees, clustering trees, model trees, ensembles of trees, incremental learning of trees, learning decision trees from large datasets, learning from data streams.

Pre-requisites:

Familiarity with mathematics, probability theory, statistics, and algorithms is expected, on the level it is typically introduced at the bachelor level in computer science or engineering programs.

References:

Relevant references will be provided as part of the course.

Short Bio

Hendrik Blockeel is a full professor at KU Leuven, Belgium. He received his Ph.D. degree in 1998 from KU Leuven. His research interests lie mostly within artificial intelligence, with a focus on machine learning and data mining, and in the use of AI-based modeling in other sciences. Prof. Blockeel has made a variety of contributions on topics such as inductive logic programming, probabilistic-logical learning, and decision trees. He is an action editor for the journals Machine Learning and Data Mining and Knowledge Discovery, and a member of the editorial board of several other journals. He has chaired or organized several conferences, including ECMLPKDD 2013, and organized the ACAI summer school in 2007. He has served on the board of the European Coordinating Committee for Artificial Intelligence (now EurAI), and currently serves on the ECMLPKDD steering committee as publications chair. He is a EurAI fellow since 2015.








Saso Dzeroski   
Jozef Stefan Institute, Dept. of Knowledge Technologies
Multi-target Prediction: Techniques and Applications [introductory/intermediate]

Summary:

Increasingly often, data mining has to learn predictive models from big data, which may have many examples and many input/output dimensions or may be streaming at very high rates. When more than one target variable has to be predicted, we talk about multi-target prediction. Predictive modeling problems may also be complex in other ways, i.e., they may involve incompletely/partially labelled data, and data placed in a network context.

The course will first give an introduction to the different tasks of multi-target prediction, such as multi-target classification and regression, hierarchical versions thereof, and versions of the tasks that involve additional complexity (such as semi-supervised multi-target regression). It will continue to present some methods, first basic and then advanced, for solving such tasks. Finally, it will review different applications of multi-target prediction, ranging from gene function prediction, through image annotation, to space exploration.

Syllabus:

Session I: Tasks and basic methods
- The different tasks of multi-target prediction: multi-target regression, multi-target classification, multi-label classification, hierarchical versions of the tasks
- Additional complexity aspects: Incomplete annotations, Massive/streaming data, Network context
- Trees and rules (and ensembles thereof) for multi-target prediction
Session II: Advanced methods
- Feature ranking for multi-target prediction (Extensions of ReliefF, Ranking based on tree ensembles)
- Semi-supervised multi-target prediction
- Multi-target prediction on data streams
Session III: Applications
- Applications of multi-target prediction in ecology and environmental sciences
- Applications of multi-target prediction in medicine and life sciences
- Miscellaneous applications (e.g., image annotation and retrieval)

Pre-requisites:

Familiarity with algorithms/computer programming and mathematics/statistics will be assumed, at the level typically taught at undergraduate studies of engineering/computer science. Basic understanding of machine learning and data mining would be helpful, but is not strictly necessary.

References:

Appropriate references will be provided in the lecture notes (slides) for the course.

Short Bio

Saso Dzeroski (Sašo Džeroski) is a scientific councillor at the Jozef Stefan Institute and the Centre of Excellence for Integrated Approaches in Chemistry and Biology of Proteins, both in Ljubljana, Slovenia. He is also a full professor at the Jozef Stefan International Postgraduate School and the University of Ljubljana, Faculty of Computer and Information Sciences. He leads a twenty-strong research group that investigates machine learning and data mining methods (including structured output prediction and automated modeling of dynamic systems), as well as their applications (in environmental sciences, incl. ecology/ecological modelling, and life sciences, incl. systems biology/systems medicine).
He has participated in many international research projects and has coordinated three of them, including the FET XTrack project MAESTRA (Learning from Massive, Incompletely annotated, and Structured Data). He is currently one of the principal investigators in the FET Flagship Human Brain Project. He has been scientific and/or organizational chair of numerous international conferences, including ECML PKDD 2017, DS-2014, MLSB-2009/10, ECEM and EAML-2004, ICML-1999 and ILP-1997/99. He is a fellow of EurAI-the European Association of AI (formerly ECCAI) since 2008, a foreign member of the Macedonian Academy of Sciences and Arts since 2015, and a member of the European Academy (Academia Europea) since 2016.








Geoffrey C. Fox   
Chair, Intelligent Systems Engineering, School of Informatics and Computing. Distinguished Professor of Computing, Engineering and Physics.Director of the Digital Science Center, Indiana University – Bloomington
Integration of HPC, Big Data Analytics and Software Ecosystem [Intermediate]

Summary:

Two major trends in computing systems are the growth in high performance computing (HPC) with an international exascale initiative, and the big data phenomenon with an accompanying cloud infrastructure of well publicized dramatic and increasing size and sophistication. This tutorial weaves these trends together using some key building blocks. The first is HPC-ABDS, the High Performance Computing (HPC) enhanced Apache Big Data Stack. (ABDS). Here we aim at using the major open source Big Data software environment but develop the principles allowing use of HPC software and hardware to achieve good performance. We give several examples of software (for example Hadoop and Heron) and algorithms implemented in this software. The second building block is the SPIDAL library (Scalable Parallel Interoperable Data Analytics Library) of scalable machine learning and data analysis software. We give examples including clustering, topic modelling and dimension reduction and their visualization with a framework called Harp. The third building block is an analysis of simulation and big data use cases in terms of 64 separate features (varying from data volume to “suitable for MapReduce” to kernel algorithm used). This allows an understanding of what type of hardware and software is needed for what type of exhibited features; it allows a one to either unify or distinguish applications across the simulation and Big Data regimes. We show that using a broad range of applications requires a variety of capabilities that seem best packaged as a reconfigurable toolkit Twister2.

Syllabus:

Session 1: HPC-ABDS and the Ogres
-Rationale for using ABDS (Apache Big Data Stack)
-Architecture of ABDS
-Reasons to enhance ABDS with HPC
-Motivating Applications and Big Data Ogres
-Examples including Harp (for Hadoop), HPC-Heron; rationale for Twister2

Session 2: Twister2 and Harp
-Design of Twister2 -- a toolkit of the parts in Heron, Spark, Flink, Hadoop, MPI, Harp
-Design of Harp -- a High Performance Machine Learning Framework
-Using Harp and Twister2

Session 3: SPIDAL Scalable Parallel Interoperable Data Analytics Library
-Some important issues in getting high performance in parallel applications
-A few short discussions of individual machine learning cases and their use in applications
-These are intermixed with performance results including accelerators and
-'SPIDAL Java' -- principles to make Java run fast on parallel applications

Pre-requisites:

Some familiarity with ABDS software such as Hadoop, Spark, Flink, Storm, Heron and HPC technologies such as MPI would be helpful. Some familiarity with parallel computing (algorithms and software) helpful. Some familiarity with data analytics helpful.

References:

Geoffrey Fox, David Crandall, Judy Qiu, Gregor Von Laszewski, Shantenu Jha, John Paden, Oliver Beckstein, Tom Cheatham, Madhav Marathe, Fusheng Wang, 'Tutorial Program', BigDat 2017 MIDAS and SPIDAL Tutorial Bari Italy February 13-14 2017
http://dsc.soic.indiana.edu/publications/SPIDAL-DIBBSreport_July2016.pdf 21 month report of SPIDAL(Scalable Parallel Interoperable Data Analytics Library) project.
Supun Kamburugamuve, Kannan Govindarajan, Pulasthi Wickramasinghe, Vibhatha Abeykoon, Geoffrey Fox, 'Twister2: Design of a Big Data Toolkit'
http://hpc-abds.org/kaleidoscope/ HPC-ABDS and Big Data Ogres Analysis
Geoffrey C. Fox, Vatche Ishakian, Vinod Muthusamy, Aleksander Slominski, 'Status of Serverless Computing and Function-as-a-Service(FaaS) in Industry and Research', Report from workshop and panel at the First International Workshop on Serverless Computing (WoSC) Atlanta, June 5 2017
B. Peng, B. Zhang, L. Chen, M. Avram, R. Henschel, C. Stewart, S. Zhu, E. Mccallum, L. Smith, T. Zahniser, J. Omer, J. Qiu. 'HarpLDA+: Optimizing Latent Dirichlet Allocation for Parallel Efficiency' Technical Report (August 2017)
Supun Kamburugamuve, Pulasthi Wickramasinghe, Saliya Ekanayake, Geoffrey C. Fox, 'Anatomy of machine learning algorithm implementations in MPI, Spark, and Flink', International Journal of High Performance Computing Applications to be published.
Also see Projects (with updates) at https://www.researchgate.net/profile/Geoffrey_Fox and presentations at https://www.dsc.soic.indiana.edu/presentations

Short Bio

Geoffrey Fox received a Ph.D. in Theoretical Physics from Cambridge University where he was Senior Wrangler. He is now a distinguished professor of Engineering, Computing, and Physics at Indiana University where he is director of the Digital Science Center, and both Department Chair and Associate Dean for Intelligent Systems Engineering at the School of Informatics, Computing, and Engineering. He previously held positions at Caltech, Syracuse University, and Florida State University after being a postdoc at the Institute for Advanced Study at Princeton, Lawrence Berkeley Laboratory, and Peterhouse College Cambridge. He has supervised the Ph.D. of 70 students and published around 1300 papers (over 450 with at least 10 citations) in physics and computing with an hindex of 75 and over 31,500 citations. He is a Fellow of APS (Physics) and ACM (Computing) and works on the interdisciplinary interface between computing and applications. He currently researches the application of computer science from infrastructure to analytics in Biology, Pathology, Sensor Clouds, Earthquake and Ice-sheet Science, Image processing, Deep Learning, Network Science, Financial Systems and Particle Physics. The infrastructure work is built around Software Defined Systems on Clouds and Clusters. The analytics focuses on scalable parallelism. He is an expert on streaming data and robot-cloud interactions. He is involved in several projects to enhance the capabilities of Minority Serving Institutions. He has experience in online education and its use in MOOCs for areas like Data and Computational Science.








Minos Garofalakis   
Professor, School of Electrical and Computer Engineering, Technical University of Crete, Chania, Greece
Data Streaming Analytics [intermediate/advanced]

Summary:

Effective Big Data analytics need to rely on algorithms for querying and analyzing massive, continuous data streams (that is, data that is seen only once and in a fixed order) with limited memory and CPU-time resources. Such streams arise naturally in emerging large-scale event monitoring applications; for instance, network-operations monitoring in large ISPs, where usage information from numerous network devices needs to be continuously collected and analyzed for interesting trends and real-time reaction to different scenarios (e.g., hotspots or DDoS attacks). In addition to memory- and time-efficiency concerns, the inherently distributed nature of such applications also raises important communication-efficiency issues, making it critical to carefully optimize the use of the underlying communication infrastructure. This course will provide an overview of some key algorithmic tools for supporting effective, real-time analytics over streaming data. Our primary focus will be on small-space sketch synopses for approximating continuous data streams, and their applicabilty in both centralized and distributed settings.

Syllabus:

1. Introduction and Motivation
2. Data Streaming Models and Mathematical Tools
3. Basic Algorithmic Tools for Data Streams
   * Reservoir Sampling
   * Bag Synopses: AMS and CountMin Sketches
   * Set Synopses: FM Sketches and Distinct Sampling
4. The Sliding Window Model and Exponential Histograms
5. Distributed Data Streaming
   * Basic Models and Techniques
   * The Geometric Method and Convex Safe Zones
6. Conclusions and Looking Forward
7. (Time-permitting) Hands-on Experience with Streaming Tools

Pre-requisites:

Database management systems, design and analysis of algorithms, randomized algorithms

References:

Surveys/Monographs:
1. Graham Cormode, Minos Garofalakis, Peter J. Haas, and Chris Jermaine. “Synopses for Massive Data: Samples, Histograms, Wavelets, Sketches”, Foundations and Trends in Databases 4(1-3): 1-294 (2012)
2. Minos Garofalakis, Johannes Gehrke, and Rajeev Rastogi (Eds.). “Data-Stream Management — Processing High-Speed Data Streams”, Springer-Verlag, New York (Data-Centric Systems and Applications Series), July 2016 (ISBN 978-3-540-28607-3).
Papers:
1. Noga Alon, Yossi Matias, Mario Szegedy: The Space Complexity of Approximating the Frequency Moments. ACM STOC 1996.
2. Noga Alon, Phillip B. Gibbons, Yossi Matias, Mario Szegedy: Tracking Join and Self-Join Sizes in Limited Storage. ACM PODS 1999.
3. Graham Cormode, S. Muthukrishnan: An Improved Data Stream Summary: The Count-Min Sketch and Its Applications. LATIN 2004.
4. Phillip B. Gibbons: Distinct Sampling for Highly-Accurate Answers to Distinct Values Queries and Event Reports. VLDB 2001.
5. Mayur Datar, Aristides Gionis, Piotr Indyk, Rajeev Motwani: Maintaining Stream Statistics over Sliding Windows. SIAM J. on Computing 31(6), 2002.
6. Graham Cormode, Minos N. Garofalakis: Approximate continuous querying over distributed streams. ACM Trans. Database Syst. 33(2), 2008.
7. Izchak Sharfman, Assaf Schuster, Daniel Keren: A geometric approach to monitoring threshold functions over distributed data streams. ACM SIGMOD Conference 2006.
8. Minos N. Garofalakis, Daniel Keren, Vasilis Samoladas: Sketch-based Geometric Monitoring of Distributed Stream Queries. PVLDB 6(10), 2013.
9. Arnon Lazerson, Izchak Sharfman, Daniel Keren, Assaf Schuster, Minos N. Garofalakis, Vasilis Samoladas: Monitoring Distributed Streams using Convex Decompositions. PVLDB 8(5), 2015.

Short Bio

Minos Garofalakis is the Director of the Institute for the Management of Information Systems (IMIS) at the Athena Research and Innovation Centre in Athens, Greece, and a Professor of Computer Science at the School of Electrical and Computer Engineering of the Technical University of Crete (TUC), where he also directs the Software Technology and Network Applications Laboratory (SoftNet). He received his PhD in Computer Science from the University of Wisconsin-Madison in 1998, and has held positions as a Member of Technical Staff at Bell Labs, Lucent Technologies in Murray Hill, NJ (1998-2005), as a Senior Researcher at Intel Research Berkeley in Berkeley, CA (2005-2007), and as a Principal Research Scientist at Yahoo! Research in Santa Clara, CA (2007-2008). In parallel, he also held an Adjunct Associate Professor position at the EECS Department of the University of California, Berkeley (2006-2008). Prof. Garofalakis research interests are in the broad areas of Big Data analytics and large-scale machine learning, including database systems, centralized/distributed data streams, data synopses and approximate query processing, uncertain databases, and data mining and knowledge discovery. He has published over 150 scientific papers in top-tier international conferences and journals in these areas. His work has resulted in 36 US Patent filings (29 patents issued) for companies such as Lucent, Yahoo!, and AT&T. GoogleScholar gives over 12.000 citations to Prof. Garofalakis work, and an h-index value of 60. He is an IEEE Fellow (Class of 2017, 'for contributions to data streaming analytics'), an ACM Distinguished Scientist (2011), and a recipient of the TUC 'Excellence in Research' Award (2015), the Bell Labs President's Gold Award (2004), and the Bell Labs Teamwork Award (2003).








David Gerbing   
Professor of Quantitative Methods. Portland State University
Data Visualization with R [introductory]

Summary:

This seminar introduces the R language via data visualization, aka computer graphics, in the context of a discussion of best practices and consideration for the analysis of big data. Code to generate the graphs is presented in terms of R base graphics, Hadley Wickham's ggplot package, and the author's lessR package. The content of the seminar is summarized with R Markup files that include commentary and implementation of all the code presented in the seminar, available to all participants. These explanatory examples serve as templates for applications to new data sets.

Syllabus:

Day 1
-----
Introduction to R
R functions and syntax
R variable types
Read data into R

Specialized Graphic Functions
Functions from the lessR package
The ggplot function from the ggplot2 package
Base R graphics

Themes

Day 2
-----
Bar Charts for Distributions of Categorical Variables
R factor variables
Counts of one variable
Joint frequencies of two variables
Statistics of a second variable plotted against one variable

Graphs for Distributions of a Continuous Variable
Histograms and binning
Densities
Boxplot
Scatterplot, 1-dimensional
Introduction to the integrated Violin/Box/Scatterplot, the VBS plot
Scatterplots, 2-dimensional
With two or more continuous variables
A categorical variable with a continuous variable
Bubble plots with categorical variables
Two variable plot with a third variable, categorical or continuous

Day 3
-----
Scatterplots, 2-dimensional (continued)
Visualization of relationships for big data sets
Time Series Plots
One-variable plot
Stacked time-series plot
Area plots
Forecasts

Pre-requisites:

Basic understanding of data analysis

References:

Gerbing, D. W. (2013). R Data Analysis without Programming, NY: Routledge.
Wickham, H. (2009). ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York.

Short Bio

David Gerbing, Ph.D., since 1987 Professor of Quantitative Methods, School of Business Administration, Portland State University. He received his Ph.D. in quantitative psychology from Michigan State University in 1979. From 1979 until 1987 he was an Assistant and then Associate Professor of Psychology and Statistics at Baylor University. He has authored R Data Analysis without Programming, which describes his lessR package, and many articles on statistical techniques and their application in a variety of journals that span several academic disciplines.








Maurizio Lenzerini   
Full professor in Computer Science. Sapienza Università di Roma.
Semantic technologies for open data publishing [intermediate/advanced]

Summary:

Semantic technologies may promote new ways of managing data within an organization. In particular, the paradigm of ontology-based data management provides techniques for accessing, using, and maintaining data by means of an ontology, i.e., a conceptual representation of the domain of interest in the underlying information system. This paradigm aims at addressing one important challenge of modern information systems, namely managing the autonomous, distributed, and heterogeneous data sources of an organization, and devising tools for deriving useful information and knowledge from them. On the other hand, many today's organization face, among others, the problem of publishing Open Data. Despite the current interest in this subject, a formal and comprehensive methodology supporting an organization in deciding which data to publish and carrying out precise procedures for publishing high-quality data, is still missing. In the course, we first provide an introduction to ontology-based data management, then we discuss the main techniques for using an ontology to access the data layer of an information system, and finally we illustrate the basic elements of a methodology for ontology-based Open Data publishing.

Syllabus:

Introduction to ontology-based data management (OBDM); languages for OBDM; query answering in OBDM; meta-modeling and higher-order ontology languages; the problem of open data publishing; ontology-based open data publishing.

Pre-requisites:

Basic notions of databases, logic, computational complexity.

References:

Diego Calvanese, Giuseppe De Giacomo, Domenico Lembo, Maurizio Lenzerini, Riccardo Rosati: Tractable Reasoning and Efficient Query Answering in Description Logics: The DL-Lite Family. J. Autom. Reasoning 39(3): 385-429 (2007)
Antonella Poggi, Domenico Lembo, Diego Calvanese, Giuseppe De Giacomo, Maurizio Lenzerini, Riccardo Rosati: Linking Data to Ontologies. J. Data Semantics 10: 133-173 (2008)
Diego Calvanese, Giuseppe De Giacomo, Domenico Lembo, Maurizio Lenzerini, Antonella Poggi, Mariano Rodriguez-Muro, Riccardo Rosati: Ontologies and Databases: The DL-Lite Approach. Reasoning Web 2009: 255-356
Roman Kontchakov, Mariano Rodriguez-Muro, Michael Zakharyaschev: Ontology-Based Data Access with Databases: A Short Course. Reasoning Web 2013: 194-229
Domenico Lembo, Maurizio Lenzerini, Riccardo Rosati, Marco Ruzzi, Domenico Fabio Savo: Inconsistency-tolerant query answering in ontology-based data access. J. Web Sem. 33: 3-29 (2015)

Short Bio

Maurizio Lenzerini (http://www.dis.uniroma1.it/~lenzerini) is a Professor of Data Management at the Dipartimento di Ingegneria Informatica Automatica e Gestionale Antonio Ruberti of Sapienza Università di Roma, where is leading a research group working on Database Theory, Data Management, Knowledge Representation and Automated Reasoning, and Ontology-based Data Management and Integration. He is the author of more than 300 publications on the above topics, which received about 24.000 citations. According to Google Scholar, his h-index is currently 75. He was an invited keynote speaker at many international conferences. He is the recipient of two IBM Faculty Awards, he is a Fellow of EurAi (formerly European Coordinating Committee for Artificial Intelligence, ECCAI) since 2008, a Fellow of the ACM (Association for Computing Machinery) since 2009, a Fellow of the AAAI (Association for the Advance of Artificial Intelligence) since 2017, and a member of the Academia Europaea - The European Academy since 2011.








Bing Liu   
Distinguished Professor Department of Computer Science University of Illinois at Chicago (UIC)
Lifelong Learning and its Applications in NLP [intermediate/advanced]

Summary:

Lifelong Learning is an advanced machine learning (ML) paradigm that learns continuously, accumulates the knowledge learned in the past, and uses it to help future learning. In the process, the learner becomes more and more knowledgeable and effective at learning. This learning ability is one of the hallmarks of human intelligence. However, the current dominant ML paradigm learns in isolation: given a training dataset, it runs a ML algorithm on the dataset to produce a model. It does not retain the learned knowledge and use it in future learning. Although this isolated learning paradigm has been very successful, it requires a large number of training data and is only suitable for well-defined and narrow tasks. In comparison, we humans can learn effectively with a few examples because we have accumulated so much knowledge in the past which enables us to learn with little data or effort. Lifelong learning aims to achieve this capability. Applications such as chatbots and physical robots that interact with real-life environments all call for such learning capabilities. Without this ability, a system will probably never be truly intelligent. In this lecture, I will introduce lifelong learning and discuss some of its applications in natural language processing (NLP).

Syllabus:

1. Introduction and motivations
2. Definition of lifelong learning
3. Related learning paradigms
4. Lifelong supervised learning
5. Open world learning
6. Learning during model application
7. Lifelong topic modeling
8. Lifelong Learning in Information Extraction
9. Lifelong learning in belief propagation
10. Summary

Pre-requisites:

Basic knowledge of machine learning

References:

1. Zhiyuan Chen and Bing Liu. Lifelong Machine Learning. Morgan & Claypool Publishers, Nov 2016
2. Zhiyuan Chen and Bing Liu. Mining Topics in Documents: Standing on the Shoulders of Big Data. KDD-2014.
3. Zhiyuan Chen and Bing Liu. Topic Modeling using Topics from Many Domains, Lifelong Learning and Big Data. ICML-2014.
4. Zhiyuan Chen, Nianzu Ma and Bing Liu. Lifelong Learning for Sentiment Classification. ACL-2015, (short paper).
5. Geli Fei, Shuai Wang, and Bing Liu. 2016. Learning Cumulatively to Become More Knowledgeable. KDD-2016.
6. T. Mitchell, W. Cohen, E. Hruschka, P. Talukdar, J. Betteridge, A. Carlson, B. Dalvi, M. Gardner, B. Kisiel, J. Krishnamurthy, N. Lao, K. Mazaitis, T. Mohamed, N. Nakashole, E. Platanios, A. Ritter, M. Samadi, B. Settles, R. Wang, D. Wijaya, A. Gupta, X. Chen, A. Saparov, M. Greaves, and J. Welling. Never-ending learning. AAAI, 2015.
7. Paul Ruvolo and Eric Eaton. ELLA: An efficient lifelong learning algorithm. ICML-2013.
8. Lei Shu, Hu Xu, and Bing Liu. Lifelong Learning CRF for Supervised Aspect Extraction. ACL-2017, (short paper).
9. Lei Shu, Bing Liu, Hu Xu, and Annice Kim. Lifelong-RL: Lifelong Relaxation Labeling for Separating Entities
and Aspects in Opinion Targets. EMNLP 2016.
10. Daniel L. Silver, Qiang Yang, and Lianghao Li. Lifelong machine learning systems: Beyond learning algorithms. In AAAI Spring Symposium: Lifelong Machine Learning, pages 49–55, 2013.
11. Sebastian Thrun. Is Learning the N-th Thing Any Easier than Learning the First? In Advances in neural information processing systems, pp. 640–646. Morgan Kaufmann Publishers, 1996.
12. Sebastian Thrun and Tom Mitchell. Lifelong robot learning. Springer, 1995.

Short Bio

Bing Liu is a distinguished professor of Computer Science at the University of Illinois at Chicago. He received his Ph.D. in Artificial Intelligence from the University of Edinburgh. His research interests include lifelong learning, sentiment analysis, data mining, machine learning, and natural language processing. He has published extensively in top conferences and journals. Two of his papers have received 10-year Test-of-Time awards from KDD. He also authored four books: one on lifelong learning, two on sentiment analysis, and one on Web mining. Some of his work has been widely reported in the press, including a front-page article in the New York Times. On professional services, he served as the Chair of ACM SIGKDD (ACM Special Interest Group on Knowledge Discovery and Data Mining) from 2013-2017. He has also served as program chair of many leading data mining conferences, including KDD, ICDM, CIKM, WSDM, SDM, and PAKDD, as associate editor of leading journals such as TKDE, TWEB, and DMKD, and as area chair or senior PC member of numerous natural language processing, AI, Web, and data mining conferences. He is a Fellow of ACM, AAAI and IEEE.








B.S. Manjunath   
Distinguished Professor. Electrical and Computer Engineering.University of California, Santa Barbara
Unstructured (Big) Data [introductory]

Summary:

Multimodal, unstructured data is ubiquitous: from consumer devices such as smart phones to scientific imaging, we encounter this data constantly, everywhere. This data is voluminous, accounting for a significant part of the digital data (one could speculate this to be >90%) generated around the world, daily. This data is complex and unstructured. In many applications, this data varies over time, and these time scales differ depending on the application. However, much of this multi-scale, multi-modal, unstructured and dynamic data remains under-exploited and un-interrogated. This lecture explores the challenges associated with such data analytics and how this differs from the more traditional big-data problems. Some interesting case studies in life sciences and medicine will be presented, with a focus on imaging data. The lecture will conclude with an overview of the BisQue software platform that is being developed at UCSB towards addressing the challenges associated with managing such data and creating reproducible workflows to analyze imaging data.

Syllabus:

Unstructured big-data challenges and examples.
Feature extraction in images/video: traditional methods to recent advances in deep learning methods.
Towards reproducible image informatics: BisQue open source project.

Pre-requisites:

Undergraduate level exposure to linear algebra and calculus. A course in image processing/computer vision will help but not required.

References:

Recent publications (conference/journal articles) on the above topics (to be added). For Bisque, see http://bioimage.ucsb.edu and http://cyverse.org

Short Bio

Manjunath is a Distinguished Professor of Electrical and Computer Engineering at the University of California, Santa Barbara. He received his Ph.D. in Electrical Engineering from the University of Southern California and the M.E. in Systems Science and Automation from the Indian Institute of Science. His research interests are in image informatics and in recent years he has focused on application to life and health sciences. He has published over 300 peer-reviewed articles, inventor on 24 patents, and co-edited the book on MPEG-7.








Folker Meyer   
Argonne National Laboratory
Efficient Multi Cloud Execution of Reproducible Data Analytics using Common Workflow Language, AWE and SHOCK [introductory/intermediate]

Summary:

Executing scientific workflows at scale poses a significant challenge to many teams and institutions, we present a unified system for portable, reproducible execution on local and remote resources. The Skyport system [1] provides containerized workflow execution with Docker [2] across systems boundaries. Allowing researchers to execute scientific workflows across system boundaries using the AWE [1, 3, 4] workflow engine and the SHOCK [5] as an active object store. AWE and SHOCK are implemented a RESTful service for managing and executing workflows, workflows are specified in Common workflow language (CWL) format[6]. CWL is a single, multi-vendor language to describe scientific workflows created by a community of practitioners. In addition to being multi-vendor – and thus supporting multiple engines for computing -- , another critical feature of CWL is the separation of science content and computational implementation. This allows experts in each of the domains to focus on their area (CWL, http://commonwl.org).

Syllabus:

Session 1: Initial system setup and execution of demo workflow

-- System overview

  -Scientific computing & workflows

  -Distributed computing

  -Example use case MG-RAST

  -CWL, Docker (why did we use those)

-- Install skyport2 services (using single docker compose image)

  -Install authentication services, AWE-server and SHOCK-server

  -Load demo data into SHOCK

  -Load demo workflow

  -Install awe-worker node

  -Download results from SHOCK



Session 2: Customizing system setup and monitoring execution

-- Setup of expanded system

  -Basis for customization

  -Creation of custom data types

  -Monitoring execution via the web interface or cmd-line

  -Adding tools to workflows

  -Adding workflow steps

-- Hands on exercise creating and executing a customized workflow



  - Session 3: Advanced topics

-- Combining AWE with other execution engines

  -Using Singularity [7]

-- Adding data types to SHOCK

Pre-requisites:

The assumption is that participants will bring a laptop with the ability to execute multiple Docker containers. Participants should test their available memory and hard drive space. Software systems required:
- Docker
- Ansible
- ASCII editor, e.g. Emacs, vi, Textmate, ...
The participants will be asked to install software (via Docker), modify configuration files and perform other Unix command line style activities.

References:

1. Wolfgang Gerlach WT, Andreas Wilke, Dan Olson, Folker Meyer: Container orchestration for scientific workflows. In: Cloud Engineering (IC2E), 2015 IEEE International Conference on: 2015/3/9. IEEE Transactions on Knowledge and Data Engineering 2016: 377-378.

2. Merkel D: Docker: lightweight Linux containers for consistent development and deployment. Linux Journal 2014, 2014(239):2.

3. Tang W, Bischof J, Desai N, Mahadik K, Gerlach W, Harrison T, Wilke A, Meyer F: Workload Characterization for MG-RAST Metagenomic Data Analytics Service in the Cloud. In: Proc of IEEE Int’l Conf on Big Data. 2014.

4. Tang W, Wilkening J, Bischof J, Gerlach W, Wilke A, Desai N, Meyer F: Building ScalableData Management and Analysis Infrastructure for Metagenomics. In: 5th International Workshop on Data-Intensive Computing in the Clouds. IEEE 2013.

5. Bischof J, Wilke A, Gerlach W, Harrison T, Paczian T, Tang W, Trimble W, Wilkening J,Desai N, Meyer F: Shock: Active Storage for Multicloud Streaming Data Analysis. In: 2nd IEEE/ACM International Symposium on Big Data Computing: 2015; Limassol, Cyprus.

6. Common Workflow Language, v1.0.

7. Kurtzer GM, Sochat V, Bauer MW: Singularity: Scientific containers for mobility of compute. PLoS One 2017, 12(5):e0177459.

Short Bio

Folker Meyer is a computational biologist at Argonne National Laboratory and a senior fellow at the Computation Institute at the University of Chicago. He is also the associate division director of the Institute for Genomics and Systems Biology at Argonne National Laboratory. Meyer was trained as a computer scientist and with that came his interest in building software systems to answer complex biological questions. He is the driving force behind the MG-RAST project.








Wladek Minor   
Professor of Molecular Physiology and Biological Physics. University of Virginia, Charlottesville, USA
Big Data in Biomedical Sciences [introductory/advanced]

Summary:

Syllabus:

-Big Data and Big Data in Biomedical Sciences
-Why big data is perceived as a big problem - technological consideration
-Data reduction - should we preserve unreduced (raw) data?
-Databases and databanks
-Data mining with the use of raw data, databanks and databases
-Data Integration
-Automatic and semi-automatic curation of the large amount of data
-Conversion of databanks into databases
-Database priorities – content and design
-Interaction between databases
-Modern data management in biomedical sciences – necessity or luxury
-Automatic data harvesting – close reality or still on the horizon
-Reproducibility of the biomedical experiments - drug discovery
-considerations
-Big data in medicine - new possibilities
-Future considerations

Pre-requisites:

References:

Porebski PJ., Sroka P., Zheng H., Cooper DR., Minor W. (2017) Molstack-Interactive visualization tool for presentation, interpretation, and validation of macromolecules and electron density maps. Protein Sci. [Pub Med ID: 28815771]

Zheng H., Langner KM., Shields GP., Hou J., Kowiel M., Allen FH., Murshudov G., Minor W. (2017) Data mining of iron(II) and iron(III) bond-valence parameters, and their relevance for macromolecular crystallography. Acta Crystallogr D Struct Biol 73(Pt 4):316-325. Times cited: 1. [Pub Med ID: 28375143] [Pub Med Central ID: PMC5503122]

Zheng H., Cooper DR., Porebski PJ., Shabalin IG., Handing KB., Minor W. (2017) CheckMyMetal: a macromolecular metal-binding validation tool. Acta Crystallogr D Struct Biol 73(Pt 3):223-233. Times cited: 1. [Pub Med ID: 28291757] [Pub Med Central ID: PMC5349434]

Grabowski M., Minor W. (2017) Sharing Big Data. IUCrJ 4(Pt 1):3-4. [Pub Med ID: 28250936] [Pub Med Central ID: PMC5331460]

Zheng H., Porebski PJ., Grabowski M., Cooper DR., Minor W. (2017) Databases, Repositories, and Other Data Resources in Structural Biology. Methods Mol. Biol. 1607:643-665. [Pub Med ID: 28573593] [Pub Med Central ID: PMC5587190]

Rupp B., Wlodawer A., Minor W., Helliwell JR., Jaskolski M. (2016) Correcting the record of structural publications requires joint effort of the community and journal editors. FEBS J. 283(24):4452-4457. Times cited: 6. [Pub Med ID: 27229767] [Pub Med Central ID: PMC5124416]

Grabowski M., Langner KM., Cymborowski M., Porebski PJ., Sroka P., Zheng H., Cooper DR., Zimmerman MD., Elsliger MA., Burley SK., Minor W. (2016) A public database of macromolecular diffraction experiments. Acta Crystallogr D Struct Biol 72(Pt 11):1181-1193. Times cited: 4. [Pub Med ID: 27841751] [Pub Med Central ID: PMC5108346]

Grabowski M., Niedzialkowska E., Zimmerman MD., Minor W. (2016) The impact of structural genomics: the first quindecennial. J. Struct. Funct. Genomics 17(1):1-16. Times cited: 2. [Pub Med ID: 26935210] [Pub Med Central ID: PMC4834271]

Niedzialkowska E., Gasiorowska O., Handing KB., Majorek KA., Porebski PJ., Shabalin IG., Zasadzinska E., Cymborowski M., Minor W. (2016) Protein purification and crystallization artifacts: The tale usually not told. Protein Sci. 25(3):720-33. Times cited: 4. [Pub Med ID: 26660914] [Pub Med Central ID: PMC4815408]

Minor W., Dauter Z., Helliwell JR., Jaskolski M., Wlodawer A. (2016) Safeguarding Structural Data Repositories against Bad Apples. Structure 24(2):216-20. Times cited: 9. [Pub Med ID: 26840827] [Pub Med Central ID: PMC4743038]

Porebski PJ., Cymborowski M., Pasenkiewicz-Gierula M., Minor W. (2016) Fitmunk: improving protein structures by accurate, automatic modeling of side-chain conformations. Acta Crystallogr D Struct Biol 72(Pt 2):266-80. Times cited: 5. [Pub Med ID: 26894674] [Pub Med Central ID: PMC4756610]

Minor W., Dauter Z., Jaskolski M. (2016) The young person’s guide to the PDB. Postepy Biochemii 62(3):242-249. [Pub Med ID: 28132477] [Pub Med Central ID: PMC5610703]

Shabalin I., Dauter Z., Jaskolski M., Minor W., Wlodawer A. (2015) .Crystallography and chemistry should always go together: a cautionary tale of protein complexes with cisplatin and carboplatin. Acta Crystallogr D Biol Crystallogr. 71(Pt 9):1965-79. Times cited: 18. [Pub Med ID: 26327386] [Pub Med Central ID: PMC4556316]

Heng H., Handing KB., Zimmerman MD., Shabalin IG., Almo SC., Minor W. (2015) X-ray crystallography over the past decade for novel drug discovery - where are we heading next? Expert Opin Drug Discov 10(9):975-89. Times cited: 8. [Pub Med ID: 26177814] [Pub Med Central ID: PMC4655606]

Berman HM., Gabanyi MJ., Groom CR., Johnson JE., Murshudov GN., Nicholls RA., Reddy V., Schwede T., Zimmerman MD., Westbrook J., Minor W. (2015) Data to knowledge: how to get meaning from your result. IUCrJ 2(Pt 1):45-58. Times cited: 5. [Pub Med ID: 25610627] [Pub Med Central ID: PMC4285880]

Dauter Z., Wlodawer A., Minor W., Jaskolski M., Rupp B. (2014) Avoidable errors in deposited macromolecular structures: an impediment to efficient data mining. UCrJ 1(Pt 3):179-93. Times cited: 30. [Pub Med ID: 25075337] [Pub Med Central ID: PMC4086436]

Short Bio

Link








Fionn Murtagh   
Professor of Data Science, University of Huddersfield.
The New Science of Big Data Analytics, Based on the Geometry and the Topology of Complex, Hierarchic Systems. [introductory/advanced]

Summary:

These foundations of Data Science are solidly based on mathematics and computational science. The hierarchical nature of complex reality is part and parcel of this new, mathematically well-founded way of observing and interacting with (physical, social and all) realities.
These lectures include pattern recognition and knowledge discovery, computer learning and statistics. Addressed is how geometry and topology can uncover and empower the semantics of data. Key themes include: text mining; computational linear time hierarchical clustering, search and retrieval; the Correspondence Analysis platform that performs latent semantic factor space mapping, and accompanying hierarchical clustering.
Various application domains are covered in the case studies. These include in text mining, literary text, and social media – Twitter; and clustering in astronomy, chemistry, psychoanalysis. Final discussion is in regard to the increasingly important domains of smart environments, Internet of Things, health analytics, and further general scope of Big Data.

Syllabus:

Topics
- General Introduction. The Visualization and the Verbalization of Data.
- Analytics through the Geometry and Topology of Complex Systems. Metric, Ultrametric Frameworks. Hierarchy and Symmetry.
- Search and Discovery, Clustering and Regression: Pattern Recognition in Very High Dimensions.
- Text and Related Analytics. Between Lives of Narratives and Narratives of Lives.
Applications include:
- Social science, following Pierre Bourdieu.
- A few issues of cosmology.
- Literary work, between style and semantics.
- Large data analytics in astronomy, chemistry, finance.
- Social media analytics: Letting the data speak.
- Computational psychoanalysis.

The case studies at issue are in R. Through general discussion using R, this can be of benefit also for users of other software environments. Presentation encompasses general background and introduction, as well as potentially innovative developments.

Session 1: Semantic mapping, both metric and ultrametric.
Session 2: Application of textual narrative.
Session 3: Applications in search and discovery; new perspectives and new approaches.

Pre-requisites:

Having engagement in, or current plans, in regard to data analytics, and having perspectives or plans in application domains,

References:

F. Murtagh, Data Science Foundations: Geometry and Topology of Complex Hierarchic Systems and Big Data Analytics, Chapman & Hall, CRC Press, 2017. In the course material, relevant references will be included.

Short Bio

Fionn Murtagh is Professor of Data Science and was Professor of Computer Science, including Department Head, in many universities. Following his primary degres in Mathematics and Engineering Science, before his MSc in Computer Science, that was in Information Retrieval, in Trinity College Dublin, his first position as Statistician/Programmer was in national level (first and second level) education research. PhD in Université P&M Curie, Paris 6, with Prof. Jean-Paul Benzécri, was in conjunction with the national geological research centre, BRGM. After an initial 4 years as lecturer in computer science, there was a period in atomic reactor safety in the European Joint Research Centre, in Ispra (VA), Italy. On the Hubble Space Telescope, as a European Space Agency Senior Scientist, Fionn was based at the European Southern Observatory, in Garching, Munich for 12 years. For 5 years, Fionn was a Director in Science Foundation Ireland, managing mathematics and computing, nanotechnology, and introducing and growing all that is related to environmental science and renewable energy.
Fionn was Editor-in-Chief of the Computer Journal (British Computer Society) for more than 10 years, and is an Editorial Board member of many journals. With over 300 refereed articles and 30 books authored or edited, his fellowships and scholarly academies include: Fellow of: British Computer Society (FBCS), Institute of Mathematics and Its Applications (FIMA), International Association for Pattern Recognition (FIAPR), Royal Statistical Society (FRSS), Royal Society of Arts (FRSA). Elected Member: Royal Irish Academy (MRIA), Academia Europaea (MAE). Senior Member IEEE.
Website: http://www.fmurtagh.info








Raymond Ng   
Professor of Computer Science at the University of British Columbia
Mining and Summarizing Text Conversations [introductory]

Summary:

With the ever-increasing popularity of Internet technologies and communication devices such as smartphones and tablets, and with huge amounts of such conversational data generated on an hourly basis, intelligent text analytic approaches can greatly benefit organizations and individuals. For example, managers can find the information exchanged in forum discussions crucial for decision making; clinicians can use patients’ discussions to assist in chronic disease management.
In this lecture, we first give an overview of important applications of mining text conversations, using clinical applications and sentiment summarization of product reviews as case studies. Then we examine three topics in this area: (i) topic modeling; (ii) natural language summarization; and (iii) extraction of rhetorical structure and relationships in text.

Syllabus:

Pre-requisites:

Basic knowledge of machine learning and natural language processing is preferred but not required.

References:

Short Bio

Raymond Ng is a Professor of Computer Science (Canada Research Chair in Data Science and Analytics Chief Informatics Officer, PROOF) and his main research area for the past two decades is on data mining, with a specific focus on health informatics and text mining. He has published over 200 peer-reviewed publications on data clustering, outlier detection, OLAP processing, health informatics and text mining. He is the recipient of two best paper awards – from the 2001 ACM SIGKDD conference, the premier data mining conference in the world, and the 2005 ACM SIGMOD conference, one of the top database conferences worldwide. For the past decade, he has co-led several large-scale genomic projects funded by Genome Canada, Genome BC and industrial collaborators. Since the inception of the PROOF Centre of Excellence, which focuses on biomarker development for end-stage organ failures, he has held the position of the Chief Informatics Officer of the Centre. From 2009 to 2014, Dr. Ng was the associate director of the NSERC-funded strategic network on business intelligence.








Srinivasan Parthasarathy   
Ohio State University
Network Science Fundamentals [introductory/intermediate]

Summary:

We shall cover basic and intermediate concepts in Network Science as outlined below.

Syllabus:

Lecture 1: Introductory Concepts: Networks in the real world; Basic graph theory concepts and algorithms; Paths, cycles and components; Modeling directionality and weights; Basic network measures.
Lecture 2: Intermediate Concepts I: Advanced network measures; Modularity: theory and applications; Tie-strength, link analysis and prediction; Social models of group formation and basic community discovery; Signed networks and structural balance theory.
Lecture 3: Intermediate Concepts II: Recent advances in community discovery; Graph sparsification and sampling strategies; Deep-dive into recent advances in stochastic flow clustering of networks.

Pre-requisites:

Prior background in algorithms and linear algebra will be helpful albeit not essential.

References:

1. Networks, Crowds and Markets: Reasoning about a highly connected world, D. Easley, J. Kleinberg https://www.cs.cornell.edu/home/kleinber/networks-book/
2. Networks an Introduction: M. Newman. http://www.oxfordscholarship.com/view/10.1093/acprof:oso/9780199206650.001.0001/acprof-9780199206650
3 https://sites.google.com/site/stochasticflowclustering/

Short Bio

Dr. Srinivasan Parthasarathy is a Professor in the Computer Science and Engineering Department at the Ohio State University (OSU). He directs the data mining research laboratory at OSU and co-directs the university-wide undergraduate program in Data analytics that he helped co-found. His research interests are broadly in the areas of Data Analytics, Machine Learning and High Performance Computing. He has an hindex of over 50 and his work has been cited over 12,000 times. He is a recipient of an Ameritech Faculty fellowship in 2001, an NSF CAREER award in 2003, a DOE Early Career Award in 2004, and multiple grants or fellowships from IBM, Google and Microsoft. His papers have received eight best paper awards or similar honors from leading conferences in the field, including ones at SIAM international conference on data mining (SDM), IEEE international conference on data mining (ICDM), the Very Large Databases Conference (VLDB) ACM Knowledge Discovery and Data Mining (SIGKDD), ACM Bioinformatics and ISMB. He has served on the program and editorial board committees of leading conferences and journals in the fields of data mining, databases, and high performance computing. HHe currently serves as the chair of the steering committee for the SIAM data mining conference series.








Hanan Samet   
Center for Automation Research. Institute for Advanced Computer Studies. University of Maryland
Sorting in Space: Multidimensional, Spatial, and Metric Data Structures for Applications in Spatial Databases, Geographic Information Systems (GIS), and Location-based Services [introductory/intermediate]

Summary:

The representation of multidimensional, spatial, and metric data is an important issue in applications of spatial database, geographic information systems (GIS), and location-based services. Recently, there has been much interest in hierarchical data structures such as quadtrees, octrees, and pyramids which are based on image hierarchies, as well methods that make use of bounding boxes which are based on object hierarchies. Their key advantage is that they provide a way to index into space. In fact, they are little more than multidimensional sorts. They are compact and depending on the nature of the spatial data they save space as well as time and also facilitate operations such as search. We describe hierarchical representations of points, lines, collections of small rectangles, regions, surfaces, and volumes. For region data, we point out the dimension-reduction property of the region quadtree and octree. We also demonstrate how to use them for both raster and vector data. For metric data that does not lie in a vector space so that indexing is based simply on the distance between objects, we nreview various representations such as the vp-tree, gh-tree, and mb-tree. In particular, we demonstrate the close relationship between these representations and those designed for a vector space. For all of the representations, we show how they can be used to compute nearest objects in an incremental fashion so that the number of objects need not be known in advance. The VASCO JAVA applet is presented that illustrates these methods (found at http://www.cs.umd.edu/~hjs/quadtree/index.html). They are also used in applications such as the SAND Internet Browser (found at http://www.cs.umd.edu/~brabec/sandjava). The above has been in the context of the traditional geometric representation of spatial data, while in the final part we review the more recent textual representation which is used in location-based services where the key issue is that of resolving ambiguities. For example, does ``London'' correspond to the name of a person or a location, and if it corresponds to a location, which of the over 700 different instances of ``London'' is it. The NewsStand system at newsstand.umiacs.umd.edu and the TwitterStand system at TwitterStand.umiacs.umd.edu system are examples. See also the cover article of the October 2014 issue of Communications of the ACM at http://tinyurl.com/newsstand-cacm or a cached version at http://www.cs.umd.edu/~hjs/pubs/cacm-newsstand.pdf and the accompanying video at https://vimeo.com/106352925

Syllabus:

1. Introduction
a. Sample queries
b. Spatial Indexing
c. Sorting approach
d. Minimum bounding rectangles (e.g., R-tree)
e. Disjoint cells (e.g., R+-tree, k-d-B-tree)
f. Uniform grid
g. Location-based queries vs: feature-based queries
h. Region quadtree
i. Dimension reduction
j. Pyramid
k. Region quadtrees vs: pyramids
l. Space ordering methods

2. Points
a. point quadtree
b. MX quadtree
c. PR quadtree
d. k-d tree
e. Bintree
f. BSP tree

3. Lines
a. Strip tree
b. PM1 quadtree
c. PM2 quadtree
d. PM3 quadtree
e. PMR quadtree

4. Rectangles and arbitrary objects
a. MX-CIF quadtree
b. Loose quadtree
c. Partition fieldtree
d. R-tree

5. Surfaces and Volumes
a. Restricted quadtree
b. Region octree
c. PM octree

6. Metric Data
a. vp-tree
b. gh-tree
c. mb-tree

7. Operations
a. Incremental nearest object location
b. Boolean set operations

8. Spatial Database Issues
a. General issues
b. Specific issues

9. Indexing spatiotextual data for location-based services delivered
on platforms such as smart phones and tablets
a. Incorporation of spatial synonyms in search engines
b. Toponym recognition
c. Toponym resolution
d. Spatial reader scope
e. Incorporation of spatiotemporal data
f. System integration issues
g. Demos of live systems on smart phones

10. Example systems
a. SAND internet browser
b. JAVA spatial data applets
c. STEWARD
d. NewsStand
e. TwitterStand

Pre-requisites:

Practitioners working in the areas of big spatial data and spatial data science that involve spatial databases, geographic information systems, and location-based services will be given a different perspective on data structures found to be useful in most applications. Familiarity with computer terminology and some programming experience is needed to follow this course.

References:

1. H. Samet. ``Foundations of Multidimensional Data Structures.'' Morgan-Kaufmann, San Francisco, 2006.
2. H. Samet. ``A sorting approach to indexing spatial data.'' International Journal of Shape Modeling. 14(1):15--37, 28(4):517--580, June 2008.
3. G. R. Hjaltason and H. Samet. ``Index-driven similarity search in metric spaces.'' ACM Transactions on Database Systems, 28(4):517--580, December 2003.
4. G. R. Hjaltason and H. Samet. ``Distance browsing in spatial databases.'' ACM Transactions on Database Systems, 24(2):265--318, June 1999. Also Computer Science TR-3919, University of Maryland, College Park, MD.
5. G. R. Hjaltason and H. Samet. ``Ranking in spatial databases.'' In Advances in Spatial Databases --- 4th International Symposium, SSD'95, M. J. Egenhofer and J. R. Herring, eds., Portland, ME, August 1995, 83--95. Also Springer-Verlag Lecture Notes in Computer Science 951.
6. H. Samet. ``Applications of Spatial Data Structures: Computer Graphics, Image Processing, and GIS.'' Addison-Wesley, Reading, MA, 1990.
7. H. Samet. ``The Design and Analysis of Spatial Data Structures.'' Addison-Wesley, Reading, MA, 1990.
8. C. Esperanca and H. Samet. ``Experience with SAND/Tcl: a scripting tool for spatial databases.'' Journal of Visual Languages and Computing, 13(2):229--255, April 2002.
9. H. Samet, H. Alborzi, F. Brabec, C. Esperanca, G. R. Hjaltason, F. Morgan, and E. Tanin. ``Use of the SAND spatial browser for digital government applications.'' Communications of the ACM, 46(1):63--66, January 2003.
10. B. Teitler, M. D. Lieberman, D. Panozzo, J. Sankaranarayanan, H. Samet, and J. Sperling. ``NewsStand: A new view on news.'' Proceedings of the 16th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, Irvine, CA, November 2008, 144--153.
11. H. Samet, J. Sankaranarayanan, M. D. Lieberman, M. D. Adelfio, B. C. Fruin, J. M. Lotkowski, D. Panozzo, J. Sperling, and B. E. Teitler. ``Reading news with maps by exploiting spatial synonyms.'' Communications of the ACM, 57(10):64--77, October 2014.
12. J. Sankaranarayanan, H. Samet, B. Teitler, M. D. Lieberman, and J. Sperling. ``TwitterStand: News in tweets.'' Proceedings of the 17th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, Seattle, WA, November 2009, 42--51.
13. M. D. Lieberman, H. Samet, and J. Sankaranarayanan. ``Geotagging with local lexicons to build indexes for textually-specified spatial data.'' Proceedings of the 26th IEEE International Conference on Data Engineering, Long Beach, CA, March 2010, 201--212.
14. M. D. Lieberman and H. Samet. ``Multifaceted Toponym Recognition for Streaming News.'' Proceedings of the ACM SIGIR Conference. Beijing, July 2011, 843--852.
15. M. D. Lieberman and H. Samet. ``Adaptive Context Features for Toponym Resolution in Streaming News.'' Proceedings of the ACM SIGIR Conference. Portland, OR, August 2012, 731--740.
16. M. D. Lieberman and H. Samet. ``Supporting Rapid Processing and Interactive Map-Based Exploration of Streaming News. Proceedings of the 20th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems. Redondo Beach, CA, November 2012, 179--188/
17. Spatial Data Structure applets at; http://www.cs.umd.edu/~hjs/quadtree/index.html.

Short Bio

Hanan Samet (http://www.cs.umd.edu/~hjs/) is a Distinguished University Professor of Computer Science at the University of Maryland, College Park and is a member of the Institute for Computer Studies. He is also a member of the Computer Vision Laboratory at the Center for Automation Research where he leads a number of research projects on the use of hierarchical data structures for database applications, geographic information systems, computer graphics, computer vision, image processing, games, robotics, and search. He received the B.S. degree in engineering from UCLA, and the M.S. Degree in operations research and the M.S. and Ph.D. degrees in computer science from Stanford University. His doctoral dissertation dealt with proving the correctness of translations of LISP programs which was the first work in translation validation and the related concept of proof-carrying code. He is the author of the recent book ``Foundations of Multidimensional and Metric Data Structures'' (http://www.cs.umd.edu/~hjs/multidimensional-book-flyer.pdf) published by Morgan-Kaufmann, an imprint of Elsevier, in 2006, an award winner in the 2006 best book in Computer and Information Science competition of the Professional and Scholarly Publishers (PSP) Group of the American Publishers Association (AAP), and of the first two books on spatial data structures ``Design and Analysis of Spatial Data Structures'', and ``Applications of Spatial Data Structures: Computer Graphics, Image Processing, and GIS,'' both published by Addison-Wesley in 1990. He is the Founding Editor-In-Chief of the ACM Transactions on Spatial Algorithms and Systems (TSAS), the founding chair of ACM SIGSPATIAL, a recipient of a Science Foundation of Ireland (SFI) Walton Visitor Award at the Centre for Geocomputation at the National University of Ireland at Maynooth (NUIM), 2009 UCGIS Research Award, 2010 CMPS Board of Visitors Award at the University of Maryland, 2011 ACM Paris Kanellakis Theory and Practice Award, 2014 IEEE Computer Society Wallace McDowell Award, and a Fellow of the ACM, IEEE, AAAS, IAPR (International Association for Pattern Recognition), and UCGIS (University Consortium for Geographic Science). He received best paper awards in the 2007 Computers & Graphics Journal, the 2008 ACM SIGMOD and SIGSPATIAL ACMGIS Conferences, the 2012 SIGSPATIAL MobiGIS Workshop, and the 2013 SIGSPATIAL GIR Workshop, as well as a best demo award at the 2011 SIGSPATIAL ACMGIS'11 Conference. His paper at the 2009 IEEE International Conference on Data Engineering (ICDE) was selected as one of the best papers for publication in the IEEE Transactions on Knowledge and Data Engineering. He was elected to the ACM Council as the Capitol Region Representative for the term 1989-1991, and is an ACM Distinguished Speaker.








Kyuseok Shim   
Professor of Electrical and Computer Engineering Department, Seoul National University, Korea
MapReduce Algorithms for Big Data Analysis [introductory/intermediate]

Summary:

There is a growing trend of applications that should handle big data. However, analyzing big data is very challenging today. For such applications, the MapReduce framework has recently attracted a lot of attention. MapReduce is a programming model that allows easy development of scalable parallel applications to process big data on large clusters of commodity machines. Google’s MapReduce or its open-source equivalent Hadoop is a powerful tool for building such applications. In this tutorial, I will first introduce the MapReduce framework based on Hadoop system available to everyone to run distributed computing algorithms using MapReduce. I will next discuss how to design efficient MapReduce algorithms and present the state-of-the-art in MapReduce algorithms for big data analysis. Since Spark is recently developed to overcome the shortcomings of MapReduce which is not optimized for of iterative algorithms and interactive data analysis, I will also present an outline of Spark as well as the differences between MapReduce and Spark. The intended audience of this tutorial is professionals who plan to develop efficient MapReduce algorithms and researchers who should be aware of the state-of-the-art in MapReduce algorithms available today for big data analysis.

Syllabus:

Introduction to Hadoop and MapReduce
- Why parallel computing for big data analysis?
- Introduction on Map/Reduce
- Hadoop distributed file systems
- Word counting, inverted index building, page rank algorithms

MapReduce Algorithms for Database Systems
- Theta joins
- Similarity joins
- K-nearest neighbor joins
- Skyline computations
- Interval joins
- Subgraph enumeration
- Triangle counting
- Wavelet computation

MapReduce Algorithms for Data Mining
- K-means clustering
- EM, PLSI and LDA clustering
- Density-based clustering
- Association rule mining
- Sequential pattern mining
Introduction to Spark
Summary

Pre-requisites:

References:

Short Bio

Kyuseok Shim is currently a professor at electrical and computer engineering department in Seoul National University, Korea. Before that, he was an assistant professor at computer science department in KAIST and a member of technical staff for the Serendip Data Mining Project at Bell Laboratories. He was also a member of the Quest Data Mining Project at the IBM Almaden Research Center and visited Microsoft Research at Redmond several times as a visiting scientist. Kyuseok was named an ACM Fellow for his contributions to scalable data mining and query processing research in 2013. Kyuseok has been working in the area of databases focusing on data mining, search engines, recommendation systems, MapReduce algorithms, privacy preservation, query processing and query optimization. His writings have appeared in a number of professional conferences and journals including ACM, VLDB and IEEE publications. He served as a Program Committee member for SIGKDD, SIGMOD, ICDE, ICDM, ICDT, EDBT, PAKDD, VLDB and WWW conferences. He also served as a Program Committee Co-Chair for PAKDD 2003, WWW 2014, ICDE 2015 and APWeb 2016. Kyuseok was previously on the editorial board of VLDB as well as IEEE TKDE Journals and is currently a member of the VLDB Endowment Board of Trustees. He received the BS degree in electrical engineering from Seoul National University in 1986, and the MS and PhD degrees in computer science from the University of Maryland, College Park, in 1988 and 1993, respectively.








Jaideep Srivastava   
University of Minnesota
Social Computing: Computing as an Integral Tool to Understanding Human Behavior and Solving Problems of Social Relevance. [introductory/intermediate]

Summary:

Social Computing is an emerging discipline, and just like any discipline at a nascent stage it can often mean different things to different people. However, there are three distinct threads that are emerging. First thread is often called Socio-Technical Systems, which focuses on building systems that allow large scale interactions of people, whether for a specific purpose or in general. Examples include social networks like Facebook and Google Plus, and Multi Player Online Games like World of Warcraft and Farmville. The second thread is often called Computational Social Science, whose goal is to use computing as an integral tool to push the research boundaries of various social and behavioral science disciplines, primarily Sociology, Economics, and Psychology. Third is the idea of solving problems of societal relevance using a combination of computing and humans. The three modules of this course are structured according to this description. The goal of this course is to discuss, in a tutorial manner, through case studies, and through discussion, what Social Computing is, where it is headed, and where is it taking us.

Syllabus:

-Module 1: Socio-technical systems
• Introduction to Social Computing
• Socio-technical systems
• Examples of a number of social computing systems, e.g. Twitter,
FaceBook, MMO games, etc.
• Applying data mining to social computing systems
-Module 2: Computational Social Science
• Online trust
• Social influence
• Individual and group/team performance
• Identifying and preventing bad behavior
-Module 3: Solving Problems of Societal Relevance
• Social computing for humanitarian assistance
• Wrap-up discussion
• Privacy and ethics
• Where are we headed

Pre-requisites:

This course is intended primarily for graduate students. Following are the potential audiences: Computer Science graduate students: All that is needed for this audience is interest in one of the themes of social computing Social Science graduate students: Some exposure to building models from data, at least what these techniques are and what they can do. Management graduate students: Those with MIS focus.

References:

Provided with slides.

Short Bio

Jaideep Srivastava (https://www.linkedin.com/in/jaideep-srivastava- 50230/) is Professor of Computer Science at the University of Minnesota, where he directs a laboratory focusing on research in Web Mining, Social Analytics, and Health Analytics. He is a Fellow of the Institute of Electrical and Electronics Engineers (IEEE), and has been an IEEE Distinguished Visitor and a Distinguished Fellow of Allina’s Center for Healthcare Innovation. He has been awarded the Distinguished Research Contributions Award of the PAKDD, for his lifetime contributions to the field of machine learning and data mining. Dr. Srivastava has significant experience in the industry, in both consulting and executive roles. Most recently he was the Chief Scientist for Qatar Computing Research Institute (QCRI), which is part of Qatar Foundation. Earlier, he was the data mining architect for Amazon.com (www.amazon.com), built a data analytics department at Yodlee (www.yodlee.com), and served as the Chief Technology Officer for Persistent Systems (www.persistentsys.com). He has provided technology and strategy advice to Cargill, United Technologies, IBM, Honeywell, KPMG, 3M, TCS, and Eaton. Dr. Srivastava Co-Founded Ninja Metrics (www.ninjametrics.com), based on his research in behavioral analytics. He was advisor and Chief Scientist for CogCubed (www.cogcubed.com), an innovative company with the goal to revolutionize the diagnosis and therapy of cognitive disorders through the use of online games, which was subsequently acquired by Teladoc (https://www.teladoc.com/), a public company. He has been a technology advisor to a number of startups at various stages, including Jornaya (https://www.jornaya.com/) - a leader in cross-industry lead management, and Kipsu (http://kipsu.com/) - which provides an innovative approach to improving service quality in the hospitality industry. Dr. Srivastava has held distinguished professorships at Heilongjiang University and Wuhan University, China. He has held advisory positions with the State of Minnesota, and the State of Maharashtra, India. He is a technology advisor to the Unique ID (UID) project of the Government of India, whose goal is to provide biometrics-based social security numbers to the 1.3 Billion citizens of India. Dr. Srivastava has a Bachelors of Technology from the Indian Institute of Technology (IIT), Kanpur, India, and MS and PhD from the University of California, Berkeley.








Jeffrey Ullman   
Stanford W. Ascherman Professor of Computer Science (Emeritus)
Big-data Algorithms That Aren't Machine Learning [introductory]

Summary:

We shall study algorithms that have been found useful in querying large data volumes. The emphasis is on algorithms that cannot be considered 'machine learning'

Syllabus:

Pre-requisites:

A course in algorithms at the advanced-undergraduate level is important. A course in database systems is helpful, but not required.

References:

We will be covering (parts of) Chapters 3, 4, 5, and 10 of the free text Mining of Massive Datasets, by Jure Leskovec, Anand Rajaraman, and Jeff Ullman, available at www.mmds.org

Short Bio

Link to the bio








Sebastián Ventura   
Professor of Computer Sciences and Artificial Intelligence in the University of Córdoba
Pattern Mining on Big Data [intermediate/advanced]

Summary:

Data analysis has a growing interest in many fields and it is concerned with the development of methods and techniques for making sense of data. Hence, there is a real incentive to collect, manage and transform raw data into significant and meaningful information that may be used for subsequent analysis that lead better decision making. When talking about data analysis, the key element is the pattern, which is used to represent any type of homogeneity and regularity in data, serving as a way of describing intrinsic and important properties of data. Pattern mining, however, is a really challenging task that requires a deep study, specially on massive and complex data where the computational and memory requirements are too high.Early exhaustive search approaches in this field were improved by adding some constraints into the mining process so the search space could be heavily reduced. These constraints helped user’s exploration and control, confining the space of solutions to those of interest. In spite of everything, the extraction of patterns on huge datasets still required large amount of memory since the number of feasible patterns exponentially increases with the number of items in data. Hence, different ways of solving this arduous task were proposed, being the use of metaheuristics a good option to avoid the analysis of the whole search space. Nevertheless, approaches based on metaheuristics are actually time consuming methods for extremely large datasets since any pattern is evaluated on any transaction. In this sense, novel data structures as well as parallel pattern mining methods have recently emerged as really interesting and promising research areas. Parallel processing is, perhaps, the principal research topic (in connection with the runtime) considered by the pattern mining community. In this regard, two main directions are being studied: (1) cluster of computers and (2) graphic processing units (GPUs). GPUs, for example, have been correctly applied by analyzing each transaction in parallel so the runtime is reduced. MapReduce, on the contrary, decomposes the problem into two phases: map and reduce. The input dataset is split into subsets so the map phase produces all the patterns within each of these subsets, assigning as a value the frequency of each pattern. Then, similar patterns are merged so the reduce phase is able to work on these sets to produce the final frequencies. MapReduce is one of the most widely studied emerging paradigms for intensive computing, achieving excellent results in a simple and robust way. However, recent research studies have demonstrated that these approaches are just recommended for really Big Data since the time required to load the parallel structure is even larger than the one required to.

Syllabus:

-Pattern mining: foundations and algorithms (time and memory requirements)
-Evolutionary algorithms for mining patterns (reducing the requirements)
-Data structure to reduce the evaluation process
-Parallel solutions:
a) based on GPUs
b) based on MapReduce

Pre-requisites:

Foundations of Pattern Mining: classical (exhaustive approaches), Foundations of Evolutionary Computation

References:

Basic References:

Charu C. Aggrawal. Data Mining. The Textbook. 1st Edition. Springer (2015). ISBN 978-3-319-14141-1
Charu C. Aggrawal and Jiawei Han. Frequent Pattern Mining. 1st. Edition. Springer (2014). ISBN 978-3-319-07820-5.
Sebastián Ventura, José María Luna: Pattern Mining with Evolutionary Algorithms. 1st Edition, Springer (2016), ISBN 978-3-319-33857-6.

Supplementary references

José María Luna, José Raúl Romero, Sebastián Ventura: Design and behavior study of a grammar-guided genetic programming algorithm for mining association rules. Knowl. Inf. Syst. 32(1): 53-76 (2012)
Alberto Cano, José María Luna, Sebastián Ventura: High performance evaluation of evolutionary-mined association rules on GPUs. The Journal of Supercomputing 66(3): 1438-1461 (2013)
José María Luna, Alberto Cano, Mykola Pechenizkiy, Sebastián Ventura: Speeding-Up Association Rule Mining With Inverted Index Compression. IEEE Trans. Cybernetics 46(12): 3059-3072 (2016)
José María Luna, Francisco Padillo, Mykola Pechenizkiy, Sebastián Ventura: Apriori Versions Based on MapReduce for Mining Frequent Patterns on Big Data. IEEE Trans. Cybernetics (2017). DOI: 10.1109/TCYB.2017.2751081
José María Luna, Alberto Cano, Mykola Pechenizkiy, Sebastián Ventura: Speeding-Up Association Rule Mining With Inverted Index Compression. IEEE Trans. Cybernetics 46(12): 3059-3072 (2016)
José María Luna, Francisco Padillo, Mykola Pechenizkiy, Sebastián Ventura: Apriori Versions Based on MapReduce for Mining Frequent Patterns on Big Data. IEEE Trans. Cybernetics (2017). DOI: 10.1109/TCYB.2017.2751081

Short Bio








Xiaowei Xu   
Professor. Department of Information Science. University of Arkansas at Little Rock
Mining Big Networked Data [introductory/advanced]

Summary:

Recent explosive growth of online social networks such as Facebook and Twitter provides a unique opportunity for many data mining applications including real time event detection, community structure detection and viral marketing. The course covers big data analytics for social networks. The emphasis will be on scalable algorithms for community structure detection, social tie modeling and structural pattern mining for big networks.

Syllabus:

Modularity-based community structure detection algorithms [1]
Structural clustering algorithms [2]
Label propagation algorithms [3]
Social tie modeling [4]
Parallel network clustering algorithm [5]
Discovering multiple social ties for characterization of individuals in online social networks [6]
Anytime network clustering algorithm for very big networks [7]

Pre-requisites:

Basic knowledge in computer algorithms and graph theory.

References:

1. Finding community structure in very large networks, Aaron Clauset, M. E. J. Newman, and Cristopher Moore, Phys. Rev. E 70, 066111 (2004).
2. X. Xu, N. Yuruk, Z. Feng, and T. A. Schweiger. Scan: a structural clustering algorithm for networks. In Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 824–833. ACM, 2007. 

3. Near linear time algorithm to detect community structures in large-scale networks, Raghavan, Usha Nandini and Albert, Reka and Kumara, Soundar, Phys. Rev. E 76, 036106 (2007)
4. S. Sintos and P. Tsaparas. Using strong triadic closure to characterize ties in social networks. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 1466–1475. ACM, 2014. 

5. Weizhong Zhao Venkata Swamy Martha Xiaowei Xu PSCAN: A Parallel Structural Clustering Algorithm for Big Networks in MapReduce. 862-869 2013 AINA
6. Ming-Hua Chung, Gang Chen, Weizhong Zhao, Guohua Hao, Julian Pan, and Xiaowei Xu. Discovering Multiple Social Ties for Characterization of Individuals in Online Social Networks. The Third European Network Intelligence Conference (ENIC 2016), September 5-7, 2016, 1-8. 10.1109/ Wrocław, Poland.
7. Weizhong Zhao, Gang Chen, Xiaowei Xu. AnySCAN: An Efficient Anytime Framework with Active Learning for Large-scale Network Clustering. Proceedings of IEEE International Conference on Data Mining (ICDM 2017), New Orleans, November 18-21, 2017.

Short Bio

Professor Xiaowei Xu is a professor in the Department of Information Science at the University of Arkansas at Little Rock (UALR). He received his Ph.D. in computer science from the University of Munich in 1998. Prior to his appointment at UALR, Dr. Xu was a senior research scientist in Siemens Corporate Technology. Dr. Xu is adjunct professor in the Department of Mathematics at the University of Arkansas. Dr. Xu was an Oak Ridge Institute for Science and Education (ORISE) Faculty Research Program Member in the National Center for Toxicological Research's (NCTR) Center for Bioinformatics in the Division of Systems Biology from 2010 to 2014. He is also a consultant for companies including Siemens, Acxiom, Dataminr and L’Oreal. Dr. Xu’s research focuses on algorithms for data mining and machine learning. Dr. Xu is a recipient of 2014 ACM SIGKDD Test of Time Award for his work in density-based clustering algorithm (DBSCAN), which has received over 10,000 citations based on Google Scholar. Dr. Xu is program committee members and session chairs for premier forums including ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), and IEEE International Conferences on Data Mining (ICDM).








Zhongfei Zhang   
Professor, Department of Computer Science, Watson School of Engineering and Applied Sciences. Binghamton University
Relational and Media Data Learning and Knowledge Discovery [introductory/advanced]

Summary:

This course aims at exposing the audience a complete introduction to knowledge discovery and machine learning theories and case studies in real-world applications for relational and media data. The course begins with an extensive introduction to the fundamental concepts and theories of knowledge discovery and machine learning for relational and media data, and then showcases several important applications as case studies in the real-world as the example for big data knowledge discovery and learning.

Syllabus:

The course consists of three two-hour sessions. The syllabus is as follows:
First session: Introduction to the fundamental concepts and theories for relational and media data with the specific foci on an overview of the wide spectrum of techniques and technologies available as well as their relationships and applications to big data scenarios through real-world case studies; Second session: Specific discussions on the classic and state-of-the-art methods for relational data knowledge discovery and learning;
Third session: Specific discussions on the state-of-the-art methods on media data knowledge discovery;

Pre-requisites:

College math, fundamentals about computer science

References:

1. Bo Long, Zhongfei (Mark) Zhang, and Philip S. Yu, Relational Data Clustering: Models, Algorithms, and Applications, Taylor & Francis/CRC Press, 2010, ISBN: 9781420072617
2. Zhongfei (Mark) Zhang and Ruofei Zhang, Multimedia Data Mining -- A Systematic Introduction to Concepts and Theory, Taylor & Francis Group/CRC Press, 2008, ISBN: 9781584889663
3. Zhongfei (Mark) Zhang, Bo Long, Zhen Guo, Tianbing Xu, and Philip S. Yu, Machine Learning Approaches to Link-Based Clustering, in Link Mining: Models, Algorithms and Applications, Edited by Philip S. Yu, Christos Faloutsos, and Jiawei Han, Springer, 2010
4. Zhen Guo, Zhongfei Zhang, Eric P. Xing, and Christos Faloutsos, Multimodal Data Mining in a Multimedia Database Based on Structured Max Margin Learning, ACM Transactions on Knowledge Discovery and Data Mining, ACM Press, 2015
5. http://www.cs.binghamton.edu/~forweb/publicationsactive.html

Short Bio

Zhongfei (Mark) Zhang is a full professor of Computer Science at State University of New York (SUNY) at Binghamton, and directs the Multimedia Research Computing Laboratory in the University. He has also served as a QiuShi Chair Professor at Zhejiang University, China, and as the Director of the Data Science and Engineering Research Center at the university while he was on leave from State University of New York (SUNY) at Binghamton, USA. He has received a B.S. in Electronics Engineering (with Honors), an M.S. in Information Sciences, both from Zhejiang University, China, and a PhD in Computer Science from the University of Massachusetts at Amherst, USA. His research interests include knowledge discovery and machine learning for media and relational data, multimedia information indexing and retrieval, artificial intelligence, computer vision, and pattern recognition. He is the author and co-author of the first monograph on multimedia data mining and the first monograph on relational data clustering, respectively. His research is sponsored by a wide spectrum of government funding agencies, industrial labs, as well as private agencies noticeably including US NSF, US AFRL, CNRS in France, JSPS in Japan, and MOST and NSFC in China, New York State Government in US, and Zhejiang Provincial Government in China, as well as Kodak Research and Microsoft Research in US and Alibaba Group in China and Huang Kuancheng Foundation in Hong Kong, China. He has published over 200 papers in premier venues in his areas and is an inventor for more than 30 patents. He has served in several journal editorial boards and received several professional awards.