3rd International Winter School on Big Data

Bari, Italy, February 13-17, 2017

Course Description


Keynotes (to be completed)

Courses


Keynotes


Ernesto Damiani   
EBTIC, Khalifa University, Abu Dhabi & University of Milan
Model-driven Development of Big Data Applications

Summary:

The advent of Big Data has highlighted a number of problems in the design, development and deployment of applications. Big data applications involve multiple components (collection and cleaning, data lake creation and management, analytics parallelization etc.) that have different features and lifecycles, and correspond to a rich panoply of software tools. Experience has shown that model-based approaches leading from technology independent to technology dependent models and finally to deployment support well the design of data-intensive applications. However, while classic MDA transformations introduce each technology-dependent feature at a pre-set stage of the model refinement chain, Big Data computations Deliver their best in term of scalability and efficiency by making binding architectural and data modeling decisions at the last possible moment, i.e. when information on Big Data distribution, volume and variety becomes available.
This talk discusses how Big-Data-as-a-Service, where data features can be different for each deployment of the analytics, can benefit from a Software Product-Line (SPL) parametric approach to keep multiple alternative models alive and postpone binding modeling decisions via variation points, in order to make them at the right (i.e., the last possible) moment. We discuss how to delay and parameterize modeling decisions for all key aspects of a Big Data application, including data preparation, representation and storage, analytics parallelization and visualization.

Short Bio:

Ernesto Damiani joined KUSTAR as Chair of the Information Security Group and Program, and EBTIC as Research Professor. He is on extended leave from the Department of Computer Science, Università degli Studi di Milano, Italy, where he leads the SESAR research lab and is the Head of the Ph.D. Program in Computer Science. Ernesto's research interests include secure service-oriented architectures, privacy-preserving Big Data analytics and Cyber-Physical Systems security.
Ernesto holds/has held visiting positions at a number of international institutions, including George Mason University in Virginia, US, Tokyo Denki University, Japan, LaTrobe University in Melbourne, Australia, and the Institut National des Sciences Appliquées (INSA) at Lyon, France. He is a Fellow of the Japanese Society for the Progress of Science.
He has been Principal Investigator in a number of large-scale research projects funded by the European Commission, the Italian Ministry of Research and by private companies such as British Telecom, Cisco Systems, SAP, Telecom Italia, Siemens Networks (now Nokia Siemens) and many others.
Ernesto serves in the editorial board of several international journals; among others, he is the EIC of the International Journal on Big Data and of the International Journal of Knowledge and Learning. He is Associate Editor of IEEE Transactions on Service-oriented Computing and of the IEEE Transactions on Fuzzy Systems. Ernesto is a senior member of the IEEE and served as Vice-Chair of the IEEE Technical Committee on Industrial Informatics. In 2008, Ernesto was nominated ACM Distinguished Scientist and received the Chester Sall Award from the IEEE Industrial Electronics Society, and in 2016 he received the Stephen S. Yau Services Computing Award. Ernesto has co-authored over 350 scientific papers and many books, including "Open Source Systems Security Certification" (Springer 2009).







Daniele Quercia   
Computer scientist and is currently building the Social Dynamics team at Bell Labs Cambridge, UK
Good City Life

Summary:

The corporate smart-city rhetoric is about efficiency, predictability, and security. “You’ll get to work on time; no queue when you go shopping, and you are safe because of CCTV cameras around you”. Well, all these things make a city acceptable, but they don’t make a city great. We are launching goodcitylife.org - a global group of like-minded people who are passionate about building technologies whose focus is not necessarily to create a smart city but to give a good life to city dwellers. The future of the city is, first and foremost, about people, and those people are increasingly networked. We will see how a creative use of network-generated data can tackle hitherto unanswered research questions. Can we rethink existing mapping tools [happy-maps]? Is it possible to capture smellscapes of entire cities and celebrate good odors [smelly-maps]? And soundscapes [chatty-maps]?

References:

  1. [happy-maps] http://www.ted.com/talks/daniele_quercia_happy_maps
  2. [smelly-maps] http://goodcitylife.org/smellymaps/index.html
  3. [chatty-maps] http://goodcitylife.org/chattymaps/index.html

Short Bio:

Daniele Quercia leads the Social Dynamics group at Bell Labs in Cambridge (UK). He has been named one of Fortune magazine's 2014 Data All-Stars, and spoke about “happy maps” at TED. His research has been focusing in the area of urban informatics and received best paper awards from Ubicomp 2014 and from ICWSM 2015, and an honourable mention from ICWSM 2013. He was Research Scientist at Yahoo Labs, a Horizon senior researcher at the University of Cambridge, and Postdoctoral Associate at the department of Urban Studies and Planning at MIT. He received his PhD from UC London. His thesis was sponsored by Microsoft Research and was nominated for BCS Best British PhD dissertation in Computer Science.







Courses information


Paul Bliese   
Associate Professor of Business Administration in the Management Department of the Darla Moore School of Business at the University of South Carolina.
Using R for Mixed-effects (Multilevel) Models

Summary:

Mixed-effects or multilevel models are commonly used when data have some form of nested structure. For instance, individuals may be nested within workgroups, or repeated measures may be nested within individuals. Nested structures in data are often accompanied by some form of non-independence. For example, in work settings, individuals in the same workgroup typically display some degree of similarity with respect to performance or they provide similar responses to questions about aspects of the work environment. Likewise, in repeated measures data, individuals usually display a high degree of similarity in responses over time. Non-independence may be considered either a nuisance variable or something to be substantively modeled, but the prevalence of nested data requires that analysts have a variety of tools to deal with nested data. This course provides and introduction to (1) the theoretical foundation, and (2) resources necessary to conduct a wide range of multilevel analyses. All practical exercises are conducted in R. Participants are encouraged to bring datasets to the course and apply the principles to their specific areas of research.

Syllabus:

Session 1

  1. Introduction and overview of Multilevel Models
  2. Introduction to R and the nlme and multilevel packages
Session 2:
  1. Nested Data and Mixed-Effects Models in nlme
  2. R Code for Models and Introduction to Functions Commonly used in Data Manipulation
Session 3:
  1. Repeated Measures data and Growth Models in nlme
  2. R Code for Models and Introduction to Functions Commonly used in Data Manipulation

References:

  1. Bliese, P. D. (2016). Multilevel Modeling in R (v. 2.6). https://cran.r-project.org/doc/contrib/Bliese_Multilevel.pdf

Pre-requisites:

Basic understanding of regression. An installed version of R (https://cran.r-project.org/) on a laptop for completing exercises.

Short Bio:

Paul D. Bliese, Ph.D. joined the Management Department at the Darla Moore School of Business, University of South Carolina in 2014. Prior to joining South Carolina, he spent 22 years as a research psychologist at the Walter Reed Army Institute of Research where he conducted research on stress, adaptation, leadership, well-being, and performance. Professor Bliese has long-term interests in understanding how analytics contribute to theory development and in applying analytics to complex organizational problems. He built and maintains the multilevel package for R. Professor Bliese has served on numerous editorial boards, and has been an associate editor at the Journal of Applied Psychology since 2010. He is the incoming editor-in-chief for Organizational Research Methods (July 2017).








Hendrik Blockeel   
Professor at the KU Leuven, Belgium
Decision Trees for Big Data Analytics

Summary:

Decision trees, and derived methods such as Random Forests, are among the most popular methods for learning predictive models from data. This is to a large extent due to the versatility and efficiency of these algorithms. This course will introduce students to the basic methods for learning decision trees, as well as to variations and more sophisticated versions of decision tree learners, with a particular focus on those methods that make decision trees work in the context of big data.

Syllabus:

Classification and regression trees, multi-output trees, clustering trees, model trees, ensembles of trees, incremental learning of trees, learning decision trees from large datasets, learning from data streams.

References:

Relevant references will be provided as part of the course.

Prerequisites:

Familiarity with mathematics, probability theory, statistics, and algorithms is expected, on the level it is typically introduced at the bachelor level in computer science or engineering programs.

Bio:

Hendrik Blockeel is a professor at KU Leuven, Belgium, and part-time associate professor at Leiden University, the Netherlands. He received his Ph.D. degree in 1998 from KU Leuven. His research interests lie mostly within artificial intelligence, with a focus on machine learning and data mining, and in the use of AI-based modeling in other sciences. Dr. Blockeel has made a variety of contributions on topics such as inductive logic programming, probabilistic-logical learning, and decision trees. He is an action editor for the journals Machine Learning and Data Mining and Knowledge Discovery, and a member of the editorial board of several other journals. He has chaired or organized several conferences, including ECMLPKDD 2013, and organized the ACAI summer school in 2007. He has served on the board of the European Coordinating Committee for Artificial Intelligence, and currently serves on the ECMLPKDD steering committee as publications chair. He became an ECCAI fellow in 2015.








Tamás Budavári   
Assistant Professor, Johns Hopkins University
Big Data Approaches in Astronomy

Summary:

With the advent of large-format CCD detectors astronomy has undergone a major paradigm shift. Over the last 2 decades, astronomers developed completely new tool and mind sets to work with the large amounts of data pouring out of dedicated telescopes that survey the sky every night. The required innovations included new hardware and software solutions as well as novel statistical methodologies to maximize the scientific outcome. Much of this was driven by the Sloan Digital Sky Survey which despite its modest 2.5m mirror has become the most impactful astronomy experiment.

Syllabus:

Introduction to modern astronomy, Databases and spatial searches, Efficient and streaming analytics, Probabilistic record linkage, Bayesian inference on GPUs,...

References:

Pre-requisites:

Basic knowledge of databases, linear algebra, and statistics

Short bio:

Tamás Budavári is Assistant Professor of Applied Mathematics & Statistics in the Whiting School of Engineering at the Johns Hopkins University, where he focuses on the mathematical and computational challenges of big data. His contributions include probabilistic record linkage, streaming robust Principal Component Analysis and hierarchical Bayesian inference on many-core architectures. He developed multi-dimensional search packages that accelerate science queries against the largest astronomy catalogs including the Sloan Digital Sky Survey, the Galaxy Evolution Explorer, the Hubble Source Catalog, and the Millennium simulations. He is founding editor of the journal Astronomy & Computing.








Diego Calvanese   
Full Professor at the Research Centre for Knowledge and Data (KRDB), at the Faculty of Computer Science of the Free University of Bozen-Bolzano.
Data-aware Processes: Modeling and Verification

Summary:

The need of combining static (i.e., data-related) and dynamic (i.e., process-related) aspects has been increasingly recognized as a key requirement towards the design, verification, and understanding of complex systems. An essential element for traditional verification is that states are propositional, resulting in a finite-state transition system. However, in the presence of data, states need to be modeled relationally, causing the transition system to become infinite-state in general. Furthermore, data call for verification languages based on first-order temporal logics. The resulting verification problem is much harder than in the finite-state setting, leading to undecidability even for severely restricted systems. In this course, we address the fundamental problem of studying data-aware process formalisms and appropriate verification languages, which on the one hand guarantee decidability of verification, and on the other hand allow one to capture real-world scenarios.

Syllabus:

  1. Introduction.
    • The need of combining the process and the data perspectives.
    • Summary of concrete languages for data-aware process modeling.
    • Overview of temporal logics, transition systems, and formal verification.
  2. Formal modeling of data-aware processes.
    • Relational transition systems.
    • Data-centric dynamic systems (DCDS) as a representative example of data-aware processes: syntax, execution semantics, examples.
    • First-order temporal logics and properties of interest.
  3. Sources of undecidability in DCDSs.
    • Genericity of first-order logic.
    • Bisimulations.
    • Undecidability of reachability.
    • State-bounded DCDSs and decidability of reachability.
    • Undecidability of first-order linear temporal logics over state-bounded DCDSs.
  4. Model checking state-bounded DCDSs.
    • Decidability results.
    • Logics with “persistent quantification”.
    • Decidability results for first-order linear-time logics.
    • Decidability results for first-order branching-time logics.

References:

The area tackled in the course is relatively new, and a comprehensive textbook or monograph covering the relevant topics is still missing. Therefore, we list some key survey papers that provide a good insight into the problems tackled in this course, their practical relevance, the challenges they pose, and solutions and techniques that have been proposed. The technical development in the course will then rely on the results presented in reference [1].

  1. Diego Calvanese, Giuseppe De Giacomo, and Marco Montali. Foundations of data-aware process analysis: A database theory perspective. In Proc. of the 32nd ACM Symposium on Principles of Database Systems (PODS). ACM Press and Addison Wesley, 2013.
  2. Richard Hull. Artifact-centric business process models: Brief survey of research results and challenges. In Proc. of the On the Move Confederated Int. Conf. (OTM), volume 5332 of Lecture Notes in Computer Science, pages 1152–1163. Springer, 2008.
  3. Richard Hull, Michael Benedikt, Vassilis Christophides, and Jianwen Su. E-services: a look behind the curtain. In Proc. of the 22nd ACM Symposium on Principles of Database Systems (PODS), pages 1–14, 2003.
  4. Moshe Y. Vardi. Model checking for database theoreticians. In Proc. of the 10th Int. Conf. on Database Theory (ICDT), volume 3363 of Lecture Notes in Computer Science, pages 1–16. Springer, 2005.
  5. Victor Vianu. Automatic verification of database-driven systems: a new frontier. In Proc. of the 12th Int. Conf. on Database Theory (ICDT), pages 1–13, 2009.

Prerequisites:

Basic knowledge of the relational model and of first-order logic.

Short bio:

Diego Calvanese is a full professor at the Research Centre for Knowledge and Data (KRDB), Faculty of Computer Science, Free University of Bozen-Bolzano, where he teaches graduate and undergraduate courses on knowledge bases and databases, ontologies, theory of computing, and formal languages. He received a PhD from Sapienza University of Rome in 1996. His research interests include formalisms for knowledge representation and reasoning, ontology based data acces and integration, description logics, Semantic Web, graph data management, data-aware process verification, and service modeling and synthesis. He has been actively involved in several national and international research projects in the above areas (including FP6-7603 TONES, FP7-257593 ACSI, FP7-318338 Optique). He is the author of more than 300 refereed publications, including ones in the most prestigious international journals and conferences in Databases and Artificial Intelligence, with more than 23000 citations and an h-index of 64, according to Google Scholar. He is one of the editors of the Description Logic Handbook. He has served in over 100 program committee roles for international events, and is an associate editor of JAIR. In 2012--2013 he has been a visiting researcher at the Technical University of Vienna as Pauli Fellow of the ``Wolfgang Pauli Institute''. He has been the program chair of the 34th ACM Symposium on Principles of Database Systems (PODS~2015) and program co-chair of the 28th Description Logic Workshop (DL~2015), and he is the general chair of the 28th European Summer School in Logic, Language and Information (ESSLLI~2016). He has been nominated Fellow of the European Association for Artificial Intelligence (EurAI, formerly ECCAI) in 2015.









Geoffrey C. Fox   
Chair, Intelligent Systems Engineering, School of Informatics and Computing. Distinguished Professor of Computing, Engineering and Physics. Director of the Digital Science Center, Indiana University – Bloomington
Using High Performance Computing for Big Data Analytics

Summary:

Two major trends in computing systems are the growth in high performance computing (HPC) with an international exascale initiative, and the big data phenomenon with an accompanying cloud infrastructure of well publicized dramatic and increasing size and sophistication. This tutorial weaves these trends together using some key building blocks. The first is HPC-ABDS, the High Performance Computing (HPC) enhanced Apache Big Data Stack. (ABDS). Here we aim at using the major open source Big Data software environment but develop the principles allowing use of HPC software and hardware to achieve good performance. We give several examples of software (for example Hadoop and Heron) and algorithms implemented in this software. The second building block is the SPIDAL library (Scalable Parallel Interoperable Data Analytics Library) of scalable machine learning and data analysis software. We give examples including clustering, topic modelling and dimension reduction and their visualization. The third building block is an analysis of simulation and big data use cases in terms of 64 separate features (varying from data volume to “suitable for MapReduce” to kernel algorithm used). This allows an understanding of what type of hardware and software is needed for what type of exhibited features; it allows a one to either unify or distinguish applications across the simulation and Big Data regimes. The final building block is DevOps and Software defined Systems. These allow one to package software so it runs across a variety of hardware (albeit with varying performance) with just a mouse click. These building blocks are finally linked together as a proposed convergence of Big Data and Exascale Computing.

Syllabus:

Session 1: HPC-ABDS

Session 2: SPIDAL Session 3: Big Data Ogres and Cloudmesh

Prerequisites:

Some familiarity with ABDS software such as Hadoop, Spark, Flink, Storm, Heron and HPC technologies such as MPI would be helpful. Java will be mostly used. Some familiarity with parallel computing (algorithms and software) helpful. Some familiarity with data analytics helpful.

References:

The 3 links below have many subsidiary links

Short Bio:

Fox received a Ph.D. in Theoretical Physics from Cambridge University where he was Senior Wrangler. He is now distinguished professor of Computing, Engineering and Physics at Indiana University where he is director of the Digital Science Center, and Chair of Department of Intelligent Systems Engineering at the School of Informatics and Computing. He previously held positions at Caltech, Syracuse University, and Florida State University after being a postdoc at the Institute of Advanced Study at Princeton, Lawrence Berkeley Laboratory and Peterhouse College Cambridge. He has supervised the PhD of 70 students and published around 1200 papers (over 400 with at least 10 citations) in physics and computer science with an hindex of 74 and over 28500 citations. He currently works in applying computer science from infrastructure to analytics in Biology, Pathology, Sensor Clouds, Earthquake and Ice-sheet Science, Image processing, Deep Learning, Network Science, Financial Systems and Particle Physics. The infrastructure work is built around Software Defined Systems on Clouds and Clusters. The analytics focuses on scalable parallelism. He is involved in several projects to enhance the capabilities of Minority Serving Institutions. He has experience in online education and its use in MOOCs for areas like Data and Computational Science. He is a Fellow of APS (Physics) and ACM (Computing).








Minos Garofalakis   
Professor of Computer Science at the School of Electrical and Computer Engineering of the Technical University of Crete (TUC)
Streaming Big Data Analytics

Summary:

Effective Big Data analytics need to rely on algorithms for querying and analyzing massive, continuous data streams (that is, data that is seen only once and in a fixed order) with limited memory and CPU-time resources. Such streams arise naturally in emerging large-scale event monitoring applications; for instance, network-operations monitoring in large ISPs, where usage information from numerous sites needs to be continuously collected and analyzed for interesting trends. In addition to memory- and time-efficiency concerns, the inherently distributed nature of such applications also raises important communication-efficiency issues, making it critical to carefully optimize the use of the underlying network infrastructure.
This seminar will give an overview of some key algorithmic tools for effective query processing over streaming data. The focus will be on small-space sketching structures for approximating continuous data streams in both centralized and distributed settings.

Syllabus:

Introduction; Data Streaming Models; Basic Streaming Algorithmic Tools: Reservoir Sampling, AMS/CountMin/FM sketching and Distinct Sampling, Exponential Histograms; Distributed Data Streaming: Models, Techniques, the Geometric Method and extensions to Safe Zone construction; Conclusions and Looking Forward.

References:

Surveys/Monographs:

  1. Graham Cormode, Minos N. Garofalakis, Peter J. Haas, Chris Jermaine: Synopses for Massive Data: Samples, Histograms, Wavelets, Sketches. Foundations and Trends in Databases 4(1-3): 1-294 (2012)
  2. Minos Garofalakis, Johannes Gehrke, and Rajeev Rastogi (Eds.). "Data-Stream Management -- Processing High-Speed Data Streams", Springer-Verlag, New York (Data-Centric Systems and Applications Series), July 2016 (ISBN 978-3-540-28607-3).

Papers:

  1. Noga Alon, Yossi Matias, Mario Szegedy: The Space Complexity of Approximating the Frequency Moments. STOC 1996: 20-29
  2. Noga Alon, Phillip B. Gibbons, Yossi Matias, Mario Szegedy: Tracking Join and Self-Join Sizes in Limited Storage. PODS 1999: 10-20
  3. Graham Cormode, S. Muthukrishnan: An Improved Data Stream Summary: The Count-Min Sketch and Its Applications. LATIN 2004: 29-38
  4. Phillip B. Gibbons: Distinct Sampling for Highly-Accurate Answers to Distinct Values Queries and Event Reports. VLDB 2001: 541-550
  5. Mayur Datar, Aristides Gionis, Piotr Indyk, Rajeev Motwani: Maintaining Stream Statistics over Sliding Windows. SIAM J. Comput. 31(6): 1794-1813, 2002
  6. Graham Cormode, Minos N. Garofalakis: Approximate continuous querying over distributed streams. ACM Trans. Database Syst. 33(2), 2008
  7. Izchak Sharfman, Assaf Schuster, Daniel Keren: A geometric approach to monitoring threshold functions over distributed data streams. SIGMOD Conference 2006: 301-312
  8. Nikos Giatrakos, Antonios Deligiannakis, Minos Garofalakis, Izchak Sharfman, Assaf Schuster: Prediction-based Geometric Monitoring over Distributed Data Streams. SIGMOD Conference 2012: 265-276
  9. Minos N. Garofalakis, Daniel Keren, Vasilis Samoladas: Sketch-based Geometric Monitoring of Distributed Stream Queries. PVLDB 6(10): 937-948, 2013
  10. Arnon Lazerson, Izchak Sharfman, Daniel Keren, Assaf Schuster, Minos N. Garofalakis, Vasilis Samoladas: Monitoring Distributed Streams using Convex Decompositions. PVLDB 8(5): 545-556 (2015)

Short Bio

Minos Garofalakis received the Diploma degree in Computer Engineering and Informatics from the University of Patras, Greece in 1992, and the M.Sc. and Ph.D. degrees in Computer Science from the University of Wisconsin-Madison in 1994 and 1998, respectively. He worked as a Member of Technical Staff at Bell Labs, Lucent Technologies in Murray Hill, NJ (1998-2005), as a Senior Researcher at Intel Research Berkeley in Berkeley, CA (2005-2007), and as a Principal Research Scientist at Yahoo! Research in Santa Clara, CA (2007-2008). In parallel, he also held an Adjunct Associate Professor position at the EECS Department of the University of California, Berkeley (2006-2008. As of October 2008, he is a Professor of Computer Science at the School of Electrical and Computer Engineering of the Technical University of Crete (TUC), and the Director of the Software Technology and Network Applications Laboratory (SoftNet). Prof. Garofalakis’ research focuses on Big Data analytics, spanning areas such as database systems, data streams, data synopses and approximate query processing, probabilistic databases, and large-scale machine learning and data mining. His work has resulted in over 140 published scientific papers in these areas, and 36 US Patent filings (29 patents issued) for companies such as Lucent, Yahoo!, and AT&T. GoogleScholar gives over 11.000 citations to his work, and an h-index value of 56. Prof. Garofalakis is an ACM Distinguished Scientist (2011), a Senior Member of the IEEE, and a recipient of the 2015 TUC Research Excellence Award, the IEEE ICDE Best Paper Award (2009), the Bell Labs President’s Gold Award (2004), and the Bell Labs Teamwork Award (2003).








David W. Gerbing   
Professor of Quantitative Methods. Portland State University
Data Visualization with R

Summary:

The R language is illustrated for data visualization, aka computer graphics, in the context of a discussion of best practices. Graphics are demonstrated with R base graphics, Hadley Wickham's ggplot package, and the author's lessR package. The content of the seminar is summarized with R Markup files that include commentary and implementation of all the code presented in the seminar.

Syllabus:

Session 1

Session 2 Session 3

References:

  1. Gerbing, D. W. (2013). R Data Analysis without Programming, NY: Routledge.
  2. Wickham, H. (2009). ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York.

Pre-requisites:

Basic understanding of data analysis

Short bio:

David Gerbing, Ph.D., since 1987 Professor of Quantitative Methods, School of Business Administration, Portland State University. He received his Ph.D. in quantitative psychology from Michigan State University in 1979. From 1979 until 1987 he was an Assistant and then Associate Professor of Psychology and Statistics at Baylor University. He has authored R Data Analysis without Programming, which describes his lessR package, and many articles on statistical techniques and their application in a variety of journals that span several academic disciplines.








Georgios Giannakis   
Professor, ADC Chair in Wireless Telecommunications. Director, Digital Technology Center. Department of Electrical and Computer Engineering, University of Minnesota
Signal Processing Tools for Big Network Data Analytics

Summary:

We live in an era of data deluge, where pervasive interconnected sensors collect massive amounts of information on every bit of our lives. Accordingly, propelled by emergent social networking services as well as high-definition streaming platforms, communication networks have evolved from specialized, research and tactical transmission systems to large-scale and highly complex interconnections of intelligent devices that carry massive volumes of multimedia traffic. While Big Data can be definitely perceived as a big blessing, big challenges also arise with large-scale network data. The sheer volume of data makes it often impossible to run analytics using a central processor and storage unit, and distributed processing with parallelized multi-processors is preferred while data themselves are stored in the cloud. As many sources continuously generate data in real time, analytics must often be performed “on-the-fly,” without an opportunity to revisit past entries. Due to their disparate origins, the resultant network datasets are often incomplete and include a sizable portion of missing entries. Overall, Big Data present challenges in which resources such as time, space, and energy, are intertwined in complex ways with data resources. As "netizens" demand a seamless networking experience with not only higher speeds, but also resilience to failures and malicious cyber-attacks, ample opportunities for data-driven signal processing (SP) research arise. This tutorial seeks to provide an overview of ongoing research in novel models applicable to a wide range of Big Data analytics problems arising with e.g., dynamic network monitoring, as well as algorithms and architectures to handle the practical challenges, while revealing fundamental limits and insights on the mathematical trade-offs involved.

Syllabus:

This tutorial aims at (i) delineating the theoretical and algorithmic background and the relevance of Signal Processing (SP) tools to the emerging field of Big Data; and (ii) introducing the community to the challenges and opportunities for SP research on (massive-scale) data analytics. The latter entails an extended and continuously refined wish-list of technologies, envisioned to encompass high-dimensional, decentralized, online, and robust statistical SP, as well as large, distributed, fault-tolerant, and intelligent systems engineering. The goal is to selectively cover a diverse gamut of Big Data challenges and opportunities through a comprehensive tutorial surveying methodological advances, as well as more application-oriented topics and illustrative examples.
In this context, the outline of this 3-hour long tutorial is as follows.
I. Introduction, motivation and context (20 mins.)
II. Theoretical and statistical foundations for Big Data Analytics (1 hr.)
 a) High-dimensional statistical SP and succinct data representations;
  i. Compressive sampling, sparsity, and (non-linear) dimensionality reduction
  ii. Low-rank models, matrix completion, and regularization
 b) Robust approaches to coping with outliers and missing data;
 c) Big tensor data models and factorizations
III. Algorithmic advances for mining massive datasets (1 hr.)
 a) Scalable, online, and decentralized learning and optimization;
 b) Randomized algorithms for very large matrix, graph, and regression problems;
 c) Convergence analysis, computational complexity, and performance
IV. Architectures for large-scale data analysis and signal processing (30 mins.)
 a) Scalable, distributed computing, e.g., Mapreduce, Hadoop;
 b) Large-scale graph processing, e.g., Pregel, Giraph
V. Concluding remarks (10 mins.)

Prerequisites:

The target audience includes graduate students, researchers and engineers with general interests in machine learning, network science, communications, and (sparsity-aware) signal Processing. The background needed is that of an M.Sc. Degree holder or commensurate experience with random processes, optimization, linear algebra, machine learningand statistical signal processing.

References:

Specific citations are provided on the slides themselves. A useful tutorial that the presentation is built on, is the following:

Short bio:

Georgios B. Giannakis (Fellow’97) received his Diploma in Electrical Engr. from the Ntl. Tech. Univ. of Athens, Greece, 1981. From 1982 to 1986 he was with the Univ. of Southern California (USC), where he received his MSc. in Electrical Engineering, 1983, MSc. in Mathematics, 1986, and Ph.D. in Electrical Engr., 1986. He was with the University of Virginia from 1987 to 1998, and since 1999 he has been a professor with the Univ. of Minnesota, where he holds an Endowed Chair in Wireless Telecommunications, a University of Minnesota McKnight Presidential Chair in ECE, and serves as director of the Digital Technology Center.
His general interests span the areas of data and network sciences, statistical signal processing, and communications - subjects on which he has published more than 400 journal papers, 680 conference papers, 30 book chapters, two edited books and 2 research monographs (h-index 123). Current research focuses on learning from Big Data, wireless cognitive radios, and network science with applications to social, brain, and power networks with renewables. He is the (co-) inventor of 28 patents issued, and the (co-) recipient of 8 best paper awards from the IEEE Signal Processing (SP) and Communications Societies, including the G. Marconi Prize Paper Award in Wireless Communications. He also received Technical Achievement Awards from the SP Society (2000), from EURASIP (2005), a Young Faculty Teaching Award, the G. W. Taylor Award for Distinguished Research from the University of Minnesota, and the IEEE Fourier Technical Field Award (2015). He is a Fellow of EURASIP, and has served the IEEE in a number of posts, including that of a Distinguished Lecturer for the IEEE-SP Society.








Sander Klous   
Professor in Big Data Ecosystems for Business and Society at the University of Amsterdam
We Are Big Data

Summary:

Everything is measurable, from our heartbeat during our morning run to the music we listen to, even our walking patterns through department stores. We can create impressive new insights by analysing such data, for example to prevent traffic jams, suppress epidemics or offer personalized medicine, precisely targeted at our individual needs. A new information society is emerging, interwoven with technology. Nevertheless, there is some serious resistance to the rise of this new society. People fear that their privacy is in jeopardy and want to put their data under lock and key. In this presentation Sander Klous discusses the inevitability of a continuously growing role of data in our society and what is needed for the responsible application of data analysis.

Syllabus:

We are Big Data: http://www.springer.com/gp/book/9789462391826

References:

None

Pre-requisites:

Book http://www.springer.com/gp/book/9789462391826

Short bio:

Sander Klous holds a PhD in High Energy Physics and contributed to the ATLAS and LHCb experiments at CERN. Klous is responsible for the data and analytics services of KPMG and is professor in Big Data Ecosystems for Business and Society at the University of Amsterdam. His book “We are Big Data” is a best-seller in the Netherlands and was runner-up for the management book of the year award in 2015.








Laks V.S. Lakshmanan   
Professor Department of Computer Science. The University of British Columbia
Analysis of Large Social Networks

Summary:

tba.








Maurizio Lenzerini   
Full professor in Computer Science. DIAG, Sapienza Università di Roma.
Ontology-based Data Management

Summary:

The need of effectively managing the data sources of an organization, which are often autonomous, distributed, and heterogeneous, and devising tools for deriving useful information and knowledge from them is widely recognized as one of the challenging issues in modern information systems. Ontology-based data management aims at accessing, using, and maintaining data by means of an ontology, i.e., a conceptual representation of the domain of interest in the underlying information system. This new paradigm provides several interesting features, many of which have been already proved effective in managing complex information systems. In the course, we first provide an introduction to ontology-based data management, then we illustrate the main techniques for using an ontology to access the data layer of an information system, and finally we discuss several important issues that are still the subject of extensive investigations, including meta-modeling, meta-querying, and inconsistency tolerance.

Syllabus:

  1. Introduction to ontology-based data management (OBDM).
  2. Languages for OBDM.
  3. Query answering in OBDM.
  4. Meta-modeling and higher-order ontology languages.
  5. The problem of meta-querying; inconsistency tolerance in OBDM.

References:

Pre-requisites:

Basic notions of databases, logic, computational complexity.

Short bio:

Maurizio Lenzerini (http://www.dis.uniroma1.it/~lenzerini) is a Professor of Data Management at the Dipartimento di Ingegneria Informatica Automatica e Gestionale Antonio Ruberti of Sapienza Università di Roma, where is leading a research group working on Database Theory, Data Management, Knowledge Representation and Automated Reasoning, and Ontology-based Data Integration. He is the author of more than 300 publications on the above topics, which received more than 20.000 citations. According to Google Scholar, his h-index is currently 70. He was an invited keynote speaker at many international conferences. He is the recipient of two IBM Faculty Awards, he is a Fellow of EurAi (formerly European Coordinating Committee for Artificial Intelligence, ECCAI) since 2008, a Fellow of the ACM (Association for Computing Machinery) since 2009, and a member of the Academia Europaea - The European Academy since 2011.








Soumya D. Mohanty   
Associate Professor, University of Texas Rio Grande Valley.
Swarm Intelligence Methods and Optimization Problems in Big Data Analytics

Summary:

Big Data applications generically require the use of flexible, hence high-dimensional, models to capture meaningful patterns in the data, and this usually leads to challenging non-linear and non-convex global optimization problems. The large data volume that must be handled further increases their difficult nature. This course will introduce methods from the field of computational swarm intelligence (SI) that are useful for solving such optimization problems. SI is a metaheuristic inspired by observations of cooperative behavior in multi-agent biological systems. This relatively new paradigm follows in the wake of well-known biology inspired optimization methods such as Genetic Algorithms. The course will introduce several SI methods, with the focus being on an in-depth presentation of PSO -- Particle Swarm Optimization. PSO shows a remarkable robustness across a wide range of optimization problems, reducing the burden that is generally involved in tuning stochastic global optimization algorithms to obtain good performance. In terms of practical applications, optimization problems from a diverse range of Big Data fields will be considered with a focus on some recent successes achieved with SI methods on data analysis challenges in Astronomy.

Syllabus:

The following is a list of the main topics to be covered in the course.

References:

The following textbooks (and references therein):

  1. "Fundamentals of Computational Swarm Intelligence", A. P. Engelbrecht, Wiley.
  2. "Particle Swarm Optimization", M. Clerc, ISTE.
A list of papers to be provided during the course.

Pre-requisites:

None, as this is an introductory course. However, familiarity with basic probability theory and statistics will be a plus.

Short bio:

Soumya D. Mohanty, Professor of Physics at UTRGV, completed his PhD degree in 1997 at the Inter-University Center for Astronomy and Astrophysics, India. He subsequently held post-doctoral positions at Northwestern University, Penn State, and the Max-Planck Institute for Gravitational Physics. He was also a visiting scholar with the LIGO project at Caltech. Mohanty’s research has focused on solving some of the important data analysis challenges faced in the realization of Gravitational Wave (GW) astronomy across all observational frequency bands. These include semi-parametric regression of very weak signals in noisy data, high-dimensional non-linear parametric regression, time series classification, and analysis of data from large heterogeneous sensor arrays. He is the recipient of several grants, from the Research Corporation, the U.S. National Science Foundation, and NASA, in support of this work.








Bernhard Pfahringer   
Professor at the Computer Science Department at the University of Waikato.
Introduction to Data Stream Mining for Big Data

Summary:

Data streams are everywhere, but regular machine learning fails to deal with streams properly. In a nutshell, the world is not i.i.d.: independently and identically distributed. Streams mining assumas and appropriately deals with non-stationary data sources.

Syllabus:

The following topics will be covered:

References:

Knowledge Discovery from Data Streams, Joao Gama, 2010 by Chapman and Hall/CRC, ISBN 9781439826119

Pre-requisites:

Some basic machine learning knowledge, including standard algorithms like decision trees, ensembles, clustering, and evaluation, is expected.

Short bio:

Bernhard Pfahringer received his PhD degree from the Vienna University of Technology in Austria in 1995. He is currently a Professor with the Department of Computer Science at the University of Waikato in New Zealand. His interests span a range of data mining and machine learning sub-fields, with a focus on streaming, randomization, and complex data.








Krithi Ramamritham   
Professor, Department of Computer Science and Engg. Head, Center for Urban Sc. and Engg. IIT Bombay.
Harnessing Big Data for Building Smart Things

Summary:

These days, unless something has the epithet "smart" attached to it, it is nothing. Smart Energy solutions promise cleaner, cheaper and more reliable energy. Smart Cities promise better quality of life for its citizens. We will argue that for a "system" to be SMART, it should Sense Meaningfully, Analyze and Respond Timely. Using real-world examples from the domains of Smart Energy and Smart Cities, this talk will illustrate the central role of big data in being SMART.

Syllabus:

  1. What are the attributes of something that is smart?How do we build something with these attributes?Examples of smart things and the role of BigdataIssues in the design and implementation of smart things: Sensing, Analysis and ResponseCase Studies from smart buildings, smart energy, smart transportation, smart cities, etc.

References:

  1. Srinivasan Iyengar, Navin Sharma, David E. Irwin, Prashant J. Shenoy, Krithi Ramamritham: SolarCast: a cloud-based black box solar predictor for smart homes. BuildSys@SenSys 2014: 40-49
  2. G. Karmakar, A. Kabra, and K. Ramamritham. Maintaining thermal comfort in buildings: Feasibility, algorithms, implementation, evaluation. Real-Time Systems, 51(5):485–525, Sept. 2015.
  3. N. Nasir, K. Palani, A. Chugh, V. Prakash, U. Arote, A. Krishnan, and K. Ramamritham. Fusing sensors for occupancy sensing in smart buildings. In R. Natarajan,G. Barua, and M. Patra, editors, Distributed Computing and Internet Technology, Volume 8956 of the series Lecture Notes in Computer Science pp 73-92.
  4. Kedar Khandeparkar, Krithi Ramamritham, Rajeev Gupta, Anil Kulkarni, Gopal Gajjar, and Shreevardhan Soman. 2015. Timely Query Processing in Smart Electric Grids: Algorithms and Performance. In Proceedings of the 2015 ACM Sixth International Conference on Future Energy Systems (e-Energy '15). ACM, New York, NY, USA, 161-170. DOI=http://dx.doi.org/10.1145/2768510.2768535
  5. Agarwal, A., Munigala, V., and Ramamritham, K., Observability: A Principled Approach to Provisioning Sensors in Buildings. In Proceedings of the 3rd ACM Internation al Conference on Systems for Energy-Efficient Built Environments (BuildSys 2016), November 16-17, 2016, Stanford, CA, USA

Pre-requisites:

Exposure to Masters level CS subjects: OS, distributed systems, databases, networking.

Short bio:

Prof. Krithi Ramamritham holds a Ph.D. degree in Computer Science from the University of Utah. He did his B.Tech (Electrical Engineering) and M.Tech (Computer Science) degrees from IIT Madras. After a long stint at the University of Massachusetts, he moved to IIT Bombay as the Vijay and Sita Vashee Chair Professor in the Department of Computer Science and Engineering. During 2006-2009, he served as Dean (R&D) at IIT Bombay. He currently heads IIT Bombay's new Center for Urban Science and Engineering. Prof. Ramamritham's research explores timeliness and consistency issues in computer systems, in particular, databases, real-time systems, and distributed applications. His recent work addresses these issues in the context of sensor networks, embedded systems, mobile environments and smart grids. During the last few years he has been interested in the use of Information and Communication Technologies for creating tools aimed at socio-economic development. He is a Fellow of the IEEE, ACM, Indian Academy of Sciences, National Academy of Sciences, India, and the Indian National Academy of Engineering. He was honored with a Doctor of Science (Honoris Causa) by the University of Sydney. He is also a recipient of the Distinguished Alumnus Award from IIT Madras. Twice he received the IBM Faculty Award. He just received the 2015 Outstanding Technical Contributions and Leadership Award from the IEEE Technical Committee for Real-Time Systems and IEEE's CEDA Outstanding Service Award. He has been associated with the editorial board of various journals. These include IEEE Embedded Systems Letters and Springer's Real-Time Systems Journal (Editor-in-Chief), IEEE Transactions on Knowledge and Data Engineering, IEEE Transactions on Parallel and Distributed Systems, IEEE Transactions on Mobile Computing, IEEE Internet Computing, ACM Computing Surveys and the VLDB (Very Large Databases) Journal. H e has served on the Board of Directors of Persistent Systems, Pune, on the Board of Trustees of the VLDB Endowment, and on the Technical Advisory Board of TTTech, Vienna, Austria, Microsoft Research India, and Tata Consultancy Services.








Michael Rosenblum   
Statistical Physics / Theory of Chaos Group. Dept. of Physics and Astronomy. University of Potsdam
Coupled Oscillators Approach in Time Series Analysis

Summary:

Many natural systems can be understood as networks of oscillatory units. We start by considering basic notions of the synchronization theory which describes dynamics of interacting oscillators. Next, we use these results in order to solve an inverse problem and to obtain information about the networks from observations. This model based data analysis provides information about strength of the interaction as well as allows us to address the connectivity problem. The connectivity problem, i.e. reconstruction of causal directional links in a network from data, is especially important in analysis of physiological data, in particular in neuroscience. We discuss how this problem can be solved by reconstructing the model of phase dynamics of the network. We illustrate the theory by real-world examples.

Syllabus:

First session

  1. Introduction: coupled oscillators and oscillatory time series
  2. Basics of synchronization theory
Second session
  1. Coupled oscillators in data analysis: formulation of the inverse problem
  2. Phase estimation from data
  3. How to reveal and quantify weak interaction
Third session
  1. The connectivity problem: recovering directional coupling in oscillatory networks
  2. Application example

References:

  1. A. Pikovsky, M. Rosenblum, and J. Kurths, Synchronization, A Universal Concept in Nonlinear Sciences, Cambridge University Press, 2001
  2. G. Rosenblum, L. Cimponeriu, and A.S. Pikovsky, Coupled oscillators approach in analysis of bivariate data, In: Handbook of Time Series Analysis, Wiley-VCH, Editors: B. Schelter, M. Winterhalder, and J. Timmer, Chapter 7, pp. 159-180, 2006.
  3. B. Kralemann, M. Frühwirth, A. Pikovsky, M. Rosenblum, T. Kenner, J. Schaefer, and M. Moser, In vivo cardiac phase response curve elucidates human respiratory heart rate variability, Nature Communications, 4, p. 2418, 2013
  4. B. Kralemann, A. Pikovsky, and M. Rosenblum, Reconstructing effective phase connectivity of oscillator networks from observations, New Journal of Physics, 16, 085013, 2014

Pre-requisites:

Basic knowledge of mathematics, in particular, basic facts about ordinary differential equations

Short bio:

Michael Rosenblum is a professor of physics at the University of Potsdam, Germany. His main research interests include chaotic dynamics, dynamics of networks of coupled oscillators, control of oscillatory dynamics, synchronisation theory and it application to data analysis. He acted two terms as a member of the editorial Board of Physical Review E and is currently an editor of CHAOS: An Interdisciplinary journal of nonlinear science. He has published over 100 papers in peer-reviewed journals, including 4 papers in Nature and 11 in Physical Review Letters.








Pierangela Samarati   
Professor at the Computer Science Department of the Università degli Studi di Milano.
Data Security and Privacy in the Cloud

Summary:

The “cloud” has become a successful paradigm for conveniently storing, accessing, processing, and sharing information. With its significant benefits of scalability and elasticity, the cloud paradigm has appealed companies and users, which are more and more resorting to the multitude of available cloud providers for storing and processing data. Unfortunately, such a convenience comes at the price of loss of control over these data by their owner, and consequent new security threats that can limit the potential widespread adoption and acceptance of the cloud computing paradigm. In this lecture, I will discuss security and privacy issues arising in the cloud scenario, addressing problems related to guaranteeing confidentiality and integrity of data stored or processed by external providers, ensuring access privacy, regulating and controlling access to data in the cloud, and performing queries on protected data.

Syllabus:

Security and privacy issues in the cloud, data confidentiality, access control, query execution on protected data, access privacy, data and computation integrity.

References:

  1. P. Samarati, S. De Capitani di Vimercati, "Cloud Security: Issues and Concerns," in Encyclopedia on Cloud Computing, S. Murugesan, I. Bojanova (eds.), Wiley, 2016.
  2. S. De Capitani di Vimercati, S. Foresti, G. Livraga, P. Samarati, "Practical Techniques Building on Encryption for Protecting and Managing Data in the Cloud," in Festschrift for David Kahn, P. Ryan, D. Naccache, J.­J. Quisquater (eds.), Springer, 2016.
  3. S. De Capitani di Vimercati, S. Foresti, S. Paraboschi, G. Pelosi, P. Samarati, "Three- Server Swapping for Access Confidentiality," in IEEE Transactions on Cloud Computing (TCC), 2015 (pre­print).
  4. S. De Capitani di Vimercati, S. Foresti, S. Paraboschi, G. Pelosi, P. Samarati, "Shuffle Index: Efficient and Private Access to Outsourced Data," in ACM Transactions on Storage (TOS), vol. 11, n. 4, October 2015, pp. 1­55 (Article 19).
  5. S. De Capitani di Vimercati, S. Foresti, S. Jajodia, G. Livraga, S. Paraboschi, P. Samarati, "Fragmentation in Presence of Data Dependencies," in IEEE Transactions on Dependable and Secure Computing (TDSC), vol. 11, n. 6, November­December 2014, pp. 510­523.
  6. S. De Capitani di Vimercati, S. Foresti, S. Jajodia, S. Paraboschi, P. Samarati, "Integrity for Join Queries in the Cloud," in IEEE Transactions on Cloud Computing (TCC), vol. 1, n. 2, July­December 2013, pp. 187­200.
  7. S. De Capitani di Vimercati, S. Foresti, S. Jajodia, S. Paraboschi, P. Samarati, "Encryption Policies for Regulating Access to Outsourced Data," in ACM Transactions on Database Systems (TODS), vol. 35, n. 2, April 2010, pp. 1­46 (Article 12).

Pre-requisites:

Basic background in databases.

Short bio:

Pierangela Samarati is a Professor at the Department of Computer Science of the Università degli Studi di Milano, Italy. Her main research interests are access control policies, models and systems, data security and privacy, information system security, and information protection in general. She is the project coordinator of the ESCUDO­CLOUD project, funded by the EC H2020 programme, and she has participated in several projects involving different aspects of information protection. On these topics she has published more than 250 peer­reviewed articles in international journals, conference proceedings, and book chapters. She has been Computer Scientist in the Computer Science Laboratory at SRI, CA (USA). She has been a visiting researcher at the Computer Science Department of Stanford University, CA (USA), and at the Center for Secure Information Systems of George Mason University, VA (USA). She is the chair of the IEEE Systems Council Technical Committee on Security and Privacy in Complex Information Systems (TCSPCIS), of the ERCIM Security and Trust Management Working Group (STM), and of the Steering Committees of the European Symposium On Research In Computer Security (ESORICS) and of the ACM Workshop on Privacy in the Electronic Society (WPES). She is member of several steering committees. She is ACM Distinguished Scientist (named 2009) and IEEE Fellow (named 2012). She has received the IEEE Computer Society Technical Achievement Award (2016). She has been awarded the IFIP TC11 Kristian Beckman Award (2008) and the IFIP WG 11.3 Outstanding Research Contributions Award (2012). She has served as General Chair, Program Chair, and program committee member of several international conferences and workshops.








V.S. Subrahmanian   
Professor of Computer Science and Director of the Lab for Computational Cultural Dynamics and Director of the Center for Digital International Government at the University of Maryland
Big Data and Cybersecurity

Summary:

This course will study the use of machine learning methods in cybersecurity. Predictive analytics methods have been used for a wide variety of cybersecurity needs. For instance, they have been used to separate malware from safeware, to separate ransomware from both safeware and other malware, to predict whether certain programs are keyloggers or not, to detect phishing attempts, and more. In this course, we will start with a basic 45-minute overview of classification methods, followed by 5 case studies on the use of ML in cybersecurity. The 5 case studies will focus on: (i) phishing attack detection and prediction, (ii) predicting whether a piece of code is ransomware or not, (iii) predicting whether a given URL is malicious or not, (iv) identifying the malware family to which a given malware sample belongs, and (v) predicting the number of hosts in a given host population that will be infected by a specific piece of malware.

Syllabus:

References:

  1. Fette, I., Sadeh, N., & Tomasic, A. (2007, May). Learning to detect phishing emails. In Proceedings of the 16th international conference on World Wide Web (pp. 649-656). ACM.
  2. Almomani, Ammar, B. B. Gupta, Samer Atawneh, A. Meulenberg, and Eman Almomani. "A survey of phishing email filtering techniques." IEEE communications surveys & tutorials 15, no. 4 (2013): 2070-2090.
  3. Lee, Sangho, and Jong Kim. "WarningBird: Detecting Suspicious URLs in Twitter Stream." NDSS. Vol. 12. 2012.
  4. Zhao, Peilin, and Steven CH Hoi. "Cost-sensitive online active learning with application to malicious URL detection." Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2013.
  5. Rahbarinia, B., Balduzzi, M., & Perdisci, R. (2016, May). Real-Time Detection of Malware Downloads via Large-Scale URL-> File-> Machine Graph Mining. In Proceedings of the 11th ACM on Asia Conference on Computer and Communications Security (pp. 783-794). ACM.
  6. Ortolani, Stefano, Cristiano Giuffrida, and Bruno Crispo. "Bait your hook: a novel detection technique for keyloggers." International Workshop on Recent Advances in Intrusion Detection. Springer Berlin Heidelberg, 2010.
  7. Sagiroglu, Seref, and Gurol Canbek. "Keyloggers." IEEE technology and society magazine 28.3 (2009): 10-17

Pre-requisites:

A broad knowledge of computer science is all that is needed. A course on machine learning and/or statistics would be helpful but is not necessary.

Short bio:

V.S. Subrahmanian is Professor of Computer Science at the University of Maryland and heads the Center for Digital International Government, having previously served as Director of the University of Maryland's Institute for Advanced Computer Studies. Prof. Subrahmanian is an expert on big data analytics, learning behavioral models from the data, forecast actions/events, and influence behaviors. He pioneered the use of data science in international security, counter-terrorism, conservation, finance, and cyber-security applications. His work has been featured in numerous outlets such as the Baltimore Sun, the Economist, Science, Nature, the Washington Post, American Public Media. He serves on the editorial boards of numerous ACM and IEEE journals as well as Science, the Board of Directors of the Development Gateway Foundation (set up by the World Bank), SentiMetrix, Inc., and on the Research Advisory Board of Tata Consultancy Services. He previously served on DARPA's Executive Advisory Council on Advanced Logistics and as an ad-hoc member of the US Air Force Science Advisory Board (2001).








Alexander S. Tuzhilin   
Leonard N. Stern Professor of Business, Professor of Information Systems, Chair, Department of Information, Operations and Management Sciences. New York University.
Recommender Systems and Big Data

Summary:

In the first part of this course, we will provide a broad overview of recommender systems and examine the key problems and methods developed for providing useful recommendations across various application domains. In the second part of the course, we will focus on large scale recommenders deployed in the industry and will study various methods especially designed to scale up to such large applications managing large volumes of heterogeneous data. We will also examine real world cases of how such techniques are applied in practice.

Short bio:

Alexander Tuzhilin is the Leonard N. Stern Professor of Business and the Chair of the IOMS Department at the Stern School of Business, NYU. Professor Tuzhilin’s current research interests include personalization, recommender systems and data mining. He has produced more than 120 research publications on these and other topics in various journals, books and conference proceedings. Professor Tuzhilin has served on the organizing and program committees of numerous conferences, including as the Program and as the General Chair of the IEEE International Conference on Data Mining (ICDM), and as the Program, as the Conference Chair and as the Chair of the Steering Committee of the ACM Conference on Recommender Systems (RecSys). He currently serves as the Editor-in-Chief of the ACM Transactions on Management Information Systems.
Professor Tuzhilin received his B.S. in Mathematics from NYU, the M.S. in Engineering Economics from Stanford University, and the Ph.D. in Computer Science from the Courant Institute of Mathematical Sciences at NYU.








Jeffrey Ullman   
Stanford W. Ascherman Professor of Computer Science (Emeritus)
Big Data Algorithms that Aren't Machine Learning

Summary:

We shall study algorithms that have been found useful in querying large data volumes. The emphasis is on algorithms that cannot be considered "machine learning."

Syllabus:

Pre-requisites:

A course in algorithms at the advanced-undergraduate level is important. A course in database systems is helpful, but not required.

References:

We will be covering (parts of) Chapters 3, 4, 5, and 10 of the free text Mining of Massive Datasets, by Jure Leskovec, Anand Rajaraman, and Jeff Ullman, available at www.mmds.org.

Short bio:

http://i.stanford.edu/~ullman/pub/opb.txt








Lyle Ungar   
Professor of Computer and Information Science at the University of Pennsylvania.
Sentiment Mining from User Generated Content

Summary:

The proliferation of user generated content on the Web is driving a new wave of work determining user sentiment from web texts such as message boards, blogs, tweets, and Facebook status updates. Both researchers and practitioners are developing and applying new methods to determine how users feel about everything: products and politicians, friends and family, scientific articles and celebrities. This short course will cover the state of the art in this rapidly growing area, including recent advances that combine information extraction with sentiment analysis to improve accuracy in assessing sentiment about specific entities, and use of NLP to understand people's personality and health behaviors. Real world applications of sentiment analysis will be presented.

Syllabus:

  1. Introduction: What is Sentiment Analysis and why is it hard?
  2. Sentiment analysis and Information Extraction (IE)
    1. Why sentiment analysis requires IE
    2. The core components of IE Systems
      1. Entity recognition
      2. Relationship extraction
  3. Sentiment Analysis
    1. Sentiment analysis types and uses
      1. Kinds of sentiment
      2. Polarity, intensity and subjectivity
      3. Sentiment vs emotion
    2. Approaches to sentiment analysis
      1. Dictionary-based
      2. Pattern-based
      3. Event-based
    3. Machine learning methods for sentiment analysis
      1. Unsupervised - LDA and PCA
      2. Supervised - regression and randomized forests
      3. Semi-supervied
    4. NLP and sentiment analysis
      1. POS tagging, word sense disambiguation, and parsing
  4. Detailed sentiment analysis case studies
    1. Discussion Boards
      1. Extracting product comparisons
      2. Assessing market structure
    2. Medical Forums
      1. How users feel about various drugs
      2. ii. When do they switch to other drugs and why?
    3. Facebook and Twitter
      1. The language of age, sex, and personality
    4. Crowdsourcing social media
      1. Language use and health outcomes: individual and community
  5. Conclusions
    1. What works when
    2. Resources for sentiment analysis
      1. Sentiment dictionaries
      2. Corpora
      3. Software
    3. Open Discussion

References:

Text Mining and Information Extraction:

Bing Liu’s tutorial on Sentiment Analysis: Open source text mining and NLP tools:

Pre-requisites:

No special background is required

Short bio:

Dr. Lyle Ungar is a Professor of Computer and Information Science at the University of Pennsylvania, where he also holds appointments in multiple departments in the Schools of Business, Medicine, Arts and Sciences, and Engineering and Applied Science. Lyle received a B.S. from Stanford University and a Ph.D. from M.I.T. He has published over 200 articles, supervised over two dozen PhD theses, and is co-inventor on eleven patents. His current research focuses on developing scalable machine learning methods for data mining and text mining, including the analysis of social media to better understand the drivers of physical and mental well-being.








Zhongfei Zhang   
Professor of Computer Science at State University of New York (SUNY) at Binghamton
Knowledge Discovery from Relational and Multimedia Data

Summary:

This course aims at exposing the audience a complete introduction to knowledge discovery from relational and multimedia data. The course begins with an extensive introduction to the fundamental concepts and theories of knowledge discovery from relational and multimedia data, and then showcases several important applications in the real-world as the example for big data knowledge discovery.

Syllabus:

The course consists of three two-hour sessions. The syllabus is as follows:

References:

  1. Bo Long, Zhongfei (Mark) Zhang, and Philip S. Yu, Relational Data Clustering: Models, Algorithms, and Applications, Taylor & Francis/CRC Press, 2010, ISBN: 9781420072617
  2. Zhongfei (Mark) Zhang and Ruofei Zhang, Multimedia Data Mining -- A Systematic Introduction to Concepts and Theory, Taylor & Francis Group/CRC Press, 2008, ISBN: 9781584889663
  3. Zhongfei (Mark) Zhang, Bo Long, Zhen Guo, Tianbing Xu, and Philip S. Yu, Machine Learning Approaches to Link-Based Clustering, in Link Mining: Models, Algorithms and Applications, Edited by Philip S. Yu, Christos Faloutsos, and Jiawei Han, Springer, 2010
  4. Zhen Guo, Zhongfei Zhang, Eric P. Xing, and Christos Faloutsos, Multimodal Data Mining in a Multimedia Database Based on Structured Max Margin Learning, ACM Transactions on Knowledge Discovery and Data Mining, ACM Press, 2015 http://www.cs.binghamton.edu/~forweb/publicationsactive.html

Pre-requisites:

College math, fundamentals about computer science.

Short bio:

Zhongfei (Mark) Zhang is a full professor at Computer Science Department at State University of New York (SUNY) at Binghamton, and directs the Multimedia Research Laboratory in the University. He has also served as a QiuShi Chair Professor at Zhejiang University, China, and directs the Data Science and Engineering Research Center at the university while he is on leave from State University of New York (SUNY) at Binghamton, USA. He has received a B.S. in Electronics Engineering (with Honors), an M.S. in Information Sciences, both from Zhejiang University, China, and a PhD in Computer Science from the University of Massachusetts at Amherst, USA. His research interests include knowledge discovery from multimedia data and relational data, multimedia information indexing and retrieval, and computer vision and pattern recognition. He is the author and co-author of the very first monograph on multimedia data mining and the very first monograph on relational data clustering, respectively. His research is supported by a wide spectrum of government funding agencies and industrial labs noticeably including US NSF, US AFRL, CNRS in France, JSPS in Japan, and MOST and NSFC in China, New York State Government in US, and Zhejiang Provincial Government in China, as well as Kodak Research, Microsoft Research, and Alibaba Group. He has published over 200 papers in premier venues in his areas and is an inventor for more than 30 patents. He has served in several journal editorial boards and received several professional awards.