Summary. We have made much progress over the past decade toward harnessing the collective power of IT resources distributed across the globe. In high-energy physics, astronomy, and climate, thousands work daily within virtual computing systems with global scope. But we now face a far greater challenge: Exploding data volumes and powerful simulation tools mean that many more--ultimately most?--researchers will soon require capabilities not so different from those used by such big-science teams. How are we to meet these needs? Must every lab be filled with computers and every researcher become an IT specialist? Perhaps the solution is rather to move research IT out of the lab entirely: to leverage the âcloudâ (whether private or public) to achieve economies of scale and reduce cognitive load. I explore the past, current, and potential future of large-scale outsourcing and automation for science, and suggest opportunities and challenges for todayâs researchers.
Bio. Ian Foster is the Arthur Holly Compton Distinguished Service Professor of Computer Science at the University of Chicago and an Argonne Distinguished Fellow at Argonne National Laboratory. He is also the Director of the Computation Institute, a joint unit of Argonne and the University. His research is concerned with the acceleration of discovery in a networked world. Dr. Foster is a fellow of the American Association for the Advancement of Science, the Association for Computing Machinery, and the British Computer Society. Awards include the British Computer Society's Lovelace Medal, honorary doctorates from the University of Canterbury, New Zealand, and CINVESTAV, Mexico, and the IEEE Tsutomu Kanai award.
Summary. There is perhaps a broad consensus as to important issues in practical parallel computing as applied to large scale simulations; this is reflected in supercomputer architectures, algorithms, libraries, languages, compilers and best practice for application development. However the same is not so true for data intensive computing, even though commercially clouds devote much more resources to data analytics than supercomputers devote to simulations.
We look at a sample of over 50 big data applications to identify characteristics of data intensive applications and to deduce needed runtime and architectures. We suggest a big data version of the famous Berkeley dwarfs and NAS parallel benchmarks and use these to identify a few key classes of hardware/software architectures.
We propose a software approach that builds on combining HPC and the Apache software stack that is well used in modern cloud computing. Initial results on academic and commercial clouds and HPC Clusters are presented. We suggest an international community effort to build a high performance data analytics library SPIDAL or Scalable Parallel Interoperable Data-Analytics Library.
Summary. Collections of scholarly documents are usually not thought of as big data. However, large collections of scholarly documents often have many millions of publications, authors, citations, equations, figures, etc., and large scale related data and structures such as social networks, slides, data sets, etc. We discuss scholarly big data challenges, insights, methodologies and applications. We illustrate scholarly big data issues with examples of specialized search engines and recommendation systems based on the SeerSuite software. Using information extraction and data mining, we illustrate applications in such diverse areas as computer science, chemistry, archaeology, acknowledgements, citation recommendation, collaboration recommendation, and others.
Bio. Dr. C. Lee Giles is the David Reese Professor of Information Sciences and Technology at the Pennsylvania State University with appointments in the departments of Computer Science and Engineering, and Supply Chain and Information Systems. His research interests are in intelligent cyberinfrastructure and big data, web tools, specialty search engines, information retrieval, digital libraries, web services, knowledge and information extraction, data mining, entity disambiguation, and social networks. He has published nearly 400 papers in these areas with over 24,000 citations and an h-index of 73 according to Google Scholar. He was a cocreator of the popular search engine CiteSeer (now CiteSeerX) and related scholarly search engines. He is a fellow of the ACM, IEEE, and INNS.
Summary. High-performance computing (HPC) systems can be excellent platforms for data-centric problems. However, their I/O systems are often weak in comparison to there compute power, and most of the focus on programming HPC systems is on using the processors rather than their I/O capabilities. This talk will discuss the I/O part of the Message-Passing Interface (MPI), which is the dominant programming system for distributed memory systems used in scientific computing. MPI-IO provides a powerful model for handling large I/O data sets, where large may be a petabyte or more, and for efficiently sharing data among tens of thousands of nodes. The talk will cover the ideas behind MPI IO, the important concepts behind collective I/O, and how to use MPI IO efficiently from computational science applications.
Pre-requisites. None, though familiarity with MPI and distributed memory computers recommended.
Bio. William Gropp is the Thomas M. Siebel Chair in the Department of Computer Science and Director of the Parallel Computing Institute at the University of Illinois in Urbana-Champaign. He received his Ph.D. in Computer Science from Stanford University in 1982, was assistant and later associate professor in the Computer Science Department of Yale University until 1990. He then joined Argonne National Laboratory where he held the positions of Senior Scientist (1998-2007) and Associate Division Director (2000-2006). He joined Illinois in 2007. His research interests are in parallel computing, software for scientific computing, and numerical methods for partial differential equations. While at Argonne, he created the MPICH project, which developed the implementation of the Message Passing Interface used on most of the leading HPC systems as well as parallel clusters of all sizes. Other projects include PETSc, one of the most popular scalable numerical libraries, and pnetCDF, a scalable version of the netCDF I/O library. His current projects include Blue Waters, an extreme scale computing system, and the development of new programming systems and numerical algorithms for scalable scientific computing. Among his awards are the Gordon Bell prize, the IEEE Sidney Fernbach Award, and the IEEE Medal for Excellence in Scalable Computing. He is a Fellow of ACM, IEEE, and SIAM and a member of the National Academy of Engineering.
Summary. Decision trees, and derived methods such as Random Forests, are among the most popular methods for learning predictive models from data. This is to a large extent due to the versatility and efficiency of these algorithms. This course will introduce students to the basic methods for learning decision trees, as well as to variations and more sophisticated versions of decision tree learners, with a particular focus on those methods that make decision trees work in the context of big data.
Syllabus. Classification and regression trees, multi-output trees, clustering trees, model trees, ensembles of trees, incremental learning of trees, learning decision trees from large datasets, learning from data streams.
References. Relevant references will be provided as part of the course.
Pre-requisites. Familiarity with mathematics, probability theory, statistics, and algorithms is expected, on the level it is typically introduced at the bachelor level in computer science or engineering programs.
Bio. Hendrik Blockeel is a professor at KU Leuven, Belgium, and part-time associate professor at Leiden University, the Netherlands. He received his Ph.D. degree in 1998 from KU Leuven. His research interests lie mostly within artificial intelligence, with a focus on machine learning and data mining, and in the use of AI-based modeling in other sciences. Dr. Blockeel has made a variety of contributions on topics such as inductive logic programming, probabilistic-logical learning, and decision trees. He is an action editor of Machine Learning and a member of the editorial board of several other journals. He has chaired or organized several conferences, including ECMLPKDD 2013, and organized the ACAI summer school in 2007. He has served on the board of the European Coordinating Committee for Artificial Intelligence, and currently serves on the ECMLPKDD steering committee as publications chair.
Summary. Ontologies allow one to describe complex domains at a high level of abstraction, providing end-users with an integrated coherent view over data sources that maintain the information of interest. In addition, ontologies provide mechanisms for performing automated inference over data taking into account domain knowledge, thus supporting a variety of data management tasks. Ontology-based Data Access (OBDA) is a recent paradigm concerned with providing access to data sources through a mediating ontology, which has gained increased attention both from the knowledge representation and from the database communities. OBDA poses significant challenges in the context of accessing large volumes of data with a complex structure and high dinamicity. It thus requires not only carefully tailored languages for expressing the ontology and the mapping to the data, but also suitably optimized algorithms for efficiently processing queries over the ontology by directly accessing the underlying data sources.
In this course we start by introducing the foundations of OBDA relying on the OWL2 QL fragment of the W3C standard Web Ontology Language OWL2. Such language is based on the DL-Lite family of lightweight ontology languages. We discuss the use of ontologies for accessing structured data sources, and turn then to the problems of query answering and inference over large amounts of data stored in such sources. We present novel theoretical and practical results for OBDA that are currently being developed in the context of the FP7 large scale integrating project Optique. These results make it possible to scale the OBDA approach so as to cope with the challenges that arise in complex real world scenarios coming from different domains.
Pre-requisites. Basics about relational databases, first-order logic, and data modeling. A background in logics for knowledge representation, description logics, and complexity theory, might be useful, but is not required to follow the course.
Bio. Diego Calvanese is a full professor at the KRDB Research Centre for Knowledge and Data, Free University of Bozen-Bolzano, Italy. His research interests include formalisms for knowledge representation and reasoning, ontology languages, description logics, conceptual data modeling, data integration, graph data management, data-aware process verification, and service modeling and synthesis. He has been actively involved in several national and international research projects in the above areas. He is the author or almost 300 refereed publications, including ones in the most prestigious international journals and conferences in Databases and Artificial Intelligence, with over 20000 citations and an h-index of 60, according to Google Scholar. He is one of the editors of the Description Logic Handbook. In 2012-2013 he has been a visiting researcher at the Technical University of Vienna as Pauli Fellow of the "Wolfgang Pauli Institute". He is the program chair of PODS 2015, program co-chair of the 2015 Description Logic Workshop, and the general chair of ESSLLI 2016.
Summary. Big data is characterized by its volume, velocity and variety, and applies to information that exceeds the processing capacity of conventional database systems. It provides tremendous opportunities but also imposes great challenges. Among the others, how to provide effective and efficient support for programming with big data has attracted a lot of attentions from both researchers and practitioners. Various system platforms have been developed for big data management and processing. In this lecture, I will introduce the requirements, difficulties and current status of programming support for big data applications. Then, I will review and provide a tutorial on typical big data programming systems, including MapReduce (Hadoop), Spark, and Storm. Finally, I will briefly describe MatrixMap, a new programming model developed by our lab for high level abstraction and efficient implementation of matric operations which are heavily used by data mining and graph algorithms.
Pre-requisites. Basic programming skill, such as JAVA, C++; Parallel and distributed processing.
Bio. Dr. Cao is a chair professor and the head of the Department of Computing at The Hong Kong Polytechnic University. He is also the director of the Internet and Mobile Computing Lab in the department. His research interests include parallel and distributed systems, wireless networks, mobile and pervasive computing, and fault tolerance. He has co-authored 3 books, co-edited 9 books, and published over 300 papers in major international journals and conference proceedings. He has directed and participated in numerous research and development projects funded by Hong Kong Research Grant Council, Hong Kong Innovation and Technology Commission, China’s Natural Science Foundation, and industries like IBM, Huawei and Nokia.
Dr. Cao is a fellow of IEEE, a member of ACM, a senior member of China Computer Federation (CCF). He served as the Chair of Technical Committee on Distributed Computing (TPDC), IEEE Computer Society (2012-2014), a vice chairman of CCF's Technical Committee on Computer Architecture (2004-2007). Dr. Cao has served as an associate editor and a member of the editorial boards of many international journals, including ACM Transactions on Sensor Networks, IEEE Transacitons on Computers, IEEE Transactions on Parallel and Distributed Systems, IEEE Networks, Pervasive and Mobile Computing Journal. He has also served as a chair and member of organizing / program committees for many international conferences, including PERCOM, INFOCOM, ICDCS, IPDPS, ICPP, RTSS, DSN, ICNP, SRDS, MASS, PRDC, ICC, GLOBECOM, and WCNC.
Summary. This tutorial presents architectures and algorithms for analyzing big data on structured and unstructured data in both batch and streaming mode. We first introduce a typical open source stack composed of e.g., Hadoop, Storm, Flume, Impala, Redis, and Scalding for big data analytics . We then depict resource management issues and parallel methods for addressing IO and computation bottlenecks of some key machine learning algorithms, including Frequent Itemset Mining , Spectral Clustering , SVMs , LDA , and deep belief networks [6,7]. The tutorial uses a couple real-world datasets/applications to demonstrate big-data challenges and solutions [8,9,10]. These datasets will be made available to the participants for conducting big-data research and experiments.
I. Introduction to an open-source architecture (see Figure 1) and building blocks for big data analytics in batch and stream mode.
II. Approaches, both algorithmic and distributed computing, for scaling up key machine learning algorithms:
Pre-requisites. Knowledge of key machine learning concepts and algorithms, and distributed computing.
Figure 1. An Open Stack for Big Data Analytics
Bio. Edward Chang
is the Vice President of Advanced Technology & Research at HTC since July 2012, heading software and hardware future technology research and development. Prior to his HTC post, he was a director of Google Research for 6.5 years, in charge of research and development in several areas including indoor positioning, big data mining, social networking and search integration, and Web search (spam fighting). His team's seminal work on indoor positioning and associated sensor-calibration patents are now being deployed in the world by several companies (see XINX paper and the editor summary of his ASIS/SIGIR/ICADL keynotes) . His pioneer contributions in parallel machine learning algorithms and big-data mining are widely recognized via several keynote invitations (see MMDS/CIVR/EMMDS/MM/AAIM/ADMA/CIKM keynote deck and tutorial deck for details). The open-source, big-data mining codes developed by his team ( PSVM , PLDA+ , Parallel Spectral Clustering , and Parallel Frequent Pattern Mining ) have been collectively downloaded over 10,000 times. The Google Q&A system (codename Confucius) developed by his team was launched in 60+ countries including China , Russia , Thailand , Vietnam, Indonesia, 17 Arabic , and 40 Africa nations. His team also devoted in developing algorithms and components for Web search (spam fighting), Google+ (recommendation engine and Open ID), Chrome, and PicasaWeb . Ed's book titled Foundations of Large-Scale Multimedia Information Management and Retrieval provides a good sumary of his experience in applying big data techniques in feature extraction, learning, and indexing for organizing multimedia data to support both management and retrieval.
Prior to Google, Ed was a full professor of Electrical Engineering at University of California, Santa Barbara (UCSB). He joined UCSB in 1999 after received his PhD from Stanford University, and was tenured in 2003 and promoted to full professor in 2006. Ed has served on ACM (SIGMOD, KDD, MM, CIKM), VLDB, IEEE, WWW, and SIAM conference program committees, and co-chaired several conferences including MMM, ACM MM, ICDE, and WWW. He is a recipient of the NSF Career Award and Google Innovation Award.
Summary. The course starts with a self-contained introduction to data streams and to event sampling techniques suitable for the representation and analysis of behavioral logs. It discusses finite, semantically rich event streams generated by social network applications or by 4G/5G mobile networks and presents some approaches toward sketching, querying and analyzing them. FInally, the open problem of preserving confidentiality and privacy of logs and streams without loosing predictive power is presented.
Bio. Ernesto Damiani is a full professor at the Università degli Studi di Milano and the Head of the University's PhD program in computer science. Ernesto's areas of interest include cloud-based services and processes, process analysis and cloud security. Ernesto has published several books and about 300 papers and international patents. He leads/has led a number of international research projects: he was the Principal Investigator of the ASSERT4SOA project (STREP) on the certification of SOA and leads the activity of SESAR research unit within PRACTICE (IP), ASSERT4SOA (STREP), CUMULUS (STREP), ARISTOTELE (IP), and SecureSCM (STREP) projects funded by the EC in the 7th Framework Program. Ernesto has been an Associate Editor of the IEEE Trans. on Service-Oriented Computing since its inception, and is Associate Editor of the IEEE Trans. on Fuzzy Systems. Also, Ernesto is Editor in Chief of the International Journal of Knowledge and Learning (Inderscience) and of the International Journal of Web Technology and Engineering (IJWTE). He has served in all capacities on many congress, conference, and workshop committees. He is a senior member of the IEEE and an ACM Distinguished Scientist (2008). He was the recipient of the IFIP Outstanding Service Award (2013).
Summary. Deep Web databases, i.e., hidden databases only accessible by restricted web query interfaces, are widely prevalent on the Web. They represent an intriguing and prototypical instance of Big Data huge, heterogeneous, not easily indexed, and inaccessible other than via restrictive and proprietary query interfaces such as keyword/form-based search and hierarchical/graph-based browsing interfaces. Efficient ways of exploring and mining contents in such hidden repositories are of increasing importance. There are two key challenges: one on the proper understanding of interfaces, and the other on the efficient exploration, e.g., crawling, sampling and analytical processing, of very large repositories. In this course, we focus on the fundamental developments in the field, including web interface understanding, crawling, sampling, and data analytics over web repositories with various types of interfaces and containing structured or unstructured data.
Syllabus. Overview of hidden databases, taxonomy of different query interfaces, interface understanding and data exploration challenges, crawling, sampling, and mining techniques, conclusions and open challenges.
Pre-requisites. Some background in algorithms, databases, and probability and statistics will be useful.
Bio. Gautam Das is Professor of Computer Science and Engineering and Head of Database Exploration Laboratory (DBXLab) at the University of Texas at Arlington. He has also held positions at Microsoft Research and the University of Memphis. He graduated with a B.Tech from IIT Kanpur, India, and with a Ph.D from the University of Wisconsin, Madison. His research interests span data mining, information retrieval, big data management, social media analytics, and theoretical algorithms. His research has resulted in over 150+ papers, many of which have appeared in premier conferences and journals such as SIGMOD, VLDB, ICDE, KDD, TODS, and TKDE. His work has received several awards, including the IEEE ICDE 2012 Influential Paper award, Best Student Paper Award of CIKM 2013, VLDB Journal special issues on Best Papers of VLDB 2012 and 2007, Best Paper of ECML/PKDD 2006, and Best Paper (runner up) of ACM SIGKDD 1998. Dr. Das is in the Editorial Board of ACM TODS and IEEE TKDE, and has served the organization and program committees of numerous premier conferences. His research has been supported by NSF, ONR, Department of Education, Texas Higher Education Coordinating Board, Microsoft Research, Nokia Research, and Cadence Design Systems.
Summary. This tutorial studies software used in many commercial activities to study Big Data and how they can be enhanced by HPC. The backdrop for course is the ~120 software subsystems illustrated at http://hpc-abds.org/kaleidoscope/. We will describe the software architecture represented by this collection which we term HPC-ABDS (High Performance Computing enhanced Apache Big Data Stack). We start by introducing Software Defined systems controlled by a Python framework Cloudmesh that also integrates workflow with the system definition using DevOps tools. We then discuss key parts of stack including Resource management, Storage, Programming model and runtime -- horizontal scaling parallelism, Collective and Point-to-Point communication, Streaming. We cover key principles of efficient parallel computing for data analytics and compare with better known simulation areas. Example applications will be covered
Pre-requisites. Python and Java will be used.
Bio. Geoffrey Charles Fox (email gcf (at) indiana.edu), received a Ph.D. in Theoretical Physics from Cambridge University and is now distinguished professor of Informatics and Computing, and Physics at Indiana University where he is director of the Digital Science Center and Senior Associate Dean for Research and Director of the Data Science program at the School of Informatics and Computing. He previously held positions at Caltech, Syracuse University and Florida State University after being a postdoc at the Institute of Advanced Study at Princeton, Lawrence Berkeley Laboratory and Peterhouse College Cambridge. He has supervised the PhD of 66 students and published around 1000 papers in physics and computer science with an hindex of 70 and over 25000 citations.
He currently works in applying computer science from infrastructure to analytics in Biology, Pathology, Sensor Clouds, Earthquake and Ice-sheet Science, Image processing, Network Science and Particle Physics. The infrastructure work is built around Software Defined Systems on Clouds and Clusters. He is involved in several projects to enhance the capabilities of Minority Serving Institutions including the eHumanity portal. He has experience in online education and its use in MOOC’s for areas like Data and Computational Science. He is a Fellow of APS and ACM.
Summary. Effective Big Data analytics need to rely on algorithms for querying and analyzing massive, continuous data streams (that is, data that is seen only once and in a fixed order) with limited memory and CPU-time resources. Such streams arise naturally in emerging large-scale event monitoring applications; for instance, network-operations monitoring in large ISPs, where usage information from numerous sites needs to be continuously collected and analyzed for interesting trends. In addition to memory- and time-efficiency concerns, the inherently distributed nature of such applications also raises important communication-efficiency issues, making it critical to carefully optimize the use of the underlying network infrastructure.
This seminar will give an overview of some key algorithmic tools for effective query processing over streaming data. The focus will be on small-space sketching structures for approximating continuous data streams in both centralized and distributed settings.
Syllabus. Introduction, Data Streaming Models; Basic Streaming Algorithmic Tools: Reservoir Sampling, AMS, CountMin, FM sketching and Distinct Sampling, Exponential Histograms; Distributed Data Streaming: Models, Techniques, the Geometric Method; Conclusions and Looking Forward.
Bio. Minos Garofalakis received the Diploma degree in Computer Engineering and Informatics from the University of Patras, Greece in 1992, and the M.Sc. and Ph.D. degrees in Computer Science from the University of Wisconsin-Madison in 1994 and 1998, respectively. He worked as a Member of Technical Staff at Bell Labs, Lucent Technologies in Murray Hill, NJ (1998-2005), as a Senior Researcher at Intel Research Berkeley in Berkeley, CA (2005-2007), and as a Principal Research Scientist at Yahoo! Research in Santa Clara, CA (2007-2008). In parallel, he also held an Adjunct Associate Professor position at the EECS Department of the University of California, Berkeley (2006-2008. As of October 2008, he is a Professor of Computer Science at the School of Electronic and Computer Engineering of the Technical University of Crete, and the Director of the Software Technology and Network Applications Laboratory (SoftNet). Prof. Garofalakis’ research focuses on Big Data analytics, spanning areas such as database systems, data streams, data synopses and approximate query processing, probabilistic databases, and data mining. His work has resulted in over 120 published scientific papers in these areas, and 35 US Patent filings (27 patents issued) for companies such as Lucent, Yahoo!, and AT&T. GoogleScholar gives over 9000 citations to his work, and an h-index value of 50. Prof. Garofalakis is an ACM Distinguished Scientist (2011), and a recipient of the IEEE ICDE Best Paper Award (2009), the Bell Labs President’s Gold Award (2004), and the Bell Labs Teamwork Award (2003).
Summary. We have entered a data-rich era. Advanced computing, imaging, and sensing technologies enable scientists to study natural and physical phenomena at unprecedented precision, resulting in an explosive growth of data. The size of the collected information about the Internet and mobile device users is expected to be even greater. To make sense and maximize utilization of such vast amounts of data for knowledge discovery and decision-making, we need a new set of tools beyond conventional data mining and statistical analysis. Visualization has been shown very effective in understanding large, complex data, and thus become an indispensable tool for many areas of research and practice. In my lectures, I will present techniques that my group at UC Davis has introduced for visualizing large-scale volume, network, and time-varying data commonly found in many real-world applications.
Bio. Dr. Kwan-Liu Ma is Professor of computer science and Chair of the Graduate Group in Computer Science at the University of California-Davis. He leads the VIDI (Visualization and Interface Design Innovation) research group, and directs the UC Davis Center for Visualization. His research spans the fields of visualization, computer graphics, high-performance computing, and user interface design. Professor Ma received his PhD in computer science from the University of Utah in 1993. During 1993-1999, he was with ICASE/NASA Langley Research Center as a research scientist. He joined UC Davis in 1999 and received the prestigious PECASE award in 2000. In 2012 he was elected as an IEEE Fellow, and in 2013 he received the IEEE VGTC Technical Achievement Award, the highest recognition by the visualization research community. Professor Ma has been actively serving the research community by playing leading roles in several professional activities including VizSEC, UltraVis, EGPGV, IEEE Vis, IEEE PacificVis, and LDAV. Professor Ma was an associate editor for the IEEE Transactions on Visualization and Computer Graphics (TVCG) during 2007-2011. He presently serves as the AEIC of the IEEE Computer Graphics and Applications (CG&A), and on the editorial board of the Journal of Computational Science and Discovery, and the Journal of Visualization.
Summary. The new in-memory data processing technology originated in a research project at the Hasso Plattner Institute. Based on the outcome of this research SAP developed their innovative in-memory data base SAP-HANA which makes it possible to process huge data sets - the famous big data - in almost real-time. The talk outlines the basics and the history of the in-memory technology as an successful cooperation between science and business, and demonstrates the power of this technology at hand of some examples from fields like IT security or social media analytics. It also describes, how this new technology can be used to turn the vision of personalized medicine into reality.
Syllabus. Presentation gives an overview about achievements in Big Data processing by in-memory technology.
Pre-requisites. No special pre-requisites.
Bio. Univ.-Prof. Dr. Christoph Meinel is Head and Executive Director of the Hasso Plattner Institute (HPI) for IT-Systems Engineering und professor for Internet-Technologies and Systems at the University of Potsdam.
Christoph Meinel (Univ.-Prof., Dr.sc.nat., Dr.rer.nat., 1954) is Scientific Director and CEO of HPI, a university institute that is privately financed by the foundation of Hasso Plattner, one of the founders of SAP. HPI provides outstanding Bachelor, Master and PhD study courses in IT-Systems Engineering as well as a Design Thinking program. Christoph Meinel is a member of acatech, the national German academy of science and engineering. In 2006, together with Hasso Plattner he hosted the first National IT-summit of German Chancellor Dr. Angela Merkel at HPI.
Christoph Meinel is full professor (C4) for computer science at HPI and the University of Potsdam, holding a chair in "Internet Technologies and Systems". His research focuses on Future Internet Technologies, in particular Internet and Information Security, Web 3.0: Semantic, Social and Service Web, as well as on innovative Internet applications, especially in the domains of e-Learning and Telemedicine. Besides this, he is scientifically active in the field of Innovation research and Design Thinking. He started his scientific career with research work in computational complexity and the design of efficient data structures and algorithms. Christoph Meinel teaches Bachelor and Master courses in IT-Systems Engineering and at the HPI School of Design Thinking. He heads the MOOC platform openHPI.de and the tele-TASK lecture portal (www.tele-task.de).
Christoph Meinel is author or co-author of 14 books, and editor of various conference proceedings. He has published more than 400 papers in high-profile scientific journals and at international conferences, and holds various international patents. He is honorary professor at the Computer Sciences School at Beijing University of Technology, visiting professor at Shanghai University in China and a senior research fellow at the Interdisciplinary Centre for Security, Reliability and Trust (SnT) at the University of Luxembourg. Since 2008, together with Larry Leifer from Stanford University he heads the HPI-Stanford Design Thinking Research Program as a program director.
Summary. Data-related challenges are quickly dominating computational and data-enabled sciences, and are limiting the potential impact of scientific applications enabled by current and emerging high-performance distributed computing environments. These data-intensive application workflows involve dynamic coordination, interactions and data coupling between multiple application process that run at scale on different resources, and with services for monitoring, analysis and visualization and archiving. In this course I will explore data grand challenges in simulation-based science and investigate how solutions based on data sharing abstractions, managed data pipelines, in-memory data-staging, in-situ placement and execution, and in-transit data processing can be used to address these data challenges at extreme scales. This course is based on DataSpaces, a programming system for enabling extreme-scale, data-intensive, coupled simulation workflow using data staging. DataSpaces is developed and deployed by the Rutgers Discovery Informatics Institute (RDI2).
Bio. Manish Parashar is Professor of Computer Science at Rutgers University. He is also the founding Director of the Rutgers Discovery Informatics Institute (RDI2) and site Co-Director of the NSF Cloud and Autonomic Computing Center (CAC). His research interests are in the broad areas of Parallel and Distributed Computing and Computational and Data-Enabled Science and Engineering. Manish serves on the editorial boards and organizing committees of a large number of journals and international conferences and workshops, and has deployed several software systems that are widely used. He has also received a number of awards and is Fellow of AAAS, Fellow of IEEE/IEEE Computer Society and Senior Member of ACM. For more information please visit http://parashar.rutgers.edu/.
Summary. Data mining and analytics is concerned with extracting actionable and interpretable knowledge from data as efficiently as possible. In this series of lectures I will be focusing on the problem of how to extract information in as efficient a manner as possible. I have broken down these lectures into three components. The three components roughly break down into three key ideas to enable scalable performance. In the first lecture I will introduce the idea of Locality Sensitive Hashing and a key specialization - minwise independent permutation hashing. I will then describe variants of this idea and how they can be used to scale up various algorithms ranging from social network analysis to outlier detection. In the second lecture I will discuss the use of sampling as a technique to scale up various machine learning and data mining algorithms and also describe how such ideas can play a role in modern systems designs. In third lecture I will focus on scalable sequential and parallel algorithms for various data mining algorithms -- focusing on architecture conscious solutions. Collectively these ideas largely represent orthogonal foci which can be used by a myriad of applications to scale up performance and extract knowledge efficiently enabling the human in the loop to effectively act upon such knowledge.
Pre-requisites. Basic understanding of algorithms, some machine learning/data mining would be helpful.
Bio. Srinivasan Parthasarathy is a Professor at Ohio State University and works in the areas of Data Mining, Database Systems and High Performance Computing. His work in these areas have received a number of best paper awards or similar honors from conferences such as VLDB, SIGKDD, ICDM(2), SIAM DM(2), and ISMB. His work has also received awards from various agencies and organizations in the US including an NSF CAREER award, a DOE Early Career Award, Research Awards from Google, Microsoft and IBM and twice from the State of Ohio. He leads the Data Mining Research laboratory at Ohio State and co-directs (along with a colleague in Statistics) a brand new undergraduate program in Data Analytics at Ohio State (among the first of its kind).
Summary. An unprecedented amount of data is generated every day from a variety of social platforms and collaboration networks. The analysis of such data requires understanding the interplay between the relationships and interactions of the users that form the network and the information content itself. The course will cover recent research on the models, metrics and algorithms that abstract the basic properties of the underlying networks. The course will also study how content and information is propagated through the network and explore the diffusion models and the theory that underlie viral marketing and disease or virus propagation. Finally, the course will explore methods for link analysis and community detection.
Pre-requisites. Some background in algorithms and graph theory will be useful but is not necessary.
Bio. Evaggelia Pitoura is a Professor at the Computer Science and Engineering Department of the University of Ioannina, Greece where she also leads the Distributed Management of Data Laboratory. She received a BSc degree from the University of Patras, Greece, and an MS and PhD degree from Purdue University, USA. Her research interests are in the area of data management systems with a recent focus on data networks. Her publications in this area include more than 150 articles in international journals (including TODS, TKDE, TPDS, PVLDB) and conferences (including SIGMOD, ICDE, EDBT, CIKM, WWW) and a highly-cited book on mobile computing. Her research in distributed computing and more recently in social and cloud computing has been funded by the EC and national sources. She has received three Best Paper Awards (ICDE 1999, DBSocial 2013, PVLDB2013), a Marie Currie Fellowship (2009) and two Recognition of Service Awards from ACM. She has served as a group leader, senior PC member and co-chair for many highly ranked international data management conferences (including SIGMOD, WWW, CIKM, EDBT, MDM, VLDB, WISE, ICDE) and serves on the editorial board of TKDE and DAPD.
Summary. Data streams arriving from multiple data sources such as sensors, logs, and social media exhibit structural pattern, and can be modeled as time evolving graphs. With the rapid growth in Internet of Things (IoT’s) as well as the availability of large-scale data from social media, sensors, smart phones, there is great interest in structuring real world observations from these sources as dynamic graphs. The size and complexity of these graphs are however growing to span millions of nodes and billions of edges, and hence present several challenges in terms of processing, analyzing, and visualizing this data. Time evolving graphs of large scale graphs are being studied in various applications such as disaster management, cyber security, fraud detection, social community network analysis, This course will introduce you to time evolving graphs, their properties, various graph mining algorithms, tools for storing, processing, analyzing, and visualizing these graph data sets and some applications.
Syllabus. This course will introduce you to the theory of time evolving graphs, their properties, and emerging computational techniques for data analysis, data management, visualization, and interactions with these graphs. The course will also provide some example solutions that have been implemented for visual analysis of large-scale time-evolving graphs.
Pre-requisites. Students who attend this course should have some basic understanding of graphs, working experience with databases, programming, or scripting.
Bio. Dr. Vijay Raghavan is the Alfred and Helen Lamson/ BoRSF Endowed Professor in Computer Science at the Center for Advanced Computer Studies and the Director of the NSF-sponsored Industry/ University Cooperative Research Center for Visual and Decision Informatics. As the director, he co-ordinates several multi-institutional, industry-driven research projects and manages a budget of over $500K/year. From 1997 to 2003, he led a $2.3M research and development project in close collaboration with the USGS National Wetlands Research Center and with the Department of Energy's Office of Science and Technical Information on creating a digital library with data mining capabilities incorporated. His research interests are in data mining, information retrieval, machine learning and Internet computing. He has published over 250 peer-reviewed research papers- appearing in top-level journals and proceedings- that cumulatively accord him an h-index of 31, based on citations. He has served as major advisor for 24 doctoral students and has garnered $10 million in external funding. Besides substantial technical expertise, Dr. Raghavan has vast experience managing interdisciplinary and multi- institutional collaborative projects. He has also directed industry-sponsored research, for companies such as GE Healthcare and Araicom Life Sciences L.L.C., on projects pertaining to Neuro-imaging based dementia detection and literature-based biomedical hypotheses generation, respectively.
His service work at the university includes coordinating the Louis Stokes-Alliance for Minority Participation (LS-AMP) program. He chaired the IEEE International Conference on Data Mining in 2005 and received the ICDM 2005 Outstanding Service Award. Dr. Raghavan serves as a member of the Executive Committee of the IEEE Technical Committee on Intelligent Informatics (IEEE-TCII), the Web Intelligence Consortium (WIC) Technical Committee and the Web Intelligence and Intelligent Agent Technology Conferences’ Steering Committee. He was the Program Committee Chair of 2013 ACM/ IEEE Conference on Web Intelligence and Intelligent Agent Technologies and one of the Conference Co-Chairs of IEEE 2013 Big Data Conference. For many years of service to the community, he received the WIC 2013 Outstanding Service Award. He is an ACM Distinguished Scientist and served as an ACM Distinguished Lecturer from 1993 – 2006. In addition, he served as a member of the Advisory Committee of the NSF Computer and Information Science and Engineering directorate (CISE-AC) during 2008 – 2010.
Raghavan currently serves as a co-editor of Elsevier Handbook of Statistics on Big Data Analytics, a program committee Co-Chair of the first international workshop on Big Data to Wise Decisions (BD2WD) (jointly with the 9th International Conference on Rough Sets and Knowledge Technology) to be held on Oct 24 - 26, 2014 at Tongji University, Shanghai, China, and a member of the Steering Committee of IEEE BigData 2014 conference to be held on Oct. 27 – 30, 2014 at Washington, D.C. He is an Associate Editor of the ACM Transactions on Internet Technology, the Web Intelligence and Agent Systems journal and the International J. of Computer Science & Applications, and a member of the International Rough Set Society Advisory Board.
Summary. The rapid advancements in Information and Communication Technologies (ICTs) have enabled the emerging of the “cloud” as a successful paradigm for conveniently storing, accessing, processing, and sharing information. With its significant benefits of scalability and elasticity, the cloud paradigm has appealed companies and users, which are more and more resorting to the multitude of available providers for storing and processing data. Unfortunately, such a convenience comes at a price of loss of control over these data and consequent new security threats that can limit the potential widespread adoption and acceptance of the cloud computing paradigm. In this lecture, I will illustrate some security and privacy issues arising in the cloud scenario, focusing in particular on the problem of guaranteeing confidentiality and integrity of data stored or processed by external providers.
Syllabus. Security and privacy issues in the cloud, data confidentiality, access control and query execution on protected data, access privacy, data and computation integrity.
Pre-requisites. Basic background in databases.
Bio. Pierangela Samarati is a Professor at the Department of Computer Science of the Universita' degli Studi di Milano. Her main research interests are access control policies, models and systems, data security and privacy, information system security, and information protection in general. She has participated in several projects involving different aspects of information protection. On these topics she has published more than 240 peer-reviewed articles in international journals, conference proceedings, and book chapters. She has been Computer Scientist in the Computer Science Laboratory at SRI, CA (USA). She has been a visiting researcher at the Computer Science Department of Stanford University, CA (USA), and at the Center for Secure Information Systems of George Mason University, VA (USA).
She is the chair of the IEEE Systems Council Technical Committee on Security and Privacy in Complex Information Systems (TCSPCIS), of the Steering Committees of the European Symposium on Research in Computer Security (ESORICS), and of the ACM Workshop on Privacy in the Electronic Society (WPES). She is member of several steering committees. She is ACM Distinguished Scientist (named 2009) and IEEE Fellow (named 2012). She has been awarded the IFIP TC11 Kristian Beckman award (2008) and the IFIP WG 11.3 Outstanding Research Contributions Award (2012).
Summary. For many applications, the data sets to be processed grow much faster than can be handled with the traditionally available algorithms. We therefore have to come up with new, dramatically more scalable approaches. In order to do that, we have to bring together know- how from the application, techniques from traditional algorithm theory, and on low level aspects like parallelism, memory hierarchies, energy efficiency, and fault tolerance. The methodology of algorithm engineering with its emphasis on realistic models and its cycle of design, analysis, implementation, and experimental evaluation can serve as a glue between these requirements.
The course will introduce the methodology of algorithm engineering and then demonstrates it with a through-going example -- sorting large data sets. Then we will move to other examples from the work of my group -- graph algorithms (route planning, graph partitioning, ...), basic algorithms for in-memory data bases, 4D image processing, particle tracking in physics, etc.
Bio. Peter Sanders received his PhD in computer science from Universität Karlsruhe, Germany in 1996. After 7 years at the Max- Planck-Institute for Informatics in Saarbrücken he returned to Karlsruhe as a full professor in 2004. He has more than 200 publications, mostly on algorithms for large data sets. This includes parallel algorithms (load balancing,...) memory hierarchies, graph algorithms (route planning, graph partitioning...), randomized algorithms, full text indices,... He is very active in promoting the methodology of algorithm engineering. For example, he currently heads a DFG priority program on AE in Germany. Peter Sanders won a number of prices, perhaps most notably the DFG Leibniz Award 2012 which amounts to 2.5 million Euros of research money.
Summary. Support vector machines and kernel methods offer a powerful class of data-driven and predictive models, with numerous successful applications reported in literature. However, the new era of big data and its Vs (Volume, Velocity, Variety) is posing important new challenges on how to make the methods scalable and applicable to massive problem sizes.
For this purpose we propose a framework of Fixed-size Kernel Models. In this approach the model is characterized through primal and dual model representations. An approximate feature map is employed with estimation of the model in the primal. It directly leads to sparse models. The methodology exists for a wide range of problems in supervised, unsupervised and semi-supervised learning.
Applications include e.g. power load forecasting, black-box weather forecasting, multilevel hierarchical clustering and community detection in complex networks.
Syllabus. The material is organized in 3 parts of 2 hours:
Starting from the basics of support vector machines and least-squares support vector machines for classification and regression, its connection to kernel principal component analysis and kernel spectral clustering is explained. Next, both basic fixed-size kernel models and advanced versions with improved sparsity are discussed. The fixed-size kernel models are explained for problems in supervised, unsupervised and semi-supervised learning. High performance computing aspects will also be addressed.
Pre-requisites. Basic linear algebra
Bio. More information available here.
Summary. The analysis of the massive and distributed data repositories that are today available, require the combined use of smart data analysis techniques and scalable architectures to find and extract useful information from them. Parallel systems, Grids and Cloud computing platforms offer an effective support for addressing both the computational and data storage needs of Big Data mining and parallel analytics applications. In fact, complex data mining tasks involve data- and compute-intensive algorithms that require large storage facilities together with high performance processors to get results in suitable times. In this lecture we introduce the most relevant topics and the main research issues in high performance data mining including parallel data mining strategies, distributed analysis techniques, knowledge Grids and Cloud data mining. We also present some data mining frameworks designed for developing distributed data analytics applications as workflows of services on Grids and Clouds. In these environment data sets, analysis tools, data mining algorithms and knowledge models are implemented as single services that are combined through a visual programming interface in distributed workflows. Some applications will be also discussed.
Syllabus. Parallel data mining techniques, distributed data mining, Grid-based knowledge discovery, Cloud-based data analytics workflows.
Pre-requisites. Some background in data mining algorithms, parallel and distributed systems.
Bio. Domenico Talia is a full professor of computer engineering at the University of Calabria and the director of ICAR-CNR. He is a partner of two startups, Exeura and Eco4Cloud. His research interests include parallel and distributed data mining, cloud computing, Grid services, knowledge discovery, mobile computing, green computing systems, peer-to-peer systems, and parallel computing. Talia published ten books and more than 300 papers in archival journals such as CACM, Computer, IEEE TKDE, IEEE TSE, IEEE TSMC-B, IEEE Micro, ACM Computing Surveys, FGCS, Parallel Computing, IEEE Internet Computing and international conference proceedings. He is a member of the editorial boards of IEEE Transactions on Computers, the Future Generation Computer Systems journal, the International Journal on Web and Grid Services, the Scalable Computing: Practice and Experience journal, MultiAgent and Grid Systems: An International Journal, International Journal of Web and Grid Services, and the Web Intelligence and Agent Systems International journal. Talia has been a project for several international institutions such as the European Commission, Aeres in France, Austrian Science Fund, Croucher Foundation, and the Russian Federation Government. He served as a chair, organizer, or program committee member of several international conferences and gave many invited talks and seminars in conferences and schools. Talia is a member of the ACM and the IEEE Computer Society.
Summary. Parsimonious representations based on sparsity and low rank have found successful applications in many areas, including machine learning, data mining, signal processing, computer vision, and biomedical informatics. In this short course, I will give a comprehensive overview of the formulations, algorithms, and applications of sparse learning and low rank modeling. I will first introduce the necessary background for sparse learning and low rank modeling, and present various sparse and low rank formulations based on L1-norm and trace norm regularization and their variants. I will then present efficient algorithms for solving sparse and low rank formulations. The successful applications of these techniques will be demonstrated in several application domains. Finally, I will discuss recent advances and future directions in the area.
Pre-requisites. Basic linear algebra and optimization.
Bio. Jieping Ye is an Associate Professor of Computer Science and Engineering at the Arizona State University. He is a core faculty member of the Bio-design Institute at ASU. He received his Ph.D. degree in Computer Science from University of Minnesota, Twin Cities in 2005. His research interests include machine learning, data mining, and biomedical informatics. He has served as Senior Program Committee/Area Chair/Program Committee Vice Chair of many conferences including NIPS, KDD, IJCAI, ICDM, SDM, ACML, and PAKDD. He serves as an Associate Editor of IEEE Transactions on Pattern Analysis and Machine Intelligence. He won the SCI Young Investigator of the Year Award at ASU in 2007, the SCI Researcher of the Year Award at ASU in 2009, and the NSF CAREER Award in 2010. His papers have been selected for the outstanding student paper at the International Conference on Machine Learning in 2004, the KDD best research paper honorable mention in 2010, the KDD best research paper nomination in 2011 and 2012, the SDM best research paper runner up in 2013, and the KDD best research paper runner up in 2013.