In the field of Immunology we are just beginning to explore repurposing public datasets to build our knowledge base, gain insight into new discoveries, and generate data-driven hypotheses that were not originally formulated in the studies. With increasing awareness of the importance of sharing research data and findings, however, comes the additional need to showcase how best to leverage shared datasets across research domains. Through this coursework we will showcase major efforts in the meta-analysis of open immunological data. Participants will gain clear understanding of recent trends in conducting data-driven science.
No strict requirements of prior knowledge; perhaps some familiarity with clinical trials, high-throughput technologies and interest in ‘big data’ analysis.
Sanchita Bhattacharya is a Bioinformatics project team leader at Bakar Institute of Computational Health Sciences in University of California, San Francisco and scientific program director for ImmPort, a National Institute of Allergy and Infectious Diseases Division of Allergy, Immunology, and Transplantation (NIAID-DAIT) sponsored shared data repository for subject-level human immunology study data and protocols. She is currently leading the efforts to leverage open-access clinical trials and immunology research studies. Her projects involve “The 10,000 Immunomes Project”, a diverse human immunology reference derived from over 44,000 individuals across 291 studies. Sanchita comes with twenty years of work experience as a data scientist at various academic institutions such as Stanford School of Medicine, Lawrence Berkeley National Laboratory, and MIT. Her formal training in bioinformatics coupled with expertise in computational modeling and immunology has led to a number of publications demonstrating the repurposing of big data in Immunology and other research areas to facilitate translational research. Recently, her team has embarked on a deep learning project to better understand the high-throughput single-cell cytometry data using convolutional neural network algorithm for applications in clinical immunology.
Recently, semantic technologies have been successfully deployed to overcome the typical difficulties in accessing and integrating data stored in different kinds of legacy sources. In particular, knowledge graphs are being used as a mechanism to provide a uniform representation of heterogeneous information. In such graphs, data are represented in the RDF format, and are complemented by an ontology that can be queried using the standard SPARQL language. The RDF graph is often obtained by materializing source data, following the traditional extract-transform-load workflow. Alternatively, the sources are declaratively mapped to the ontology, and the RDF graph is maintained virtual. In such an approach, usually called Virtual Knowledge Graphs (VKG), query answering is based on sophisticated query transformation techniques. In this tutorial: (i) we provide a general introduction to relevant semantic technologies; (ii) we illustrate the principles underlying the VKG approach to data integration, providing insights into its theoretical foundations, and describing well-established algorithms, techniques, and tools; (iii) we discuss relevant use-cases using VKGs; (iv) we provide a hands-on experience with the stat-of-the-art VKG system Ontop.
▸ Guohui Xiao, Diego Calvanese, Roman Kontchakov, Domenico Lembo, Antonella Poggi, Riccardo Rosati, and Michael Zakharyaschev. "Ontology-Based Data Access: A Survey". In: Proc. of the 27th Int. Joint Conf. on Artificial Intelligence (IJCAI). IJCAI Org., 2018, pp. 5511-5519. doi: 10.24963/ijcai.2018/777.
▸ Guohui Xiao, Linfang Ding, Benjamin Cogrel, and Diego Calvanese. "Virtual Knowledge Graphs: An Overview of Systems and Use Cases". In: Data Intelligence 1.3 (2019), pp. 201-223. doi: 10.1162/dint_a_00011.
▸ Diego Calvanese, Benjamin Cogrel, Sarah Komla-Ebri, Roman Kontchakov, Davide Lanti, Martin Rezk, Mariano Rodriguez-Muro, and Guohui Xiao. "Ontop: Answering SPARQL Queries over Relational Databases". In: Semantic Web J. 8.3 (2017), pp. 471-487. doi: 10.3233/SW-160217.
▸ Diego Calvanese, Giuseppe De Giacomo, Domenico Lembo, Maurizio Lenzerini, Antonella Poggi, Mariano Rodriguez-Muro, and Riccardo Rosati. "Ontologies and Databases: The DL-Lite Approach". In: Reasoning Web: Semantic Technologies for Informations Systems - 5th Int. Summer School Tutorial Lectures (RW). Vol. 5689. Lecture Notes in Computer Science. Springer, 2009, pp. 255-356.
Basics about relational databases, first-order logic, and data modeling, as typically taught in BSc-level Computer Science courses. A background in logics for knowledge representation, description logics, and complexity theory, might be useful to establish cross-connections, but is not required to follow the course.
Diego Calvanese is a full professor at the Research Centre for Knowledge and Data (KRDB), Faculty of Computer Science, Free University of Bozen-Bolzano (Italy), and since November 2019 he is also a Wallenberg visiting Professor at the Department of Computing Science, Umea University (Sweden). He received a PhD from Sapienza University of Rome in 1996. His research interests include formalisms for knowledge representation and reasoning, virtual knowledge graphs for data access and integration, description logics, Semantic Web, graph data management, data-aware process verification, and service modeling and synthesis. He has been actively involved in several national and international research projects in the above areas (including FP6-7603 TONES, FP7-257593 ACSI, FP7-318338 Optique, H2020-863410 INODE). He is the author of more than 350 refereed publications, including ones in the most prestigious international journals and conferences in Databases and Artificial Intelligence, with more than 30000 citations and an h-index of 69, according to Google Scholar. He is one of the editors of the Description Logic Handbook. He has served in more than 150 organization and program committee roles for international events, and he is/has been an associate editor of Artificial Intelligence, JAIR, and a member of the editorial board of the Journal of Automated Reasoning. In 2012-2013 he has been a visiting researcher at the Technical University of Vienna as Pauli Fellow of the "Wolfgang Pauli Institute". He has been the program chair of the 34th ACM Symposium on Principles of Database Systems (PODS~2015) and the general chair of the 28th European Summer School in Logic, Language and Information (ESSLLI~2016), and he is the program co-chair of the 17th International Conference on Principles of Knowledge Representation and Reasoning (KR~2020). He is one of the inventors of VKG approach, and the initiator of the Ontop VKG system. He is a co-founder of the startup Ontopic, which delivers data management solutions and services based on virtual knowledge graphs, developing technologies for their management. He has been nominated EurAI Fellow in 2015 and ACM Fellow in 2019.
Data visualization has been defined as ‘the use of computer supported, interactive, visual representations of abstract data to amplify cognition’ . Data visualization draws upon a variety of fields including computer science, visual perception, design and communication theory to visually and interactively represent data so that it is explorable and discoverable. In this course, we will take a hands-on, data driven approach to learning the process of visually representing data. In particular, we will make active use of visual variables  and draw upon principles from externalization and sketch-based design  to explore the challenges and possibilities of data visualization.
Representation of data, Data to visual mappings Sketching and externalization Data driven design
Bertin, Jacques. Semiology of graphics: diagrams, networks, maps. Publisher: University of Wisconsin Press, Pub date: 1983. (Readings: pages 2 to 13 and 42 to 97).
Fekete, Jean-Daniel, Jarke J. Van Wijk, John T. Stasko, and Chris North. The value of information visualization. In Information visualization, pp. 1-18. Springer Berlin Heidelberg, 2008
Huron, Samuel, Carpendale, Sheelagh, Thudt, Alice, Tang, Anthony and Mauerer, Michael. Constructive Visualization. In Proceedings of the ACM Conference on Designing Interactive Systems. ACM, pages 433-442, 2014
Huron, S., Jansen, Y. and Carpendale, S, Constructing Visual Representations: Investigating the Use of Tangible Tokens. IEEE Transactions on Visualization and Computer Graphics 20(12):2102-2111, Dec, 2014.
Stuart Card, Jock Mackinlay, and Ben Shneiderman. Chapter 1, Readings in Information Visualization: Using Vision to Think. Morgan Kaufmann 1999.
Greenberg, Saul, Sheelagh Carpendale, Nicolai Marquardt, and Bill Buxton. Sketching user experiences: The workbook. Elsevier, 2011.
Ben Shneiderman. The Eyes Have It: A Task by Data Type Taxonomy for Information Visualizations, Proc. 1996 IEEE Visual Languages, also Maryland HCIL TR 96-13
Stuart Card and Jock Mackinlay. The Structure of the Information Visualization Design Space. Proc. InfoVis 97.
Sheelagh Carpendale is a Full Professor at Simon Fraser University in the School of Computing Science. She holds the NSERC/AITF/SMART Industrial Research Chair in Interactive Technologies. Her leadership role in the international data visualization research community has been repeatedly confirmed through many awards including the IEEE Visualization Career Award and being inducted into both the IEEE Visualization Academy and the ACM CHI (Computer-Human-Interaction) Academy. Her other awards include the Canadian NSERC E.W.R. STEACIE Fellowship, a British BAFTA (equivalent to an Oscar) in Interactive Learning; the Alberta ASTech Award, the Canadian Human Computer Communications Society Achievement Award. Her research focuses on information visualization, interaction design, and qualitative empirical research. By studying how people interact with information both in work and social settings, she works towards designing more natural, accessible and understandable interactive visual representations of data. She combines information visualization, visual analytics and human-computer interaction with innovative new interaction techniques to better support the everyday practices of people who are viewing, representing, and interacting with information.
Learning from imbalanced data is pervasive across applications, as the class(es) of interest do not have as many instances and this under-representation presents a challenge from learning to evaluation. The standard learning algorithms can be biased towards the larger classes, and thus under-perform on the actual class of interest, leading to inferior performance. For example, consider a fraud detection example where the non-fraudulent instances make-up 99% of the data and fraudulent instances make up 1% of the data. When a classifier is presented with such a skewed data distribution, it tends to favor the larger class and biased towards making non-fraud predictions at the expense of predictions of fraud, when the goal is to actually catch fraudulent activity. This also presents the challenge of accuracy as a metric, as just random guessing non-fraud class will produce a 99% accurate classifier. This problem is further confounded in the presence of streaming data associated with changing distributions or concept drift. In this tutorial, I will provide an introduction to learning from imbalanced data and concept drift, provide an overview of popular methods from sampling to learning algorithms about class imbalance as well as class imbalance in the presence of concept drift, how to effectively evaluate the performance of learning algorithms, the challenges associated with big data, and the interface of deep learning and class imbalance. I will also include a perspective of different applications.
Introductory course in machine learning or data science or data mining
Nitesh Chawla is the Frank M. Freimann Professor of Computer Science and Engineering, and Director of the Center on Network and DataSsciences (CNDS) at the University of Notre Dame. His research is focused on machine learning, AI and network science fundamentals and interdisciplinary applications that advance the common good. He is also the recipient of several awards and honors including the Best Paper Award and Nominations, Outstanding Dissertation Award, NIPS Classification Challenge Award, IEEE CIS Outstanding Early Career Award, the IBM Watson Faculty Award, the IBM Big Data and Analytics Faculty Award, the National Academy of Engineering New Faculty Fellowship, and 1st Source Bank Technology Commercialization Award. In recognition of the societal impact of his research, he was recognized with the Rodney F. Ganey Award. He is co-founder of Aunalytics, a data science software and solutions company.
The rise of Bitcoin and other peer-to-peer cryptocurrencies has opened many interesting and challenging problems in cryptography, distributed systems, and databases. The main underlying data structure is blockchain, a scalable fully replicated structure that is shared among all participants and guarantees a consistent view of all user transactions by all participants in the system. In this course, we discuss the basic protocols used in blockchain, and elaborate on its main advantages and limitations. To overcome these limitations, we provide the necessary distributed systems background in managing large scale fully replicated ledgers, using Byzantine Agreement protocols to solve the consensus problem. Finally, we expound on some of the most recent proposals to design scalable and efficient blockchains in both permissionless and permissioned settings. The focus of the tutorial is on the distributed systems and database aspects of the recent innovations in blockchains.
Basic knowledge of data structures and operating systems.
Amr El Abbadi is a Professor of Computer Science at the University of California, Santa Barbara. He received his B. Eng. from Alexandria University, Egypt, and his Ph.D. from Cornell University. His research interests are in the fields of fault-tolerant distributed systems and databases, focusing recently on Cloud data management and blockchain based systems. Prof. El Abbadi is an ACM Fellow, AAAS Fellow, and IEEE Fellow. He was Chair of the Computer Science Department at UCSB from 2007 to 2011. He has served as a journal editor for several database journals, including, The VLDB Journal, IEEE Transactions on Computers and The Computer Journal. He has been Program Chair for multiple database and distributed systems conferences. He currently serves on the executive committee of the IEEE Technical Committee on Data Engineering (TCDE) and was a board member of the VLDB Endowment from 2002 to 2008. In 2007, Prof. El Abbadi received the UCSB Senate Outstanding Mentorship Award for his excellence in mentoring graduate students. In 2013, his student, Sudipto Das received the SIGMOD Jim Gray Doctoral Dissertation Award. Prof. El Abbadi is also a co-recipient of the Test of Time Award at EDBT/ICDT 2015. He has published over 300 articles in databases and distributed systems and has supervised over 35 PhD students.
These three lectures will explain the essential theory and practice of neural networks as used in current applications.
The lectures will be less mathematical than https://mitpress.mit.edu/books/deep-learning but more mathematical than https://www.oreilly.com/library/view/hands-on-machine-learning/9781492032632/. Both these books are highly recommended.
Mathematics at the level of an undergraduate degree in computer science: basic multivariate calculus, probability theory, and linear algebra.
Charles Elkan is the global head of machine learning and a managing director at Goldman Sachs in New York. He is also an adjunct professor of computer science at the University of California, San Diego (UCSD). From 2014 to 2018 he was the first Amazon Fellow, leading a team of over 30 scientists and engineers in Seattle, Palo Alto, and New York doing research and development in applied machine learning in both e-commerce and cloud computing. Before joining Amazon, he was a full-time professor of computer science at UCSD. His Ph.D. is from Cornell in computer science, and his undergraduate degree is from Cambridge in mathematics. For publications, see https://scholar.google.com/citations?user=im5aMngAAAAJ&hl=en
Effective Big Data analytics need to rely on algorithms for querying and analyzing massive, continuous data streams (that is, data that is seen only once and in a fixed order) with limited memory and CPU-time resources. Such streams arise naturally in emerging large-scale event monitoring applications; for instance, network-operations monitoring in large ISPs, where usage information from numerous network devices needs to be continuously collected and analyzed for interesting trends and real-time reaction to different scenarios (e.g., hotspots or DDoS attacks). In addition to memory- and time-efficiency concerns, the inherently distributed nature of such applications also raises important communication-efficiency issues, making it critical to carefully optimize the use of the underlying communication infrastructure. This course will provide an overview of some key algorithmic tools for supporting effective, real-time analytics over streaming data. Our primary focus will be on small-space sketch synopses for approximating continuous data streams, and their applicabilty in both centralized and distributed settings.
Algorithms, Complexity, Randomized Algorithms, Elementary Probability and Linear Algebra.
Dr. Minos Garofalakis is the Director of the Information Management Systems Institute (IMSI) at the ATHENA Research and Innovation Center and a Professor of Computer Science at the Technical University of Crete, where he also directs the Software Technology and Network Applications Lab (SoftNet). He received the MSc and PhD degrees in Computer Science from the University of Wisconsin-Madison in 1994 and 1998, respectively, and worked as a Member of Technical Staff at Bell Labs, Lucent Technologies (1998-2005), as a Senior Researcher at Intel Research Berkeley (2005-2007), and as a Principal Research Scientist at Yahoo! Research (2007-2008). In parallel, he also held an Adjunct Associate Professor position at the EECS Department of the University of California, Berkeley (2006-2008).
Prof. Garofalakis' research interests are in the broad areas of Big Data Analytics and Large-Scale Machine Learning. He has published over 170 scientific papers in refereed international conferences and journals in these areas, is the co-editor of a volume on Data Stream Management published by Springer in 2016, and has delivered several invited keynote talks and tutorials in major international events. Prof. Garofalakis' work has resulted in 36 US Patent filings (29 patents issued) for companies such as Lucent, Yahoo!, and AT&T; in addition, he is/has been the PI in a number of important research projects funded by the European Union. Google Scholar gives over 14.000 citations to his work, and an h-index value of 63. Prof. Garofalakis is an ACM Fellow (2018) "for contributions to data processing and analytics", an IEEE Fellow (2017) "for contributions to data streaming analytics", and a recipient of the TUC "Excellence in Research" Award (2015), a Marie-Curie International Reintegration Fellowship (2010-2014), the 2009 IEEE ICDE Best Paper Award, the Bell Labs President's Gold Award (2004), and the Bell Labs Teamwork Award (2003).
The real-world big data are largely unstructured, interconnected, and dynamic, in the form of natural language text. It is highly desirable to transform such massive unstructured data into structured knowledge. Many researchers rely on labor-intensive labeling and curation to extract knowledge from such data. However, such approaches may not be scalable, especially considering that a lot of text corpora are highly dynamic and domain-specific. We argue that massive text data itself may disclose a large body of hidden patterns, structures, and knowledge. Equipped with domain-independent and domain-dependent knowledge-bases, we should explore the power of massive data itself for turning unstructured data into structured knowledge. Moreover, by organizing massive text documents into multidimensional text cubes, we show structured knowledge can be extracted and used effectively. In this talk, we introduce a set of methods developed recently in our group for such an exploration, including mining quality phrases, entity recognition and typing, multi-faceted taxonomy construction, and construction and exploration of multi-dimensional text cubes. We show that data-driven approach could be an promising direction at transforming massive text data into structured knowledge.
• Familiarity with elementary knowledge about machine learning, data mining and natural language processing.
Jiawei Han is Michael Aiken Chair Professor in the Department of Computer Science, University of Illinois at Urbana-Champaign. He has been researching into data mining, information network analysis, database systems, and data warehousing, with over 900 journal and conference publications. He has chaired or served on many program committees of international conferences in most data mining and database conferences. He also served as the founding Editor-In-Chief of ACM Transactions on Knowledge Discovery from Data (2008-2011), the Director of Information Network Academic Research Center supported by U.S. Army Research Lab (2009-2016), and the co-Director of KnowEnG, an NIH funded Center of Excellence in Big Data Computing since 2014. He is a Fellow of ACM and a Fellow of IEEE. He received ACM SIGKDD Innovations Award (2004), IEEE Computer Society Technical Achievement Award (2005), IEEE W. Wallace McDowell Award (2009), Japan's Funai Achievement Award (2018), and have been named as Michael Aiken Chair Professor at the University of Illinois (2019).
There is a tremendous amount of data spread across the web and stored in databases that can be turned into an integrated semantic network of data, called a knowledge graph. Knowledge graphs have been applied to a variety of challenging real-world problems including combating human, finding illegal arms sales in online marketplaces, and identifying threats in space. However, exploiting the available data to build knowledge graphs is difficult due to the heterogeneity of the sources, scale in the amount of data, and noise in the data. In this course I will review the techniques for building knowledge graphs, including extracting data from online sources, aligning the data to a common terminology, linking the data across sources, and representing knowledge graphs and querying them at scale.
►Part 1: Knowledge graphs
Web data extraction
►Part 2: Source alignment
►Part 3: Representing and querying knowledge graphs
Existing knowledge graphs to reuse
Background in computer science and some basic knowledge of AI, machine learning, and databases will be helpful, but not required.
Craig Knoblock is a Research Professor of both Computer Science and Spatial Sciences at the University of Southern California (USC), Keston Executive Director of the USC Information Sciences Institute, and Director of the Data Science Program at USC. He received his Bachelor of Science degree from Syracuse University and his Master’s and Ph.D. from Carnegie Mellon University in computer science. His research focuses on techniques for describing, acquiring, and exploiting the semantics of data. He has worked extensively on source modeling, schema and ontology alignment, entity and record linkage, data cleaning and normalization, extracting data from the Web, and combining all of these techniques to build knowledge graphs. He has published more than 300 journal articles, book chapters, and conference papers on these topics and has received 7 best paper awards on this work. Dr. Knoblock is a Fellow of the Association for the Advancement of Artificial Intelligence (AAAI), a Fellow of the Association of Computing Machinery (ACM), past President and Trustee of the International Joint Conference on Artificial Intelligence (IJCAI), and winner of the 2014 Robert S. Engelmore Award.
Contemporary scientific research takes advantage of the ever-increasing volume and variety of data to make new discoveries. Similarly, researchers and data scientists/developers working within every sector of society are trying to use vast amounts of data to improve many aspects of our daily lives, including how to make our lives more pleasant and last longer. The ultimate goal that is in front of us is to transfer gargantuan amounts of data into practical knowledge. This conversion is the real challenge of the 21 st century. One of the major impediments to this conversion is the fact that data collected from various sources are sometimes contradictory, and it is not clear how to resolve contrary information without going to the primary source. This situation leads to the lack of reproducibility that is the most critical bottleneck in modern research. Experimental reproducibility is the cornerstone of scientific research, and the veracity of scientific publications is crucial because subsequent lines of investigation frequently rely on previous knowledge. Several recent systematic surveys of academic results published in biomedical journals reveal that a large fraction of representative studies in a variety of fields cannot be reproduced in another laboratory. Artificial intelligence and Big Data approaches are coming to the rescue. The goal of the presented lectures is to discuss a strategy to increase the reproducibility of reported results for a wide range of experiments by building a set of “best practices,” culled by extensive data harvesting and curation combined with experimental verification of the parameters crucial for reproducibility. Experimental verification assisted by an automatic/semi-automatic harvesting of data from laboratory equipment into the already developed sophisticated laboratory information management system (LIMS) will be presented. This data-in, information out paradigm will be discussed in detail.
Harrison Distinguished Professor of Molecular Physiology and Biological Physics, University of Virginia. Development of methods for structural biology, in particular macromolecular structure determination by protein crystallography. Data management in structural biology, data mining as applied to drug discovery, bioinformatics. Member of Center of Structural Genomics of Infectious Diseases. Former Member of Midwest Center for Structural Genomics, New York Center for structural Genomics and Enzyme Function Initiative.
Intelligent personalized applications such as recommender systems help alleviate information overload by tailoring their output to users' personal preferences. Some of the most effective providers of on-line information and services such as Amazon, LinkedIn, Netflix, Spotify, Facebook, YouTube, and others rely heavily on machine learning methods that predict user preferences and recommend or suggest personalized content. These systems, however, often do not take into account the fact that users interact with systems in a particular context and that user' preferences change over time as they transition among different contexts. The role of context has been studied in many disciplines. In psychology, a change in context during learning has been shown to have an impact on recall. Research in linguistics has shown that context plays the important role of a disambiguation function. More recently, a variety of approaches and architectures have emerged for incorporating context or situational awareness in the recommendation process. In short course, I will provide a broad overview of the problem of contextual recommendation and some of the solutions proposed in Context-Aware Recommender Systems (CARS) research. We will begin with some general background on recommender systems and basic architectural approaches for CARS design. We will then examine several approaches and algorithms for CARS in more detail including those involving representational context (where explicit contextual variables are available) as well as those based on interactional context, where contextual states are implicit and are inferred as a set of latent factors.
-Basic knowledge of recommender systems (though a brief introduction will be provided) -Basic knowledge of probability theory and linear algebra
Bamshad Mobasher (co-PI) is a professor of Computer Science and the director of the Center for Web Intelligence at DePaul University College of Computing and Digital Media. His research areas include artificial intelligence and machine learning. He is considered one of the leading authorities in Web personalization and recommender systems. He has published over 200 scientific articles on these topics. As the director of the Center for Web Intelligence, he directs research on intelligent Web applications, including on several NSF funded projects. He also regularly works with the industry on various joint projects. He serves in leadership positions of several international conferences, including the ACM Conference on Recommender Systems, ACM SIGKDD Conference on Knowledge Discovery and Data Mining, and the ACM Conference on User Modeling, Adaptation, and Personalization. Dr. Mobasher
At present, we are going through the fourth industrial revolution fueled by data which has been termed as a 'new oil'. We can easily find examples where the insight drawn from large data sets is leading growth in different areas in industry. For example, large corpora of texts available in many different languages have led to breakthroughs in Natural Language Processing tasks such as neural machine translation, sentiment analysis, text summarization, question-answering, picture and video captioning etc. Like huge corpora of texts now we have many software repositories with billions of lines of code easily available on many different platforms such as GitHub, Bit-bucket, etc. We can call this data as 'Big Code' since it has all the four Vs (volume, variety, velocity, and veracity) of Big data. These software repositories not only have billions of lines of code but they also have associated texts in natural languages such as class, method and variable names, comments, doc strings, version history, bug reports etc. This huge volume of data makes a good case for applying machine learning on source code and draw insight for better software development. Although it also brings its unique challenges such as modelling the source code. In this course, I will discuss a framework for machine learning on source code. I will also present a set of use cases where this has been applied. I will introduce a set of tools and techniques to build an end-to-end inference pipeline starting from data acquisition, cleaning, feature engineering to modelling and inference for applying machine learning on source code.
Dr. Jayanti Prasad did his Ph.D in Physics (Astrophysics) from the Harish-Chandra Research Institute, Allahabad India and was a Post Doctoral Fellow at the National Centre for Astrophysics (NCRA) Pune, India and the Inter-University Center for Astronomy and Astrophysics (IUCAA) Pune, India. Dr. Prasad has also been a recipient of International and National Research grants and has published more than 100 research papers with small groups and large collaborations also. As a member of the LIGO Scientific Collaboration for six years, major discoveries were made such as the first detection of gravitational waves. Dr. Prasad also worked as a consultant for the LIGO Data Grid Center at the IUCAA. Dr. Prasad has worked on computing and data intensive problems in Astronomy and Astrophysics such as Galaxy formation and Clustering, Cosmological N-Body simulations, Radio Astronomy data processing, Cosmic Microwave Background Radiation and Gravitational Waves. Some of his works on the problems such as Particle Swarm Optimization and Maximum Entropy Deconvolution have been well received outside the Astronomy community also. Recently Dr. Prasad has been very much interested in Data Science, Machine Learning and Software Engineering and is working as data scientist for a Software company in India.
Recommender systems learn the preferences of their users and predict items that they will like. Recommender systems engines are implemented in many services today to various domains and are aimed at assisting users locate the exact item they need or like.
In this course we will cover the fundamental algorithms in recommender systems from collaborative-filtering, content-based and matrix factorization, through advanced deep learning techniques that were recently developed to enhance these basic approaches. We will discuss the challenges that the field is facing, such as sparstity, cold start, scalability, and reliable evaluation and measurement of recommender performance, and show how these challenges may be addressed with the recent developments.
• Types of recommenders
• Applying deep learning to recommendation
Applying deep learning algorithms to the different types of recommenders
• Real-world challenges and solutions in recommender systems
• Evaluating and measurement of recommender systems
• Hands-on practice in building recommender systems
Bracha Shapira is a professor at the Software and Information Systems Engineering Department at Ben-Gurion University of the Negev. She is currently the deputy dean of research at the Faculty of Engineering Sciences at BGU. She established the Information Retrieval Lab there and leads numerous research projects related to personalization, recommender systems, user profiling, privacy, and application of machine learning methods to cyber security. She has published more than 200 papers in major scientific conferences and journals and have several registered patents. Prof. Shapira is also the editor of "Recommender Systems Handbook" (1st edition, Springer, 2011; 2nd edition, 2015).
Lior Rokach is a professor of data science at the Ben-Gurion University of the Negev, where he currently serves as the chair of the Department of Software and Information System Engineering. His research interests lie in the areas of Machine Learning, Big Data, Deep Learning and Data Mining and their applications. Prof. Rokach is the author of over 300 peer-reviewed papers in leading journals and conference proceedings. Rokach has authored several popular books in data science, including Data Mining with Decision Trees (1st edition, World Scientific Publishing, 2007, 2nd edition, World Scientific Publishing, 2015). He is also the editor of "The Data Mining and Knowledge Discovery Handbook" (1st edition, Springer, 2005; 2nd edition, 2010) and "Recommender Systems Handbook" (1st edition, Springer, 2011; 2nd edition, 2015). He currently serves as an editorial board member of ACM Transactions on Intelligent Systems and Technology (ACM TIST) and an area editor for Information Fusion (Elsevier).
Real data often contain anomalous cases, also known as outliers. Depending on the situation, outliers may be (a) undesirable errors, which can adversely affect the data analysis, or (b) valuable nuggets of unexpected information. In either case, the ability to detect such anomalies is essential. A useful tool for this purpose is robust statistics, which aims to detect the outliers by first fitting the majority of the data and then flagging data points that deviate from it. We present an overview of several robust methods and the resulting graphical outlier detection tools. We discuss robust procedures for univariate, low-dimensional, and high-dimensional data, such as estimating location and scatter, linear regression, and principal component analysis. Also the emerging topic of cellwise outliers is introduced.
Some basic knowledge of linear regression and principal components is helpful.
Peter Rousseeuw obtained his PhD at ETH Zurich, Switzerland and afterward became a professor in universities in The Netherlands, Switzerland and Belgium. For over a decade he worked full-time at Renaissance Technologies in the US. Currently he is Professor at KU Leuven in Belgium. His main research topics are cluster analysis (unsupervised classification) and anomaly detection by robust fitting, always with a focus on methodology as well as efficient algorithms and practical implementation. His work has been cited over 70,000 times. For more information see https://en.wikipedia.org/wiki/Peter_Rousseeuw and https://scholar.google.com/citations?user=5LMM6rsAAAAJ&hl=en .
Deep learning owes its success to the advent of massively parallel computing enabled by FPGAs (Field Programmable Gate Arrays), GPUs (Graphical Processing Units) and other special processors. However, many other neural network architectures can exploit such massively parallel computing. In this course, I will introduce the basic concepts and architectures of heterogeneous computing using FPGAs and GPUs. There are two basic languages for programming such hardware – OpenCL for FPGAs (from Intel, Xilinx and others) and CUDA for Nvidia GPUs. I will introduce the basic features of these languages and show how to implement parallel computations in these languages.
In the second part of this course, I will show how to implement some basic neural architectures on this kind of hardware. In addition, we can do much more with such hardware including feature selection, hyperparameter tuning and finding a good neural architecture. Finding the best combination of features, the best neural network design and the best hyperparameters is critical to neural networks. With the availability of massive parallelism, it is relatively easy now to explore, in parallel, many different combinations of features, neural network designs and hyperparameters.
In the last part of the course, I will discuss why it is becoming important that machine learning for IoT be at the edge of IoT instead of the cloud and how FPGAs and GPUs can facilitate that. And not just IoT, but in a wide range of application domains, from robotics to remote patient monitoring, localized machine learning from streaming sensor data is becoming increasingly important. GPUs, in particular, are available in a wide range of capabilities and prices and one can use them in many such applications where localized machine learning is desirable.
►Lecture 1: Massively parallel, heterogeneous computing using FPGAs and GPUs – heterogeneous computing concepts and architectures; comparison of FPGAs and GPUs; programming languages for parallel computing (OpenCL, CUDA)
►Lecture 2: Implementation of basic neural network algorithms on FPGAs and GPUs exploiting massive parallelism; exploiting massive parallelism to explore different feature combinations and neural network designs and for hyperparameter tuning
►Lecture 3: Machine learning at the edge of IoT in real-time from streaming sensor data using FPGAs and GPUs – classification, function approximation, clustering, anomaly detection
Fundamentals of computer science, basic knowledge of neural networks
Asim Roy is a professor of information systems at Arizona State University. He earned his bachelor's degree from Calcutta University, his master's degree from Case Western Reserve University, and his doctorate from the University of Texas at Austin. He has been a visiting scholar at Stanford University and a visiting scientist at the Robotics and Intelligent Systems Group at Oak Ridge National Laboratory, Tennessee. Professor Roy serves on the Governing Board of the International Neural Network Society (INNS) and is currently its VP of Industrial Relations. He is the founder of two INNS Sections, one on Autonomous Machine Learning and the other on Big Data Analytics. He was the Guest Editor-in-Chief of an open access eBook Representation in the Brain of Frontiers in Psychology. He was also the Guest Editor-in-Chief of two special issues of Neural Networks - one on autonomous learning and the other on big data analytics. He is the Senior Editor of Big Data Analytics and serves on the editorial boards of Neural Networks and Cognitive Computation.
He has served on the organizing committees of many scientific conferences. He started the Big Data conference series of INNS and was the General Co-Chair of the first one in San Francisco in 2015. He was the Technical Program Co-Chair of IJCNN 2015 in Ireland and the IJCNN Technical Program Co-Chair for the World Congress on Computational Intelligence 2018 (WCCI 2018) in Rio de Janeiro, Brazil. He is currently the IJCNN General Chair for WCCI 2020 in Glasgow, UK (https://www.wcci2020.org/). He is currently working on hardware-based (GPU, FPGA-based) machine learning for real-time learning from streaming data at the edge of the Internet of Things (IoT). He is also working on Explainable AI.
The representation of multidimensional, spatial, and metric data is an important issue in applications of spatial and spatiotextual databases, geographic information systems (GIS), and location-based services. Recently, there has been much interest in hierarchical data structures such as quadtrees, octrees, and pyramids which are based on image hierarchies, as well methods that make use of bounding boxes which are based on object hierarchies. Their key advantage is that they provide a way to index into space. In fact, they are little more than multidimensional sorts. They are compact and depending on the nature of the spatial data they save space as well as time and also facilitate operations such as search.
We describe hierarchical representations of points, lines, collections
of small rectangles, regions, surfaces, and volumes. For region data,
we point out the dimension-reduction property of the region quadtree
and octree. We also demonstrate how to use them for both raster and
vector data. For metric data that does not lie in a vector space so
that indexing is based simply on the distance between objects, we
review various representations such as the vp-tree, gh-tree, and
mb-tree. In particular, we demonstrate the close relationship
between these representations and those designed for a vector space.
For all of the representations, we show how they can be used to compute nearest objects in an incremental fashion so that the number of objects need not be known in advance. The VASCO JAVA applet is presented that illustrates these methods (found at http://www.cs.umd.edu/~hjs/quadtree/index.html). They are also used in applications such as the SAND Internet Browser (found at http://www.cs.umd.edu/~brabec/sandjava).
The above has been in the context of the traditional geometric
representation of spatial data, while in the final part we review the
more recent textual representation which is used in location-based
services where the key issue is that of resolving ambiguities.
For example, does
London'' correspond to the name of a person or a location, and if it corresponds to a location, which of the over 700 different instances ofLondon'' is it. The NewsStand system at
newsstand.umiacs.umd.edu and the TwitterStand system at
TwitterStand.umiacs.umd.edu system are examples. See also the
cover article of the October 2014 issue of Communications of the ACM
at http://tinyurl.com/newsstand-cacm or a cached version at
http://www.cs.umd.edu/~hjs/pubs/cacm-newsstand.pdf and the
accompanying video at https://vimeo.com/106352925
Practitioners working in the areas of big spatial data and spatial data science that involve spatial databases, geographic information systems, and location-based services will be given a different perspective on data structures found to be useful in most applications. Familiarity with computer terminology and some programming experience is needed to follow this course.
Hanan Samet (http://www.cs.umd.edu/~hjs/) is a Distinguished
University Professor of Computer Science at the University of
Maryland, College Park and is a member of the Institute for Computer
Studies. He is also a member
of the Computer Vision Laboratory at the Center for Automation Research
where he leads a number of research projects on the use of hierarchical
data structures for database applications, geographic information
systems, computer graphics, computer vision, image processing, games,
robotics, and search. He received the B.S. degree in engineering from
UCLA, and the M.S. Degree in operations research and the M.S. and
Ph.D. degrees in computer science from Stanford University. His
doctoral dissertation dealt with proving the correctness of
translations of LISP programs which was the first work in translation
validation and the related concept of proof-carrying code.
He is the author of the recent book
Foundations of Multidimensional and Metric Data Structures'' (http://www.cs.umd.edu/~hjs/multidimensional-book-flyer.pdf) published by Morgan-Kaufmann, an imprint of Elsevier, in 2006, an award winner in the 2006 best book in Computer and Information Science competition of the Professional and Scholarly Publishers (PSP) Group of the American Publishers Association (AAP), and of the first two books on spatial data structuresDesign and Analysis of Spatial Data
Structures'', and ``Applications of Spatial Data Structures: Computer
Graphics, Image Processing, and GIS,'' both published by Addison-Wesley
in 1990. He is the Founding Editor-In-Chief of the ACM Transactions
on Spatial Algorithms and Systems (TSAS), the founding chair of ACM
SIGSPATIAL, a recipient of a Science Foundation of Ireland (SFI)
Walton Visitor Award at the Centre for Geocomputation at the National
University of Ireland at Maynooth (NUIM), 2009 UCGIS Research
Award, 2010 CMPS Board of Visitors Award at the University of Maryland,
2011 ACM Paris Kanellakis Theory and Practice Award, 2014 IEEE
Computer Society Wallace McDowell Award, and a Fellow of the ACM,
IEEE, AAAS, IAPR (International Association for Pattern Recognition),
and UCGIS (University Consortium for Geographic Information Science). He was
recently elected to the SIGGRAPH Academy. He received best paper
awards in the 2007 Computers & Graphics Journal, the 2008 ACM SIGMOD
and SIGSPATIAL ACMGIS Conferences, the 2012 SIGSPATIAL MobiGIS
Workshop, and the 2013 SIGSPATIAL GIR Workshop, as well as a best demo
award at the 2011 SIGSPATIAL ACMGIS'11 Conference. The 2008 ACM
SIGSPATIAL ACMGIS best paper award winner also received the SIGSPATIAL
10-Year Impact Award. His paper at the 2009 IEEE International
Conference on Data Engineering (ICDE) was selected as one of the best
papers for publication in the IEEE Transactions on Knowledge and Data
Engineering. He was elected to the ACM Council as the Capitol Region
Representative for the term 1989-1991, and is an ACM Distinguished Speaker.
What is the statistically optimal way to detect and extract information from signals in noisy data? After detecting ensembles of signals, what can we learn about the population of all the signals? This course will address these questions using the language of Bayesian inference. After reviewing the basics of Bayes theorem, we will frame the problem of signal detection in terms of hypothesis testing and model selection. Extracting information from signals will be cast in terms of computing posterior density functions of signal parameters. After reviewing model selection and parameter estimation, the course will focus on practical methods. Specifically, we will implement sampling algorithms which we will use to perform model selection and parameter estimation on signals in synthetic data sets. Finally, we will ask what can be learned about the population properties of an ensemble of signals. This population-level inference will be studied as a hierarchical inference problem.
Basic probability theory, sampling, python and jupyter hub
Dr. Rory Smith is a lecturer in physics at Monash University in Melbourne, Australia. From 2013-2017, he was a senior postdoctoral fellow at the California Institute of Technology where he worked on searches for gravitational waves. Dr. Smith participated in the landmark first detection of gravitational waves for which the 2017 Nobel Prize in physics was awarded. Dr. Smith’s research focuses on detecting astrophysical gravitational-wave signals from black holes and neutron stars, and extracting the rich astrophysical information encoded within to study the fundamental nature of spacetime.
Social Computing is an emerging discipline, and just like any discipline at a nascent stage it can often mean different things to different people. However, there are three distinct threads that are emerging. First thread is often called Socio-Technical Systems, which focuses on building systems that allow large scale interactions of people, whether for a specific purpose or in general. Examples include social networks like Facebook and Google Plus, and Multi Player Online Games like World of Warcraft and Farmville. The second thread is often called Computational Social Science, whose goal is to use computing as an integral tool to push the research boundaries of various social and behavioral science disciplines, primarily Sociology, Economics, and Psychology. Third is the idea of solving problems of societal relevance using a combination of computing and humans. The goal of this course is to discuss, in a tutorial manner, through case studies, and through discussion, what Social Computing is, where it is headed, and where is it taking us.
• Module 1: Introduction
• Module 2: Science
• Module 3: Applications
This course is intended primarily for graduate students. Following are the potential audiences:
Will provide later.
Jaideep Srivastava (https://www.linkedin.com/in/jaideep-srivastava-50230/) is Professor of Computer Science at the University of Minnesota, where he directs a laboratory focusing on research in Web Mining, Social Analytics, and Health Analytics. He is a Fellow of the Institute of Electrical and Electronics Engineers (IEEE), and has been an IEEE Distinguished Visitor and a Distinguished Fellow of Allina’s Center for Healthcare Innovation. He has been awarded the Distinguished Research Contributions Award of the PAKDD, for his lifetime contributions to the field of machine learning and data mining. He has supervised 39 PhD dissertations, and over 65 MS theses. He has also mentored a number of post-doctoral fellows and junior scientists in the industry and academia. He has authored or co-authored over 420 papers in journals and conferences, and filed 8 patents. Seven of his papers have won best paper awards, and he has a Google Scholar citation count of over 25,649 and an h-index of 59 (https://scholar.google.com/citations?user=Y4J5SOwAAAAJ&hl=en&oi=ao).
Dr. Srivastava’s research has been supported by a broad range of government agencies, including NSF, NASA, ARDA, DARPA, IARPA, NIH, CDC, US Army, US Air Force, and MNDoT; and industries, including IBM, United Technologies, Eaton, Honeywell, Cargill, Allina and Huawei. He is a regular participant in the evaluation committees of various US and international funding agencies, on the organizing and steering committees of various international scientific forums, and on the editorial boards of a number of journals.
Dr. Srivastava has significant experience in the industry, in both consulting and executive roles. Most recently he was the Chief Scientist for Qatar Computing Research Institute (QCRI), which is part of Qatar Foundation. Earlier, he was the data mining architect for Amazon.com (www.amazon.com), built a data analytics department at Yodlee (www.yodlee.com), and served as the Chief Technology Officer for Persistent Systems (www.persistentsys.com). He has provided technology and strategy advice to Cargill, United Technologies, IBM, Honeywell, KPMG, 3M, TCS, Cargill and Eaton. Dr. Srivastava Co-Founded Ninja Metrics (www.ninjametrics.com), based on his research in behavioral analytics. He was advisor and Chief Scientist for CogCubed (www.cogcubed.com), an innovative company with the goal to revolutionize the diagnosis and therapy of cognitive disorders through the use of online games, which was subsequently acquired by Teladoc (https://www.teladoc.com/), and for Jornaya (https://www.jornaya.com/). He is presently a technology advisor to a number of startups at various stages, including Kipsu (http://kipsu.com/) - which provides an innovative solution to improving service quality in the hospitality industry, and G2lytics (https://g2lytics.com/) – an organization that uses machine learning to identify tax compliance problems.
Dr. Srivastava has held distinguished professorships at Heilongjiang University and Wuhan University, China. He has held advisory positions with the State of Minnesota, and the State of Maharashtra, India. He is an advisor to the Unique ID (UID) project of the Government of India, whose goal is to provide biometrics-based social security numbers to the 1.25+ billion citizens of India.
Dr. Srivastava has delivered over 170 invited talks in over 35 countries, including more than a dozen keynote addresses at major international conferences. He has a Bachelors of Technology from the Indian Institute of Technology (IIT), Kanpur, India, and MS and PhD from the University of California, Berkeley.
Meta-analysis plays an important role in summarizing and synthesizing scientific evidence derived from multiples studies. By combining multiple data sources, higher statistical power is achieved, leading to more accurate effect estimates and greater reproducibility. While the wealth of omics data in public repositories offers great opportunities to understand molecular mechanisms and identify biomarkers of human diseases, differences in design and methodology across studies often translates into poor reproducibility. Hence, the importance of meta-analytical approaches to make more robust inferences from this type of data.
In this course, we will learn the most common meta-analysis strategies to integrate high-throughput biological data and to implement such analysis using R capabilities. All practical exercises will be conducted in R. Participants are encouraged to bring datasets to the course and apply the principles to their specific areas of research.
Students must be proficient in R and familiar with the analysis of some of the omics data types. Familiarity with statistical concepts and basic understanding of regression and ANOVA. An installed version of R (https://cran.r-project.org/) and R-Studio (https://www.rstudio.com/) on a laptop is required for completing exercises.
Mayte Suarez-Farinas, PhD, is currently an Associate Professor at the Center for Biostatistics and The Department of Genetics and Genomics Science of the Icahn School of Medicine at Mount Sinai, New York. She received an MSc in mathematics from the University of Havana, Cuba, in year 1995, and a Ph.D. in quantitative analysis from the Pontifical Catholic University of Rio de Janeiro, Brazil, in 2003. Prior to joining Mount Sinai, she was co-director of Biostatistics at the Center for Clinical and Translational Science at the Rockefeller University, where she developed methodologies for data integration across omics studies and a framework to evaluate drug response at the molecular level in proof of concept studies in inflammatory skin diseases using mixed-effect models and machine learning. Her long terms goals are to develop robust statistical techniques to mine and integrate complex high-throughput data, with an emphasis on immunological diseases, and to develop precision medicine algorithms to predict treatment response and phenotype.
We shall study algorithms that have been found useful in querying large datasets. The emphasis is on algorithms that cannot be considered "machine learning."
A course in algorithms at the advanced-undergraduate level is important.
We will be covering (parts of) Chapters 3, 4, and 10 of the free text: Mining of Massive Datasets (third edition) by Jure Leskovec, Anand Rajaraman, and Jeff Ullman, available at www.mmds.org
A brief on-line bio is available at i.stanford.edu/~ullman/pub/opb.txt
Process mining is the missing link between model-based process analysis and data-oriented analysis techniques. The use of process mining is rapidly increasing and there are over 30 commercial vendors of process mining software. Through concrete data sets and easy-to-use software, the course provides data science knowledge that can be applied directly to analyze and improve processes in a variety of domains.
The course explains the key analysis techniques in process mining. Participants will learn various process discovery algorithms. These can be used to automatically learn process models from raw event data. Various other process analysis techniques that use event data will be presented. Moreover, the course will provide easy-to-use software, real-life data sets, and practical skills to directly apply the theory in a variety of application domains.
Process mining provides not only a bridge between data mining and business process management; it also helps to address the classical divide between "business" and "IT". Evidence-based business process management based on process mining helps to create a common ground for business process improvement and information systems development.
Note that Gartner identified process-mining software as a new and important class of software. On can witness the rapid uptake looking at the successful vendors (e.g., Celonis, Disco, ProcessGold, myInvenio, PAFnow, Minit, QPR, Mehrwerk, Puzzledata, LanaLabs, StereoLogic, Everflow, TimelinePI, Signavio, and Logpickr) and the organizations applying process mining at a large scale with thousands of users (e.g., Siemens and BMW). Yet many traditional mainstream-oriented data scientists (machine learners and data miners) are not aware of this. This explains the relevance of the course for BigDat 2020 participants.
The course focuses on process mining as the bridge between data science and process science. The course will introduce the three main types of process mining.
The course uses many examples using real-life event logs to illustrate the concepts and algorithms. After taking this course, one is able to run process mining projects and have a good understanding of the Business Process Intelligence field.
W.M.P. van der Aalst. Process Mining: Data Science in Action. Springer-Verlag, Berlin, 2016. (The course will also provide access to slides, several articles, software tools, and data sets.)
This course is aimed at both students (Master or PhD level) and professionals. A basic understanding of logic, sets, and statistics (at the undergraduate level) is assumed. Basic computer skills are required to use the software provided with the course (but no programming experience is needed). Participants are also expected to have an interest in process modeling and data mining but no specific prior knowledge is assumed as these concepts are introduced in the course.
Prof.dr.ir. Wil van der Aalst is a full professor at RWTH Aachen University leading the Process and Data Science (PADS) group. He is also part-time affiliated with the Fraunhofer-Institut f¸r Angewandte Informationstechnik (FIT) where he leads FIT's Process Mining group and the Technische Universiteit Eindhoven (TU/e). Until December 2017, he was the scientific director of the Data Science Center Eindhoven (DSC/e) and led the Architecture of Information Systems group at TU/e. Since 2003, he holds a part-time position at Queensland University of Technology (QUT). Currently, he is also a distinguished fellow of Fondazione Bruno Kessler (FBK) in Trento and a member of the Board of Governors of Tilburg University. His research interests include process mining, Petri nets, business process management, workflow management, process modeling, and process analysis. Wil van der Aalst has published over 220 journal papers, 20 books (as author or editor), 500 refereed conference/workshop publications, and 75 book chapters. Many of his papers are highly cited (he one of the most cited computer scientists in the world; according to Google Scholar, he has an H-index of 144 and has been cited over 96,000 times) and his ideas have influenced researchers, software developers, and standardization committees working on process support. Next to serving on the editorial boards of over ten scientific journals, he is also playing an advisory role for several companies, including Fluxicon, Celonis, Processgold, and Bright Cape. Van der Aalst received honorary degrees from the Moscow Higher School of Economics (Prof. h.c.), Tsinghua University, and Hasselt University (Dr. h.c.). He is also an elected member of the Royal Netherlands Academy of Arts and Sciences, the Royal Holland Society of Sciences and Humanities, and the Academy of Europe. In 2018, he was awarded an Alexander-von-Humboldt Professorship.