Course Description

 

Keynotes

  • Courses



  • Maria Girone
    (European Organization for Nuclear Research) [-]
    Big Data Challenges at the CERN HL-LHC

    Summary

    High Energy Physics has been a leading big data science for decades. The process of discovery in collider physics involves producing massive numbers of events and searching for new processes that occur at rates as low as one in one trillion. After a 10 year construction program, the European Organization for Nuclear Research (CERN) has been operating the world’s largest and most powerful particle accelerator, the Large Hadron Collider (LHC), since 2010. The combined output in raw data, derived results, and simulated events stored, served and analyzed from the four experiments is roughly an exabyte. CERN is preparing to upgrade the accelerator to make the High Luminosity LHC (HL-LHC), which will be operational in 2027. The program is facing unprecedented computing challenges for the next phase, where the each experiment will produce exabytes of data yearly. In this talk I will discuss the future computing needs in High Energy Physics to meet our big data challenges and how we are exploring alternative resources to expand the available capacity. I will discuss our work engaging HPC sites and the R&D efforts in using new processing and storage architectures. I will close with plans for the future and discussions of our collaborative work with other data intensive science projects.

    Short bio

    Maria has a PhD in particle physics. She also has extensive knowledge in computing for high-energy physics experiments, having worked in scientific computing since 2002.

    Maria has worked for many years on the development and deployment of services and tools for the Worldwide LHC Computing Grid (WLCG), the global grid computing system used to store, distribute, and analyse the data produced by the experiments on the Large Hadron Collider (LHC).

    Maria was the founder of the WLCG operations coordination team, which she also previously led. This team is responsible for overseeing core operations and commissioning new services.

    Throughout 2014 and 2015, Maria was the software and computing coordinator for one of the four main LHC experiments, called CMS. She was responsible for about seventy computing centres on five continents, and managed a distributed team of several hundred people.

    Prior to joining CERN, Maria was a Marie Curie fellow and research associate at Imperial College London. She worked on hardware development and data analysis for another of the LHC experiments, called LHCb — as well as for an experiment called ALEPH, built on the accelerator that preceded the LHC.

    In her role as CTO, Maria is managing the overall technical strategy of CERN openlab plans towards R&D in computing architectures, HPC and AI, in collaboration with the LHC experiments for the upgrade programmes for software and computing, promoting opportunities for collaboration with industry.



    Lisa Schurer Lambert
    (Oklahoma State University) [-]
    Research Methods as a Lens: How We Know What We Know

    Summary

    The methods we use shape what we know and what we don’t know. The act of selecting a methodological approach for data collected for a research question indicates the type of answer we think we will find in the data. I will review some methodological choices, some simple and some more complex, including examples of measurement design, data splitting, linear and nonlinear relationships, difference scores and ratios, and moderated mediation analyses. These examples illustrate how our choice of method may sometimes hide the meaning we are looking for. If you know only one method, one approach, then you will only look for what you think is there.

    Short Bio

    Dr. Lisa Schurer Lambert is an Associate Professor and William S. Spears Chair of Business at Oklahoma State University. Her scholarship has focused on the employment relationship, leadership, psychological contracts, person-environment fit theory, and research methods. She is an Associate Editor and incoming Co-Editor for Organizational Research Methods and serves on the Editorial Boards of multiple journals. She has also served as Division Chair for the Research Methods Division of the Academy of Management, and is in the leadership rotation for the Board of the Southern Management Association. She is a frequent instructor for the Consortium for the Advancement of Research Methods and Analysis (CARMA) instructor and SMA Fellow.



    Thomas Bäck & Hao Wang
    (Leiden University) [introductory/intermediate]
    Data Driven Modeling and Optimization for Industrial Applications


    Paul Bliese
    (University of South Carolina) [introductory/intermediate]
    Using R for Mixed-effects (Multilevel) Models

    Summary:

    Mixed-effects or multilevel models are commonly used when data have some form of nested structure. For instance, individuals may be nested within workgroups, or repeated measures may be nested within individuals. Nested structures in data are often accompanied by some form of non-independence. For example, in work settings, individuals in the same workgroup typically display some degree of similarity with respect to performance or they provide similar responses to questions about aspects of the work environment. Likewise, in repeated measures data, individuals usually display a high degree of similarity in responses over time. Non-independence may be considered either a nuisance variable or something to be substantively modeled, but the prevalence of nested data requires that analysts have a variety of tools to deal with nested data. This course provides and introduction to (1) the theoretical foundation, and (2) resources necessary to conduct a wide range of multilevel analyses. All practical exercises are conducted in R. Participants are encouraged to bring datasets to the course and apply the principles to their specific areas of research.

    Syllabus

    Session 1

      1. Introduction and overview of Multilevel Models
      1. Introduction to R and the nlme and multilevel packages

    Session 2:

      1. Nested Data and Mixed-Effects Models in nlme
      1. R Code for Models and Introduction to Functions Commonly used in Data Manipulation

    Session 3:

      1. Repeated Measures data and Growth Models in nlme
      1. R Code for Models and Introduction to Functions Commonly used in Data Manipulation

    References

    • Bliese, P. D. (2016). Multilevel Modeling in R (v. 2.6). https://cran.r-project.org/doc/contrib/Bliese_Multilevel.pdf

    • Bliese, P. D., Maltarich, M. A., & Hendricks, J. L. (2018). Back to Basics with Mixed-Effects Models: Nine Take-Away Points. Journal of Business and Psychology, 33, 1-23.

    • Bliese, P. D., & Ployhart, R. E. (2002). Growth modeling using random coefficient models: Model building, testing, and illustrations. Organizational Research Methods, 5, 362-387.

    • Bliese, P. D., Schepker, D. J., Essman, S. M., & Ployhart, R. E. (2020). Bridging methodological divides between macro- and microresearch: Endogeneity and methods for panel data. Journal of Management, 46, 70-99.

    • Bodner, T. E., & Bliese, P. D. (2018). Detecting and differentiating the direction of change and intervention effects in randomized trials. Journal of Applied Psychology, 103, 37-53.

    Pre-requisites

    Short bio

    Paul D. Bliese, Ph.D. joined the Management Department at the Darla Moore School of Business, University of South Carolina in 2014. Prior to joining South Carolina, he spent 22 years as a research psychologist at the Walter Reed Army Institute of Research where he conducted research on stress, adaptation, leadership, well-being, and performance. Professor Bliese has long-term interests in understanding how analytics contribute to theory development and in applying analytics to complex organizational problems. He built and maintains the multilevel package for R. Professor Bliese has served on numerous editorial boards, and has been an associate editor at the Journal of Applied Psychology since 2010. In July of 2017 he took over a editor-in-chief for Organizational Research Methods.



    Gianluca Bontempi
    (Université Libre de Bruxelles) [intermediate/advanced]
    Machine Learning against Credit-card Fraud: Lessons Learned from a Real Case

    Summary

    Billions of dollars of loss are caused every year due to fraudulent credit card transactions. The design of efficient fraud detection algorithms is essential for reducing these losses, and more and more algorithms rely on advanced machine learning techniques to assist fraud investigators. The design of fraud detection algorithms is, however, particularly challenging due to non-stationary distribution of the data, highly imbalanced classes distributions and continuous streams of transactions. At the same time, public data are scarcely available for confidentiality issues, leaving many unanswered questions about which is the best strategy to deal with them. In this course, I will detail several theoretical and practical lessons learned during our long-standing collaboration with the R&D team of Worldline. In particular, we will focus on: supervised and unsupervised detection models, the impact of data unbalancedness and non-stationarity, the part of active, incremental and transfer learning, best practices for assessment, the role of big data infrastructure. ,

    Syllabus

    An online book will be made available at the beginning of next year. If you are interested in my vision about machine learning you can read my Handbook on Statistical foundations of machine learning https://www.researchgate.net/publication/242692234_Handbook_on_Statistical_foundations_of_machine_learning

    References

    • Andrea Dal Pozzolo, Olivier Caelen, Reid A. Johnson and Gianluca Bontempi. Calibrating Probability with Undersampling for Unbalanced Classification. In Symposium on Computational Intelligence and Data Mining (CIDM), IEEE, 2015

    • Dal Pozzolo, Andrea; Caelen, Olivier; Le Borgne, Yann-Ael; Waterschoot, Serge; Bontempi, Gianluca. Learned lessons in credit card fraud detection from a practitioner perspective, Expert systems with applications,41,10,4915-4928,2014, Pergamon

    • Dal Pozzolo, Andrea; Boracchi, Giacomo; Caelen, Olivier; Alippi, Cesare; Bontempi, Gianluca. Credit card fraud detection: a realistic modeling and a novel learning strategy, IEEE transactions on neural networks and learning systems,29,8,3784-3797,2018,IEEE

    • Dal Pozzolo, Andrea Adaptive Machine learning for credit card fraud detection ULB MLG PhD thesis (supervised by G. Bontempi)

    • Carcillo, Fabrizio; Dal Pozzolo, Andrea; Le Borgne, Yann-Aël; Caelen, Olivier; Mazzer, Yannis; Bontempi, Gianluca. Scarff: a scalable framework for streaming credit card fraud detection with Spark, Information fusion,41, 182-194,2018,Elsevier

    • Carcillo, Fabrizio; Le Borgne, Yann-Aël; Caelen, Olivier; Bontempi, Gianluca. Streaming active learning strategies for real-life credit card fraud detection: assessment and visualization, International Journal of Data Science and Analytics, 5,4,285-300,2018,Springer International Publishing

    • Bertrand Lebichot, Yann-Aël Le Borgne, Liyun He, Frederic Oblé, Gianluca Bontempi Deep-Learning Domain Adaptation Techniques for Credit Cards Fraud Detection, INNSBDDL 2019: Recent Advances in Big Data and Deep Learning, pp 78-88, 2019

    • Fabrizio Carcillo, Yann-Aël Le Borgne, Olivier Caelen, Frederic Oblé, Gianluca Bontempi Combining Unsupervised and Supervised Learning in Credit Card Fraud Detection Information Sciences, 2019

    • Credit Card Fraud Detection Kaggle dataset. https://www.kaggle.com/mlg-ulb/creditcardfraud

    Pre-requisites

    Basic notions of machine learning.

    Short bio

    Gianluca Bontempi is Full Professor in the Computer Science Department at the Université Libre de Bruxelles (ULB), Brussels, Belgium, co-head of the ULB Machine Learning Group (mlg.ulb.ac.be). He has been Director of (IB)2, the ULB/VUB Interuniversity Institute of Bioinformatics in Brussels (ibsquare.be) in 2013-17. His main research interests are big data mining, machine learning, bioinformatics, causal inference, predictive modeling and their application to complex tasks in engineering (time series forecasting, fraud detection) and life science (network inference, gene signature extraction). He was Marie Curie fellow researcher, he was awarded in two international data analysis competitions and he took part to many research projects in collaboration with universities and private companies all over Europe. He is author of more than 200 scientific publications and his Google scholar H-number is 56. He is Belgian (French Community) national contact point of the CLAIRE network, co-leader of the CLAIRE COVID19 Task Force and IEEE Senior Member. He is also co-author of several open-source software packages for bioinformatics, data mining and prediction. If you want to know more about his vision on machine learning (or more general about science) and society check his blog https://datascience741.wordpress.com



    Altan Cakir
    (Istanbul Technical University) [intermediate]
    Big Data Analytics with Apache Spark

    Summary

    Apache spark, open-source cluster-computing framework providing a fast and general engine for large-scale processing, has been one of the exciting technologies in recent years for the big data development. The main idea behind this technology is to provide a memory abstraction, which allows us to efficiently share data across the different stages of a map-reduce job or provide in-memory data sharing. Our lecture starts with a brief introduction to Spark and its Hadoop related ecosystem, and then shows some common techniques - classification, collaborative filtering, and anomaly detection, among others, to fields scientific applications, social media analysis, web-analytics and finance. If you have an entry-level understanding of machine learning and statistics, and program in Python or Scala, you will find these subjects useful for working on your own big data challenges.

    Syllabus

    • Introduction to Data Analysis with Apache Spark
    • Spark Programming Model with RDD objects and DataFrames
    • Running Spark Applications on Hadoop / Cloud-based Cluster Systems
    • Spark SQL
    • Spark Streaming
    • Machine Learning with Spark MLlib/ML
    • Advanced Analytics Applications with Spark
    • Anaysis of real world applications

    References

    • https://spark.apache.org, Unified Analytics Engine for Big Data
    • Advanced Analytics with Spark: Patterns For Learning From Data at Scale, A. Teller, M. Pumperla, M. Malohlava
    • Mastering Machine Learning with Apache Spark 2.x, S. Amirgodshi, M. Rajendran, B. Hall, S. Mei

    Pre-requisites

    Python, Machine Learning, Distributed Computing

    Short Bio

    Altan Cakir has M.Sc. degree in physics from Izmir Institute of Technology in 2006 and then went straight to graduate school at the Karlsruhe Institute of Technology, Germany. During his Ph.D., he was responsible for a scientific research based on new physics searches in the CMS detector at the Large Hadron Collider (LHC) at European Nuclear Research Laboratory (CERN). Thereafter he was granted as a post-doctoral research fellow at Deutsches Elektronen-Synchotron (DESY), a national nuclear research center in Hamburg, Germany, where he spent 5 years, and then recently got his present a full professor position at Istanbul Technical University (ITU), Istanbul, Turkey. Currently, Altan Cakir is a group leader of ITU-CMS group at CERN and leading a data analysis group at the CMS detector. Furthermore, he was a visiting faculty at Fermi National Accelerator Laboratory (Fermilab), Illinois, USA in 2017. His group’s expertise is focused around machine learning techniques in large scale data analysis. However, their research is very much interdisciplinary, with expertise in the group ranging from science and big data synthesis to economy, industrial applications and operation research. Today, he is consulting various companies worldwide, and sharing his expertise in big data application areas, strategies, skills and competencies based on the real-world scenarios.

    Altan Cakir involved a large number of high-profile research projects at CERN, DESY and Fermilab last fiftheen years. He enjoys being able to integrate his research and teaching key concepts of science and big data technologies. It’s rewarding to be part of the development of the next generation of scientists, engineers and help his students move on to careers all over the world, in academia, industry and government.

    The following lectures on big data are periodically given by Assoc. Prof. Dr. Altan Cakir in Big Data and Business Analytics Program (http://bigdatamaster.itu.edu.tr) at Istanbul Technical University: Big Data Technologies and Applications, Machine Learning with Big Data. All in all, Altan Cakir is executive member of ITU AI Center (ai.itu.edu.tr) and one of the lecturers of Cambridge Big Data Program in BigDat2019, University of Cambridge, United Kingdom. Further information’s can be found https://irdta.eu/BigDat2019/



    Michael X. Cohen
    (Radboud University Nijmegen) [introductory]
    Dimension Explosion and Dimension Reduction in Brain Electrical Activity

    Summary

    The brain is a ludicrously complex system. It is naive to think that we can understand it, and yet neuroscience research is a booming field. As computing power and technology for recording brain activity have increased, so has the need for data analysis methods that allow us to characterize the multidimensional complexity of brain recordings. This seminar will begin with a theoretical background on neuroscience and a brief overview of the history of data analysis methods in neuroscience. But most of the time will be spent on hands-on exploration of simulated and real data. The mood will range from complete hopelessness to optimism to thought-provoking questions about consciousness.

    Syllabus

    • Broad introduction to scales and measurements in neuroscience.
    • Why neuroscience fears yet needs high-dimensional data.
    • Problems with traditional neuroscience analyses and their modern solutions.
    • Problem statements, mathematical intuitions, and analysis solutions.
    • Analysis validation in simulated data (hands-on).
    • Proof-of-principle in real data (hands-on).
    • Discussion: tepid enthusiasm for deep learning in neuroscience data.

    References

    Pre-requisites

    No strict requirements. Some coding skills in MATLAB or Python (both will be available) will help with the demos. Some linear algebra knowledge (e.g., eigendecomposition) will help with understanding the math, but is not necessary.

    Short bio

    Mike X Cohen is an associate professor at the Radboud University (the Netherlands) and heads the Synchronization in Neural Systems lab, which has received funding from the ERC, Dutch government, Radboud Medical Centre, and several private organizations. His lab uses a combination of electrical recordings and optical/electrical stimulation techniques to characterize multidimensional dynamics in the brain. A second major focus is developing and validating new multivariate data analysis methods using a combination of empirical data and simulated data. Dr. Cohen has also authored several textbooks and creates online courses for applied math, statistics, and scientific programming.



    Ramez Elmasri
    (University of Texas, Arlington) [intermediate]
    Spatial, Temporal, and Spatio-Temporal Data


    Ian Fisk
    (Flatiron Institute) [introductory]
    The Infrastructure to Support Data Science

    Summary

    As the complexity of instruments has increased and the availability and reliability of simulation has improved, the number of science domains that can be classified as data intensive is increasing. Scientific progress in these fields often depends on accessing large quantities of diverse and complex data. In this class we will learn about the infrastructure and services required to support data intensive science. We will look at the infrastructure to store, discover, and serve data samples. We will look at data management and data serving tools. We will discuss required IO for various access patterns and types of resource scheduling and the need for specialized hardware. Many of the examples will be taken from the experience building up the computing infrastructure at the Flatiron Institute to support the local research community in astrophysics, biology, quantum systems, and mathematics but the conclusions should be broadly applicable to other organizations and science communities.

    Syllabus

    • Storing, finding, and accessing data
    • User management and access
    • Resource provisioning, IO, and specialized processing needs
    • Data sharing, data management, and archiving

    Pre-requisites

    Basic understanding of unix environments and an interest in data analysis

    Bio

    Ian Fisk is a Co-Director of the Scientific Computing Core (SCC) at the Flatiron Institute, a division of the Simons Foundation. As an SCC director Fisk is responsible for the design, construction and operations of the computing facilities and services at the Flatiron Institute in New York. The Flatiron Institute supports 200 scientists spread across astrophysics, biology, quantum systems, mathematics and neuroscience. Fisk came to the Simons Foundation in 2014 after working in the computing division at the Fermi National Accelerator Laboratory in Illinois since 2003. Fisk was the Compact Muon Solenoid (CMS) experiment’s computing coordinator at CERN from 2010 to 2013, overseeing operations and improvements for the global CMS computing infrastructure. Before that, Fisk served in several roles to prepare the experiment for the first run. He oversaw service integration, facility development and operations, and project management. Fisk holds an M.S. and a Ph.D. in physics from the University of California, San Diego.



    Michael Freeman
    (University of Washington) [intermediate]
    Interactive Data Visualization Using D3 + Observable

    Summary

    In this short course, participants will learn how to load, wrangle, and visualize data using the D3 JavaScript library on the Observable platform. D3 continues to be the leading visualization package for the web, providing a powerful API for crafting custom visualizations. Observable, a (free) JavaScript Notebook environment, offers a workspace for rapid data exploration and discovery. This hands-on workshop will introduce the tools and techniques necessary to quickly explore datasets in an interactive and sharable format.

    Syllabus

    • Part I: Loading and Wrangling Data in Observable Notebooks
    • Part II: Visualization Foundations with D3.js
    • Part III: Advanced Charting Techniques and Animations

    References

    Pre-requisites

    A basic familiarity with web programming (HTML, CSS, and JavaScript) is expected for the course. Participants should be familiar with using different JavaScript object types (e.g., objects, arrays) to store and retrieve information.

    Short Bio

    Michael Freeman is an Associate Teaching Professor at the University of Washington Information School, where he teaches courses in data science, interactive data visualization, and web development. Prior to his teaching career, he worked as a data visualization specialist and research fellow at the Institute for Health Metrics and Evaluation.

    Michael is interested in applications of data science to social justice, and holds a Master's in Public Health from the University of Washington.



    David Gerbing
    (Portland State University) [introductory]
    Derive Meaning from Data with R Visualizations

    Summary

    This seminar introduces the R language via data visualization, aka computer graphics, in the context of a discussion of best practices and consideration for the analysis of big data. Based on the presenter's recent CRC Press book, R Visualizations: Deriving Meaning from Data, code to generate the graphs is presented in terms of the presenter's lessR package and Hadley Wickham's ggplot package. Visualizations with lessR are designed to be as simple to obtain as logically possible. The content of the seminar is summarized with R Markup files that include commentary and implementation of all the code presented in the seminar, available to all participants.

    Syllabus

    Session 1

    • Introduction to R
    • R functions and syntax
    • R variable types
    • Read data into R
    • Specialized Graphic Functions
    • Functions from the lessR package
    • The ggplot function from the ggplot2 package
    • Themes

    Session 2

    • Bar Charts for Distributions of Categorical Variables

      • R factor variables and lessR doFactors function
      • Counts of one variable
      • Joint frequencies of two variables
      • Statistics for a second variable plotted against one variable
    • Graphs for Distributions of a Continuous Variable

      • Histograms and binning
      • Densities
      • Boxplot
      • Scatterplot, 1-dimensional
      • Introduction to the integrated Violin/Box/Scatterplot, the VBS plot
    • Scatterplots, 2-dimensional

      • With two or more continuous variables
      • A categorical variable with a continuous variable
      • Bubble plots with categorical variables
      • Two variable plot with a third variable, categorical or continuous

    Session 3

    • Scatterplots, 2-dimensional (continued)
      • Visualization of relationships for big data sets
    • Time Series Plots
      • One-variable plot
      • Stacked time-series plot
      • Area plots
      • Forecasts
    • Interactive Visualizations with Shiny

    References

    • Gerbing, D. W. (2020). R Data Analysis without Programming, CRC Press.
    • Wickham, H. (2009). ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York.

    Pre-requisites

    Basic understanding of data analysis

    Short Bio

    David Gerbing, Ph.D., since 1987 Professor of Data Analytics, The School of Business, Portland State University. He received his Ph.D. in quantitative psychology from Michigan State University in 1979. From 1979 until 1987 he was an Assistant and then Associate Professor of Psychology and Statistics at Baylor University. He has authored R Data Analysis without Programming and R Visualizations: Deriving Meaning from Data, and many articles on statistical techniques and their application in a variety of journals that span several academic disciplines.



    Christopher W.V. Hogue
    (Ericsson Inc.) [introductory]
    Applied Information Theory for Scalable Database Schema and Query Templates

    Summary

    With an understanding of how much sparse and repetitive information exists in your data, it is possible to use a hybrid table structure, the Power-Schema, to compress big data inside an SQL database without giving up query capability.

    Syllabus

    • SQL tables, rows and columns, database normalization, Star Schema, KV Stores, Hash Tables
    • What is in a database column? Distinct data, sparse data, repetitive data, unexpected object fields
    • Shannon Entropy and data compression as an estimate of entropy
    • The 3 component Power-Schema
    • The KV-store for sparse/unexpected data columns
    • The Hashed table for repetitive data
    • How the Power-Schema reduces database size
    • A practical heuristic for column partitioning
    • What to expect when querying each partition
    • Populating tables from JSON
    • Query the Power Schema table set with PostgreSQL JSONB query extensions

    References

    Link: github.com/erixzone/power-schema

    Pre-requisites

    Familiarity with SQL and command line tools.

    Short Bio

    Christopher W. V. Hogue (Ph.D. Biochemistry, Ottawa, 1994) is a Data Scientist/Engineer in the Ericsson Global AI Accelerator ML/AI Systems & Services (GAIA MASS) working from San Jose California on automated AI/ML learning systems for cell tower radio switching. Prior to joining Ericsson, Dr. Hogue ran an academic Bioinformatics lab, first at the Lunenfeld-Tannenbaum Research Institute (LTRI) in Toronto Canada, and then at the National University of Singapore. He has worked on structured data, data representations, and databases for over 30 years. His lab drove the establishment of HUPO-PSI data formats, curation standards and practices to capture molecular interaction data from experiments. Later work focused on large-scale statistical approaches for docking intrinsically disordered protein structures to understand their structure and function. For publications, see: https://publons.com/researcher/3909098/christopher-w-v-hogue/



    Ravi Kumar
    (Google) [intermediate/advanced]
    Clustering for Big Data

    Summary

    tbd.

    Syllabus

    tbd.

    References

    tdb.

    Pre-requisites

    Familiarity with basics of probability, algorithms, and machine learning.

    Short Bio

    Ravi Kumar has been a research scientist at Google since 2012. Prior to this, he was at the IBM Almaden Research Center and at Yahoo! Research. His interests include Web search and data mining, algorithms for massive data, and the theory of computation.



    Victor O.K. Li
    (University of Hong Kong) [intermediate]
    Deep Learning and Applications

    Summary

    Overview of machine learning; deep feedforward networks; regularization for deep learning; convolutional networks; recurrent and recursive nets; AI ethics; applications.

    Syllabus

    • Machine Learning Basics
    • Deep Feedforward Networks
    • Regularization for Deep Learning
    • Convolutional Networks
    • Recurrent and Recursive Nets
    • Bayesian Deep Learning
    • Explainable AI and Causal Inference
    • AI Ethics

    References

      1. Deep Learning by Ian Goodfellow, Yoshua Bengio, and Aaron Courville, MIT Press,
      1. The Book of Why: The Science of Cause and Effect by Judea Pearl and Dana Mackenzie, Basic Books, New York, 2018
      1. Han, Y., Lam, J.C.K., Li, V.O.K and Reiner, D., “A Bayesian LSTM Model to evaluate the effects of air pollution control regulations in Beijing, China,” accepted for publication, Elsevier Environmental Science and Policy.
      1. Han, Y., Lam, J.C.K., Li, V.O.K and Zhang, Q., “A domain-specific Bayesian Deep- Learning approach for air pollution forecast,” accepted for publication, IEEE Trans. on Big Data.
      1. Li, V.O.K., Han, Y., Lam, J.C.K., Zhu, J., and Bacon-Shone, J., “Air pollution and environmental injustice: Are the socially deprived exposed to more PM2.5 pollution in Hong Kong?” Elsevier Environmental Science & Policy, 80 (2018), pp. 53 - 61.

    Pre-requisites

    Probability & Statistics, Linear Algebra

    Short Bio

    Victor O.K. Li (S’80 – M’81 – F’92) received SB, SM, EE and ScD degrees in Electrical Engineering and Computer Science from MIT. Prof. Li is Chair of Information Engineering and Cheng Yu-Tung Professor in Sustainable Development at the Department of Electrical & Electronic Engineering (EEE) at the University of Hong Kong. He is the Director of the HKU-Cambridge Clean Energy and Environment Research Platform, and of the HKU- Cambridge AI to Advance Well-being and Society Research Platform, which are interdisciplinary collaborations with Cambridge University. He was the Head of EEE, Assoc. Dean (Research) of Engineering and Managing Director of Versitech Ltd. He serves on the board of Sunevision Holdings Ltd., listed on the Hong Kong Stock Exchange and co-founded Fano Labs Ltd., an artificial intelligence (AI) company with his PhD student. Previously, he was Professor of Electrical Engineering at the University of Southern California (USC), Los Angeles, California, USA, and Director of the USC Communication Sciences Institute. He served as Visiting Professor at the Department of Computer Science and Technology at the University of Cambridge from April to August 2019. His research interests include big data, AI, optimization techniques, and interdisciplinary clean energy and environment studies. In Jan 2018, he was awarded a USD 6.3M RGC Theme-based Research Project to develop deep learning techniques for personalized and smart air pollution monitoring and health management. Sought by government, industry, and academic organizations, he has lectured and consulted extensively internationally. He has received numerous awards, including the PRC Ministry of Education Changjiang Chair Professorship at Tsinghua University, the UK Royal Academy of Engineering Senior Visiting Fellowship in Communications, the Croucher Foundation Senior Research Fellowship, and the Order of the Bronze Bauhinia Star, Government of the HKSAR. He is a Fellow of the Hong Kong Academy of Engineering Sciences, the IEEE, the IAE, and the HKIE.



    B.S. Manjunath
    (University of California, Santa Barbara) [introductory]
    Digital Media Forensics


    Wladek Minor
    (University of Virginia) [introductory/Advanced]
    Big Data in Biomedical Sciences

    Syllabus

    • Big Data and Big Data in Biomedical Sciences
    • Why Big Data is perceived as a big problem - technological considerations
    • Data reduction - should we preserve unreduced (raw) data?
    • Databases and databanks
    • Data mining with the use of raw data
    • Data mining in databanks and databases
    • Data Integration
    • Automatic and semi-automatic curation of large amount of data
    • Conversion of databanks into databases and Advanced Information Systems
    • Experimental results and knowledge
    • Database priorities – content and design
    • Interaction between databases
    • Modern data management in biomedical sciences – necessity or luxury
    • Automatic data harvesting – close reality or still on the horizon
    • Reproducibility of the biomedical experiments - drug discovery considerations
    • Artificial Intelligence and machine learning in drug discovery
    • Big data in medicine - new possibilities
    • Personalized medicine
    • COVID-19
    • Future considerations

    Pre-requisites

    None

    Short bio

    Co-author of 220+ publications. Co-author of 440+ PDB macromolecular structures. Education: M.Sc., Physics, University of Warsaw, 1969 Ph.D., Physics, University of Warsaw, 1978.

    Research Interest: Development of methods for structural biology, in particular macromolecular structure determination by protein crystallography. Data management in structural biology, data mining as applied to drug discovery, and bioinformatics. Structure-function relation. Member of Center of Structural Genomics of Infectious Diseases. Former Member of Midwest Center for Structural Genomics, New York Center for structural Genomics and Enzyme Function Initiative.

    Employment: University of Warsaw 1969-1985, Purdue University 1985-1995, University of Virginia 1995- present.

    Research Impact: Citations: 46,000+, Relative Citation Ratio (RCR): 750+.

    Trainees: 70+ students, 25+ post-docs and Research Faculty. There are seven Pl's among lab alumni

    Major Awards/Elected function: Edlich-Henderson Innovator of the Year Award, University of Warsaw/AFUW Award, American Crystallographic Association Fellow, American Association for the Advancement of Science (AAAS) Fellow, and Commission of Macromolecular Biology of IUCr chair



    José M.F. Moura
    (Carnegie Mellon University) [introductory]
    Graph Signal Processing

    Summary

    Data is now multimodal and unstructured, arising in many applications and formats. The course shows how to extend traditional Digital Signal Processing methods to data supported by graphs (Graph Signal Processing, an expression coined in the last few years) and how to modify the structure of deep learning models to reflect the underlying data geometry (Geometric Learning). Graph Signal Processing (GSP) structures the data by representing its underlying geometry with a graph where the nodes index the data and its edges capture data dependencies. We present the basics of GSP with a brief historical overview, illustrate some of its concepts, recent developments, and extensions. We end by showing how GSP helps with designing from basic principles convolutional neural networks (CNN) for graph-based data – Graph Neural Networks.

    Syllabus

    Topics to be covered include: brief review of Linear Algebra concepts; Graph Signal {Processing – basics on graphs including graph representations through adjacency and graph Laplacian matrices, graph models, and graph spectral analysis; graph shift, graph shift invariance, graph signals, graph filtering, graph Fourier transform, graph convolution and modulation, graph frequency and graph spectral analysis of graph signals, among others; and Geometric Learning–extending deep learning models to learning with data supported by graphs.

    References

    • Aliaksei Sandryhaila and José M. F. Moura. Discrete signal processing on graphs. IEEE Transactions on Signal Processing, 61(7):1644—1656, January 2013.
    • D. I. Shuman, S. K. Narang, P. Frossard, A. Ortega, and P. Vandergheynst, “The emerging field of signal processing on graphs: Extending high-dimensional data analysis to networks and other irregular domains,” IEEE Signal Processing Magazine, vol. 30, no. 3, pp. 83–98, May 2013.
    • Aliaksei Sandryhaila and José M. F. Moura. Signal processing on graphs: Frequency analysis. IEEE Transactions on Signal Processing, 62(12):3042—3054, June 2014.
    • Aliaksei Sandryhaila and José M. F. Moura. Big data analysis with discrete signal processing on graphs. IEEE Signal Processing Magazine, Special Issue on Big Data, 31(5):80—90, September 2014.
    • A. Ortega, P. Frossard, J. Kovacevic, J. M. F. Moura, and P. Vandergheynst. Graph signal processing: Overview, challenges, and applications. Proceedings of the IEEE, 106(5):808—828, May 2018.
    • J. Du, J. Shi, S. Kar, and J. M. F. Moura. On graph convolution for graph CNNs. In 2018 IEEE Data Science Workshop (DSW), pages 1—5, June 2018.

    Pre-requisites

    None, prior exposure to Linear Algebra and or Linear Signal Processing is advisable

    Short Bio

    José M. F. Moura is the Philip L. and Marsha Dowd University Professor at CMU, with interests in signal processing and data science. A detector in two of his patents with Alek Kavcic is found in over 60% of the disk drives of all computers sold worldwide in the last 15 years (4 billion and counting)–leading to a US $750 Million settlement between CMU and Marvell. He was the 2019 IEEE President and CeO. He was Editor in Chief for the Transactions on SP. He is Fellow of the IEEE, AAAS, and the US National Academy of Inventors, holds an honorary doctorate from the University of Strathclyde, he is a corresponding member of the Academy of Sciences of Portugal, and a member of the US National Academy of Engineering. He received the Great Cross of the Order of The Infante D. Henrique bestowed to him by the President of the Republic of Portugal.



    Panos Pardalos
    (University of Florida) [intermediate/advanced]
    Optimization and Data Sciences Techniques for Large Networks


    Valeriu Predoi
    (University of Reading) [introductory]
    A Beginner's Guide to Big Data Analysis: How to Connect Scientific Software Development with Real World Problems

    Summary

    Whether it be astronomy, biology, climate science or buying groceries, we are surrounded by a lot of data, raw numerical data that gets distilled into huge amounts of numbers and parameters that we eventually use to discover more about our world. This data crunching process is done via many, many computers all over the world that we control via code we write and then let it run, eagerly waiting for results - the proverbial "grab a coffee while the code is running". There is, however, a great deal of variability in terms of how we approach the data problem: first, depending on the way it is collected - whether it is directly observed or produced by a theoretical model, and anything else in between, and, second, depending on how we forge our tools to analyze it. I will try to work out with you a few model cases that replicate such varied situations; I will also show you that complex problems in data analysis can be solved by using simple concepts and conversely, apparently simple problems will need complex solutions, both in terms of the methods and the computational tools needed. I will use three real-world problems drawn from astronomy, biology and climate modelling and I will show you how to solve them, in the process highlighting modern software development techniques. This is an introductory course that you may be overqualified for, but I'll try to challenge you no matter what.

    Syllabus

    Ideally three separate days for three different used cases.

    References

    None needed, I will point the attendees to a few works after the course.

    Pre-requisites

    Basic calculus, basic physics and basic statistics; also, beginner Python knowledge.

    Short Bio

    I am a research software engineer with the Department of Meteorology, at the University of Reading, Computational Modelling Services, UK and core developer of the United Kingdom Earth System Model (UKESM), the UK integrated climate model. I am a former member of LIGO (Laser Interferometry Gravitational Waves Observatory) Scientific Collaboration, and have been a gravitational waves data scientist since 2008. I obtained my PhD in gravitational waves data analysis in 2012 and have had various research posts since, including a postdoc stint as computational virologist. All my work so far has been mainly as a software engineer and data scientist, so I don't consider myself a scientist but rather an engineer that solves problems and provides real scientists with tools to get their papers published.



    Karsten Reuter
    (Max Planck Society) [introductory/intermediate]
    Machine Learning for Materials and Energy Applications

    Summary

    Reflecting the general data revolution, knowledge-based methods are now also entering theoretical catalysis and energy related research with full might. Automatized workflows and the training of machine learning approaches with first-principles data generate predictive-quality insight into elementary processes and process energetics at undreamed-of pace. Computational screening and data mining allows to explore these data bases for promising materials and extract correlations like structure-property relationships. At present, these efforts are still largely based on highly reductionist models that break down the complex interdependencies of working catalyst and energy conversion systems into a tractable number of so-called descriptors, i.e. microscopic parameters that are believed to govern the macroscopic function. Current efforts concentrate on using artificial intelligence also in the actual generation and reinforced improvement of the reductionist models, to ultimately fulfill the dream of an inverse (de novo) design from function to structure.

    Syllabus

    •   Computational screening, funneling and active learning
    • Data representation
    • Machine-learned interatomic potentials
    • Descriptor generation
    • Machine learning in chemical and reaction space

    References

    •   Bruix, A., Margraf, J.T., Andersen, M. & Reuter, K. First-principles based multiscale modeling of heterogeneous catalysis. Nature Catal. 2, 659 (2019).
    • Rupp, M., von Lilienfeld, O. A. & Burke, K. Guest editorial: special topic on data-enabled theoretical chemistry. J. Chem. Phys. 148, 241401 (2018).
    • Behler, J. Perspective: machine learning potentials for atomistic simulations. J. Chem. Phys. 145, 170901 (2016).

    Pre-requisites

    •   Introductory Level Quantum Mechanics and Physical Chemistry

    Short bio

    Karsten is director of the Theory Department of the Fritz Haber Institute (FHI) of the Max Planck Society in Berlin, Germany. He did his doctoral studies on theoretical surface physics in Erlangen, Madrid and Milwaukee. Following research experiences at the FHI and the FOM Institute in Amsterdam, he held the Chair for Theoretical Chemistry at the Technical University of Munich (TUM) from 2009-2020. Karsten recently was a visiting professor at Stanford (2014), the Massachusetts Institute of Technology (2018) and Imperial College London (2019). His research activities focus on a quantitative modeling of materials properties and functions. He specifically works on multiscale models that combine predictive-quality first-principles techniques with coarse-grained methodologies and machine learning to achieve microscopic insight into the processes in working catalysts and energy conversion devices.



    Ramesh Sharda
    (Oklahoma State University) [introductory/intermediate]
    Network-based Health Analytics

    Summary

    This course will introduce network level properties and illustrate how such network measures can be used to help inform medical decision-making. It will illustrate how such measures can be computed, and then used in machine learning models to improve modeling performance.

    Syllabus

    • Introduction to Network Measures
    • Applications of network metrics in health analytics – comorbidities
    • Descriptive analytics in health demographics based upon comorbidities
    • Network measures in health analytics modeling – incorporating comorbidities to predict hospital lengths of stay
    • Clique modeling to determine identify diseases combinations that impact mortality
    • Conclusions and future work

    References

    • Pankush Kalgotra, Ramesh Sharda, and Julie M. Croff. (2020). "Examining multimorbidity differences across racial groups: a network analysis of electronic medical records". Scientific Reports. (10), 13538.
    • Pankush Kalgotra, Ramesh Sharda, and Julie M Croff. (2017). "Examining Health Disparities by Gender: A Multimorbidity Network Analysis of Electronic Medical Record". International Journal of Medical Informatics. (108), 22–28.
    • J. Loscalzo, A.-L. Barabási, E.K. Silverman (Eds.), Network medicine: complex systems in human disease and therapeutics, Harvard University Press (2017)

    Pre-requisites

    None

    Short Bio

    Ramesh Sharda is the Vice Dean for Research and the Watson Graduate School of Management, Watson/ConocoPhillips Chair and a Regents Professor of Management Science and Information Systems in the Spears School of Business at Oklahoma State University. He has coauthored two textbooks ( Analytics, Data Science, and Artificial Intelligence: Systems for Decision Support, 11th edition, Pearson and Business Intelligence, Analytics, and Data Science: A Managerial Perspective , 4th Edition, Pearson). His research has been published in major journals in management science and information systems including Management Science, Operations Research, Information Systems Research, Decision Support Systems, Interfaces, INFORMS Journal on Computing, and many others. He is a member of the editorial boards of journals such as the Decision Support Systems, Decision Sciences, ACM Database, and Information Systems Frontiers. He served as the Executive Director of Teradata University Network through 2020 and was inducted into the Oklahoma Higher Education Hall of Fame in 2016. Ramesh is a Fellow of INFORMS and AIS.



    Steven Skiena
    (Stony Brook University) [introductory/intermediate]
    Word and Graph Embeddings for Machine Learning

    Summary

    Word/graph embedding methods (like word2vec and DeepWalk) are powerful approaches for constructing feature vectors for symbols and nodes for use in building machine learning models in an unsupervised manner. We will review the basics of how word embeddings are constructed, graph neural networks and other approaches to building graph embeddings -- along with interesting applications of these techniques.

    Syllabus

    • word embeddings
    • graph embeddings

    References

    • Skiena, Steven S. The data science design manual. Springer, 2017.

    Pre-requisites

    None

    Short bio

    Steven Skiena is Distinguished Teaching Professor of Computer Science and Director of the Institute for AI-Driven Discovery and Innovation at Stony Brook University. His research interests include data science, bioinformatics, and algorithms. He is the author of six books, including "The Algorithm Design Manual", "The Data Science Design Manual", and "Who's Bigger: Where Historical Figures Really Rank".

    Skiena received his B.S. in Computer Science from the University of Virginia (Wahoo-Wa!) and his Ph.D. in Computer Science from the University of Illinois in 1988. He is the author of over 150 technical papers. He is a Fellow of the American Association for the Advancement of Science (AAAS), a former Fulbright scholar, and recipient of the ONR Young Investigator Award and the IEEE Computer Science and Engineer Teaching Award. More info is available at http://www.cs.stonybrook.edu/~skiena/.



    Alexandre Vaniachine
    (VirtualHealth) [intermediate]
    Open-source Columnar Databases


    Sebastián Ventura
    (University of Córdoba) [intermediate/advanced]
    Supervised Descriptive Pattern Mining

    Summary

    The objective of this course is to give a gentle introduction to several disciplines which aims the discovering of useful descriptive patterns, represented as rules induced from labeled data. For each of these disciplines we will examine their goals, the main algorithms described in the literature for them and some illustrative examples to show their usefulness in real-world problems. We will finish this discussion with a reflection about scalability and the adaptation of these algorithms to big data problems.

    Syllabus

    • Introduction to Supervised Descriptive Pattern Mining
    • Contrast Sets
    • Emerging Patterns
    • Subgroup Discovery
    • Class Association Rules
    • Exceptional Models

    References

    • S. Ventura & J.M. Luna. Supervised Descriptive Pattern Mining. Springer, 2018.
    • G. Dong. Exploiting the Power of Group Differences: Using Patterns to Solve Data Analysis Problems. Synthesis Lectures on Data Mining and Knowledge Discovery, Vol. 11, nº 1. Morgan Claypool, 2019.
    • G. Dong & B.J. Bayley (eds). Contrast Data Mining. CRC Press, 2013.
    • P. K. Novak, N. Lavrac, and G. I. Webb. Supervised descriptive rule discovery: A unifying survey of contrast set, emerging pattern and subgroup mining. Journal of Machine Learning Research, 10:377–403, 2009.

    Pre-requisites

    • Frequent itemset mining: concept, essential algorithms and their implementation.
    • Association rule mining: concept, essential algorithms and implementation.
    • Computer programming.

    Short Bio

    Sebastián Ventura is currently a Full Professor in the Department of Computer Science and Numerical Analysis at the University of Córdoba (Córdoba, Spain), where he heads the Knowledge Discovery and Intelligent Systems Research Laboratory. He is also affiliated professor at the Department of Computer Science, Virginia Commonwealth University (Richmond, USA). He has published three books and about 300 papers in journals and scientific conferences, and he has edited about ten books and special issues in international journals in

    his area of expertise. He has also served as General and Program Chair in several conferences in the fields of machine learning and artificial intelligence, and currently he holds different positions at the editorial board of journals such as Information Fusion, Engineering Applications of Artificial Intelligence, and Computational Intelligence, serving also as Editor in Chief at the Progress in Artificial Intelligence journal. His main research interests are in the fields of data science, computational intelligence, and their applications. Dr. Ventura is a senior member of the IEEE Computer, the IEEE Computational Intelligence and the IEEE Systems, Man and Cybernetics Societies, as well as the Association of Computing Machinery (ACM).



    Xiaowei Xu
    (University of Arkansas, Little Rock) [introductory/advanced]
    Language Models and Applications

    Summary

    A language model is a probabilistic model of languages. This course covers the basic concept of probabilistic models for languages including n-gram, latent semantic models, and topic models, as well as recent neural network-based models including skip-gram, CBOW, and ELMo. The focus is the most recent advances of transformer-based language models such as Google’s BERT and OpenAI’s GPT-3. The course presents some of the breakthroughs of the advanced language model applications including machine creative writing, reasoning and causality inference.

    Syllabus

      1. Introduction (1.5 hours)
      • 1.1. Vector space model
      • 1.2 Word2vec (Skip-gram and CBOW)
      • 1.3 GloVe
      • 1.4 FastText
      • 1.5. ELMo
      1. Transformer-based language models (1.5 hours)
      • 2.1 BERT
      • 2.2. GPT
      1. Applications (1.5 hours)
        • 3.1 NLP
        • 3.2 Text mining
        • 3.3 Creative writing
        • 3.4 Reasoning from text
      1. Future directions

    References

      1. Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and Dean, J. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems 26, pp. 3111–3119. 2013.
      1. Pennington, J., Socher, R., and Manning, C. D. Glove: Global vectors for word representation. In Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543, 2014.
      1. Mikolov, T., Grave, E., Bojanowski, P., Puhrsch, C., and Joulin, A. Advances in pre-training distributed word representations. In Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018), 2018.
      1. Peters, M., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., and Zettlemoyer, L. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 2227–2237, 2018.
      1. Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
      1. Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I. Language Models are Unsupervised Multitask Learners, OpenAI technical report.
      1. Brown, Tom B.; Mann, Benjamin; Ryder, Nick; Subbiah, Melanie; Kaplan, Jared; Dhariwal, Prafulla; Neelakantan, Arvind; Shyam, Pranav; Sastry, Girish; Askell, Amanda; Agarwal, Sandhini; Herbert-Voss, Ariel; Krueger, Gretchen; Henighan, Tom; Child, Rewon; Ramesh, Aditya; Ziegler, Daniel M.; Wu, Jeffrey; Winter, Clemens; Hesse, Christopher; Chen, Mark; Sigler, Eric; Litwin, Mateusz; Gray, Scott; Chess, Benjamin; Clark, Jack; Berner, Christopher; McCandlish, Sam; Radford, Alec; Sutskever, Ilya; Amodei, Dario (July 22, 2020). "Language Models are Few-Shot Learners". arXiv:2005.14165.
      1. Cakaloglu, T., and Xu, X. A Multi-Resolution Word Embedding for Document Retrieval from Large Unstructured Knowledge Bases. arXiv preprint arXiv:1902.00663

    Pre-requisites

    Basic knowledge of linear algebra, statistics and machine learning.

    Short Bio

    Xiaowei Xu, a professor of Information Science at the University of Arkansas, Little Rock (UALR), received his Ph.D. degree in Computer Science at the University of Munich in 1998. Before his appointment in UALR, he was a senior research scientist in Siemens, Munich, Germany. His research spans machine learning and artificial intelligence. Dr. Xu is a recipient of 2014 ACM SIGKDD Test of Time award for his contribution to the density-based clustering algorithm DBSCAN, which is one of the most common clustering algorithms and also most cited in scientific literature with 18,939 citations according to Google Scholar