DeustoTech, School of Engineering
University of Deusto
Avda. Universidades, 24
48014 Bilbao

Links of Interest

Active Links

Past Links

BigDat 2015

Contact Information

Lilica Voicu: florentinalilica.voicu (at) urv.cat

2^nd International Winter School on Big Data

Bilbao, Spain, February 8-12, 2016

Keynote Speakers

Nektarios Benekos (European Organization for Nuclear Research), Role of Computing and Software in Particle Physics

Summary. The challenges of computing in science and technology have grown in several aspects: computing software and infrastructure is mission critical, requirements in communication, storage and CPU capacity are often above the currently feasible and affordable level and the large science projects have a lifetime that is much longer than one cycle in hardware and software technology. Computing in particle physics and related research domains faces unprecedented challenges, determined by the complexity of physics under study, of the detectors and of the environments where they operate. Cross-disciplinary solutions in physics and detector simulation, data analysis as well as in software engineering and Grid technology are presented, with examples of their use in global science projects in the fields of particle physics and space.

Bio. Nektarios Chr. Benekos received a Ph.D. in particle physics-detector R&D and data analysis and is now visiting Associate Professor in the Department of Applied Physics and Computing at the National Technical University of Athens and Senior Data Analyst at CERN. He previously held positions at Illinois Urbana- Champaign University, Max Planck Institute for Physics after being a Fellow at the Physics Division of the European Organization for Nuclear Research-CERN. He hasupervised the PhD and Master of 10 students and published around 65 papers in physics and computer science with an hindex of 100 and over 50000 citations.

He currently works in applying computer science from infrastructure to analytics in large-scale particle and astroparticle physics experiments, Image processing, Network Science and Particle Physics. The infrastructure work is built around Software Defined Systems on Clouds and Clusters. He is involved several projects to Big Data and Big Computing Centers. He has experience in online education and its use in MOOC’s and other webinar platforms for areas like Data and Computational Science.

He is also acting as senior scientific advisor at the GF-ACCORD, a fast Business Consulting company, supervising the R&D efforts in Data Science Unit for Business Intelligence Applications.

Chih-Jen Lin (National Taiwan University), When and When Not to Use Distributed Machine Learning

Summary. In the big-data era, data are often too large tobe stored in one computer.However, whether one should conduct machine learning tasks or dataanalytics in the distributed environment is still a debatable issue.Sub-sampling data to one machine for analysis is always an option.In this talk I will argue that both the traditional single-machinesetting and the new distributed setting are both importantfor big-data machine learning in the future. I will discuss theiradvantage/disadvantages and when to use which.

Bio. National Taiwan University page.

Jeffrey Ullman (Stanford University), Theory of MapReduce Algorithms

Summary. Not all parallel algorithms are MapReduce algorithms. A complexity theory for MapReduce algorithms has developed recently, and it involves the trade-off between the memory used by reducers ("reducer size") and the communication between the map and reduce phases ("replication rate"). We shall outline the theory, review some of the problems whose complexity in this sense has been analyzed, and mention some of the open problems that remain in this theory.

Bio. Stanford page.

Alexandre Vaniachine (Argonne National Laboratory), Big Data Technologies and Data Science Methods in the Higgs Boson Discovery

Summary. Big Data technologies and Data Science methods in Higgs boson discovery The data volumes collected at the Large Hadron Collider represent research challenges common for data-intensive sciences. The scientific goals of the LHC include high precision tests of the Standard Model of fundamental particles and searches for new physics. These goals require detailed comparison of the expected physics models and detector behavior with data. Progress in Big Data technologies enables LHC data collection and analytics, including comparison with computational models simulating the expected data behavior. We use the Higgs boson example to describe the end-to-end data flow for scientific discovery highlighting the roles of distributed computing and Data Science classification methods maximizing signal significance over background noise models.

Bio. Argonne page.

Nektarios Benekos (European Organization for Nuclear Research), [introductory/intermediate] Exploring the Mysteries of our Cosmos: the Big Deal between Big Data and Big Science

Summary. We are living in "the age of big data," according to The World Economic Forum. I agree too. The collection and mining of massive amounts of digital data currently defines the term big data. Those are processes that businesses largely handle. However, the analysis of that data -- that magic ingredient of algorithms and advanced mathematics that bridges the gap between knowledge and insight -- is big science. It is wherhe value is. It is the future. Getting right message to the right customer at the right time is the promise of relevant, real-time marketing. Big science, not big data, will bring this to life. CERN is one of the world’s largest and most respected centres for scientific research where big data meets big science. Its business is fundamental physics, finding out what the Universe is made of and how it works, after analyzing massive scale of raw and derived data. CERN operates the Large Hadron Collider (LHC), the world’s largest and most powerful particle accelerator where th ATLAS and CMS experiments announced their observations of a particle consistent with the long-sought Higgs boson, in summer of 2012. The particle’s detection has set the worldwide scientific community buzzing, but behind the success of the work undertaken at the LHC, lies a story of Big Data success that is truly ground-breaking.

This rapid increase in performance of the LHC accelerator is having an impact on the computing requirements since it increases the rate (Velocity), complexity (Variety) and quantity (Volume) of data that the LHC experiments need to store, distribute and process.

Put simply, the analysis that big science brings to the table makes big data relevant. Therefore, this course will introduce students to big science combining with big data to create big opportunities in three significant ways: real-time relevant content, data visualization, and predictive analytics and how these performed at the LHC.

Syllabus. Introductions to mysteries of our Cosmos, introduction to the Large Hadron Collider and particle physics expements, introduction to big data programming, streaming, management, triggering, filtering, visualization, monitoring and analyzing in real time.

References

The Large Hadron Collider
LHC Brochure, English version. A presentation of the largest and the most powerful particle accelerator in the world, the Large Hadron Collider (LHC), which started up in 2008. Its role, characteristics, technologies, etc. are explained for the general public. CERN-Brochure-2010-006-Eng. LHC Brochure, English version. CERN. Retrieved 20 January 2013
What is the Worldwide LHC Computing Grid? (Public 'About' page): Currently WLCG is made up of more than 170 computing centers in 36 countries. The WLCG is now the world's largest computing grid
New results indicate that new particle is a Higgs boson. CERN. 14 March 2013. Retrieved 14 March 2013
IBM What is big data? — Bringing big data to the enterprise. www.ibm.com. Retrieved 2013-08-26
De Mauro, Andrea; Greco, Marco; Grimaldi, Michele What is big data? A consensual definition and a review of key research topics. AIP Conference Proceedings 1644: 97–104. doi:10.1063/1.4907823, 2015
Hellerstein, Joseph (27 February 2008). Quantitative Data Cleaning for Large Databases. EECS Computer Science Division: 3. Retrieved 26 October 2013.

Pre-requisites. Some background in physics experiments, programming techniques and algorithms, databases, and probability and statistics will be useful.

He is also acting as senior scientific advisor at the GF-ACCORD, a fast Business Consulting company, supervising the R&D efforts in Data Science Unit for Business Intelligence Applications.

Hendrik Blockeel (KU Leuven), [intermediate] Decision Trees for Big Data Analytics

Summary. Decision trees, and derived methods such as Random Forests, are among the most popular methods for learning predictive models from data. This is to a large extent due to the versatility and efficiency of these algorithms. This course will introduce students to the basic methods for learning decision trees, as well as to variations and more sophisticated versions of decision tree learners, with a particular focus on those methods that make decision trees work in the context of big data.

Syllabus. Classification and regression trees, multi-output trees, clustering trees, model trees, ensembles of trees, incremental learning of trees, learning decision trees from large datasets, learning from data streams.

References. Relevant references will be provided as part of the course.

Pre-requisites. Familiarity with mathematics, probability theory, statistics, and algorithms is expected, on the level it is typically introduced at the bachelor level in computer science or engineering programs.

Bio. Hendrik Blockeel is a professor at KU Leuven, Belgium, and part-time associate professor at Leiden University, the Netherlands. He received his Ph.D. degree in 1998 from KU Leuven. His research interests lie mostly within artificial intelligence, with a focus on machine learning and data mining, and in the use of AI-based modeling in other sciences. Dr. Blockeel has made a variety of contributions on topics such as inductive logic programming, probabilistic-logical learning, and decision trees. He is an action editor of Machine Learning and a member of the editorial board of several other journals. He has chaired or organized several conferences, including ECMLPKDD 2013, and organized the ACAI summer school in 2007. He has served on the board of the European Coordinating Committee for Artificial Intelligence, and currently serves on the ECMLPKDD steering committee as publications chair. He became an ECCAI fellow in 2015.

Nello Cristianini (University of Bristol), [introductory] THINKBIG: Towards Large Scale Computational Social Sciences, History and Digital Humanities

Summary. We describe a series of studies in which massive sets of data (mostly text and images) and mined in order to gain new insights about society, the media system and history. These studies are only possible with large scale AI techniques, and we expect them to become increasingly common in the future. Among other things, we will study gender in the media, mood on twitter, cultural change in history, by analysing several millions of documents. The methods used can directly be trasferred and applied to a variety of other domains.

Syllabus.

on data driven AI
what is a pattern ?
extracting information from social media (mood, flu, etc) / large scale
mining newspapers: science, politics, history... and analysis of news images. / large scale
can science be automated? / large scale
ethical aspects

Pre-requisites. No prerequitites, it is an introductory course.

Bio. Wiki page.

Ernesto Damiani (University of Milan and EBTIC/Khalifa University), [introductory/intermediate] Architectures, Models and Tools for Big-Data-as-a-Service

Summary. Many companies andorganisations worldwide have become aware of the potential competitiveadvantage they could get by timely and accurate Big Data Analytics (BDA), but lackthe data management expertise and budget to fully exploit BDA. To help overcoming this hurdle,this course will introduce model-based BDA-as-a-service (MBDAaaS), providing clear models of the entire Big Data analysis process and of its artefacts. The approach supports automation and commoditisation of Big Data analytics, while enabling BDA customization to domain-specific customer requirements. Besides models for representing all aspects ofBDA, the course will discuss and compare available architectural patterns and toolkits for repeatable set-up and management of Big Data analyticspipelines. Repeatable patterns can drive costs of Big Data analytics withinreach of EU organizations (including SMEs) that do not have either in-house BigData expertise or budget for expensive data consultancy. Activities discussed in the course will include (i) planning Big Data sourcespreparation handling Big Data opacity, diversity, security, and privacy compliance (ii) Service Level Agreements (SLAs)for BDA detailing privacy, timing, and accuracy needs (iii) datamanagement and algorithm parallelisation strategies, from distributed data acquisition/storage to the design and parallel deployment of analytics and presentation of results. (iv) auditing andassessment of legal compliance (for example, to privacy regulations) of BDAenactment.

Bio. Ernesto Damiani joined Khalifa University, Abu Dhabi as Director of the Information Security Center and Leader of the Big Data Initiative at the Etisalat British Telecom Innovation Center (EBTIC). He is on extended leave from the Department of Computer Science, Università degli Studi di Milano, Italy, where he leads the SESAR research lab. Ernesto's research interests include secure service-oriented architectures, and privacy-preserving Big Data analytics. Ernesto holds/has held visiting positions at a number of international institutions, including George Mason University in Virginia, US, Tokyo Denki University, Japan, LaTrobe University in Melbourne, Australia, and the Institut National des Sciences Appliquées (INSA) at Lyon, France. He is a Fellow of the Japanese Society for the Progress of Science.

From December 2015, Ernesto is the Principal Investigator of the TOREADOR H2020 project on Model-driven Big-Data-as-a-Service. He has served as a PI in a number of large-scale research projects funded by the European Commission, the Italian Ministry of Research and by private companies such as British Telecom, Cisco Systems, SAP, Telecom Italia, Siemens Networks (now Nokia Siemens) and many others. Ernesto serves in the editorial board of several international journals; he is the EIC of the International Journal on Big Data and of the International Journal of Knowledge and Learning. He is Associate Editor of IEEE Transactions on Service-oriented Computing and of the IEEE Transactions on Fuzzy Systems. Ernesto is a senior member of the IEEE and served as Vice-Chair of the IEEE Technical Committee on Industrial Informatics. In 2008, Ernesto was nominated ACM Distinguished Scientist and received the Chester Sall Award from the IEEE Industrial Electronics Society. Ernesto has co-authored over 350 scientific papers and many books and international patents.

Francisco Herrera (University of Granada), [introductory] Big Data Preprocessing

Summary. Data Preprocessing for Data Mining addresses one of the most important issues within the well-known Knowledge Discovery from Data process. Data directly taken from the source usually are not ready to be considered for a data mining process. Data preprocessing techniques adapt the data to fulfill the input demands of each data mining algorithm. Data preprocessing includes data preparation methods for cleaning, transformation or managing imperfect data (missing values and noise data) and data reduction techniques, which aim at reducing the complexity of the data, detecting or removing irrelevant and noisy elements from the data, including feature and instance selection and discretization.

The knowledge extraction process from Big Data has become a very difficult task for most of the classical and advanced existing techniques. The main challenges are to deal with the increasing amount of data considering the number of instances and/or features, and the complexity of the problem. The design of data preprocessing methods for big data requires to redesign the methods adapting them to the new paradigms such as MapReduce and the directed acyclic graph model using Apache Spark.

In this course we will pay attention to preprocessing approaches for classification big data. We will analyze the design of preprocessing methods for big data (feature selection discretization, data preprocessing for imbalance classification, noise data cleaning,…), discussing how to include data preprocessing methods along the knowledge discovery process. We will pay attention to their design for MapReduce paradigm and Apache Spark framework.

Syllabus

Data preprocessing
Big data reduction (feature selection, discretization, prototype generation)
Noise and big data
Imbalanced big data classification preprocessing

References. See the page.

Bio. Francisco Herrera is a Professor in the Department of Computer Science and Artificial Intelligence at the University of Granada.

He has been the supervisor of 38 Ph.D. students and published more than 300 journal papers. He is co-author of the book "Data Preprocessing in Data Mining" (Springer, 2015). He currently acts as Editor in Chief of the international journals "Information Fusion" (Elsevier) and “Progress in Artificial Intelligence (Springer). He acts as editorial member of a dozen of journals.

He has been given many awards and honors for his personal work or for his publications in journals and conferences, among others; ECCAI Fellow 2009, IFSA Fellow 2013, IEEE Transactions on Fuzzy System Outstanding 2008 and 2012 Paper Award ((bestowed in 2011 and 2015), and nomination as Highly Cited Researchers in the areas of Engineering and Computer Sciences.

Chih-Jen Lin (National Taiwan University), [introductory/intermediate] Large-scale Linear Classification

Summary. Many classification methods such as kernel methods or decision trees are nonlinear approaches. However, linear methods of using a simple weight vector as the model remain to be very useful for many applications. By careful feature engineering and having data in a rich dimensional space, the performance may be competitive with that of using a highly nonlinear classifier. Successful application areas include document classification and computational advertising (CTR prediction). In the first part of this talk, we give an overview of linear classification by introducing commonly used formulations through different aspects. This discussion is useful because many people are confused about the relationships between, for example, SVM and logistic regression. We also discuss the connection between linear and kernel classification. In the second part we move to investigate techniques for solving optimization problems for linear classification. In particular, we show details of two representative settings: coordinate descent methods and Newton methods. The third part of the talk discusses issues in applying linear classification for big-data analytics. We present effective training methods in both multi-core and distributed environments. After demonstrating some promising results we discuss future challenges of linear classification.

Syllabus

Linear classification
Kernel classification
Linear versus kernel classification
Solving optimization problems
Big-data machine learning
Multi-core linear classification
Distributed linear classification
Discussion and conclusions

References

G.-X. Yuan, C.-H. Ho, and C.-J. Lin. Recent Advances of Large-scale Linear Classification. Proceedings of the IEEE, 100(2012), 2584-2603
R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. LIBLINEAR: A library for large linear classification. Journal of Machine Learning Research 9(2008), 1871-1874

Pre-requisites. No prerequisites.

Bio. Chih-Jen Lin is currently a distinguished professor at the Department of Computer Science, National Taiwan University. He obtained his B.S. degree from National Taiwan University in 1993 and Ph.D. degree from University of Michigan in 1998. His major research areas include machine learning, data mining, and numerical optimization. He is best known for his work on support vector machines (SVM) for data classification. His software LIBSVM is one of the most widely used and cited SVM packages. For his research work he has received many awards, including the ACM KDD 2010 and ACM RecSys 2013 best paper awards. He is an IEEE fellow, a AAAI fellow, and an ACM fellow for his contribution to machine learning algorithms and software design. More information about him can be found at the National Taiwan University page.

George Karypis (University of Minnesota), [intermediate/advanced] Scaling Up Recommender Systems

Summary. Recommender systems are designed to identify the items that a user will like or find useful based on the user’s prior preferences and activities. These systems have become ubiquitous and are an essential tool for information filtering and (e-)commerce. Over the years, collaborative filtering, which derive these recommendations by leveraging past activities of groups of users, has emerged as the most prominent approach for solving this problem.

This course is designed to provide an overview of the different state-of-the-art methods and applications of recommender systems with a focus towards deploying them in domains with large amounts of data and/or low-latency recommendations. The course consists of two major parts. The first will cover various serial algorithms for solving some of the most common recommendation problems including rating prediction, top-N recommendation, and cold-start. The second will cover various serial and parallel algorithms, formulations, and approaches that allow these methods to scale to large problems.

In order to succeed in the course, students need to have a background in algorithms, numerical optimization, and parallel computing.

References

Recommender Systems Handbook, Francesco Ricci, Lior Rokach, Bracha Shapira, Paul B. Kantor (Eds.), Springer, 2011 (a new edition is in press)

Bio. George Karypis is an ADC Chair of Digital Technology Professor at the Department of Computer Science & Engineering at the University of Minnesota, Twin Cities. His research interests spans the areas of data mining, high performance computing, information retrieval, collaborative filtering, bioinformatics, cheminformatics, and scientific computing. His research has resulted in the development of software libraries for serial and parallel graph partitioning (METIS and ParMETIS), hypergraph partitioning (hMETIS), for parallel Cholesky factorization (PSPASES), for collaborative filtering-based recommendation algorithms (SUGGEST), clustering high dimensional datasets (CLUTO), finding frequent patterns in diverse datasets (PAFI), and for protein secondary structure prediction (YASSPP). He has coauthored over 250 papers on these topics and two books (“Introduction to Protein Structure Prediction: Methods and Algorithms” (Wiley, 2010) and “Introduction to Parallel Computing” (Publ. Addison Wesley, 2003, 2nd edition)). In addition, he is serving on the program committees of many conferences and workshops on these topics, and on the editorial boards of the IEEE Transactions on Knowledge and Data Engineering, ACM Transactions on Knowledge Discovery from Data, Data Mining and Knowledge Discovery, Social Network Analysis and Data Mining Journal, International Journal of Data Mining and Bioinformatics, the journal on Current Proteomics, Advances in Bioinformatics, and Biomedicine and Biotechnology.

Geoff McLachlan (University of Queensland), [intermediate/advanced] Big Data Extensions of Some Methods of Classification and Clustering

Summary. Attention is focussed first on supervised classification (discriminant analysis) for high-dimensional datasets. Issues discussed include variable selection and the estimation of the associated error rates to circumvent selection bias problems. The unsupervised classification (cluster analysis) is considered next with the focus on the use of finite mixture distributions, in particular multivariate normal distributions, to provide a model-based approach to clustering. A coverage of the main issues involved with the maximum likelihood fitting of mixture models via the EM algorithm will be given, including extensions of normal component distributions for long-tailed and/or skew data. Finally, consideration is given to further extensions of these mixture models to handle big data of possibly high-dimensions through the use of factor models after an appropriate reduction where necessary in the number of variables. Various real-data examples are given.

Syllabus

Construction of classifiers, including variable selection and estimation of their error rates, for the supervised classification of high-dimensional datasets. (2 hours).
An introduction to finite mixture models for unsupervised classification (clustering) with extensions of normal component distributions tot-distributions to handle long-tailed clusters. (2 hours).
Extensions of normal andt-mixture models for the clustering of big datasets of possibly high dimensions and with clusters that are not necessarily elliptically symmetric. The use of mixtures of factor models witht- and skew t-component distributions is to be highlighted. (2 hours).

References

Lee, S. and McLachlan, G.J. Finite mixtures of multivariate skewt-distributions: some recent and new results. Statistics and Computing24, 181-202, 2014
Lin, T.-I., Wu, P.H., McLachlan, G.J., and Lee, S.X. A robust factor analysis model using the restricted skewt-distribution. TEST 24, 510-531, 2015
McLachlan, G.J., Bean, R.W., and Peel, D. A mixture model-based approach to the clustering of microarray expression data. Bioinformatics 18, 413-422, 2002
McLachlan, G.J. and Krishnan, T. The EM Algorithm and Extensions. Second Edition. Hoboken, New Jersey: Wiley, 2008
McLachlan, G.J. and Peel, D. Finite Mixture Models. New York: Wiley, 2000

Pre-requisites. A good knowledge of multivariate statistics at least at an advanced undergraduate level.

Bio. Homepage atUniversity of Queensland.

Wladek Minor (University of Virginia), [introductory/intermediate] Big Data in Biomedical Sciences

Syllabus.

Big Data and Big Data in Biomedical Sciences
Why big data is perceived as a big problem - technological consideration
Data reduction - should we preserve unreduced (raw) data?
Databases and databanks
Data mining with the use of raw data, databanks and databases
Data Integration
Automatic and semi-automatic curation of the large amount of data
Conversion of databanks into databases
Database priorities – content and design
Interaction between databases
Modern data management in biomedical sciences – necessity or luxury
Automatic data harvesting – close reality or still on the horizon
Reproducibility of the biomedical experiments - drug discovery considerations
Big data in medicine - new possibilities
Future considerations

Raymond Ng (University of British Columbia), [introductory/intermediate] Mining and Summarizing Text Conversations

Summary. With the ever-increasing popularity of Internet technologies and communication devices such as smartphones and tablets, and with huge amounts of such conversational data generated on hourly basis, intelligent text analytic approaches can greatly benefit organizations and individuals. For example, managers can find the information exchanged in forum discussions crucial for decision making. Moreover, the posts and comments about a product can help business owners to improve the product.

In this lecture, we first give an overview of important applications of mining text conversations, using sentiment summarization of product reviews as a case study. Then we examine three topics in this area: (i) topic modeling; (ii) natural language summarization; and (iii) extraction of rhetorical structure and relationships in text.

Syllabus

Text conversations and business intelligence (0.5 hours)
Sentiment extraction and summarization as applications (1.5 hours)
Topic modeling (1 hour)
Extractive and abstractive summarization (1 hour)
Rhetorical analysis (1.5 hour)
Summary (0.5 hour)

References

Shafiq Joty, Giuseppe Carenini and Raymond Ng. Topic Segmentation and Labeling in Asynchronous Conversations. Journal of AI Research (JAIR) (2013), Vol. 47, Page 521-573
Shafiq Joty, Giuseppe Carenini, Gabriel Murray and Raymond Ng. Exploiting Conversation Structure in Unsupervised Topic Segmentation for Emails. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2010), MIT, Massachusetts, USA
Yashar Mehdad, Giuseppe Carenini, Raymond Ng and Shafiq Joty. Towards Topic Labeling with Phrase Entailment and Aggregation. In Proceedings of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2013), Atlanta, USA
Yashar Mehdad, Giuseppe Carenini, and Raymond Ng. Abstractive Summarization of Spoken and Written Conversations based on Phrasal Queries. In Proceedings of Association of Computational Linguistics (ACL 2014)
Shafiq Joty, Giuseppe Carenini and Raymond Ng. A Novel Discriminative Framework for Sentence-Level Discourse Analysis. EMNLP 2012
Shafiq Joty, Giuseppe Carenini, Raymond Ng and Yashar Mehdad. Combining Intra- and Multi-sentential Rhetorical Parsing for Document-level Discourse Analysis. ACL 2013
Shima Gerani, Yashar Mehdad, Giuseppe Carenini, Raymond Ng and Bita Nejat. Abstractive Summarization of Product Reviews Using Discourse Structure. EMNLP 2014
Kelsey Allen, Giuseppe Carenini and Raymond Ng, Detecting Disagreement in Conversations using Pseudo-Monologic Rhetorical Structure, EMNLP 2014

Pre-requisites. Basic knowledge of machine learning and natural language processing is preferred but not required.

Bio. Dr. Raymond Ng is a professor in Computer Science at the University of British Columbia. His main research area for the past two decades is on data mining, with a specific focus on health informatics and text mining. He has published over 180 peer-reviewed publications on data clustering, outlier detection, OLAP processing, health informatics and text mining. He is the recipient of two best paper awards - from 2001 ACM SIGKDD conference, which is the premier data mining conference worldwide, and the 2005 ACM SIGMOD conference, which is one of the top database conferences worldwide. He co-authors the books entitled “Methods for mining and summarizing text conversations,” and “Perspectives for Business Intelligence”.

Sankar K. Pal (Indian Statistical Institute), [introductory/advanced] Machine Intelligence and Granular Mining: Relevance to Big Data

Syllabus

What is Machine Intelligence?
Relevance of Pattern Recognition (PR)
Data Mining from PR Perspective
Soft Computing

Roles of Fuzzy Sets, Neural Networks and Genetic Algorithms
Relevance of Different Synergistic Integrations

Significance of Soft Computing in PR
What is Granular Computing (GrC)?
Rough Sets (RS) and Information Granules

Example: Case mining
Rough knowledge encoding and multi-spectral Image segmentation
Uncertainty handling
Significance in data mining
Other applications of rough granules

Why Rough-fuzzy Computing?

Fuzzy lower and upper approximations

Generalized Rough Sets and Entropy

Different cases of crisp/fuzzy sets and granules
Upper and lower approximations
Generalized Entropy

Challenging Issues in GrC

Fuzzy granular computing
Granular fuzzy computing

Application in Image and Video Analysis

Neighborhood Rough sets and class dependent granulation

Application to remote sensing images

Rough-fuzzy clustering and image segmentation
Generalized RS in object extraction & video tracking

Significance of rough-fuzzy entropy in image
Generalized rough image ambiguity measures
Effect of generalization
Effect of granule size

Granular Flow Graph, Adaptive Rule Generation and Tracking

Granules of arbitrary shape
Effect of flow graph

Application in Bioinformatics
- Rough-fuzzy c-medoids and Bio-bases determination from protein sequence
- f-information measures and Gene selection from microarray data
- miRNA selection in groups
Application in Web intelligence

Rough set based ensemble classifier
Web service classification

Why web service classification?
Characterization features of web services
Relevance of tensor space model (TSM)
Experimental results

Hypertext (web page) mining

Granular Neural Networks

Modular rough-fuzzy MLP: Data to knowledge
Fuzzy-rough granular classification
Fuzzy-rough granular SOM
Applications

Granular Model for Social Networks (SN)

Why fuzzy granules in SN?
Fuzzy set theoretic extension of SN properties
Fuzzy-rough community detection

Evaluating community structures

Target set selection

Challenging issues
Summary
Relevance to Big Data Analysis

Bio. Sankar K. Pal is a Distinguished Scientist and former Director of the Indian Statistical Institute. He is also a J.C. Bose Fellow of the Govt. of India. He founded the Machine Intelligence Unit and the Center for Soft Computing Research: A National Facility in the Institute in Calcutta. He received a Ph.D. in Radio Physics and Electronics from the University of Calcutta in 1979, and another Ph.D. in Electrical Engineering along with DIC from Imperial College, University of London in 1982.

Prof. Pal worked at the University of California, Berkeley and the University of Maryland, College Park in 1986-87; the NASA Johnson Space Center, Houston, Texas in 1990-92 & 1994; and in US Naval Research Laboratory, Washington DC in 2004. Since 1997 he has been a Distinguished Visitor of the IEEE Computer Society, for the Asia-Pacific Region, and held several visiting positions in Italy, Poland, Hong Kong and Australian universities.

He is a co-author of seventeen books and more than four hundred research publications in the areas of Pattern Recognition and Machine Learning, Image Processing, Data Mining, Web Intelligence, Soft Computing, Neural Nets, Genetic Algorithms, Fuzzy Sets, Rough Sets Social Network Analysis and Bioinformatics. He visited about forty countries as a Keynote/ Invited speaker.

He is a Fellow of the IEEE, the Academy of Sciences for the Developing World (TWAS), International Association for Pattern Recognition, International Association of Fuzzy Systems, and all the four National Academies for Science & Engineering in India. He serves(d) in the editorial boards of twenty-two international journals including several IEEE Transactions.

He has received the 1990 S.S. Bhatnagar Prize (the most coveted award for a scientist in India), 2013 Padma Shri (one of the highest civilian awards) by the President of India and many prestigious awards in India and abroad including the 1999 G.D. Birla Award, 1993 Jawaharlal Nehru Fellowship, 2000 Khwarizmi International Award from the President of Iran, 1993 NASA Tech Brief Award (USA), 1994 IEEE Trans. Neural Networks Outstanding Paper Award, and 2005-06 Indian Science Congress-P.C. Mahalanobis Birth Centenary Gold Medal from the Prime Minister of India for Lifetime Achievement.

More info Indian Statistical Institute page.

Erhard Rahm (University of Leipzig), [introductory/intermediate] Scalable and Privacy-preserving Data Integration

Summary. Data integration is a key challenge for Big Data applications to analyze large sets of heterogeneous data of potentially different kinds, including structured database records as well as semi-structured entities from web sources or social networks. In many cases, there is also a need to deal with a very high number of data sources, e.g. product offers from many e-commerce websites. The integration of sensitive personal information from different sources, e.g. about customers of different companies or patients of different hospitals, poses additional challenge to protect the privacy of the individuals while still supporting certain data analysis tasks on the anonymized data. We will cover proposed approaches to deal with the key data integration tasks of (large-scale) entity resolution and schema or ontology matching. In particular we discuss how entity resolution can be performed in parallel on Hadoop platforms together with so-called blocking approaches to avoid comparing to many entities with each other and load balancing techniques to deal with data skew. For privacy-preserving record linkage we focus on the use of Bloom filters for encrypting sensitive attribute values while still permitting effective match decisions. We discuss the use of different configurations with or without the use of a dedicated linkage unit and their implications regarding privacy and runtime complexity. Another topic are graph-based data integration and analysis approaches that keep all relevant relationships between entities to enable more sophisticated analysis tasks. Such approaches are not only useful for typical graph applications such as for social networks, but can also lead to enhanced business intelligence on enterprise data.

Syllabus

Data integration

Scalable entity resolution / link discovery
Large-scale schema/ontology matching
Holistic data integration

Privacy-preserving record linkage (PPRL)

Encryption of sensitive information
PPRL with linkage unit
Secure multi-party approaches

Graph-based data integration and analytics

ETL for graph data
Graph-based business intelligence
Hadoop-based graph analytics

References

Z. Bellahsene, A. Bonifati, E. Rahm (eds.): Schema Matching and Mapping, Springer 2011
P. Christen: Data Matching. Springer 2012
X. L.Dong, D. Srivastava: Big Data Integration. Synthesis Lectures on Data Management, Morgan Kaufman 2015
M.Junghanns et al.: GRADOOP: Scalable Graph Data Management and Analytics with Hadoop. Arxiv report 2015
L. Kolb, A. Thor, E. Rahm: Dedoop: Efficient Deduplication with Hadoop, PVLDB 2012
A. Petermann et al.: Graph-based Data Integration and Business Intelligence with BIIIG, PVLDB 2014
D.Vatsalan, P. Christen, V.S. Verykios: A taxonomy of privacy-preserving record linkage techniques. Information Systems 2013

Pre-requisites. Participants should have a computer science background and be familiar with traditional database systems and data warehouses. Knowledge about basic Big Data technologies such as Hadoop is beneficial.

Bio. Erhard Rahm is full professor for databases at the computer science institute of the University of Leipzig, Germany. His current research focusses on Big Data and data integration. He has authored several books and more than 200 peer-reviewed journal and conference publications. His research on data integration and schema matching has been awarded several times, in particular with the renowned 10-year best-paper award of the conference series VLDB (Very Large Databases) and the Influential Paper Award of the conference series ICDE (Int. Conf. on Data Engineering). Prof. Rahm is one of the two scientific coordinators of the new German center of excellence on Big Data ScaDS (competence center for SCAlable Data services and Solutions) Dresden/Leipzig.

Hanan Samet (University of Maryland), [introductory/intermediate] Sorting in Space: Multidimensional, Spatial, and Metric Data Structures for Applications in Spatial Databases, Geographic Information Systems (GIS), and Location-based Services

Summary. The representation of multidimensional, spatial, and metric data is an important issue in applications of spatial database, geographic information systems (GIS), and location-based services. Recently, there has been much interest in hierarchical data structures such as quadtrees, octrees, and pyramids which are based on image hierarchies, as well methods that make use of bounding boxes which are based on object hierarchies. Their key advantage is that they provide a way to index into space. In fact, they are little more than multidimensional sorts. They are compact and depending on the nature of the spatial data they save space as well as time and also facilitate operations such as search.

We describe hierarchical representations of points, lines, collections of small rectangles, regions, surfaces, and volumes. For region data, we point out the dimension-reduction property of the region quadtree and octree. We also demonstrate how to use them for both raster and vector data. For metric data that does not lie in a vector space so that indexing is based simply on the distance between objects, we review various representations such as the vp-tree, gh-tree, and mb-tree. In particular, we demonstrate the close relationship between these representations and those designed for a vector space. For all of the representations, we show how they can be used to compute nearest objects in an incremental fashion so that the number of objects need not be known in advance. The VASCO JAVA applet is presented that illustrates these methods.

The above has been in the context of the traditional geometric representation of spatial data, while in the final part we review the more recent textual representation which is used in location-based services where the key issue is that of resolving ambiguities. For example, does ``London'' correspond to the name of a person or a location, and if it corresponds to a location, which of the over 700 different instances of ``London'' is it. The NewsStand system at newsstand.umiacs.umd.edu and the TwitterStand system at TwitterStand.umiacs.umd.edu system are examples. See also the cover article of the October 2014 issue of Communications of the ACM (or a cached version) and the accompanying video.

Syllabus

Introduction

Sample queries
Spatial Indexing
Sorting approach
Minimum bounding rectangles (e.g., R-tree)
Disjoint cells (e.g., R+-tree, k-d-B-tree)
Uniform grid
Location-based queries vs: feature-based queries
Region quadtree
Dimension reduction
Pyramid
Region quadtrees vs: pyramids
Space ordering methods

Points

point quadtree
MX quadtree
PR quadtree
k-d tree
Bintree
BSP tree

Lines

Strip tree
PM1 quadtree
PM2 quadtree
PM3 quadtree
PMR quadtree

Rectangles and arbitrary objects

MX-CIF quadtree
Loose quadtree
Partition fieldtree
R-tree

Surfaces and Volumes

Restricted quadtree
Region octree
PM octree

Metric Data

vp-tree
gh-tree
mb-tree

Operations

Incremental nearest object location
Boolean set operations

Spatial Database Issues

General issues
Specific issues

Indexing spatiotextual data for location-based services delivered on platforms such as smart phones and tablets

Incorporation of spatial synonyms in search engines
Toponym recognition
Toponym resolution
Spatial reader scope
Incorporation of spatiotemporal data
System integration issues
Demos of live systems on smart phones

Example systems

AND internet browser
JAVA spatial data applets
STEWARD
NewsStand
TwitterStand

References

H. Samet. Foundations of Multidimensional Data Structures. Morgan-Kaufmann, San Francisco, 2006
H. Samet. A sorting approach to indexing spatial data. International Journal of Shape Modeling. 14(1):15--37, 28(4):517--580, June 2008
G. R. Hjaltason and H. Samet. Index-driven similarity search in metric spaces. ACM Transactions on Database Systems, 28(4):517--580, December 2003
G. R. Hjaltason and H. Samet. Distance browsing in spatial databases. ACM Transactions on Database Systems, 24(2):265--318, June 1999. Also Computer Science TR-3919, University of Maryland, College Park, MD
G. R. Hjaltason and H. Samet. Ranking in spatial databases. In Advances in Spatial Databases --- 4th International Symposium, SSD'95, M. J. Egenhofer and J. R. Herring, eds., pages 83--95, Portland, ME, August 1995. Also Springer-Verlag Lecture Notes in Computer Science 951
H. Samet. Applications of Spatial Data Structures: Computer Graphics, Image Processing, and GIS. Addison-Wesley, Reading, MA, 1990
H. Samet. The Design and Analysis of Spatial Data Structures. Addison-Wesley, Reading, MA, 1990
C. Esperanca and H. Samet

Experience with SAND/Tcl: a scripting tool for spatial databases.

H. Samet, H. Alborzi, F. Brabec, C. Esperanca, G. R. Hjaltason, F. Morgan, and E. Tanin. Use of the SAND spatial browser for digital government applications. Communications of the ACM, 46(1):63--66, January 2003
B. Teitler, M. D. Lieberman, D. Panozzo, J. Sankaranarayanan, H. Samet, and J. Sperling. NewsStand: A new view on news. Proceedings of the 16th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, pages 144--153, Irvine, CA, November 2008
H. Samet, J. Sankaranarayanan, M. D. Lieberman, M. D. Adelfio, B. C. Fruin, J. M. Lotkowski, D. Panozzo, J. Sperling, and B. E. Teitler

Reading news with maps by exploiting spatial synonyms.

J. Sankaranarayanan, H. Samet, B. Teitler, M. D. Lieberman, and J. Sperling. TwitterStand: News in tweets. Proceedings of the 17th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, pages 42--51, Seattle, WA, November 2009
M. D. Lieberman, H. Samet, and J. Sankaranarayanan

Geotagging with local lexicons to build indexes for textually-specified spatial data.

M. D. Lieberman and H. Samet. Multifaceted Toponym Recognition for Streaming News. Proceedings of the ACM SIGIR Conference
M. D. Lieberman and H. Samet

Adaptive Context Features for Toponym Resolution in Streaming News.

Spatial Data Structure applets

Intended audience and pre-requisites. Practitioners working in the areas of spatial databases, geographic information systems, and location-based services will be given a different perspective on data structures found to be useful in most applications. Familiarity with computer terminology and some programming experience is needed to follow this course.

Bio. Hanan Samet is a Distinguished University Professor of Computer Science at the University of Maryland, College Park and is a member of the Institute for Computer Studies. He is also a member of the Computer Vision Laboratory at the Center for Automation Research where he leads a number of research projects on the use of hierarchical data structures for database applications, geographic information systems, computer graphics, computer vision, image processing, games, robotics, and search. He received the B.S. degree in engineering from UCLA, and the M.S. Degree in operations research and the M.S. and Ph.D. degrees in computer science from Stanford University. His doctoral dissertation dealt with proving the correctness of translations of LISP programs which was the first work in translation validation and the related concept of proof-carrying code. He is the author of the recent book `` Foundations of Multidimensional and Metric Data Structures'' published by Morgan-Kaufmann, an imprint of Elsevier, in 2006, an award winner in the 2006 best book in Computer and Information Science competition of the Professional and Scholarly Publishers (PSP) Group of the American Publishers Association (AAP), and of the first two books on spatial data structures ``Design and Analysis of Spatial Data Structures'', and ``Applications of Spatial Data Structures: Computer Graphics, Image Processing, and GIS,'' both published by Addison-Wesley in 1990. He is the Founding Editor-In-Chief of the ACM Transactions on Spatial Algorithms and Systems (TSAS), the founding chair of ACM SIGSPATIAL, a recipient of a Science Foundation of Ireland (SFI) Walton Visitor Award at the Centre for Geocomputation at the National University of Ireland at Maynooth (NUIM), 2009 UCGIS Research Award, 2010 CMPS Board of Visitors Award at the University of Maryland, 2011 ACM Paris Kanellakis Theory and Practice Award, 2014 IEEE Computer Society Wallace McDowell Award, and a Fellow of the ACM, IEEE, AAAS, IAPR (International Association for Pattern Recognition), and UCGIS (University Consortium for Geographic Science). He received best paper awards in the 2007 Computers & Graphics Journal, the 2008 ACM SIGMOD and SIGSPATIAL ACMGIS Conferences, the 2012 SIGSPATIAL MobiGIS Workshop, and the 2013 SIGSPATIAL GIR Workshop, as well as a best demo award at the 2011 SIGSPATIAL ACMGIS'11 Conference. His paper at the 2009 IEEE International Conference on Data Engineering (ICDE) was selected as one of the best papers for publication in the IEEE Transactions on Knowledge and Data Engineering. He was elected to the ACM Council as the Capitol Region Representative for the term 1989-1991, and is an ACM Distinguished Speaker.

Jaideep Srivastava (Qatar ComputingResearch Institute), [intermediate] Social Computing: Computing as an Integral Tool to Understanding Human Behavior and Solving Problems of Social Relevance

Summary. Social Computing is an emerging discipline, and just like any discipline at a nascent stage it can often mean different things to different people. However, there are three distinct threads that are emerging. First thread is often called Socio-Technical Systems, which focuses on building systems that allow large scale interactions of people, whether for a specific purpose or in general. Examples include social networks like Facebook and Google Plus, and Multi Player Online Games like World of Warcraft and Farmville. The second thread is often called Computational Social Science, whose goal is to use computing as an integral tool to push the research boundaries of various social and behavioral science disciplines, primarily Sociology, Economics, and Psychology. Third is the idea of solving problems of societal relevance using a combination of computing and humans.

The three modules of this course are structured according to this description. The goal of this course is to discuss, in a tutorial manner, through case studies, and through discussion, what Social Computing is, where it is headed, and where is it taking us.

Syllabus

Module 1: Socio-technical systems

Introduction to Social Computing
Socio-technical systems

Examples of a number of social computing systems, e.g. Twitter, FaceBook, MMO games, etc.

Applying data mining to social computing systems

Module 2: Computational Social Science

Online trust
Social influence
Individual and group/team performance
Identifying and preventing bad behavior

Module 3: Solving Problems of Societal Relevance

Social computing for humanitarian assistance
Wrap-up discussion

Privacy and ethics
Where are we headed

References. Will provide later.

Pre-requisites. This course is intended primarily for graduate students. Following are the potential audiences:

Computer Science graduate students: All that is needed for this audience is interest in one of the themes of social computing
Social Science graduate students: Some exposure to building models from data, at least what these techniques are and what they can do
Management graduate students: Those with MIS focus

Bio. Jaideep Srivastava is the Director of the Social Computing division at QCRI. He is on leave from the University of Minnesota, where he directs a laboratory focusing on research in Web Mining, Social Analytics, and Health Analytics. He is a Fellow of the Institute of Electrical and Electronics Engineers (IEEE), has been an IEEE Distinguished Visitor, and is a Distinguished Fellow of Allina’s Center for Healthcare Innovation. He has been awarded the Distinguished Research Contributions Award of the PAKDD, for his lifetime contributions to the field of data mining. Six of his papers have won best paper awards.

Dr. Srivastava is currently co-leading a multi-institutional, multi-disciplinary project in the rapidly emerging area of social computing. He has significant experience in the industry, in both consulting and executive roles. He has led a data mining team at Amazon.com, built a data analytics department at Yodlee, and served as the Chief Technology Officer for Persistent Systems. He is a Co-Founder of Ninja Metrics, and an adviser and Chief Scientist of CogCubed, an innovative company whose goal is to revolutionize the diagnosis and therapy of cognitive disorders through the use of online games.

Dr. Srivastava has held distinguished professorships at Heilongjiang University and Wuhan University, China. He has held advisory positions with the State of Minnesota, and the State of Maharashtra, India. He is a technology advisor to the Unique ID (UID) project of the Government of India, whose goal is to provide biometrics-based social security numbers to the 1.2 Billion citizens of India. He has a Bachelors of Technology from the Indian Institute of Technology (IIT), Kanpur, India, and MS and PhD from the University of California, Berkeley.

Jeffrey Ullman (Stanford University), [introductory] Big Data Algorithms that Aren't Machine Learning

Summary. We shall study algorithms that have been found useful in querying large data volumes. The emphasis is on algorithms that cannot be considered "machine learning.

Syllabus.

Locality-sensitive hashing: shingling, minhashing, applications;
PageRank and related ideas: hubs-and-authorities, spam detection, topic-specific PageRank;
Stream-processing algorithms: counting occurrences, counting unique values, sampling;
Graph-processing algorithms: centrality, counting neighborhoods, counting triangles.

References. We will be covering (parts of) Chapters 3, 4, 5, and 10 of the free text

Jure Leskovec, Anand Rajaraman, and Jeff Ullman, Mining of Massive Datasets

Pre-requisites. A course in algorithms at the advanced-undergraduate level is important. A course in database systems is helpful, but not required.

Bio. Stanford page.

Alexandre Vaniachine (Argonne National Laboratory), [introductory/advanced] Big Data: Comparison with Computational Models

Summary. The scientific goals of the Large Hadron Collider include high precision tests of the Standard Model and searches for new physics. These goals require detailed comparison of data with computational models simulating the expected data behavior. We highlight the role which modeling and simulation plays in scientific discovery and experience with methods for processing real and simulated data growing in volume and variety.

Syllabus

Do not let the data speak for themselves: comparison of data-driven vs. model-based approaches; signal and noise. (2 hours, introductory)
Big Data Processing in High Energy Physics: Distributed Computing; Data Science methods in Higgs boson discovery. (2 hours, intermediate)
Higgs Boson Machine Learning Challenge: practical data analysis. (2 hours, advanced)

References

http://www.fourthparadigm.org
Bird I, Computing for the Large Hadron Collider. Annu. Rev. Nucl. Part. S. (2011) 61: 99
A.V. Vaniachine on behalf of the ATLAS and CMS Collaborations. Advancements in Big Data Processing in the ATLAS and CMS Experiments, arXiv:1303.1950, 2013
ATLAS collaborationDataset from the ATLAS Higgs Boson Machine Learning Challenge, CERN Open Data Portal. DOI:10.7483/OPENDATA.ATLAS.ZBP2.M5T8, 2014
Adam-Bourdarios C, Cowan G, Germain C, Guyon I, Kégl B et al. Learning to discover: the Higgs boson machine learning challenge - Documentation. CERN Open Data Portal. DOI:10.7483/OPENDATA.ATLAS.MQ5J.GHXA, 2014

Pre-requisites. Lecture 3: familiarity with python, machine learning and multivariate analysis will be useful, but is not required to follow this lecture.

Bio. Argonne page.

Xiaowei Xu (University of Arkansas, Little Rock), [introductory/advanced] Big Data Analytics for Social Networks

Summary. Recent explosive growth of online social networks such as Facebook and Twitter provides a unique opportunity for many data mining applications including real time event detection, community structure detection and viral marketing. The course covers big data analytics for social networks. The emphasis will be on scalable algorithms for community structure detection and structural pattern mining.

Syllabus

Modularity-based community structure detection algorithms [1]
Structural clustering algorithms for networks [2]
Label propagation algorithms [3]
Social tie modeling and characterization [4]
Parallel algorithms for big social networks [5]

References

Aaron Clauset, M. E. J. Newman, and Cristopher Moore, Finding community structure in very large networks, Phys. Rev. E 70, 066111, 2004
X. Xu, N. Yuruk, Z. Feng, and T. A. Schweiger. Scan: a structural clustering algorithm for networks. In Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 824–833. ACM, 2007
Raghavan, Usha Nandini and Albert, Reka and Kumara, Soundar, Near linear time algorithm to detect community structures in large-scale networks, Phys. Rev. E 76, 036106, 2007
S. Sintos and P. Tsaparas. Using strong triadic closure to characterize ties in social networks. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 1466–1475. ACM, 2014
Weizhong Zhao Venkata Swamy Martha Xiaowei Xu PSCAN: A Parallel Structural Clustering Algorithm for Big Networks in MapReduce. 862-869 2013 AINA

Pre-requisites. Basic knowledge in computer algorithms and graph theory.

Bio. Professor Xiaowei Xu is a professor in the Department of Information Science at the University of Arkansas at Little Rock (UALR). He received his Ph.D. in computer science from the University of Munich in 1998. Prior to his appointment at UALR, Dr. Xu was a senior research scientist in Siemens Corporate Technology. Dr. Xu is adjunct professor in the Department of Mathematics at the University of Arkansas. Dr. Xu was an Oak Ridge Institute for Science and Education (ORISE) Faculty Research Program Member in the National Center for Toxicological Research’s (NCTR) Center for Bioinformatics in the Division of Systems Biology from 2010 to 2014. He is also a consultant for companies including Siemens, Acxiom, Dataminr and Neusoft. Dr. Xu’s research focuses on algorithms for data mining and machine learning. Dr. Xu is a recipient of 2014 ACM SIGKDD Test of Time Award for his work in density-based clustering algorithm (DBSCAN), which has received over 7,000 citations based on Google Scholar. Dr. Xu is program committee members and session chairs for premier forums including ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), and IEEE International Conferences on Data Mining (ICDM).

Mohammed J. Zaki (Rensselaer Polytechnic Institute), [introductory/intermediate] Large Scale Graph Analytics and Mining

Summary. Increasingly, today's massive data is in the form of complex graphs and networks. Examples include the world wide web, social networks, biological networks, semantic networks, and so on. The study of such complex and large scale networks (often referred to as network science) – understanding their intrinsic properties, changes to their structure over time or due to other external factors, and the behavior of entities and communities within them – can afford important insight to domain researchers and organizations alike. Given that networks are part and parcel of the complex and connected social, physical and biological world we live in, a coordinated and concerted approach combining these strands of research is essential to make progress on this grand challenge complex systems problem of our times.

In this course, we study the fundamental algorithms to model and mine graph data. We focus on graph and network modeling, graph pattern mining, as well as graph clustering and classification tasks. We will also cover both the algorithms and frameworks for parallel and distributed graph mining over massive graphs.

Syllabus

Graph Data Analysis
Graph Models
Graph Pattern Mining
Graph Clustering
Graph Classification
Parallel and Distributed Graph Mining

References. We will cover material from Chapters 4,5,11,13,16 from our freely available textbook (and other supplements)

Mohammed J. Zaki, Wagner Meira, Jr., Data Mining and Analysis: Fundamental Concepts and Algorithms, 2014

Pre-requisites. Introductory courses in discrete mathematics, linear algebra, and probability and statistics.

Bio. Mohammed J. Zaki is a Professor of Computer Science at RPI. He received his Ph.D. degree in computer science from the University of Rochester in 1998. His research interests focus on developing novel data mining techniques, especially for applications in bioinformatics and social networks. He has published over 225 papers and book-chapters on data mining and bioinformatics, including the textbook "Data Mining and Analysis: Fundamental Concepts and Algorithms," Cambridge University Press, 2014. He is the founding co-chair for the BIOKDD series of workshops. He is currently Area Editor for Statistical Analysis and Data Mining, and an Associate Editor for Data Mining and Knowledge Discovery, ACM Transactions on Knowledge Discovery from Data, and Social Networks and Mining. He was the program co-chair for SDM'08, SIGKDD'09, PAKDD'10, BIBM'11, CIKM'12, and ICDM'12. He is currently the program co-chair for IEEE BigData'15. He is currently serving on the Board of Directors for ACM SIGKDD. He received the National Science Foundation CAREER Award in 2001 and the Department of Energy Early Career Principal Investigator Award in 2002. He received an HP Innovation Research Award in 2010, 2011, and 2012, and a Google Faculty Research Award in 2011. He is a senior member of the IEEE, and an ACM Distinguished Scientist.

Contents

Links of Interest

Active Links

Past Links

Contact Information

2nd International Winter School on Big Data

Bilbao, Spain, February 8-12, 2016

Keynote Speakers

Nektarios Benekos (European Organization for Nuclear Research), Role of Computing and Software in Particle Physics

Chih-Jen Lin (National Taiwan University), When and When Not to Use Distributed Machine Learning

Jeffrey Ullman (Stanford University), Theory of MapReduce Algorithms

Alexandre Vaniachine (Argonne National Laboratory), Big Data Technologies and Data Science Methods in the Higgs Boson Discovery

Nektarios Benekos (European Organization for Nuclear Research), [introductory/intermediate] Exploring the Mysteries of our Cosmos: the Big Deal between Big Data and Big Science

Hendrik Blockeel (KU Leuven), [intermediate] Decision Trees for Big Data Analytics

Nello Cristianini (University of Bristol), [introductory] THINKBIG: Towards Large Scale Computational Social Sciences, History and Digital Humanities

Ernesto Damiani (University of Milan and EBTIC/Khalifa University), [introductory/intermediate] Architectures, Models and Tools for Big-Data-as-a-Service

Francisco Herrera (University of Granada), [introductory] Big Data Preprocessing

Chih-Jen Lin (National Taiwan University), [introductory/intermediate] Large-scale Linear Classification

George Karypis (University of Minnesota), [intermediate/advanced] Scaling Up Recommender Systems

Geoff McLachlan (University of Queensland), [intermediate/advanced] Big Data Extensions of Some Methods of Classification and Clustering

Wladek Minor (University of Virginia), [introductory/intermediate] Big Data in Biomedical Sciences

Raymond Ng (University of British Columbia), [introductory/intermediate] Mining and Summarizing Text Conversations

Sankar K. Pal (Indian Statistical Institute), [introductory/advanced] Machine Intelligence and Granular Mining: Relevance to Big Data

Erhard Rahm (University of Leipzig), [introductory/intermediate] Scalable and Privacy-preserving Data Integration

Hanan Samet (University of Maryland), [introductory/intermediate] Sorting in Space: Multidimensional, Spatial, and Metric Data Structures for Applications in Spatial Databases, Geographic Information Systems (GIS), and Location-based Services

Jaideep Srivastava (Qatar ComputingResearch Institute), [intermediate] Social Computing: Computing as an Integral Tool to Understanding Human Behavior and Solving Problems of Social Relevance

Jeffrey Ullman (Stanford University), [introductory] Big Data Algorithms that Aren't Machine Learning

Alexandre Vaniachine (Argonne National Laboratory), [introductory/advanced] Big Data: Comparison with Computational Models

Xiaowei Xu (University of Arkansas, Little Rock), [introductory/advanced] Big Data Analytics for Social Networks

Mohammed J. Zaki (Rensselaer Polytechnic Institute), [introductory/intermediate] Large Scale Graph Analytics and Mining

2^nd International Winter School on Big Data