★ The internship will take place at the FOX team of CRIStAL laboratory, at the University of Lille.
Summary: In this project, we will perform time series analysis in general. The following are the main tasks of time series analysis :
- Time series classification: The time series classification is an important topic in time series analysis, and it has been applied in many areas, such as finance, biometrics, networking, and artificial intelligence. Given a time series sequence p, the goal of classification is to classify each of the p time series into a predefined class. In this process, one of the significant steps is to calculate the distance between each of the p time series and other time series in dataset D. Hence, in the field of time series data mining, a distance function or similarity measure is often used to describe the relationship between two time series.
- Motif discovery in time series: Time series motifs are approximately repeating patterns in real-valued time series data. They are useful for exploratory data mining and are typically used as inputs for various time series clustering, classification, segmentation, rule discovery, and visualization algorithms. The time series motifs are useful in exploratory data mining. If a time series pattern is conserved, we may assume that there is some high-level atomic mechanism/behavior that causes that pattern to be conserved. That behavior may be desirable in certain cases (e.g., a perfect badminton shot) or undesirable in others (e.g., the cough of a sick pig). In either case, the discovery of motifs is typically the first step in various kinds of higher-level time series analytics.
- Anomaly detection in time series: Outlier detection in time series analysis (typically called anomaly detection) is the process of identifying data points or sequences that deviate significantly from the established pattern of a temporal dataset. Unlike static data, time series outliers are tricky because a value might be perfectly normal in one month but a major red flag in another.
For time series data analysis, the Matrix Profile-based technique is an important invention. The idea is based on the concept of the all-pairs-similarity-search (also known as similarity join) problem, which comes in several variants. The basic task is this: Given a collection of data objects, retrieve the nearest neighbor for each object. Numerous algorithms have been yet proposed for computing it, e.g., STAMP, STOMP, and SCRIMP++. All these algorithms use :
- The z-normalized Euclidean distance is used to measure the distance between subsequences.
- Developed to handle mainly 1D time series data
However, as we illustrate in this paper, for some datasets, the non-normalized (classical) based matrix profile is more useful. Thus, efficient matrix profile techniques based on both z-normalized and non-normalized distances are necessary for knowledge extraction from different time series datasets.
AAMP and ACAMP : In one of our papers, we propose such efficient techniques. We first propose an efficient algorithm called AAMP for computing the matrix profile with the non-normalized Euclidean distance. Then, we extend our algorithm for the p-norm distance. We also propose two algorithms called ACAMP and ACAMP-Optimized that use the same principle as AAMP, but for calculating the matrix profile by using z-normalized Euclidean distance.
Processing Multi-Dimensional Data : Time series motifs are approximately repeating patterns in real-valued time series data. Since the introduction of the first motif discovery algorithm for univariate time series, multiple efforts have been made to generalize motifs to the multidimensional case. It is shown in the literature that these efforts, which typically attempt to find motifs on all dimensions, will not produce meaningful motifs except in the most contrived situations. Many researchers have generalized motifs to the multidimensional case.
However, almost all of these efforts attempt to find motifs in all dimensions. We believe that using all dimensions will generally not produce meaningful motifs, except in the most contrived situations. In this context, we plan to work on the following topics :
- Continue our work on the AAMP and ACAMP algorithms
- Propose and adapt the AAMP and ACAMP for multidimensional data with the purpose of motif discovery and anomaly detection
- Furthermore, propose algorithms for outlier detection on multidimensional time series
Time Series Classification: The problem of time series classification has been intensively investigated during the last few years, with the increase in the computational capabilities of computers. It is particularly important to be able to find the distance between time series. We can find many distance measures for the similarity of time series data. Nevertheless, the simple method combining the nearest neighbor (1NN) classifier and some form of dynamic time warping (DTW) distance has been shown to be one of the best-performing time series classification techniques. The DTW algorithm is a classical and well-established distance measure well-suited to the task of comparing time series.
Previously, we performed an experimental study of many such distance computation algorithms for time series analysis. These are dynamic programming-based techniques, like DTW and its variants. Along with that, we also incorporated other dynamic time programming-based methods (not DTW-based) in this experimental study. The study was performed to match the images of scanned words (a domain named Word Spotting). In this project, we would like to extend this experimental study in the context of 1D time series analysis. In this context, we will perform the following work :
- Perform a comparative experimental study of all these algorithms on 1D time series dataset
- For this experimental study, we will use UCR time series dataset. There are 128 different time series data sets in this archive.
Time series analysis using machine learning: For time series classification, it is shown in the literature that Convolution Neural Network (CNN) has outperformed other non machine learning based approaches for capturing the relevant features in time series. In CNN, the feature extraction consists in finding linear combinations between consecutive time steps of a fixed size. The deeper the model is, the more it increases its receptive field. This represents the input space that a point in a certain depth of the network depends on. The larger the receptive field is, the more beneficial it is for the model. In this context, we will work in the following directions :
- We will explore more into details, the various deep learning based models for time series analysis
Desired Profile :
- Final-year Master’s student (M2) or engineering student specializing in machine learning, data analysis, or a related field.
- Knowledge of machine learning and deep learning.
- Strong mathematical background
- Programming skills (Python).
- Autonomy, rigor, and critical thinking skills.
★ The internship will take place at the FOX team of CRIStAL laboratory, at University of Lille.
Address of the Internship :
CAMPUS Haute-Borne CNRS IRCICA-IRI-RMN
Parc Scientifique de la Haute Borne, 50 Avenue Halley, BP 70478, 59658 Villeneuve d’Ascq Cédex, France.
Candidature :
If this proposal interests you, please send the following documents to Dr. Tanmoy MONDAL (tanmoy.mondal@univ-lille.fr), Damien MARCHAL (damien.marchal@univ-lille.fr), Redha KASSI (Redha.Kassi@univ-lille.fr), Hedia MARZOUKI (hedia.marzouki@univ-lille.fr)
- CV
- Motivation Letter
- Transcripts of grades obtained in Bachelor’s/Master’s/Engineering school as well as class ranking
- Name and contact details of at least one reference person who can be contacted if necessary
