Maurin Gilles

I want to do a PhD, so I've made this page to gather all the research work I'll be doing during my internships and within the research initiation programme I'm attending. I'm not sure what I want to specialise in yet, but the topics I feel drawn to are combinatorial optimisation, very large databases and AI.

My profile is at the intersection of the maths nerd, the computer nerd, the philosophy nerd and the art nerd. It is useful to fit into various environments, but I feel that bringing original, multilayered approaches to problems can be my strength in research.

Dynamic query optimisation in data integration systems

This is the work I began in june 2024 for my research initiation program within the LIMOS, supervised by Jean-Philippe Gayon for the general research aspects, and Maxime Buron for the technial aspects. This section will introduce the subject and the first ideas of my work. A work-in-progress paper and some code are available on this GitLab repo.

Goals

                 The problem arises in a big data environment, specifically in data integration systems, whose goal is to provide a homogeneous access to data from different sources. A major characteristic of data integration systems is this need to work with data sources that evolve independently, on different formats, with different access times, and with a lack of available information about the data. This is a key difference from regular DBMS, and it calls for specific optimisation techniques, as it will be introduced in the following paragraphs.

                 Data integration systems follow the goal of DBMSs to provide a semantic-oriented access to data, thus answering to human queries through an unified language. We have chosen to focus on the relational model, where queries are parsed as logical expressions over relational algebra. An expression that answers a particular query is called a "query plan", and each one corresponds to several equivalent plans, that produce the same output for the same sources, but do so through a different sequence of operators. Although such plans are algebraically similar, the difference between their computation times can be arbitrarily large.

                 The goal of query optimisation is to search for the best, or at least a good enough query plan. Some optimisation techniques over some operators are guaranteed to always be optimal. It is not the case for joins. They are a major operator in relational algebra and can appear in number in complex queries. Unfortunately, their results are difficult to predict, they have a huge computation time, and the possibilities of structuring them become impossible to explore exhaustively for queries with many joins.

                 Regular relational DBMSs address this problem by storing information about the data that allows them to make decent predictions about the size of the joins. However, this method can't be applied to a data integration problem, because the sources are independent. The ideas I am exploring consist of "dynamic" query optimisation. The aim is to start processing a sub-optimal plan, and to switch to a better one during execution thanks to information gained along the way. My work can be summarised as seeking answers to these questions:

Which sub-optimal plan to start with
What information to collect
How this information can be used to compute better plans
Is the plan improvement worth the cost of collecting the information
How to switch plan without losing the results computed so far

Pipelining joins

                 Among the questions listed above, the first I was told to look at was the last one: "How to switch from one plan to another without losing the results computed so far?" To begin with, we only looked at plans in which sources are joined over the same attribute, and forgot all the other operators in the query. Maxime Buron had hypothesised that the "double-pipelined hashjoin", a specific algorithm for processing joins, might be useful in this case. I made an implementation of this algorithm on Tatooine, a data integration simulation project that Maxime had already worked on. We improved this implementation until it got similar performance to other algorithms, thus justifying its use in practice.

                 The next step of our work was to show how this implementation allowed us to move from one plan to "neighbouring" plans in constant time, under our hypotheses. Our idea was to start with a plan, search for better ones among its neighbours, apply the transformation if one is found, and repeat the sequence several times. Unfortunately, this "step-by-step" approach proved to be actually very limited in our more recent work. However, we still think that the double-pipelined hashjoin has properties that can be useful for transferring the work already computed from one plan to another.

                 After one week I made a poster to present my work. Some ideas and results are now outdated, but the main part of it is still a good introduction to my subject. It can be found here.

Dynamic histogram generation

                 I have only had three hours in my timetable devoted to research for the last few months, so it takes time to make progress. However, I finally came up with some interesting ideas to answer to the question "what information should we collect to be able to determine better plans when we have zero initial knowledge about our data sources?". The main problem with this question is the lack of a hypothesis about the sources. Some heuristics exist to approximate the size of the output of a join, but they require at least the size of the sources as input, and more precise information about the data partitioning to be efficient.

                 Histograms are a common way of storing information about the data partitioning within a source, and can provide good estimates of join results. There are different types of histograms (equal-width, equal-height), but unfortunately they are unusable in a case where the data is explored sample by sample from scratch. My idea is an algorithm that computes a kind of histogram that dynamically tends to fit with the data partitioning. My first rough tests seem promising, and if it proves to be efficient, it could be a very useful tool for dynamic query optimisation.

                 My current goals are to take this idea further. More specifically, here is what I would like to do before my internship in 2025:

Have a rigorous formulation for the dynamic histograms and prove some of their properties
Determine the algorithmic complexity of their computation
Make a better implementation that could be more easily linked to other work
Compare the predictions to what could be obtained with other histograms, with and without initial knowledge on the data