A detailed photorealistic image (24mm lens, high detail) showing a stylized map of Madrid overlaid with abstract, colored functional data curves representing pollution levels at different monitoring stations, with a subtle bokeh effect in the background.

Unlocking Environmental Secrets: How We Cluster Complex Data, from Canadian Weather to Madrid’s Air

Alright, let’s talk about data that’s a bit… wiggly. Not just simple numbers, but stuff that changes over time or space – like temperature throughout the year, or pollution levels hour by hour across a city. When you have *multiple* of these wiggly things happening at once (say, temperature *and* precipitation, or PM10 *and* NO2), you’ve got what we statisticians call multivariate functional data. Analyzing this kind of data is super important, especially in areas like environmental science, but it’s also a bit of a puzzle.

See, traditional ways of looking at data often treat each point as separate. But with functional data, the *whole curve* matters. And with multivariate functional data, the curves for *each variable* matter, but so do the relationships *between* them. Think about it: temperature and precipitation aren’t independent; traffic density affects both PM10 and NO2. Capturing these interdependencies is key!

One big challenge is grouping these complex curves – a process called clustering. We want to find stations with similar air quality patterns, or weather stations with similar climate profiles. While clustering is well-established for simpler data, doing it right for multivariate functional data, considering those tricky interdependencies and the infinite-dimensional nature of functions, has been an active area of research. Lots of smart folks have worked on extending tools like functional principal component analysis or regression to this multivariate world, but clustering has lagged a bit behind, especially when it comes to methods that really look at the *shape* and *relative position* of the curves.

Why Multivariate Functional Data is Tricky

Imagine you’re tracking air quality at different spots in a city like Madrid. You’re not just getting one number; you’re getting a whole year’s worth of hourly readings for PM10, another year for NO2, maybe another for Ozone, all at the same location. That’s multivariate functional data! Each station gives you a set of curves.

The tricky part?

  • These aren’t just points; they are continuous functions over time.
  • There are multiple variables (PM10, NO2) evolving together.
  • The relationship *between* PM10 and NO2 at a station is crucial.
  • Different stations might have similar overall levels but different *patterns* throughout the day or year.

Standard clustering methods often struggle with this. They might look at averages or summaries, but they lose the rich information in the curve shapes and how the different variables interact. We needed a way to capture this ‘shape’ and ‘position’ information in a way that works for multiple variables at once.

Enter Epigraph and Hypograph Indices

This is where things get interesting. For single-variable functional data, researchers came up with clever ideas like the epigraph and hypograph indices. Conceptually, these indices help you understand how a specific curve sits relative to a whole collection of other curves.

The basic idea is simple:

  • The epigraph index tells you, roughly, the proportion of other curves in your sample that are *completely below* yours over the entire interval.
  • The hypograph index tells you the proportion of other curves that are *completely above* yours.

Pretty neat for ordering curves from “bottom” to “top”! But there’s a catch: if your curves cross each other a lot (which they often do in real-world data), these indices become less informative, often stuck near 0 or 1. So, modified versions (MEI and MHI) were introduced. These look at the *proportion of time* other curves are above or below yours. This is much more robust when curves intersect.

These indices have been used for cool stuff like creating functional boxplots or detecting outliers in single-variable functional data. But extending them to the multivariate world? That’s been the challenge.

Our New Take for the Multivariate World

Previous attempts to extend these indices to multiple variables often used a weighted average of the indices calculated for each variable separately. It’s like saying, “Okay, let’s see how this station’s PM10 curve compares to others, and how its NO2 curve compares, and then average those comparisons.” It’s a start, but it kind of misses the point of multivariate data – the *joint* behavior. It doesn’t naturally capture that crucial interdependency.

So, here’s the cool part: we proposed a novel formulation for the epigraph and hypograph indices specifically for multivariate functional data. Instead of averaging univariate comparisons, our new definition looks at the proportion of other multivariate curves where *all* their components (PM10 *and* NO2, for instance) are simultaneously above (for epigraph) or below (for hypograph) the corresponding components of the curve you’re looking at. Or, for the generalized versions (MEI and MHI), it’s the proportion of *time* that all components of other curves are simultaneously above or below yours.

This might sound like a small change, but it’s a big deal! It means our indices inherently consider the relationships *between* the variables. They provide a more integrated view of the multivariate functional data. Plus, they don’t require you to pick arbitrary weights for each variable, which is a common headache with the weighted average approach.

We didn’t just stop at defining them; we also dug into their theoretical properties, like how they behave under transformations (shifting or scaling the curves) and their consistency (how well the sample indices estimate the true population indices). Turns out, they behave nicely, making them suitable tools for analysis. We even showed how our new multivariate MEI and MHI relate back to the univariate ones and the weighted average versions, providing a clearer picture of what each definition captures.

A close-up macro shot (100mm lens) of a complex knot, symbolizing the interdependencies within multivariate data, with sharp focus on the intricate weaving of threads and controlled studio lighting.

Putting it to Work: The EHyClus Method

Okay, so we have these shiny new indices. How do we use them for clustering? We adapted a methodology called EHyClus, which was originally developed for clustering univariate functional data using the one-dimensional epigraph/hypograph indices.

EHyClus works in a few steps:

  • Prepare the data: Smooth out the raw data using something like cubic splines. This helps remove noise and makes it easy to calculate derivatives.
  • Get the derivatives: Calculate the first and second derivatives of the smoothed curves. Why? Because the *rate of change* and *acceleration* of the curves can also contain valuable information for clustering, sometimes even more than the original curves!
  • Apply the indices: This is where our new multivariate indices come in. We apply the MEI and MHI (we focus on these because the non-generalized EI and HI are often too restrictive with real data) to the original curves, the first derivatives, and the second derivatives. This transforms our functional data into a multivariate dataset of index values for each station/curve.
  • Cluster the index data: Now that we have a multivariate dataset (each station has several index values), we can use standard multivariate clustering techniques (like k-means, hierarchical clustering, spectral clustering) on these index values.
  • Find the best partition: Since we can apply indices to different combinations of original data and derivatives, and use different clustering methods, we get many possible clustering results. In simulations where we know the true clusters, we can use metrics like the Rand Index (RI) to see which combination worked best.

The key adaptation we made was in step 3, replacing the univariate index calculation with our new multivariate index calculation. This allows EHyClus to handle multivariate functional data effectively, leveraging the information captured by our indices about the relative position and shape, including interdependencies.

Tackling Real-World Messiness: Automated Selection

In simulations, we know the “right” answer (the true clusters), so we can test lots of combinations of indices and clustering methods to find the best one. But in the real world, we don’t have that ground truth! How do you know which combination of indices (applied to data, first derivative, second derivative) and which clustering method is best?

This is a common problem in clustering. To make EHyClus practical for real applications, we developed an automated approach to select the most informative variables (the index values) to feed into the clustering algorithm.

Here’s how it works:

  • First, we calculate a whole set of indices: MEI and MHI applied to the original data, the first derivatives, and the second derivatives. This gives us a set of variables for each curve.
  • Then, we filter these variables. We discard variables (index types) that have very low variability (e.g., less than 50% distinct values). Why? Because if an index gives almost the same value for everyone, it’s not going to help distinguish clusters.
  • Next, we look at the remaining variables and discard those that are highly correlated (e.g., correlation greater than 75%). Why? Because highly correlated variables provide redundant information and can bias the clustering. We want a set of variables that capture different aspects of the data’s shape and position.
  • The variables that survive this filtering process are the ones used for clustering in EHyClus.

We call this “auto-EHyClus.” We tested this automated selection process extensively in simulations and found that, while it might not always pick the *absolute* best combination (which you only know if you have ground truth), it consistently yields results very close to the optimal ones. This makes EHyClus much more user-friendly and reliable for real-world problems where you’re flying blind. Based on our simulations, k-means with Euclidean distance or spectral clustering seem to be good default clustering methods to use with the selected indices.

A wide-angle landscape photo (10mm lens) showing a complex network of roads and buildings in a city, with distinct zones highlighted by color overlays, representing different data clusters.

Proving It Works: Simulation Studies

Before applying our shiny new tools to real-world problems, we put them through their paces with simulated data. We created several different scenarios (called Data Generating Processes, or DGPs) with known clusters, some simple, some more complex, some where the clusters were more obvious in the original data, others where they were clearer in the derivatives. We also included scenarios specifically designed for multivariate data, where the relationship between the variables was important.

We compared EHyClus using our new multivariate indices against several state-of-the-art methods for clustering MFD from the literature (like funclust, funHDDC, FGRC, and different k-means variations). We also compared against EHyClus using the older, weighted-average multivariate index definitions to see if our new indices made a difference.

The results were pretty encouraging!

  • In many scenarios, EHyClus with our novel indices performed competitively or significantly better than the benchmark methods, especially when the clustering structure was tied to the joint behavior of the multiple variables or their derivatives.
  • Our new indices (MEI and MHI) often led to better clustering results within the EHyClus framework compared to using the weighted-average indices, highlighting the value of capturing interdependencies directly.
  • The automated variable selection process proved its worth, consistently getting close to the best possible results achievable with EHyClus, even without knowing the true clusters.

Of course, no single method is perfect for *every* scenario. Some benchmark methods excelled in specific, carefully constructed situations (like funHDDC in one particular DGP). But overall, EHyClus with our new indices showed great promise as a robust and effective tool.

Case Study 1: Canadian Weather

One classic dataset in functional data analysis is the Canadian Weather dataset. It includes daily temperature and precipitation curves averaged over many years for 35 weather stations across Canada. These stations are traditionally grouped into four climate regions: Arctic, Atlantic, Continental, and Pacific. This is a perfect test case for multivariate functional data (temperature + precipitation) and clustering.

Unlike some previous studies that normalized the data (because temperature and precipitation are in different units), we didn’t need to. Our MEI and MHI indices work directly on the original scales because they compare the relative positions of the curves within each dimension and then combine this information naturally. The resulting index values are unitless (between 0 and 1), so the clustering algorithm doesn’t get confused by different scales.

First, we tried clustering into 4 groups, using the known regional classification as a kind of “ground truth” to evaluate our results. EHyClus, particularly when using hierarchical clustering on the first derivatives’ indices, performed very well compared to other methods, producing clusters that largely aligned with the geographical regions. We saw some interesting differences at regional boundaries, which might suggest the climate patterns don’t perfectly follow the administrative regions.

Then, we used a tool (the NbClustR package) to suggest the optimal number of clusters based purely on the data, which recommended 3 clusters. We applied auto-EHyClus with 3 clusters. The resulting groups made a lot of sense climatically: a northern Arctic/subarctic group, a coastal maritime group (Atlantic and Pacific stations, plus some lake-influenced ones), and a central continental group. This showed how EHyClus can reveal meaningful structure in the data without relying on pre-defined groupings.

A high-detail still life (60mm Macro lens) of a weather map with abstract functional data curves overlaid, symbolizing the analysis of complex environmental patterns.

Case Study 2: Madrid Air Quality

Now for the main event – Madrid air quality! We got our hands on hourly data for PM10 and NO2 concentrations from 13 monitoring stations across the city for the year 2023. Our goal: cluster these stations based on their pollution patterns to understand spatial variations and potentially identify areas with similar pollution sources.

Again, we first used NbClustR to suggest the number of clusters, which pointed towards three groups. Then, we unleashed auto-EHyClus. The automated process selected a combination of MEI on the first and second derivatives and MHI on the original data and first derivatives as the most informative variables. We used k-means clustering on these selected indices.

The resulting three clusters painted a clear picture of Madrid’s air quality landscape:

  • Cluster 1 (High Pollution): Included stations like Plaza Elíptica and Barajas. These are known hotspots, heavily impacted by road traffic (Plaza Elíptica is near a major interchange) and air traffic (Barajas airport). Escuelas de Aguirre, also in this group, is close to busy roads. These stations show consistently higher pollution levels throughout the year.
  • Cluster 2 (Moderate Pollution): Stations in mixed urban areas. They experience significant pollution from traffic and urban activities but aren’t as extreme as the first group.
  • Cluster 3 (Lower Pollution): Stations located in less urbanized areas, often near green spaces or quieter residential neighborhoods. These show the lowest pollution levels, benefiting from less traffic and potentially the cleansing effect of parks.

This classification clearly differentiates zones influenced by different factors – traffic, urban density, proximity to green areas. It provides valuable insights for urban planners and public health officials looking to target interventions where they’re needed most.

A photorealistic portrait (35mm lens, depth of field) of a data scientist looking intently at multiple screens displaying complex functional data visualizations and maps of Madrid, symbolizing the analysis process.

Wrapping It Up

So, there you have it! We’ve developed a novel way to extend the useful epigraph and hypograph indices to the multivariate functional data setting, capturing the crucial interdependencies between variables. We’ve shown how these new indices can be integrated into the EHyClus methodology, making it a powerful tool for clustering complex functional data.

Through simulations, we’ve demonstrated that our approach is competitive, often outperforming existing methods. And with real-world case studies on Canadian weather and, excitingly, Madrid air quality, we’ve shown its practical utility in uncovering meaningful patterns. The automated variable selection process we introduced is key to making this methodology accessible and reliable for real applications where the ground truth is unknown.

Clustering complex data is never without its challenges – deciding on the number of clusters, choosing the right indices, picking the best clustering method. But we’ve provided guidelines and tools (like the auto-EHyClus feature in the ehymetR package) to help researchers navigate these decisions.

We’re pretty excited about the potential of these new multivariate indices. Beyond clustering, they could be used to enhance other functional data tools, like multivariate functional boxplots or tests for comparing groups of multivariate functional curves. There’s still plenty more to explore in the fascinating world of multivariate functional data analysis!

Source: Springer

Articoli correlati

Lascia un commento

Il tuo indirizzo email non sarà pubblicato. I campi obbligatori sono contrassegnati *