The future of time series data in sunpy#

In late 2022 I got a small development grant from NumFocus to scope the future of time series data in sunpy. The successful application can be read on the sunpy wiki. The application contains context that I won’t repeat here. This blog post is the key outcome of this grant, with a record of what I did, the recommendations I made, and any decisions we came to as a community.

User requirements#

The first stage of my work investigated what the user requirements are for a sunpy data container for time series data. As part of this I used my own experience and the following community engagement:

From these discsusions came the following list of requirements:

Requirement

Notes

Store data that is a function of time

This means the time column should be treated as the index or coordinates to the data, and be stored as a time-like type.

Handle different time scales

Data can have times defined in a variety of different time scales (e.g. UTC, TAI)

Store multi-dimensional data

Although time is a common index to timeseries data, it isn’t always the only one. As an example, velocity distribution functions measured in the solar wind are 4D datasets, with data as a function of time and three dimensions in velocity space.

Handle time scales with leapseconds

Some timescales can contain timestamps that occur within a leapsecond.

Store and use physical units with the data and any non-time indices

Store data in a format that can be used with scientific Python libraries

Support for storing out-of memory datasets

Store metadata alongside actual data

Have a way to store an observer coordinate alongside the time index

Have an easy way to do common data manipulation tasks

e.g. interpolating, resampling, rebinning

Have a way to combine multiple timeseries objects, and keep track of metadata

Ability to convert to other common time series objects (e.g. pandas.DataFrame)

Functionality for loading and saving out to common file formats

Existing options for a data container#

The next step was to identify a set of possible data containers that could be used to store time- series data in sunpy. The identified options were:

  • astropy.timeseries.TimeSeries

  • pandas.DataFrame

  • xarray.DataArray (or xarray.DataSet)

  • numpy.ndarray

  • ndcube

What do other projects use?#

I also looked at what Python in Heliophysics projects use (as of writing, in Jan 2023):

Package

Container

sunpy

Custom TimeSeries object, backed by pandas.DataFrame

HAPI Client

numpy.ndarray

pySPEDAS

Not clear if users can access the data itself

spacepy

Unclear if there is any specific timeseries container object

aidapy

xarray.DataArray

cdflib

numpy.ndarray

NDCube

NDCube

pytplot

xarray.DataArray

solo-epd-loader

pandas.DataFrame

speasy

Custom DataContainer object, backed by numpy.ndarray

There is no common container used, with only astropy.TimeSeries not represented out of the possible options above.

What datasets does sunpy currently support?#

sunpy currently has built in support for reading CDF files that conform to the Space Physics Guidelines for CDF, as long as the dataset is one- or two- dimensional. Alongside this several custom data readers have been written to support different data sources:

(links point to the data source information web page)

Data product(s)

File format

SDO EVE/ESP L1

FITS

SDO EVE/ESP L0CS

Text file

FERMI GBM summary

FITS

GOES XRS

FITS, netCDF

PROBA-2 LYRA ligthcurve

FITS

NOAA solar cycle monthly indices

JSON

NOAA solar cycle predicted indices

JSON

NoRH radio

FITS

RHESSI x-ray summary

FITS

Evaluating options#

Having found possible options, in this section I’ve evaluated them against the criteria set out above.

numpy.ndarray#

Time-like index data

🛑

Can store datetime64 data, but no support for indexes

Different time scales

🛑

No support

Multi-dimensional data

🟩

Physical units

🛑

No support

Interop with scientific Python

🟩

Out of memory

🛑

numpy arrays are always in memory

Metadata

🛑

No support

Observer coordinates

🛑

No support

Easy data manipulation

🟩

I/O

🟠

Can save to binary .npy format or text file

pandas.DataFrame#

Time-like index data

🟩

Different time scales

🛑

No support

Multi-dimensional data

🟠

Possible, but recommended to use xarray instead

Physical units

🛑

No native support (tracking issue), could be possible with pint-pands

Interop with scientific Python

🟩

Out of memory

🛑

pandas DataFrames are always in memory

Metadata

🟩

Possible to add additional properties to a DataFrame

Observer coordinates

🛑

No support

Easy data manipulation

🟩

Many built in methods for manipulating time-like data

I/O

🟩

Lots of I/O options

xarray.DataArray#

Time-like index data

🟩

Different time scales

🛑

No support

Multi-dimensional data

🟩

Physical units

🛑

No native support (tracking issue), could be possible with pint-xarray

Interop with scientific Python

🟩

Out of memory

🟩

Support for computing using dask

Metadata

🟩

Possible to add metadata to a DataArray

Observer coordinates

🟠

Support for adding “non-dimensional” coordinates (e.g. longitude/latitude), but not clear if storing astropy SkyCoord would work

Easy data manipulation

🟩

Many built in methods for manipulating time-like data

I/O

🟩

Lots of I/O options

astropy.timeseries.TimeSeries#

Time-like index data

🟩

Different time scales

🟩

Multi-dimensional data

🛑

Physical units

🟩

Interop with scientific Python

🟩

Out of memory

🟠

Apparently there is some support, but this is undocumented.

Metadata

🟩

Can store on the .meta attribute

Observer coordinates

🟩

Easy data manipulation

🟠

I/O

🟩

I/O is done via astropy.table.Table

NDCube#

Time-like index data

🟩

Different time scales

🟩

Multi-dimensional data

🟩

Physical units

🟩

Interop with scientific Python

🟩

Out of memory

🟠

Seems to be supported in theory, but little docs

Metadata

🟩

Can store arbitrary FITS metadata

Observer coordinates

🛑

No support for extra coordinates

Easy data manipulation

🛑

Very few manipulation methods implemented

I/O

🛑

Initial recommendations#

  • numpy.ndarray doesn’t implement several key features, and these are almost certainly out of scope for future ndarray development, so I suggest ndarray is discounted.

  • xarray.DataArray builds on top of pandas.DataFrame with additional features that would be useful to us, I suggest pandas.DataFrame is discounted.

  • NDCube is designed specifically to store data that is associated with a FITS world coordinate system (WCS). While some solar timeseries data is already in the FITS format, a large portion is in CDF format which is tabular, which FITS is not primarily designed to represent. So I suggest NDCube is discounted.

At a SunPy community meeting there was a consensus agreement that going forward we should consider astropy.TimeSeries and xarray.DataArray as the two options to consider.

These two options have the following comparison:

astropy.TimeSeries

xarray.DataArray

Time-like index data

🟩

🟩

Different time scales

🟩

🛑

Multi-dimensional data

🛑

🟩

Physical units

🟩

🛑

Interop with scientific Python

🟩

🟩

Out of memory

🟠

🟩

Metadata

🟩

🟩

Observer coordinates

🟩

🟠

Easy data manipulation

🟠

🟩

I/O

🟩

🟩

My initial recommendation would be to adopt xarray.DataArray, as the two red items have a strong possibility of being solved with DataArray:

  • It should (I haven’t confirmed this) be possible to convert times in different time scales (including ones with leap seconds) to a single timescale that doesn’t have leap seconds, and store this in an xarray.DataArray.

  • Although there is not native support for units in DataArray currently, there is interest and ongoing development to support them.

It is unclear to me (because I did not have time to investigate) how hard it would be to implement support for storing rich coordinates (ie. astropy.SkyCoord) in the extra_coords part of xarray data structures.

In contrast I think implementing multi-dimensional data in astropy.TimeSeries, adding documentation for out of memory datasets, and implementing easy data manipulation methods would take significantly more effor than this. Finally, xarray has a much bigger development community than astropy.TimeSeries, so implementing bug fixes and new features would probably be much easier with xarray.

Putting astropy objects in xarray structures#

For the final part of the small development grant, I investigated the changes needed to put astropy objects in xarray structures.

As a model for doing this, it is currently possible to store unitful data created with pint in xarray structures. Support for doing this has two components:

  • xarray natively supports storing duck arrays

  • xarray-pint provides a set of accessors that can be used to serialise and deserialise unitful data so that it can be saved to a file and loaded again. It does this by converting the unit data into metadata, with strings representing units.

It is not currently possible to store astropy.Quantity objects in xarray structures, as they inherit directly from ndarray, and get coerced from Quantity to ndarray during the xarray structure initialisation. I think fixing this is (at least initially) a one line change, changing (what was xarray/core/variable.py#L288 on commit hash 51554f2638bc9e4a527492136fe6f54584ffa75d) from

data = np.asarray(data)

to

if not isinstance(data, np.array):
    data = np.asarray(data)

Before moving forward with this it needs to be possible to run the full unit tests in xarray with astropy.Quantity. I started work on this in these two PRs: