The future of time series data in `sunpy`#

In late 2022 I got a small development grant from NumFocus to scope the future of time series data in sunpy. The successful application can be read on the sunpy wiki. The application contains context that I won’t repeat here. This blog post is the key outcome of this grant, with a record of what I did, the recommendations I made, and any decisions we came to as a community.

User requirements#

The first stage of my work investigated what the user requirements are for a sunpy data container for time series data. As part of this I used my own experience and the following community engagement:

Discussion at one of the weekly sunpy community meetings in December 2022
Discussion on the Python in Heliophysics mailing list
Discussion on the SunPy forum

From these discsusions came the following list of requirements:

Requirement	Notes
Store data that is a function of time	This means the time column should be treated as the index or coordinates to the data, and be stored as a time-like type.
Handle different time scales	Data can have times defined in a variety of different time scales (e.g. UTC, TAI)
Store multi-dimensional data	Although time is a common index to timeseries data, it isn’t always the only one. As an example, velocity distribution functions measured in the solar wind are 4D datasets, with data as a function of time and three dimensions in velocity space.
Handle time scales with leapseconds	Some timescales can contain timestamps that occur within a leapsecond.
Store and use physical units with the data and any non-time indices
Store data in a format that can be used with scientific Python libraries
Support for storing out-of memory datasets
Store metadata alongside actual data
Have a way to store an observer coordinate alongside the time index
Have an easy way to do common data manipulation tasks	e.g. interpolating, resampling, rebinning
Have a way to combine multiple timeseries objects, and keep track of metadata
Ability to convert to other common time series objects (e.g. `pandas.DataFrame`)
Functionality for loading and saving out to common file formats

Existing options for a data container#

The next step was to identify a set of possible data containers that could be used to store time- series data in sunpy. The identified options were:

astropy.timeseries.TimeSeries
pandas.DataFrame
xarray.DataArray (or xarray.DataSet)
numpy.ndarray
ndcube

What do other projects use?#

I also looked at what Python in Heliophysics projects use (as of writing, in Jan 2023):

Package	Container
sunpy	Custom `TimeSeries` object, backed by `pandas.DataFrame`
HAPI Client	`numpy.ndarray`
pySPEDAS	Not clear if users can access the data itself
spacepy	Unclear if there is any specific timeseries container object
aidapy	`xarray.DataArray`
cdflib	`numpy.ndarray`
NDCube	`NDCube`
pytplot	`xarray.DataArray`
solo-epd-loader	`pandas.DataFrame`
speasy	Custom `DataContainer` object, backed by `numpy.ndarray`

There is no common container used, with only astropy.TimeSeries not represented out of the possible options above.

What datasets does sunpy currently support?#

sunpy currently has built in support for reading CDF files that conform to the Space Physics Guidelines for CDF, as long as the dataset is one- or two- dimensional. Alongside this several custom data readers have been written to support different data sources:

(links point to the data source information web page)

Data product(s)	File format
SDO EVE/ESP L1	FITS
SDO EVE/ESP L0CS	Text file
FERMI GBM summary	FITS
GOES XRS	FITS, netCDF
PROBA-2 LYRA ligthcurve	FITS
NOAA solar cycle monthly indices	JSON
NOAA solar cycle predicted indices	JSON
NoRH radio	FITS
RHESSI x-ray summary	FITS

Evaluating options#

Having found possible options, in this section I’ve evaluated them against the criteria set out above.

`numpy.ndarray`#


Time-like index data	🛑	Can store datetime64 data, but no support for indexes
Different time scales	🛑	No support
Multi-dimensional data	🟩
Physical units	🛑	No support
Interop with scientific Python	🟩
Out of memory	🛑	numpy arrays are always in memory
Metadata	🛑	No support
Observer coordinates	🛑	No support
Easy data manipulation	🟩
I/O	🟠	Can save to binary .npy format or text file

`pandas.DataFrame`#


Time-like index data	🟩
Different time scales	🛑	No support
Multi-dimensional data	🟠	Possible, but recommended to use xarray instead
Physical units	🛑	No native support (tracking issue), could be possible with `pint-pands`
Interop with scientific Python	🟩
Out of memory	🛑	pandas DataFrames are always in memory
Metadata	🟩	Possible to add additional properties to a DataFrame
Observer coordinates	🛑	No support
Easy data manipulation	🟩	Many built in methods for manipulating time-like data
I/O	🟩	Lots of I/O options

`xarray.DataArray`#


Time-like index data	🟩
Different time scales	🛑	No support
Multi-dimensional data	🟩
Physical units	🛑	No native support (tracking issue), could be possible with `pint-xarray`
Interop with scientific Python	🟩
Out of memory	🟩	Support for computing using `dask`
Metadata	🟩	Possible to add metadata to a DataArray
Observer coordinates	🟠	Support for adding “non-dimensional” coordinates (e.g. longitude/latitude), but not clear if storing astropy SkyCoord would work
Easy data manipulation	🟩	Many built in methods for manipulating time-like data
I/O	🟩	Lots of I/O options

`astropy.timeseries.TimeSeries`#


Time-like index data	🟩
Different time scales	🟩
Multi-dimensional data	🛑
Physical units	🟩
Interop with scientific Python	🟩
Out of memory	🟠	Apparently there is some support, but this is undocumented.
Metadata	🟩	Can store on the `.meta` attribute
Observer coordinates	🟩
Easy data manipulation	🟠
I/O	🟩	I/O is done via `astropy.table.Table`

`NDCube`#


Time-like index data	🟩
Different time scales	🟩
Multi-dimensional data	🟩
Physical units	🟩
Interop with scientific Python	🟩
Out of memory	🟠	Seems to be supported in theory, but little docs
Metadata	🟩	Can store arbitrary FITS metadata
Observer coordinates	🛑	No support for extra coordinates
Easy data manipulation	🛑	Very few manipulation methods implemented
I/O	🛑

Initial recommendations#

numpy.ndarray doesn’t implement several key features, and these are almost certainly out of scope for future ndarray development, so I suggest ndarray is discounted.
xarray.DataArray builds on top of pandas.DataFrame with additional features that would be useful to us, I suggest pandas.DataFrame is discounted.
NDCube is designed specifically to store data that is associated with a FITS world coordinate system (WCS). While some solar timeseries data is already in the FITS format, a large portion is in CDF format which is tabular, which FITS is not primarily designed to represent. So I suggest NDCube is discounted.

At a SunPy community meeting there was a consensus agreement that going forward we should consider astropy.TimeSeries and xarray.DataArray as the two options to consider.

These two options have the following comparison:

	`astropy.TimeSeries`	`xarray.DataArray`
Time-like index data	🟩	🟩
Different time scales	🟩	🛑
Multi-dimensional data	🛑	🟩
Physical units	🟩	🛑
Interop with scientific Python	🟩	🟩
Out of memory	🟠	🟩
Metadata	🟩	🟩
Observer coordinates	🟩	🟠
Easy data manipulation	🟠	🟩
I/O	🟩	🟩

My initial recommendation would be to adopt xarray.DataArray, as the two red items have a strong possibility of being solved with DataArray:

It should (I haven’t confirmed this) be possible to convert times in different time scales (including ones with leap seconds) to a single timescale that doesn’t have leap seconds, and store this in an xarray.DataArray.
Although there is not native support for units in DataArray currently, there is interest and ongoing development to support them.

It is unclear to me (because I did not have time to investigate) how hard it would be to implement support for storing rich coordinates (ie. astropy.SkyCoord) in the extra_coords part of xarray data structures.

In contrast I think implementing multi-dimensional data in astropy.TimeSeries, adding documentation for out of memory datasets, and implementing easy data manipulation methods would take significantly more effor than this. Finally, xarray has a much bigger development community than astropy.TimeSeries, so implementing bug fixes and new features would probably be much easier with xarray.

Putting `astropy` objects in `xarray` structures#

For the final part of the small development grant, I investigated the changes needed to put astropy objects in xarray structures.

As a model for doing this, it is currently possible to store unitful data created with pint in xarray structures. Support for doing this has two components:

xarray natively supports storing duck arrays
xarray-pint provides a set of accessors that can be used to serialise and deserialise unitful data so that it can be saved to a file and loaded again. It does this by converting the unit data into metadata, with strings representing units.

It is not currently possible to store astropy.Quantity objects in xarray structures, as they inherit directly from ndarray, and get coerced from Quantity to ndarray during the xarray structure initialisation. I think fixing this is (at least initially) a one line change, changing (what was xarray/core/variable.py#L288 on commit hash 51554f2638bc9e4a527492136fe6f54584ffa75d) from

data = np.asarray(data)

if not isinstance(data, np.array):
    data = np.asarray(data)

Before moving forward with this it needs to be possible to run the full unit tests in xarray with astropy.Quantity. I started work on this in these two PRs:

aiapy: A SunPy affiliated package for analyzing data from the Atmospheric Imaging Assembly pyOpenSci and sunpy

10 May 2023

The future of time series data in sunpy#

User requirements#

Existing options for a data container#

What do other projects use?#

What datasets does sunpy currently support?#

Evaluating options#

numpy.ndarray#

pandas.DataFrame#

xarray.DataArray#

astropy.timeseries.TimeSeries#

NDCube#