Skip to content

Consider h5netcdf dependency and default #484

Description

@veenstrajelmer

Passing engine='h5netcdf' might increase performance: https://docs.xarray.dev/en/stable/user-guide/io.html

Also supports reading netcdf files via git link with fsspec (only scipy and h5netcdf support this, netcdf4 doesn't):

f = fsspec.open(file_nc_git)
ds_git = xr.open_dataset(f.open(),engine='h5netcdf')

Temporarily installed in env, so first uninstall for performance check.

memory/timings assesment
Using h5netcdf showed to be very slow in open_mfdataset when checking with partitions with 2410 variables like in: #968. A cleaned up version of this code and results is included in this issue.

Code to run with memory_profiler via mprof run python memory_profiler.py:

import os
import glob
from time import sleep
import dfm_tools as dfmt
import datetime as dt

dir_model = r"p:\11210284-011-nose-c-cycling\runs_fine_grid\B05_waq_2012_PCO2_ChlC_NPCratios_DenWat_stats_2023.01\B05_waq_2012_PCO2_ChlC_NPCratios_DenWat_stats_2023.01\DFM_OUTPUT_DCSM-FM_0_5nm_waq"
file_nc_pat = os.path.join(dir_model, "DCSM-FM_0_5nm_waq_0*_map.nc")
file_nc_list_all = glob.glob(file_nc_pat)
file_nc_list = file_nc_list_all[:5]

uds = dfmt.open_partitioned_dataset(file_nc_list, remove_ghost=False) #, engine="h5netcdf")

print('>> plot single timestep: ',end='')
dtstart = dt.datetime.now()
uds["mesh2d_tureps1"].isel(time=-1, mesh2d_nInterfaces=-2).ugrid.plot()
print(f'{(dt.datetime.now()-dtstart).total_seconds():.2f} sec')
sleep(2)

Memory usage including plot (peaks at around 900 MB):
image

>> xu.open_dataset() with 5 partition(s): 1 2 3 4 5 : 17.65 sec
>> xu.merge_partitions() with 5 partition(s): 16.65 sec
>> dfmt.open_partitioned_dataset() total: 34.30 sec
>> plot single timestep: 0.42 sec

Memory usage including plot and engine="h5netcdf" (peaks at around 480 MB):
image

>> xu.open_dataset() with 5 partition(s): 1 2 3 4 5 : 171.22 sec
>> xu.merge_partitions() with 5 partition(s): 15.89 sec
>> dfmt.open_partitioned_dataset() total: 187.12 sec
>> plot single timestep: 0.44 sec

So h5netcdf shows far much more time consumption on opening the dataset (because of the pure-python implementation h5netcdf), but the memory usage is significantly less. This might be very useful when h5netcdf is more performant: h5netcdf/h5netcdf#195 and h5netcdf/h5netcdf#251

If changed, also check impact when varying remove_ghost parameter: #957

Additonally:

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions