7.3.1.1. Abstract Scalar Dataloader

class meshiphi.dataloaders.scalar.abstract_scalar.ScalarDataLoader(bounds, params)

Abstract class for all scalar Datasets.

__init__(bounds, params)

This is where large-scale operations are performed, such as importing data, downsampling, reprojecting, and renaming variables

Parameters:
  • bounds (Boundary) – Initial mesh boundary to limit scope of data ingest

  • params (dict) – Values needed by dataloader to initialise. Unique to each dataloader

self.data

Data stored by dataloader to use when called upon by the mesh. Must be saved in mercator projection (EPSG:4326), with coordinates names ‘lat’, ‘long’, and ‘time’ (if applicable).

Type:

pd.DataFrame or xr.Dataset

self.data_name

Name of scalar variable. Must be the column name if self.data is pd.DataFrame. Must be variable if self.data is xr.Dataset

Type:

str

Raises:

ValueError – If no data lies within the parsed boundary

add_default_params(params)

Set default values for all scalar dataloaders. This function should be overloaded to include any extra params for a specific dataloader

Parameters:

params (dict) – Dictionary containing attributes that are required for each dataloader.

Returns:

Dictionary of attributes the dataloader will require, completed with default values if not provided in config.

Return type:

(dict)

calculate_coverage(bounds, data=None)

Calculates percentage of boundary covered by dataset

Parameters:
  • bounds (Boundary) – Boundary being compared against

  • data (pd.DataFrame or xr.Dataset) – Dataset with ‘lat’ and ‘long’ coordinates. Extent calculated from min/max of these coordinates. Defaults to objects internal dataset.

Returns:

Decimal fraction of boundary covered by the dataset

Return type:

float

downsample(agg_type=None)

Downsamples imported data to be more easily manipulated. Data size should be reduced by a factor of m*n, where (m,n) are the downsample_factors defined in the params. self.data can be pd.DataFrame or xr.Dataset

Parameters:

agg_type (str) – Method of aggregation to bin data by to downsample. Default is same method used for homogeneity condition.

Returns:

Downsampled data

Return type:

xr.Dataset or pd.DataFrame

get_data_col_name()

Retrieve name of data column (for pd.DataFrame), or variable (for xr.Dataset). Used for when data_name not defined in params.

Returns:

Name of data column

Return type:

str

Raises:

ValueError – If multiple possible data columns found, can’t retrieve data name

get_hom_condition(bounds, splitting_conds, data=None)

Retrieves homogeneity condition of data within boundary.

Parameters:
  • bounds (Boundary) – Boundary object with limits of datarange to analyse

  • splitting_conds (dict) –

    Containing the following keys:

    ’threshold’:

    (float) The threshold at which data points of type ‘value’ within this CellBox are checked to be either above or below

    ’upper_bound’:

    (float) The lowerbound of acceptable percentage of data_points of type value within this boundary that are above ‘threshold’

    ’lower_bound’:

    (float) The upperbound of acceptable percentage of data_points of type value within this boundary that are above ‘threshold’

    ’split_lock’:

    (bool) If true, a cellbox will not be split by other splitting conditions if it is deemed homogeneous. default = False.

Returns:

The homogeniety condtion returned is of the form:

’CLR’ = the proportion of data points within this cellbox over a given threshold is lower than the lowerbound

’HOM’ = the proportion of data points within this cellbox over a given threshold is higher than the upperbound

’MIN’ = the cellbox contains less than a minimum number of data points

’HET’ = the proportion of data points within this cellbox over a given threshold if between the upper and lower bound

Return type:

str

get_value(bounds, data=None, agg_type=None, skipna=True)

Retrieve aggregated value from within bounds

Parameters:
  • aggregation_type (str) – Method of aggregation of datapoints within bounds. Can be upper or lower case. Accepts ‘MIN’, ‘MAX’, ‘MEAN’, ‘MEDIAN’, ‘STD’, ‘COUNT’

  • bounds (Boundary) – Boundary object with limits of lat/long

  • skipna (bool) – Defines whether to propogate NaN’s or not Default = True (ignore’s NaN’s)

Returns:

{variable (str): aggregated_value (float)} Aggregated value within bounds following aggregation_type

Return type:

dict

Raises:

ValueError – aggregation type not in list of available methods

abstract import_data(bounds)

User defined method for importing data from files, or even generating data from scratch

Returns:

Coordinates and data being imported from file

if xr.Dataset,
  • Must have coordinates ‘lat’ and ‘long’

  • Must have single data variable

if pd.DataFrame,
  • Must have columns ‘lat’ and ‘long’

  • Must have single data column

Downsampling and reprojecting happen in __init__() method

Return type:

xr.Dataset or pd.DataFrame

reproject(in_proj='EPSG:4326', out_proj='EPSG:4326', x_col='lat', y_col='long')

Reprojects data using pyProj.Transformer self.data can be pd.DataFrame or xr.Dataset

Parameters:
  • in_proj (str) – Projection that the imported dataset is in Must be allowed by PyProj.CRS (Coordinate Reference System)

  • out_proj (str) – Projection required for final data output Must be allowed by PyProj.CRS (Coordinate Reference System) Shouldn’t change from default value (EPSG:4326)

  • x_col (str) – Name of coordinate column 1

  • y_col (str) – Name of coordinate column 2 x_col and y_col will be cast into lat and long by the reprojection

Returns:

Reprojected data with ‘lat’, ‘long’ columns replacing ‘x_col’ and ‘y_col’

Return type:

pd.DataFrame

set_data_col_name(new_name)

Sets name of data column/data variable

Parameters:

name (str) – Name to replace currently stored name with

Returns:

Data with variable name changed

Return type:

xr.Dataset or pd.DataFrame

trim_datapoints(bounds, data=None)

Trims datapoints from self.data within boundary defined by ‘bounds’. self.data can be pd.DataFrame or xr.Dataset

Parameters:

bounds (Boundary) – Limits of lat/long/time to select data from

Returns:

Trimmed dataset in same format as self.data

Return type:

pd.DataFrame or xr.Dataset