7.4.1.1. Abstract Vector Dataloader

class meshiphi.dataloaders.vector.abstract_vector.VectorDataLoader(bounds, params)

Abstract class for all vector Datasets.

__init__(bounds, params)

This is where large-scale operations are performed, such as importing data, downsampling, reprojecting, and renaming variables

Parameters:

bounds (Boundary) – Initial mesh boundary to limit scope of data ingest
params (dict) – Values needed by dataloader to initialise. Unique to each dataloader

self.data

Data stored by dataloader to use when called upon by the mesh. Must be saved in mercator projection (EPSG:4326), with coordinates names ‘lat’, ‘long’, and ‘time’ (if applicable).

Type:: pd.DataFrame or xr.Dataset

self.data_name

Name of scalar variable. Must be the column name if self.data is pd.DataFrame. Must be variable if self.data is xr.Dataset

Type:: str

add_default_params(params)

Set default values for all scalar dataloaders. This function should be overloaded to include any extra params for a specific dataloader

Parameters:: params (dict) – Dictionary containing attributes that are required for each dataloader.
Returns:: Dictionary of attributes the dataloader will require, completed with default values if not provided in config.
Return type:: (dict)

add_mag_dir(data=None, data_names=None)

Adds magnitude and direction variables/columns to data for easier retrieval of value

Parameters:

data (pd.DataFrame or xr.Dataset) – Data with ‘lat’ and ‘long’ columns/dimensions. Assumes that the existing data is in cartesian form (x and y components). If None, will use self.data
data_names (list) – List of data columns/variables to use in calculation If None, will use self.data_name_list

Returns:

Original dataset with two new columns/variables called ‘_magnitude’ and ‘_direction’, containing the corresponding values for each.

Return type:

data (pd.DataFrame or xr.Dataset)

calc_curl(bounds, data=None, collapse=True, agg_type='MAX')

Calculates the curl of vectors in a cellbox

Parameters:

bounds (Boundary) – Cellbox boundary in which all relevant vectors are contained
data (pd.DataFrame or xr.Dataset) – Dataset with ‘lat’ and ‘long’ columns/dimensions with vectors
collapes (bool) – Flag determining whether to return an aggregated value, or a vector field (values for each individual vector).
agg_type (str) – Method of aggregation if collapsing value. Accepts ‘MAX’ or ‘MEAN’

Returns:

float value of aggregated curl if collapse=True, or pd.DataFrame of curl vector field if collapse=False

Return type:

float or pd.DataFrame

Raises:

ValueError – If agg_type is not ‘MAX’ or ‘MEAN’

calc_divergence(bounds, data=None, collapse=True, agg_type='MAX')

Calculates the divergence of vectors in a cellbox

Parameters:

bounds (Boundary) – Cellbox boundary in which all relevant vectors are contained
data (pd.DataFrame or xr.Dataset) – Dataset with ‘lat’ and ‘long’ columns/dimensions with vectors
collapes (bool) – Flag determining whether to return an aggregated value, or a vector field (values for each individual vector).
agg_type (str) – Method of aggregation if collapsing value. Accepts ‘MAX’ or ‘MEAN’

Returns:

float value of aggregated div if collapse=True, or pd.DataFrame of div vector field if collapse=False

Return type:

float or pd.DataFrame

Raises:

ValueError – If agg_type is not ‘MAX’ or ‘MEAN’

calc_dmag(bounds, data=None, collapse=True, agg_type='MEAN')

Calculates the dmag of vectors in a cellbox. dmag is defined as being the difference in magnitudes between each vector and the average vector within the bounds.

dmag = mag(vector - mean_vector)

Parameters:

bounds (Boundary) – Cellbox boundary in which all relevant vectors are contained
data (pd.DataFrame or xr.Dataset) – Dataset with ‘lat’ and ‘long’ columns/dimensions with vectors
collapes (bool) – Flag determining whether to return an aggregated value, or a vector field (values for each individual vector).
agg_type (str) – Method of aggregation if collapsing value. Accepts ‘MAX’ or ‘MEAN’

Returns:

float value of aggregated dmag if collapse=True, or pd.DataFrame of dmag vector field if collapse=False

Return type:

float or pd.DataFrame

Raises:

ValueError – If agg_type is not ‘MAX’ or ‘MEAN’

calc_reynolds_number(bounds)

Calculates an approximate Reynolds number from the mean vector velocity and cellbox size.

CURRENTLY ASSUMES DENSITY AND VISCOSITY OF SEAWATER AT 4°C! WILL NEED MINOR REWORKING TO INCLUDE DIFFERENT FLUIDS

Parameters:: bounds (Boundary) – Cellbox boundary to calculate characteristic length from
Returns:: Reynolds number of cellbox
Return type:: float

calculate_coverage(bounds, data=None)

Calculates percentage of boundary covered by dataset

Parameters:

bounds (Boundary) – Boundary being compared against
data (pd.DataFrame or xr.Dataset) – Dataset with ‘lat’ and ‘long’ coordinates. Extent calculated from min/max of these coordinates. Defaults to objects internal dataset.

Returns:

Decimal fraction of boundary covered by the dataset

Return type:

float

downsample(agg_type=None)

Downsamples imported data to be more easily manipulated. Data size should be reduced by a factor of m*n, where (m,n) are the downsample_factors defined in the params. self.data can be pd.DataFrame or xr.Dataset

Parameters:: agg_type (str) – Method of aggregation to bin data by to downsample. Default is same method used for homogeneity condition.
Returns:: Downsampled data
Return type:: xr.Dataset or pd.DataFrame

get_data_col_name()

Retrieve name of data column (for pd.DataFrame), or variable (for xr.Dataset). Used for when data_name not defined in params. Variable names are appended and comma seperated

Returns:: Name of data columns, comma seperated
Return type:: str

get_data_col_name_list()

Retrieve names of data columns (for pd.DataFrame), or variable (for xr.Dataset). Used for when data_name not defined in params.

Returns:: Contains strings of data namesk
Return type:: list

get_hom_condition(bounds, splitting_conds, agg_type='MEAN', data=None)

Retrieves homogeneity condition of data within boundary.

Parameters:

bounds (Boundary) – Boundary object with limits of datarange to analyse
splitting_conds (dict) –
Containing the following keys:

’threshold’:
(float) The threshold at which data points of type ‘value’ within this CellBox are checked to be either above or below

Returns:

The homogeniety condtion returned is of the form:

’MIN’ = the cellbox contains less than a minimum number of data points

’HET’ = Threshold values defined in config are exceeded

’CLR’ = None of the HET conditions were triggered

Return type:

str

get_value(bounds, agg_type=None, skipna=True, data=None)

Retrieve aggregated value from within bounds

Parameters:

aggregation_type (str) – Method of aggregation of datapoints within bounds. Can be upper or lower case. Accepts ‘MIN’, ‘MAX’, ‘MEAN’, ‘MEDIAN’, ‘STD’, ‘COUNT’
bounds (Boundary) – Boundary object with limits of lat/long
skipna (bool) – Defines whether to propogate NaN’s or not Default = True (ignore’s NaN’s)

Returns:

{variable (str): aggregated_value (float)} Aggregated value within bounds following aggregation_type

Return type:

dict

Raises:

ValueError – aggregation type not in list of available methods

abstract import_data(bounds)

User defined method for importing data from files, or even generating data from scratch

Returns:

Coordinates and data being imported from file

if xr.Dataset,

Must have coordinates ‘lat’ and ‘long’
Should have multiple data variables

if pd.DataFrame,

Must have columns ‘lat’ and ‘long’
Should have multiple data columns

Downsampling and reprojecting happen in __init__() method

Return type:

xr.Dataset or pd.DataFrame

reproject(in_proj='EPSG:4326', out_proj='EPSG:4326', x_col='lat', y_col='long')

Reprojects data using pyProj.Transformer self.data can be pd.DataFrame or xr.Dataset

Parameters:

in_proj (str) – Projection that the imported dataset is in Must be allowed by PyProj.CRS (Coordinate Reference System)
out_proj (str) – Projection required for final data output Must be allowed by PyProj.CRS (Coordinate Reference System) Shouldn’t change from default value (EPSG:4326)
x_col (str) – Name of coordinate column 1
y_col (str) – Name of coordinate column 2 x_col and y_col will be cast into lat and long by the reprojection

Returns:

Reprojected data with ‘lat’, ‘long’ columns replacing ‘x_col’ and ‘y_col’

Return type:

pd.DataFrame

set_data_col_name(new_names)

Sets name of data column/data variables from a comma-seperated string

Parameters:: name_dict (dict) – Dictionary mapping old variable names to new variable names, of the form {old_name (str): new_name (str)}
Returns:: Data with variable name changed
Return type:: xr.Dataset or pd.DataFrame

set_data_col_name_list(new_names)

Sets name of data column/data variables from a list of strings. Also updates self.data_name_list with new names from list

Parameters:: new_names (list) – List of strings containing new variable names
Returns:: Original dataset with data variables renamed
Return type:: pd.DataFrame or xr.Dataset

trim_datapoints(bounds, data=None)

Trims datapoints from self.data within boundary defined by ‘bounds’. self.data can be pd.DataFrame or xr.Dataset

Parameters:: bounds (Boundary) – Limits of lat/long/time to select data from
Returns:: Trimmed dataset in same format as self.data
Return type:: pd.DataFrame or xr.Dataset