7.4.1.1. Abstract Vector Dataloader
- class meshiphi.dataloaders.vector.abstract_vector.VectorDataLoader(bounds, params)
Abstract class for all vector Datasets.
- __init__(bounds, params)
This is where large-scale operations are performed, such as importing data, downsampling, reprojecting, and renaming variables
- Parameters:
bounds (Boundary) – Initial mesh boundary to limit scope of data ingest
params (dict) – Values needed by dataloader to initialise. Unique to each dataloader
- self.data
Data stored by dataloader to use when called upon by the mesh. Must be saved in mercator projection (EPSG:4326), with coordinates names ‘lat’, ‘long’, and ‘time’ (if applicable).
- Type:
pd.DataFrame or xr.Dataset
- self.data_name
Name of scalar variable. Must be the column name if self.data is pd.DataFrame. Must be variable if self.data is xr.Dataset
- Type:
str
- add_default_params(params)
Set default values for all scalar dataloaders. This function should be overloaded to include any extra params for a specific dataloader
- Parameters:
params (dict) – Dictionary containing attributes that are required for each dataloader.
- Returns:
Dictionary of attributes the dataloader will require, completed with default values if not provided in config.
- Return type:
(dict)
- add_mag_dir(data=None, data_names=None)
Adds magnitude and direction variables/columns to data for easier retrieval of value
- Parameters:
data (pd.DataFrame or xr.Dataset) – Data with ‘lat’ and ‘long’ columns/dimensions. Assumes that the existing data is in cartesian form (x and y components). If None, will use self.data
data_names (list) – List of data columns/variables to use in calculation If None, will use self.data_name_list
- Returns:
Original dataset with two new columns/variables called ‘_magnitude’ and ‘_direction’, containing the corresponding values for each.
- Return type:
data (pd.DataFrame or xr.Dataset)
- calc_curl(bounds, data=None, collapse=True, agg_type='MAX')
Calculates the curl of vectors in a cellbox
- Parameters:
bounds (Boundary) – Cellbox boundary in which all relevant vectors are contained
data (pd.DataFrame or xr.Dataset) – Dataset with ‘lat’ and ‘long’ columns/dimensions with vectors
collapes (bool) – Flag determining whether to return an aggregated value, or a vector field (values for each individual vector).
agg_type (str) – Method of aggregation if collapsing value. Accepts ‘MAX’ or ‘MEAN’
- Returns:
float value of aggregated curl if collapse=True, or pd.DataFrame of curl vector field if collapse=False
- Return type:
float or pd.DataFrame
- Raises:
ValueError – If agg_type is not ‘MAX’ or ‘MEAN’
- calc_divergence(bounds, data=None, collapse=True, agg_type='MAX')
Calculates the divergence of vectors in a cellbox
- Parameters:
bounds (Boundary) – Cellbox boundary in which all relevant vectors are contained
data (pd.DataFrame or xr.Dataset) – Dataset with ‘lat’ and ‘long’ columns/dimensions with vectors
collapes (bool) – Flag determining whether to return an aggregated value, or a vector field (values for each individual vector).
agg_type (str) – Method of aggregation if collapsing value. Accepts ‘MAX’ or ‘MEAN’
- Returns:
float value of aggregated div if collapse=True, or pd.DataFrame of div vector field if collapse=False
- Return type:
float or pd.DataFrame
- Raises:
ValueError – If agg_type is not ‘MAX’ or ‘MEAN’
- calc_dmag(bounds, data=None, collapse=True, agg_type='MEAN')
Calculates the dmag of vectors in a cellbox. dmag is defined as being the difference in magnitudes between each vector and the average vector within the bounds.
dmag = mag(vector - mean_vector)
- Parameters:
bounds (Boundary) – Cellbox boundary in which all relevant vectors are contained
data (pd.DataFrame or xr.Dataset) – Dataset with ‘lat’ and ‘long’ columns/dimensions with vectors
collapes (bool) – Flag determining whether to return an aggregated value, or a vector field (values for each individual vector).
agg_type (str) – Method of aggregation if collapsing value. Accepts ‘MAX’ or ‘MEAN’
- Returns:
float value of aggregated dmag if collapse=True, or pd.DataFrame of dmag vector field if collapse=False
- Return type:
float or pd.DataFrame
- Raises:
ValueError – If agg_type is not ‘MAX’ or ‘MEAN’
- calc_reynolds_number(bounds)
Calculates an approximate Reynolds number from the mean vector velocity and cellbox size.
CURRENTLY ASSUMES DENSITY AND VISCOSITY OF SEAWATER AT 4°C! WILL NEED MINOR REWORKING TO INCLUDE DIFFERENT FLUIDS
- Parameters:
bounds (Boundary) – Cellbox boundary to calculate characteristic length from
- Returns:
Reynolds number of cellbox
- Return type:
float
- calculate_coverage(bounds, data=None)
Calculates percentage of boundary covered by dataset
- Parameters:
bounds (Boundary) – Boundary being compared against
data (pd.DataFrame or xr.Dataset) – Dataset with ‘lat’ and ‘long’ coordinates. Extent calculated from min/max of these coordinates. Defaults to objects internal dataset.
- Returns:
Decimal fraction of boundary covered by the dataset
- Return type:
float
- downsample(agg_type=None)
Downsamples imported data to be more easily manipulated. Data size should be reduced by a factor of m*n, where (m,n) are the downsample_factors defined in the params. self.data can be pd.DataFrame or xr.Dataset
- Parameters:
agg_type (str) – Method of aggregation to bin data by to downsample. Default is same method used for homogeneity condition.
- Returns:
Downsampled data
- Return type:
xr.Dataset or pd.DataFrame
- get_data_col_name()
Retrieve name of data column (for pd.DataFrame), or variable (for xr.Dataset). Used for when data_name not defined in params. Variable names are appended and comma seperated
- Returns:
Name of data columns, comma seperated
- Return type:
str
- get_data_col_name_list()
Retrieve names of data columns (for pd.DataFrame), or variable (for xr.Dataset). Used for when data_name not defined in params.
- Returns:
Contains strings of data namesk
- Return type:
list
- get_hom_condition(bounds, splitting_conds, agg_type='MEAN', data=None)
Retrieves homogeneity condition of data within boundary.
- Parameters:
bounds (Boundary) – Boundary object with limits of datarange to analyse
splitting_conds (dict) –
Containing the following keys:
- ’threshold’:
(float) The threshold at which data points of type ‘value’ within this CellBox are checked to be either above or below
- Returns:
The homogeniety condtion returned is of the form:
’MIN’ = the cellbox contains less than a minimum number of data points
’HET’ = Threshold values defined in config are exceeded
’CLR’ = None of the HET conditions were triggered
- Return type:
str
- get_value(bounds, agg_type=None, skipna=True, data=None)
Retrieve aggregated value from within bounds
- Parameters:
aggregation_type (str) – Method of aggregation of datapoints within bounds. Can be upper or lower case. Accepts ‘MIN’, ‘MAX’, ‘MEAN’, ‘MEDIAN’, ‘STD’, ‘COUNT’
bounds (Boundary) – Boundary object with limits of lat/long
skipna (bool) – Defines whether to propogate NaN’s or not Default = True (ignore’s NaN’s)
- Returns:
{variable (str): aggregated_value (float)} Aggregated value within bounds following aggregation_type
- Return type:
dict
- Raises:
ValueError – aggregation type not in list of available methods
- abstract import_data(bounds)
User defined method for importing data from files, or even generating data from scratch
- Returns:
Coordinates and data being imported from file
- if xr.Dataset,
Must have coordinates ‘lat’ and ‘long’
Should have multiple data variables
- if pd.DataFrame,
Must have columns ‘lat’ and ‘long’
Should have multiple data columns
Downsampling and reprojecting happen in __init__() method
- Return type:
xr.Dataset or pd.DataFrame
- reproject(in_proj='EPSG:4326', out_proj='EPSG:4326', x_col='lat', y_col='long')
Reprojects data using pyProj.Transformer self.data can be pd.DataFrame or xr.Dataset
- Parameters:
in_proj (str) – Projection that the imported dataset is in Must be allowed by PyProj.CRS (Coordinate Reference System)
out_proj (str) – Projection required for final data output Must be allowed by PyProj.CRS (Coordinate Reference System) Shouldn’t change from default value (EPSG:4326)
x_col (str) – Name of coordinate column 1
y_col (str) – Name of coordinate column 2 x_col and y_col will be cast into lat and long by the reprojection
- Returns:
Reprojected data with ‘lat’, ‘long’ columns replacing ‘x_col’ and ‘y_col’
- Return type:
pd.DataFrame
- set_data_col_name(new_names)
Sets name of data column/data variables from a comma-seperated string
- Parameters:
name_dict (dict) – Dictionary mapping old variable names to new variable names, of the form {old_name (str): new_name (str)}
- Returns:
Data with variable name changed
- Return type:
xr.Dataset or pd.DataFrame
- set_data_col_name_list(new_names)
Sets name of data column/data variables from a list of strings. Also updates self.data_name_list with new names from list
- Parameters:
new_names (list) – List of strings containing new variable names
- Returns:
Original dataset with data variables renamed
- Return type:
pd.DataFrame or xr.Dataset
- trim_datapoints(bounds, data=None)
Trims datapoints from self.data within boundary defined by ‘bounds’. self.data can be pd.DataFrame or xr.Dataset
- Parameters:
bounds (Boundary) – Limits of lat/long/time to select data from
- Returns:
Trimmed dataset in same format as self.data
- Return type:
pd.DataFrame or xr.Dataset