The
DataSet is a top-level class of the Scientific DataSet library that provides a single data model for different scientific datasets. Applications can store and retrieve data uniformly, having an abstract view on various custom data storage formats. This makes an application less dependent on data formats and significantly eases data transfer between software components.
The
DataSet bundles several related arrays and associated metadata in a single self-descriptive package and enforces certain constraints on arrays' shapes to ensure data consistency. The underlying data model has much in common with Unidata’s
Common Data Model, which has been successfully tested by time.
To create or open a dataset, use the
DataSet.Open(String) method with a proper
DataSet URI:
Example
DataSet dataSet = DataSet.Open("csv_for_autotypes.csv");
DataSet dataSet = DataSet.Open("msds:csv?file=csv_for_autotypes.csv");
DataSet dataSet = DataSet.Open(@"c:\data\ncfile.nc");
DataSet dataSet = DataSet.Open("msds:nc?file=ncfile.nc");
DataSet dataSet = DataSet.Open("msds:as?server=(local)&database=ActiveStorage" +
"&integrated security=true&GroupName=mm5&UseNetcdfConventions=true");
The URI indicates the particular
DataSet implementation, which provides access to the specified data storage format. Such an implementation is called a
DataSet provider.
For a list of providers included in the current release, see
Managed Libraries.
The URI can include additional provider-specific information, such as security credentials, behavioral properties, and so on. For more information, see the documentation for each provider. To customize a URI you open a
DataSet, use properties exposed by the
DataSetUri class. For details, see the
DataSetUri.Create(string) method in complete reference included in the installation package.
The
DataSet consists of a collection of
Variable objects. Each
Variable represents a single array with a collection of attributes attached. For information about attributes, see the
Variable.Metadata property. If a provider works with large datasets, it typically does not load all data into memory, but provides on-demand data access through variables. See also
Supported Types.
ExampleThe following code iterates through all variables of a
DataSet loaded from the
"sample.csv" file and displays information about the
DataSet on the console:
using(DataSet ds = new DataSet.Open("sample.csv"))
{
Console.WriteLine ("Scientfic DataSet " + ds.Name + "(" + ds.DataSetGuid + ")" + " contents: ");
Console.WriteLine (" Variables:");
foreach (Variable v in sds.Variables)
{
Console.WriteLine (v.ToString());
}
}
Relationships between variables are expressed using
shared dimensions. A
DataSet dimension is an index space that has a distinct name. For example, if variable
Observation shares a dimension with variable
X, then
Observation[i] relates somehow to
X[i] for all indices
i from the shared index space. Additionally this introduces
a constraint on the
DataSet:
- Two variables that share an index space must always be the same size along the shared dimension.
In addition to attaching metadata to each
Variable, you can attach global metadata to the
DataSet. Metadata is a dictionary of string keys and typed values, represented as a
MetadataDictionary class.
The
DataSet instance can be read-only. Such a dataset cannot be modified. For details, see
DataSet.IsReadOnly property.
New
Variable can be added to a DataSet by using
AddVariable<DataType>() methods.
ExampleThe following code adds a column to the text file
"Tutorial1.csv", which is in comma-separated values (CSV) format:
using(DataSet ds = DataSet.Open("Tutorial1.csv"))
{
// read input data
double[] x = (double[])ds["X"].GetData();
double[] y = (double[])ds["Observation"].GetData();
// compute
var xm = x.Sum() / x.Length;
var ym = y.Sum() / y.Length;
var a = x.Zip(y, (xx, yy) => (xx - xm) * (yy - ym)).Sum()
/ x.Select(xx => (xx - xm) * (xx - xm)).Sum();
var b = ym - a * xm;
var model = x.Select(xx => a * xx + b).ToArray();
// Adding new variable
ds.AddVariable<double>("Model", model);
}
The extensions
DataSetExtensions enable you to work with
DataSet in a way similar to a
procedural API. You could rewrite the previous example using the extensions as follows:
Example
using(DataSet ds = sds.DataSet.Open("Tutorial1.csv"))
{
// read input data
var x = ds.GetData<double[]>("X");
var y = ds.GetData<double[]>("Observation");
// compute
var xm = x.Sum() / x.Length;
var ym = y.Sum() / y.Length;
var a = x.Zip(y, (xx, yy) => (xx - xm) * (yy - ym)).Sum()
/ x.Select(xx => (xx - xm) * (xx - xm)).Sum();
var b = ym - a * xm;
var model = x.Select(xx => a * xx + b).ToArray();
// Adding new variable
ds.Add<double[]>("Model");
ds.PutData<double[]>("Model", model);
}
The
DataSet uses a two-phase transaction mechanism to commit changes. For details, see the
DataSet.Commit() method and the
DataSet.IsAutocommitEnabled property.
See also
How-To topic.