Trying out the new DataFrame package in .NET Core

Yesterday, Prashanth Govindarajan posted an article about DataFrame on the .NET Blog. I got excited and wanted to try the library as soon as I could. In this post, I will explain what the library is intended for and what are my thoughts on it.

The package

The DataFrame related classes were introduced in the package Microsoft.Data.Analysis. The source code is available in the corefxlab repository, Microsoft.Data.Analysis.

This means that the package is experimental and is currently in preview.

Installing

As with all NuGet packages, you can install it either by using the Visual Studio NuGet Package Manager or using the CLI: dotnet add package Microsoft.Data.Analysis.
Note: You can also install the package in a .NET Jupyter Notebook.

Functionalities

  • Loading data from CSV (not mature enough).
  • Easy column/row selection.
  • Join/Merge DataFrames.
  • GroupBy.
  • Handle Null values.
  • And others…

Example

Creating Data

In this example, we'll be creating data instead of loading it from a CSV, since DataFrame.LoadCsv doesn't handle null values and all structs (e.g. DateTime) yet.

string[] names = { "John", "Ahmed", "Chris", "Albert" };
int[] salaries = { 20000, 30000, 40000, 10000 };
DateTime[] birthdays = { DateTime.Parse("23/4/1990"), DateTime.Parse("4/5/1982"),
    DateTime.Parse("2/1/1980"), DateTime.Parse("9/10/1994") };
string[] departments = { "Development", "Development", "HR", null };

var idColumn1 = new PrimitiveDataFrameColumn<int>("Id", new int[] { 1, 2, 3, 4 });
var nameColumn = new StringDataFrameColumn("Name", names);
var birthdayColumn = new PrimitiveDataFrameColumn<DateTime>("Birthday", birthdays);
var salaryColumn = new PrimitiveDataFrameColumn<int>("Salary", salaries);

var idColumn2 = new PrimitiveDataFrameColumn<int>("EmployeeId", new int[] { 3, 1, 2, 4 });
var departmentColumn = new StringDataFrameColumn("Department", departments);

var employeesDf = new DataFrame(idColumn1, nameColumn, birthdayColumn, salaryColumn);
var departmentsDf = new DataFrame(idColumn2, departmentColumn);

The example data is very simple, we have employees (id, name, birthday and salary) and which department they belong to.

Note: This code would change to 2 lines when LoadCsv works properly.

DataFrame information and description

employeesDf.PrettyPrint();
Id Name Birthday Salary
1 John 23/04/1990 20000
2 Ahmed 04/05/1982 30000
3 Chris 02/01/1980 40000
4 Albert 09/10/1994 10000
departmentsDf.PrettyPrint();
EmployeeId Department
3 Development
1 Development
2 HR
4 null
employeesDf.Info().PrettyPrint();
Info Id Name Birthday Salary
DataType System.Int32 System.String System.DateTime System.Int32
Length (excluding null values) 4 4 4 4
departmentsDf.Info().PrettyPrint();
Info EmployeeId Department
DataType System.Int32 System.String
Length (excluding null values) 4 3

I would've hoped to see more information in Info, for example the memory usage like in pandas.

employeesDf.Description().PrettyPrint();
Description Id Birthday Salary
Length (excluding null values) 4 4 4
Max 4 null 40000
Min 1 null 10000
Mean 2.5 null 25000

PrettyPrint

DataFrame.ToString only takes into account the maximum length of columns when printing the dataframe, which doesn't always work since some column values might be longer.
I created a function that does the opposite to get some good looking tables, the code is available at the end of the post.

Fill nulls with a value

departmentColumn.FillNulls("Other", inPlace: true);

In the future, it would be nice to have more ways to fill nulls, for example the most present value (in this case Development).

Creating a column from an existing one by applying a function

var currentYear = DateTime.Now.Year;
employeesDf["Age"] = birthdayColumn.Apply(d => currentYear - d.Year);
employeesDf.PrettyPrint();

For now, DataFrame.Apply only handle same-type function, meaning that if your colum is of type DateTime, the output needs to be the same, which doesn't work for our example (and for other use cases).

I opened an issue in the repo (here) and I'm working on an PR, for now, I'm using a custom extension method that will be available at the end of the post too.

Id Name Birthday Salary Age
1 John 23/04/1990 00:00:00 20000 29
2 Ahmed 04/05/1982 00:00:00 30000 37
3 Chris 02/01/1980 00:00:00 40000 39
4 Albert 09/10/1994 00:00:00 10000 25

Normalize a column

Note: This example intends to show how the columns can handle operations.

var minSalary = (float)(int)salaryColumn.Min();
var maxSalary = (int)salaryColumn.Max();
employeesDf["NormalizedSalary"] = (salaryColumn - minSalary) / (maxSalary - minSalary);
employeesDf.PrettyPrint();
Id Name Birthday Salary Age NormalizedSalary
1 John 23/04/1990 00:00:00 20000 29 0.3333333333333333
2 Ahmed 04/05/1982 00:00:00 30000 37 0.6666666666666666
3 Chris 02/01/1980 00:00:00 40000 39 1
4 Albert 09/10/1994 00:00:00 10000 25 0

Join 2 DataFrames

var df = employeesDf.Merge<int>(departmentsDf, "Id", "EmployeeId", joinAlgorithm: JoinAlgorithm.Inner);
df.PrettyPrint();
Id Name Birthday Salary Age NormalizedSalary EmployeeId Department
3 Chris 02/01/1980 00:00:00 40000 39 1 3 Development
1 John 23/04/1990 00:00:00 20000 29 0.3333333333333333 1 Development
2 Ahmed 04/05/1982 00:00:00 30000 37 0.6666666666666666 2 HR
4 Albert 09/10/1994 00:00:00 10000 25 0 4 Other

I'll have to admit that I got confused at first, thinking that Join should do this while Merge would only merge without a condition.

Drop a column and sort by a column

df.Columns.Remove("EmployeeId");
df = df.Sort("Id");
df.PrettyPrint();
Id Name Birthday Salary Age NormalizedSalary Department
1 John 23/04/1990 00:00:00 20000 29 0.3333333333333333 Development
2 Ahmed 04/05/1982 00:00:00 30000 37 0.6666666666666666 HR
3 Chris 02/01/1980 00:00:00 40000 39 1 Development
4 Albert 09/10/1994 00:00:00 10000 25 0 Other

Mean salary by department

var employeesByDepartment = df.GroupBy("Department");
employeesByDepartment.Mean("Salary").PrettyPrint();
Department Salary
Development 30000
HR 30000
Other 10000

Value counts of department

departmentColumn.ValueCounts().PrettyPrint();
Values Counts
Development 2
HR 1
Other 1

Sample rows

df.Sample(2).PrettyPrint();

Note: This method can give duplicate rows. I created an issue and hopefully it'll be fixed next preview.

Conclusion

I'm excited to see what this library becomes in the future. It still needs a lot of work but I hope it gets mature enough to get out of the lab and be officially supported, it can be a nice addition to .NET.

If you're interested in the code used in this post:

See you soon!

Zanid Haytam Written by:

Zanid Haytam is an enthusiastic programmer that enjoys coding, reading code, hunting bugs and writing blog posts.