Group By and Aggregates in .NET DataTable

Often there is a need to perform group by operations on in-memory .NET data tables. You can’t run a standard SQL statement and LINQ is not available prior version 3.0 of .NET. To compensate for this shortcoming I put together a small function that works in .NET 2.0. As input parameters it accepts source DataTable, column name to group by and a column name to perform aggregate operation on. It returns a grouped data as data table:

Function GroupBy(ByVal i_sGroupByColumn As String, ByVal i_sAggregateColumn As String, ByVal i_dSourceTable As DataTable) As DataTable


        Dim dv As New DataView(i_dSourceTable)

        'getting distinct values for group column
        Dim dtGroup As DataTable = dv.ToTable(True, New String() {i_sGroupByColumn}) 

        'adding column for the row count
        dtGroup.Columns.Add("Count", GetType(Integer)) 

        'looping thru distinct values for the group, counting
        For Each dr As DataRow In dtGroup.Rows
            dr("Count") = i_dSourceTable.Compute("Count(" & i_sAggregateColumn & ")", i_sGroupByColumn & " = '" & dr(i_sGroupByColumn) & "'")
        Next        

        'returning grouped/counted result
        Return dtGroup
End Function

The function first gets distinct values for group-by column, by creating a data view from source data table and using DataView’s “ToTable” method. It then loops thru these distinct values performing aggregate function on the source table using DataTable’s “Compute” method – Count in this case, but it can easily be replaced with other aggregates.

27 replies on “Group By and Aggregates in .NET DataTable”

  1. Similar. Function is modified so that instead of a single string – multiple strings (or array) that represent column names are passed. Then the first part will get distinct combination of column values and the second – aggregate for that combination. I can post source code if needed.
    I had a different requirement – grouping by 2 columns should be done in a way so that the first column values would go top-to-bottom in the datatable and the second column values left-to-right as column names (kinda like crosstab). This format is usual when binding data to a chart (each column represents chart series). I can post this solution as well if needed.

  2. Can you please post your code in C# for multiple column group by ?

    It will be very very helpful for me.

  3. Ok, here the VB version of grouping by multiple columns. Keep in mind, I haven’t got a chance to test to, so there’re potential errors here but the logic is clear.

     Function GroupByMultiple(ByVal i_sGroupByColumns As String(), ByVal i_sAggregateColumn As String, ByVal i_dSourceTable As DataTable) As DataTable
    
       Dim dv As New DataView(i_dSourceTable)
       Dim dtGroup As DataTable = dv.ToTable(True, i_sGroupByColumns)
    
       dtGroup.Columns.Add("Count", GetType(Integer))
    
       Dim sCondition As String
       For Each dr As DataRow In dtGroup.Rows
          sCondition = ""
    
          For I = 0 To i_sGroupByColumns.Length - 1
             sCondition &= i_sGroupByColumns(I) & " = '" & dr(i_sGroupByColumns(I)) & "' "
             If I < i_sGroupByColumns.Length - 1 Then sCondition &= " AND "
          Next
    
          dr("Count") = i_dSourceTable.Compute("Count(" & i_sAggregateColumn & ")", sCondition)
       Next
    
       Return dtGroup
    End Function

    Instead of passing a single column name – an array with column names is passed. That array is used both in creating a distinct view and condition for summary operation.

  4. @ Yuriy
    Very Nice.
    I`searched long Time for a Function like that.
    Could you please post also the “CrossTab” Version.

    BR

  5. @Joe Sorry, maybe sometime in the future, but currently don’t have much time.

    @Chintan I don’t really do C# at this time, but the function is pretty straightforward to write C# version. Or you can use online converters like http://www.developerfusion.com/tools/convert/vb-to-csharp/. Here what it produces for the GroupBy function:

    public DataTable GroupBy(string i_sGroupByColumn, string i_sAggregateColumn, DataTable i_dSourceTable)
    {
    
    	DataView dv = new DataView(i_dSourceTable);
    
    	//getting distinct values for group column
    	DataTable dtGroup = dv.ToTable(true, new string[] { i_sGroupByColumn });
    
    	//adding column for the row count
    	dtGroup.Columns.Add("Count", typeof(int));
    
    	//looping thru distinct values for the group, counting
    	foreach (DataRow dr in dtGroup.Rows) {
    		dr["Count"] = i_dSourceTable.Compute("Count(" + i_sAggregateColumn + ")", i_sGroupByColumn + " = '" + dr[i_sGroupByColumn] + "'");
    	}
    
    	//returning grouped/counted result
    	return dtGroup;
    }
  6. extremely slow on large datatables. the looping over all datarows works well for small sets (~5,000 rows) but anything over that and it take alot of time to do

  7. @dave f Technically you’re not supposed to have large in-memory datasets, if you do – maybe it’s a good idea to update design to query database in smaller batches. That said, I tested that function on datatables with 20K+ rows and it was very fast on my development laptop (which is not very fast). Entire operation is done in-memory and loop is going thru distinct groups, not all the rows. But if this method is slow for you and you’re using .NET 3.5+ try the LINQ method: http://codecorner.galanter.net/2009/12/17/grouping-ado-net-datatable-using-linq/

  8. public static DataTable TestGroupBy(string i_sGroupByColumn, string i_sAggregateColumn, DataTable i_dSourceTable)
    {
    DataView dv = new DataView(i_dSourceTable);

    //getting distinct values for group column
    DataTable dtGroup = dv.ToTable(true, new string[] { i_sGroupByColumn });

    //adding column for the row count
    dtGroup.Columns.Add(“Count”, typeof(int));

    //looping thru distinct values for the group, counting
    foreach (DataRow dr in dtGroup.Rows)
    {
    dr[“Count”] = i_dSourceTable.Compute(“Count(” + i_sGroupByColumn + “)”, i_sGroupByColumn + ” = ‘” + dr[i_sGroupByColumn] + “‘”);
    //dr[“Sum”] = i_dSourceTable.Compute(“Sum(” + i_sAggregateColumn + “)”, i_sGroupByColumn + ” = ‘” + dr[i_sGroupByColumn] + “‘”);
    }

    //returning grouped/counted result
    return dtGroup;
    }

    Throws an exception when executing:

    System.Data.SyntaxErrorException: Syntax error in aggregate argument: Expecting a single column argument with possible ‘Child’ qualifier.

    Please help, I have tried searching on various forum without any luck. Thanks, Khalid.

  9. @Khalid Mirza Do your column name contain spaces? In that case try surrounding them by “[” and “]”, e.g.

    dr["Count"] = i_dSourceTable.Compute("Count([" + i_sAggregateColumn + "])", "[" + i_sGroupByColumn + "] = '" + dr[i_sGroupByColumn] + "'");

    Btw in your sample you use i_sGroupByColumn twice both for = condition and Count aggregate – is this on purpose?

  10. I am trying to use this function to sum the values instead of providing a count, but it gives me an error saying strings cannot be summed. When I try to convert them to integer by using

    dr(“Count”) = i_dSourceTable.Compute(“Sum(Convert([” & i_sAggregateColumn & “], System.Int32)”, i_sGroupByColumn & ” = ‘” & dr(i_sGroupByColumn) & “‘”)

    It gives me an error saying: Syntax error in aggregate argument: Expecting a single column argument with possible ‘Child’ qualifier.

    Any suggestions?

  11. @Blake, “Sum” expects just a column name, so use just

    dr(“Count”) = i_dSourceTable.Compute(“Sum([” & i_sAggregateColumn & “])”, i_sGroupByColumn & ” = ‘” & dr(i_sGroupByColumn) & “‘”)

  12. I tried this and It gives me the original error I was getting: Invalid usage of aggregate function Sum() and Type: String

    I have created my own loop which sums the table for me. Just thought your function might help save some processing time.

  13. @Blake, yes unfortunately Sum does not work on strings in DataTable.Compute method and you can’t use conversion functions inside of it. You could create a separate column with integer values corresponding to the string ones in the original one, but that’s too much extra work and will be inefficient.

    If you’re using .NET 3.5 or above, I strongly suggest looking into LINQ aggregate solutions, for example:

    http://codecorner.galanter.net/2009/12/17/grouping-ado-net-datatable-using-linq/
    http://codecorner.galanter.net/2009/12/17/linq-truly-language-integrated/

    It’s much shorter, much more efficient, and you can use conversion functions right in the query.

  14. @Yuriy, your GroupByMultiple() function almost saved my day, but the returned datatable only includes the group by columns, while there are more columns in the original datatable. Is there any way of fixing this? Thanks

  15. @Robert: But that’s the idea, it’s by design. If you have data

    c1 c2 c3 c4
    a b c d
    a b x y
    f g 1 2
    f g 3 4
    f g q t

    And you group by columns C1 and C2 only – you will get data like

    c1 c2 Count
    a b 2
    f g 3

    Since it’s a group by operation – details a left off. Maybe you scenario is different? Could you explain how do you need to group by certain columns and still keep all the original columns?

  16. Very nice! Saves me a lot of time!!! Great work and thanks for sharing!

Leave a Reply

Your email address will not be published. Required fields are marked *