module Crysda::DataFrame

Overview

A "tabular" data structure representing cases/records (rows), each of which consists of a number of observations or measurements (columns) DataFrame is an immutable object, any mutation will return a new object.

Defined in:

crysda/dataframe.cr
crysda/joins.cr
crysda/reshape.cr

Constant Summary

DEF_NEST_COLUMN_NAME = "data"

Instance Method Summary

Instance Method Detail

abstract def [](name : String) : DataCol #

Returns a column by name


[View source]
def [](index : Int32) #

Returns a column by index


[View source]
abstract def add_column(tf : ColumnFormula) : DataFrame #

Adds new variables and preserves existing


[View source]
def add_column(col_name : String, &expression : TableExpression) : DataFrame #

Add a new column and preserve existing ones.

df.add_column("salary_category") { 3 }             # with constant value
df.add_column("age_3y_later") { |e| e["age"] + 3 } # by doing basic column arithmetics

[View source]
def add_columns(cols : Iterable(ColumnFormula)) : DataFrame #

[View source]
def add_columns(*cols : ColumnFormula) : DataFrame #

add multiple columns

df.add_columns(
  "age_plus3".with { |e| e["age"] + 3 },
  "initials".with { |e| e["first_name"].map(&.to_s[0]).concatenate(e["last_name"].map(&.to_s[0])) }
)

[View source]
def add_row(*row) #

Returns a DataFrame containing the new row. The new row length must match the number of columns in the DataFrame


[View source]
def add_row_number(name = "row_number") #

Add the row-number as column to data-frame


[View source]
def bind_cols(cols : Iterable(DataCol)) #

Add new columns. rows are matched by position, so all data frames must have the same number of rows.


[View source]
def bind_rows(df : DataFrame) : DataFrame #

Adds new rows. Missing entries are set to null. The output of bind_rows will contain a column if that column appears in any of the inputs. When row-binding, columns are matched by name, and any missing columns will be filled with NA. Grouping will be discarded when binding rows


[View source]
def bind_rows(rows : Iterable(Hash(String, Any))) #

[View source]
def bind_rows(dfs : Iterable(DataFrame)) : DataFrame #

[View source]
def bind_rows(*rows : Hash(String, Any)) #

Add new rows. Missing entries are set to nil. The output of #bind_rows will contain a column if that column appears in any of the inputs. when row-binding, columns are matched by name, and any missing column will be filled with NA Grouping will be discarded when binding rows

row1 = {
  "person" => "james",
  "year"   => 1996,
  "weight" => 54.0,
  "sex"    => "M",
} of String => Any

row2 = {
  "person" => "nell",
  "year"   => 1997,
  "weight" => 48.1,
  "sex"    => "F",
} of String => Any
df.bind_rows(row1, row2)

[View source]
def bind_rows(*rows : DataFrameRow) #

[View source]
def bind_rows(*df : DataFrame) : DataFrame #

[View source]
abstract def cols : Array(DataCol) #

Ordered list of column in this data-frame


[View source]
def complete(*column_names : String) : DataFrame #

Turns implicit missing values into explicit missing values. This is a wrapper around #expand


[View source]
def count(selects : Array(String) = [] of String, name = "n") : DataFrame #

[View source]
def count(*selects : String, name = "n") : DataFrame #

Counts observations by group.

If no grouping attributes are provided the method will respect the grouping of the receiver, or in cases of an ungrouped receiver will simply count the rows in the data.frame

selects : The variables to to be used for cross-tabulation. name : The name of the count column resulting table.

df.count("column name")

[View source]
def count_expr(*exprs : TableExpression, name = "n", table_expression : TableExpression | Nil = nil) : DataFrame #

Counts expressions

If no grouping attributes are provided the method will respect the grouping of the receiver, or in cases of an ungrouped receiver will simply count the rows in the data.frame


[View source]
def distinct(selects : Array(String) = self.names) : DataFrame #

[View source]
def distinct(*selects : String) : DataFrame #

Retains only unique/distinct rows selects : Variables to use when determining uniqueness. If there are multiple rows for a given combination of inputs, only the first row will be preserved.


[View source]
def expand(*column_names : String) : DataFrame #

#expand is often useful in conjunction with left_join if you want to convert implicit missing values to explicit missing values.


[View source]
abstract def filter(&block : RowPredicate) : DataFrame #

Filter the rows of a table with a single predicate. the filter() function is used to subset a data frame, retaining all rows that satisfy your conditions.


[View source]
def filter(*predicates : DataFrame -> Array(Bool) | Array(Bool | Nil)) : DataFrame #

AND-filter a table with different filters. Subset rows with filter

df.filter { |e| e.["age"] == 23 }
df.filter { |e| e.["weight"] > 50 }
df.filter { |e| e["first_name"].matching { |e| e.starts_with?("Ho") } }

[View source]
def filter_by_row(&row_filter : DataFrameRow -> Bool) : DataFrame #

filter rows by Row predicate, which is invoked on each row of the dataframe

df = Crysda.dataframe_of("person", "year", "weight", "sex").values(
  "max", 2014, 33.1, "M",
  "max", 2016, nil, "M",
  "anna", 2015, 39.2, "F",
  "anna", 2016, 39.9, "F"
)
df.filter_by_row { |f| f["year"].as_i > 2015 }.print

[View source]
def gather(key : String, value : String, columns : Array(String) = self.names, convert : Bool = false) : DataFrame #

gather takes multiple columns and collapses into key-value pairs, duplicating all other columns as needed. You use gather() when you notice that you have columns that are not variables.

key Name of the key column to create in output. value Name of the value column to create in output. columns The colums to gather. The same selectar syntax as for krangl::select is supported here convert If TRUE will automatically run convertType on the key column. This is useful if the column names are actually numeric, integer, or logical.


[View source]
def gather(key : String, value : String, columns : ColumnSelector, convert : Bool = false) : DataFrame #

[View source]
abstract def group_by(by : Iterable(String)) : DataFrame #

Creates a grouped data-frame given a list of grouping attributes. Most data operations are done on groups defined by variables. #group_by() takes the receiver data-frame and converts it into a grouped data-frame where operations are performed "by group". #ungroup() removes grouping.

Most verbs like #add_column(), #summarize(), etc. will be executed per group if a grouping is present.


[View source]
def group_by(*by : String) : DataFrame #

Creates a grouped data-frame given a list of grouping attributes. Most data operations are done on groups defined by variables. #group_by() takes the receiver data-frame and converts it into a grouped data-frame where operations are performed "by group". #ungroup() removes grouping.

Most verbs like #add_column(), #summarize(), etc. will be executed per group if a grouping is present.


[View source]
def group_by(&col_selector : ColumnSelector) : DataFrame #

Creates a grouped data-frame from a column selector function. See #select() for details about column selection.

Most data operations are done on groups defined by variables. #group_by() takes the receiver data-frame and converts it into a grouped data-frame where operations are performed "by group". #ungroup() removes grouping.


[View source]
def group_by_expr(table_expression : TableExpression | Nil = nil) : DataFrame #

[View source]
def group_by_expr(exprs : Iterable(TableExpression), table_expression : TableExpression | Nil = nil) : DataFrame #

[View source]
def group_by_expr(*exprs : TableExpression, table_expression : TableExpression | Nil = nil) : DataFrame #

Creates a grouped data-frame from one or more table expressions. See #add_column() for details about table expressions.

Most data operations are done on groups defined by variables. #group_by() takes the receiver data-frame and converts it into a grouped data-frame where operations are performed "by group". #ungroup() removes grouping.


[View source]
abstract def grouped_by : DataFrame #

Returns a data-frame of distinct grouping variable tuples for a grouped data-frame. An empty data-frame for ungrouped data


[View source]
abstract def groups : Array(DataFrame) #

Returns the groups of a grouped data frame or just a reference to self


[View source]
def head(rows = 5) #

return the top rows from dataframe. default to 5


[View source]
def inner_join(right : DataFrame, by : String, suffices : Tuple(String, String) = {".x", ".y"}) : DataFrame #

[View source]
def inner_join(right : DataFrame, by : Iterable(String) = default_by(self, right), suffices : Tuple(String, String) = {".x", ".y"}) : DataFrame #

[View source]
def inner_join(right : DataFrame, by : Iterable(Tuple(String, String)), suffices : Tuple(String, String) = {".x", ".y"}) : DataFrame #

[View source]
def left_join(right : DataFrame, by : String, suffices : Tuple(String, String) = {".x", ".y"}) : DataFrame #

[View source]
def left_join(right : DataFrame, by : Iterable(String) = default_by(self, right), suffices : Tuple(String, String) = {".x", ".y"}) : DataFrame #

[View source]
def move_left(*col_names : String) : DataFrame #

Push some columns to the left end of a data-frame


[View source]
def move_right(*col_names : String) : DataFrame #

Push some columns to the right end of a data-frame


[View source]
abstract def names : Array(String) #

Ordered list of column names of this data-frame


[View source]
def nest(col_select : ColumnSelector = ColumnSelector.new do |c| c.except(grouped_by().names) end, column_name : String = DEF_NEST_COLUMN_NAME) : DataFrame #

Nest repeated values in a list-variable.

There are many possible ways one could choose to nest colSelect inside a data frame. nest() creates a list of data frames containing all the nested variables: this seems to be the most useful form in practice.

Usage

nest(data, ..., column_name = "data")

col_select - A selection of col_select. If not provided, all except the grouping variables are selected. column_name - The name of the new column, as a string or symbol. also see https://github.com/tidyverse/tidyr/blob/master/R/nest.R


[View source]
abstract def num_col : Int32 #

Number of columns in this dataframe


[View source]
abstract def num_row : Int32 #

Number of rows in this dataframe


[View source]
def outer_join(right : DataFrame, by : String, suffices : Tuple(String, String) = {".x", ".y"}) : DataFrame #

[View source]
def outer_join(right : DataFrame, by : Iterable(String) = default_by(self, right), suffices : Tuple(String, String) = {".x", ".y"}) : DataFrame #

[View source]
def print(title = "A DataFrame", col_names = true, max_rows = PRINT_MAX_ROWS, max_width = PRINT_MAX_WIDTH, max_digits = PRINT_MAX_DIGITS, row_numbers = PRINT_ROW_NUMBERS, output = STDOUT) #

Prints a dataframe to output (defaults to STDOUT). df.to_s will also work but has no options


[View source]
def reject(columns : Iterable(String)) : DataFrame #

[View source]
def reject(col_type : DataCol.class) : DataFrame #

reject column by column type


[View source]
def reject(*columns : String) : DataFrame #

reject selected columns


[View source]
def reject(&col_sel : ColumnSelector) : DataFrame #

[View source]
def reject(*col_sels : ColumnSelector) : DataFrame #

remove selected columns


[View source]
def reject?(&pred : DataCol -> Bool) : DataFrame #

Select or reject columns by predicate


[View source]
def rename(cols : Array(RenamePair)) : DataFrame #

Rename one or several columns. Positions should be preserved.


[View source]
def rename(rules : Array(RenameRule)) : DataFrame #

[View source]
def rename(*cols : RenamePair) : DataFrame #

Rename one or several columns. Positions should be preserved.


[View source]
def rename(*rules : RenameRule) : DataFrame #

[View source]
def right_join(right : DataFrame, by : String, suffices : Tuple(String, String) = {".x", ".y"}) : DataFrame #

[View source]
def right_join(right : DataFrame, by : Iterable(String) = default_by(self, right), suffices : Tuple(String, String) = {".x", ".y"}) : DataFrame #

[View source]
abstract def row(index : Int32) : DataFrameRow #

Returns a row by index


[View source]
def row_number : Array(Int32) #

Returns array of row numbers starting from 1


[View source]
abstract def rows : Iterator(DataFrameRow) #

Returns an Iterator over all rows. Per row data is represented as DataFrameRow


[View source]
def rowwise : DataFrame #

Creates a grouped data-frame where each group consists of exactly one line.


[View source]
def sample_frac(fraction : Float64, replace = false) : DataFrame #

Select random rows from a table. If receiver is grouped, sampling is done per group. fraction - Fraction of rows to sample replace - Sample with or without replacement


[View source]
def sample_n(n : Int32, replace = false) : DataFrame #

Select random rows from a table. If receiver is grouped, sampling is done per group. n - Number of rows to sample replace - Sample with or without replacement


[View source]
def schema(max_digits = 3, max_width = PRINT_MAX_WIDTH, output = STDOUT) #

Prints the schema (that is column names, types, and the first few values per column) of a dataframe to output (defaults to STDOUT).


[View source]
abstract def select(columns : Iterable(String)) : DataFrame #

Create a new data frame with only selected columns


[View source]
def select(col_type : DataCol.class) : DataFrame #

Select column by column type


[View source]
def select(which : Array(Bool | Nil)) : DataFrame #

[View source]
def select(*columns : String) : DataFrame #

[View source]
def select(&col_sel : ColumnSelector) : DataFrame #

Keeps only the variables that match any of the given expression


[View source]
def select(*col_sels : ColumnSelector) : DataFrame #

[View source]
def select?(&pred : DataCol -> Bool) : DataFrame #

Select or reject columns by predicate


[View source]
def semi_join(right : DataFrame, by : String) : DataFrame #

Special case of inner join against distinct right side


[View source]
def semi_join(right : DataFrame, by : Iterable(Tuple(String, String))) : DataFrame #

[View source]
def semi_join(right : DataFrame, by : Iterable(String) = default_by(self, right), suffices : Tuple(String, String) = {".x", ".y"}) : DataFrame #

[View source]
def separate(column : String, into : Array(String), sep : Regex | String = /[^\w]/, remove : Bool = true, convert : Bool = false) : DataFrame #

Given either regular expression or a vector of character positions, separate() turns a single character column into multiple columns.

column - Bare column name. into - Names of new variables to create as character vector. sep - Separator between columns. If String, is interpreted as a regular expression. The default value is a regular expression that matches any sequence of non-alphanumeric values. remove - If true, remove input column from output data frame. convert - If set, attempt to do a type conversion will be run on all new columns. This is useful if the value column was a mix of variables that was coerced to a string.


[View source]
def set_names(new_names : Array(String)) : DataFrame #

Replace current column names with new ones. The number of provided names must match the number of columns.


[View source]
def set_names(*new_name : String) : DataFrame #

Replace current column names with new ones. The number of provided names must match the number of columns.


[View source]
def shuffle : DataFrame #

Randomize the row order of a data-frame.


[View source]
def slice(slices : Range) #

Select rows by position while taking into account grouping in a data-frame.


[View source]
def slice(*slices : Int32) #

Select rows by position while taking into account grouping in a data-frame.


[View source]
abstract def sort_by(by : Iterable(String)) : DataFrame #

Resorts the receiver in ascending order (small values to go top of table). The first argument defines the primary attribute to sort by. Additional ones are used to resolve ties.

Missing values will come last in the sorted table.


[View source]
def sort_by(exp : Iterable(SortExpression)) #

[View source]
def sort_by : DataFrame #

[View source]
def sort_by(*by : String) : DataFrame #

[View source]
def sort_by(&exp : SortExpression) #

[View source]
def sort_desc_by(*by : String) : DataFrame #

Resorts the receiver in descending order (small values to go bottom of table). The first argument defines the primary attribute to sort by. Additional ones are used to resolve ties.


[View source]
def spread(key : String, value : String, fill = nil, convert = false) : DataFrame #

spread a key-value pair across multiple columns.

key The bare (unquoted) name of the column whose values will be used as column headings. value The bare (unquoted) name of the column whose values will populate the cells. fill If set, missing values will be replaced with this value - NOT IMPLEMENTED convert If set, attempt to do a type conversion will be run on all new columns. This is useful if the value column was a mix of variables that was coerced to a string.


[View source]
def summarize(name : String, block : TableExpression) : DataFrame #

[View source]
abstract def summarize(sum_rules : Array(ColumnFormula)) : DataFrame #

Creates a summary of a table or a group. The provided expression is expected to evaluate to a scalar value and not into a column. #summarize() is typically used on grouped data created by group_by(). The output will have one row for each group.


[View source]
def summarize(*sum_rules : ColumnFormula) : DataFrame #

[View source]
def summarize(name : String, &block : TableExpression) : DataFrame #

[View source]
def summarize_at(&col_sel : ColumnSelector) : DataFrame #

[View source]
def summarize_at(col_sel : ColumnSelector, aggfuns : Array(AggFunc)) : DataFrame #

[View source]
def summarize_at(col_sel : ColumnSelector, op : SummarizeFunc | Nil = nil) : DataFrame #

[View source]
def summarize_at(col_sel : ColumnSelector, *aggfuns : AggFunc) : DataFrame #

[View source]
def tail(rows = 5) #

[View source]
def take(rows = 5) #

[View source]
def take_last(rows : Int32) #

[View source]
def to_h #

Expose a view on the data as Hash from column names to nullable arrays.


[View source]
def to_s(io : Nil) #

[View source]
def to_s #

[View source]
def to_string(title = "A DataFrame", col_names = true, max_rows = PRINT_MAX_ROWS, max_width = PRINT_MAX_WIDTH, max_digits = PRINT_MAX_DIGITS, row_numbers = PRINT_ROW_NUMBERS) #

Converts dataframe to its string representation. This is being invoked via #print and #to_s


[View source]
def transmute(*formula : ColumnFormula) #

Create a new dataframe based on a list of column-formulas which are evaluated in the context of the this instance.


[View source]
abstract def ungroup : DataFrame #

Removes the grouping (if present from a data frame)


[View source]
def unite(col_name : String, which : Array(String), sep : String = "_", remove : Bool = true) : DataFrame #

Convenience function to paste together multiple columns into one.

colName - Name of the column to add which - Names of columns which should be concatenated together sep - Separator to use between values. remove - If true, remove input columns from output data frame.

see #separate


[View source]
def unite(col_name : String, *which : ColumnSelector, sep : String = "_", remove : Bool = true) : DataFrame #

[View source]
def unnest(column_name : String) : DataFrame #

If you have a list-column, this makes each element of the list its own row. It unfolds data vertically. unnest() can handle list-columns that can atomic vectors, lists, or data frames (but not a mixture of the different types).


[View source]
def write_csv(filename : String, separator : Char = ',', quote_char : Char = '"') : Nil #

Save the current dataframe to separator delimited file.


[View source]
def write_csv(io : IO, separator : Char = ',', quote_char : Char = '"') : Nil #

[View source]