module Crysda::DataFrame
Overview
A "tabular" data structure representing cases/records (rows), each of which consists of a number of observations or measurements (columns) DataFrame is an immutable object, any mutation will return a new object.
Defined in:
crysda/dataframe.crcrysda/joins.cr
crysda/reshape.cr
Constant Summary
-
DEF_NEST_COLUMN_NAME =
"data"
Instance Method Summary
-
#[](name : String) : DataCol
Returns a column by name
-
#[](index : Int32)
Returns a column by index
-
#add_column(tf : ColumnFormula) : DataFrame
Adds new variables and preserves existing
-
#add_column(col_name : String, &expression : TableExpression) : DataFrame
Add a new column and preserve existing ones.
- #add_columns(cols : Iterable(ColumnFormula)) : DataFrame
-
#add_columns(*cols : ColumnFormula) : DataFrame
add multiple columns
df.add_columns( "age_plus3".with { |e| e["age"] + 3 }, "initials".with { |e| e["first_name"].map(&.to_s[0]).concatenate(e["last_name"].map(&.to_s[0])) } )
-
#add_row(*row)
Returns a DataFrame containing the new row.
-
#add_row_number(name = "row_number")
Add the row-number as column to data-frame
-
#bind_cols(cols : Iterable(DataCol))
Add new columns.
-
#bind_rows(df : DataFrame) : DataFrame
Adds new rows.
- #bind_rows(rows : Iterable(Hash(String, Any)))
- #bind_rows(dfs : Iterable(DataFrame)) : DataFrame
-
#bind_rows(*rows : Hash(String, Any))
Add new rows.
- #bind_rows(*rows : DataFrameRow)
- #bind_rows(*df : DataFrame) : DataFrame
-
#cols : Array(DataCol)
Ordered list of column in this data-frame
-
#complete(*column_names : String) : DataFrame
Turns implicit missing values into explicit missing values.
- #count(selects : Array(String) = [] of String, name = "n") : DataFrame
-
#count(*selects : String, name = "n") : DataFrame
Counts observations by group.
-
#count_expr(*exprs : TableExpression, name = "n", table_expression : TableExpression | Nil = nil) : DataFrame
Counts expressions
- #distinct(selects : Array(String) = self.names) : DataFrame
-
#distinct(*selects : String) : DataFrame
Retains only unique/distinct rows selects : Variables to use when determining uniqueness.
-
#expand(*column_names : String) : DataFrame
#expand
is often useful in conjunction with left_join if you want to convert implicit missing values to explicit missing values. -
#filter(& : RowPredicate) : DataFrame
Filter the rows of a table with a single predicate.
-
#filter(*predicates : DataFrame -> Array(Bool) | Array(Bool | Nil)) : DataFrame
AND-filter a table with different filters.
-
#filter_by_row(&row_filter : DataFrameRow -> Bool) : DataFrame
filter rows by Row predicate, which is invoked on each row of the dataframe
df = Crysda.dataframe_of("person", "year", "weight", "sex").values( "max", 2014, 33.1, "M", "max", 2016, nil, "M", "anna", 2015, 39.2, "F", "anna", 2016, 39.9, "F" ) df.filter_by_row { |f| f["year"].as_i > 2015 }.print
-
#filter_by_row_with_index(&row_filter : DataFrameRow, Int32 -> Bool) : DataFrame
filter rows by Row predicate, which is invoked on each row of the dataframe
df = Crysda.dataframe_of("person", "year", "weight", "sex").values( "max", 2014, 33.1, "M", "max", 2016, nil, "M", "anna", 2015, 39.2, "F", "anna", 2016, 39.9, "F" ) df.filter_by_row_with_index { |f, i| f["year"].as_i > 2015 || i % 2 != 0 }.print
-
#gather(key : String, value : String, columns : Array(String) = self.names, convert : Bool = false) : DataFrame
gather takes multiple columns and collapses into key-value pairs, duplicating all other columns as needed.
- #gather(key : String, value : String, columns : ColumnSelector, convert : Bool = false) : DataFrame
-
#group_by(by : Iterable(String)) : DataFrame
Creates a grouped data-frame given a list of grouping attributes.
-
#group_by(*by : String) : DataFrame
Creates a grouped data-frame given a list of grouping attributes.
-
#group_by(&col_selector : ColumnSelector) : DataFrame
Creates a grouped data-frame from a column selector function.
- #group_by_expr(table_expression : TableExpression | Nil = nil) : DataFrame
- #group_by_expr(exprs : Iterable(TableExpression), table_expression : TableExpression | Nil = nil) : DataFrame
-
#group_by_expr(*exprs : TableExpression, table_expression : TableExpression | Nil = nil) : DataFrame
Creates a grouped data-frame from one or more table expressions.
-
#grouped_by : DataFrame
Returns a data-frame of distinct grouping variable tuples for a grouped data-frame.
-
#groups : Array(DataFrame)
Returns the groups of a grouped data frame or just a reference to self
-
#head(rows = 5)
return the top rows from dataframe.
- #inner_join(right : DataFrame, by : String, suffices : Tuple(String, String) = {".x", ".y"}) : DataFrame
- #inner_join(right : DataFrame, by : Iterable(String) = default_by(self, right), suffices : Tuple(String, String) = {".x", ".y"}) : DataFrame
- #inner_join(right : DataFrame, by : Iterable(Tuple(String, String)), suffices : Tuple(String, String) = {".x", ".y"}) : DataFrame
- #left_join(right : DataFrame, by : String, suffices : Tuple(String, String) = {".x", ".y"}) : DataFrame
- #left_join(right : DataFrame, by : Iterable(String) = default_by(self, right), suffices : Tuple(String, String) = {".x", ".y"}) : DataFrame
-
#move_left(*col_names : String) : DataFrame
Push some columns to the left end of a data-frame
-
#move_right(*col_names : String) : DataFrame
Push some columns to the right end of a data-frame
-
#names : Array(String)
Ordered list of column names of this data-frame
-
#nest(col_select : ColumnSelector = ColumnSelector.new(&.except(grouped_by().names)), column_name : String = DEF_NEST_COLUMN_NAME) : DataFrame
Nest repeated values in a list-variable.
-
#num_col : Int32
Number of columns in this dataframe
-
#num_row : Int32
Number of rows in this dataframe
- #outer_join(right : DataFrame, by : String, suffices : Tuple(String, String) = {".x", ".y"}) : DataFrame
- #outer_join(right : DataFrame, by : Iterable(String) = default_by(self, right), suffices : Tuple(String, String) = {".x", ".y"}) : DataFrame
-
#print(title = "A DataFrame", col_names = true, max_rows = PRINT_MAX_ROWS, max_width = PRINT_MAX_WIDTH, max_digits = PRINT_MAX_DIGITS, row_numbers = PRINT_ROW_NUMBERS, output = STDOUT)
Prints a dataframe to output (defaults to STDOUT).
- #reject(columns : Iterable(String)) : DataFrame
-
#reject(col_type : DataCol.class) : DataFrame
reject column by column type
-
#reject(*columns : String) : DataFrame
reject selected columns
- #reject(&col_sel : ColumnSelector) : DataFrame
-
#reject(*col_sels : ColumnSelector) : DataFrame
remove selected columns
-
#reject?(&pred : DataCol -> Bool) : DataFrame
Select or reject columns by predicate
-
#rename(cols : Array(RenamePair)) : DataFrame
Rename one or several columns.
- #rename(rules : Array(RenameRule)) : DataFrame
-
#rename(*cols : RenamePair) : DataFrame
Rename one or several columns.
- #rename(*rules : RenameRule) : DataFrame
- #right_join(right : DataFrame, by : String, suffices : Tuple(String, String) = {".x", ".y"}) : DataFrame
- #right_join(right : DataFrame, by : Iterable(String) = default_by(self, right), suffices : Tuple(String, String) = {".x", ".y"}) : DataFrame
-
#row(index : Int32) : DataFrameRow
Returns a row by index
-
#row_number : Array(Int32)
Returns array of row numbers starting from 1
-
#rows : Iterator(DataFrameRow)
Returns an Iterator over all rows.
-
#rowwise : DataFrame
Creates a grouped data-frame where each group consists of exactly one line.
-
#sample_frac(fraction : Float64, replace = false) : DataFrame
Select random rows from a table.
-
#sample_n(n : Int32, replace = false) : DataFrame
Select random rows from a table.
-
#schema(max_digits = 3, max_width = PRINT_MAX_WIDTH, output = STDOUT)
Prints the schema (that is column names, types, and the first few values per column) of a dataframe to output (defaults to STDOUT).
-
#select(columns : Iterable(String)) : DataFrame
Create a new data frame with only selected columns
-
#select(col_type : DataCol.class) : DataFrame
Select column by column type
- #select(which : Array(Bool | Nil)) : DataFrame
- #select(*columns : String) : DataFrame
-
#select(&col_sel : ColumnSelector) : DataFrame
Keeps only the variables that match any of the given expression
- #select(*col_sels : ColumnSelector) : DataFrame
-
#select?(&pred : DataCol -> Bool) : DataFrame
Select or reject columns by predicate
-
#semi_join(right : DataFrame, by : String) : DataFrame
Special case of inner join against distinct right side
- #semi_join(right : DataFrame, by : Iterable(Tuple(String, String))) : DataFrame
- #semi_join(right : DataFrame, by : Iterable(String) = default_by(self, right), suffices : Tuple(String, String) = {".x", ".y"}) : DataFrame
-
#separate(column : String, into : Array(String), sep : Regex | String = /[^\w]/, remove : Bool = true, convert : Bool = false) : DataFrame
Given either regular expression or a vector of character positions, separate() turns a single character column into multiple columns.
-
#set_names(new_names : Array(String)) : DataFrame
Replace current column names with new ones.
-
#set_names(*new_name : String) : DataFrame
Replace current column names with new ones.
-
#shuffle : DataFrame
Randomize the row order of a data-frame.
-
#slice(slices : Range)
Select rows by position while taking into account grouping in a data-frame.
-
#slice(*slices : Int32)
Select rows by position while taking into account grouping in a data-frame.
-
#sort_by(by : Iterable(String)) : DataFrame
Resorts the receiver in ascending order (small values to go top of table).
- #sort_by(exp : Iterable(SortExpression))
- #sort_by : DataFrame
- #sort_by(*by : String) : DataFrame
- #sort_by(&exp : SortExpression)
-
#sort_desc_by(*by : String) : DataFrame
Resorts the receiver in descending order (small values to go bottom of table).
-
#spread(key : String, value : String, fill = nil, convert = false) : DataFrame
spread a key-value pair across multiple columns.
- #summarize(name : String, block : TableExpression) : DataFrame
-
#summarize(sum_rules : Array(ColumnFormula)) : DataFrame
Creates a summary of a table or a group.
- #summarize(*sum_rules : ColumnFormula) : DataFrame
- #summarize(name : String, &block : TableExpression) : DataFrame
- #summarize_at(&col_sel : ColumnSelector) : DataFrame
- #summarize_at(col_sel : ColumnSelector, aggfuns : Array(AggFunc)) : DataFrame
- #summarize_at(col_sel : ColumnSelector, op : SummarizeFunc | Nil = nil) : DataFrame
- #summarize_at(col_sel : ColumnSelector, *aggfuns : AggFunc) : DataFrame
- #tail(rows = 5)
- #take(rows = 5)
- #take_last(rows : Int32)
-
#to_h
Expose a view on the data as Hash from column names to nullable arrays.
- #to_s(io : Nil)
- #to_s
-
#to_string(title = "A DataFrame", col_names = true, max_rows = PRINT_MAX_ROWS, max_width = PRINT_MAX_WIDTH, max_digits = PRINT_MAX_DIGITS, row_numbers = PRINT_ROW_NUMBERS)
Converts dataframe to its string representation.
-
#transmute(*formula : ColumnFormula)
Create a new dataframe based on a list of column-formulas which are evaluated in the context of the this instance.
-
#ungroup : DataFrame
Removes the grouping (if present from a data frame)
-
#unite(col_name : String, which : Array(String), sep : String = "_", remove : Bool = true) : DataFrame
Convenience function to paste together multiple columns into one.
- #unite(col_name : String, *which : ColumnSelector, sep : String = "_", remove : Bool = true) : DataFrame
-
#unnest(column_name : String) : DataFrame
If you have a list-column, this makes each element of the list its own row.
-
#write_csv(filename : String, separator : Char = ',', quote_char : Char = '"') : Nil
Save the current dataframe to
separator
delimited file. - #write_csv(io : IO, separator : Char = ',', quote_char : Char = '"') : Nil
Instance Method Detail
Adds new variables and preserves existing
Add a new column and preserve existing ones.
df.add_column("salary_category") { 3 } # with constant value
df.add_column("age_3y_later") { |e| e["age"] + 3 } # by doing basic column arithmetics
add multiple columns
df.add_columns(
"age_plus3".with { |e| e["age"] + 3 },
"initials".with { |e| e["first_name"].map(&.to_s[0]).concatenate(e["last_name"].map(&.to_s[0])) }
)
Returns a DataFrame containing the new row. The new row length must match the number of columns in the DataFrame
Add new columns. rows are matched by position, so all data frames must have the same number of rows.
Adds new rows. Missing entries are set to null. The output of bind_rows will contain a column if that column appears in any of the inputs. When row-binding, columns are matched by name, and any missing columns will be filled with NA. Grouping will be discarded when binding rows
Add new rows. Missing entries are set to nil. The output of #bind_rows
will contain a column if that column appears in any of the inputs.
when row-binding, columns are matched by name, and any missing column will be filled with NA
Grouping will be discarded when binding rows
row1 = {
"person" => "james",
"year" => 1996,
"weight" => 54.0,
"sex" => "M",
} of String => Any
row2 = {
"person" => "nell",
"year" => 1997,
"weight" => 48.1,
"sex" => "F",
} of String => Any
df.bind_rows(row1, row2)
Turns implicit missing values into explicit missing values. This is a wrapper around #expand
Counts observations by group.
If no grouping attributes are provided the method will respect the grouping of the receiver, or in cases of an ungrouped receiver will simply count the rows in the data.frame
selects : The variables to to be used for cross-tabulation. name : The name of the count column resulting table.
df.count("column name")
Counts expressions
If no grouping attributes are provided the method will respect the grouping of the receiver, or in cases of an ungrouped receiver will simply count the rows in the data.frame
Retains only unique/distinct rows selects : Variables to use when determining uniqueness. If there are multiple rows for a given combination of inputs, only the first row will be preserved.
#expand
is often useful in conjunction with left_join if you want to convert implicit missing values to explicit
missing values.
Filter the rows of a table with a single predicate. the filter() function is used to subset a data frame, retaining all rows that satisfy your conditions.
AND-filter a table with different filters. Subset rows with filter
df.filter { |e| e.["age"] == 23 }
df.filter { |e| e.["weight"] > 50 }
df.filter { |e| e["first_name"].matching { |e| e.starts_with?("Ho") } }
filter rows by Row predicate, which is invoked on each row of the dataframe
df = Crysda.dataframe_of("person", "year", "weight", "sex").values(
"max", 2014, 33.1, "M",
"max", 2016, nil, "M",
"anna", 2015, 39.2, "F",
"anna", 2016, 39.9, "F"
)
df.filter_by_row { |f| f["year"].as_i > 2015 }.print
filter rows by Row predicate, which is invoked on each row of the dataframe
df = Crysda.dataframe_of("person", "year", "weight", "sex").values(
"max", 2014, 33.1, "M",
"max", 2016, nil, "M",
"anna", 2015, 39.2, "F",
"anna", 2016, 39.9, "F"
)
df.filter_by_row_with_index { |f, i| f["year"].as_i > 2015 || i % 2 != 0 }.print
gather takes multiple columns and collapses into key-value pairs, duplicating all other columns as needed. You use gather() when you notice that you have columns that are not variables.
key Name of the key column to create in output.
value Name of the value column to create in output.
columns The colums to gather. The same selectar syntax as for krangl::select
is supported here
convert If TRUE will automatically run convertType
on the key column. This is useful if the
column names are actually numeric, integer, or logical.
Creates a grouped data-frame given a list of grouping attributes.
Most data operations are done on groups defined by variables. #group_by()
takes the receiver data-frame and
converts it into a grouped data-frame where operations are performed "by group". #ungroup()
removes grouping.
Most verbs like #add_column()
, #summarize()
, etc. will be executed per group if a grouping is present.
Creates a grouped data-frame given a list of grouping attributes.
Most data operations are done on groups defined by variables. #group_by()
takes the receiver data-frame and
converts it into a grouped data-frame where operations are performed "by group". #ungroup()
removes grouping.
Most verbs like #add_column()
, #summarize()
, etc. will be executed per group if a grouping is present.
Creates a grouped data-frame from a column selector function. See #select()
for details about column selection.
Most data operations are done on groups defined by variables. #group_by()
takes the receiver data-frame and
converts it into a grouped data-frame where operations are performed "by group". #ungroup()
removes grouping.
Creates a grouped data-frame from one or more table expressions. See #add_column()
for details about table expressions.
Most data operations are done on groups defined by variables. #group_by()
takes the receiver data-frame and
converts it into a grouped data-frame where operations are performed "by group". #ungroup()
removes grouping.
Returns a data-frame of distinct grouping variable tuples for a grouped data-frame. An empty data-frame for ungrouped data
Returns the groups of a grouped data frame or just a reference to self
Push some columns to the left end of a data-frame
Push some columns to the right end of a data-frame
Nest repeated values in a list-variable.
There are many possible ways one could choose to nest colSelect inside a data frame. nest() creates a list of data frames containing all the nested variables: this seems to be the most useful form in practice.
Usage
nest(data, ..., column_name = "data")
col_select - A selection of col_select. If not provided, all except the grouping variables are selected. column_name - The name of the new column, as a string or symbol. also see https://github.com/tidyverse/tidyr/blob/master/R/nest.R
Prints a dataframe to output (defaults to STDOUT). df.to_s will also work but has no options
Select or reject columns by predicate
Rename one or several columns. Positions should be preserved.
Rename one or several columns. Positions should be preserved.
Returns an Iterator over all rows. Per row data is represented as DataFrameRow
Creates a grouped data-frame where each group consists of exactly one line.
Select random rows from a table. If receiver is grouped, sampling is done per group. fraction - Fraction of rows to sample replace - Sample with or without replacement
Select random rows from a table. If receiver is grouped, sampling is done per group. n - Number of rows to sample replace - Sample with or without replacement
Prints the schema (that is column names, types, and the first few values per column) of a dataframe to output (defaults to STDOUT).
Create a new data frame with only selected columns
Keeps only the variables that match any of the given expression
Select or reject columns by predicate
Special case of inner join against distinct right side
Given either regular expression or a vector of character positions, separate() turns a single character column into multiple columns.
column - Bare column name. into - Names of new variables to create as character vector. sep - Separator between columns. If String, is interpreted as a regular expression. The default value is a regular expression that matches any sequence of non-alphanumeric values. remove - If true, remove input column from output data frame. convert - If set, attempt to do a type conversion will be run on all new columns. This is useful if the value column was a mix of variables that was coerced to a string.
Replace current column names with new ones. The number of provided names must match the number of columns.
Replace current column names with new ones. The number of provided names must match the number of columns.
Select rows by position while taking into account grouping in a data-frame.
Select rows by position while taking into account grouping in a data-frame.
Resorts the receiver in ascending order (small values to go top of table). The first argument defines the primary attribute to sort by. Additional ones are used to resolve ties.
Missing values will come last in the sorted table.
Resorts the receiver in descending order (small values to go bottom of table). The first argument defines the primary attribute to sort by. Additional ones are used to resolve ties.
spread a key-value pair across multiple columns.
key The bare (unquoted) name of the column whose values will be used as column headings. value The bare (unquoted) name of the column whose values will populate the cells. fill If set, missing values will be replaced with this value - NOT IMPLEMENTED convert If set, attempt to do a type conversion will be run on all new columns. This is useful if the value column was a mix of variables that was coerced to a string.
Creates a summary of a table or a group. The provided expression is expected to evaluate to a scalar value and not into a column.
#summarize()
is typically used on grouped data created by group_by(). The output will have one row for each group.
Create a new dataframe based on a list of column-formulas which are evaluated in the context of the this instance.
Removes the grouping (if present from a data frame)
Convenience function to paste together multiple columns into one.
colName - Name of the column to add which - Names of columns which should be concatenated together sep - Separator to use between values. remove - If true, remove input columns from output data frame.
see #separate
If you have a list-column, this makes each element of the list its own row. It unfolds data vertically. unnest() can handle list-columns that can atomic vectors, lists, or data frames (but not a mixture of the different types).
Save the current dataframe to separator
delimited file.