Dataframe
The Dataframe shard allows programmers to work with and manipulate data in a dataframe (or dataset) object. Information can be easily imported or exported in CSV format, modified by column, filtered by row, or even combined with other Dataframe objects.
Installation
-
Add the dependency to your
shard.yml
:dependencies: dataframe: github: HCLarsen/dataframe
-
Run
shards install
Usage
Dataframes cam be created from column names and data, or imported from CSV or JSON. Data types of columns can be specified during initialization, or they can be automatically determined from the content of the data.
NOTE While dataframe columns are always nilable, they're not able to be specified as any other union type. When automatically determining the column type, the type of the first non-nil entry in that column will be used.
require "dataframe"
headers = ["Name", "Age", "Address"]
rows = [
["Jim", 41, "Hawkins, Indiana, USA"] of Dataframe::Type,
["Yuri", 47, "Siberia, USSR"] of Dataframe::Type,
["Murray", 40, "Sesser, Illinois, USA"] of Dataframe::Type,
]
dataframe = Dataframe.new(headers, rows)
The values in the Dataframe
can be accessed as an Array
of arrays called #data
, an Array
of Row
objects, or a Hash
of Column
objects.
dataframe.shape #=> {2, 3}
dataframe.data[1][0] #=> "Yuri"
dataframe.row[1] #=> Dataframe::Row{ "Name" => "Yuri", "Age" => 47, "Address" => "Siberia, USSR" }
dataframe.column("Age") #=> Dataframe::Column{41, 47, 40}
The Column
class contains many statistical and mathematical methods, such as #sum
, #avg
, and #mode
. While a Column
in a Dataframe
can't be modified directly, it can be reassigned using the #[]=
operator with the column name.
age_column = dataframe["Age"].as(Dataframe::Column(Int32))
age_column.map! { |e| e.nil? ? nil : e + 1 }
dataframe["Age"] = age_column
assert_equal dataframe["Age"] #=> Dataframe::Column{42, 48, nil}
NOTE The reassignment to a column cannot change the type. Attempting to do so will provide a runtime error. If the calculation needed will return a new datatype, it's best to add a new column with values generated by #map
on the existing column.
Development
To Do
- Add #fillnil
- Add #tally method to get frequency count info for a column.
- Add #compact and #compact!(remove any row with a nil value, or with a nil value in specified columns).
- Add calculation for sparsity of the dataset.
- Add calculation for sparsity of a row.
- Add calculation for sparsity of a column.
- Add #uniq to dataframe to eliminate duplicate rows for all, or some columns.
- Add detection of "mergeable" rows.
- Add #info method to display row and column characteristics.
- Add #describe method to display info about a column.
- Get a sample of rows based on percentage.
- Create a JSON parser/generator.
- Add correlation analysis between two columns.
Contributing
All features must be properly tested, using minitest.cr. Methods and their params should always have an indicated type.
- Fork it (https://gitlab.com/HCLarsen/dataframe/-/forks/new)
- Create your feature branch (
git checkout -b my-new-feature
) - Commit your changes (
git commit -am 'Add some feature'
) - Push to the branch (
git push origin my-new-feature
) - Create a new Pull Request
Contributors
- Chris Larsen - creator and maintainer