Quick Start
Jupyter
Online
Try out
dataframeon binder.
Run locally from docker
Clone the ihaskell-dataframe repository.
Ensure that docker is installed on your machine.
Run
sudo make upfrom the root directory.
Examples
There are pre-loaded examples in the Jupyter environment.
Running Haskell locally
Installation
Install GHC (The Haskell compiler) and cabal
For MacOS/Linux/WSL2:
curl --proto '=https' --tlsv1.2 -sSf https://get-ghcup.haskell.org | BOOTSTRAP_HASKELL_NONINTERACTIVE=1 shFor windows:
$ErrorActionPreference = 'Stop';Set-ExecutionPolicy Bypass -Scope Process -Force;[System.Net.ServicePointManager]::SecurityProtocol = [System.Net.ServicePointManager]::SecurityProtocol -bor 3072;try { & ([ScriptBlock]::Create((Invoke-WebRequest https://www.haskell.org/ghcup/sh/bootstrap-haskell.ps1 -UseBasicParsing))) -InBash -InstallDir "C:\" } catch { Write-Error $_ }
Cabal scripts
You can run standalone scripts with minimal setup using cabal scripts.
#!/usr/bin/env cabal
{- cabal:
build-depends: base >= 4, dataframe
-}
-- Test.hs
{-# LANGUAGE OverloadedStrings #-}
{-# LANGUAGE TemplateHaskell #-}
import qualified DataFrame as D
import qualified DataFrame.Functions as F
import DataFrame.Operators
-- Creates the column references used below (namely total_rooms and households)
-- This gives us type-safe column access.
$(D.declareColumnsFromCsvFile "./data/housing.csv")
main :: IO ()
main = do
df <- D.readCsv "./data/housing.csv"
print (df |> D.derive "rooms_per_household" (total_rooms / households))
Save the file as Test.hs and run with:
cabal run Test.hs
We provide a small, monadic DSL for scripts where you want relatively more type safety.
{-# LANGUAGE OverloadedStrings #-}
{-# LANGUAGE TemplateHaskell #-}
module Main where
import qualified DataFrame as D
import qualified DataFrame.Functions as F
import DataFrame.Monad
import Data.Text (Text)
import DataFrame.Functions ((.&&.), (.>=.))
$(D.declareColumnsFromCsvFile "./data/housing.csv")
main :: IO ()
main = do
df <- D.readCsv "./data/housing.csv"
print $ execFrameM df $ do
-- 1) Type safe reference to `median_house_value` and `median_income`
-- 2) creates a type safe reference to the newly created column.
is_expensive <- deriveM "is_expensive" (median_house_value .>=. 500000)
luxury <- deriveM "luxury" (is_expensive .&&. median_income .>=. 8)
filterWhereM luxury
As a commandline tool
Run
cabal install dataframe.Start the dataframe REPL by running
dataframewhich should be in your PATH.
Example usage
GHCi/Jupyter notebooks
Looking through the structure of the columns.
dataframe> df <- D.readCsv "./data/housing.csv"
dataframe> D.describeColumns df
------------------------------------------------------------------------
Column Name | ## Non-null Values | ## Null Values | Type
--------------------|--------------------|----------------|-------------
Text | Int | Int | Text
--------------------|--------------------|----------------|-------------
total_bedrooms | 20433 | 207 | Maybe Double
ocean_proximity | 20640 | 0 | Text
median_house_value | 20640 | 0 | Double
median_income | 20640 | 0 | Double
households | 20640 | 0 | Double
population | 20640 | 0 | Double
total_rooms | 20640 | 0 | Double
housing_median_age | 20640 | 0 | Double
latitude | 20640 | 0 | Double
longitude | 20640 | 0 | Double
Automatically generate column names.
dataframe> :declareColumns df
We can use the generated columns in expressions.
dataframe> import DataFrame.Operators
dataframe> df |> D.groupBy ["ocean_proximity"] |> D.aggregate [(F.mean median_house_value) `as` "avg_house_value" ]
-------------------------------------
ocean_proximity | avg_house_value
-----------------|-------------------
Text | Double
-----------------|-------------------
<1H OCEAN | 240084.28546409807
INLAND | 124805.39200122119
ISLAND | 380440.0
NEAR BAY | 259212.31179039303
NEAR OCEAN | 249433.97742663656
Create a new column based on other columns.
dataframe> df |> D.derive "rooms_per_household" (total_rooms / households)
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | ocean_proximity | rooms_per_household
-----------|----------|--------------------|-------------|----------------|------------|------------|--------------------|--------------------|-----------------|--------------------
Double | Double | Double | Double | Maybe Double | Double | Double | Double | Double | Text | Double
-----------|----------|--------------------|-------------|----------------|------------|------------|--------------------|--------------------|-----------------|--------------------
-122.23 | 37.88 | 41.0 | 880.0 | Just 129.0 | 322.0 | 126.0 | 8.3252 | 452600.0 | NEAR BAY | 6.984126984126984
-122.22 | 37.86 | 21.0 | 7099.0 | Just 1106.0 | 2401.0 | 1138.0 | 8.3014 | 358500.0 | NEAR BAY | 6.238137082601054
-122.24 | 37.85 | 52.0 | 1467.0 | Just 190.0 | 496.0 | 177.0 | 7.2574 | 352100.0 | NEAR BAY | 8.288135593220339
-122.25 | 37.85 | 52.0 | 1274.0 | Just 235.0 | 558.0 | 219.0 | 5.6431000000000004 | 341300.0 | NEAR BAY | 5.8173515981735155
-122.25 | 37.85 | 52.0 | 1627.0 | Just 280.0 | 565.0 | 259.0 | 3.8462 | 342200.0 | NEAR BAY | 6.281853281853282
-122.25 | 37.85 | 52.0 | 919.0 | Just 213.0 | 413.0 | 193.0 | 4.0368 | 269700.0 | NEAR BAY | 4.761658031088083
-122.25 | 37.84 | 52.0 | 2535.0 | Just 489.0 | 1094.0 | 514.0 | 3.6591 | 299200.0 | NEAR BAY | 4.9319066147859925
-122.25 | 37.84 | 52.0 | 3104.0 | Just 687.0 | 1157.0 | 647.0 | 3.12 | 241400.0 | NEAR BAY | 4.797527047913447
-122.26 | 37.84 | 42.0 | 2555.0 | Just 665.0 | 1206.0 | 595.0 | 2.0804 | 226700.0 | NEAR BAY | 4.294117647058823
-122.25 | 37.84 | 52.0 | 3549.0 | Just 707.0 | 1551.0 | 714.0 | 3.6912000000000003 | 261100.0 | NEAR BAY | 4.970588235294118
If two columns don’t type check we catch this with a type error instead of a runtime error.
dataframe> df |> D.derive "nonsense_feature" (latitude + ocean_proximity) |> D.take 10
<interactive>:14:47: error: [GHC-83865]
• Couldn't match type ‘Text’ with ‘Double’
Expected: Expr Double
Actual: Expr Text
• In the second argument of ‘(+)’, namely ‘ocean_proximity’
In the second argument of ‘derive’, namely
‘(latitude + ocean_proximity)’
In the second argument of ‘(|>)’, namely
‘derive "nonsense_feature" (latitude + ocean_proximity)’
Key features in example:
Intuitive, SQL-like API to get from data to insights.
Create type-safe references to columns in a dataframe using :declareColumns
Type-safe column transformations for faster and safer exploration.
Fluid, chaining API that makes code easy to reason about.