Tutorial

Tablite¶

Introduction¶

Tablite fills the data-science space where incremental data processing based on:

Datasets are larger than memory.
You don't want to worry about datatypes.

Tablite thereby competes with:

Pandas, but saves the memory overhead.
Numpy, but spares you from worrying about lower level data types
SQlite, by sheer speed.
Polars, by working beyond RAM.
Other libraries for data cleaning thanks to tablites powerful datatypes module.

Install: pip install tablite

Usage: >>> from tablite import Table

Upgrade: pip install tablite --no-cache --upgrade

Overview¶

(Version 2023.6.0 and later. For older version see this)

Tablite handles all Python datatypes: str, float, bool, int, date, datetime, time, timedelta and None.
you can select:
- all rows in a column as table['A']
- rows across all columns as table[4:8]
- or a slice as table['A', 'B', slice(4,8) ].
you to update with table['A'][2] = new value
you can store or send data using json, by:
- dumping to json: json_str = table.to_json(), or
- you can load it with Table.from_json(json_str).
you can iterate over rows using for row in Table.rows.
you can ask column_xyz in Table.colums ?
load from files with new_table = Table.from_file('this.csv') which has automatic datatype detection
perform inner, outer & left sql join between tables as simple as table_1.inner_join(table2, keys=['A', 'B'])
summarise using table.groupby( ... )
create pivot tables using groupby.pivot( ... )
perform multi-criteria lookup in tables using table1.lookup(table2, criteria=.....
and of course a large selection of tools in from tablite.tools import *

Examples¶

Here are some examples:

In [1]:

Copied!

from tablite import Table

## To create a tablite table is as simple as populating a dictionary:
t = Table({'A':[1,2,3], 'B':['a','b','c']})
from tablite import Table

## To create a tablite table is as simple as populating a dictionary:
t = Table({'A':[1,2,3], 'B':['a','b','c']})

In [2]:

Copied!

## In this notebook we can show tables in the HTML style:
t
## In this notebook we can show tables in the HTML style:
t

Out[2]:

#	A	B
0	1	a
1	2	b
2	3	c

In [3]:

Copied!

## or the ascii style:
t.show()
## or the ascii style:
t.show()

+==+=+=+
|# |A|B|
+--+-+-+
| 0|1|a|
| 1|2|b|
| 2|3|c|
+==+=+=+

In [4]:

Copied!

## or if you'd like to inspect the table, use:
print(str(t))
## or if you'd like to inspect the table, use:
print(str(t))

Table(2 columns, 3 rows)

In [5]:

Copied!

## You can also add all columns at once (slower) if you prefer. 
t2 = Table(headers=('A','B'), rows=((1,'a'),(2,'b'),(3,'c')))
assert t==t2
## You can also add all columns at once (slower) if you prefer. 
t2 = Table(headers=('A','B'), rows=((1,'a'),(2,'b'),(3,'c')))
assert t==t2

In [6]:

Copied!





## or load data:
t3 = Table.from_file('tests/data/book1.csv')

## to view any table in the notebook just let jupyter show the table. If you're using the terminal use .show(). 
## Note that show gives either first and last 7 rows or the whole table if it is less than 20 rows.
t3
## or load data:
t3 = Table.from_file('tests/data/book1.csv')

## to view any table in the notebook just let jupyter show the table. If you're using the terminal use .show(). 
## Note that show gives either first and last 7 rows or the whole table if it is less than 20 rows.
t3

Collecting tasks: 'tests/data/book1.csv'
Dumping tasks: 'tests/data/book1.csv'

importing file: 100%|██████████| 1/1 [00:00<00:00, 487.82it/s]

Out[6]:

#	a	b	c	d	e	f
0	1	0.060606061	0.090909091	0.121212121	0.151515152	0.181818182
1	2	0.121212121	0.242424242	0.484848485	0.96969697	1.939393939
2	3	0.242424242	0.484848485	0.96969697	1.939393939	3.878787879
3	4	0.484848485	0.96969697	1.939393939	3.878787879	7.757575758
4	5	0.96969697	1.939393939	3.878787879	7.757575758	15.51515152
5	6	1.939393939	3.878787879	7.757575758	15.51515152	31.03030303
6	7	3.878787879	7.757575758	15.51515152	31.03030303	62.06060606
...	...	...	...	...	...	...
38	39	16659267088.0	33318534175.0	66637068350.0	133274000000.0	266548000000.0
39	40	33318534175.0	66637068350.0	133274000000.0	266548000000.0	533097000000.0
40	41	66637068350.0	133274000000.0	266548000000.0	533097000000.0	1066190000000.0
41	42	133274000000.0	266548000000.0	533097000000.0	1066190000000.0	2132390000000.0
42	43	266548000000.0	533097000000.0	1066190000000.0	2132390000000.0	4264770000000.0
43	44	533097000000.0	1066190000000.0	2132390000000.0	4264770000000.0	8529540000000.0
44	45	1066190000000.0	2132390000000.0	4264770000000.0	8529540000000.0	17059100000000.0

In [7]:

Copied!





## should you however want to select the headers instead of importing everything
## (which maybe timeconsuming), simply use get_headers(path)
from tablite.tools import get_headers
from pathlib import Path
path = Path('tests/data/book1.csv')
sample = get_headers(path, linecount=5)
print(f"sample is of type {type(sample)} and has the following entries:")
for k,v in sample.items():
    print(k)
    if isinstance(v,list):
        for r in sample[k]:
            print("\t", r)
## should you however want to select the headers instead of importing everything
## (which maybe timeconsuming), simply use get_headers(path)
from tablite.tools import get_headers
from pathlib import Path
path = Path('tests/data/book1.csv')
sample = get_headers(path, linecount=5)
print(f"sample is of type {type(sample)} and has the following entries:")
for k,v in sample.items():
    print(k)
    if isinstance(v,list):
        for r in sample[k]:
            print("\t", r)

sample is of type <class 'dict'> and has the following entries:
delimiter
book1.csv
	 ['a', 'b', 'c', 'd', 'e', 'f']
	 ['1', '0.060606061', '0.090909091', '0.121212121', '0.151515152', '0.181818182']
	 ['2', '0.121212121', '0.242424242', '0.484848485', '0.96969697', '1.939393939']
	 ['3', '0.242424242', '0.484848485', '0.96969697', '1.939393939', '3.878787879']
	 ['4', '0.484848485', '0.96969697', '1.939393939', '3.878787879', '7.757575758']
	 ['5', '0.96969697', '1.939393939', '3.878787879', '7.757575758', '15.51515152']

In [8]:

Copied!





## to extend a table by adding columns, use t[new] = [new values]
t['C'] = [4,5,6]
## but make sure the column has the same length as the rest of the table!
t
## to extend a table by adding columns, use t[new] = [new values]
t['C'] = [4,5,6]
## but make sure the column has the same length as the rest of the table!
t

Out[8]:

#	A	B	C
0	1	a	4
1	2	b	5
2	3	c	6

In [9]:

Copied!





## should you want to mix datatypes, tablite will not complain:
from datetime import datetime, date,time,timedelta
import numpy as np
## What you put in ...
t4 = Table()
t4['mixed'] = [
    -1,0,1,  # regular integers
    -12345678909876543211234567890987654321,  # very very large integer
    None,np.nan,  # null values 
    "one", "",  # strings
    True,False,  # booleans
    float('inf'), 0.01,  # floats
    date(2000,1,1),   # date
    datetime(2002,2,3,23,0,4,6660),  # datetime
    time(12,12,12),  # time
    timedelta(days=3, seconds=5678)  # timedelta
]
## ... is exactly what you get out:
t4
## should you want to mix datatypes, tablite will not complain:
from datetime import datetime, date,time,timedelta
import numpy as np
## What you put in ...
t4 = Table()
t4['mixed'] = [
    -1,0,1,  # regular integers
    -12345678909876543211234567890987654321,  # very very large integer
    None,np.nan,  # null values 
    "one", "",  # strings
    True,False,  # booleans
    float('inf'), 0.01,  # floats
    date(2000,1,1),   # date
    datetime(2002,2,3,23,0,4,6660),  # datetime
    time(12,12,12),  # time
    timedelta(days=3, seconds=5678)  # timedelta
]
## ... is exactly what you get out:
t4

Out[9]:

#	mixed
0	-1
1	0
2	1
3	-12345678909876543211234567890987654321
4	None
5	nan
6	one
7
8	True
9	False
10	inf
11	0.01
12	2000-01-01
13	2002-02-03 23:00:04.006660
14	12:12:12
15	3 days, 1:34:38

In [10]:

Copied!

## also if you claim the values back as a python list:
for item in list(t4['mixed']):
    print(item)
## also if you claim the values back as a python list:
for item in list(t4['mixed']):
    print(item)

-1
0
1
-12345678909876543211234567890987654321
None
nan
one

True
False
inf
0.01
2000-01-01
2002-02-03 23:00:04.006660
12:12:12
3 days, 1:34:38

The column itself (__repr__) shows us the pid, file location and the entries, so you know exactly what you're working with.

In [11]:

Copied!

t4['mixed']
t4['mixed']

Out[11]:

Column(/tmp/tablite-tmp/pid-54911, [-1 0 1 -12345678909876543211234567890987654321 None nan 'one' '' True
 False inf 0.01 datetime.date(2000, 1, 1)
 datetime.datetime(2002, 2, 3, 23, 0, 4, 6660) datetime.time(12, 12, 12)
 datetime.timedelta(days=3, seconds=5678)])

In [12]:

Copied!





## to view the datatypes in a column, use Column.types()
type_dict = t4['mixed'].types()
for k,v in type_dict.items():
    print(k,v)
## to view the datatypes in a column, use Column.types()
type_dict = t4['mixed'].types()
for k,v in type_dict.items():
    print(k,v)

<class 'int'> 4
<class 'NoneType'> 1
<class 'float'> 3
<class 'str'> 2
<class 'bool'> 2
<class 'datetime.date'> 1
<class 'datetime.datetime'> 1
<class 'datetime.time'> 1
<class 'datetime.timedelta'> 1

In [13]:

Copied!





## You may have noticed that all datatypes in t3 where identified as floats, despite their origin from a text type file.
## This is because tablite guesses the most probable datatype using the `.guess` function on each column.
## You can use the .guess function like this:
from tablite import DataTypes
t3['a'] = DataTypes.guess(t3['a'])
## You can also convert the datatype using a list comprehension
t3['b'] = [float(v) for v in t3['b']]
t3
## You may have noticed that all datatypes in t3 where identified as floats, despite their origin from a text type file.
## This is because tablite guesses the most probable datatype using the `.guess` function on each column.
## You can use the .guess function like this:
from tablite import DataTypes
t3['a'] = DataTypes.guess(t3['a'])
## You can also convert the datatype using a list comprehension
t3['b'] = [float(v) for v in t3['b']]
t3

Out[13]:

#	a	b	c	d	e	f
0	1	0.060606061	0.090909091	0.121212121	0.151515152	0.181818182
1	2	0.121212121	0.242424242	0.484848485	0.96969697	1.939393939
2	3	0.242424242	0.484848485	0.96969697	1.939393939	3.878787879
3	4	0.484848485	0.96969697	1.939393939	3.878787879	7.757575758
4	5	0.96969697	1.939393939	3.878787879	7.757575758	15.51515152
5	6	1.939393939	3.878787879	7.757575758	15.51515152	31.03030303
6	7	3.878787879	7.757575758	15.51515152	31.03030303	62.06060606
...	...	...	...	...	...	...
38	39	16659267088.0	33318534175.0	66637068350.0	133274000000.0	266548000000.0
39	40	33318534175.0	66637068350.0	133274000000.0	266548000000.0	533097000000.0
40	41	66637068350.0	133274000000.0	266548000000.0	533097000000.0	1066190000000.0
41	42	133274000000.0	266548000000.0	533097000000.0	1066190000000.0	2132390000000.0
42	43	266548000000.0	533097000000.0	1066190000000.0	2132390000000.0	4264770000000.0
43	44	533097000000.0	1066190000000.0	2132390000000.0	4264770000000.0	8529540000000.0
44	45	1066190000000.0	2132390000000.0	4264770000000.0	8529540000000.0	17059100000000.0

API Examples¶

In the following sections, example are given of the Tablite API's power features:

Iteration
Append
Sort
Filter
Index
Search All
Search Any
Lookup
Join inner, outer,
GroupBy
Pivot table

ITERATION!¶

Iteration supports for loops and list comprehension at the speed of light:

Just use [r for r in table.rows], or:

for row in table.rows:
    row ...

Here's a more practical use case:

(1) Imagine a table with columns a,b,c,d,e (all integers) like this:

In [14]:

Copied!

t = Table()
for column_name in 'abcde':
    t[column_name] =[i for i in range(5)]
t = Table()
for column_name in 'abcde':
    t[column_name] =[i for i in range(5)]

(2) we want to add two new columns using the functions:

In [15]:

Copied!





def f1(a,b,c):
    return a+b+c+1
def f2(b,c,d):
    return b*c*d
def f1(a,b,c):
    return a+b+c+1
def f2(b,c,d):
    return b*c*d

(3) and we want to compute two new columns f and g:

In [16]:

Copied!

t.add_columns('f', 'g')
t.add_columns('f', 'g')

(4) we can now use the filter, to iterate over the table, and add the values to the two new columns:

In [17]:

Copied!





f,g=[],[]
for row in t['a', 'b', 'c', 'd'].rows:
    a, b, c, d = row

    f.append(f1(a, b, c))
    g.append(f2(b, c, d))
t['f'] = f
t['g'] = g

assert len(t) == 5
assert list(t.columns) == list('abcdefg')
t
f,g=[],[]
for row in t['a', 'b', 'c', 'd'].rows:
    a, b, c, d = row

    f.append(f1(a, b, c))
    g.append(f2(b, c, d))
t['f'] = f
t['g'] = g

assert len(t) == 5
assert list(t.columns) == list('abcdefg')
t

Out[17]:

#	a	b	c	d	e	f	g
0	0	0	0	0	0	1	0
1	1	1	1	1	1	4	1
2	2	2	2	2	2	7	8
3	3	3	3	3	3	10	27
4	4	4	4	4	4	13	64

Take note that if your dataset is assymmetric, a warning will be show:

In [18]:

Copied!





assymmetric_table = Table({'a':[1,2,3], 'b':[1,2]})
for row in assymmetric_table.rows:
    print(row)
## warning at the bottom ---v
assymmetric_table = Table({'a':[1,2,3], 'b':[1,2]})
for row in assymmetric_table.rows:
    print(row)
## warning at the bottom ---v

[1, 1]
[2, 2]
[3, None]

/home/bjorn/github/tablite/tablite/base.py:1188: UserWarning: Column b has length 2 / 3. None will appear as fill value.
  warnings.warn(f"Column {name} has length {len(column)} / {n_max}. None will appear as fill value.")

Create Index / Indices¶

Index supports multi-key indexing using args such as: index = table.index('B','C').

Here's an example:

In [19]:

Copied!





table7 = Table(columns={
'A': [1,1,2,2,3,4],
'B': [1,1,2,2,30,40],
'C': [-1,-2,-3,-4,-5,-6]
})
index = table7.index('A', 'B')
for k, v in index.items():
    print("key", k, "indices", v)
table7 = Table(columns={
'A': [1,1,2,2,3,4],
'B': [1,1,2,2,30,40],
'C': [-1,-2,-3,-4,-5,-6]
})
index = table7.index('A', 'B')
for k, v in index.items():
    print("key", k, "indices", v)

key (1, 1) indices [0, 1]
key (2, 2) indices [2, 3]
key (3, 30) indices [4]
key (4, 40) indices [5]

The keys are created for each unique column-key-pair, and the value is the index where the key is found. To fetch all rows for key (2,2), we can use:

In [20]:

Copied!

for ix, row in enumerate(table7.rows):
    if ix in index[(2,2)]:
        print(row)
for ix, row in enumerate(table7.rows):
    if ix in index[(2,2)]:
        print(row)

[2, 2, -3]
[2, 2, -4]

APPEND¶

In [21]:

Copied!





## to append one table to another, use + or += 
print('length before:', len(t3))  # length before: 45
t5 = t3 + t3  
print('length after +', len(t5))  # length after + 90
t5 += t3 
print('length after +=', len(t5))  # length after += 135
## if you need a lot of numbers for a test, you can repeat a table using * and *=
t5 *= 1_000
print('length after +=', len(t5))  # length after += 135000
## to append one table to another, use + or += 
print('length before:', len(t3))  # length before: 45
t5 = t3 + t3  
print('length after +', len(t5))  # length after + 90
t5 += t3 
print('length after +=', len(t5))  # length after += 135
## if you need a lot of numbers for a test, you can repeat a table using * and *=
t5 *= 1_000
print('length after +=', len(t5))  # length after += 135000

length before: 45
length after + 90
length after += 135
length after += 135000

In [22]:

Copied!

t5
t5

Out[22]:

#	a	b	c	d	e	f
0	1	0.060606061	0.090909091	0.121212121	0.151515152	0.181818182
1	2	0.121212121	0.242424242	0.484848485	0.96969697	1.939393939
2	3	0.242424242	0.484848485	0.96969697	1.939393939	3.878787879
3	4	0.484848485	0.96969697	1.939393939	3.878787879	7.757575758
4	5	0.96969697	1.939393939	3.878787879	7.757575758	15.51515152
5	6	1.939393939	3.878787879	7.757575758	15.51515152	31.03030303
6	7	3.878787879	7.757575758	15.51515152	31.03030303	62.06060606
...	...	...	...	...	...	...
134,993	39	16659267088.0	33318534175.0	66637068350.0	133274000000.0	266548000000.0
134,994	40	33318534175.0	66637068350.0	133274000000.0	266548000000.0	533097000000.0
134,995	41	66637068350.0	133274000000.0	266548000000.0	533097000000.0	1066190000000.0
134,996	42	133274000000.0	266548000000.0	533097000000.0	1066190000000.0	2132390000000.0
134,997	43	266548000000.0	533097000000.0	1066190000000.0	2132390000000.0	4264770000000.0
134,998	44	533097000000.0	1066190000000.0	2132390000000.0	4264770000000.0	8529540000000.0
134,999	45	1066190000000.0	2132390000000.0	4264770000000.0	8529540000000.0	17059100000000.0

In [23]:

Copied!





## if your are in doubt whether your tables will be the same you can use .stack(other)
assert t.columns != t2.columns  # compares list of column names.
t6 = t.stack(t2)
t6
## if your are in doubt whether your tables will be the same you can use .stack(other)
assert t.columns != t2.columns  # compares list of column names.
t6 = t.stack(t2)
t6

Out[23]:

#	a	b	c	d	e	f	g	A	B
0	0	0	0	0	0	1	0	None	None
1	1	1	1	1	1	4	1	None	None
2	2	2	2	2	2	7	8	None	None
3	3	3	3	3	3	10	27	None	None
4	4	4	4	4	4	13	64	None	None
5	None	None	None	None	None	None	None	1	a
6	None	None	None	None	None	None	None	2	b
7	None	None	None	None	None	None	None	3	c

In [24]:

Copied!





## As you can see above, t6['C'] is padded with "None" where t2 was missing the columns.

## if you need a more detailed view of the columns you can iterate:
for name in t.columns:
    col_from_t = t[name]
    if name in t2.columns:
        col_from_t2 = t2[name]
        print(name, col_from_t == col_from_t2)
    else:
        print(name, "not in t2")
## As you can see above, t6['C'] is padded with "None" where t2 was missing the columns.

## if you need a more detailed view of the columns you can iterate:
for name in t.columns:
    col_from_t = t[name]
    if name in t2.columns:
        col_from_t2 = t2[name]
        print(name, col_from_t == col_from_t2)
    else:
        print(name, "not in t2")

a not in t2
b not in t2
c not in t2
d not in t2
e not in t2
f not in t2
g not in t2

In [25]:

Copied!





## to make a copy of a table, use table.copy()
t3_copy = t3.copy()

## you can also perform multi criteria selections using getitem [ ... ]
t3_slice = t3['a','b','d', 5:25:5]
t3_slice
## to make a copy of a table, use table.copy()
t3_copy = t3.copy()

## you can also perform multi criteria selections using getitem [ ... ]
t3_slice = t3['a','b','d', 5:25:5]
t3_slice

Out[25]:

#	a	b	d
0	6	1.939393939	7.757575758
1	11	62.06060606	248.2424242
2	16	1985.939394	7943.757576
3	21	63550.06061	254200.2424

In [26]:

Copied!

##deleting items also works the same way:
del t3_slice[1:3]  # delete row number 2 & 3 
t3_slice
##deleting items also works the same way:
del t3_slice[1:3]  # delete row number 2 & 3 
t3_slice

Out[26]:

#	a	b	d
0	6	1.939393939	7.757575758
1	21	63550.06061	254200.2424

In [27]:

Copied!

## to wipe a table, use .clear:
t3_slice.clear()
t3_slice
## to wipe a table, use .clear:
t3_slice.clear()
t3_slice

Out[27]:

Empty Table

SAVE¶

In [28]:

Copied!





## tablite uses .npy for storage because it is fast.
## this means you can make a table persistent using .save
local_file = Path("local_file.tpz")
t5.save(local_file)

old_t5 = Table.load(local_file)
print("the t5 table had", len(old_t5), "rows")  # the t5 table had 135000 rows

del old_t5  # only removes the in-memory object

print("old_t5 still exists?", local_file.exists())
print("path:", local_file)

import os
os.remove(local_file)
## tablite uses .npy for storage because it is fast.
## this means you can make a table persistent using .save
local_file = Path("local_file.tpz")
t5.save(local_file)

old_t5 = Table.load(local_file)
print("the t5 table had", len(old_t5), "rows")  # the t5 table had 135000 rows

del old_t5  # only removes the in-memory object

print("old_t5 still exists?", local_file.exists())
print("path:", local_file)

import os
os.remove(local_file)

loading 'local_file.tpz' file:  55%|█████▍    | 9851/18000 [00:02<00:01, 4386.96it/s]

loading 'local_file.tpz' file: 100%|██████████| 18000/18000 [00:04<00:00, 4417.27it/s]

the t5 table had 135000 rows
old_t5 still exists? True
path: local_file.tpz

If you want to save a table from one session to another use save=True. This tells the garbage collector to leave the tablite Table on disk, so you can load it again without changing your code.

For example:

First time you run t = Table.import_file(....big.csv) it may take a minute or two.

If you then add t.save=True and restart python, the second time you run t = Table.import_file(....big.csv) it will take a few milliseconds instead of minutes.

FILTER!¶

In [29]:

Copied!

unfiltered = Table({'a':[1,2,3,4], 'b':[10,20,30,40]})
unfiltered = Table({'a':[1,2,3,4], 'b':[10,20,30,40]})

In [30]:

Copied!





true,false = unfiltered.filter(
    [
        {"column1": 'a', "criteria":">=", 'value2':3}
    ], filter_type='all'
)
true,false = unfiltered.filter(
    [
        {"column1": 'a', "criteria":">=", 'value2':3}
    ], filter_type='all'
)

In [31]:

Copied!

true
true

Out[31]:

#	a	b
0	3	30
1	4	40

In [32]:

Copied!

false.show()  # using show here to show that terminal users can have a nice view too.
false.show()  # using show here to show that terminal users can have a nice view too.

+==+=+==+
|# |a|b |
+--+-+--+
| 0|1|10|
| 1|2|20|
+==+=+==+

Any! All?¶

Any and All are cousins of the filter. They're there so you can use them in the same way as you'd use any and all in python - as boolean evaluators:

In [33]:

Copied!

ty = Table({'a':[1,2,3,4],'b': [10,20,30,40]})
ty = Table({'a':[1,2,3,4],'b': [10,20,30,40]})

In [34]:

Copied!

## typical python
any(i > 3 for i in ty['a'])
## typical python
any(i > 3 for i in ty['a'])

Out[34]:

True

In [35]:

Copied!

## hereby you can do:
any( ty.any(**{'a':lambda x:x>3}).rows )
## hereby you can do:
any( ty.any(**{'a':lambda x:x>3}).rows )

Out[35]:

True

In [36]:

Copied!

## if you have multiple criteria this also works:
all( ty.all(**{'a': lambda x:x>=2, 'b': lambda x:x<=30}).rows )
## if you have multiple criteria this also works:
all( ty.all(**{'a': lambda x:x>=2, 'b': lambda x:x<=30}).rows )

Out[36]:

True

In [37]:

Copied!

## or this if you want to see the table.
ty.all(a=lambda x:x>2, b=lambda x:x<=30)
## or this if you want to see the table.
ty.all(a=lambda x:x>2, b=lambda x:x<=30)

Out[37]:

#	a	b
0	3	30

In [38]:

Copied!

## As `all` and `any` returns tables, this also means that you can chain operations:
ty.any(a=lambda x:x>2).any(b=30)
## As `all` and `any` returns tables, this also means that you can chain operations:
ty.any(a=lambda x:x>2).any(b=30)

Out[38]:

#	a	b
0	3	30

SORT!¶

In [39]:

Copied!





table = Table({
    'A':[ 1, None, 8, 3, 4, 6,  5,  7,  9],
    'B':[10,'100', 1, 1, 1, 1, 10, 10, 10],
    'C':[ 0,    1, 0, 1, 0, 1,  0,  1,  0],
})
table
table = Table({
    'A':[ 1, None, 8, 3, 4, 6,  5,  7,  9],
    'B':[10,'100', 1, 1, 1, 1, 10, 10, 10],
    'C':[ 0,    1, 0, 1, 0, 1,  0,  1,  0],
})
table

Out[39]:

#	A	B	C
0	1	10	0
1	None	100	1
2	8	1	0
3	3	1	1
4	4	1	0
5	6	1	1
6	5	10	0
7	7	10	1
8	9	10	0

In [40]:

Copied!

sort_order = {'B': False, 'C': False, 'A': False}
assert not table.is_sorted(mapping=sort_order)

sorted_table = table.sort(mapping=sort_order)
sorted_table
sort_order = {'B': False, 'C': False, 'A': False}
assert not table.is_sorted(mapping=sort_order)

sorted_table = table.sort(mapping=sort_order)
sorted_table

creating sort index: 100%|██████████| 3/3 [00:00<00:00, 2719.45it/s]
creating sort index: 100%|██████████| 3/3 [00:00<00:00, 3434.20it/s]
join: 100%|██████████| 3/3 [00:00<00:00, 1902.47it/s]

Sort is reasonable effective as it uses multiprocessing above a million fields.

Hint: You can set this limit in tablite.config, like this:

In [41]:

Copied!

from tablite.config import Config
print(f"multiprocessing is used above {Config.SINGLE_PROCESSING_LIMIT:,} fields")
from tablite.config import Config
print(f"multiprocessing is used above {Config.SINGLE_PROCESSING_LIMIT:,} fields")

multiprocessing is used above 1,000,000 fields

In [42]:

Copied!





import math
n = math.ceil(1_000_000 / (9*3))

table = Table({
    'A':[ 1, None, 8, 3, 4, 6,  5,  7,  9]*n,
    'B':[10,'100', 1, 1, 1, 1, 10, 10, 10]*n,
    'C':[ 0,    1, 0, 1, 0, 1,  0,  1,  0]*n,
})
table
import math
n = math.ceil(1_000_000 / (9*3))

table = Table({
    'A':[ 1, None, 8, 3, 4, 6,  5,  7,  9]*n,
    'B':[10,'100', 1, 1, 1, 1, 10, 10, 10]*n,
    'C':[ 0,    1, 0, 1, 0, 1,  0,  1,  0]*n,
})
table

Out[42]:

#	A	B	C
0	1	10	0
1	None	100	1
2	8	1	0
3	3	1	1
4	4	1	0
5	6	1	1
6	5	10	0
...	...	...	...
333,335	8	1	0
333,336	3	1	1
333,337	4	1	0
333,338	6	1	1
333,339	5	10	0
333,340	7	10	1
333,341	9	10	0

In [43]:

Copied!





import time as cputime
start = cputime.time()
sort_order = {'B': False, 'C': False, 'A': False}
sorted_table = table.sort(mapping=sort_order)  # sorts 1M values.
print("table sorting took ", round(cputime.time() - start,3), "secs")
sorted_table
import time as cputime
start = cputime.time()
sort_order = {'B': False, 'C': False, 'A': False}
sorted_table = table.sort(mapping=sort_order)  # sorts 1M values.
print("table sorting took ", round(cputime.time() - start,3), "secs")
sorted_table

creating sort index: 100%|██████████| 3/3 [00:00<00:00,  4.20it/s]
join: 100%|██████████| 3/3 [00:00<00:00, 18.17it/s]

table sorting took  0.913 secs

GROUPBY !¶

In [44]:

Copied!





n = math.ceil(1_000_000 / (9*3))

table = Table({
    'A':[ 1, None, 8, 3, 4, 6,  5,  7,  9]*n,
    'B':[10,'100', 1, 1, 1, 1, 10, 10, 10]*n,
    'C':[ 0,    1, 0, 1, 0, 1,  0,  1,  0]*n,
})
table
n = math.ceil(1_000_000 / (9*3))

table = Table({
    'A':[ 1, None, 8, 3, 4, 6,  5,  7,  9]*n,
    'B':[10,'100', 1, 1, 1, 1, 10, 10, 10]*n,
    'C':[ 0,    1, 0, 1, 0, 1,  0,  1,  0]*n,
})
table

Out[44]:

#	A	B	C
0	1	10	0
1	None	100	1
2	8	1	0
3	3	1	1
4	4	1	0
5	6	1	1
6	5	10	0
...	...	...	...
333,335	8	1	0
333,336	3	1	1
333,337	4	1	0
333,338	6	1	1
333,339	5	10	0
333,340	7	10	1
333,341	9	10	0

In [45]:

Copied!

from tablite import GroupBy as gb
grpby = table.groupby(keys=['C', 'B'], functions=[('A', gb.count)])
grpby
from tablite import GroupBy as gb
grpby = table.groupby(keys=['C', 'B'], functions=[('A', gb.count)])
grpby

groupby: 100%|██████████| 333342/333342 [00:00<00:00, 427322.50it/s]

Out[45]:

#	C	B	Count(A)
0	0	10	111114
1	1	100	37038
2	0	1	74076
3	1	1	74076
4	1	10	37038

Here is the list of groupby functions:

class GroupBy(object):    
    max = Max  # shortcuts to avoid having to type a long list of imports.
    min = Min
    sum = Sum
    product = Product
    first = First
    last = Last
    count = Count
    count_unique = CountUnique
    avg = Average
    stdev = StandardDeviation
    median = Median
    mode = Mode

Did I say pivot table? Yes.¶

Pivot Table is included in the groupby functionality - so yes - you can pivot the groupby on any column that is used for grouping. Here's a simple example:

In [46]:

Copied!





t = Table({
    'A':[1, 1, 2, 2, 3, 3] * 2,
    'B':[1, 2, 3, 4, 5, 6] * 2,
    'C':[6, 5, 4, 3, 2, 1] * 2,
})
t
t = Table({
    'A':[1, 1, 2, 2, 3, 3] * 2,
    'B':[1, 2, 3, 4, 5, 6] * 2,
    'C':[6, 5, 4, 3, 2, 1] * 2,
})
t

Out[46]:

#	A	B	C
0	1	1	6
1	1	2	5
2	2	3	4
3	2	4	3
4	3	5	2
5	3	6	1
6	1	1	6
7	1	2	5
8	2	3	4
9	2	4	3
10	3	5	2
11	3	6	1

In [47]:

Copied!

t2 = t.pivot(rows=['C'], columns=['A'], functions=[('B', gb.sum), ('B', gb.count)], values_as_rows=False)
t2
t2 = t.pivot(rows=['C'], columns=['A'], functions=[('B', gb.sum), ('B', gb.count)], values_as_rows=False)
t2

pivot: 100%|██████████| 14/14 [00:00<00:00, 3643.83it/s]

Out[47]:

#	C	Sum(B,A=1)	Count(B,A=1)	Sum(B,A=2)	Count(B,A=2)	Sum(B,A=3)	Count(B,A=3)
0	6	2	2	None	None	None	None
1	5	4	2	None	None	None	None
2	4	None	None	6	2	None	None
3	3	None	None	8	2	None	None
4	2	None	None	None	None	10	2
5	1	None	None	None	None	12	2

JOIN!¶

In [48]:

Copied!





numbers = Table()
numbers.add_column('number', data=[      1,      2,       3,       4,   None])
numbers.add_column('colour', data=['black', 'blue', 'white', 'white', 'blue'])

letters = Table()
letters.add_column('letter', data=[  'a',     'b',      'c',     'd',   None])
letters.add_column('color', data=['blue', 'white', 'orange', 'white', 'blue'])
numbers = Table()
numbers.add_column('number', data=[      1,      2,       3,       4,   None])
numbers.add_column('colour', data=['black', 'blue', 'white', 'white', 'blue'])

letters = Table()
letters.add_column('letter', data=[  'a',     'b',      'c',     'd',   None])
letters.add_column('color', data=['blue', 'white', 'orange', 'white', 'blue'])

In [49]:

Copied!





## left join
## SELECT number, letter FROM numbers LEFT JOIN letters ON numbers.colour == letters.color
left_join = numbers.left_join(letters, left_keys=['colour'], right_keys=['color'], left_columns=['number'], right_columns=['letter'])
left_join
## left join
## SELECT number, letter FROM numbers LEFT JOIN letters ON numbers.colour == letters.color
left_join = numbers.left_join(letters, left_keys=['colour'], right_keys=['color'], left_columns=['number'], right_columns=['letter'])
left_join

join: 100%|██████████| 2/2 [00:00<00:00, 1221.94it/s]

Out[49]:

#	number	letter
0	1	None
1	2	a
2	2	None
3	None	a
4	None	None
5	3	b
6	3	d
7	4	b
8	4	d

In [50]:

Copied!





## inner join
## SELECT number, letter FROM numbers JOIN letters ON numbers.colour == letters.color
inner_join = numbers.inner_join(letters, left_keys=['colour'], right_keys=['color'], left_columns=['number'], right_columns=['letter'])
inner_join
## inner join
## SELECT number, letter FROM numbers JOIN letters ON numbers.colour == letters.color
inner_join = numbers.inner_join(letters, left_keys=['colour'], right_keys=['color'], left_columns=['number'], right_columns=['letter'])
inner_join

join: 100%|██████████| 2/2 [00:00<00:00, 1121.77it/s]

Out[50]:

#	number	letter
0	2	a
1	2	None
2	None	a
3	None	None
4	3	b
5	3	d
6	4	b
7	4	d

In [51]:

Copied!





# outer join
## SELECT number, letter FROM numbers OUTER JOIN letters ON numbers.colour == letters.color
outer_join = numbers.outer_join(letters, left_keys=['colour'], right_keys=['color'], left_columns=['number'], right_columns=['letter'])
outer_join
 # outer join
## SELECT number, letter FROM numbers OUTER JOIN letters ON numbers.colour == letters.color
outer_join = numbers.outer_join(letters, left_keys=['colour'], right_keys=['color'], left_columns=['number'], right_columns=['letter'])
outer_join

join: 100%|██████████| 2/2 [00:00<00:00, 1585.15it/s]

Out[51]:

#	number	letter
0	1	None
1	2	a
2	2	None
3	None	a
4	None	None
5	3	b
6	3	d
7	4	b
8	4	d
9	None	c

Q: But ...I think there's a bug in the join...
A: Venn diagrams do not explain joins.

A Venn diagram is a widely-used diagram style that shows the logical relation between sets, popularised by John Venn in the 1880s. The diagrams are used to teach elementary set theory, and to illustrate simple set relationships
source: en.wikipedia.org

Joins operate over rows and when there are duplicate rows, these will be replicated in the output. Many beginners are surprised by this, because they didn't read the SQL standard.

Q: So what do I do?
A: If you want to get rid of duplicates using tablite, use the index functionality across all columns and pick the first row from each index. Here's the recipe that starts with plenty of duplicates:

In [52]:

Copied!





old_table = Table({
'A':[1,1,1,2,2,2,3,3,3],
'B':[1,1,4,2,2,5,3,3,6],
})
old_table
old_table = Table({
'A':[1,1,1,2,2,2,3,3,3],
'B':[1,1,4,2,2,5,3,3,6],
})
old_table

Out[52]:

#	A	B
0	1	1
1	1	1
2	1	4
3	2	2
4	2	2
5	2	5
6	3	3
7	3	3
8	3	6

In [53]:

Copied!

## CREATE TABLE OF UNIQUE ENTRIES (a.k.a. DEDUPLICATE)
new_table = old_table.drop_duplicates()
new_table
## CREATE TABLE OF UNIQUE ENTRIES (a.k.a. DEDUPLICATE)
new_table = old_table.drop_duplicates()
new_table

9it [00:00, 11329.15it/s]
join: 100%|██████████| 2/2 [00:00<00:00, 1819.26it/s]

Out[53]:

#	A	B
0	1	1
1	1	4
2	2	2
3	2	5
4	3	3
5	3	6

You can also use groupby; We'll get to that in a minute.

LOOKUP!¶

Lookup is a special case of a search loop: Say for example you are planning a concert and want to make sure that your friends can make it home using public transport: You would have to find the first departure after the concert ends towards their home. A join would only give you a direct match on the time.

Lookup allows you "to iterate through a list of data and find the first match given a set of criteria."

Here's an example:

First we have our list of friends and their stops.

In [54]:

Copied!





friends = Table({
"name":['Alice', 'Betty', 'Charlie', 'Dorethy', 'Edward', 'Fred'],
"stop":['Downtown-1', 'Downtown-2', 'Hillside View', 'Hillside Crescent', 'Downtown-2', 'Chicago'],
})
friends
friends = Table({
"name":['Alice', 'Betty', 'Charlie', 'Dorethy', 'Edward', 'Fred'],
"stop":['Downtown-1', 'Downtown-2', 'Hillside View', 'Hillside Crescent', 'Downtown-2', 'Chicago'],
})
friends

Out[54]:

#	name	stop
0	Alice	Downtown-1
1	Betty	Downtown-2
2	Charlie	Hillside View
3	Dorethy	Hillside Crescent
4	Edward	Downtown-2
5	Fred	Chicago

Next we need a list of bus routes and their time and stops. I don't have that, so I'm making one up:

In [55]:

Copied!





import random
random.seed(11)
table_size = 40

times = [DataTypes.time(random.randint(21, 23), random.randint(0, 59)) for i in range(table_size)]
stops = ['Stadium', 'Hillside', 'Hillside View', 'Hillside Crescent', 'Downtown-1', 'Downtown-2',
            'Central station'] * 2 + [f'Random Road-{i}' for i in range(table_size)]
route = [random.choice([1, 2, 3]) for i in stops]

import random
random.seed(11)
table_size = 40

times = [DataTypes.time(random.randint(21, 23), random.randint(0, 59)) for i in range(table_size)]
stops = ['Stadium', 'Hillside', 'Hillside View', 'Hillside Crescent', 'Downtown-1', 'Downtown-2',
            'Central station'] * 2 + [f'Random Road-{i}' for i in range(table_size)]
route = [random.choice([1, 2, 3]) for i in stops]

In [56]:

Copied!





bus_table = Table({
"time":times,
"stop":stops[:table_size],
"route":route[:table_size],
})
bus_table.sort(mapping={'time': False})

print("Departures from Concert Hall towards ...")
bus_table[0:10]

bus_table = Table({
"time":times,
"stop":stops[:table_size],
"route":route[:table_size],
})
bus_table.sort(mapping={'time': False})

print("Departures from Concert Hall towards ...")
bus_table[0:10]

creating sort index: 100%|██████████| 1/1 [00:00<00:00, 1459.90it/s]
join: 100%|██████████| 3/3 [00:00<00:00, 2421.65it/s]

Departures from Concert Hall towards ...

Out[56]:

#	time	stop	route
0	21:02:00	Random Road-6	2
1	21:05:00	Hillside Crescent	2
2	21:06:00	Hillside	1
3	21:25:00	Random Road-24	1
4	21:29:00	Random Road-16	1
5	21:32:00	Random Road-21	1
6	21:33:00	Random Road-12	1
7	21:36:00	Random Road-23	3
8	21:38:00	Central station	2
9	21:38:00	Random Road-8	2

Let's say the concerts ends at 21:00 and it takes a 10 minutes to get to the bus-stop. Earliest departure must then be 21:10 - goodbye hugs included.

In [57]:

Copied!

lookup_1 = friends.lookup(bus_table, (DataTypes.time(21, 10), "<=", 'time'), ('stop', "==", 'stop'))
lookup1_sorted = lookup_1.sorted(mapping={'time': False, 'name':False}, sort_mode='unix')
lookup1_sorted
lookup_1 = friends.lookup(bus_table, (DataTypes.time(21, 10), "<=", 'time'), ('stop', "==", 'stop'))
lookup1_sorted = lookup_1.sorted(mapping={'time': False, 'name':False}, sort_mode='unix')
lookup1_sorted

100%|██████████| 6/6 [00:00<00:00, 1513.92it/s]
join: 100%|██████████| 3/3 [00:00<00:00, 2003.65it/s]
creating sort index: 100%|██████████| 2/2 [00:00<00:00, 2589.88it/s]
join: 100%|██████████| 5/5 [00:00<00:00, 2034.29it/s]

Out[57]:

#	name	stop	time	stop_1	route
0	Fred	Chicago	None	None	None
1	Betty	Downtown-2	21:51:00	Downtown-2	1
2	Edward	Downtown-2	21:51:00	Downtown-2	1
3	Charlie	Hillside View	22:19:00	Hillside View	2
4	Alice	Downtown-1	23:12:00	Downtown-1	3
5	Dorethy	Hillside Crescent	23:54:00	Hillside Crescent	1

Lookup's ability to custom criteria is thereby far more versatile than SQL joins.

But with great power comes great responsibility.

Match¶

If you're looking to do a join where you afterwards remove the empty rows, match is the faster choice.

Here is an example.

Let's start with two tables:

In [58]:

Copied!





materials = Table({
    'bom_id': [1, 2, 3, 4, 5, 6, 7, 8, 9], 
    'partial_of': [1, 2, 3, 4, 5, 6, 7, 4, 6], 
    'sku': ['A', 'irrelevant', 'empty carton', 'pkd carton', 'empty pallet', 'pkd pallet', 'pkd irrelevant', 'ppkd carton', 'ppkd pallet'], 
    'material_id': [None, None, None, 3, None, 5, 3, 3, 5], 
    'quantity': [10, 20, 30, 40, 50, 60, 70, 80, 90]
})
    # 9 is a partially packed pallet of 6

## multiple values.
looking_for = Table({
    'bom_id': [3,4,6], 
    'moq': [1,2,3]
    })
materials = Table({
    'bom_id': [1, 2, 3, 4, 5, 6, 7, 8, 9], 
    'partial_of': [1, 2, 3, 4, 5, 6, 7, 4, 6], 
    'sku': ['A', 'irrelevant', 'empty carton', 'pkd carton', 'empty pallet', 'pkd pallet', 'pkd irrelevant', 'ppkd carton', 'ppkd pallet'], 
    'material_id': [None, None, None, 3, None, 5, 3, 3, 5], 
    'quantity': [10, 20, 30, 40, 50, 60, 70, 80, 90]
})
    # 9 is a partially packed pallet of 6

## multiple values.
looking_for = Table({
    'bom_id': [3,4,6], 
    'moq': [1,2,3]
    })

Our goals is now to find the quantity from the materials table based on the items in the looking_for table.

This requires two steps:

lookup
filter for all by dropping items that didn't match.

In [59]:

Copied!

## step 1/2:
products_lookup = materials.lookup(looking_for, ("bom_id", "==", "bom_id"), ("partial_of", "==", "bom_id"), all=False)   
products_lookup
## step 1/2:
products_lookup = materials.lookup(looking_for, ("bom_id", "==", "bom_id"), ("partial_of", "==", "bom_id"), all=False)   
products_lookup

100%|██████████| 9/9 [00:00<00:00, 3651.81it/s]
join: 100%|██████████| 2/2 [00:00<00:00, 1625.38it/s]

Out[59]:

#	bom_id	partial_of	sku	material_id	quantity	bom_id_1	moq
0	1	1	A	None	10	None	None
1	2	2	irrelevant	None	20	None	None
2	3	3	empty carton	None	30	3	1
3	4	4	pkd carton	3	40	4	2
4	5	5	empty pallet	None	50	None	None
5	6	6	pkd pallet	5	60	6	3
6	7	7	pkd irrelevant	3	70	None	None
7	8	4	ppkd carton	3	80	4	2
8	9	6	ppkd pallet	5	90	6	3

In [60]:

Copied!

## step 2/2:
products = products_lookup.all(bom_id_1=lambda x: x is not None)
products
## step 2/2:
products = products_lookup.all(bom_id_1=lambda x: x is not None)
products

Out[60]:

#	bom_id	partial_of	sku	material_id	quantity	bom_id_1	moq
0	3	3	empty carton	None	30	3	1
1	4	4	pkd carton	3	40	4	2
2	6	6	pkd pallet	5	60	6	3
3	8	4	ppkd carton	3	80	4	2
4	9	6	ppkd pallet	5	90	6	3

The faster way to solve this problem is to use match!

Here is the example:

In [61]:

Copied!

products_matched = materials.match(looking_for, ("bom_id", "==", "bom_id"), ("partial_of", "==", "bom_id"))
products_matched
products_matched = materials.match(looking_for, ("bom_id", "==", "bom_id"), ("partial_of", "==", "bom_id"))
products_matched

Out[61]:

#	bom_id	partial_of	sku	material_id	quantity	bom_id_1	moq
0	3	3	empty carton	None	30	3	1
1	4	4	pkd carton	3	40	4	2
2	6	6	pkd pallet	5	60	6	3
3	8	4	ppkd carton	3	80	4	2
4	9	6	ppkd pallet	5	90	6	3

In [62]:

Copied!

assert products == products_matched
assert products == products_matched

Are there other ways I can add data?¶

Yes - but row based operations cause a lot of IO, so it'll work but be slower:

In [63]:

Copied!

from tablite import Table
t = Table()  # create table
t.add_columns('row','A','B','C')  # add columns
from tablite import Table
t = Table()  # create table
t.add_columns('row','A','B','C')  # add columns

The following examples are all valid and append the row (1,2,3) to the table.

In [64]:

Copied!





t.add_rows(1, 1, 2, 3)  # individual values
t.add_rows([2, 1, 2, 3])  # list of values
t.add_rows((3, 1, 2, 3))  # tuple of values
t.add_rows(*(4, 1, 2, 3))  # unpacked tuple
t.add_rows(row=5, A=1, B=2, C=3)   # keyword - args
t.add_rows(**{'row': 6, 'A': 1, 'B': 2, 'C': 3})  # dict / json.
t.add_rows(1, 1, 2, 3)  # individual values
t.add_rows([2, 1, 2, 3])  # list of values
t.add_rows((3, 1, 2, 3))  # tuple of values
t.add_rows(*(4, 1, 2, 3))  # unpacked tuple
t.add_rows(row=5, A=1, B=2, C=3)   # keyword - args
t.add_rows(**{'row': 6, 'A': 1, 'B': 2, 'C': 3})  # dict / json.

The following examples add two rows to the table

In [65]:

Copied!





t.add_rows((7, 1, 2, 3), (8, 4, 5, 6))  # two (or more) tuples.
t.add_rows([9, 1, 2, 3], [10, 4, 5, 6])  # two or more lists
t.add_rows({'row': 11, 'A': 1, 'B': 2, 'C': 3},
          {'row': 12, 'A': 4, 'B': 5, 'C': 6})  # two (or more) dicts as args.
t.add_rows(*[{'row': 13, 'A': 1, 'B': 2, 'C': 3},
            {'row': 14, 'A': 1, 'B': 2, 'C': 3}])  # list of dicts.
t.add_rows((7, 1, 2, 3), (8, 4, 5, 6))  # two (or more) tuples.
t.add_rows([9, 1, 2, 3], [10, 4, 5, 6])  # two or more lists
t.add_rows({'row': 11, 'A': 1, 'B': 2, 'C': 3},
          {'row': 12, 'A': 4, 'B': 5, 'C': 6})  # two (or more) dicts as args.
t.add_rows(*[{'row': 13, 'A': 1, 'B': 2, 'C': 3},
            {'row': 14, 'A': 1, 'B': 2, 'C': 3}])  # list of dicts.

In [66]:

Copied!

t
t

Out[66]:

#	row	A	B	C
0	1	1	2	3
1	2	1	2	3
2	3	1	2	3
3	4	1	2	3
4	5	1	2	3
5	6	1	2	3
6	7	1	2	3
7	8	4	5	6
8	9	1	2	3
9	10	4	5	6
10	11	1	2	3
11	12	4	5	6
12	13	1	2	3
13	14	1	2	3

As the row incremented from 1 in the first of these examples, and finished with row: 14, you can now see the whole table above

Okay, great. How do I load data?¶

Easy. Use file_reader. Here's an example:

In [67]:

Copied!





from pathlib import Path
path = Path('tests/data/book1.csv')
tx = Table.from_file(path)
tx
from pathlib import Path
path = Path('tests/data/book1.csv')
tx = Table.from_file(path)
tx

Collecting tasks: 'tests/data/book1.csv'
Dumping tasks: 'tests/data/book1.csv'

importing file: 100%|██████████| 1/1 [00:00<00:00, 444.08it/s]

Out[67]:

#	a	b	c	d	e	f
0	1	0.060606061	0.090909091	0.121212121	0.151515152	0.181818182
1	2	0.121212121	0.242424242	0.484848485	0.96969697	1.939393939
2	3	0.242424242	0.484848485	0.96969697	1.939393939	3.878787879
3	4	0.484848485	0.96969697	1.939393939	3.878787879	7.757575758
4	5	0.96969697	1.939393939	3.878787879	7.757575758	15.51515152
5	6	1.939393939	3.878787879	7.757575758	15.51515152	31.03030303
6	7	3.878787879	7.757575758	15.51515152	31.03030303	62.06060606
...	...	...	...	...	...	...
38	39	16659267088.0	33318534175.0	66637068350.0	133274000000.0	266548000000.0
39	40	33318534175.0	66637068350.0	133274000000.0	266548000000.0	533097000000.0
40	41	66637068350.0	133274000000.0	266548000000.0	533097000000.0	1066190000000.0
41	42	133274000000.0	266548000000.0	533097000000.0	1066190000000.0	2132390000000.0
42	43	266548000000.0	533097000000.0	1066190000000.0	2132390000000.0	4264770000000.0
43	44	533097000000.0	1066190000000.0	2132390000000.0	4264770000000.0	8529540000000.0
44	45	1066190000000.0	2132390000000.0	4264770000000.0	8529540000000.0	17059100000000.0

Note that you can also add start, limit and chunk_size to the file reader. Here's an example:

In [68]:

Copied!

path = Path('tests/data/book1.csv')
tx2 = Table.from_file(path, start=2, limit=15)
tx2
path = Path('tests/data/book1.csv')
tx2 = Table.from_file(path, start=2, limit=15)
tx2

Collecting tasks: 'tests/data/book1.csv'

importing file: 100%|██████████| 1/1 [00:00<00:00, 391.22it/s]

Dumping tasks: 'tests/data/book1.csv'

Out[68]:

#	a	b	c	d	e	f
0	3	0.242424242	0.484848485	0.96969697	1.939393939	3.878787879
1	4	0.484848485	0.96969697	1.939393939	3.878787879	7.757575758
2	5	0.96969697	1.939393939	3.878787879	7.757575758	15.51515152
3	6	1.939393939	3.878787879	7.757575758	15.51515152	31.03030303
4	7	3.878787879	7.757575758	15.51515152	31.03030303	62.06060606
5	8	7.757575758	15.51515152	31.03030303	62.06060606	124.1212121
6	9	15.51515152	31.03030303	62.06060606	124.1212121	248.2424242
7	10	31.03030303	62.06060606	124.1212121	248.2424242	496.4848485
8	11	62.06060606	124.1212121	248.2424242	496.4848485	992.969697
9	12	124.1212121	248.2424242	496.4848485	992.969697	1985.939394
10	13	248.2424242	496.4848485	992.969697	1985.939394	3971.878788
11	14	496.4848485	992.969697	1985.939394	3971.878788	7943.757576
12	15	992.969697	1985.939394	3971.878788	7943.757576	15887.51515
13	16	1985.939394	3971.878788	7943.757576	15887.51515	31775.0303
14	17	3971.878788	7943.757576	15887.51515	31775.0303	63550.06061

How good is the file_reader?

I've included all formats in the test suite that are publicly available from the Alan Turing institute, dateutils) and Python's csv reader.

What about MM-DD-YYYY formats? Some users from the US ask why the csv reader doesn't read the month-day-year format.

The answer is simple: It's not an iso8601 format. The US month-day-year format is a locale that may be used a lot in the US, but it isn't an international standard.

If you need to work with MM-DD-YYYY you will find that the file_reader will import the values as text (str). You can then reformat it with a custom function like:

In [69]:

Copied!





s = "03-21-1998"
from datetime import date
f = lambda s: date(int(s[-4:]), int(s[:2]), int(s[3:5]))
f(s)
s = "03-21-1998"
from datetime import date
f = lambda s: date(int(s[-4:]), int(s[:2]), int(s[3:5]))
f(s)

Out[69]:

datetime.date(1998, 3, 21)

Sweet. What formats are supported? Can I add my own file reader?¶

Yes! This is very good for special log files or custom json formats. Here's how you do it:

(1) Go to all existing readers in the tablite.core and find the closest match.

In [70]:

Copied!

from tablite.import_utils import file_readers
for k,v in file_readers.items():
    print(k,v)
from tablite.import_utils import file_readers
for k,v in file_readers.items():
    print(k,v)

fods <function excel_reader at 0x7f36a3ef8c10>
json <function excel_reader at 0x7f36a3ef8c10>
html <function from_html at 0x7f36a3ef8b80>
hdf5 <function from_hdf5 at 0x7f36a3ef8a60>
simple <function excel_reader at 0x7f36a3ef8c10>
rst <function excel_reader at 0x7f36a3ef8c10>
mediawiki <function excel_reader at 0x7f36a3ef8c10>
xlsx <function excel_reader at 0x7f36a3ef8c10>
xls <function excel_reader at 0x7f36a3ef8c10>
xlsm <function excel_reader at 0x7f36a3ef8c10>
csv <function text_reader at 0x7f36a3ef9000>
tsv <function text_reader at 0x7f36a3ef9000>
txt <function text_reader at 0x7f36a3ef9000>
ods <function ods_reader at 0x7f36a3ef8ca0>

(2) define your new file reader

In [71]:

Copied!

def my_magic_reader(path, **kwargs):   # define your new file reader.
    print("do magic with {path}")
    return
def my_magic_reader(path, **kwargs):   # define your new file reader.
    print("do magic with {path}")
    return

(3) add it to the list of readers.

In [72]:

Copied!

file_readers['my_special_format'] = my_magic_reader
file_readers['my_special_format'] = my_magic_reader

The file_readers are all in tablite.core so if you intend to extend the readers, I recommend that you start here.

Very nice. How about exporting data?¶

Just use .export

In [73]:

Copied!

file = Path('example.xlsx')
tx2.to_xlsx(file)
os.remove(file)
file = Path('example.xlsx')
tx2.to_xlsx(file)
os.remove(file)

example.xlsx

Cool. Does it play well with plotting packages?¶

Yes. Here's an example you can copy and paste:

In [74]:

Copied!





from tablite import Table

t = Table({
'a':[1, 2, 8, 3, 4, 6, 5, 7, 9],
'b':[10, 100, 3, 4, 16, -1, 10, 10, 10],
})
t.sort(mapping={"a":False})
t
from tablite import Table

t = Table({
'a':[1, 2, 8, 3, 4, 6, 5, 7, 9],
'b':[10, 100, 3, 4, 16, -1, 10, 10, 10],
})
t.sort(mapping={"a":False})
t

creating sort index: 100%|██████████| 1/1 [00:00<00:00, 1674.37it/s]
join: 100%|██████████| 2/2 [00:00<00:00, 1701.89it/s]

Out[74]:

#	a	b
0	1	10
1	2	100
2	3	4
3	4	16
4	5	10
5	6	-1
6	7	10
7	8	3
8	9	10

In [75]:

Copied!

%pip install matplotlib -q
%pip install matplotlib -q

Note: you may need to restart the kernel to use updated packages.

In [76]:

Copied!





import matplotlib.pyplot as plt
plt.plot(t['a'], t['b'])
plt.ylabel('Hello Figure')
plt.show()
import matplotlib.pyplot as plt
plt.plot(t['a'], t['b'])
plt.ylabel('Hello Figure')
plt.show()

No description has been provided for this image

I like sql. Can tablite understand SQL?¶

Almost. You can use table.to_sql and tablite will return ANSI-92 compliant SQL.

You can also create a table using Table.from_sql and tablite will consume ANSI-92 compliant SQL.

But what do I do if I'm about to run out of memory?¶

You wont. Every tablite table is backed by disk. The memory footprint of a table is only the metadata required to know the relationships between variable names and the datastructures.

Let's do a comparison:

In [77]:

Copied!





## Let's monitor the memory and record the observations into a table!
import psutil, os, gc
from time import process_time,sleep
process = psutil.Process(os.getpid())

def mem_time():  # go and check taskmanagers memory usage.
    return process.memory_info().rss, process_time()

digits = 1_000_000

records = Table({'method':[], 'memory':[], 'time':[]})
## Let's monitor the memory and record the observations into a table!
import psutil, os, gc
from time import process_time,sleep
process = psutil.Process(os.getpid())

def mem_time():  # go and check taskmanagers memory usage.
    return process.memory_info().rss, process_time()

digits = 1_000_000

records = Table({'method':[], 'memory':[], 'time':[]})

The row based format: 1 million 10-tuples

In [78]:

Copied!





before, start = mem_time()
L = [tuple([11 for _ in range(10)]) for _ in range(digits)]
after, end = mem_time()  
del L
gc.collect()

records.add_rows(*('1e6 lists w. 10 integers', after - before, round(end-start,4)))
records
before, start = mem_time()
L = [tuple([11 for _ in range(10)]) for _ in range(digits)]
after, end = mem_time()  
del L
gc.collect()

records.add_rows(*('1e6 lists w. 10 integers', after - before, round(end-start,4)))
records

Out[78]:

#	method	memory	time
0	1e6 lists w. 10 integers	119054336	0.5045

The column based format: 10 columns with 1M values:

In [79]:

Copied!





before, start = mem_time()
L = [[11 for i2 in range(digits)] for i1 in range(10)]
after,end = mem_time()

del L
gc.collect()
records.add_rows(('10 lists with 1e6 integers', after - before, round(end-start,4)))
before, start = mem_time()
L = [[11 for i2 in range(digits)] for i1 in range(10)]
after,end = mem_time()

del L
gc.collect()
records.add_rows(('10 lists with 1e6 integers', after - before, round(end-start,4)))

We've thereby saved 50 Mb by avoiding the overhead from managing 1 million lists.

Q: But why didn't I just use an array? It would have even lower memory footprint.

A: First, array's don't handle None's and we get that frequently in dirty csv data.

Second, Table needs even less memory.

Let's try with an array:

In [80]:

Copied!





import array

before, start = mem_time()
L = [array.array('i', [11 for _ in range(digits)]) for _ in range(10)]
after,end = mem_time()

del L
gc.collect()
records.add_rows(('10 lists with 1e6 integers in arrays', after - before, round(end-start,4)))
records
import array

before, start = mem_time()
L = [array.array('i', [11 for _ in range(digits)]) for _ in range(10)]
after,end = mem_time()

del L
gc.collect()
records.add_rows(('10 lists with 1e6 integers in arrays', after - before, round(end-start,4)))
records

Out[80]:

#	method	memory	time
0	1e6 lists w. 10 integers	119054336	0.5045
1	10 lists with 1e6 integers	75276288	0.1906
2	10 lists with 1e6 integers in arrays	39833600	0.3633

Finally let's use a tablite.Table:

In [81]:

Copied!





before,start = mem_time()
t = Table(columns={str(i1): [11 for i2 in range(digits)] for i1 in range(10)})
after,end = mem_time()

records.add_rows(('Table with 10 columns with 1e6 integers', after - before, round(end-start,4)))

before,start = mem_time()
t2 = t.copy()
after,end = mem_time()

records.add_rows(('2 Tables with 10 columns with 1e6 integers each', after - before, round(end-start,4)))

## Let's show it, so we know nobody's cheating:
t2
before,start = mem_time()
t = Table(columns={str(i1): [11 for i2 in range(digits)] for i1 in range(10)})
after,end = mem_time()

records.add_rows(('Table with 10 columns with 1e6 integers', after - before, round(end-start,4)))

before,start = mem_time()
t2 = t.copy()
after,end = mem_time()

records.add_rows(('2 Tables with 10 columns with 1e6 integers each', after - before, round(end-start,4)))

## Let's show it, so we know nobody's cheating:
t2

Out[81]:

#	0	1	2	3	4	5	6	7	8	9
0	11	11	11	11	11	11	11	11	11	11
1	11	11	11	11	11	11	11	11	11	11
2	11	11	11	11	11	11	11	11	11	11
3	11	11	11	11	11	11	11	11	11	11
4	11	11	11	11	11	11	11	11	11	11
5	11	11	11	11	11	11	11	11	11	11
6	11	11	11	11	11	11	11	11	11	11
...	...	...	...	...	...	...	...	...	...	...
999,993	11	11	11	11	11	11	11	11	11	11
999,994	11	11	11	11	11	11	11	11	11	11
999,995	11	11	11	11	11	11	11	11	11	11
999,996	11	11	11	11	11	11	11	11	11	11
999,997	11	11	11	11	11	11	11	11	11	11
999,998	11	11	11	11	11	11	11	11	11	11
999,999	11	11	11	11	11	11	11	11	11	11

In [82]:

Copied!

records
records

Out[82]:

#	method	memory	time
0	1e6 lists w. 10 integers	119054336	0.5045
1	10 lists with 1e6 integers	75276288	0.1906
2	10 lists with 1e6 integers in arrays	39833600	0.3633
3	Table with 10 columns with 1e6 integers	0	1.9569
4	2 Tables with 10 columns with 1e6 integers each	0	0.0001

Conclusion: whilst the common worst case (1M lists with 10 integers) take up 118 Mb of RAM, Tablite's tables vanish in the noise of memory measurement.

Conclusions¶

This concludes the mega-tutorial to tablite. There's nothing more to it. But oh boy it'll save a lot of time.

Here's a summary of features:

Everything a list can do.
import csv*, fods, json, html, simple, rst, mediawiki, xlsx, xls, xlsm, csv, tsv, txt, ods using Table.from_file(...)
Iterate over rows or columns
Create multikey index, sort, use filter, any and all to select. Perform lookup across tables including using custom functions.
Perform multikey joins with other tables.
Perform groupby and reorganise data as a pivot table with max, min, sum, first, last, count, unique, average, standard deviation, median and mode.
Update tables with += which automatically sorts out the columns - even if they're not in perfect order.

FAQ¶

Question	Answer
I'm not in a notebook. Is there a nice way to view tables?	Yes. `table.show()` prints the ascii version
I'm looking for the equivalent to `apply` in pandas.	Just use list comprehensions: `table[column] = [f(x) for x in table[column]`
What about `map`?	Just use the python function: `mapping = map(f, table[column name])`
Is there a `where` function?	It's called `any` or `all` like in python: `table.any(column_name > 0)`.
I like sql and sqlite. Can I use sql?	Yes. Call `table.to_sql()` returns ANSI-92 SQL compliant table definition. You can use this in any SQL compliant engine.

| sometimes i need to clean up data with datetimes. Is there any tool to help with that? | Yes. Look at DataTypes.
DataTypes.round(value, multiple) allows rounding of datetime.

Coming to Tablite from Pandas¶

If you're coming to Tablite from Pandas you will notice some differences.

Here's the ultra short comparison to the documentation from Pandas called 10 minutes intro to pandas

The tutorials provide the generic overview:

Some key differences

topic	Tablite
Viewing data	Just use `table.show()` in print outs, or if you're in a jupyter notebook just use the variable name `table`
Selection	Slicing works both on columns and rows, and you can filter using `any` or `all`: `table['A','B', 2:30:3].any(A=lambda x:x>3)` to copy a table use: `t2 = t.copy()` This is a very fast deep copy, that has no memory overhead as tablites memory manager keeps track of the data.
Missing data	Tablite uses `mixed` column format for any format that isn't uniform To get rid of rows with `None`s and `np.nan`s use any: `table.drop_na(None, np.nan)` Alternatively you can use replace: `table.replace(None,5)` following the syntax: `table.replace_missing_values(sources, target)`
Operations	Descriptive statistics are on a colum by column basis: `table['a'].statistics()` the pandas function `df.apply` doesn't exist in tablite. Use a list comprehension instead. For example: `df.apply(np.cumsum)` is just `np.cumsum(t['A'])` "histogramming" in tablite is per column: `table['a'].histogram()` string methods? Just use a list comprehensions: `table['A', 'B'].any(A=lambda x: "hello" in x, B=lambda x: "world" in x)`
Merge	Concatenation: Just use `+` or `+=` as in `t1 = t2 + t3 += t4`. If the columns are out of order, tablite will sort the headers according to the order in the first table. If you're worried that the header mismatch use `t1.stack(t2)` Joins are ANSI92 compliant: `t1.join(t2, <...args...>, join_type=...)`.
Grouping	Tablite supports multikey groupby using `from tablite import Groupby as gb`. `table.groupby(keys, functions)`
Reshaping	To reshape a table use `transpose`. to perform pivot table like operations, use: `table.pivot(rows, columns, functions)` subtotals aside tablite will give you everything Excels pivot table can do.
Time series	To convert time series use a list comprehension. `t1['GMT'] = [timedelta(hours=1) + v for v in t1['date'] ]` to generate a date range use: `from Tablite import daterange` `t['date'] = date_range(start=2022/1/1, stop=2023/1/1, step=timedelta(days=1))`
Categorical	Pandas only seems to use this for sorting and grouping. Tablite table has `.sort`, `.groupby` and `.pivot` to achieve the same task.
Plotting	Import your favorite plotting package and feed it the values, such as: `import matplotlib.pyplot as plt` `plt.plot(t['a'],t['b'])` `plt.showw()`
Import/Export	Tablite supports the same import/export options as pandas. Tablite pegs the free memory before IO and can therefore process larger-than-RAM files. Tablite also guesses the datatypes for all ISOformats and uses multiprocessing and may therefore be faster. Should you want to inspect how guess works, use `from tools import guess` and try the function out.
Gotchas	None really. Should you come across something non-pythonic, then please post it on the issue list.

Pandas also permits the usage of namedtuples, which are unpacked upon entry.

from collections import namedtuple
Point = namedtuple("Point", "x y")
points = [Point(0, 0), Point(0, 3)]
pd.DataFrame(points)

Doing that in tablite is a bit different. To unpack the named tuple, you should do so explicitly:

t = Table({'x': [p.x for p in points], 'y': [p.y for p in points]})

However should you want to keep the points as namedtuple, you can do so in tablite:

t = Table()
t['points'] = points

Tablite will store a serialised version of the points, so your memory overhead will be close to zero.

#	row	A	B	C
0	1	1	2	3
1	2	1	2	3
2	3	1	2	3
3	4	1	2	3
4	5	1	2	3
5	6	1	2	3
6	7	1	2	3
7	8	4	5	6
8	9	1	2	3
9	10	4	5	6
10	11	1	2	3
11	12	4	5	6
12	13	1	2	3
13	14	1	2	3

#	row	A	B	C
0	1	1	2	3
1	2	1	2	3
2	3	1	2	3
3	4	1	2	3
4	5	1	2	3
5	6	1	2	3
6	7	1	2	3
7	8	4	5	6
8	9	1	2	3
9	10	4	5	6
10	11	1	2	3
11	12	4	5	6
12	13	1	2	3
13	14	1	2	3

#	row	A	B	C
0	1	1	2	3
1	2	1	2	3
2	3	1	2	3
3	4	1	2	3
4	5	1	2	3
5	6	1	2	3
6	7	1	2	3
7	8	4	5	6
8	9	1	2	3
9	10	4	5	6
10	11	1	2	3
11	12	4	5	6
12	13	1	2	3
13	14	1	2	3