Tutorial
Tablite¶
Introduction¶
Tablite fills the data-science space where incremental data processing based on:
- Datasets are larger than memory.
- You don't want to worry about datatypes.
Tablite thereby competes with:
- Pandas, but saves the memory overhead.
- Numpy, but spares you from worrying about lower level data types
- SQlite, by sheer speed.
- Polars, by working beyond RAM.
- Other libraries for data cleaning thanks to tablites powerful
datatypes
module.
Install: pip install tablite
Usage: >>> from tablite import Table
Upgrade: pip install tablite --no-cache --upgrade
Overview¶
(Version 2023.6.0 and later. For older version see this)
- Tablite handles all Python datatypes:
str
,float
,bool
,int
,date
,datetime
,time
,timedelta
andNone
. - you can select:
- all rows in a column as
table['A']
- rows across all columns as
table[4:8]
- or a slice as
table['A', 'B', slice(4,8) ]
.
- all rows in a column as
- you to update with
table['A'][2] = new value
- you can store or send data using json, by:
- dumping to json:
json_str = table.to_json()
, or - you can load it with
Table.from_json(json_str)
.
- dumping to json:
- you can iterate over rows using
for row in Table.rows
. - you can ask
column_xyz in Table.colums
? - load from files with
new_table = Table.from_file('this.csv')
which has automatic datatype detection - perform inner, outer & left sql join between tables as simple as
table_1.inner_join(table2, keys=['A', 'B'])
- summarise using
table.groupby( ... )
- create pivot tables using
groupby.pivot( ... )
- perform multi-criteria lookup in tables using
table1.lookup(table2, criteria=.....
- and of course a large selection of tools in
from tablite.tools import *
Examples¶
Here are some examples:
from tablite import Table
## To create a tablite table is as simple as populating a dictionary:
t = Table({'A':[1,2,3], 'B':['a','b','c']})
## In this notebook we can show tables in the HTML style:
t
# | A | B |
---|---|---|
0 | 1 | a |
1 | 2 | b |
2 | 3 | c |
## or the ascii style:
t.show()
+==+=+=+ |# |A|B| +--+-+-+ | 0|1|a| | 1|2|b| | 2|3|c| +==+=+=+
## or if you'd like to inspect the table, use:
print(str(t))
Table(2 columns, 3 rows)
## You can also add all columns at once (slower) if you prefer.
t2 = Table(headers=('A','B'), rows=((1,'a'),(2,'b'),(3,'c')))
assert t==t2
## or load data:
t3 = Table.from_file('tests/data/book1.csv')
## to view any table in the notebook just let jupyter show the table. If you're using the terminal use .show().
## Note that show gives either first and last 7 rows or the whole table if it is less than 20 rows.
t3
Collecting tasks: 'tests/data/book1.csv' Dumping tasks: 'tests/data/book1.csv'
importing file: 100%|██████████| 1/1 [00:00<00:00, 487.82it/s]
# | a | b | c | d | e | f |
---|---|---|---|---|---|---|
0 | 1 | 0.060606061 | 0.090909091 | 0.121212121 | 0.151515152 | 0.181818182 |
1 | 2 | 0.121212121 | 0.242424242 | 0.484848485 | 0.96969697 | 1.939393939 |
2 | 3 | 0.242424242 | 0.484848485 | 0.96969697 | 1.939393939 | 3.878787879 |
3 | 4 | 0.484848485 | 0.96969697 | 1.939393939 | 3.878787879 | 7.757575758 |
4 | 5 | 0.96969697 | 1.939393939 | 3.878787879 | 7.757575758 | 15.51515152 |
5 | 6 | 1.939393939 | 3.878787879 | 7.757575758 | 15.51515152 | 31.03030303 |
6 | 7 | 3.878787879 | 7.757575758 | 15.51515152 | 31.03030303 | 62.06060606 |
... | ... | ... | ... | ... | ... | ... |
38 | 39 | 16659267088.0 | 33318534175.0 | 66637068350.0 | 133274000000.0 | 266548000000.0 |
39 | 40 | 33318534175.0 | 66637068350.0 | 133274000000.0 | 266548000000.0 | 533097000000.0 |
40 | 41 | 66637068350.0 | 133274000000.0 | 266548000000.0 | 533097000000.0 | 1066190000000.0 |
41 | 42 | 133274000000.0 | 266548000000.0 | 533097000000.0 | 1066190000000.0 | 2132390000000.0 |
42 | 43 | 266548000000.0 | 533097000000.0 | 1066190000000.0 | 2132390000000.0 | 4264770000000.0 |
43 | 44 | 533097000000.0 | 1066190000000.0 | 2132390000000.0 | 4264770000000.0 | 8529540000000.0 |
44 | 45 | 1066190000000.0 | 2132390000000.0 | 4264770000000.0 | 8529540000000.0 | 17059100000000.0 |
## should you however want to select the headers instead of importing everything
## (which maybe timeconsuming), simply use get_headers(path)
from tablite.tools import get_headers
from pathlib import Path
path = Path('tests/data/book1.csv')
sample = get_headers(path, linecount=5)
print(f"sample is of type {type(sample)} and has the following entries:")
for k,v in sample.items():
print(k)
if isinstance(v,list):
for r in sample[k]:
print("\t", r)
sample is of type <class 'dict'> and has the following entries: delimiter book1.csv ['a', 'b', 'c', 'd', 'e', 'f'] ['1', '0.060606061', '0.090909091', '0.121212121', '0.151515152', '0.181818182'] ['2', '0.121212121', '0.242424242', '0.484848485', '0.96969697', '1.939393939'] ['3', '0.242424242', '0.484848485', '0.96969697', '1.939393939', '3.878787879'] ['4', '0.484848485', '0.96969697', '1.939393939', '3.878787879', '7.757575758'] ['5', '0.96969697', '1.939393939', '3.878787879', '7.757575758', '15.51515152']
## to extend a table by adding columns, use t[new] = [new values]
t['C'] = [4,5,6]
## but make sure the column has the same length as the rest of the table!
t
# | A | B | C |
---|---|---|---|
0 | 1 | a | 4 |
1 | 2 | b | 5 |
2 | 3 | c | 6 |
## should you want to mix datatypes, tablite will not complain:
from datetime import datetime, date,time,timedelta
import numpy as np
## What you put in ...
t4 = Table()
t4['mixed'] = [
-1,0,1, # regular integers
-12345678909876543211234567890987654321, # very very large integer
None,np.nan, # null values
"one", "", # strings
True,False, # booleans
float('inf'), 0.01, # floats
date(2000,1,1), # date
datetime(2002,2,3,23,0,4,6660), # datetime
time(12,12,12), # time
timedelta(days=3, seconds=5678) # timedelta
]
## ... is exactly what you get out:
t4
# | mixed |
---|---|
0 | -1 |
1 | 0 |
2 | 1 |
3 | -12345678909876543211234567890987654321 |
4 | None |
5 | nan |
6 | one |
7 | |
8 | True |
9 | False |
10 | inf |
11 | 0.01 |
12 | 2000-01-01 |
13 | 2002-02-03 23:00:04.006660 |
14 | 12:12:12 |
15 | 3 days, 1:34:38 |
## also if you claim the values back as a python list:
for item in list(t4['mixed']):
print(item)
-1 0 1 -12345678909876543211234567890987654321 None nan one True False inf 0.01 2000-01-01 2002-02-03 23:00:04.006660 12:12:12 3 days, 1:34:38
The column itself (__repr__
) shows us the pid
, file location
and the entries, so you know exactly what you're working with.
t4['mixed']
Column(/tmp/tablite-tmp/pid-54911, [-1 0 1 -12345678909876543211234567890987654321 None nan 'one' '' True False inf 0.01 datetime.date(2000, 1, 1) datetime.datetime(2002, 2, 3, 23, 0, 4, 6660) datetime.time(12, 12, 12) datetime.timedelta(days=3, seconds=5678)])
## to view the datatypes in a column, use Column.types()
type_dict = t4['mixed'].types()
for k,v in type_dict.items():
print(k,v)
<class 'int'> 4 <class 'NoneType'> 1 <class 'float'> 3 <class 'str'> 2 <class 'bool'> 2 <class 'datetime.date'> 1 <class 'datetime.datetime'> 1 <class 'datetime.time'> 1 <class 'datetime.timedelta'> 1
## You may have noticed that all datatypes in t3 where identified as floats, despite their origin from a text type file.
## This is because tablite guesses the most probable datatype using the `.guess` function on each column.
## You can use the .guess function like this:
from tablite import DataTypes
t3['a'] = DataTypes.guess(t3['a'])
## You can also convert the datatype using a list comprehension
t3['b'] = [float(v) for v in t3['b']]
t3
# | a | b | c | d | e | f |
---|---|---|---|---|---|---|
0 | 1 | 0.060606061 | 0.090909091 | 0.121212121 | 0.151515152 | 0.181818182 |
1 | 2 | 0.121212121 | 0.242424242 | 0.484848485 | 0.96969697 | 1.939393939 |
2 | 3 | 0.242424242 | 0.484848485 | 0.96969697 | 1.939393939 | 3.878787879 |
3 | 4 | 0.484848485 | 0.96969697 | 1.939393939 | 3.878787879 | 7.757575758 |
4 | 5 | 0.96969697 | 1.939393939 | 3.878787879 | 7.757575758 | 15.51515152 |
5 | 6 | 1.939393939 | 3.878787879 | 7.757575758 | 15.51515152 | 31.03030303 |
6 | 7 | 3.878787879 | 7.757575758 | 15.51515152 | 31.03030303 | 62.06060606 |
... | ... | ... | ... | ... | ... | ... |
38 | 39 | 16659267088.0 | 33318534175.0 | 66637068350.0 | 133274000000.0 | 266548000000.0 |
39 | 40 | 33318534175.0 | 66637068350.0 | 133274000000.0 | 266548000000.0 | 533097000000.0 |
40 | 41 | 66637068350.0 | 133274000000.0 | 266548000000.0 | 533097000000.0 | 1066190000000.0 |
41 | 42 | 133274000000.0 | 266548000000.0 | 533097000000.0 | 1066190000000.0 | 2132390000000.0 |
42 | 43 | 266548000000.0 | 533097000000.0 | 1066190000000.0 | 2132390000000.0 | 4264770000000.0 |
43 | 44 | 533097000000.0 | 1066190000000.0 | 2132390000000.0 | 4264770000000.0 | 8529540000000.0 |
44 | 45 | 1066190000000.0 | 2132390000000.0 | 4264770000000.0 | 8529540000000.0 | 17059100000000.0 |
API Examples¶
In the following sections, example are given of the Tablite API's power features:
- Iteration
- Append
- Sort
- Filter
- Index
- Search All
- Search Any
- Lookup
- Join inner, outer,
- GroupBy
- Pivot table
ITERATION!¶
Iteration supports for loops and list comprehension at the speed of light:
Just use [r for r in table.rows]
, or:
for row in table.rows:
row ...
Here's a more practical use case:
(1) Imagine a table with columns a,b,c,d,e (all integers) like this:
t = Table()
for column_name in 'abcde':
t[column_name] =[i for i in range(5)]
(2) we want to add two new columns using the functions:
def f1(a,b,c):
return a+b+c+1
def f2(b,c,d):
return b*c*d
(3) and we want to compute two new columns f
and g
:
t.add_columns('f', 'g')
(4) we can now use the filter, to iterate over the table, and add the values to the two new columns:
f,g=[],[]
for row in t['a', 'b', 'c', 'd'].rows:
a, b, c, d = row
f.append(f1(a, b, c))
g.append(f2(b, c, d))
t['f'] = f
t['g'] = g
assert len(t) == 5
assert list(t.columns) == list('abcdefg')
t
# | a | b | c | d | e | f | g |
---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
1 | 1 | 1 | 1 | 1 | 1 | 4 | 1 |
2 | 2 | 2 | 2 | 2 | 2 | 7 | 8 |
3 | 3 | 3 | 3 | 3 | 3 | 10 | 27 |
4 | 4 | 4 | 4 | 4 | 4 | 13 | 64 |
Take note that if your dataset is assymmetric, a warning will be show:
assymmetric_table = Table({'a':[1,2,3], 'b':[1,2]})
for row in assymmetric_table.rows:
print(row)
## warning at the bottom ---v
[1, 1] [2, 2] [3, None]
/home/bjorn/github/tablite/tablite/base.py:1188: UserWarning: Column b has length 2 / 3. None will appear as fill value. warnings.warn(f"Column {name} has length {len(column)} / {n_max}. None will appear as fill value.")
Create Index / Indices¶
Index supports multi-key indexing using args such as: index = table.index('B','C')
.
Here's an example:
table7 = Table(columns={
'A': [1,1,2,2,3,4],
'B': [1,1,2,2,30,40],
'C': [-1,-2,-3,-4,-5,-6]
})
index = table7.index('A', 'B')
for k, v in index.items():
print("key", k, "indices", v)
key (1, 1) indices [0, 1] key (2, 2) indices [2, 3] key (3, 30) indices [4] key (4, 40) indices [5]
The keys are created for each unique column-key-pair, and the value is the index where the key is found. To fetch all rows for key (2,2)
, we can use:
for ix, row in enumerate(table7.rows):
if ix in index[(2,2)]:
print(row)
[2, 2, -3] [2, 2, -4]
APPEND¶
## to append one table to another, use + or +=
print('length before:', len(t3)) # length before: 45
t5 = t3 + t3
print('length after +', len(t5)) # length after + 90
t5 += t3
print('length after +=', len(t5)) # length after += 135
## if you need a lot of numbers for a test, you can repeat a table using * and *=
t5 *= 1_000
print('length after +=', len(t5)) # length after += 135000
length before: 45 length after + 90 length after += 135 length after += 135000
t5
# | a | b | c | d | e | f |
---|---|---|---|---|---|---|
0 | 1 | 0.060606061 | 0.090909091 | 0.121212121 | 0.151515152 | 0.181818182 |
1 | 2 | 0.121212121 | 0.242424242 | 0.484848485 | 0.96969697 | 1.939393939 |
2 | 3 | 0.242424242 | 0.484848485 | 0.96969697 | 1.939393939 | 3.878787879 |
3 | 4 | 0.484848485 | 0.96969697 | 1.939393939 | 3.878787879 | 7.757575758 |
4 | 5 | 0.96969697 | 1.939393939 | 3.878787879 | 7.757575758 | 15.51515152 |
5 | 6 | 1.939393939 | 3.878787879 | 7.757575758 | 15.51515152 | 31.03030303 |
6 | 7 | 3.878787879 | 7.757575758 | 15.51515152 | 31.03030303 | 62.06060606 |
... | ... | ... | ... | ... | ... | ... |
134,993 | 39 | 16659267088.0 | 33318534175.0 | 66637068350.0 | 133274000000.0 | 266548000000.0 |
134,994 | 40 | 33318534175.0 | 66637068350.0 | 133274000000.0 | 266548000000.0 | 533097000000.0 |
134,995 | 41 | 66637068350.0 | 133274000000.0 | 266548000000.0 | 533097000000.0 | 1066190000000.0 |
134,996 | 42 | 133274000000.0 | 266548000000.0 | 533097000000.0 | 1066190000000.0 | 2132390000000.0 |
134,997 | 43 | 266548000000.0 | 533097000000.0 | 1066190000000.0 | 2132390000000.0 | 4264770000000.0 |
134,998 | 44 | 533097000000.0 | 1066190000000.0 | 2132390000000.0 | 4264770000000.0 | 8529540000000.0 |
134,999 | 45 | 1066190000000.0 | 2132390000000.0 | 4264770000000.0 | 8529540000000.0 | 17059100000000.0 |
## if your are in doubt whether your tables will be the same you can use .stack(other)
assert t.columns != t2.columns # compares list of column names.
t6 = t.stack(t2)
t6
# | a | b | c | d | e | f | g | A | B |
---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | None | None |
1 | 1 | 1 | 1 | 1 | 1 | 4 | 1 | None | None |
2 | 2 | 2 | 2 | 2 | 2 | 7 | 8 | None | None |
3 | 3 | 3 | 3 | 3 | 3 | 10 | 27 | None | None |
4 | 4 | 4 | 4 | 4 | 4 | 13 | 64 | None | None |
5 | None | None | None | None | None | None | None | 1 | a |
6 | None | None | None | None | None | None | None | 2 | b |
7 | None | None | None | None | None | None | None | 3 | c |
## As you can see above, t6['C'] is padded with "None" where t2 was missing the columns.
## if you need a more detailed view of the columns you can iterate:
for name in t.columns:
col_from_t = t[name]
if name in t2.columns:
col_from_t2 = t2[name]
print(name, col_from_t == col_from_t2)
else:
print(name, "not in t2")
a not in t2 b not in t2 c not in t2 d not in t2 e not in t2 f not in t2 g not in t2
## to make a copy of a table, use table.copy()
t3_copy = t3.copy()
## you can also perform multi criteria selections using getitem [ ... ]
t3_slice = t3['a','b','d', 5:25:5]
t3_slice
# | a | b | d |
---|---|---|---|
0 | 6 | 1.939393939 | 7.757575758 |
1 | 11 | 62.06060606 | 248.2424242 |
2 | 16 | 1985.939394 | 7943.757576 |
3 | 21 | 63550.06061 | 254200.2424 |
##deleting items also works the same way:
del t3_slice[1:3] # delete row number 2 & 3
t3_slice
# | a | b | d |
---|---|---|---|
0 | 6 | 1.939393939 | 7.757575758 |
1 | 21 | 63550.06061 | 254200.2424 |
## to wipe a table, use .clear:
t3_slice.clear()
t3_slice
SAVE¶
## tablite uses .npy for storage because it is fast.
## this means you can make a table persistent using .save
local_file = Path("local_file.tpz")
t5.save(local_file)
old_t5 = Table.load(local_file)
print("the t5 table had", len(old_t5), "rows") # the t5 table had 135000 rows
del old_t5 # only removes the in-memory object
print("old_t5 still exists?", local_file.exists())
print("path:", local_file)
import os
os.remove(local_file)
loading 'local_file.tpz' file: 55%|█████▍ | 9851/18000 [00:02<00:01, 4386.96it/s]
loading 'local_file.tpz' file: 100%|██████████| 18000/18000 [00:04<00:00, 4417.27it/s]
the t5 table had 135000 rows old_t5 still exists? True path: local_file.tpz
If you want to save a table from one session to another use save=True
. This tells the garbage collector to leave the tablite Table on disk, so you can load it again without changing your code.
For example:
First time you run t = Table.import_file(....big.csv)
it may take a minute or two.
If you then add t.save=True
and restart python, the second time you run t = Table.import_file(....big.csv)
it will take a few milliseconds instead of minutes.
FILTER!¶
unfiltered = Table({'a':[1,2,3,4], 'b':[10,20,30,40]})
true,false = unfiltered.filter(
[
{"column1": 'a', "criteria":">=", 'value2':3}
], filter_type='all'
)
true
# | a | b |
---|---|---|
0 | 3 | 30 |
1 | 4 | 40 |
false.show() # using show here to show that terminal users can have a nice view too.
+==+=+==+ |# |a|b | +--+-+--+ | 0|1|10| | 1|2|20| +==+=+==+
Any! All?¶
Any and All are cousins of the filter. They're there so you can use them in the same way as you'd use any
and all
in python - as boolean evaluators:
ty = Table({'a':[1,2,3,4],'b': [10,20,30,40]})
## typical python
any(i > 3 for i in ty['a'])
True
## hereby you can do:
any( ty.any(**{'a':lambda x:x>3}).rows )
True
## if you have multiple criteria this also works:
all( ty.all(**{'a': lambda x:x>=2, 'b': lambda x:x<=30}).rows )
True
## or this if you want to see the table.
ty.all(a=lambda x:x>2, b=lambda x:x<=30)
# | a | b |
---|---|---|
0 | 3 | 30 |
## As `all` and `any` returns tables, this also means that you can chain operations:
ty.any(a=lambda x:x>2).any(b=30)
# | a | b |
---|---|---|
0 | 3 | 30 |
SORT!¶
table = Table({
'A':[ 1, None, 8, 3, 4, 6, 5, 7, 9],
'B':[10,'100', 1, 1, 1, 1, 10, 10, 10],
'C':[ 0, 1, 0, 1, 0, 1, 0, 1, 0],
})
table
# | A | B | C |
---|---|---|---|
0 | 1 | 10 | 0 |
1 | None | 100 | 1 |
2 | 8 | 1 | 0 |
3 | 3 | 1 | 1 |
4 | 4 | 1 | 0 |
5 | 6 | 1 | 1 |
6 | 5 | 10 | 0 |
7 | 7 | 10 | 1 |
8 | 9 | 10 | 0 |
sort_order = {'B': False, 'C': False, 'A': False}
assert not table.is_sorted(mapping=sort_order)
sorted_table = table.sort(mapping=sort_order)
sorted_table
creating sort index: 100%|██████████| 3/3 [00:00<00:00, 2719.45it/s] creating sort index: 100%|██████████| 3/3 [00:00<00:00, 3434.20it/s] join: 100%|██████████| 3/3 [00:00<00:00, 1902.47it/s]
Sort is reasonable effective as it uses multiprocessing above a million fields.
Hint: You can set this limit in tablite.config
, like this:
from tablite.config import Config
print(f"multiprocessing is used above {Config.SINGLE_PROCESSING_LIMIT:,} fields")
multiprocessing is used above 1,000,000 fields
import math
n = math.ceil(1_000_000 / (9*3))
table = Table({
'A':[ 1, None, 8, 3, 4, 6, 5, 7, 9]*n,
'B':[10,'100', 1, 1, 1, 1, 10, 10, 10]*n,
'C':[ 0, 1, 0, 1, 0, 1, 0, 1, 0]*n,
})
table
# | A | B | C |
---|---|---|---|
0 | 1 | 10 | 0 |
1 | None | 100 | 1 |
2 | 8 | 1 | 0 |
3 | 3 | 1 | 1 |
4 | 4 | 1 | 0 |
5 | 6 | 1 | 1 |
6 | 5 | 10 | 0 |
... | ... | ... | ... |
333,335 | 8 | 1 | 0 |
333,336 | 3 | 1 | 1 |
333,337 | 4 | 1 | 0 |
333,338 | 6 | 1 | 1 |
333,339 | 5 | 10 | 0 |
333,340 | 7 | 10 | 1 |
333,341 | 9 | 10 | 0 |
import time as cputime
start = cputime.time()
sort_order = {'B': False, 'C': False, 'A': False}
sorted_table = table.sort(mapping=sort_order) # sorts 1M values.
print("table sorting took ", round(cputime.time() - start,3), "secs")
sorted_table
creating sort index: 100%|██████████| 3/3 [00:00<00:00, 4.20it/s] join: 100%|██████████| 3/3 [00:00<00:00, 18.17it/s]
table sorting took 0.913 secs
GROUPBY !¶
n = math.ceil(1_000_000 / (9*3))
table = Table({
'A':[ 1, None, 8, 3, 4, 6, 5, 7, 9]*n,
'B':[10,'100', 1, 1, 1, 1, 10, 10, 10]*n,
'C':[ 0, 1, 0, 1, 0, 1, 0, 1, 0]*n,
})
table
# | A | B | C |
---|---|---|---|
0 | 1 | 10 | 0 |
1 | None | 100 | 1 |
2 | 8 | 1 | 0 |
3 | 3 | 1 | 1 |
4 | 4 | 1 | 0 |
5 | 6 | 1 | 1 |
6 | 5 | 10 | 0 |
... | ... | ... | ... |
333,335 | 8 | 1 | 0 |
333,336 | 3 | 1 | 1 |
333,337 | 4 | 1 | 0 |
333,338 | 6 | 1 | 1 |
333,339 | 5 | 10 | 0 |
333,340 | 7 | 10 | 1 |
333,341 | 9 | 10 | 0 |
from tablite import GroupBy as gb
grpby = table.groupby(keys=['C', 'B'], functions=[('A', gb.count)])
grpby
groupby: 100%|██████████| 333342/333342 [00:00<00:00, 427322.50it/s]
# | C | B | Count(A) |
---|---|---|---|
0 | 0 | 10 | 111114 |
1 | 1 | 100 | 37038 |
2 | 0 | 1 | 74076 |
3 | 1 | 1 | 74076 |
4 | 1 | 10 | 37038 |
Here is the list of groupby functions:
class GroupBy(object):
max = Max # shortcuts to avoid having to type a long list of imports.
min = Min
sum = Sum
product = Product
first = First
last = Last
count = Count
count_unique = CountUnique
avg = Average
stdev = StandardDeviation
median = Median
mode = Mode
Did I say pivot table? Yes.¶
Pivot Table is included in the groupby functionality - so yes - you can pivot the groupby on any column that is used for grouping. Here's a simple example:
t = Table({
'A':[1, 1, 2, 2, 3, 3] * 2,
'B':[1, 2, 3, 4, 5, 6] * 2,
'C':[6, 5, 4, 3, 2, 1] * 2,
})
t
# | A | B | C |
---|---|---|---|
0 | 1 | 1 | 6 |
1 | 1 | 2 | 5 |
2 | 2 | 3 | 4 |
3 | 2 | 4 | 3 |
4 | 3 | 5 | 2 |
5 | 3 | 6 | 1 |
6 | 1 | 1 | 6 |
7 | 1 | 2 | 5 |
8 | 2 | 3 | 4 |
9 | 2 | 4 | 3 |
10 | 3 | 5 | 2 |
11 | 3 | 6 | 1 |
t2 = t.pivot(rows=['C'], columns=['A'], functions=[('B', gb.sum), ('B', gb.count)], values_as_rows=False)
t2
pivot: 100%|██████████| 14/14 [00:00<00:00, 3643.83it/s]
# | C | Sum(B,A=1) | Count(B,A=1) | Sum(B,A=2) | Count(B,A=2) | Sum(B,A=3) | Count(B,A=3) |
---|---|---|---|---|---|---|---|
0 | 6 | 2 | 2 | None | None | None | None |
1 | 5 | 4 | 2 | None | None | None | None |
2 | 4 | None | None | 6 | 2 | None | None |
3 | 3 | None | None | 8 | 2 | None | None |
4 | 2 | None | None | None | None | 10 | 2 |
5 | 1 | None | None | None | None | 12 | 2 |
JOIN!¶
numbers = Table()
numbers.add_column('number', data=[ 1, 2, 3, 4, None])
numbers.add_column('colour', data=['black', 'blue', 'white', 'white', 'blue'])
letters = Table()
letters.add_column('letter', data=[ 'a', 'b', 'c', 'd', None])
letters.add_column('color', data=['blue', 'white', 'orange', 'white', 'blue'])
## left join
## SELECT number, letter FROM numbers LEFT JOIN letters ON numbers.colour == letters.color
left_join = numbers.left_join(letters, left_keys=['colour'], right_keys=['color'], left_columns=['number'], right_columns=['letter'])
left_join
join: 100%|██████████| 2/2 [00:00<00:00, 1221.94it/s]
# | number | letter |
---|---|---|
0 | 1 | None |
1 | 2 | a |
2 | 2 | None |
3 | None | a |
4 | None | None |
5 | 3 | b |
6 | 3 | d |
7 | 4 | b |
8 | 4 | d |
## inner join
## SELECT number, letter FROM numbers JOIN letters ON numbers.colour == letters.color
inner_join = numbers.inner_join(letters, left_keys=['colour'], right_keys=['color'], left_columns=['number'], right_columns=['letter'])
inner_join
join: 100%|██████████| 2/2 [00:00<00:00, 1121.77it/s]
# | number | letter |
---|---|---|
0 | 2 | a |
1 | 2 | None |
2 | None | a |
3 | None | None |
4 | 3 | b |
5 | 3 | d |
6 | 4 | b |
7 | 4 | d |
# outer join
## SELECT number, letter FROM numbers OUTER JOIN letters ON numbers.colour == letters.color
outer_join = numbers.outer_join(letters, left_keys=['colour'], right_keys=['color'], left_columns=['number'], right_columns=['letter'])
outer_join
join: 100%|██████████| 2/2 [00:00<00:00, 1585.15it/s]
# | number | letter |
---|---|---|
0 | 1 | None |
1 | 2 | a |
2 | 2 | None |
3 | None | a |
4 | None | None |
5 | 3 | b |
6 | 3 | d |
7 | 4 | b |
8 | 4 | d |
9 | None | c |
Q: But ...I think there's a bug in the join...
A: Venn diagrams do not explain joins.
A Venn diagram is a widely-used diagram style that shows the logical relation between sets, popularised by John Venn in the 1880s. The diagrams are used to teach elementary set theory, and to illustrate simple set relationships
source: en.wikipedia.org
Joins operate over rows and when there are duplicate rows, these will be replicated in the output. Many beginners are surprised by this, because they didn't read the SQL standard.
Q: So what do I do?
A: If you want to get rid of duplicates using tablite, use the index
functionality
across all columns and pick the first row from each index. Here's the recipe that starts with plenty of duplicates:
old_table = Table({
'A':[1,1,1,2,2,2,3,3,3],
'B':[1,1,4,2,2,5,3,3,6],
})
old_table
# | A | B |
---|---|---|
0 | 1 | 1 |
1 | 1 | 1 |
2 | 1 | 4 |
3 | 2 | 2 |
4 | 2 | 2 |
5 | 2 | 5 |
6 | 3 | 3 |
7 | 3 | 3 |
8 | 3 | 6 |
## CREATE TABLE OF UNIQUE ENTRIES (a.k.a. DEDUPLICATE)
new_table = old_table.drop_duplicates()
new_table
9it [00:00, 11329.15it/s] join: 100%|██████████| 2/2 [00:00<00:00, 1819.26it/s]
# | A | B |
---|---|---|
0 | 1 | 1 |
1 | 1 | 4 |
2 | 2 | 2 |
3 | 2 | 5 |
4 | 3 | 3 |
5 | 3 | 6 |
You can also use groupby; We'll get to that in a minute.
LOOKUP!¶
Lookup is a special case of a search loop: Say for example you are planning a concert and want to make sure that your friends can make it home using public transport: You would have to find the first departure after the concert ends towards their home. A join would only give you a direct match on the time.
Lookup allows you "to iterate through a list of data and find the first match given a set of criteria."
Here's an example:
First we have our list of friends and their stops.
friends = Table({
"name":['Alice', 'Betty', 'Charlie', 'Dorethy', 'Edward', 'Fred'],
"stop":['Downtown-1', 'Downtown-2', 'Hillside View', 'Hillside Crescent', 'Downtown-2', 'Chicago'],
})
friends
# | name | stop |
---|---|---|
0 | Alice | Downtown-1 |
1 | Betty | Downtown-2 |
2 | Charlie | Hillside View |
3 | Dorethy | Hillside Crescent |
4 | Edward | Downtown-2 |
5 | Fred | Chicago |
Next we need a list of bus routes and their time and stops. I don't have that, so I'm making one up:
import random
random.seed(11)
table_size = 40
times = [DataTypes.time(random.randint(21, 23), random.randint(0, 59)) for i in range(table_size)]
stops = ['Stadium', 'Hillside', 'Hillside View', 'Hillside Crescent', 'Downtown-1', 'Downtown-2',
'Central station'] * 2 + [f'Random Road-{i}' for i in range(table_size)]
route = [random.choice([1, 2, 3]) for i in stops]
bus_table = Table({
"time":times,
"stop":stops[:table_size],
"route":route[:table_size],
})
bus_table.sort(mapping={'time': False})
print("Departures from Concert Hall towards ...")
bus_table[0:10]
creating sort index: 100%|██████████| 1/1 [00:00<00:00, 1459.90it/s] join: 100%|██████████| 3/3 [00:00<00:00, 2421.65it/s]
Departures from Concert Hall towards ...
# | time | stop | route |
---|---|---|---|
0 | 21:02:00 | Random Road-6 | 2 |
1 | 21:05:00 | Hillside Crescent | 2 |
2 | 21:06:00 | Hillside | 1 |
3 | 21:25:00 | Random Road-24 | 1 |
4 | 21:29:00 | Random Road-16 | 1 |
5 | 21:32:00 | Random Road-21 | 1 |
6 | 21:33:00 | Random Road-12 | 1 |
7 | 21:36:00 | Random Road-23 | 3 |
8 | 21:38:00 | Central station | 2 |
9 | 21:38:00 | Random Road-8 | 2 |
Let's say the concerts ends at 21:00 and it takes a 10 minutes to get to the bus-stop. Earliest departure must then be 21:10 - goodbye hugs included.
lookup_1 = friends.lookup(bus_table, (DataTypes.time(21, 10), "<=", 'time'), ('stop', "==", 'stop'))
lookup1_sorted = lookup_1.sorted(mapping={'time': False, 'name':False}, sort_mode='unix')
lookup1_sorted
100%|██████████| 6/6 [00:00<00:00, 1513.92it/s] join: 100%|██████████| 3/3 [00:00<00:00, 2003.65it/s] creating sort index: 100%|██████████| 2/2 [00:00<00:00, 2589.88it/s] join: 100%|██████████| 5/5 [00:00<00:00, 2034.29it/s]
# | name | stop | time | stop_1 | route |
---|---|---|---|---|---|
0 | Fred | Chicago | None | None | None |
1 | Betty | Downtown-2 | 21:51:00 | Downtown-2 | 1 |
2 | Edward | Downtown-2 | 21:51:00 | Downtown-2 | 1 |
3 | Charlie | Hillside View | 22:19:00 | Hillside View | 2 |
4 | Alice | Downtown-1 | 23:12:00 | Downtown-1 | 3 |
5 | Dorethy | Hillside Crescent | 23:54:00 | Hillside Crescent | 1 |
Lookup's ability to custom criteria is thereby far more versatile than SQL joins.
But with great power comes great responsibility.
Match¶
If you're looking to do a join where you afterwards remove the empty rows, match
is the faster choice.
Here is an example.
Let's start with two tables:
materials = Table({
'bom_id': [1, 2, 3, 4, 5, 6, 7, 8, 9],
'partial_of': [1, 2, 3, 4, 5, 6, 7, 4, 6],
'sku': ['A', 'irrelevant', 'empty carton', 'pkd carton', 'empty pallet', 'pkd pallet', 'pkd irrelevant', 'ppkd carton', 'ppkd pallet'],
'material_id': [None, None, None, 3, None, 5, 3, 3, 5],
'quantity': [10, 20, 30, 40, 50, 60, 70, 80, 90]
})
# 9 is a partially packed pallet of 6
## multiple values.
looking_for = Table({
'bom_id': [3,4,6],
'moq': [1,2,3]
})
Our goals is now to find the quantity from the materials
table based on the items in the looking_for
table.
This requires two steps:
- lookup
- filter for
all
by dropping items that didn't match.
## step 1/2:
products_lookup = materials.lookup(looking_for, ("bom_id", "==", "bom_id"), ("partial_of", "==", "bom_id"), all=False)
products_lookup
100%|██████████| 9/9 [00:00<00:00, 3651.81it/s] join: 100%|██████████| 2/2 [00:00<00:00, 1625.38it/s]
# | bom_id | partial_of | sku | material_id | quantity | bom_id_1 | moq |
---|---|---|---|---|---|---|---|
0 | 1 | 1 | A | None | 10 | None | None |
1 | 2 | 2 | irrelevant | None | 20 | None | None |
2 | 3 | 3 | empty carton | None | 30 | 3 | 1 |
3 | 4 | 4 | pkd carton | 3 | 40 | 4 | 2 |
4 | 5 | 5 | empty pallet | None | 50 | None | None |
5 | 6 | 6 | pkd pallet | 5 | 60 | 6 | 3 |
6 | 7 | 7 | pkd irrelevant | 3 | 70 | None | None |
7 | 8 | 4 | ppkd carton | 3 | 80 | 4 | 2 |
8 | 9 | 6 | ppkd pallet | 5 | 90 | 6 | 3 |
## step 2/2:
products = products_lookup.all(bom_id_1=lambda x: x is not None)
products
# | bom_id | partial_of | sku | material_id | quantity | bom_id_1 | moq |
---|---|---|---|---|---|---|---|
0 | 3 | 3 | empty carton | None | 30 | 3 | 1 |
1 | 4 | 4 | pkd carton | 3 | 40 | 4 | 2 |
2 | 6 | 6 | pkd pallet | 5 | 60 | 6 | 3 |
3 | 8 | 4 | ppkd carton | 3 | 80 | 4 | 2 |
4 | 9 | 6 | ppkd pallet | 5 | 90 | 6 | 3 |
The faster way to solve this problem is to use match
!
Here is the example:
products_matched = materials.match(looking_for, ("bom_id", "==", "bom_id"), ("partial_of", "==", "bom_id"))
products_matched
# | bom_id | partial_of | sku | material_id | quantity | bom_id_1 | moq |
---|---|---|---|---|---|---|---|
0 | 3 | 3 | empty carton | None | 30 | 3 | 1 |
1 | 4 | 4 | pkd carton | 3 | 40 | 4 | 2 |
2 | 6 | 6 | pkd pallet | 5 | 60 | 6 | 3 |
3 | 8 | 4 | ppkd carton | 3 | 80 | 4 | 2 |
4 | 9 | 6 | ppkd pallet | 5 | 90 | 6 | 3 |
assert products == products_matched
Are there other ways I can add data?¶
Yes - but row based operations cause a lot of IO, so it'll work but be slower:
from tablite import Table
t = Table() # create table
t.add_columns('row','A','B','C') # add columns
The following examples are all valid and append the row (1,2,3) to the table.
t.add_rows(1, 1, 2, 3) # individual values
t.add_rows([2, 1, 2, 3]) # list of values
t.add_rows((3, 1, 2, 3)) # tuple of values
t.add_rows(*(4, 1, 2, 3)) # unpacked tuple
t.add_rows(row=5, A=1, B=2, C=3) # keyword - args
t.add_rows(**{'row': 6, 'A': 1, 'B': 2, 'C': 3}) # dict / json.
The following examples add two rows to the table
t.add_rows((7, 1, 2, 3), (8, 4, 5, 6)) # two (or more) tuples.
t.add_rows([9, 1, 2, 3], [10, 4, 5, 6]) # two or more lists
t.add_rows({'row': 11, 'A': 1, 'B': 2, 'C': 3},
{'row': 12, 'A': 4, 'B': 5, 'C': 6}) # two (or more) dicts as args.
t.add_rows(*[{'row': 13, 'A': 1, 'B': 2, 'C': 3},
{'row': 14, 'A': 1, 'B': 2, 'C': 3}]) # list of dicts.
t
# | row | A | B | C |
---|---|---|---|---|
0 | 1 | 1 | 2 | 3 |
1 | 2 | 1 | 2 | 3 |
2 | 3 | 1 | 2 | 3 |
3 | 4 | 1 | 2 | 3 |
4 | 5 | 1 | 2 | 3 |
5 | 6 | 1 | 2 | 3 |
6 | 7 | 1 | 2 | 3 |
7 | 8 | 4 | 5 | 6 |
8 | 9 | 1 | 2 | 3 |
9 | 10 | 4 | 5 | 6 |
10 | 11 | 1 | 2 | 3 |
11 | 12 | 4 | 5 | 6 |
12 | 13 | 1 | 2 | 3 |
13 | 14 | 1 | 2 | 3 |
As the row incremented from 1
in the first of these examples, and finished with
row: 14
, you can now see the whole table above
Okay, great. How do I load data?¶
Easy. Use file_reader
. Here's an example:
from pathlib import Path
path = Path('tests/data/book1.csv')
tx = Table.from_file(path)
tx
Collecting tasks: 'tests/data/book1.csv' Dumping tasks: 'tests/data/book1.csv'
importing file: 100%|██████████| 1/1 [00:00<00:00, 444.08it/s]
# | a | b | c | d | e | f |
---|---|---|---|---|---|---|
0 | 1 | 0.060606061 | 0.090909091 | 0.121212121 | 0.151515152 | 0.181818182 |
1 | 2 | 0.121212121 | 0.242424242 | 0.484848485 | 0.96969697 | 1.939393939 |
2 | 3 | 0.242424242 | 0.484848485 | 0.96969697 | 1.939393939 | 3.878787879 |
3 | 4 | 0.484848485 | 0.96969697 | 1.939393939 | 3.878787879 | 7.757575758 |
4 | 5 | 0.96969697 | 1.939393939 | 3.878787879 | 7.757575758 | 15.51515152 |
5 | 6 | 1.939393939 | 3.878787879 | 7.757575758 | 15.51515152 | 31.03030303 |
6 | 7 | 3.878787879 | 7.757575758 | 15.51515152 | 31.03030303 | 62.06060606 |
... | ... | ... | ... | ... | ... | ... |
38 | 39 | 16659267088.0 | 33318534175.0 | 66637068350.0 | 133274000000.0 | 266548000000.0 |
39 | 40 | 33318534175.0 | 66637068350.0 | 133274000000.0 | 266548000000.0 | 533097000000.0 |
40 | 41 | 66637068350.0 | 133274000000.0 | 266548000000.0 | 533097000000.0 | 1066190000000.0 |
41 | 42 | 133274000000.0 | 266548000000.0 | 533097000000.0 | 1066190000000.0 | 2132390000000.0 |
42 | 43 | 266548000000.0 | 533097000000.0 | 1066190000000.0 | 2132390000000.0 | 4264770000000.0 |
43 | 44 | 533097000000.0 | 1066190000000.0 | 2132390000000.0 | 4264770000000.0 | 8529540000000.0 |
44 | 45 | 1066190000000.0 | 2132390000000.0 | 4264770000000.0 | 8529540000000.0 | 17059100000000.0 |
Note that you can also add start, limit and chunk_size to the file reader. Here's an example:
path = Path('tests/data/book1.csv')
tx2 = Table.from_file(path, start=2, limit=15)
tx2
Collecting tasks: 'tests/data/book1.csv'
importing file: 100%|██████████| 1/1 [00:00<00:00, 391.22it/s]
Dumping tasks: 'tests/data/book1.csv'
# | a | b | c | d | e | f |
---|---|---|---|---|---|---|
0 | 3 | 0.242424242 | 0.484848485 | 0.96969697 | 1.939393939 | 3.878787879 |
1 | 4 | 0.484848485 | 0.96969697 | 1.939393939 | 3.878787879 | 7.757575758 |
2 | 5 | 0.96969697 | 1.939393939 | 3.878787879 | 7.757575758 | 15.51515152 |
3 | 6 | 1.939393939 | 3.878787879 | 7.757575758 | 15.51515152 | 31.03030303 |
4 | 7 | 3.878787879 | 7.757575758 | 15.51515152 | 31.03030303 | 62.06060606 |
5 | 8 | 7.757575758 | 15.51515152 | 31.03030303 | 62.06060606 | 124.1212121 |
6 | 9 | 15.51515152 | 31.03030303 | 62.06060606 | 124.1212121 | 248.2424242 |
7 | 10 | 31.03030303 | 62.06060606 | 124.1212121 | 248.2424242 | 496.4848485 |
8 | 11 | 62.06060606 | 124.1212121 | 248.2424242 | 496.4848485 | 992.969697 |
9 | 12 | 124.1212121 | 248.2424242 | 496.4848485 | 992.969697 | 1985.939394 |
10 | 13 | 248.2424242 | 496.4848485 | 992.969697 | 1985.939394 | 3971.878788 |
11 | 14 | 496.4848485 | 992.969697 | 1985.939394 | 3971.878788 | 7943.757576 |
12 | 15 | 992.969697 | 1985.939394 | 3971.878788 | 7943.757576 | 15887.51515 |
13 | 16 | 1985.939394 | 3971.878788 | 7943.757576 | 15887.51515 | 31775.0303 |
14 | 17 | 3971.878788 | 7943.757576 | 15887.51515 | 31775.0303 | 63550.06061 |
How good is the file_reader?
I've included all formats in the test suite that are publicly available from the Alan Turing institute, dateutils) and Python's csv reader.
What about MM-DD-YYYY
formats? Some users from the US ask why the csv reader doesn't read the month-day-year format.
The answer is simple: It's not an iso8601 format. The US month-day-year format is a locale that may be used a lot in the US, but it isn't an international standard.
If you need to work with MM-DD-YYYY
you will find that the file_reader will import the values as text (str). You can then reformat it with a custom function like:
s = "03-21-1998"
from datetime import date
f = lambda s: date(int(s[-4:]), int(s[:2]), int(s[3:5]))
f(s)
datetime.date(1998, 3, 21)
Sweet. What formats are supported? Can I add my own file reader?¶
Yes! This is very good for special log files or custom json formats. Here's how you do it:
(1) Go to all existing readers in the tablite.core
and find the closest match.
from tablite.import_utils import file_readers
for k,v in file_readers.items():
print(k,v)
fods <function excel_reader at 0x7f36a3ef8c10> json <function excel_reader at 0x7f36a3ef8c10> html <function from_html at 0x7f36a3ef8b80> hdf5 <function from_hdf5 at 0x7f36a3ef8a60> simple <function excel_reader at 0x7f36a3ef8c10> rst <function excel_reader at 0x7f36a3ef8c10> mediawiki <function excel_reader at 0x7f36a3ef8c10> xlsx <function excel_reader at 0x7f36a3ef8c10> xls <function excel_reader at 0x7f36a3ef8c10> xlsm <function excel_reader at 0x7f36a3ef8c10> csv <function text_reader at 0x7f36a3ef9000> tsv <function text_reader at 0x7f36a3ef9000> txt <function text_reader at 0x7f36a3ef9000> ods <function ods_reader at 0x7f36a3ef8ca0>
(2) define your new file reader
def my_magic_reader(path, **kwargs): # define your new file reader.
print("do magic with {path}")
return
(3) add it to the list of readers.
file_readers['my_special_format'] = my_magic_reader
The file_readers
are all in tablite.core so if you intend to extend the readers, I recommend that you start here.
Very nice. How about exporting data?¶
Just use .export
file = Path('example.xlsx')
tx2.to_xlsx(file)
os.remove(file)
Cool. Does it play well with plotting packages?¶
Yes. Here's an example you can copy and paste:
from tablite import Table
t = Table({
'a':[1, 2, 8, 3, 4, 6, 5, 7, 9],
'b':[10, 100, 3, 4, 16, -1, 10, 10, 10],
})
t.sort(mapping={"a":False})
t
creating sort index: 100%|██████████| 1/1 [00:00<00:00, 1674.37it/s] join: 100%|██████████| 2/2 [00:00<00:00, 1701.89it/s]
# | a | b |
---|---|---|
0 | 1 | 10 |
1 | 2 | 100 |
2 | 3 | 4 |
3 | 4 | 16 |
4 | 5 | 10 |
5 | 6 | -1 |
6 | 7 | 10 |
7 | 8 | 3 |
8 | 9 | 10 |
%pip install matplotlib -q
Note: you may need to restart the kernel to use updated packages.
import matplotlib.pyplot as plt
plt.plot(t['a'], t['b'])
plt.ylabel('Hello Figure')
plt.show()
I like sql. Can tablite understand SQL?¶
Almost. You can use table.to_sql
and tablite will return ANSI-92 compliant SQL.
You can also create a table using Table.from_sql
and tablite will consume ANSI-92 compliant SQL.
But what do I do if I'm about to run out of memory?¶
You wont. Every tablite table is backed by disk. The memory footprint of a table is only the metadata required to know the relationships between variable names and the datastructures.
Let's do a comparison:
## Let's monitor the memory and record the observations into a table!
import psutil, os, gc
from time import process_time,sleep
process = psutil.Process(os.getpid())
def mem_time(): # go and check taskmanagers memory usage.
return process.memory_info().rss, process_time()
digits = 1_000_000
records = Table({'method':[], 'memory':[], 'time':[]})
The row based format: 1 million 10-tuples
before, start = mem_time()
L = [tuple([11 for _ in range(10)]) for _ in range(digits)]
after, end = mem_time()
del L
gc.collect()
records.add_rows(*('1e6 lists w. 10 integers', after - before, round(end-start,4)))
records
# | method | memory | time |
---|---|---|---|
0 | 1e6 lists w. 10 integers | 119054336 | 0.5045 |
The column based format: 10 columns with 1M values:
before, start = mem_time()
L = [[11 for i2 in range(digits)] for i1 in range(10)]
after,end = mem_time()
del L
gc.collect()
records.add_rows(('10 lists with 1e6 integers', after - before, round(end-start,4)))
We've thereby saved 50 Mb by avoiding the overhead from managing 1 million lists.
Q: But why didn't I just use an array? It would have even lower memory footprint.
A: First, array's don't handle None's and we get that frequently in dirty csv data.
Second, Table needs even less memory.
Let's try with an array:
import array
before, start = mem_time()
L = [array.array('i', [11 for _ in range(digits)]) for _ in range(10)]
after,end = mem_time()
del L
gc.collect()
records.add_rows(('10 lists with 1e6 integers in arrays', after - before, round(end-start,4)))
records
# | method | memory | time |
---|---|---|---|
0 | 1e6 lists w. 10 integers | 119054336 | 0.5045 |
1 | 10 lists with 1e6 integers | 75276288 | 0.1906 |
2 | 10 lists with 1e6 integers in arrays | 39833600 | 0.3633 |
Finally let's use a tablite.Table
:
before,start = mem_time()
t = Table(columns={str(i1): [11 for i2 in range(digits)] for i1 in range(10)})
after,end = mem_time()
records.add_rows(('Table with 10 columns with 1e6 integers', after - before, round(end-start,4)))
before,start = mem_time()
t2 = t.copy()
after,end = mem_time()
records.add_rows(('2 Tables with 10 columns with 1e6 integers each', after - before, round(end-start,4)))
## Let's show it, so we know nobody's cheating:
t2
# | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
---|---|---|---|---|---|---|---|---|---|---|
0 | 11 | 11 | 11 | 11 | 11 | 11 | 11 | 11 | 11 | 11 |
1 | 11 | 11 | 11 | 11 | 11 | 11 | 11 | 11 | 11 | 11 |
2 | 11 | 11 | 11 | 11 | 11 | 11 | 11 | 11 | 11 | 11 |
3 | 11 | 11 | 11 | 11 | 11 | 11 | 11 | 11 | 11 | 11 |
4 | 11 | 11 | 11 | 11 | 11 | 11 | 11 | 11 | 11 | 11 |
5 | 11 | 11 | 11 | 11 | 11 | 11 | 11 | 11 | 11 | 11 |
6 | 11 | 11 | 11 | 11 | 11 | 11 | 11 | 11 | 11 | 11 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
999,993 | 11 | 11 | 11 | 11 | 11 | 11 | 11 | 11 | 11 | 11 |
999,994 | 11 | 11 | 11 | 11 | 11 | 11 | 11 | 11 | 11 | 11 |
999,995 | 11 | 11 | 11 | 11 | 11 | 11 | 11 | 11 | 11 | 11 |
999,996 | 11 | 11 | 11 | 11 | 11 | 11 | 11 | 11 | 11 | 11 |
999,997 | 11 | 11 | 11 | 11 | 11 | 11 | 11 | 11 | 11 | 11 |
999,998 | 11 | 11 | 11 | 11 | 11 | 11 | 11 | 11 | 11 | 11 |
999,999 | 11 | 11 | 11 | 11 | 11 | 11 | 11 | 11 | 11 | 11 |
records
# | method | memory | time |
---|---|---|---|
0 | 1e6 lists w. 10 integers | 119054336 | 0.5045 |
1 | 10 lists with 1e6 integers | 75276288 | 0.1906 |
2 | 10 lists with 1e6 integers in arrays | 39833600 | 0.3633 |
3 | Table with 10 columns with 1e6 integers | 0 | 1.9569 |
4 | 2 Tables with 10 columns with 1e6 integers each | 0 | 0.0001 |
Conclusion: whilst the common worst case (1M lists with 10 integers) take up 118 Mb of RAM, Tablite's tables vanish in the noise of memory measurement.
Conclusions¶
This concludes the mega-tutorial to tablite
. There's nothing more to it.
But oh boy it'll save a lot of time.
Here's a summary of features:
- Everything a list can do.
- import csv*, fods, json, html, simple, rst, mediawiki, xlsx, xls, xlsm, csv, tsv, txt, ods using
Table.from_file(...)
- Iterate over rows or columns
- Create multikey
index
,sort
, usefilter
,any
andall
to select. Performlookup
across tables including using custom functions. - Perform multikey
joins
with other tables. - Perform
groupby
and reorganise data as apivot
table with max, min, sum, first, last, count, unique, average, standard deviation, median and mode. - Update tables with
+=
which automatically sorts out the columns - even if they're not in perfect order.
FAQ¶
Question | Answer |
---|---|
I'm not in a notebook. Is there a nice way to view tables? | Yes. table.show() prints the ascii version |
I'm looking for the equivalent to apply in pandas. |
Just use list comprehensions: table[column] = [f(x) for x in table[column] |
What about map ? |
Just use the python function:mapping = map(f, table[column name]) |
Is there a where function? |
It's called any or all like in python:table.any(column_name > 0) . |
I like sql and sqlite. Can I use sql? | Yes. Call table.to_sql() returns ANSI-92 SQL compliant table definition.You can use this in any SQL compliant engine. |
| sometimes i need to clean up data with datetimes. Is there any tool to help with that? | Yes. Look at DataTypes.DataTypes.round(value, multiple)
allows rounding of datetime.
Coming to Tablite from Pandas¶
If you're coming to Tablite from Pandas you will notice some differences.
Here's the ultra short comparison to the documentation from Pandas called 10 minutes intro to pandas
The tutorials provide the generic overview:
Some key differences
topic | Tablite |
---|---|
Viewing data | Just use table.show() in print outs, or if you're in a jupyter notebook just use the variable name table |
Selection | Slicing works both on columns and rows, and you can filter using any or all :table['A','B', 2:30:3].any(A=lambda x:x>3) to copy a table use: t2 = t.copy() This is a very fast deep copy, that has no memory overhead as tablites memory manager keeps track of the data. |
Missing data | Tablite uses mixed column format for any format that isn't uniformTo get rid of rows with None s and np.nan s use any:table.drop_na(None, np.nan) Alternatively you can use replace: table.replace(None,5) following the syntax: table.replace_missing_values(sources, target) |
Operations | Descriptive statistics are on a colum by column basis:table['a'].statistics() the pandas function df.apply doesn't exist in tablite. Use a list comprehension instead. For example: df.apply(np.cumsum) is just np.cumsum(t['A']) "histogramming" in tablite is per column: table['a'].histogram() string methods? Just use a list comprehensions: table['A', 'B'].any(A=lambda x: "hello" in x, B=lambda x: "world" in x) |
Merge | Concatenation: Just use + or += as in t1 = t2 + t3 += t4 . If the columns are out of order, tablite will sort the headers according to the order in the first table. If you're worried that the header mismatch use t1.stack(t2) Joins are ANSI92 compliant: t1.join(t2, <...args...>, join_type=...) . |
Grouping | Tablite supports multikey groupby using from tablite import Groupby as gb .table.groupby(keys, functions) |
Reshaping | To reshape a table use transpose . to perform pivot table like operations, use: table.pivot(rows, columns, functions) subtotals aside tablite will give you everything Excels pivot table can do. |
Time series | To convert time series use a list comprehension.t1['GMT'] = [timedelta(hours=1) + v for v in t1['date'] ] to generate a date range use: from Tablite import daterange t['date'] = date_range(start=2022/1/1, stop=2023/1/1, step=timedelta(days=1)) |
Categorical | Pandas only seems to use this for sorting and grouping. Tablite table has .sort , .groupby and .pivot to achieve the same task. |
Plotting | Import your favorite plotting package and feed it the values, such as:import matplotlib.pyplot as plt plt.plot(t['a'],t['b']) plt.showw() |
Import/Export | Tablite supports the same import/export options as pandas. Tablite pegs the free memory before IO and can therefore process larger-than-RAM files. Tablite also guesses the datatypes for all ISOformats and uses multiprocessing and may therefore be faster. Should you want to inspect how guess works, use from tools import guess and try the function out. |
Gotchas | None really. Should you come across something non-pythonic, then please post it on the issue list. |
Pandas also permits the usage of namedtuples, which are unpacked upon entry.
from collections import namedtuple
Point = namedtuple("Point", "x y")
points = [Point(0, 0), Point(0, 3)]
pd.DataFrame(points)
Doing that in tablite is a bit different. To unpack the named tuple, you should do so explicitly:
t = Table({'x': [p.x for p in points], 'y': [p.y for p in points]})
However should you want to keep the points as namedtuple, you can do so in tablite:
t = Table()
t['points'] = points
Tablite will store a serialised version of the points, so your memory overhead will be close to zero.