Tutorial

This is an introduction to LArray. It is not intended to be a fully comprehensive manual. It is mainly dedicated to help new users to familiarize with it and others to remind essentials.

The first step to use the LArray library is to import it:

In [1]: from larray import *

Axis creation

An axis represents a dimension of an LArray object. It consists of a name and a list of labels. They are several ways to create an axis:

# create a wildcard axis
In [2]: age = Axis(3, 'age')

# labels given as a list
In [3]: time = Axis([2007, 2008, 2009], 'time')

# create an axis using one string
In [4]: sex = Axis('sex=M,F')

# labels generated using a special syntax
In [5]: other = Axis('other=A01..C03')

In [6]: age, sex, time, other
Out[6]: 
(Axis(3, 'age'),
 Axis(['M', 'F'], 'sex'),
 Axis([2007, 2008, 2009], 'time'),
 Axis(['A01', 'A02', 'A03', 'B01', 'B02', 'B03', 'C01', 'C02', 'C03'], 'other'))

Array creation

A LArray object represents a multidimensional array with labeled axes.

From scratch

To create an array from scratch, you need to provide the data and a list of axes. Optionally, a title can be defined.

In [7]: import numpy as np

# list of the axes
In [8]: axes = [age, sex, time, other]

# data (the shape of data array must match axes lengths)
In [9]: data = np.random.randint(100, size=[len(axis) for axis in axes])

# title (optional)
In [10]: title = 'random data'

In [11]: arr = LArray(data, axes, title)

In [12]: arr
Out[12]: 
age*  sex  time\other  A01  A02  A03  B01  B02  B03  C01  C02  C03
   0    M        2007   11   26   93   57   75   82   45    2   13
   0    M        2008   55    9   50   76   33   91   38   25   69
   0    M        2009   53   28   72   40   30   69   35   24    7
   0    F        2007   38   60   34   41   23   51   55   83   78
   0    F        2008   23   81   60   31   88   78   85   16   43
   0    F        2009   69   63   92   48   34   69   51   61   18
   1    M        2007   17   37   86    9    3   58   88    9   41
   1    M        2008   15    4   74   47   23   81    8   25   15
   1    M        2009   36   30   63   27   35   34   93   31   86
   1    F        2007   50   18   86   95   79   46   76   11   39
   1    F        2008    2   99   11   97   24   18    1   27   23
   1    F        2009   70   59   13   82   80    6   39    6   88
   2    M        2007   61   74   81   57   34   66    9   42   79
   2    M        2008   23   67   29   58   77   43   29   74   49
   2    M        2009   85   46   44    4   74   17   64   34   95
   2    F        2007   18   98   29   62   86   43   19   12   65
   2    F        2008   28   87   32   91   22   51   40    4   38
   2    F        2009   41   76   97   97    5    3   45   58   40

Array creation functions

Arrays can also be generated in an easier way through creation functions:

  • ndrange() : fills an array with increasing numbers
  • ndtest() : same as ndrange but with axes generated automatically (for testing)
  • empty() : creates an array but leaves its allocated memory unchanged (i.e., it contains “garbage”. Be careful !)
  • zeros() : fills an array with 0
  • ones() : fills an array with 1
  • full() : fills an array with a given value

Except for ndtest, a list of axes must be provided. Axes can be passed in different ways:

  • as Axis objects
  • as integers defining the lengths of auto-generated wildcard axes
  • as a string : ‘sex=M,F;time=2007,2008,2009’ (name is optional)
  • as pairs (name, labels)

Optionally, the type of data stored by the array can be specified using argument dtype.

# start defines the starting value of data
In [13]: ndrange(['age=0..2', 'sex=M,F', 'time=2007..2009'], start=-1)
Out[13]: 
age  sex\time  2007  2008  2009
  0         M    -1     0     1
  0         F     2     3     4
  1         M     5     6     7
  1         F     8     9    10
  2         M    11    12    13
  2         F    14    15    16
# start defines the starting value of data
# label_start defines the starting index of labels
In [14]: ndtest((3, 3), start=-1, label_start=2)
Out[14]: 
a\b  b2  b3  b4
 a2  -1   0   1
 a3   2   3   4
 a4   5   6   7
# empty generates uninitialised array with correct axes (much faster but use with care!).
# This not really random either, it just reuses a portion of memory that is available, with whatever content is there.
# Use it only if performance matters and make sure all data will be overridden.
In [15]: empty(['age=0..2', 'sex=M,F', 'time=2007..2009'])
Out[15]: 
age  sex\time                   2007                    2008                    2009
  0         M   6.9180901376223e-310         2.84317546e-316   6.91809100536483e-310
  0         F  6.91806053025905e-310   6.91806051988683e-310                     0.0
  1         M                    0.0   6.91809100536483e-310                     0.0
  1         F                    0.0    6.9180660075894e-310   6.91809100536483e-310
  2         M                    0.0                     0.0   6.91809100536483e-310
  2         F   6.9180909566974e-310  2.865238098441256e+161  3.250037186622356e+178
# example with anonymous axes
In [16]: zeros(['0..2', 'M,F', '2007..2009'])
Out[16]: 
{0}  {1}\{2}  2007  2008  2009
  0        M   0.0   0.0   0.0
  0        F   0.0   0.0   0.0
  1        M   0.0   0.0   0.0
  1        F   0.0   0.0   0.0
  2        M   0.0   0.0   0.0
  2        F   0.0   0.0   0.0
# dtype=int forces to store int data instead of default float
In [17]: ones(['age=0..2', 'sex=M,F', 'time=2007..2009'], dtype=int)
Out[17]: 
age  sex\time  2007  2008  2009
  0         M     1     1     1
  0         F     1     1     1
  1         M     1     1     1
  1         F     1     1     1
  2         M     1     1     1
  2         F     1     1     1
In [18]: full(['age=0..2', 'sex=M,F', 'time=2007..2009'], 1.23)
Out[18]: 
age  sex\time  2007  2008  2009
  0         M  1.23  1.23  1.23
  0         F  1.23  1.23  1.23
  1         M  1.23  1.23  1.23
  1         F  1.23  1.23  1.23
  2         M  1.23  1.23  1.23
  2         F  1.23  1.23  1.23

All the above functions exist in {func}_like variants which take axes from another array

In [19]: ones_like(arr)
Out[19]: 
age*  sex  time\other  A01  A02  A03  B01  B02  B03  C01  C02  C03
   0    M        2007    1    1    1    1    1    1    1    1    1
   0    M        2008    1    1    1    1    1    1    1    1    1
   0    M        2009    1    1    1    1    1    1    1    1    1
   0    F        2007    1    1    1    1    1    1    1    1    1
   0    F        2008    1    1    1    1    1    1    1    1    1
   0    F        2009    1    1    1    1    1    1    1    1    1
   1    M        2007    1    1    1    1    1    1    1    1    1
   1    M        2008    1    1    1    1    1    1    1    1    1
   1    M        2009    1    1    1    1    1    1    1    1    1
   1    F        2007    1    1    1    1    1    1    1    1    1
   1    F        2008    1    1    1    1    1    1    1    1    1
   1    F        2009    1    1    1    1    1    1    1    1    1
   2    M        2007    1    1    1    1    1    1    1    1    1
   2    M        2008    1    1    1    1    1    1    1    1    1
   2    M        2009    1    1    1    1    1    1    1    1    1
   2    F        2007    1    1    1    1    1    1    1    1    1
   2    F        2008    1    1    1    1    1    1    1    1    1
   2    F        2009    1    1    1    1    1    1    1    1    1

Sequence

The special sequence() function allows you to create an array from an axis by iteratively applying a function to a given initial value. You can choose between inc and mult functions or define your own.

# With initial=1.0 and inc=0.5, we generate the sequence 1.0, 1.5, 2.0, 2.5, 3.0, ...
In [20]: sequence('sex=M,F', initial=1.0, inc=0.5)
Out[20]: 
sex    M    F
     1.0  1.5
# With initial=1.0 and mult=2.0, we generate the sequence 1.0, 2.0, 4.0, 8.0, ...
In [21]: sequence('age=0..2', initial=1.0, mult=2.0)
Out[21]: 
age    0    1    2
     1.0  2.0  4.0
# Using your own function
In [22]: sequence('time=2007..2009', initial=2.0, func=lambda value: value**2)
Out[22]: 
time  2007  2008  2009
       2.0   4.0  16.0

You can also create N-dimensional array by passing (N-1)-dimensional array to initial, inc or mult argument

In [23]: birth = LArray([1.05, 1.15], 'sex=M,F')

In [24]: cumulate_newborns = sequence('time=2007..2009', initial=0.0, inc=birth)

In [25]: cumulate_newborns
Out[25]: 
sex\time  2007  2008  2009
       M   0.0  1.05   2.1
       F   0.0  1.15   2.3
In [26]: initial = LArray([90, 100], 'sex=M,F')

In [27]: survival = LArray([0.96, 0.98], 'sex=M,F')

In [28]: pop = sequence('age=80..83', initial=initial, mult=survival)

In [29]: pop
Out[29]: 
sex\age     80                 81                 82                 83
      M   90.0  86.39999999999999             82.944           79.62624
      F  100.0               98.0  96.03999999999999  94.11919999999999

Load/Dump from files

Load from files

In [30]: example_dir = EXAMPLE_FILES_DIR

Arrays can be loaded from CSV files (see documentation of read_csv() for more details)

# read_tsv is a shortcut when data are separated by tabs instead of commas (default separator of read_csv)
# read_eurostat is a shortcut to read EUROSTAT TSV files
In [31]: household = read_csv(example_dir + 'hh.csv')

In [32]: household.info
Out[32]: 
26 x 3 x 7
 time [26]: 1991 1992 1993 ... 2014 2015 2016
 geo [3]: 'BruCap' 'Fla' 'Wal'
 hh_type [7]: 'SING' "'MAR0" 'MAR+' ... 'UNM+' 'H1P' 'OTHR'
dtype: int64

or Excel sheets (see documentation of read_excel() for more details)

# loads array from the first sheet if no sheetname is given
In [33]: pop = read_excel(example_dir + 'demography.xlsx', 'pop')

In [34]: pop.info
Out[34]: 
26 x 3 x 121 x 2 x 2
 time [26]: 1991 1992 1993 ... 2014 2015 2016
 geo [3]: 'BruCap' 'Fla' 'Wal'
 age [121]: 0 1 2 ... 118 119 120
 sex [2]: 'M' 'F'
 nat [2]: 'BE' 'FO'
dtype: int64

or HDF5 files (HDF5 is file format designed to store and organize large amounts of data. An HDF5 file can contain multiple arrays. See documentation of read_hdf() for more details)

In [35]: mortality = read_hdf(example_dir + 'demography.h5','qx')

In [36]: mortality.info
Out[36]: 
26 x 3 x 121 x 2 x 2
 time [26]: 1991 1992 1993 ... 2014 2015 2016
 geo [3]: 'BruCap' 'Fla' 'Wal'
 age [121]: 0 1 2 ... 118 119 120
 sex [2]: 'M' 'F'
 nat [2]: 'BE' 'FO'
dtype: float64

Dump in files

Arrays can be dumped in CSV files (see documentation of to_csv() for more details)

In [37]: household.to_csv('hh2.csv')

or in Excel files (see documentation of to_excel() for more details)

 # if the file does not already exist, it is created with a single sheet,
 # otherwise a new sheet is added to it
In [38]: household.to_excel('demography_2.xlsx', overwrite_file=True)

 # it is usually better to specify the sheet explicitly (by name or position) though
In [39]: household.to_excel('demography_2.xlsx', 'hh')

or in HDF5 files (see documentation of to_hdf() for more details)

In [40]: household.to_hdf('demography_2.h5', 'hh')

more Excel IO

# create a 3 x 2 x 3 array
In [41]: age, sex, time = Axis('age=0..2'), Axis('sex=M,F'), Axis('time=2007..2009')

In [42]: arr = ndrange([age, sex, time])

In [43]: arr
Out[43]: 
age  sex\time  2007  2008  2009
  0         M     0     1     2
  0         F     3     4     5
  1         M     6     7     8
  1         F     9    10    11
  2         M    12    13    14
  2         F    15    16    17

Write Arrays

Open an Excel file

In [44]: wb = open_excel('test.xlsx', overwrite_file=True)

Put an array in an Excel Sheet, excluding headers (labels)

 # put arr at A1 in Sheet1, excluding headers (labels)
In [45]: wb['Sheet1'] = arr

 # same but starting at A9
 # note that Sheet1 must exist
In [46]: wb['Sheet1']['A9'] = arr

Put an array in an Excel Sheet, including headers (labels)

 # dump arr at A1 in Sheet2, including headers (labels)
In [47]: wb['Sheet2'] = arr.dump()

 # same but starting at A10
In [48]: wb['Sheet2']['A10'] = arr.dump()

Save file to disk

In [49]: wb.save()

Close file

In [50]: wb.close()

Read Arrays

Open an Excel file

In [51]: wb = open_excel('test.xlsx')

Load an array from a sheet (assuming the presence of (correctly formatted) headers and only one array in sheet)

# save one array in Sheet3 (including headers)
In [52]: wb['Sheet3'] = arr.dump()

# load array from the data starting at A1 in Sheet3
In [53]: arr = wb['Sheet3'].load()

In [54]: arr
Out[54]: 
age  sex\time  2007  2008  2009
  0         M     0     1     2
  0         F     3     4     5
  1         M     6     7     8
  1         F     9    10    11
  2         M    12    13    14
  2         F    15    16    17

Load an array with its axes information from a range

# if you need to use the same sheet several times,
# you can create a sheet variable
In [55]: sheet2 = wb['Sheet2']

# load array contained in the 4 x 4 table defined by cells A10 and D14
In [56]: arr2 = sheet2['A10:D14'].load()

In [57]: arr2
Out[57]: 
age  sex\time  2007  2008
  0         M     0     1
  0         F     3     4
  1         M     6     7
  1         F     9    10

Read Ranges (experimental)

Load an array (raw data) with no axis information from a range

In [58]: arr3 = wb['Sheet1']['A1:B4']

In [59]: arr3
Out[59]: 
{0}*\{1}*  0   1
        0  0   1
        1  3   4
        2  6   7
        3  9  10

in fact, this is not really an LArray …

In [60]: type(arr3)
larray.io.excel.Range

… but it can be used as such

In [61]: arr3.sum(axis=0)
Out[61]: 
{0}*   0   1
      18  22

… and it can be used for other stuff, like setting the formula instead of the value:

In [62]: arr3.formula = '=D10+1'

In the future, we should also be able to set font name, size, style, etc.

In [63]: wb.close()

Inspecting

# load population array
In [64]: pop = load_example_data('demography').pop

Get array summary : dimensions + description of axes

In [65]: pop.info
Out[65]: 
26 x 3 x 121 x 2 x 2
 time [26]: 1991 1992 1993 ... 2014 2015 2016
 geo [3]: 'BruCap' 'Fla' 'Wal'
 age [121]: 0 1 2 ... 118 119 120
 sex [2]: 'M' 'F'
 nat [2]: 'BE' 'FO'
dtype: int64

Get axes

In [66]: time, geo, age, sex, nat = pop.axes

Get array dimensions

In [67]: pop.shape
Out[67]: (26, 3, 121, 2, 2)

Get number of elements

In [68]: pop.size
Out[68]: 37752

Get size in memory

In [69]: pop.nbytes
Out[69]: 302016

Start viewer (graphical user interface) in read-only mode. This will open a new window and block execution of the rest of code until the windows is closed! Required PyQt installed.

In [70]: view(pop)

Load array in an Excel sheet

In [71]: pop.to_excel()

Selection (Subsets)

LArray allows to select a subset of an array either by labels or positions

Selection by Labels

To take a subset of an array using labels, use brackets [ ]. Let’s start by selecting a single element:

# here we select the value associated with Belgian women of age 50 from Brussels region for the year 2015
In [72]: pop[2015, 'BruCap', 50, 'F', 'BE']
Out[72]: 4813

Continue with selecting a subset using slices and lists of labels

# here we select the subset associated with Belgian women of age 50, 51 and 52
# from Brussels region for the years 2010 to 2016
In [73]: pop[2010:2016, 'BruCap', 50:52, 'F', 'BE']
Out[73]: 
time\age    50    51    52
    2010  4869  4811  4699
    2011  5015  4860  4792
    2012  4722  5014  4818
    2013  4711  4727  5007
    2014  4788  4702  4730
    2015  4813  4767  4676
    2016  4814  4792  4740
# slices bounds are optional:
# if not given start is assumed to be the first label and stop is the last one.
# Here we select all years starting from 2010
In [74]: pop[2010:, 'BruCap', 50:52, 'F', 'BE']
Out[74]: 
time\age    50    51    52
    2010  4869  4811  4699
    2011  5015  4860  4792
    2012  4722  5014  4818
    2013  4711  4727  5007
    2014  4788  4702  4730
    2015  4813  4767  4676
    2016  4814  4792  4740
# Slices can also have a step (defaults to 1), to take every Nth labels
# Here we select all even years starting from 2010
In [75]: pop[2010::2, 'BruCap', 50:52, 'F', 'BE']
Out[75]: 
time\age    50    51    52
    2010  4869  4811  4699
    2012  4722  5014  4818
    2014  4788  4702  4730
    2016  4814  4792  4740
# one can also use list of labels to take non-contiguous labels.
# Here we select years 2008, 2010, 2013 and 2015
In [76]: pop[[2008, 2010, 2013, 2015], 'BruCap', 50:52, 'F', 'BE']
Out[76]: 
time\age    50    51    52
    2008  4731  4735  4724
    2010  4869  4811  4699
    2013  4711  4727  5007
    2015  4813  4767  4676

The order of indexing does not matter either, so you usually do not care/have to remember about axes positions during computation. It only matters for output.

# order of index doesn't matter
In [77]: pop['F', 'BE', 'BruCap', [2008, 2010, 2013, 2015], 50:52]
Out[77]: 
time\age    50    51    52
    2008  4731  4735  4724
    2010  4869  4811  4699
    2013  4711  4727  5007
    2015  4813  4767  4676

Warning

Selecting by labels as above works well as long as there is no ambiguity. When two or more axes have common labels, it may lead to a crash. The solution is then to precise to which axis belong the labels.

# let us now create an array with the same labels on several axes
In [78]: age, weight, size = Axis('age=0..80'), Axis('weight=0..120'), Axis('size=0..200')

In [79]: arr_ws = ndrange([age, weight, size])
# let's try to select teenagers with size between 1 m 60 and 1 m 65 and weight > 80 kg.
# In this case the subset is ambiguous and this results in an error:
In [80]: arr_ws[10:18, :80, 160:165]
<class 'ValueError'> slice(10, 18, None) is ambiguous (valid in age, weight, size)
# the solution is simple. You need to precise the axes on which you make a selection
In [81]: arr_ws[age[10:18], weight[:80], size[160:165]]
Out[81]: 
age  weight\size     160     161     162     163     164     165
 10            0  243370  243371  243372  243373  243374  243375
 10            1  243571  243572  243573  243574  243575  243576
 10            2  243772  243773  243774  243775  243776  243777
 10            3  243973  243974  243975  243976  243977  243978
 10            4  244174  244175  244176  244177  244178  244179
...          ...     ...     ...     ...     ...     ...     ...
 18           76  453214  453215  453216  453217  453218  453219
 18           77  453415  453416  453417  453418  453419  453420
 18           78  453616  453617  453618  453619  453620  453621
 18           79  453817  453818  453819  453820  453821  453822
 18           80  454018  454019  454020  454021  454022  454023

Special variable x

When selecting, assiging or using aggregate functions, an axis can be refered via the special variable x:

  • pop[x.age[:20]]
  • pop.sum(x.age)

This gives you acces to axes of the array you are manipulating. The main drawback of using x is that you lose the autocompletion available from many editors. It only works with non-wildcard axes.

# the previous example could have been also written as
In [82]: arr_ws[x.age[10:18], x.weight[:80], x.size[160:165]]
Out[82]: 
age  weight\size     160     161     162     163     164     165
 10            0  243370  243371  243372  243373  243374  243375
 10            1  243571  243572  243573  243574  243575  243576
 10            2  243772  243773  243774  243775  243776  243777
 10            3  243973  243974  243975  243976  243977  243978
 10            4  244174  244175  244176  244177  244178  244179
...          ...     ...     ...     ...     ...     ...     ...
 18           76  453214  453215  453216  453217  453218  453219
 18           77  453415  453416  453417  453418  453419  453420
 18           78  453616  453617  453618  453619  453620  453621
 18           79  453817  453818  453819  453820  453821  453822
 18           80  454018  454019  454020  454021  454022  454023

Selection by Positions

Sometimes it is more practical to use positions along the axis, instead of labels. You need to add the character i before the brackets: .i[positions]. As for selection with labels, you can use single position or slice or list of positions. Positions can be also negative (-1 represent the last element of an axis).

Note

Remember that positions (indices) are always 0-based in Python. So the first element is at position 0, the second is at position 1, etc.

# here we select the subset associated with Belgian women of age 50, 51 and 52
# from Brussels region for the first 3 years
In [83]: pop[x.time.i[:3], 'BruCap', 50:52, 'F', 'BE']
Out[83]: 
time\age    50    51    52
    1991  3739  4138  4101
    1992  3373  3665  4088
    1993  3648  3335  3615
# same but for the last 3 years
In [84]: pop[x.time.i[-3:], 'BruCap', 50:52, 'F', 'BE']
Out[84]: 
time\age    50    51    52
    2014  4788  4702  4730
    2015  4813  4767  4676
    2016  4814  4792  4740
# using list of positions
In [85]: pop[x.time.i[-9,-7,-4,-2], 'BruCap', 50:52, 'F', 'BE']
Out[85]: 
time\age    50    51    52
    2008  4731  4735  4724
    2010  4869  4811  4699
    2013  4711  4727  5007
    2015  4813  4767  4676

Warning

The end indice (position) is EXCLUSIVE while the end label is INCLUSIVE.

# with labels (3 is included)
In [86]: pop[2015, 'BruCap', x.age[:3], 'F', 'BE']
Out[86]: 
age     0     1     2     3
     6020  5882  6023  5861
# with position (3 is out)
In [87]: pop[2015, 'BruCap', x.age.i[:3], 'F', 'BE']
Out[87]: 
age     0     1     2
     6020  5882  6023

You can use .i[] selection directly on array instead of axes. In this context, if you want to select a subset of the first and third axes for example, you must use a full slice : for the second one.

# here we select the last year and first 3 ages
# equivalent to: pop.i[-1, :, :3, :, :]
In [88]: pop.i[-1, :, :3]
Out[88]: 
   geo  age  sex\nat     BE    FO
BruCap    0        M   6155  3104
BruCap    0        F   5900  2817
BruCap    1        M   6165  3068
BruCap    1        F   5916  2946
BruCap    2        M   6053  2918
BruCap    2        F   5736  2776
   Fla    0        M  29993  3717
   Fla    0        F  28483  3587
   Fla    1        M  31292  3716
   Fla    1        F  29721  3575
   Fla    2        M  31718  3597
   Fla    2        F  30353  3387
   Wal    0        M  17869  1472
   Wal    0        F  17242  1454
   Wal    1        M  18820  1432
   Wal    1        F  17604  1443
   Wal    2        M  19076  1444
   Wal    2        F  18189  1358

Assigning subsets

Assigning value

Assign a value to a subset

# let's take a smaller array
In [89]: pop = load_example_data('demography').pop[2016, 'BruCap', 100:105]

In [90]: pop2 = pop

In [91]: pop2
Out[91]: 
age  sex\nat  BE  FO
100        M  12   0
100        F  60   3
101        M  12   2
101        F  66   5
102        M   8   0
102        F  26   1
103        M   2   1
103        F  17   2
104        M   2   1
104        F  14   0
105        M   0   0
105        F   2   2
# set all data corresponding to age >= 102 to 0
In [92]: pop2[102:] = 0

In [93]: pop2
Out[93]: 
age  sex\nat  BE  FO
100        M  12   0
100        F  60   3
101        M  12   2
101        F  66   5
102        M   0   0
102        F   0   0
103        M   0   0
103        F   0   0
104        M   0   0
104        F   0   0
105        M   0   0
105        F   0   0

One very important gotcha though…

Warning

Modifying a slice of an array in-place like we did above should be done with care otherwise you could have unexpected effects. The reason is that taking a slice subset of an array does not return a copy of that array, but rather a view on that array. To avoid such behavior, use .copy() method.

Remember:

  • taking a slice subset of an array is extremely fast (no data is copied)
  • if one modifies that subset in-place, one also modifies the original array
  • .copy() returns a copy of the subset (takes speed and memory) but allows you to change the subset without modifying the original array in the same time
# indeed, data from the original array have also changed
In [94]: pop
Out[94]: 
age  sex\nat  BE  FO
100        M  12   0
100        F  60   3
101        M  12   2
101        F  66   5
102        M   0   0
102        F   0   0
103        M   0   0
103        F   0   0
104        M   0   0
104        F   0   0
105        M   0   0
105        F   0   0
# the right way
In [95]: pop = load_example_data('demography').pop[2016, 'BruCap', 100:105]

In [96]: pop2 = pop.copy()

In [97]: pop2[102:] = 0

In [98]: pop2
Out[98]: 
age  sex\nat  BE  FO
100        M  12   0
100        F  60   3
101        M  12   2
101        F  66   5
102        M   0   0
102        F   0   0
103        M   0   0
103        F   0   0
104        M   0   0
104        F   0   0
105        M   0   0
105        F   0   0
# now, data from the original array have not changed this time
In [99]: pop
Out[99]: 
age  sex\nat  BE  FO
100        M  12   0
100        F  60   3
101        M  12   2
101        F  66   5
102        M   8   0
102        F  26   1
103        M   2   1
103        F  17   2
104        M   2   1
104        F  14   0
105        M   0   0
105        F   2   2

Assigning Arrays & Broadcasting

Instead of a value, we can also assign an array to a subset. In that case, that array can have less axes than the target but those which are present must be compatible with the subset being targeted.

In [100]: sex, nat = Axis('sex=M,F'), Axis('nat=BE,FO')

In [101]: new_value = LArray([[1, -1], [2, -2]],[sex, nat])

In [102]: new_value
Out[102]: 
sex\nat  BE  FO
      M   1  -1
      F   2  -2
# this assigns 1, -1 to Belgian, Foreigner men
# and 2, -2 to Belgian, Foreigner women for all
# people older than 100
In [103]: pop[102:] = new_value

In [104]: pop
Out[104]: 
age  sex\nat  BE  FO
100        M  12   0
100        F  60   3
101        M  12   2
101        F  66   5
102        M   1  -1
102        F   2  -2
103        M   1  -1
103        F   2  -2
104        M   1  -1
104        F   2  -2
105        M   1  -1
105        F   2  -2

Warning

The array being assigned must have compatible axes with the target subset.

# assume we define the following array with shape 3 x 2 x 2
In [105]: new_value = zeros(['age=0..2', sex, nat])

In [106]: new_value
Out[106]: 
age  sex\nat   BE   FO
  0        M  0.0  0.0
  0        F  0.0  0.0
  1        M  0.0  0.0
  1        F  0.0  0.0
  2        M  0.0  0.0
  2        F  0.0  0.0
# now let's try to assign the previous array in a subset with shape 7 x 2 x 2
In [107]: pop[102:] = new_value
<class 'ValueError'> could not broadcast input array from shape (3,2,2) into shape (4,2,2)
# but this works
In [108]: pop[102:104] = new_value
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-108-182e038e029c> in <module>()
----> 1 pop[102:104] = new_value

~/checkouts/readthedocs.org/user_builds/larray-test/conda/stable/lib/python3.5/site-packages/larray-0.27-py3.5.egg/larray/core/array.py in __setitem__(self, key, value, collapse_slices)
   2168                 axes = self._get_axes_from_translated_key(translated_key)
   2169             value = value.broadcast_with(axes)
-> 2170             value.axes.check_compatible(axes)
   2171 
   2172             # replace incomprehensible error message "could not broadcast input array from shape XX into shape YY"

~/checkouts/readthedocs.org/user_builds/larray-test/conda/stable/lib/python3.5/site-packages/larray-0.27-py3.5.egg/larray/core/axis.py in check_compatible(self, axes)
   1713             local_axis = self.get_by_pos(axis, i)
   1714             if not local_axis.iscompatible(axis):
-> 1715                 raise ValueError("incompatible axes:\n{!r}\nvs\n{!r}".format(axis, local_axis))
   1716 
   1717     def extend(self, axes, validate=True, replace_wildcards=False):

ValueError: incompatible axes:
Axis([102, 103, 104], 'age')
vs
Axis([0, 1, 2], 'age')

In [109]: pop
Out[109]: 
age  sex\nat  BE  FO
100        M  12   0
100        F  60   3
101        M  12   2
101        F  66   5
102        M   1  -1
102        F   2  -2
103        M   1  -1
103        F   2  -2
104        M   1  -1
104        F   2  -2
105        M   1  -1
105        F   2  -2

Boolean filtering

Boolean filtering can be use to extract subsets.

#Let's focus on population living in Brussels during the year 2016
In [110]: pop = load_example_data('demography').pop[2016, 'BruCap']

# here we select all males and females with age less than 5 and 10 respectively
In [111]: subset = pop[((x.sex == 'H') & (x.age <= 5)) | ((x.sex == 'F') & (x.age <= 10))]

In [112]: subset
Out[112]: 
age_sex\nat    BE    FO
        0_F  5900  2817
        1_F  5916  2946
        2_F  5736  2776
        3_F  5883  2734
        4_F  5784  2523
        5_F  5780  2521
        6_F  5759  2290
        7_F  5518  2234
        8_F  5474  2066
        9_F  5354  1896
       10_F  5200  1785

Note

Be aware that after boolean filtering, several axes may have merged.

# 'age' and 'sex' axes have been merged together
In [113]: subset.info
Out[113]: 
11 x 2
 age_sex [11]: '0_F' '1_F' '2_F' ... '8_F' '9_F' '10_F'
 nat [2]: 'BE' 'FO'
dtype: int64

This may be not what you because previous selections on merged axes are no longer valid

# now let's try to calculate the proportion of females with age less than 10
In [114]: subset['F'].sum() / pop['F'].sum()
<class 'ValueError'> F is not a valid label for any axis

Therefore, it is sometimes more useful to not select, but rather set to 0 (or another value) non matching elements

In [115]: subset = pop.copy()

In [116]: subset[((x.sex == 'F') & (x.age > 10))] = 0

In [117]: subset['F', :20]
Out[117]: 
age\nat    BE    FO
      0  5900  2817
      1  5916  2946
      2  5736  2776
      3  5883  2734
      4  5784  2523
      5  5780  2521
      6  5759  2290
      7  5518  2234
      8  5474  2066
      9  5354  1896
     10  5200  1785
     11     0     0
     12     0     0
     13     0     0
     14     0     0
     15     0     0
     16     0     0
     17     0     0
     18     0     0
     19     0     0
     20     0     0
# now we can calculate the proportion of females with age less than 10
In [118]: subset['F'].sum() / pop['F'].sum()
Out[118]: 0.14618110657051941

Boolean filtering can also mix axes and arrays. Example above could also have been written as

In [119]: age_limit = sequence('sex=M,F', initial=5, inc=5)

In [120]: age_limit
Out[120]: 
sex  M   F
     5  10
In [121]: age = pop.axes['age']

In [122]: (age <= age_limit)[:20]
Out[122]: 
age\sex      M      F
      0   True   True
      1   True   True
      2   True   True
      3   True   True
      4   True   True
      5   True   True
      6  False   True
      7  False   True
      8  False   True
      9  False   True
     10  False   True
     11  False  False
     12  False  False
     13  False  False
     14  False  False
     15  False  False
     16  False  False
     17  False  False
     18  False  False
     19  False  False
     20  False  False
In [123]: subset = pop.copy()

In [124]: subset[x.age > age_limit] = 0

In [125]: subset['F'].sum() / pop['F'].sum()
Out[125]: 0.14618110657051941

Finally, you can choose to filter on data instead of axes

# let's focus on females older than 90
In [126]: subset = pop['F', 90:110].copy()

In [127]: subset
Out[127]: 
age\nat    BE   FO
     90  1477  136
     91  1298  105
     92  1141   78
     93   906   74
     94   739   65
     95   566   53
     96   327   25
     97   171   21
     98   135    9
     99    92    8
    100    60    3
    101    66    5
    102    26    1
    103    17    2
    104    14    0
    105     2    2
    106     3    3
    107     1    2
    108     1    0
    109     0    0
    110     0    0
# here we set to 0 all data < 10
In [128]: subset[subset < 10] = 0

In [129]: subset
Out[129]: 
age\nat    BE   FO
     90  1477  136
     91  1298  105
     92  1141   78
     93   906   74
     94   739   65
     95   566   53
     96   327   25
     97   171   21
     98   135    0
     99    92    0
    100    60    0
    101    66    0
    102    26    0
    103    17    0
    104    14    0
    105     0    0
    106     0    0
    107     0    0
    108     0    0
    109     0    0
    110     0    0

Manipulates axes from arrays

# let's start with
In [130]: pop = load_example_data('demography').pop[2016, 'BruCap', 90:95]

In [131]: pop
Out[131]: 
age  sex\nat    BE   FO
 90        M   539   74
 90        F  1477  136
 91        M   499   49
 91        F  1298  105
 92        M   332   35
 92        F  1141   78
 93        M   287   27
 93        F   906   74
 94        M   237   23
 94        F   739   65
 95        M   154   19
 95        F   566   53

Relabeling

Replace all labels of one axis

# returns a copy by default
In [132]: pop_new_labels = pop.set_labels(x.sex, ['Men', 'Women'])

In [133]: pop_new_labels
Out[133]: 
age  sex\nat    BE   FO
 90      Men   539   74
 90    Women  1477  136
 91      Men   499   49
 91    Women  1298  105
 92      Men   332   35
 92    Women  1141   78
 93      Men   287   27
 93    Women   906   74
 94      Men   237   23
 94    Women   739   65
 95      Men   154   19
 95    Women   566   53
# inplace flag avoids to create a copy
In [134]: pop.set_labels(x.sex, ['M', 'F'], inplace=True)
Out[134]: 
age  sex\nat    BE   FO
 90        M   539   74
 90        F  1477  136
 91        M   499   49
 91        F  1298  105
 92        M   332   35
 92        F  1141   78
 93        M   287   27
 93        F   906   74
 94        M   237   23
 94        F   739   65
 95        M   154   19
 95        F   566   53

Renaming axes

Rename one axis

In [135]: pop.info
Out[135]: 
6 x 2 x 2
 age [6]: 90 91 92 93 94 95
 sex [2]: 'M' 'F'
 nat [2]: 'BE' 'FO'
dtype: int64
# 'rename' returns a copy of the array
In [136]: pop2 = pop.rename(x.sex, 'gender')

In [137]: pop2
Out[137]: 
age  gender\nat    BE   FO
 90           M   539   74
 90           F  1477  136
 91           M   499   49
 91           F  1298  105
 92           M   332   35
 92           F  1141   78
 93           M   287   27
 93           F   906   74
 94           M   237   23
 94           F   739   65
 95           M   154   19
 95           F   566   53

Rename several axes at once

# No x. here because sex and nat are keywords and not actual axes
In [138]: pop2 = pop.rename(sex='gender', nat='nationality')

In [139]: pop2
Out[139]: 
age  gender\nationality    BE   FO
 90                   M   539   74
 90                   F  1477  136
 91                   M   499   49
 91                   F  1298  105
 92                   M   332   35
 92                   F  1141   78
 93                   M   287   27
 93                   F   906   74
 94                   M   237   23
 94                   F   739   65
 95                   M   154   19
 95                   F   566   53

Reordering axes

Axes can be reordered using transpose() method. By default, transpose reverse axes, otherwise it permutes the axes according to the list given as argument. Axes not mentioned come after those which are mentioned(and keep their relative order). Finally, transpose returns a copy of the array.

# starting order : age, sex, nat
In [140]: pop
Out[140]: 
age  sex\nat    BE   FO
 90        M   539   74
 90        F  1477  136
 91        M   499   49
 91        F  1298  105
 92        M   332   35
 92        F  1141   78
 93        M   287   27
 93        F   906   74
 94        M   237   23
 94        F   739   65
 95        M   154   19
 95        F   566   53
# no argument --> reverse axes
In [141]: pop.transpose()
Out[141]: 
nat  sex\age    90    91    92   93   94   95
 BE        M   539   499   332  287  237  154
 BE        F  1477  1298  1141  906  739  566
 FO        M    74    49    35   27   23   19
 FO        F   136   105    78   74   65   53

# .T is a shortcut for .transpose()
In [142]: pop.T
Out[142]: 
nat  sex\age    90    91    92   93   94   95
 BE        M   539   499   332  287  237  154
 BE        F  1477  1298  1141  906  739  566
 FO        M    74    49    35   27   23   19
 FO        F   136   105    78   74   65   53
# reorder according to list
In [143]: pop.transpose(x.age, x.nat, x.sex)
Out[143]: 
age  nat\sex    M     F
 90       BE  539  1477
 90       FO   74   136
 91       BE  499  1298
 91       FO   49   105
 92       BE  332  1141
 92       FO   35    78
 93       BE  287   906
 93       FO   27    74
 94       BE  237   739
 94       FO   23    65
 95       BE  154   566
 95       FO   19    53
# axes not mentioned come after those which are mentioned (and keep their relative order)
In [144]: pop.transpose(x.sex)
Out[144]: 
sex  age\nat    BE   FO
  M       90   539   74
  M       91   499   49
  M       92   332   35
  M       93   287   27
  M       94   237   23
  M       95   154   19
  F       90  1477  136
  F       91  1298  105
  F       92  1141   78
  F       93   906   74
  F       94   739   65
  F       95   566   53

Aggregates

Calculate the sum along an axis

In [145]: pop = load_example_data('demography').pop[2016, 'BruCap']

In [146]: pop.sum(x.age)
Out[146]: 
sex\nat      BE      FO
      M  375261  204534
      F  401554  206541

or along all axes except one by appending _by to the aggregation function

In [147]: pop[90:95].sum_by(x.age)
Out[147]: 
age    90    91    92    93    94   95
     2226  1951  1586  1294  1064  792

# is equivalent to
In [148]: pop[90:95].sum(x.sex, x.nat)
Out[148]: 
age    90    91    92    93    94   95
     2226  1951  1586  1294  1064  792

There are many other aggregate functions built-in:

  • mean, min, max, median, percentile, var (variance), std (standard deviation)
  • labelofmin, labelofmax (label indirect minimum/maxium – labels where the value is minimum/maximum)
  • indexofmin, indexofmax (positional indirect minimum/maxium – position along axis where the value is minimum/maximum)
  • cumsum, cumprod (cumulative sum, cumulative product)

Groups

One can define groups of labels (or indices)

In [149]: age = pop.axes['age']

# using indices (remember: 20 will not be included)
In [150]: teens = age.i[10:20]

# using labels
In [151]: pensioners = age[67:]

In [152]: strange = age[[30, 55, 52, 25, 99]]

In [153]: strange
Out[153]: age[30, 55, 52, 25, 99]

or rename them

# method 'named' returns a new group with the given name
In [154]: teens = teens.named('children')

# operator >> is a shortcut for 'named'
In [155]: pensioners = pensioners >> 'pensioners'

In [156]: pensioners
Out[156]: age[67:] >> 'pensioners'

Then, use them in selections

In [157]: pop[strange]
Out[157]: 
age  sex\nat    BE    FO
 30        M  5278  4725
 30        F  5253  5419
 55        M  4457  2196
 55        F  4953  2059
 52        M  4635  2640
 52        F  4740  2333
 25        M  5477  3590
 25        F  5539  4635
 99        M    20     2
 99        F    92     8

or aggregations

In [158]: pop.sum(pensioners)
Out[158]: 
sex\nat     BE     FO
      M  44138   9939
      F  70314  13241
# several groups (here you see the interest of groups renaming)
In [159]: pop.sum((teens, pensioners, strange))
Out[159]: 
           age  sex\nat     BE     FO
      children        M  49143  17100
      children        F  47226  16523
    pensioners        M  44138   9939
    pensioners        F  70314  13241
30,55,52,25,99        M  19867  13153
30,55,52,25,99        F  20577  14454
# combined with other axes
In [160]: pop.sum((teens, pensioners, strange), x.nat)
Out[160]: 
       age\sex      M      F
      children  66243  63749
    pensioners  54077  83555
30,55,52,25,99  33020  35031

Arithmetic operations

# go back to our 6 x 2 x 2 example array
In [161]: pop = load_example_data('demography').pop[2016, 'BruCap', 90:95]

In [162]: pop
Out[162]: 
age  sex\nat    BE   FO
 90        M   539   74
 90        F  1477  136
 91        M   499   49
 91        F  1298  105
 92        M   332   35
 92        F  1141   78
 93        M   287   27
 93        F   906   74
 94        M   237   23
 94        F   739   65
 95        M   154   19
 95        F   566   53

Usual Operations

One can do all usual arithmetic operations on an array, it will apply the operation to all elements individually

# addition
In [163]: pop + 200
Out[163]: 
age  sex\nat    BE   FO
 90        M   739  274
 90        F  1677  336
 91        M   699  249
 91        F  1498  305
 92        M   532  235
 92        F  1341  278
 93        M   487  227
 93        F  1106  274
 94        M   437  223
 94        F   939  265
 95        M   354  219
 95        F   766  253
# multiplication
In [164]: pop * 2
Out[164]: 
age  sex\nat    BE   FO
 90        M  1078  148
 90        F  2954  272
 91        M   998   98
 91        F  2596  210
 92        M   664   70
 92        F  2282  156
 93        M   574   54
 93        F  1812  148
 94        M   474   46
 94        F  1478  130
 95        M   308   38
 95        F  1132  106
# ** means raising to the power (squaring in this case)
In [165]: pop ** 2
Out[165]: 
age  sex\nat       BE     FO
 90        M   290521   5476
 90        F  2181529  18496
 91        M   249001   2401
 91        F  1684804  11025
 92        M   110224   1225
 92        F  1301881   6084
 93        M    82369    729
 93        F   820836   5476
 94        M    56169    529
 94        F   546121   4225
 95        M    23716    361
 95        F   320356   2809
# % means modulo (aka remainder of division)
In [166]: pop % 10
Out[166]: 
age  sex\nat  BE  FO
 90        M   9   4
 90        F   7   6
 91        M   9   9
 91        F   8   5
 92        M   2   5
 92        F   1   8
 93        M   7   7
 93        F   6   4
 94        M   7   3
 94        F   9   5
 95        M   4   9
 95        F   6   3

More interestingly, it also works between two arrays

# load mortality equivalent array
In [167]: mortality = load_example_data('demography').qx[2016, 'BruCap', 90:95]

# compute number of deaths
In [168]: death = pop * mortality

In [169]: death
Out[169]: 
age  sex\nat                  BE                  FO
 90        M   94.00000000000001  13.000000000000004
 90        F  204.00000000000003  19.000000000000004
 91        M                95.0                 9.0
 91        F  200.00000000000006                16.0
 92        M                70.0                 7.0
 92        F  195.00000000000006  13.000000000000004
 93        M   66.00000000000001                 6.0
 93        F  171.99999999999997                14.0
 94        M                59.0                 6.0
 94        F  155.00000000000003                14.0
 95        M                41.0                 5.0
 95        F               130.0  12.000000000000004

Note

Be careful when mixing different data types. See type promotion in programming. You can use the method astype() to change the data type of an array.

# to be sure to get number of deaths as integers
# one can use .astype() method
In [170]: death = (pop * mortality).astype(int)

In [171]: death
Out[171]: 
age  sex\nat   BE  FO
 90        M   94  13
 90        F  204  19
 91        M   95   9
 91        F  200  16
 92        M   70   7
 92        F  195  13
 93        M   66   6
 93        F  171  14
 94        M   59   6
 94        F  155  14
 95        M   41   5
 95        F  130  12

But operations between two arrays only works when they have compatible axes (i.e. same labels)

In [172]: pop[90:92] * mortality[93:95]
<class 'ValueError'> incompatible axes:
Axis([93, 94, 95], 'age')
vs
Axis([90, 91, 92], 'age')

You can override that but at your own risk. In that case only the position on the axis is used and not the labels.

In [173]: pop[90:92] * mortality[93:95].drop_labels(x.age)
Out[173]: 
age  sex\nat                  BE                  FO
 90        M  123.95121951219514  16.444444444444443
 90        F    280.401766004415   25.72972972972973
 91        M  124.22362869198312  12.782608695652174
 91        F  272.24627875507446  22.615384615384617
 92        M   88.38961038961038   9.210526315789473
 92        F  262.06713780918733   17.66037735849057

Boolean Operations

In [174]: pop2 = pop.copy()

In [175]: pop2['F'] = -pop2['F']

In [176]: pop2
Out[176]: 
age  sex\nat     BE    FO
 90        M    539    74
 90        F  -1477  -136
 91        M    499    49
 91        F  -1298  -105
 92        M    332    35
 92        F  -1141   -78
 93        M    287    27
 93        F   -906   -74
 94        M    237    23
 94        F   -739   -65
 95        M    154    19
 95        F   -566   -53
# testing for equality is done using == (a single = assigns the value)
In [177]: pop == pop2
Out[177]: 
age  sex\nat     BE     FO
 90        M   True   True
 90        F  False  False
 91        M   True   True
 91        F  False  False
 92        M   True   True
 92        F  False  False
 93        M   True   True
 93        F  False  False
 94        M   True   True
 94        F  False  False
 95        M   True   True
 95        F  False  False
# testing for inequality
In [178]: pop != pop2
Out[178]: 
age  sex\nat     BE     FO
 90        M  False  False
 90        F   True   True
 91        M  False  False
 91        F   True   True
 92        M  False  False
 92        F   True   True
 93        M  False  False
 93        F   True   True
 94        M  False  False
 94        F   True   True
 95        M  False  False
 95        F   True   True
# what was our original array like again?
In [179]: pop
Out[179]: 
age  sex\nat    BE   FO
 90        M   539   74
 90        F  1477  136
 91        M   499   49
 91        F  1298  105
 92        M   332   35
 92        F  1141   78
 93        M   287   27
 93        F   906   74
 94        M   237   23
 94        F   739   65
 95        M   154   19
 95        F   566   53
# & means (boolean array) and
In [180]: (pop >= 500) & (pop <= 1000)
Out[180]: 
age  sex\nat     BE     FO
 90        M   True  False
 90        F  False  False
 91        M  False  False
 91        F  False  False
 92        M  False  False
 92        F  False  False
 93        M  False  False
 93        F   True  False
 94        M  False  False
 94        F   True  False
 95        M  False  False
 95        F   True  False
# | means (boolean array) or
In [181]: (pop < 500) | (pop > 1000)
Out[181]: 
age  sex\nat     BE    FO
 90        M  False  True
 90        F   True  True
 91        M   True  True
 91        F   True  True
 92        M   True  True
 92        F   True  True
 93        M   True  True
 93        F  False  True
 94        M   True  True
 94        F  False  True
 95        M   True  True
 95        F  False  True

Arithmetic operations with missing axes

In [182]: pop.sum(x.age)
Out[182]: 
sex\nat    BE   FO
      M  2048  227
      F  6127  511
# arr has 3 dimensions
In [183]: pop.info
Out[183]: 
6 x 2 x 2
 age [6]: 90 91 92 93 94 95
 sex [2]: 'M' 'F'
 nat [2]: 'BE' 'FO'
dtype: int64
# and arr.sum(age) has two
In [184]: pop.sum(x.age).info
Out[184]: 
2 x 2
 sex [2]: 'M' 'F'
 nat [2]: 'BE' 'FO'
dtype: int64
# you can do operation with missing axes so this works
In [185]: pop / pop.sum(x.age)
Out[185]: 
age  sex\nat                   BE                   FO
 90        M        0.26318359375  0.32599118942731276
 90        F   0.2410641423208748  0.26614481409001955
 91        M        0.24365234375  0.21585903083700442
 91        F   0.2118491921005386   0.2054794520547945
 92        M          0.162109375  0.15418502202643172
 92        F  0.18622490615309287  0.15264187866927592
 93        M        0.14013671875  0.11894273127753303
 93        F  0.14787008323812634  0.14481409001956946
 94        M        0.11572265625   0.1013215859030837
 94        F  0.12061367716663947  0.12720156555772993
 95        M         0.0751953125  0.08370044052863436
 95        F  0.09237799902072792  0.10371819960861056

Axis order does not matter much (except for output)

You can do operations between arrays having different axes order. The axis order of the result is the same as the left array

In [186]: pop
Out[186]: 
age  sex\nat    BE   FO
 90        M   539   74
 90        F  1477  136
 91        M   499   49
 91        F  1298  105
 92        M   332   35
 92        F  1141   78
 93        M   287   27
 93        F   906   74
 94        M   237   23
 94        F   739   65
 95        M   154   19
 95        F   566   53
# let us change the order of axes
In [187]: pop_transposed = pop.T

In [188]: pop_transposed
Out[188]: 
nat  sex\age    90    91    92   93   94   95
 BE        M   539   499   332  287  237  154
 BE        F  1477  1298  1141  906  739  566
 FO        M    74    49    35   27   23   19
 FO        F   136   105    78   74   65   53
# mind blowing
In [189]: pop_transposed + pop
Out[189]: 
nat  sex\age    90    91    92    93    94    95
 BE        M  1078   998   664   574   474   308
 BE        F  2954  2596  2282  1812  1478  1132
 FO        M   148    98    70    54    46    38
 FO        F   272   210   156   148   130   106

Combining arrays

Append/Prepend

Append/prepend one element to an axis of an array

In [190]: pop = load_example_data('demography').pop[2016, 'BruCap', 90:95]

# imagine that you have now acces to the number of non-EU foreigners
In [191]: data = [[25, 54], [15, 33], [12, 28], [11, 37], [5, 21], [7, 19]]

In [192]: pop_non_eu = LArray(data, pop['FO'].axes)

# you can do something like this
In [193]: pop = pop.append(nat, pop_non_eu, 'NEU')

In [194]: pop
Out[194]: 
age  sex\nat    BE   FO  NEU
 90        M   539   74   25
 90        F  1477  136   54
 91        M   499   49   15
 91        F  1298  105   33
 92        M   332   35   12
 92        F  1141   78   28
 93        M   287   27   11
 93        F   906   74   37
 94        M   237   23    5
 94        F   739   65   21
 95        M   154   19    7
 95        F   566   53   19
# you can also add something at the start of an axis
In [195]: pop = pop.prepend(x.sex, pop.sum(x.sex), 'B')

In [196]: pop
Out[196]: 
age  sex\nat    BE   FO  NEU
 90        B  2016  210   79
 90        M   539   74   25
 90        F  1477  136   54
 91        B  1797  154   48
 91        M   499   49   15
 91        F  1298  105   33
 92        B  1473  113   40
 92        M   332   35   12
 92        F  1141   78   28
 93        B  1193  101   48
 93        M   287   27   11
 93        F   906   74   37
 94        B   976   88   26
 94        M   237   23    5
 94        F   739   65   21
 95        B   720   72   26
 95        M   154   19    7
 95        F   566   53   19

The value being appended/prepended can have missing (or even extra) axes as long as common axes are compatible

In [197]: aliens = zeros(pop.axes['sex'])

In [198]: aliens
Out[198]: 
sex    B    M    F
     0.0  0.0  0.0
In [199]: pop = pop.append(x.nat, aliens, 'AL')

In [200]: pop
Out[200]: 
age  sex\nat      BE     FO   NEU   AL
 90        B  2016.0  210.0  79.0  0.0
 90        M   539.0   74.0  25.0  0.0
 90        F  1477.0  136.0  54.0  0.0
 91        B  1797.0  154.0  48.0  0.0
 91        M   499.0   49.0  15.0  0.0
 91        F  1298.0  105.0  33.0  0.0
 92        B  1473.0  113.0  40.0  0.0
 92        M   332.0   35.0  12.0  0.0
 92        F  1141.0   78.0  28.0  0.0
 93        B  1193.0  101.0  48.0  0.0
 93        M   287.0   27.0  11.0  0.0
 93        F   906.0   74.0  37.0  0.0
 94        B   976.0   88.0  26.0  0.0
 94        M   237.0   23.0   5.0  0.0
 94        F   739.0   65.0  21.0  0.0
 95        B   720.0   72.0  26.0  0.0
 95        M   154.0   19.0   7.0  0.0
 95        F   566.0   53.0  19.0  0.0

Extend

Extend an array along an axis with another array with that axis (but other labels)

In [201]: _pop = load_example_data('demography').pop

In [202]: pop = _pop[2016, 'BruCap', 90:95]

In [203]: pop_next = _pop[2016, 'BruCap', 96:100]

# concatenate along age axis
In [204]: pop.extend(x.age, pop_next)
Out[204]: 
age  sex\nat    BE   FO
 90        M   539   74
 90        F  1477  136
 91        M   499   49
 91        F  1298  105
 92        M   332   35
 92        F  1141   78
 93        M   287   27
 93        F   906   74
 94        M   237   23
 94        F   739   65
 95        M   154   19
 95        F   566   53
 96        M    80    9
 96        F   327   25
 97        M    43    9
 97        F   171   21
 98        M    23    4
 98        F   135    9
 99        M    20    2
 99        F    92    8
100        M    12    0
100        F    60    3

Stack

Stack several arrays together to create an entirely new dimension

# imagine you have loaded data for each nationality in different arrays (e.g. loaded from different Excel sheets)
In [205]: pop_be, pop_fo = pop['BE'], pop['FO']

# first way to stack them
In [206]: nat = Axis('nat=BE,FO,NEU')

In [207]: pop = stack([pop_be, pop_fo, pop_non_eu], nat)

# second way
In [208]: pop = stack([('BE', pop_be), ('FO', pop_fo), ('NEU', pop_non_eu)], 'nat')

In [209]: pop
Out[209]: 
age  sex\nat    BE   FO  NEU
 90        M   539   74   25
 90        F  1477  136   54
 91        M   499   49   15
 91        F  1298  105   33
 92        M   332   35   12
 92        F  1141   78   28
 93        M   287   27   11
 93        F   906   74   37
 94        M   237   23    5
 94        F   739   65   21
 95        M   154   19    7
 95        F   566   53   19

Sorting

Sort an axis (alphabetically if labels are strings)

In [210]: pop_sorted = pop.sort_axes(x.nat)

In [211]: pop_sorted
Out[211]: 
age  sex\nat    BE   FO  NEU
 90        M   539   74   25
 90        F  1477  136   54
 91        M   499   49   15
 91        F  1298  105   33
 92        M   332   35   12
 92        F  1141   78   28
 93        M   287   27   11
 93        F   906   74   37
 94        M   237   23    5
 94        F   739   65   21
 95        M   154   19    7
 95        F   566   53   19

Give labels which would sort the axis

In [212]: pop_sorted.labelsofsorted(x.sex)
Out[212]: 
age  sex\nat  BE  FO  NEU
 90        0   M   M    M
 90        1   F   F    F
 91        0   M   M    M
 91        1   F   F    F
 92        0   M   M    M
 92        1   F   F    F
 93        0   M   M    M
 93        1   F   F    F
 94        0   M   M    M
 94        1   F   F    F
 95        0   M   M    M
 95        1   F   F    F

Sort according to values

In [213]: pop_sorted.sort_values((90, 'F'))
Out[213]: 
age  sex\nat  NEU   FO    BE
 90        M   25   74   539
 90        F   54  136  1477
 91        M   15   49   499
 91        F   33  105  1298
 92        M   12   35   332
 92        F   28   78  1141
 93        M   11   27   287
 93        F   37   74   906
 94        M    5   23   237
 94        F   21   65   739
 95        M    7   19   154
 95        F   19   53   566

Plotting

Create a plot (last axis define the different curves to draw)

In [214]: pop.plot()
Out[214]: <matplotlib.axes._subplots.AxesSubplot at 0x7f59ab5966d8>
_images/plot_tutorial_0.png
# plot total of both sex
In [215]: pop.sum(x.sex).plot()
Out[215]: <matplotlib.axes._subplots.AxesSubplot at 0x7f59ab426f60>
_images/plot_tutorial_1.png

Interesting methods

# starting array
In [216]: pop = load_example_data('demography').pop[2016, 'BruCap', 100:105]

In [217]: pop
Out[217]: 
age  sex\nat  BE  FO
100        M  12   0
100        F  60   3
101        M  12   2
101        F  66   5
102        M   8   0
102        F  26   1
103        M   2   1
103        F  17   2
104        M   2   1
104        F  14   0
105        M   0   0
105        F   2   2

with total

Add totals to one axis

In [218]: pop.with_total(x.sex, label='B')
Out[218]: 
age  sex\nat  BE  FO
100        M  12   0
100        F  60   3
100        B  72   3
101        M  12   2
101        F  66   5
101        B  78   7
102        M   8   0
102        F  26   1
102        B  34   1
103        M   2   1
103        F  17   2
103        B  19   3
104        M   2   1
104        F  14   0
104        B  16   1
105        M   0   0
105        F   2   2
105        B   2   2

Add totals to all axes at once

# by default label is 'total'
In [219]: pop.with_total()
Out[219]: 
  age  sex\nat   BE  FO  total
  100        M   12   0     12
  100        F   60   3     63
  100    total   72   3     75
  101        M   12   2     14
  101        F   66   5     71
  101    total   78   7     85
  102        M    8   0      8
  102        F   26   1     27
  102    total   34   1     35
  103        M    2   1      3
  103        F   17   2     19
  103    total   19   3     22
  104        M    2   1      3
  104        F   14   0     14
  104    total   16   1     17
  105        M    0   0      0
  105        F    2   2      4
  105    total    2   2      4
total        M   36   4     40
total        F  185  13    198
total    total  221  17    238

where

where can be used to apply some computation depending on a condition

# where(condition, value if true, value if false)
In [220]: where(pop < 10, 0, -pop)
Out[220]: 
age  sex\nat   BE  FO
100        M  -12   0
100        F  -60   0
101        M  -12   0
101        F  -66   0
102        M    0   0
102        F  -26   0
103        M    0   0
103        F  -17   0
104        M    0   0
104        F  -14   0
105        M    0   0
105        F    0   0

clip

Set all data between a certain range

# clip(min, max)
# values below 10 are set to 10 and values above 50 are set to 50
In [221]: pop.clip(10, 50)
Out[221]: 
age  sex\nat  BE  FO
100        M  12  10
100        F  50  10
101        M  12  10
101        F  50  10
102        M  10  10
102        F  26  10
103        M  10  10
103        F  17  10
104        M  10  10
104        F  14  10
105        M  10  10
105        F  10  10

divnot0

Replace division by 0 to 0

In [222]: pop['BE'] / pop['FO']
Out[222]: 
age\sex    M     F
    100  inf  20.0
    101  6.0  13.2
    102  inf  26.0
    103  2.0   8.5
    104  2.0   inf
    105  nan   1.0
# divnot0 replaces results of division by 0 by 0.
# Using it should be done with care though
# because it can hide a real error in your data.
In [223]: pop['BE'].divnot0(pop['FO'])
Out[223]: 
age\sex    M     F
    100  0.0  20.0
    101  6.0  13.2
    102  0.0  26.0
    103  2.0   8.5
    104  2.0   0.0
    105  0.0   1.0

diff

diff() calculates the n-th order discrete difference along given axis. The first order difference is given by out[n+1] = in[n + 1] - in[n] along the given axis.

In [224]: pop = load_example_data('demography').pop[2005:2015, 'BruCap', 50]

In [225]: pop
Out[225]: 
time  sex\nat    BE    FO
2005        M  4289  1591
2005        F  4661  1584
2006        M  4335  1761
2006        F  4781  1580
2007        M  4291  1806
2007        F  4719  1650
2008        M  4349  1773
2008        F  4731  1680
2009        M  4429  2003
2009        F  4824  1722
2010        M  4582  2085
2010        F  4869  1928
2011        M  4677  2294
2011        F  5015  2104
2012        M  4463  2450
2012        F  4722  2186
2013        M  4610  2604
2013        F  4711  2254
2014        M  4725  2709
2014        F  4788  2349
2015        M  4841  2891
2015        F  4813  2498
# calculates 'pop[year+1] - pop[year]'
In [226]: pop.diff(x.time)
Out[226]: 
time  sex\nat    BE   FO
2006        M    46  170
2006        F   120   -4
2007        M   -44   45
2007        F   -62   70
2008        M    58  -33
2008        F    12   30
2009        M    80  230
2009        F    93   42
2010        M   153   82
2010        F    45  206
2011        M    95  209
2011        F   146  176
2012        M  -214  156
2012        F  -293   82
2013        M   147  154
2013        F   -11   68
2014        M   115  105
2014        F    77   95
2015        M   116  182
2015        F    25  149
# calculates 'pop[year+2] - pop[year]'
In [227]: pop.diff(x.time, d=2)
Out[227]: 
time  sex\nat    BE   FO
2007        M     2  215
2007        F    58   66
2008        M    14   12
2008        F   -50  100
2009        M   138  197
2009        F   105   72
2010        M   233  312
2010        F   138  248
2011        M   248  291
2011        F   191  382
2012        M  -119  365
2012        F  -147  258
2013        M   -67  310
2013        F  -304  150
2014        M   262  259
2014        F    66  163
2015        M   231  287
2015        F   102  244

ratio

In [228]: pop.ratio(x.nat)
Out[228]: 
time  sex\nat                  BE                   FO
2005        M   0.729421768707483    0.270578231292517
2005        F  0.7463570856685349   0.2536429143314652
2006        M  0.7111220472440944   0.2888779527559055
2006        F  0.7516113818581984   0.2483886181418016
2007        M   0.703788748564868  0.29621125143513205
2007        F  0.7409326424870466  0.25906735751295334
2008        M  0.7103887618425351  0.28961123815746487
2008        F  0.7379503977538605  0.26204960224613943
2009        M  0.6885883084577115  0.31141169154228854
2009        F  0.7369385884509624  0.26306141154903756
2010        M  0.6872656367181641   0.3127343632818359
2010        F  0.7163454465205238   0.2836545534794762
2011        M  0.6709223927700474  0.32907760722995266
2011        F  0.7044528725944655  0.29554712740553446
2012        M  0.6455952553160712  0.35440474468392885
2012        F  0.6835552982049797  0.31644470179502027
2013        M  0.6390352093152204   0.3609647906847796
2013        F  0.6763819095477387   0.3236180904522613
2014        M   0.635593220338983   0.3644067796610169
2014        F  0.6708701134930644   0.3291298865069357
2015        M  0.6260993274702535   0.3739006725297465
2015        F  0.6583230748187663  0.34167692518123377

# which is equivalent to
In [229]: pop / pop.sum(x.nat)
Out[229]: 
time  sex\nat                  BE                   FO
2005        M   0.729421768707483    0.270578231292517
2005        F  0.7463570856685349   0.2536429143314652
2006        M  0.7111220472440944   0.2888779527559055
2006        F  0.7516113818581984   0.2483886181418016
2007        M   0.703788748564868  0.29621125143513205
2007        F  0.7409326424870466  0.25906735751295334
2008        M  0.7103887618425351  0.28961123815746487
2008        F  0.7379503977538605  0.26204960224613943
2009        M  0.6885883084577115  0.31141169154228854
2009        F  0.7369385884509624  0.26306141154903756
2010        M  0.6872656367181641   0.3127343632818359
2010        F  0.7163454465205238   0.2836545534794762
2011        M  0.6709223927700474  0.32907760722995266
2011        F  0.7044528725944655  0.29554712740553446
2012        M  0.6455952553160712  0.35440474468392885
2012        F  0.6835552982049797  0.31644470179502027
2013        M  0.6390352093152204   0.3609647906847796
2013        F  0.6763819095477387   0.3236180904522613
2014        M   0.635593220338983   0.3644067796610169
2014        F  0.6708701134930644   0.3291298865069357
2015        M  0.6260993274702535   0.3739006725297465
2015        F  0.6583230748187663  0.34167692518123377

percents

# or, if you want the previous ratios in percents
In [230]: pop.percent(x.nat)
Out[230]: 
time  sex\nat                  BE                  FO
2005        M    72.9421768707483    27.0578231292517
2005        F   74.63570856685348  25.364291433146516
2006        M   71.11220472440945  28.887795275590552
2006        F   75.16113818581984   24.83886181418016
2007        M    70.3788748564868  29.621125143513204
2007        F   74.09326424870466  25.906735751295336
2008        M   71.03887618425351   28.96112381574649
2008        F   73.79503977538606  26.204960224613945
2009        M   68.85883084577114  31.141169154228855
2009        F   73.69385884509624   26.30614115490376
2010        M   68.72656367181641  31.273436328183593
2010        F   71.63454465205237  28.365455347947623
2011        M   67.09223927700474   32.90776072299526
2011        F   70.44528725944654  29.554712740553448
2012        M   64.55952553160712  35.440474468392885
2012        F   68.35552982049798  31.644470179502026
2013        M   63.90352093152204   36.09647906847796
2013        F   67.63819095477388   32.36180904522613
2014        M  63.559322033898304  36.440677966101696
2014        F   67.08701134930644   32.91298865069357
2015        M   62.60993274702535   37.39006725297465
2015        F   65.83230748187663  34.167692518123374

growth_rate

using the same principle than diff…

In [231]: pop.growth_rate(x.time)
Out[231]: 
time  sex\nat                     BE                      FO
2006        M   0.010725110748426206     0.10685103708359522
2006        F   0.025745548165629694  -0.0025252525252525255
2007        M  -0.010149942329873126     0.02555366269165247
2007        F  -0.012967998326709893     0.04430379746835443
2008        M   0.013516662782568165   -0.018272425249169437
2008        F  0.0025429116338207248     0.01818181818181818
2009        M    0.01839503334099793     0.12972363226170333
2009        F   0.019657577679137603                   0.025
2010        M    0.03454504402799729    0.040938592111832255
2010        F   0.009328358208955223     0.11962833914053426
2011        M    0.02073330423395897     0.10023980815347722
2011        F   0.029985623331279524      0.0912863070539419
2012        M   -0.04575582638443447     0.06800348735832606
2012        F    -0.0584247258225324     0.03897338403041825
2013        M    0.03293748599596684     0.06285714285714286
2013        F  -0.002329521389241847     0.03110704483074108
2014        M   0.024945770065075923     0.04032258064516129
2014        F    0.01634472511144131     0.04214729370008873
2015        M    0.02455026455026455     0.06718346253229975
2015        F  0.0052213868003341685     0.06343124733929331

shift

The shift() method drops first label of an axis and shifts all subsequent labels

In [232]: pop.shift(x.time)
Out[232]: 
time  sex\nat    BE    FO
2006        M  4289  1591
2006        F  4661  1584
2007        M  4335  1761
2007        F  4781  1580
2008        M  4291  1806
2008        F  4719  1650
2009        M  4349  1773
2009        F  4731  1680
2010        M  4429  2003
2010        F  4824  1722
2011        M  4582  2085
2011        F  4869  1928
2012        M  4677  2294
2012        F  5015  2104
2013        M  4463  2450
2013        F  4722  2186
2014        M  4610  2604
2014        F  4711  2254
2015        M  4725  2709
2015        F  4788  2349
# when shift is applied on an (increasing) time axis, it effectively brings "past" data into the future
In [233]: pop.shift(x.time).drop_labels(x.time) == pop[2005:2014].drop_labels(x.time)
Out[233]: 
time*  sex\nat    BE    FO
    0        M  True  True
    0        F  True  True
    1        M  True  True
    1        F  True  True
    2        M  True  True
    2        F  True  True
    3        M  True  True
    3        F  True  True
    4        M  True  True
    4        F  True  True
    5        M  True  True
    5        F  True  True
    6        M  True  True
    6        F  True  True
    7        M  True  True
    7        F  True  True
    8        M  True  True
    8        F  True  True
    9        M  True  True
    9        F  True  True
# this is mostly useful when you want to do operations between the past and now
# as an example, here is an alternative implementation of the .diff method seen above:
In [234]: pop.i[1:] - pop.shift(x.time)
Out[234]: 
time  sex\nat    BE   FO
2006        M    46  170
2006        F   120   -4
2007        M   -44   45
2007        F   -62   70
2008        M    58  -33
2008        F    12   30
2009        M    80  230
2009        F    93   42
2010        M   153   82
2010        F    45  206
2011        M    95  209
2011        F   146  176
2012        M  -214  156
2012        F  -293   82
2013        M   147  154
2013        F   -11   68
2014        M   115  105
2014        F    77   95
2015        M   116  182
2015        F    25  149

Misc other interesting functions

There are a lot more functions available:

  • round, floor, ceil, trunc,
  • exp, log, log10,
  • sqrt, absolute, nan_to_num, isnan, isinf, inverse,
  • sin, cos, tan, arcsin, arccos, arctan
  • and many many more…

Sessions

You can group several arrays in a Session

# load several arrays
In [235]: arr1, arr2, arr3 = ndtest((3, 3)), ndtest((4, 2)), ndtest((2, 4))

# create and populate a 'session'
In [236]: s1 = Session()

In [237]: s1.arr1 = arr1

In [238]: s1.arr2 = arr2

In [239]: s1.arr3 = arr3

In [240]: s1
Out[240]: Session(arr1, arr2, arr3)

The advantage of sessions is that you can manipulate all of the arrays in them in one shot

# this saves all the arrays in a single excel file (each array on a different sheet)
In [241]: s1.save('test.xlsx')
# this saves all the arrays in a single HDF5 file (which is a very fast format)
In [242]: s1.save('test.h5')
# this creates a session out of all arrays in the .h5 file
In [243]: s2 = Session('test.h5')

In [244]: s2
Out[244]: Session(arr1, arr2, arr3)
# this creates a session out of all arrays in the .xlsx file
In [245]: s3 = Session('test.xlsx')

In [246]: s3
Out[246]: Session(arr1, arr2, arr3)

You can compare two sessions

In [247]: s1 == s2
Out[247]: 
name  arr1  arr2  arr3
      True  True  True
# let us introduce a difference (a variant, or a mistake perhaps)
In [248]: s2.arr1['a0', 'b1':] = 0
In [249]: s1 == s2
Out[249]: 
name   arr1  arr2  arr3
      False  True  True
In [250]: s1_diff = s1[s1 != s2]

In [251]: s1_diff
Out[251]: Session(arr1)
In [252]: s2_diff = s2[s1 != s2]

In [253]: s2_diff
Out[253]: Session(arr1)

This a bit experimental but can be useful nonetheless (Open a graphical interface)

In [254]: compare(s1_diff.arr1, s2_diff.arr1)