Introduction to Scipp#

Multi-dimensional arrays with labeled dimensions and physical units

scipp.github.io



Scipp is an open-source library developed by ESS for handling, manipulating and visualizing multi-dimensional data arrays.

It enriches raw NumPy-like arrays by adding named dimensions and associated coordinates. In addition, it supports

  • Physical units which are handled in arithmetic operations

  • Histograms, i.e., bin-edge axes, which are by 1 longer than the data extent

  • Propagation of uncertainties



%matplotlib inline
import numpy as np
import scipp as sc
import matplotlib.pyplot as plt

# import scipp_intro
from scipp_utils import quiz, plot, scatter

rng = np.random.default_rng(seed=1234)





1. Labeled dimensions: why do we need them?#

Say we have a 2D rectangular array of data

ny, nx = 10, 20
a = np.sin(np.arange(ny) / (ny / 4)).reshape((-1, 1)) * np.cos(np.arange(nx) / (ny / 4))
a.shape
(10, 20)

that looks like

plot(a)
../_images/0e0ff0dfecf60522d6d49a2703d7ba9771c7d63a43fc923e0b9983d83905f501.png

The task is now to slice out row number 4. Because of the shape of the array, we know that the row dimension is the smallest, so we slice the first dimension of the 2D array:

# Slice out row number 4
plot(a[4, :])
../_images/82d3fd64121fb5eb97d8b935ec144999747ef7d4d7223501e59aa9bd92d6fa74.png

We can’t always deduce from the shape#

Now say we have an array which has a square shape:

ny, nx = 20, 20
a = np.sin(np.arange(ny) / (ny / 4)).reshape((-1, 1)) * np.cos(np.arange(nx) / (ny / 4))
a.shape
(20, 20)
plot(a)
../_images/612af90be78ba4f3e9109e10b2681bfdc5e718c7c5cb8f4007fc727c25256ae2.png

Do we slice the first or the second index of the 2D array?

# Not always obvious which dimension is which
plot(a[:, 4], a[4, :])
../_images/4667add5ac505670d2f60842fecf13dcfa7d6b0fd42f706077348270f0978c62.png

The situation gets worse with more dimensions#

Say we now have an array that has 4 dimensions: x, y, z, t (in that order, maybe?, or is it z, y, x, t, or t, x, y, z?)

a = np.random.random([20] * 4)
a.shape
(20, 20, 20, 20)

Quiz time!

quiz(1)



Introducing labeled dimensions#

        

Xarray (https://docs.xarray.dev) introduced labels to multi-dimensional Numpy arrays.

real-world datasets are usually more than just raw numbers; they have labels which encode information about how the array values map to locations in space, time, etc.

We have embraced, and to a large extent copied, the Xarray mechanism.

var = sc.array(dims=["x", "y", "z", "time"], values=a)
var
Show/Hide data repr Show/Hide attributes
scipp.Variable (1.22 MB)
    • (x: 20, y: 20, z: 20, time: 20)
      float64
      𝟙
      0.983, 0.107, ..., 0.536, 0.876
      Values:
      array([[[[0.9831159 , 0.10719758, 0.41145117, ..., 0.68502521, 0.15217414, 0.34799306], [0.22688799, 0.99166429, 0.27363283, ..., 0.08606551, 0.80876095, 0.7978453 ], [0.61938281, 0.82004747, 0.53072349, ..., 0.90788228, 0.3201691 , 0.47904103], ..., [0.20608843, 0.68895697, 0.22743767, ..., 0.18151568, 0.71060832, 0.96797188], [0.60406176, 0.12690292, 0.91787404, ..., 0.62517385, 0.39292353, 0.2617367 ], [0.79442557, 0.65409706, 0.79416072, ..., 0.31220827, 0.42764837, 0.04638983]], [[0.87189721, 0.52125384, 0.30148701, ..., 0.92429269, 0.60589755, 0.70312794], [0.90118685, 0.12882927, 0.90693596, ..., 0.25262956, 0.14779989, 0.97603977], [0.97884672, 0.29815079, 0.2530791 , ..., 0.9404299 , 0.41477241, 0.8577149 ], ..., [0.19752429, 0.09152861, 0.02202596, ..., 0.2640889 , 0.70628247, 0.98647451], [0.40692789, 0.14864531, 0.55649561, ..., 0.86700508, 0.73705668, 0.07469197], [0.39448171, 0.3563775 , 0.59130664, ..., 0.87735199, 0.50049317, 0.41939839]], [[0.3756235 , 0.78151252, 0.72416978, ..., 0.47620115, 0.39295543, 0.19577445], [0.64936884, 0.39784476, 0.17098117, ..., 0.89809013, 0.63268611, 0.12948835], [0.77388688, 0.4932722 , 0.18395173, ..., 0.92402066, 0.28054436, 0.06482627], ..., [0.54989202, 0.7668139 , 0.37777067, ..., 0.85255608, 0.76007433, 0.46037272], [0.4498564 , 0.42587336, 0.89972693, ..., 0.50608292, 0.39082896, 0.81737138], [0.13421467, 0.03524271, 0.09955752, ..., 0.90160676, 0.95614059, 0.21060229]], ..., [[0.66955678, 0.11561836, 0.47817125, ..., 0.08365637, 0.83487295, 0.28903305], [0.83798696, 0.28613051, 0.11963846, ..., 0.77201079, 0.17872403, 0.48977044], [0.01805862, 0.3607634 , 0.67292289, ..., 0.78408092, 0.69492193, 0.84287478], ..., [0.22140211, 0.86122721, 0.41208083, ..., 0.88110991, 0.24345075, 0.24074636], [0.87161412, 0.71992761, 0.42178735, ..., 0.28489086, 0.31876845, 0.68791583], [0.36568951, 0.13243958, 0.0688192 , ..., 0.67652363, 0.23372173, 0.12242755]], [[0.05951289, 0.80271275, 0.38614982, ..., 0.4344515 , 0.90526496, 0.24868711], [0.70027453, 0.56059858, 0.41451304, ..., 0.04364619, 0.47381292, 0.5542896 ], [0.79398492, 0.61147545, 0.74819835, ..., 0.08623527, 0.78729871, 0.24890762], ..., [0.35861392, 0.06161654, 0.62688673, ..., 0.37969773, 0.64842555, 0.59902677], [0.62909339, 0.77098006, 0.22526371, ..., 0.41640506, 0.28549397, 0.07026757], [0.20779511, 0.12827524, 0.15172894, ..., 0.42971469, 0.64169618, 0.49764402]], [[0.78598369, 0.43815279, 0.49050512, ..., 0.24350223, 0.62652124, 0.22691952], [0.42556127, 0.02928561, 0.96171468, ..., 0.89324856, 0.34395563, 0.18537712], [0.16015393, 0.08135381, 0.85388336, ..., 0.18098586, 0.40927251, 0.42358584], ..., [0.50065214, 0.40590893, 0.23298891, ..., 0.82226277, 0.67239681, 0.36096814], [0.12790011, 0.02495754, 0.25364194, ..., 0.70992428, 0.87778728, 0.86169296], [0.86576449, 0.47418768, 0.93051845, ..., 0.3738504 , 0.56750402, 0.07031641]]], [[[0.0810235 , 0.22385683, 0.3521541 , ..., 0.99755233, 0.003871 , 0.65009725], [0.68507577, 0.69280498, 0.51940692, ..., 0.76410643, 0.88370111, 0.63995563], [0.65466285, 0.35716005, 0.52869396, ..., 0.09042514, 0.71848767, 0.58626943], ..., [0.55033921, 0.32386935, 0.50506466, ..., 0.60882025, 0.16727542, 0.84516137], [0.39482934, 0.3284456 , 0.69847417, ..., 0.82385081, 0.65108475, 0.14496682], [0.0849452 , 0.0047162 , 0.35175525, ..., 0.89601829, 0.61270415, 0.25821068]], [[0.82155764, 0.52914775, 0.84584232, ..., 0.14120091, 0.37066022, 0.23492326], [0.47529645, 0.33716654, 0.57289323, ..., 0.93789858, 0.88088673, 0.59491728], [0.06991419, 0.1611961 , 0.46923903, ..., 0.85851376, 0.02729705, 0.79085418], ..., [0.99244964, 0.41315413, 0.72813182, ..., 0.07074519, 0.47286545, 0.67493473], [0.27169809, 0.21612414, 0.34565149, ..., 0.77130022, 0.32110187, 0.55429469], [0.7919708 , 0.19841168, 0.77273073, ..., 0.25071043, 0.73046886, 0.39669816]], [[0.35479376, 0.94248105, 0.36876325, ..., 0.43904996, 0.95202795, 0.11854037], [0.54747867, 0.23543254, 0.49435365, ..., 0.43739567, 0.08184948, 0.70703154], [0.82120541, 0.55235274, 0.5290074 , ..., 0.92829372, 0.41118577, 0.87791792], ..., [0.62545503, 0.43034121, 0.25861938, ..., 0.20510863, 0.2000553 , 0.96495074], [0.8239874 , 0.24792812, 0.31692709, ..., 0.87092639, 0.48303991, 0.90707038], [0.90028204, 0.77907888, 0.26917609, ..., 0.64452651, 0.96773749, 0.28580608]], ..., [[0.28266196, 0.98353818, 0.03032619, ..., 0.68637175, 0.29426191, 0.65421586], [0.17170535, 0.92507068, 0.85772247, ..., 0.51916504, 0.42270502, 0.38455032], [0.68883042, 0.51260386, 0.53451327, ..., 0.48813393, 0.24402579, 0.08023331], ..., [0.65861257, 0.94707027, 0.20651123, ..., 0.01635761, 0.75739334, 0.84949805], [0.57960882, 0.132154 , 0.38754624, ..., 0.83116126, 0.05464846, 0.60585071], [0.24408699, 0.27621013, 0.42061182, ..., 0.26396533, 0.22794136, 0.32735797]], [[0.74407192, 0.80596963, 0.4255383 , ..., 0.08843931, 0.42395844, 0.7652009 ], [0.36520032, 0.42652561, 0.51005031, ..., 0.58661311, 0.35887107, 0.35891527], [0.90383053, 0.24148413, 0.92004795, ..., 0.87696826, 0.94634779, 0.05778089], ..., [0.20358127, 0.92557072, 0.02726101, ..., 0.06617759, 0.36175771, 0.96996294], [0.02704698, 0.5116217 , 0.79063197, ..., 0.93358793, 0.39000866, 0.52338819], [0.64550534, 0.76151756, 0.16885907, ..., 0.99011483, 0.13998229, 0.21814392]], [[0.01663924, 0.63611496, 0.74741969, ..., 0.3557019 , 0.52469485, 0.12751597], [0.78083982, 0.12226366, 0.47088728, ..., 0.30194991, 0.01590673, 0.63492787], [0.63115764, 0.57728881, 0.10547427, ..., 0.53474767, 0.85179615, 0.62880084], ..., [0.89872964, 0.46860998, 0.38802529, ..., 0.94638679, 0.60066858, 0.75826086], [0.5135382 , 0.21559805, 0.86604188, ..., 0.37264589, 0.40980322, 0.34262607], [0.07254099, 0.0513138 , 0.63069305, ..., 0.27039194, 0.36254276, 0.5246378 ]]], [[[0.37662827, 0.01360986, 0.152563 , ..., 0.32183609, 0.34801126, 0.02430882], [0.80433827, 0.69676625, 0.70993109, ..., 0.27210553, 0.02268278, 0.29060652], [0.59628983, 0.36879887, 0.52070894, ..., 0.78035757, 0.67024744, 0.9682647 ], ..., [0.5798377 , 0.72411221, 0.04333832, ..., 0.46326505, 0.01193127, 0.75772521], [0.79788723, 0.40260299, 0.57699412, ..., 0.61826102, 0.51239577, 0.71156066], [0.35284412, 0.54877646, 0.77422604, ..., 0.35096179, 0.67933803, 0.5289583 ]], [[0.75612686, 0.81360643, 0.3354102 , ..., 0.22885406, 0.58929976, 0.87532708], [0.17941477, 0.61356651, 0.57462335, ..., 0.90433739, 0.92110873, 0.97927453], [0.81386622, 0.90079305, 0.14974827, ..., 0.61055261, 0.46939267, 0.77127763], ..., [0.56517953, 0.07779371, 0.81718187, ..., 0.83182359, 0.63024785, 0.13334549], [0.97890086, 0.37394046, 0.42117291, ..., 0.75603307, 0.91777376, 0.30304552], [0.2874504 , 0.68941378, 0.63531855, ..., 0.44154547, 0.6546478 , 0.45659147]], [[0.68161163, 0.72670984, 0.73776355, ..., 0.87654972, 0.87411016, 0.07848855], [0.65137482, 0.33879013, 0.2019538 , ..., 0.83373438, 0.93359598, 0.5032787 ], [0.67630034, 0.65662898, 0.61477922, ..., 0.14556242, 0.0344506 , 0.43331542], ..., [0.98287083, 0.02231671, 0.22414388, ..., 0.56903639, 0.62906692, 0.1099617 ], [0.67803664, 0.05718234, 0.4966607 , ..., 0.74883722, 0.62491645, 0.9953686 ], [0.1500487 , 0.26840732, 0.32581504, ..., 0.75020858, 0.93652315, 0.39455034]], ..., [[0.7472917 , 0.8483616 , 0.69670719, ..., 0.13209312, 0.53009177, 0.15385741], [0.49723449, 0.78983683, 0.59712488, ..., 0.96090173, 0.88924766, 0.35889161], [0.94681982, 0.26353545, 0.12691129, ..., 0.15844645, 0.30137071, 0.31920334], ..., [0.44822763, 0.79705096, 0.18941599, ..., 0.79210567, 0.29803706, 0.64724648], [0.51107109, 0.35554393, 0.14988997, ..., 0.82165044, 0.60011857, 0.60148218], [0.26421386, 0.46534893, 0.14599475, ..., 0.90378301, 0.66208288, 0.63354469]], [[0.58571716, 0.67223757, 0.08799764, ..., 0.65809095, 0.50440643, 0.62710872], [0.90459015, 0.36423165, 0.37523959, ..., 0.11553181, 0.72350383, 0.70497335], [0.13686471, 0.70187431, 0.62514037, ..., 0.55834565, 0.29052105, 0.13028478], ..., [0.24781693, 0.66688346, 0.73263588, ..., 0.96462843, 0.30546553, 0.1603606 ], [0.22675492, 0.25556477, 0.89820432, ..., 0.31448568, 0.79824035, 0.43277984], [0.80741175, 0.13353863, 0.92501974, ..., 0.71772415, 0.78833046, 0.51408333]], [[0.89001658, 0.171241 , 0.20701742, ..., 0.04906035, 0.96813317, 0.36871473], [0.90491579, 0.7836709 , 0.36175161, ..., 0.13714082, 0.35758516, 0.73389621], [0.74028799, 0.60225092, 0.83182787, ..., 0.42046834, 0.44001046, 0.14787732], ..., [0.90527258, 0.78868989, 0.50112354, ..., 0.85621952, 0.97636925, 0.31894808], [0.20754274, 0.78786682, 0.76456105, ..., 0.62288212, 0.23415147, 0.16775884], [0.77986585, 0.31313518, 0.52097307, ..., 0.30639618, 0.71144968, 0.65226039]]], ..., [[[0.28897777, 0.41809829, 0.78401411, ..., 0.09030589, 0.80049372, 0.96698212], [0.40560248, 0.40375766, 0.03320377, ..., 0.27885419, 0.88433182, 0.19005622], [0.58044225, 0.95215577, 0.37443729, ..., 0.50456517, 0.25143504, 0.75258544], ..., [0.8662712 , 0.5637265 , 0.53061135, ..., 0.35242589, 0.75715889, 0.5791487 ], [0.64542807, 0.63810255, 0.33931917, ..., 0.04830852, 0.65969083, 0.83278627], [0.48689428, 0.79424555, 0.38016363, ..., 0.64995464, 0.64154217, 0.95273877]], [[0.0452467 , 0.56980737, 0.50041077, ..., 0.07979944, 0.31484085, 0.89254475], [0.56965571, 0.49665184, 0.18307276, ..., 0.30223874, 0.70461367, 0.45553255], [0.07001136, 0.75833016, 0.72328839, ..., 0.34655761, 0.7583817 , 0.77759813], ..., [0.86755675, 0.27120306, 0.50220264, ..., 0.09265631, 0.83804185, 0.68059143], [0.64772869, 0.26639818, 0.63584085, ..., 0.51037184, 0.09435423, 0.0410012 ], [0.89778904, 0.07965095, 0.17147637, ..., 0.83151281, 0.78067749, 0.86111656]], [[0.29863662, 0.61089181, 0.23583289, ..., 0.63368909, 0.25144819, 0.41085924], [0.29589851, 0.84078869, 0.63453003, ..., 0.56421582, 0.02392647, 0.66485537], [0.80337285, 0.20360852, 0.43049746, ..., 0.73386142, 0.28963916, 0.04974921], ..., [0.16558923, 0.16748987, 0.35057927, ..., 0.71085434, 0.94347173, 0.86495165], [0.33192521, 0.79687409, 0.99443353, ..., 0.72549715, 0.19094724, 0.47764551], [0.08476964, 0.91385344, 0.84600194, ..., 0.64661289, 0.79904745, 0.99167192]], ..., [[0.66071191, 0.16042809, 0.83340744, ..., 0.88307952, 0.16469817, 0.28973696], [0.74015492, 0.50669768, 0.14356568, ..., 0.90324456, 0.61492119, 0.47247367], [0.13733939, 0.27006708, 0.14883468, ..., 0.05699596, 0.64535171, 0.78551129], ..., [0.5681443 , 0.31393547, 0.94625783, ..., 0.57854357, 0.18415403, 0.7519436 ], [0.5794495 , 0.08982994, 0.73830402, ..., 0.26232955, 0.36904501, 0.19255628], [0.87400888, 0.8431952 , 0.22380597, ..., 0.96692589, 0.63854195, 0.69871842]], [[0.43602847, 0.27303667, 0.48635819, ..., 0.56069231, 0.70533617, 0.158989 ], [0.94425562, 0.01632062, 0.09452716, ..., 0.20467916, 0.93789087, 0.8247514 ], [0.03379684, 0.63954844, 0.25762306, ..., 0.92538156, 0.26200588, 0.64873279], ..., [0.81205363, 0.53698451, 0.45108861, ..., 0.26532624, 0.35101593, 0.40416004], [0.52106992, 0.26827291, 0.62527143, ..., 0.03439733, 0.63048968, 0.78699457], [0.34255244, 0.69527547, 0.01862796, ..., 0.96437431, 0.19927145, 0.33392903]], [[0.18537555, 0.79352616, 0.36746149, ..., 0.63884753, 0.62453829, 0.82196125], [0.05999899, 0.43371588, 0.4058158 , ..., 0.56452214, 0.50857307, 0.78781363], [0.09413258, 0.23769409, 0.07168316, ..., 0.86494548, 0.25826305, 0.79701473], ..., [0.15781176, 0.17926414, 0.30111299, ..., 0.71383837, 0.19793462, 0.73423556], [0.48987274, 0.05541171, 0.80288845, ..., 0.65389416, 0.30998212, 0.8728796 ], [0.3709216 , 0.18806515, 0.80354319, ..., 0.49281414, 0.58115379, 0.06985836]]], [[[0.18308052, 0.24573569, 0.93842424, ..., 0.2591203 , 0.86375222, 0.04873919], [0.41463305, 0.19620368, 0.83119133, ..., 0.69244854, 0.63218874, 0.60241586], [0.49041417, 0.62697106, 0.50999164, ..., 0.38810929, 0.07703945, 0.57281782], ..., [0.08691434, 0.23908359, 0.67404986, ..., 0.75135073, 0.07396861, 0.05829659], [0.16722536, 0.72831553, 0.61549392, ..., 0.48874485, 0.40137536, 0.42490938], [0.3937442 , 0.47028155, 0.45071916, ..., 0.77293343, 0.17482084, 0.73879207]], [[0.39291953, 0.81812886, 0.42553067, ..., 0.17802388, 0.82146001, 0.10052558], [0.39331644, 0.33222575, 0.75746372, ..., 0.03667198, 0.02451217, 0.24312666], [0.23210528, 0.7850013 , 0.55225187, ..., 0.7357575 , 0.3965793 , 0.26650953], ..., [0.06797099, 0.79609021, 0.64754909, ..., 0.78580404, 0.32862163, 0.5448191 ], [0.44440685, 0.34238599, 0.32275056, ..., 0.03314827, 0.95831774, 0.50531766], [0.09667621, 0.8994035 , 0.80769839, ..., 0.91022379, 0.24966311, 0.29752085]], [[0.65159761, 0.07652644, 0.88319822, ..., 0.3582827 , 0.28722282, 0.87677689], [0.34634909, 0.78856857, 0.86455049, ..., 0.20129835, 0.47708511, 0.44950797], [0.53359239, 0.32982415, 0.89934769, ..., 0.66349936, 0.88067406, 0.39838581], ..., [0.17816724, 0.90960295, 0.71444861, ..., 0.20434683, 0.22993685, 0.29833253], [0.04080748, 0.73432406, 0.52156495, ..., 0.51031765, 0.82898833, 0.76222927], [0.05584104, 0.86817371, 0.50570589, ..., 0.84272766, 0.04029628, 0.90117925]], ..., [[0.15289849, 0.82025645, 0.7681076 , ..., 0.70524693, 0.53487862, 0.3869719 ], [0.08069913, 0.51210879, 0.52730136, ..., 0.3774398 , 0.5588179 , 0.01296483], [0.3677369 , 0.38657329, 0.43084338, ..., 0.6203898 , 0.80610955, 0.20265088], ..., [0.35250022, 0.23428845, 0.41410938, ..., 0.27891348, 0.9937924 , 0.36142577], [0.70804393, 0.65100883, 0.57713066, ..., 0.30657189, 0.88737481, 0.46362721], [0.08321195, 0.6976224 , 0.3086068 , ..., 0.88969315, 0.70997644, 0.98652819]], [[0.25222569, 0.46148946, 0.39179222, ..., 0.86912584, 0.40965129, 0.95602511], [0.97271716, 0.04395097, 0.80737458, ..., 0.10151205, 0.46233083, 0.70257443], [0.07394265, 0.76829432, 0.32925113, ..., 0.19191598, 0.69572326, 0.09120943], ..., [0.94663855, 0.31425786, 0.32203533, ..., 0.10200907, 0.83718269, 0.15146661], [0.883481 , 0.96554696, 0.34974642, ..., 0.37454546, 0.47972929, 0.44874989], [0.85086919, 0.82468642, 0.58437814, ..., 0.28884651, 0.35687385, 0.02403875]], [[0.64524316, 0.99403746, 0.22013019, ..., 0.90402422, 0.45372969, 0.11648419], [0.17185131, 0.58235867, 0.22661496, ..., 0.15461825, 0.70451024, 0.12247365], [0.83357583, 0.26697512, 0.2671855 , ..., 0.84533582, 0.7044121 , 0.08594047], ..., [0.30631435, 0.42423852, 0.33756825, ..., 0.36604261, 0.81536907, 0.29166207], [0.04998002, 0.0987755 , 0.08226919, ..., 0.10189586, 0.58545021, 0.16403482], [0.07374776, 0.15717542, 0.94468209, ..., 0.50106127, 0.70553933, 0.56046855]]], [[[0.79612887, 0.69984681, 0.58253428, ..., 0.68728334, 0.28730536, 0.84359168], [0.33529947, 0.92990259, 0.86497706, ..., 0.83474283, 0.45553143, 0.90379922], [0.60779245, 0.44854235, 0.95554725, ..., 0.14544303, 0.48911669, 0.90556703], ..., [0.60878465, 0.1258369 , 0.11317343, ..., 0.02067839, 0.81411411, 0.77770566], [0.22982169, 0.23471682, 0.44018992, ..., 0.60628969, 0.97318039, 0.89226003], [0.1219005 , 0.27033947, 0.76905104, ..., 0.89774284, 0.93285858, 0.96337748]], [[0.4451001 , 0.80497765, 0.9025736 , ..., 0.65759665, 0.88915054, 0.09847492], [0.62459343, 0.63631368, 0.65059125, ..., 0.61453064, 0.70442078, 0.43622055], [0.55852236, 0.28925104, 0.39951607, ..., 0.65520055, 0.12002531, 0.35061589], ..., [0.42447906, 0.29489632, 0.92880131, ..., 0.63065246, 0.65006586, 0.54672686], [0.4597447 , 0.22025616, 0.64030165, ..., 0.75520438, 0.31972411, 0.03949479], [0.5669226 , 0.09087324, 0.55963227, ..., 0.47658592, 0.12423615, 0.96482476]], [[0.52742725, 0.37005623, 0.7440747 , ..., 0.71089951, 0.48773766, 0.09519228], [0.36842121, 0.89264431, 0.55399524, ..., 0.40451315, 0.88671277, 0.80182513], [0.106526 , 0.09116179, 0.74137405, ..., 0.67965089, 0.48831742, 0.29818795], ..., [0.45073976, 0.01912947, 0.49926053, ..., 0.46090507, 0.41626492, 0.0298117 ], [0.09022588, 0.8544457 , 0.44667055, ..., 0.24920863, 0.14637087, 0.5601132 ], [0.30323382, 0.92242948, 0.85620303, ..., 0.08156375, 0.78604965, 0.17091402]], ..., [[0.18371477, 0.0961658 , 0.20497762, ..., 0.56801992, 0.52042481, 0.07577535], [0.66856454, 0.81375874, 0.99939605, ..., 0.39445158, 0.69532794, 0.93102736], [0.94653508, 0.81052472, 0.07344357, ..., 0.69580067, 0.80909056, 0.68757428], ..., [0.92631859, 0.58334046, 0.5843174 , ..., 0.96361508, 0.30580511, 0.52427972], [0.72674791, 0.87501667, 0.87895511, ..., 0.88240901, 0.64913473, 0.81177623], [0.27353047, 0.97933286, 0.25505259, ..., 0.85266405, 0.81406393, 0.47884571]], [[0.72281716, 0.78711358, 0.8808617 , ..., 0.58407864, 0.99392288, 0.60314355], [0.47815179, 0.48413966, 0.88999685, ..., 0.68598356, 0.43932894, 0.95152772], [0.47601707, 0.60945273, 0.58774511, ..., 0.32775071, 0.60605983, 0.17948314], ..., [0.15031913, 0.72478013, 0.71594343, ..., 0.73738194, 0.22655036, 0.85266639], [0.79176048, 0.5087181 , 0.45607419, ..., 0.33495609, 0.75142596, 0.78046752], [0.38084213, 0.78781025, 0.99358002, ..., 0.50999579, 0.67332853, 0.89733754]], [[0.67889079, 0.43856926, 0.86405702, ..., 0.31604336, 0.11425999, 0.54565878], [0.99794154, 0.97199402, 0.2450148 , ..., 0.48017391, 0.0305332 , 0.39533776], [0.46685505, 0.87406969, 0.85476088, ..., 0.71543678, 0.66335682, 0.89781525], ..., [0.51962647, 0.72803815, 0.04473266, ..., 0.48429839, 0.63145511, 0.16980772], [0.88278778, 0.58946675, 0.69090405, ..., 0.65411643, 0.14511123, 0.36379289], [0.26956958, 0.44042052, 0.89964745, ..., 0.40379701, 0.5360206 , 0.87586859]]]])

Quiz time again!

Can you guess the syntax?

quiz(2)



Getting the z slice is now easy and readable.



Adding coordinates#

  • Coordinates can be specified for each dimension.

  • They describe the extent of each axis, as well as how far each data point is from its neighbours.

Here is an array that represents air pollution levels as a function of altitude and time.

data = sc.array(
    dims=["altitude", "year"],
    values=np.linspace(500, 10, 5).reshape((5, 1)) * rng.random(10),
)
sc.show(data)
dims=('altitude', 'year'), shape=(5, 10), unit=dimensionless, variances=Falsevalues altitudeyear
data.plot()
../_images/c2f536765ada469e90700f06cb59856a7375703a89a461a9b3e151a1172bffa9.svg

In Scipp and Xarray, coordinates are added in a data structure called DataArray:

da = sc.DataArray(
    data=data,
    coords={
        "altitude": sc.linspace("altitude", 0, 8000, 5),
    },
)
sc.show(da)
(dims=('altitude', 'year'), shape=(5, 10), unit=dimensionless, variances=False)values altitudeyear altit..altitude(dims=('altitude',), shape=(5,), unit=dimensionless, variances=False)values altitude
da
Show/Hide data repr Show/Hide attributes
scipp.DataArray (1.43 KB)
    • altitude: 5
    • year: 10
    • altitude
      (altitude)
      float64
      𝟙
      0.0, 2000.000, 4000.000, 6000.000, 8000.000
      Values:
      array([ 0., 2000., 4000., 6000., 8000.])
    • (altitude, year)
      float64
      𝟙
      488.350, 190.098, ..., 9.641, 2.636
      Values:
      array([[488.34988335, 190.09786751, 461.62311688, 130.84621193, 159.54852921, 59.04561648, 120.88314663, 159.26696439, 482.03962259, 131.82490214], [368.70416193, 143.52388997, 348.52545325, 98.78889001, 120.45913955, 44.57944044, 91.2667757 , 120.24655812, 363.93991505, 99.52780111], [249.05844051, 96.94991243, 235.42778961, 66.73156809, 81.3697499 , 30.11326441, 61.65040478, 81.22615184, 245.84020752, 67.23070009], [129.41271909, 50.37593489, 122.33012597, 34.67424616, 42.28036024, 15.64708837, 32.03403386, 42.20574556, 127.74049999, 34.93359907], [ 9.76699767, 3.80195735, 9.23246234, 2.61692424, 3.19097058, 1.18091233, 2.41766293, 3.18533929, 9.64079245, 2.63649804]])
da.plot()
../_images/568bbe14e93219dc6a2239cced78f8e71ea825767d44d7055f300e8b45a0dc86.svg

Accessing and adding more coordinates#

Coordinates are stored in a dict, and each dimension can have more than one coordinate.

Getting and setting coordinates is done using the same syntax as Python dicts:

print(da.coords.keys())
da.coords["altitude"]
<scipp.Dict.keys {altitude}>
Show/Hide data repr Show/Hide attributes
scipp.Variable (296 Bytes)
    • (altitude: 5)
      float64
      𝟙
      0.0, 2000.000, 4000.000, 6000.000, 8000.000
      Values:
      array([ 0., 2000., 4000., 6000., 8000.])

Exercise 1.1: Adding a new coordinate#

The air pollution data was collected every year from 2014 to 2023; [2014, 2024).
Let’s add a coordinate, year to the year dimension.

Tip: You can create a Variable with consecutive numbers by using sc.arange(dim, start, stop).

Hint:

da = sc.DataArray(
    data=data,
    coords={
        "altitude": sc.linspace("altitude", 0, 8000, 5),
        "year": sc.arange(..., 2014, ...)
    },
)

or

da.coords['year'] = sc.arange(..., 2014, ...)

Solution:

Hide code cell content
da = sc.DataArray(
    data=data,
    coords={
        "altitude": sc.linspace("altitude", 0, 8000, 5),
        "year": sc.arange("year", 2014, 2024),
    },
)
sc.show(da)
da
(dims=('altitude', 'year'), shape=(5, 10), unit=dimensionless, variances=False)values altitudeyear altit..altitude(dims=('altitude',), shape=(5,), unit=dimensionless, variances=False)values altitude yearyear(dims=('year',), shape=(10,), unit=dimensionless, variances=False)values year
Show/Hide data repr Show/Hide attributes
scipp.DataArray (1.76 KB)
    • altitude: 5
    • year: 10
    • altitude
      (altitude)
      float64
      𝟙
      0.0, 2000.000, 4000.000, 6000.000, 8000.000
      Values:
      array([ 0., 2000., 4000., 6000., 8000.])
    • year
      (year)
      int64
      𝟙
      2014, 2015, ..., 2022, 2023
      Values:
      array([2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023])
    • (altitude, year)
      float64
      𝟙
      488.350, 190.098, ..., 9.641, 2.636
      Values:
      array([[488.34988335, 190.09786751, 461.62311688, 130.84621193, 159.54852921, 59.04561648, 120.88314663, 159.26696439, 482.03962259, 131.82490214], [368.70416193, 143.52388997, 348.52545325, 98.78889001, 120.45913955, 44.57944044, 91.2667757 , 120.24655812, 363.93991505, 99.52780111], [249.05844051, 96.94991243, 235.42778961, 66.73156809, 81.3697499 , 30.11326441, 61.65040478, 81.22615184, 245.84020752, 67.23070009], [129.41271909, 50.37593489, 122.33012597, 34.67424616, 42.28036024, 15.64708837, 32.03403386, 42.20574556, 127.74049999, 34.93359907], [ 9.76699767, 3.80195735, 9.23246234, 2.61692424, 3.19097058, 1.18091233, 2.41766293, 3.18533929, 9.64079245, 2.63649804]])

Exercise 1.2: Compute new coordinate#

Add a new coordinate representing the Scipp-year.

Hint: Scipp was first released in 2020

Solution:

Hide code cell content
da.coords["scipp-year"] = da.coords["year"] - 2020
sc.show(da)
da
(dims=('altitude', 'year'), shape=(5, 10), unit=dimensionless, variances=False)values altitudeyear altit..altitude(dims=('altitude',), shape=(5,), unit=dimensionless, variances=False)values altitude scipp-yearscipp-year(dims=('year',), shape=(10,), unit=dimensionless, variances=False)values year yearyear(dims=('year',), shape=(10,), unit=dimensionless, variances=False)values year
Show/Hide data repr Show/Hide attributes
scipp.DataArray (2.09 KB)
    • altitude: 5
    • year: 10
    • altitude
      (altitude)
      float64
      𝟙
      0.0, 2000.000, 4000.000, 6000.000, 8000.000
      Values:
      array([ 0., 2000., 4000., 6000., 8000.])
    • scipp-year
      (year)
      int64
      𝟙
      -6, -5, ..., 2, 3
      Values:
      array([-6, -5, -4, -3, -2, -1, 0, 1, 2, 3])
    • year
      (year)
      int64
      𝟙
      2014, 2015, ..., 2022, 2023
      Values:
      array([2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023])
    • (altitude, year)
      float64
      𝟙
      488.350, 190.098, ..., 9.641, 2.636
      Values:
      array([[488.34988335, 190.09786751, 461.62311688, 130.84621193, 159.54852921, 59.04561648, 120.88314663, 159.26696439, 482.03962259, 131.82490214], [368.70416193, 143.52388997, 348.52545325, 98.78889001, 120.45913955, 44.57944044, 91.2667757 , 120.24655812, 363.93991505, 99.52780111], [249.05844051, 96.94991243, 235.42778961, 66.73156809, 81.3697499 , 30.11326441, 61.65040478, 81.22615184, 245.84020752, 67.23070009], [129.41271909, 50.37593489, 122.33012597, 34.67424616, 42.28036024, 15.64708837, 32.03403386, 42.20574556, 127.74049999, 34.93359907], [ 9.76699767, 3.80195735, 9.23246234, 2.61692424, 3.19097058, 1.18091233, 2.41766293, 3.18533929, 9.64079245, 2.63649804]])









2. Going further#

2.1 Physical units#

Every data variable and coordinate in Scipp has physical units. (see also pint, astropy.units, pint-xarray)

Array Variable with unit:

temperature = sc.array(dims=["time"], values=[300.0, 301.0, 312.0, 340.0], unit="K")
temperature
Show/Hide data repr Show/Hide attributes
scipp.Variable (288 Bytes)
    • (time: 4)
      float64
      K
      300.0, 301.0, 312.0, 340.0
      Values:
      array([300., 301., 312., 340.])

Scalar Variable (no dimensions) with unit:

sound_speed = sc.scalar(340.0, unit="m/s")
sound_speed
Show/Hide data repr Show/Hide attributes
scipp.Variable (264 Bytes)
    • ()
      float64
      m/s
      340.0
      Values:
      array(340.)

Coordinates and data with units in a DataArray:

cph_air = sc.DataArray(
    data=sc.array(
        dims=["altitude", "year"],
        values=np.linspace(500, 10, 5).reshape((5, 1)) * rng.random(10),
        unit="m-3",
    ),
    coords={
        "altitude": sc.linspace("altitude", 0, 8000, 5, unit="m"),
        "year": sc.arange("year", 2014, 2024, unit="year"),
    },
)
cph_air
Show/Hide data repr Show/Hide attributes
scipp.DataArray (1.76 KB)
    • altitude: 5
    • year: 10
    • altitude
      (altitude)
      float64
      m
      0.0, 2000.000, 4000.000, 6000.000, 8000.000
      Values:
      array([ 0., 2000., 4000., 6000., 8000.])
    • year
      (year)
      int64
      Y
      2014, 2015, ..., 2022, 2023
      Values:
      array([2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023])
    • (altitude, year)
      float64
      1/m^3
      220.503, 304.935, ..., 1.721, 8.704
      Values:
      array([[220.50306103, 304.93540471, 431.81064828, 431.87883539, 337.44065667, 329.93717398, 367.87884916, 111.37682907, 86.03309233, 435.20748624], [166.47981108, 230.22623056, 326.01703946, 326.06852072, 254.76769579, 249.10256635, 277.74853111, 84.08950594, 64.95498471, 328.58165211], [112.45656112, 155.5170564 , 220.22343063, 220.25820605, 172.0947349 , 168.26795873, 187.61821307, 56.80218282, 43.87687709, 221.95581798], [ 58.43331117, 80.80788225, 114.4298218 , 114.44789138, 89.42177402, 87.4333511 , 97.48789503, 29.5148597 , 22.79876947, 115.32998385], [ 4.41006122, 6.09870809, 8.63621297, 8.63757671, 6.74881313, 6.59874348, 7.35757698, 2.22753658, 1.72066185, 8.70414972]])

Units are automatically handled in arithmetic operations.

Say we know the mean ultra-fine particle mass

ultra_fine_particle_mass = sc.scalar(1.0e-6, unit="kg")

cph_air *= ultra_fine_particle_mass
cph_air
Show/Hide data repr Show/Hide attributes
scipp.DataArray (1.76 KB)
    • altitude: 5
    • year: 10
    • altitude
      (altitude)
      float64
      m
      0.0, 2000.000, 4000.000, 6000.000, 8000.000
      Values:
      array([ 0., 2000., 4000., 6000., 8000.])
    • year
      (year)
      int64
      Y
      2014, 2015, ..., 2022, 2023
      Values:
      array([2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023])
    • (altitude, year)
      float64
      kg/m^3
      0.000, 0.000, ..., 1.721e-06, 8.704e-06
      Values:
      array([[2.20503061e-04, 3.04935405e-04, 4.31810648e-04, 4.31878835e-04, 3.37440657e-04, 3.29937174e-04, 3.67878849e-04, 1.11376829e-04, 8.60330923e-05, 4.35207486e-04], [1.66479811e-04, 2.30226231e-04, 3.26017039e-04, 3.26068521e-04, 2.54767696e-04, 2.49102566e-04, 2.77748531e-04, 8.40895059e-05, 6.49549847e-05, 3.28581652e-04], [1.12456561e-04, 1.55517056e-04, 2.20223431e-04, 2.20258206e-04, 1.72094735e-04, 1.68267959e-04, 1.87618213e-04, 5.68021828e-05, 4.38768771e-05, 2.21955818e-04], [5.84333112e-05, 8.08078822e-05, 1.14429822e-04, 1.14447891e-04, 8.94217740e-05, 8.74333511e-05, 9.74878950e-05, 2.95148597e-05, 2.27987695e-05, 1.15329984e-04], [4.41006122e-06, 6.09870809e-06, 8.63621297e-06, 8.63757671e-06, 6.74881313e-06, 6.59874348e-06, 7.35757698e-06, 2.22753658e-06, 1.72066185e-06, 8.70414972e-06]])



Units also provide protection#

Say we now also have air pollution data for another city, e.g., NYC.

We would like to compute the difference between CPH and NYC air pollution (as a function of altitude and year), but we forgot to multiply the NYC data by particle mass:

nyc_air = sc.DataArray(
    data=sc.array(
        dims=["altitude", "year"],
        values=np.linspace(800, 20, 5).reshape((5, 1)) * rng.random(10),
        unit="m-3",
    ),
    coords={
        "altitude": sc.linspace("altitude", 0, 8000, 5, unit="m"),
        "year": sc.arange("year", 2014, 2024, unit="year"),
    },
)

cph_air - nyc_air
---------------------------------------------------------------------------
UnitError                                 Traceback (most recent call last)
Cell In[24], line 13
      1 nyc_air = sc.DataArray(
      2     data=sc.array(
      3         dims=["altitude", "year"],
   (...)
     10     },
     11 )
---> 13 cph_air - nyc_air

UnitError: Cannot subtract kg/m^3 and 1/m^3.
nyc_air *= ultra_fine_particle_mass

air_difference = cph_air - nyc_air
air_difference.plot()
../_images/0ad0ed5df04074c3478dc46f1d9f5c79f70385588e945ad68867499de1ba4063.svg
  • Units are very useful in early prevention of difficult-to-spot bugs in a workflow.

  • They save hours of debugging time, free-up mental capacity and let the user focus on the important thing: doing science.







Units for label-based indexing#

We also use units to distinguish between positional indexing and label-based indexing:

cph_air["altitude", 2000.0 * sc.Unit("m")].plot()
../_images/e1c937c85db8d803c9f0b6db572e3cc5eb898bf31da42f30531f33493796c129.svg

Positional indices are based on the dimension, and value indices are based on the coordinates.







Exercise 2: Coordinate and Units#

We have a data array that contains air polution as a function of year and altitude above the city of Copenhagen. However, we want to have a pressure coordinate for the altitude dimension instead of altitude.

Assuming a constant air temperature \(T\) of 300 K, the pressure as a function of height \(h\) is given by

\[ P = P_{0} \exp{\left[ \frac{-g_{0}Mh}{RT} \right]} \]

Here is the incomplete function altitude_to_pressure that converts altitude[m] into pressure[hPa].

Complete the function and use it to add the pressure coordinate to cph_air.

def altitude_to_pressure(altitude):
    M = sc.scalar(0.0289644, unit="kg/mol")
    g0 = sc.scalar(9.80665, unit="m/s2")
    R = sc.scalar(8.3144598, unit="J/mol/K")
    T = sc.scalar(300.0)
    p0 = sc.scalar(1013.25, unit="hPa")
    return p0 * sc.exp(-g0 * M * altitude / (R * T))

Solution:

Hide code cell content
def altitude_to_pressure(altitude):
    M = sc.scalar(0.0289644, unit="kg/mol")
    g0 = sc.scalar(9.80665, unit="m/s2")
    R = sc.scalar(8.3144598, unit="J/mol/K")
    T = sc.scalar(300.0, unit="K")
    p0 = sc.scalar(1013.25, unit="hPa")
    return p0 * sc.exp(-g0 * M * altitude / (R * T))


cph_air.coords["pressure"] = altitude_to_pressure(cph_air.coords["altitude"])
cph_air
Show/Hide data repr Show/Hide attributes
scipp.DataArray (2.05 KB)
    • altitude: 5
    • year: 10
    • altitude
      (altitude)
      float64
      m
      0.0, 2000.000, 4000.000, 6000.000, 8000.000
      Values:
      array([ 0., 2000., 4000., 6000., 8000.])
    • pressure
      (altitude)
      float64
      100Pa
      1013.250, 806.874, 642.532, 511.663, 407.449
      Values:
      array([1013.25 , 806.87395253, 642.53202593, 511.66282298, 407.44870895])
    • year
      (year)
      int64
      Y
      2014, 2015, ..., 2022, 2023
      Values:
      array([2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023])
    • (altitude, year)
      float64
      kg/m^3
      0.000, 0.000, ..., 1.721e-06, 8.704e-06
      Values:
      array([[2.20503061e-04, 3.04935405e-04, 4.31810648e-04, 4.31878835e-04, 3.37440657e-04, 3.29937174e-04, 3.67878849e-04, 1.11376829e-04, 8.60330923e-05, 4.35207486e-04], [1.66479811e-04, 2.30226231e-04, 3.26017039e-04, 3.26068521e-04, 2.54767696e-04, 2.49102566e-04, 2.77748531e-04, 8.40895059e-05, 6.49549847e-05, 3.28581652e-04], [1.12456561e-04, 1.55517056e-04, 2.20223431e-04, 2.20258206e-04, 1.72094735e-04, 1.68267959e-04, 1.87618213e-04, 5.68021828e-05, 4.38768771e-05, 2.21955818e-04], [5.84333112e-05, 8.08078822e-05, 1.14429822e-04, 1.14447891e-04, 8.94217740e-05, 8.74333511e-05, 9.74878950e-05, 2.95148597e-05, 2.27987695e-05, 1.15329984e-04], [4.41006122e-06, 6.09870809e-06, 8.63621297e-06, 8.63757671e-06, 6.74881313e-06, 6.59874348e-06, 7.35757698e-06, 2.22753658e-06, 1.72066185e-06, 8.70414972e-06]])











2.2 Histogramming and bin-edge coordinates#

  • It is sometimes necessary to have coordinates that represent a range for each data value.

  • E.g., “the temperature was 310 K in the time span between 10 and 20 seconds”.

  • This also arises every time we histogram data, as in the image above.

  • Scipp supports this by having bin-edge coordinates: a coordinate which has a length of 1 more than the dimension length.

The next data set is meant to represent photon events in a camera. We have a long list of x and y positions for the photons.

x = sc.array(dims=["row"], values=rng.normal(size=10000), unit="cm")
y = sc.array(dims=["row"], values=rng.normal(size=10000), unit="cm")
recording = sc.DataArray(
    data=sc.ones(sizes=x.sizes, unit="counts"), coords={"x": x, "y": y}
)
recording
Show/Hide data repr Show/Hide attributes
scipp.DataArray (235.63 KB)
    • row: 10000
    • x
      (row)
      float64
      cm
      -0.679, -0.621, ..., -0.788, 1.123
      Values:
      array([-0.67924997, -0.62053203, 1.33121422, ..., 0.66931119, -0.78846271, 1.12268421])
    • y
      (row)
      float64
      cm
      -0.306, -0.704, ..., -0.848, 0.739
      Values:
      array([-0.30610691, -0.70374693, -1.00885753, ..., -0.42935023, -0.84826518, 0.73921398])
    • (row)
      float64
      counts
      1.0, 1.0, ..., 1.0, 1.0
      Values:
      array([1., 1., 1., ..., 1., 1., 1.])
scatter(x.values, y.values)
../_images/31cf9241e60db621d5677c7fe04dd0f7d7722c8a30caa4a5a9d033c5cbf5435b.png

It is very common to histogram such data.

In Scipp, histogramming has a very concise and easy-to-use syntax. To make 8 bins in both the x and y dimensions:

image = recording.hist(y=8, x=8)
image.plot(aspect="equal")
../_images/4b3ae1a4705d506ab3862d342fd46a1c4ef81d53437f26dfd9e9713f3f750f96.svg

The x and y coordinates are now bin-edge coordinates.

sc.show(image)
image
(dims=('y', 'x'), shape=(8, 8), unit=counts, variances=False)values yx yy(dims=('y',), shape=(9,), unit=cm, variances=False)values y xx(dims=('x',), shape=(9,), unit=cm, variances=False)values x
Show/Hide data repr Show/Hide attributes
scipp.DataArray (1.89 KB)
    • y: 8
    • x: 8
    • x
      (x [bin-edge])
      float64
      cm
      -3.819, -2.831, ..., 3.095, 4.083
      Values:
      array([-3.81886182, -2.83110929, -1.84335675, -0.85560421, 0.13214833, 1.11990087, 2.10765341, 3.09540594, 4.08315848])
    • y
      (y [bin-edge])
      float64
      cm
      -3.646, -2.663, ..., 3.236, 4.219
      Values:
      array([-3.64602432, -2.66291545, -1.67980659, -0.69669773, 0.28641113, 1.26952 , 2.25262886, 3.23573772, 4.21884658])
    • (y, x)
      float64
      counts
      0.0, 2.0, ..., 1.0, 0.0
      Values:
      array([[0.000e+00, 2.000e+00, 8.000e+00, 1.200e+01, 1.000e+01, 5.000e+00, 0.000e+00, 0.000e+00], [1.000e+00, 1.100e+01, 8.100e+01, 1.460e+02, 1.190e+02, 4.600e+01, 6.000e+00, 1.000e+00], [4.000e+00, 6.200e+01, 3.290e+02, 7.620e+02, 6.310e+02, 2.370e+02, 3.000e+01, 3.000e+00], [7.000e+00, 1.210e+02, 5.790e+02, 1.298e+03, 1.179e+03, 4.170e+02, 7.000e+01, 2.000e+00], [5.000e+00, 8.800e+01, 4.440e+02, 1.030e+03, 8.770e+02, 3.240e+02, 4.400e+01, 1.000e+00], [3.000e+00, 3.600e+01, 1.490e+02, 3.280e+02, 2.710e+02, 9.300e+01, 1.400e+01, 2.000e+00], [0.000e+00, 5.000e+00, 1.900e+01, 3.500e+01, 3.600e+01, 1.100e+01, 1.000e+00, 0.000e+00], [0.000e+00, 0.000e+00, 1.000e+00, 2.000e+00, 1.000e+00, 0.000e+00, 1.000e+00, 0.000e+00]])
  • Numpy and Matplotlib return the bin edges and the data counts separately.

  • We have everything stored inside a single data structure.

You can, of course, adjust the number of bins:

recording.hist(y=100, x=100).plot(aspect="equal")
../_images/00d894121d17975505529c327c5e55b9c3bdb129d404c9bc0e057e12697fbe32.svg






Exercise 3: Histogramming#

We found a 2D detector that reads your mood!

We recorded a signal with it, and now we can visualize the signal by histogramming.

from scipp_utils import load_signal_to_histogram
signal_rng = np.random.default_rng(1)
signal = load_signal_to_histogram(signal_rng)
signal
Show/Hide data repr Show/Hide attributes
scipp.DataArray (4.66 MB)
    • row: 203700
    • x
      (row)
      float64
      cm
      7.278, -16.296, ..., 0.018, 0.016
      Values:
      array([ 7.27805283e+00, -1.62959448e+01, -7.09559402e+00, ..., 1.89787051e-02, 1.81696472e-02, 1.60307256e-02])
    • y
      (row)
      float64
      cm
      0.432, -0.784, ..., 0.061, 0.060
      Values:
      array([ 0.43193503, -0.78422306, -9.62131874, ..., 0.06044533, 0.06091188, 0.05965346])
    • (row)
      float64
      counts
      1.0, 1.0, ..., 1.0, 1.0
      Values:
      array([1., 1., 1., ..., 1., 1., 1.])

Exercise 3-1: Number of bins for histogramming.#

First, we need to find the right number of bins to histogram the signal.

We tried 200 bins and 4 bins for each axis, but none of them seems meaningful!

signal.hist(x=200, y=200).plot() + signal.hist(x=4, y=4).plot()
../_images/c7efdae3d10002cd7f71b5e3f6be46db9c24ce55db0c15fd5f6afa3499c6b436.svg

Solution:

Hide code cell content
# 30~50 bins are enough to see the meaningful shape!
signal.hist(x=50, y=50).plot() + signal.hist(x=30, y=30).plot()
../_images/834376c179c5d5cc7492571d2d9b24446f3713b772cb03731430cddeaaeb2c41.svg

Exercise 3-2: Custom histogram edges.#

However, there is a suspicious hot spot in the very middle of the image.

We want to investigate those signals within the specific range of x and y.

Let’s histogram the hot spot and see what is in there.

You can histogram the data with custom histogram edges like below.

Hint:

hist_edges_x = sc.linspace(dim='x', start=-10, stop=10, unit='cm', num=200)
hist_edges_y = sc.linspace(dim='y', start=-10, stop=10, unit='cm', num=200)
signal.hist(x=hist_edges_x, y=hist_edges_y).plot()

Solution:

Hide code cell content
# There was a smiley in the middle of the heart!

hist_edges_x = sc.linspace(dim='x', start=-0.15, stop=0.15, unit='cm', num=50)
hist_edges_y = sc.linspace(dim='y', start=-0.15, stop=0.15, unit='cm', num=50)
signal.hist(x=hist_edges_x, y=hist_edges_y).plot()
../_images/8d853d985bb20663714700e7cd87821ee42c75495387295da5534951e4068fca.svg













3. Binned data#

Scipp distinguishes histogrammed data from binned data:

  • Histogrammed data refers to regular dense arrays of, e.g., floating-point values with an associated bin-edge coordinate.

  • Binned data refers to the precursor of histogrammed data, i.e., each bin contains a “list” of contributing events or values. Binned data can be converted into a histogram by computing the sum over all events or values in a bin.

binned

This is conceptually similar to a multi-dimensional .

It is best illustrated with an example of data analysis. For this, we will use one of the NYC taxi datasets.



NYC yellow taxi dataset#

(https://vaex.readthedocs.io/en/latest/datasets.html, Dataset from 2015, obtained as a HDF5 file from the Vaex docs, and subsequently cleaned of outliers).

For today, we will use a small set of it.

!wget -nc --no-verbose https://public.esss.dk/groups/scipp/dmsc-summer-school/scipp/nyc_taxi_data_2015_small.tar.gz
!tar -xzf nyc_taxi_data_2015_small.tar.gz
2023-11-01 19:26:52 URL:https://public.esss.dk/groups/scipp/dmsc-summer-school/scipp/nyc_taxi_data_2015_small.tar.gz [362919163/362919163] -> "nyc_taxi_data_2015_small.tar.gz" [1]
# %matplotlib widget

da = sc.io.load_hdf5("nyc_taxi_data_2015_small.h5")
da
Show/Hide data repr Show/Hide attributes
scipp.DataArray (1.60 GB)
    • row: 17839810
    • dropoff_datetime
      (row)
      datetime64
      s
      2014-12-16T02:28:00, 2015-01-10T20:58:31, ..., 2016-01-01T00:11:37, 2016-01-01T00:14:14
      Values:
      array(['2014-12-16T02:28:00', '2015-01-10T20:58:31', '2015-01-10T20:39:23', ..., '2016-01-01T00:06:43', '2016-01-01T00:11:37', '2016-01-01T00:14:14'], dtype='datetime64[s]')
    • dropoff_hour
      (row)
      int64
      𝟙
      2, 20, ..., 0, 0
      Values:
      array([ 2, 20, 20, ..., 0, 0, 0])
    • dropoff_latitude
      (row)
      float64
      deg
      40.743, 40.750, ..., 40.763, 40.696
      Values:
      array([40.74289322, 40.74963379, 40.73989487, ..., 40.74245071, 40.76282883, 40.69619751])
    • dropoff_longitude
      (row)
      float64
      deg
      -73.996, -73.992, ..., -73.925, -73.980
      Values:
      array([-73.99645996, -73.99246979, -73.99521637, ..., -73.97740936, -73.92475128, -73.98009491])
    • fare_amount
      (row)
      float64
      \$
      5.0, 14.0, ..., 12.5, 17.0
      Values:
      array([ 5. , 14. , 6. , ..., 10.5, 12.5, 17. ])
    • pickup_datetime
      (row)
      datetime64
      s
      2014-12-16T02:26:00, 2015-01-10T20:33:39, ..., 2015-12-31T23:59:48, 2015-12-31T23:59:55
      Values:
      array(['2014-12-16T02:26:00', '2015-01-10T20:33:39', '2015-01-10T20:33:41', ..., '2015-12-31T23:59:46', '2015-12-31T23:59:48', '2015-12-31T23:59:55'], dtype='datetime64[s]')
    • pickup_hour
      (row)
      int64
      𝟙
      2, 20, ..., 23, 23
      Values:
      array([ 2, 20, 20, ..., 23, 23, 23])
    • pickup_latitude
      (row)
      float64
      deg
      40.756, 40.726, ..., 40.764, 40.731
      Values:
      array([40.75642014, 40.72600937, 40.73177719, ..., 40.77241898, 40.7635498 , 40.73109055])
    • pickup_longitude
      (row)
      float64
      deg
      -73.987, -73.983, ..., -73.971, -73.982
      Values:
      array([-73.98672485, -73.98327637, -74.0067215 , ..., -73.9466095 , -73.97135925, -73.98199463])
    • tip_amount
      (row)
      float64
      \$
      0.0, 0.0, ..., 0.0, 0.0
      Values:
      array([0., 0., 0., ..., 0., 0., 0.])
    • trip_distance
      (row)
      float64
      mi
      1.090, 2.200, ..., 3.130, 5.070
      Values:
      array([1.09000003, 2.20000005, 1.10000002, ..., 2.79999995, 3.13000011, 5.07000017])
    • (row)
      float64
      counts
      1.0, 1.0, ..., 1.0, 1.0
      Values:
      array([1., 1., 1., ..., 1., 1., 1.])
n = 100
x = da.coords["dropoff_longitude"].values[::n]
y = da.coords["dropoff_latitude"].values[::n]
scatter(x, y)
../_images/900d65306bf5c8cf5e4a46b36f0785a4f6959b67d2c6aee5e5703f5a60d55621.png

Binning the data records#

  • Working with binned data is most efficient when keeping the number of bins relatively low.

  • Binning is essentially like overlaying a grid of bin edges onto our data

ax = scatter(x, y, get_ax=True)
for lon in np.linspace(*ax.get_xlim(), 9):
    ax.axvline(lon, color="gray")
for lat in np.linspace(*ax.get_ylim(), 9):
    ax.axhline(lat, color="gray")
../_images/9e9a0429f33a6adb35e71f7ec2276e9fc012058b7e32e0fc90f2cf3c92203170.png
# Bin into 8 longitude & latitude bins
binned = da.bin(dropoff_latitude=8, dropoff_longitude=8)
binned
Show/Hide data repr Show/Hide attributes
scipp.DataArray (1.60 GB)
    • dropoff_latitude: 8
    • dropoff_longitude: 8
    • dropoff_latitude
      (dropoff_latitude [bin-edge])
      float64
      deg
      40.595, 40.635, ..., 40.875, 40.915
      Values:
      array([40.59500122, 40.63499641, 40.67499161, 40.7149868 , 40.75498199, 40.79497719, 40.83497238, 40.87496758, 40.91496277])
    • dropoff_longitude
      (dropoff_longitude [bin-edge])
      float64
      deg
      -74.050, -74.010, ..., -73.770, -73.730
      Values:
      array([-74.04999542, -74.00999641, -73.96999741, -73.9299984 , -73.88999939, -73.85000038, -73.81000137, -73.77000237, -73.73000336])
    • (dropoff_latitude, dropoff_longitude)
      DataArrayView
      binned data [len=16129, len=12695, ..., len=500, len=12]
      dim='row',
      content=DataArray(
                dims=(row: 17839810),
                data=float64[counts],
                coords={'dropoff_datetime':datetime64[s], 'pickup_datetime':datetime64[s],
                        'fare_amount':float64[\$], 'trip_distance':float64[mi],
                        'tip_amount':float64[\$], 'dropoff_latitude':float64[deg],
                        'dropoff_longitude':float64[deg], 'pickup_latitude':float64[deg],
                        'pickup_longitude':float64[deg], 'dropoff_hour':int64[dimensionless],
                        'pickup_hour':int64[dimensionless]})
# Histogramming is summing all the counts in each bin
binned_sum = binned.bins.sum()

binned_sum.plot(aspect="equal", norm="log")
../_images/cc25c34b8c72f396f0c9484ddf82d2725f49c2d324c266ab3f2e864ef4856321.svg





Selecting/slicing bins#

  • Binning groups the data into bins, but keeps the underlying table of records.

  • No information is lost, it is simply re-ordered.

  • The bins can then be used for slicing the data, providing extremely efficient data selection and filtering.

manh = binned["dropoff_longitude", 1]["dropoff_latitude", 4]
manh
Show/Hide data repr Show/Hide attributes
scipp.DataArray (385.92 MB out of 1.60 GB)
    • dropoff_latitude
      (dropoff_latitude [bin-edge])
      float64
      deg
      40.755, 40.795
      Values:
      array([40.75498199, 40.79497719])
    • dropoff_longitude
      (dropoff_longitude [bin-edge])
      float64
      deg
      -74.010, -73.970
      Values:
      array([-74.00999641, -73.96999741])
    • ()
      DataArrayView
      binned data [len=4215195]
      dim='row',
      content=DataArray(
                dims=(row: 17839810),
                data=float64[counts],
                coords={'dropoff_datetime':datetime64[s], 'pickup_datetime':datetime64[s],
                        'fare_amount':float64[\$], 'trip_distance':float64[mi],
                        'tip_amount':float64[\$], 'dropoff_latitude':float64[deg],
                        'dropoff_longitude':float64[deg], 'pickup_latitude':float64[deg],
                        'pickup_longitude':float64[deg], 'dropoff_hour':int64[dimensionless],
                        'pickup_hour':int64[dimensionless]})
# We can now histogram this with a much finer resolution

manh.hist(dropoff_latitude=300, dropoff_longitude=300).plot(norm="log", aspect="equal")
../_images/1f69bb67ddb6dfa89324b87345bf92fe7051c4ef32e113285dced6ca84dbd434.svg
# We select another bin, which contains the JFK airport

jfk = binned["dropoff_longitude", 6]["dropoff_latitude", 1]
jfk.hist(dropoff_latitude=300, dropoff_longitude=300).plot(norm="log", aspect="equal")
../_images/da9ae21e046232f0beb5b50f16bb1beb72a9316e1c05ebe5ac51d6937a96e7c9.svg

jfk

(https://commons.wikimedia.org/wiki/File:JFK_airport_terminal_map.png)



Binning into a new dimension#

  • Data that has already been binned can also be binned further into new dimensions.

manh
Show/Hide data repr Show/Hide attributes
scipp.DataArray (385.92 MB out of 1.60 GB)
    • dropoff_latitude
      (dropoff_latitude [bin-edge])
      float64
      deg
      40.755, 40.795
      Values:
      array([40.75498199, 40.79497719])
    • dropoff_longitude
      (dropoff_longitude [bin-edge])
      float64
      deg
      -74.010, -73.970
      Values:
      array([-74.00999641, -73.96999741])
    • ()
      DataArrayView
      binned data [len=4215195]
      dim='row',
      content=DataArray(
                dims=(row: 17839810),
                data=float64[counts],
                coords={'dropoff_datetime':datetime64[s], 'pickup_datetime':datetime64[s],
                        'fare_amount':float64[\$], 'trip_distance':float64[mi],
                        'tip_amount':float64[\$], 'dropoff_latitude':float64[deg],
                        'dropoff_longitude':float64[deg], 'pickup_latitude':float64[deg],
                        'pickup_longitude':float64[deg], 'dropoff_hour':int64[dimensionless],
                        'pickup_hour':int64[dimensionless]})
  • We look at the trip distances inside the Manhattan and JFK bins we have selected above.

# Use 100 distance bins
manh_dist = manh.bin(trip_distance=100)
manh_dist
Show/Hide data repr Show/Hide attributes
scipp.DataArray (385.92 MB)
    • trip_distance: 100
    • dropoff_latitude
      (dropoff_latitude)
      float64
      deg
      40.755, 40.795
      Values:
      array([40.75498199, 40.79497719])
    • dropoff_longitude
      (dropoff_longitude)
      float64
      deg
      -74.010, -73.970
      Values:
      array([-74.00999641, -73.96999741])
    • trip_distance
      (trip_distance [bin-edge])
      float64
      mi
      0.020, 0.781, ..., 75.379, 76.140
      Values:
      array([1.99999996e-02, 7.81199993e-01, 1.54239999e+00, 2.30359998e+00, 3.06479998e+00, 3.82599997e+00, 4.58719996e+00, 5.34839996e+00, 6.10959995e+00, 6.87079994e+00, 7.63199994e+00, 8.39319993e+00, 9.15439993e+00, 9.91559992e+00, 1.06767999e+01, 1.14379999e+01, 1.21991999e+01, 1.29603999e+01, 1.37215999e+01, 1.44827999e+01, 1.52439999e+01, 1.60051999e+01, 1.67663999e+01, 1.75275999e+01, 1.82887999e+01, 1.90499998e+01, 1.98111998e+01, 2.05723998e+01, 2.13335998e+01, 2.20947998e+01, 2.28559998e+01, 2.36171998e+01, 2.43783998e+01, 2.51395998e+01, 2.59007998e+01, 2.66619998e+01, 2.74231998e+01, 2.81843998e+01, 2.89455998e+01, 2.97067998e+01, 3.04679998e+01, 3.12291997e+01, 3.19903997e+01, 3.27515997e+01, 3.35127997e+01, 3.42739997e+01, 3.50351997e+01, 3.57963997e+01, 3.65575997e+01, 3.73187997e+01, 3.80799997e+01, 3.88411997e+01, 3.96023997e+01, 4.03635997e+01, 4.11247997e+01, 4.18859997e+01, 4.26471997e+01, 4.34083997e+01, 4.41695996e+01, 4.49307996e+01, 4.56919996e+01, 4.64531996e+01, 4.72143996e+01, 4.79755996e+01, 4.87367996e+01, 4.94979996e+01, 5.02591996e+01, 5.10203996e+01, 5.17815996e+01, 5.25427996e+01, 5.33039996e+01, 5.40651996e+01, 5.48263996e+01, 5.55875996e+01, 5.63487995e+01, 5.71099995e+01, 5.78711995e+01, 5.86323995e+01, 5.93935995e+01, 6.01547995e+01, 6.09159995e+01, 6.16771995e+01, 6.24383995e+01, 6.31995995e+01, 6.39607995e+01, 6.47219995e+01, 6.54831995e+01, 6.62443995e+01, 6.70055995e+01, 6.77667995e+01, 6.85279995e+01, 6.92891994e+01, 7.00503994e+01, 7.08115994e+01, 7.15727994e+01, 7.23339994e+01, 7.30951994e+01, 7.38563994e+01, 7.46175994e+01, 7.53787994e+01, 7.61399994e+01])
    • (trip_distance)
      DataArrayView
      binned data [len=676736, len=1480372, ..., len=0, len=1]
      dim='row',
      content=DataArray(
                dims=(row: 4215195),
                data=float64[counts],
                coords={'dropoff_datetime':datetime64[s], 'pickup_datetime':datetime64[s],
                        'fare_amount':float64[\$], 'trip_distance':float64[mi],
                        'tip_amount':float64[\$], 'dropoff_latitude':float64[deg],
                        'dropoff_longitude':float64[deg], 'pickup_latitude':float64[deg],
                        'pickup_longitude':float64[deg], 'dropoff_hour':int64[dimensionless],
                        'pickup_hour':int64[dimensionless]})
manh_dist.hist().plot()
../_images/d20cce0e13694c923ab1f9c7a26a1f5d75a6948b15b901b16d7c96142e55033e.svg
jfk_dist = jfk.bin(trip_distance=100)
jfk_dist.hist().plot()
../_images/f91b52eb2d7c41385a1122d849606ce792c9d642aa3db07713b5699d0108876a.svg



Other operations on bins: what is the fare amount as a function of distance?#

  • In addition to summing/histogramming, bins can be used for other reduction operations: min(), max(), and mean().

manh_dist
Show/Hide data repr Show/Hide attributes
scipp.DataArray (385.92 MB)
    • trip_distance: 100
    • dropoff_latitude
      (dropoff_latitude)
      float64
      deg
      40.755, 40.795
      Values:
      array([40.75498199, 40.79497719])
    • dropoff_longitude
      (dropoff_longitude)
      float64
      deg
      -74.010, -73.970
      Values:
      array([-74.00999641, -73.96999741])
    • trip_distance
      (trip_distance [bin-edge])
      float64
      mi
      0.020, 0.781, ..., 75.379, 76.140
      Values:
      array([1.99999996e-02, 7.81199993e-01, 1.54239999e+00, 2.30359998e+00, 3.06479998e+00, 3.82599997e+00, 4.58719996e+00, 5.34839996e+00, 6.10959995e+00, 6.87079994e+00, 7.63199994e+00, 8.39319993e+00, 9.15439993e+00, 9.91559992e+00, 1.06767999e+01, 1.14379999e+01, 1.21991999e+01, 1.29603999e+01, 1.37215999e+01, 1.44827999e+01, 1.52439999e+01, 1.60051999e+01, 1.67663999e+01, 1.75275999e+01, 1.82887999e+01, 1.90499998e+01, 1.98111998e+01, 2.05723998e+01, 2.13335998e+01, 2.20947998e+01, 2.28559998e+01, 2.36171998e+01, 2.43783998e+01, 2.51395998e+01, 2.59007998e+01, 2.66619998e+01, 2.74231998e+01, 2.81843998e+01, 2.89455998e+01, 2.97067998e+01, 3.04679998e+01, 3.12291997e+01, 3.19903997e+01, 3.27515997e+01, 3.35127997e+01, 3.42739997e+01, 3.50351997e+01, 3.57963997e+01, 3.65575997e+01, 3.73187997e+01, 3.80799997e+01, 3.88411997e+01, 3.96023997e+01, 4.03635997e+01, 4.11247997e+01, 4.18859997e+01, 4.26471997e+01, 4.34083997e+01, 4.41695996e+01, 4.49307996e+01, 4.56919996e+01, 4.64531996e+01, 4.72143996e+01, 4.79755996e+01, 4.87367996e+01, 4.94979996e+01, 5.02591996e+01, 5.10203996e+01, 5.17815996e+01, 5.25427996e+01, 5.33039996e+01, 5.40651996e+01, 5.48263996e+01, 5.55875996e+01, 5.63487995e+01, 5.71099995e+01, 5.78711995e+01, 5.86323995e+01, 5.93935995e+01, 6.01547995e+01, 6.09159995e+01, 6.16771995e+01, 6.24383995e+01, 6.31995995e+01, 6.39607995e+01, 6.47219995e+01, 6.54831995e+01, 6.62443995e+01, 6.70055995e+01, 6.77667995e+01, 6.85279995e+01, 6.92891994e+01, 7.00503994e+01, 7.08115994e+01, 7.15727994e+01, 7.23339994e+01, 7.30951994e+01, 7.38563994e+01, 7.46175994e+01, 7.53787994e+01, 7.61399994e+01])
    • (trip_distance)
      DataArrayView
      binned data [len=676736, len=1480372, ..., len=0, len=1]
      dim='row',
      content=DataArray(
                dims=(row: 4215195),
                data=float64[counts],
                coords={'dropoff_datetime':datetime64[s], 'pickup_datetime':datetime64[s],
                        'fare_amount':float64[\$], 'trip_distance':float64[mi],
                        'tip_amount':float64[\$], 'dropoff_latitude':float64[deg],
                        'dropoff_longitude':float64[deg], 'pickup_latitude':float64[deg],
                        'pickup_longitude':float64[deg], 'dropoff_hour':int64[dimensionless],
                        'pickup_hour':int64[dimensionless]})
  • To get the minimum and maximum fares for all trips that ended inside our Manhattan area, we can do

manh_dist.bins.coords["fare_amount"].min(), manh.bins.coords["fare_amount"].max()
(<scipp.Variable> ()    float64              [$]  -80,
 <scipp.Variable> ()    float64              [$]  900)
  • These values are somewhat strange, indicative of bad data in the table.

  • We restrict our fare range from $0 to $200.

# Make 100 bins between 0 and 200 dollars
nbins = 100
fare_bins = sc.linspace("fare_amount", 0, 200, nbins + 1, unit="dollar")

# Bin & plot our data
manh_dist.bin(fare_amount=fare_bins).hist().transpose().plot(norm="log")
../_images/009c0343ac5f24d892c906ae19acf1afe6a33f3d540ab1eadd70e3b9f45ee1df.svg

Some things we can say about the data:

  • there appears to be a (somewhat expected) correlation between fare amount and trip distance: the further you go, the more you’ll have to pay

  • for a given trip distance, clients usually pay above the diagonal line, rarely below

  • there appears to be a magic fare amount of $52 that will take you anywhere from 0 to 60 miles!



4. Plopp: interactive data visualization tools#

https://scipp.github.io/plopp

import plopp as pp

fare_lat_lon = da.hist(
    fare_amount=fare_bins, dropoff_latitude=300, dropoff_longitude=300
)
fare_lat_lon
Show/Hide data repr Show/Hide attributes
scipp.DataArray (68.67 MB)
    • fare_amount: 100
    • dropoff_latitude: 300
    • dropoff_longitude: 300
    • dropoff_latitude
      (dropoff_latitude [bin-edge])
      float64
      deg
      40.595, 40.596, ..., 40.914, 40.915
      Values:
      array([40.59500122, 40.59606776, 40.5971343 , 40.59820084, 40.59926737, 40.60033391, 40.60140045, 40.60246699, 40.60353353, 40.60460007, 40.60566661, 40.60673314, 40.60779968, 40.60886622, 40.60993276, 40.6109993 , 40.61206584, 40.61313238, 40.61419891, 40.61526545, 40.61633199, 40.61739853, 40.61846507, 40.61953161, 40.62059814, 40.62166468, 40.62273122, 40.62379776, 40.6248643 , 40.62593084, 40.62699738, 40.62806391, 40.62913045, 40.63019699, 40.63126353, 40.63233007, 40.63339661, 40.63446314, 40.63552968, 40.63659622, 40.63766276, 40.6387293 , 40.63979584, 40.64086238, 40.64192891, 40.64299545, 40.64406199, 40.64512853, 40.64619507, 40.64726161, 40.64832815, 40.64939468, 40.65046122, 40.65152776, 40.6525943 , 40.65366084, 40.65472738, 40.65579391, 40.65686045, 40.65792699, 40.65899353, 40.66006007, 40.66112661, 40.66219315, 40.66325968, 40.66432622, 40.66539276, 40.6664593 , 40.66752584, 40.66859238, 40.66965892, 40.67072545, 40.67179199, 40.67285853, 40.67392507, 40.67499161, 40.67605815, 40.67712468, 40.67819122, 40.67925776, 40.6803243 , 40.68139084, 40.68245738, 40.68352392, 40.68459045, 40.68565699, 40.68672353, 40.68779007, 40.68885661, 40.68992315, 40.69098969, 40.69205622, 40.69312276, 40.6941893 , 40.69525584, 40.69632238, 40.69738892, 40.69845545, 40.69952199, 40.70058853, 40.70165507, 40.70272161, 40.70378815, 40.70485469, 40.70592122, 40.70698776, 40.7080543 , 40.70912084, 40.71018738, 40.71125392, 40.71232045, 40.71338699, 40.71445353, 40.71552007, 40.71658661, 40.71765315, 40.71871969, 40.71978622, 40.72085276, 40.7219193 , 40.72298584, 40.72405238, 40.72511892, 40.72618546, 40.72725199, 40.72831853, 40.72938507, 40.73045161, 40.73151815, 40.73258469, 40.73365122, 40.73471776, 40.7357843 , 40.73685084, 40.73791738, 40.73898392, 40.74005046, 40.74111699, 40.74218353, 40.74325007, 40.74431661, 40.74538315, 40.74644969, 40.74751623, 40.74858276, 40.7496493 , 40.75071584, 40.75178238, 40.75284892, 40.75391546, 40.75498199, 40.75604853, 40.75711507, 40.75818161, 40.75924815, 40.76031469, 40.76138123, 40.76244776, 40.7635143 , 40.76458084, 40.76564738, 40.76671392, 40.76778046, 40.768847 , 40.76991353, 40.77098007, 40.77204661, 40.77311315, 40.77417969, 40.77524623, 40.77631276, 40.7773793 , 40.77844584, 40.77951238, 40.78057892, 40.78164546, 40.782712 , 40.78377853, 40.78484507, 40.78591161, 40.78697815, 40.78804469, 40.78911123, 40.79017776, 40.7912443 , 40.79231084, 40.79337738, 40.79444392, 40.79551046, 40.796577 , 40.79764353, 40.79871007, 40.79977661, 40.80084315, 40.80190969, 40.80297623, 40.80404277, 40.8051093 , 40.80617584, 40.80724238, 40.80830892, 40.80937546, 40.810442 , 40.81150853, 40.81257507, 40.81364161, 40.81470815, 40.81577469, 40.81684123, 40.81790777, 40.8189743 , 40.82004084, 40.82110738, 40.82217392, 40.82324046, 40.824307 , 40.82537354, 40.82644007, 40.82750661, 40.82857315, 40.82963969, 40.83070623, 40.83177277, 40.8328393 , 40.83390584, 40.83497238, 40.83603892, 40.83710546, 40.838172 , 40.83923854, 40.84030507, 40.84137161, 40.84243815, 40.84350469, 40.84457123, 40.84563777, 40.84670431, 40.84777084, 40.84883738, 40.84990392, 40.85097046, 40.852037 , 40.85310354, 40.85417007, 40.85523661, 40.85630315, 40.85736969, 40.85843623, 40.85950277, 40.86056931, 40.86163584, 40.86270238, 40.86376892, 40.86483546, 40.865902 , 40.86696854, 40.86803507, 40.86910161, 40.87016815, 40.87123469, 40.87230123, 40.87336777, 40.87443431, 40.87550084, 40.87656738, 40.87763392, 40.87870046, 40.879767 , 40.88083354, 40.88190008, 40.88296661, 40.88403315, 40.88509969, 40.88616623, 40.88723277, 40.88829931, 40.88936584, 40.89043238, 40.89149892, 40.89256546, 40.893632 , 40.89469854, 40.89576508, 40.89683161, 40.89789815, 40.89896469, 40.90003123, 40.90109777, 40.90216431, 40.90323085, 40.90429738, 40.90536392, 40.90643046, 40.907497 , 40.90856354, 40.90963008, 40.91069661, 40.91176315, 40.91282969, 40.91389623, 40.91496277])
    • dropoff_longitude
      (dropoff_longitude [bin-edge])
      float64
      deg
      -74.050, -74.049, ..., -73.731, -73.730
      Values:
      array([-74.04999542, -74.04892878, -74.04786214, -74.0467955 , -74.04572886, -74.04466222, -74.04359558, -74.04252894, -74.0414623 , -74.04039566, -74.03932902, -74.03826238, -74.03719574, -74.0361291 , -74.03506246, -74.03399582, -74.03292918, -74.03186254, -74.0307959 , -74.02972926, -74.02866262, -74.02759598, -74.02652934, -74.0254627 , -74.02439606, -74.02332942, -74.02226278, -74.02119614, -74.0201295 , -74.01906286, -74.01799622, -74.01692958, -74.01586294, -74.0147963 , -74.01372965, -74.01266301, -74.01159637, -74.01052973, -74.00946309, -74.00839645, -74.00732981, -74.00626317, -74.00519653, -74.00412989, -74.00306325, -74.00199661, -74.00092997, -73.99986333, -73.99879669, -73.99773005, -73.99666341, -73.99559677, -73.99453013, -73.99346349, -73.99239685, -73.99133021, -73.99026357, -73.98919693, -73.98813029, -73.98706365, -73.98599701, -73.98493037, -73.98386373, -73.98279709, -73.98173045, -73.98066381, -73.97959717, -73.97853053, -73.97746389, -73.97639725, -73.97533061, -73.97426397, -73.97319733, -73.97213069, -73.97106405, -73.96999741, -73.96893077, -73.96786413, -73.96679749, -73.96573085, -73.9646642 , -73.96359756, -73.96253092, -73.96146428, -73.96039764, -73.959331 , -73.95826436, -73.95719772, -73.95613108, -73.95506444, -73.9539978 , -73.95293116, -73.95186452, -73.95079788, -73.94973124, -73.9486646 , -73.94759796, -73.94653132, -73.94546468, -73.94439804, -73.9433314 , -73.94226476, -73.94119812, -73.94013148, -73.93906484, -73.9379982 , -73.93693156, -73.93586492, -73.93479828, -73.93373164, -73.932665 , -73.93159836, -73.93053172, -73.92946508, -73.92839844, -73.9273318 , -73.92626516, -73.92519852, -73.92413188, -73.92306524, -73.9219986 , -73.92093196, -73.91986532, -73.91879868, -73.91773204, -73.9166654 , -73.91559875, -73.91453211, -73.91346547, -73.91239883, -73.91133219, -73.91026555, -73.90919891, -73.90813227, -73.90706563, -73.90599899, -73.90493235, -73.90386571, -73.90279907, -73.90173243, -73.90066579, -73.89959915, -73.89853251, -73.89746587, -73.89639923, -73.89533259, -73.89426595, -73.89319931, -73.89213267, -73.89106603, -73.88999939, -73.88893275, -73.88786611, -73.88679947, -73.88573283, -73.88466619, -73.88359955, -73.88253291, -73.88146627, -73.88039963, -73.87933299, -73.87826635, -73.87719971, -73.87613307, -73.87506643, -73.87399979, -73.87293315, -73.87186651, -73.87079987, -73.86973323, -73.86866659, -73.86759995, -73.8665333 , -73.86546666, -73.86440002, -73.86333338, -73.86226674, -73.8612001 , -73.86013346, -73.85906682, -73.85800018, -73.85693354, -73.8558669 , -73.85480026, -73.85373362, -73.85266698, -73.85160034, -73.8505337 , -73.84946706, -73.84840042, -73.84733378, -73.84626714, -73.8452005 , -73.84413386, -73.84306722, -73.84200058, -73.84093394, -73.8398673 , -73.83880066, -73.83773402, -73.83666738, -73.83560074, -73.8345341 , -73.83346746, -73.83240082, -73.83133418, -73.83026754, -73.8292009 , -73.82813426, -73.82706762, -73.82600098, -73.82493434, -73.8238677 , -73.82280106, -73.82173442, -73.82066778, -73.81960114, -73.8185345 , -73.81746785, -73.81640121, -73.81533457, -73.81426793, -73.81320129, -73.81213465, -73.81106801, -73.81000137, -73.80893473, -73.80786809, -73.80680145, -73.80573481, -73.80466817, -73.80360153, -73.80253489, -73.80146825, -73.80040161, -73.79933497, -73.79826833, -73.79720169, -73.79613505, -73.79506841, -73.79400177, -73.79293513, -73.79186849, -73.79080185, -73.78973521, -73.78866857, -73.78760193, -73.78653529, -73.78546865, -73.78440201, -73.78333537, -73.78226873, -73.78120209, -73.78013545, -73.77906881, -73.77800217, -73.77693553, -73.77586889, -73.77480225, -73.77373561, -73.77266897, -73.77160233, -73.77053569, -73.76946905, -73.7684024 , -73.76733576, -73.76626912, -73.76520248, -73.76413584, -73.7630692 , -73.76200256, -73.76093592, -73.75986928, -73.75880264, -73.757736 , -73.75666936, -73.75560272, -73.75453608, -73.75346944, -73.7524028 , -73.75133616, -73.75026952, -73.74920288, -73.74813624, -73.7470696 , -73.74600296, -73.74493632, -73.74386968, -73.74280304, -73.7417364 , -73.74066976, -73.73960312, -73.73853648, -73.73746984, -73.7364032 , -73.73533656, -73.73426992, -73.73320328, -73.73213664, -73.73107 , -73.73000336])
    • fare_amount
      (fare_amount [bin-edge])
      float64
      \$
      0.0, 2.0, ..., 198.0, 200.0
      Values:
      array([ 0., 2., 4., 6., 8., 10., 12., 14., 16., 18., 20., 22., 24., 26., 28., 30., 32., 34., 36., 38., 40., 42., 44., 46., 48., 50., 52., 54., 56., 58., 60., 62., 64., 66., 68., 70., 72., 74., 76., 78., 80., 82., 84., 86., 88., 90., 92., 94., 96., 98., 100., 102., 104., 106., 108., 110., 112., 114., 116., 118., 120., 122., 124., 126., 128., 130., 132., 134., 136., 138., 140., 142., 144., 146., 148., 150., 152., 154., 156., 158., 160., 162., 164., 166., 168., 170., 172., 174., 176., 178., 180., 182., 184., 186., 188., 190., 192., 194., 196., 198., 200.])
    • (fare_amount, dropoff_latitude, dropoff_longitude)
      float64
      counts
      0.0, 0.0, ..., 0.0, 0.0
      Values:
      array([[[0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.], ..., [0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.]], [[0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.], ..., [0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.]], [[0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.], ..., [0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.]], ..., [[0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.], ..., [0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.]], [[0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.], ..., [0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.]], [[0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.], ..., [0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.]]])
%matplotlib widget

inspect = pp.inspector(fare_lat_lon, dim="fare_amount", norm="log")
inspect
../_images/aebb7da2fbf31bcd31824160391d66a5409ac44bea4ee1bc38e4fffcfb9b8596.png ../_images/24d2905d89d24f5237f4d15d6d13b0381d17e8ef229137b4da800bbe6825e261.png








Exercise 4.1: Rush hours#

Histogram the Manhattan and JFK bins according to hour-of-the-day, to show the quiet and busy hours for both boroughs.

Solution:

Hide code cell content
# In Plopp, you can use the + and / operators to make tiled figures
manh.hist(dropoff_hour=24).plot(title='Manhattan') / jfk.hist(dropoff_hour=24).plot(title='JFK')

Exercise 4.2: Expensive hours#

The final exercise is to create an interactive figure that will show histograms of how expensive trips were, as a function of the hour-of-the-day, for the entire dataset.

You should:

  1. Create a price_per_mile coordinate on the original dataset da

  2. Bin da using two dimensions: hour-of-the-day and price_per_mile

  3. Use Plopp’s superplot function to make a figure with a 1D histogram and an interactive slider to navigate the hour dimension

Use the slider to find the hour of the day when trips are the most expensive!

Hint: For binning in hour-of-the-day, using 24 bins should work well. For binning in price_per_mile, you will have to manually set the bin boundaries.

Solution:

Hide code cell content
da.coords['price_per_mile'] = da.coords['fare_amount'] / da.coords['trip_distance']
sp = pp.superplot(
         da.bin(dropoff_hour=24,
                price_per_mile=sc.linspace('price_per_mile', 0, 20, 100, unit='dollar/mi')).hist())
sp
Hide code cell outputs








Bonus Exercise#

You decided to join an exchange program in NY.

But living expenses are too high there, even compared to Copenhagen!

Luckily, you can take over a car from a previous student in the same program, and you are allowed to have a part-time job for 2 hours every day, and there is no limit of income.

So you decide to be a shared-car driver. Your goal is to maximize your income within those 2 hours, so you are going to analyse which hours to drive in which borough!

You are free to choose 2 hours among all 24 in a day, and there are 2 places, Manhattan and JFK airport, where you can be registered as a driver.

Solution

Hide code cell content
# Bin the data again to get the `price_per_mile` coord in the
# Manhattan and JFK bins
binned = da.bin(dropoff_latitude=8, dropoff_longitude=8)

# Manhattan bin
manh = binned["dropoff_longitude", 1]["dropoff_latitude", 4]
# Create a new data array with the `price_per_mile` as weights
prices_manh = sc.DataArray(data=manh.values.coords['price_per_mile'],
                           coords={'dropoff_hour': manh.values.coords['dropoff_hour']})
# Bin by hour-of-the-day and get the mean inside each bin
mean_manh = prices_manh.bin(dropoff_hour=24).bins.mean()

# Repeat for JFK
jfk = binned["dropoff_longitude", 6]["dropoff_latitude", 1]
prices_jfk = sc.DataArray(data=jfk.values.coords['price_per_mile'],
                          coords={'dropoff_hour': jfk.values.coords['dropoff_hour']})
mean_jfk = prices_jfk.bin(dropoff_hour=24).bins.mean()

# Plot
fig = pp.plot({'Manhattan': mean_manh, 'JFK': mean_jfk})
fig
Hide code cell outputs
../_images/3a895e2ce6e11e6ef288b6607c8c27eb275424a36c3ebbe9101226958c81f22e.png