forked from jakobnissen/hardware_introduction
-
Notifications
You must be signed in to change notification settings - Fork 0
/
pluto.jl
1751 lines (1379 loc) · 98.1 KB
/
pluto.jl
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
### A Pluto.jl notebook ###
# v0.19.11
using Markdown
using InteractiveUtils
# ╔═╡ 675e66aa-8aef-11eb-27be-5fe273e33297
# Load packages
begin
using BenchmarkTools
using PlutoUI
end
# ╔═╡ 15f5c31a-8aef-11eb-3f19-cf0a4e456e7a
md"""
# What scientists must know about hardware to write fast code
**Find this notebook at https://github.com/jakobnissen/hardware_introduction**
Programming is used in many fields of science today, where individual scientists often have to write custom code for their own projects. For most scientists, however, computer science is not their field of expertise; They have learned programming by necessity. I count myself as one of them. While we may be reasonably familiar with the *software* side of programming, we rarely have even a basic understanding of how computer *hardware* impacts code performance.
The aim of this tutorial is to give non-professional programmers a *brief* overview of the features of modern hardware that you must understand in order to write fast code. It will be a distillation of what have learned the last few years. This tutorial will use Julia because it allows these relatively low-level considerations to be demonstrated easily in a high-level, interactive language.
## What this notebook is not
#### This is not a guide to the Julia programming language
To write fast code, you must first understand your programming language and its idiosyncrasies. But this is *not* a guide to the Julia programming language. I recommend reading the [performance tips section](https://docs.julialang.org/en/v1/manual/performance-tips/) of the Julia documentation.
#### This is not an explanation of specific data structures or algorithms
Besides knowing your language, you must also know your own code to make it fast. You must understand the idea behind big-O notation, why some algorithms are faster than others, and how different data structures work internally. Without knowing *what an `Array` is*, how could you possibly optimize code making use of arrays?
This too, is outside the scope of this paper. However, I would say that as a minimum, a programmer should have an understanding of:
* How a binary integer is represented in memory
* How a floating point number is represented in memory (learning this is also necessary to understand computational inaccuracies from floating point operations, which is a must when doing scientific programming)
* The memory layout of a `String` including ASCII and UTF-8 encoding
* The basics of how an `Array` is structured, and what the difference between a dense array of e.g. integers and an array of references to objects are
* The principles behind how a `Dict` (i.e. hash table) and a `Set` works
Furthermore, I would also recommend familiarizing yourself with:
* Heaps
* Deques
* Tuples
#### This is not a tutorial on benchmarking your code
To write fast code *in practice*, it is necessary to profile your code to find bottlenecks where your machine spends the majority of the time. One must benchmark different functions and approaches to find the fastest in practice. Julia (and other languages) have tools for exactly this purpose, but I will not cover them here.
"""
# ╔═╡ 5dd2329a-8aef-11eb-23a9-7f3c325bcf74
md"""## Setting up this notebook
If you don't already have these packages installed, uncomment these lines and run them:
"""
# ╔═╡ 7490def0-8aef-11eb-19ce-4b11ce5a9328
# begin
# using Pkg
# Pkg.add(["BenchmarkTools", "PlutoUI"])
# end
# ╔═╡ 800d827e-8c20-11eb-136a-97a622a7c1e6
TableOfContents()
# ╔═╡ 9a24985a-8aef-11eb-104a-bd9abf0adc6d
md"""
## The basic structure of computer hardware
For now, we will work with a simplified mental model of a computer. Through this document, I will add more details to our model as they become relevant.
$$[CPU] ↔ [RAM] ↔ [DISK]$$
In this simple, uh, "diagram", the arrows represent data flow in either direction. The diagram shows three important parts of a computer:
* The central processing unit (CPU) is a chip the size of a stamp. This is where all the computation actually occurs, the brain of the computer.
* Random access memory (RAM, or just "memory") is the short-term memory of the computer. This memory requires electrical power to maintain, and is lost when the computer is shut down. RAM serves as a temporary storage of data between the disk and the CPU. Much of time spent "loading" various applications and operating systems is actually spent moving data from disk to RAM and unpacking it there. A typical consumer laptop has around $10^{11}$ bits of RAM memory.
* The disk is a mass storage unit. This data on disk persists after power is shut down, so the disk contains the long-term memory of the computer. It is also much cheaper per gigabyte than RAM, with consumer PCs having around $10^{13}$ bits of disk space.
"""
# ╔═╡ a2fad250-8aef-11eb-200f-e5f8caa57a67
md"""
## Avoid accessing disk too often
When discussing software performance, it's useful to distinguish between *throughput* and *latency*. Latency is the time it takes from something begins until it is finished. Throughput is a measure of how much stuff gets done in a set amount of time.
On the surface, the relationship between latency and throughput seems obvious: If an operation takes $N$ seconds to compute, then $1/N$ operations can be done per second. So you would think:
Naive equation: $$throughput = \frac{1}{latency}$$
In reality, it's not so simple. For example, imagine an operation that has a 1 second "warmup" before it begins, but afterwards completes in 0.1 seconds. The latency is thus 1.1 seconds, but it's throughput after the initial warmup is 10 ops/second.
Or, imagine a situation with an operation with a latency of 1 second, but where 8 operations can be run concurrently. In bulk, these operations can be run with a throughput of 8 ops/second.
One place where it's useful to distinguish between latency and throughput is when programs read from the disk. Most modern computers use a type of disk called a *solid state drive (SSD)*. In round numbers, current (2021) SSD's have latencies around 100 µs, and read/write throughputs of well over 1 GB/s. Older, or cheaper mass-storage disks are of the *hard disk drive (HDD)* type. These have latencies 100 times larger, at near 10 ms, and 10 times lower throughput of 100 MB/s.
Even the latest, fastest SSDs has latencies thousands of times slower than RAM, whose latency is below 100 nanoseconds. Disk latency is incurred whenever a read or write happen. To write fast code, you must therefore at all costs avoid repeatedly reading from, or writing to disk.
The following example serves to illustrate the difference in latency: The first function opens a file, accesses one byte from the file, and closes it again. The second function randomly accesses 1,000,000 integers from RAM.
"""
# ╔═╡ cdde6fe8-8aef-11eb-0a3c-77e28f7a2c09
md"""
Benchmarking this is a little tricky, because the *first* invocation will include the compilation times of both functions. And in the *second* invocation, your operating system will have stored a copy of the file (or *cached* the file) in RAM, making the file seek almost instant. To time it properly, run it once, then *change the file* to another not-recently-opened file, and run it again. So in fact, we should update our computer diagram:
$$[CPU] ↔ [RAM] ↔ [DISK CACHE] ↔ [DISK]$$
On my computer, finding a single byte in a file (including opening and closing the file) takes about 500 µs, and accessing 1,000,000 integers from memory takes 200 milliseconds. So RAM latency is on the order of 2,500 times lower than disk's. Therefore, repeated access to files *must* be avoided in high performance computing.
Only a few years back, SSDs were uncommon and HDD throughput was lower than today. Therefore, old texts will often warn people not to have your program depend on the disk at all for high throughput. That advice is mostly outdated today, as most programs are incapable of bottlenecking at the throughput of even cheap, modern SSDs of 1 GB/s. The advice today still stands only for programs that need *frequent* individual reads/writes to disk, where the high *latency* accumulates. In these situations, you should indeed keep your data in RAM.
The worst case for performance is if you need to read/write a large file in tiny chunks, for example one single byte at a time. In these situations, great speed improvements can be found by *buffering* the file. When buffering, you read in larger chunks, the *buffer*, to memory, and when you want to read from the file, you check if it's in the buffer. If not, read another large chunk into your buffer from the file. This approach minimizes disk latency. Both your operating system and your programming language will make use of caches, however, sometimes [it is necessary to manually buffer your files](https://github.com/JuliaLang/julia/issues/34195).
"""
# ╔═╡ f58d428c-8aef-11eb-3127-89d729e23823
md"""
## Avoid cache misses
The RAM is faster than the disk, and the CPU in turn is faster than RAM. A CPU ticks like a clock, with a speed of about 3 GHz, i.e. 3 billion ticks per second. One "tick" of this clock is called a *clock cycle*. While this is a simplification, you may imagine that every cycle, the CPU executes a single, simple command called a *CPU instruction* which does one operation on a small piece of data. The clock speed then can serve as a reference for other timings in a computer. It is worth realizing just how quick a clock cycle is: In one cycle, a photon will travel only around 10 cm. In fact, modern CPUs are so fast that a significant constraint on their physical layout is that one must take into account the time needed for electricity to move through the wires inside them, so called wire delays.
On this scale, reading from RAM takes around 500 clock cycles. Similarly to how the high latency of disks can be mitigated by copying data to the faster RAM, data from RAM is copied to a smaller memory chip physically on the CPU, called a *cache*. The cache is faster because it is physically on the CPU chip (reducing wire delays), and because it uses a faster type of storage, static RAM, instead of the slower (but cheaper to manufacture) dynamic RAM which is what the main RAM is made of. Because the cache must be placed on the CPU, limiting its size, and because it is more expensive to produce, a typical CPU cache only contains around $10^8$ bits, around 1000 times less than RAM. There are actually multiple layers of CPU cache, but here we simplify it and just refer to "the cache" as one single thing:
$$[CPU] ↔ [CPU CACHE] ↔ [RAM] ↔ [DISK CACHE] ↔ [DISK]$$
When the CPU requests a piece of data from the RAM, say a single byte, it will first check if the memory is already in cache. If so, it will read it from there. This is much faster, usually just one or a few clock cycles, than access to RAM. If not, we have a *cache miss*, where your program will stall for around 100 nanoseconds while your computer copies data from RAM into the cache.
It is not possible, except in very low-level languages, to manually manage the CPU cache. Instead, you must make sure to use your cache effectively.
First, you strive to use as little memory as possible. With less memory, it is more likely that your data will be in cache when the CPU needs it. Remember, a CPU can do approximately 500 small operations in the time wasted by a single cache miss.
Effective use of the cache comes down to *locality*, temporal and spacial locality:
* By *temporal locality*, I mean that data you recently accessed likely resides in cache already. Therefore, if you must access a piece of memory multiple times, make sure you do it close together in time.
* By *spacial locality*, I mean that you should access data from memory addresses close to each other. Your CPU does not copy *just* the requested bytes to cache. Instead, your CPU will always copy data in larger chunks called *cache lines* (usually 512 consecutive bits, depending on the CPU model).
To illustrate this, let's compare the performance of the `random_access` function above when it's run on a short (8 KiB) vector, compared to a long (16 MiB) one. The first one is small enough that after just a few accessions, all the data has been copied to cache. The second is so large that new indexing causes cache misses most of the time.
Notice the large discrepancy in time spent - a difference of around 70x.
"""
# ╔═╡ c6da4248-8c19-11eb-1c16-093695add9a9
md"""
We can play around with the function `random_access` from before. What happens if, instead of accessing the array randomly, we access it *in the worst possible order*?
For example, we could use the following function:
"""
# ╔═╡ d4c67b82-8c1a-11eb-302f-b79c86412ce5
md"""
`linear_access` does nearly the same computation as `random_access`, but accesses every 15th element. An `UInt` in Julia is 8 bytes (64 bits), so a step size of 15 means there are $15 * 64 = 960$ bits between each element; larger than the 64 byte cache line. That means *every single* access will cause a cache miss - in contrast to `random_access` with a large vector, where only *most* accesses forces cache misses.
"""
# ╔═╡ 0f2ac53c-8c1b-11eb-3841-27f4ea1e9617
md"""
Surprise! The linear access pattern is more than 20 times faster! How can that be?
Next to the cache of the CPU lies a small circuit called the *prefetcher*. This electronic circuit collects data on which memory is being accessed by the CPU, and looks for patterns. When it detects a pattern, it will *prefetch* whatever data it predicts will soon be accessed, so that it already resides in cache when the CPU requests the data.
Our function `linear_access`, depite having worse *cache usage* than *random_access*, fetched the data in a completely predictable pattern, which allowed the prefetcher to do its job.
In summary, we have seen that
* A *cache miss* incurs a penalty equivalent to roughly 500 CPU operations, so is absolutely critical for performance to avoid these
* To reduce cache misses:
- Use smaller data so it more easily fits in cache
- Access data in a predicatable, regular pattern to allow the prefetcher to do its job
- Access data close together in memory instead of far apart
- When accessing data close together in memory, do so close together in time, so when it's accessed the second time, it's still in cache.
Cache usage has implications for your data structures. Hash tables such as `Dict`s and `Set`s are inherently cache inefficient and almost always cause cache misses, whereas arrays don't. Hence, while many operations of hash tables are $O(1)$ (i.e. they complete in constant time), their cost per operation is high.
Many of the optimizations in this document indirectly impact cache use, so this is important to have in mind.
"""
# ╔═╡ 12f1228a-8af0-11eb-0449-230ae20bfa7a
md"""
## Keep your data aligned to memory
As just mentioned, your CPU will move entire cache lines of usually 512 consecutive bits (64 bytes) to and from main RAM to cache at a time. Your entire main memory is segmented into cache lines. For example, memory addresses 0 to 63 is one cache line, addresses 64 to 127 is the next, 128 to 191 the next, et cetera. Your CPU may only request one of these cache lines from memory, and not e.g. the 64 bytes from address 30 to 93.
This means that some data structures can straddle the boundaries between cache lines. If I request a 64-bit (8 byte) integer at adress 60, the CPU must first generate two memory requests from the single requested memory address (namely to get cache lines 0-63 and 64-127), and then retrieve the integer from both cache lines, wasting time.
The time wasted can be significant. In a situation where in-cache memory access proves the bottleneck, the slowdown can approach 2x. In the following example, I use a pointer to repeatedly access an array at a given offset from a cache line boundary. If the offset is in the range `0:56`, the integers all fit within one single cache line, and the function is fast. If the offset is in `57:63` all integers will straddle cache lines.
"""
# ╔═╡ 3a1efd5a-8af0-11eb-21a2-d1011f16555c
md"The consequences of unaligned memory access are very CPU-dependent. On my current CPU, I see a ~15% performance decrease. On my old computer where I originally wrote this notebook, the penalty was around 100%. Old processors can do [even worse things](https://www.kernel.org/doc/Documentation/unaligned-memory-access.txt) - incredibly, the CPU inside the Game Boy Advance from 2001 would _silently perform a different read!_ 😱
Fortunately, the compiler does a few tricks to make it less likely that you will access misaligned data. First, Julia (and other compiled languages) always places new objects in memory at the boundaries of cache lines. When an object is placed right at the boundary, we say that it is *aligned*. Julia also aligns the beginning of larger arrays:"
# ╔═╡ 5b10a2b6-8af0-11eb-3fe7-4b78b4c22550
md"Note that if the beginning of an array is aligned, then it's not possible for 1-, 2-, 4-, or 8-byte objects to straddle cache line boundaries, and everything will be aligned.
It would still be possible for an e.g. 7-byte object to be misaligned in an array. In an array of 7-byte objects, the 10th object would be placed at byte offset $7 \times (10-1) = 63$, and the object would straddle the cache line. However, the compiler usually does not allow struct with a nonstandard size for this reason. If we define a 7-byte struct:"
# ╔═╡ 6061dc94-8af0-11eb-215a-4f3af731774e
struct AlignmentTest
a::UInt32 # 4 bytes +
b::UInt16 # 2 bytes +
c::UInt8 # 1 byte = 7 bytes?
end;
# ╔═╡ 624eae74-8af0-11eb-025b-8b68dc55f31e
md"Then we can use Julia's introspection to get the relative position of each of the three integers in an `AlignmentTest` object in memory:"
# ╔═╡ d4c8c38c-8ee6-11eb-0b49-33fbfbd214f3
let
T = AlignmentTest
println("Size of $T: ", sizeof(T), "bytes")
for fieldno in 1:fieldcount(T)
print("Name: ", fieldname(T, fieldno), '\t')
print("Size: ", sizeof(fieldtype(T, fieldno)), '\t')
print("Offset: ", fieldoffset(T, fieldno), '\n')
end
end
# ╔═╡ 7b979410-8af0-11eb-299c-af0a5d740c24
md"""
We can see that, despite an `AlignmentTest` only having 4 + 2 + 1 = 7 bytes of actual data, it takes up 8 bytes of memory, and accessing an `AlignmentTest` object from an array will always be aligned.
As a coder, there are only a few situations where you can face alignment issues. I can come up with two:
1. If you manually create an object with a strange size, e.g. by accessing a dense integer array with pointers. This can save memory, but will waste time. [My implementation of a Cuckoo filter](https://github.com/jakobnissen/Probably.jl) does this to save space.
2. During matrix operations. If you have a matrix the columns are sometimes unaligned because it is stored densely in memory. E.g. in a 15x15 matrix of `Float32`s, only the first column is aligned, all the others are not. This can have serious effects when doing matrix operations: [I've seen benchmarks](https://juliasimd.github.io/LoopVectorization.jl/latest/examples/matrix_vector_ops/) where an 80x80 matrix/vector multiplication is 2x faster than a 79x79 one due to alignment issues.
"""
# ╔═╡ 8802ff60-8af0-11eb-21ac-b9fdbeac7c24
md"""
## Digression: Assembly code
To run, any program must be translated, or *compiled* to CPU instructions. The CPU instructions are what is actually running on your computer, as opposed to the code written in your programming language, which is merely a *description* of the program. CPU instructions are usually presented to human beings in *assembly*. Assembly is a programming language which has a one-to-one correspondance with CPU instructions.
Viewing assembly code will be useful to understand some of the following sections which pertain to CPU instructions.
In Julia, we can easily inspect the compiled assembly code using the `code_native` function or the equivalent `@code_native` macro. We can do this for a simple function:
"""
# ╔═╡ a36582d4-8af0-11eb-2b5a-e577c5ed07e2
# View assembly code generated from this function call
function foo(x)
s = zero(eltype(x))
@inbounds for i in eachindex(x)
s = x[i ⊻ s]
end
return s
end;
# ╔═╡ a74a9966-8af0-11eb-350f-6787d2759eba
@code_native foo([UInt(1)])
# ╔═╡ ae9ee028-8af0-11eb-10c0-6f2db3ab8025
md"""
Let's break it down:
The lines beginning with `;` are comments, and explain which section of the code the following instructions come from. They show the nested series of function calls, and where in the source code they are. You can see that `eachindex`, calls `axes1`, which calls `axes`, which calls `size`. Under the comment line containing the `size` call, we see the first CPU instruction. The instruction name is on the far left, `movq`. The name is composed of two parts, `mov`, the kind of instruction (to move content to or from a register), and a suffix `q`, short for "quad", which means 64-bit integer. There are the following suffixes: `b` (byte, 8 bit), `w` (word, 16 bit), `l`, (long, 32 bit) and `q` (quad, 64 bit).
The next two columns in the instruction, `24(%rdi)` and `%rax` are the arguments to `movq`. These are the names of the registers (we will return to registers later) where the data to operate on are stored.
You can also see (in the larger display of assembly code) that the code is segmented into sections beginning with a name starting with "L", for example, when I ran it, there's a section `L32`. These sections are jumped between using if-statements, or *branches*. Here, section `L32` marks the actual loop. You can see the following two instructions in the `L32` section:
```
; ││┌ @ promotion.jl:401 within `=='
cmpq $1, %rdi
; │└└
jne L32
```
The first instruction `cmpq` (compare quad) compares the data in registry `rdi`, which hold the data for the number of iterations left (plus one), with the number 1, and sets certain flags (wires) in the CPU based on the result. The next instruction `jne` (jump if not equal) makes a jump if the "equal" flag is not set in the CPU, which happens if there is one or more iterations left. You can see it jumps to `L32`, meaning this section repeat.
"""
# ╔═╡ b73b5eaa-8af0-11eb-191f-cd15de19bc38
md"""
#### Fast instruction, slow instruction
Not all CPU instructions are equally fast. Below is a table of selected CPU instructions with *very rough* estimates of how many clock cycles they take to execute. You can find much more detailed tables [in this document](https://www.agner.org/optimize/instruction_tables.pdf). Here, I'll summarize the speed of instructions on modern Intel CPUs. It's very similar for all modern CPUs.
You will see that the time is given both as latency, and reciprocal throughput (that is, $1/throughput$). The reason is that CPUs contain multiple circuits for some operations that can operate in parallel. So while float multiplication has a latency of 5 clock cycles, if 10 floating point ops can be computed in parallel in 10 different circuits, it has a throughput of 2 ops/second, and so a reciprocal throughput of 0.5.
The following table measures time in clock cycles:
|Instruction |Latency|Rec. throughp.|
|------------------------|-------|--------------|
|move data | 1 | 0.25
|and/or/xor | 1 | 0.25
|test/compare | 1 | 0.25
|do nothing | 1 | 0.25
|int add/subtract | 1 | 0.25
|bitshift | 1 | 0.5
|float multiplication | 5 | 0.5
|vector int and/or/xor | 1 | 0.5
|vector int add/sub | 1 | 0.5
|vector float add/sub | 4 | 0.5
|vector float multiplic. | 5 | 0.5
|lea | 3 | 1
|int multiplic | 3 | 1
|float add/sub | 3 | 1
|float multiplic. | 5 | 1
|float division | 15 | 5
|vector float division | 13 | 8
|integer division | 50 | 40
The `lea` instruction takes three inputs, A, B and C, where A must be 1, 2, 4, or 8, and calculates AB + C. We'll come back to what the "vector" instructions do later.
For comparison, we may also add some *very rough* estimates of other sources of delays:
|Delay |Cycles|
|-----------------------|----|
|move memory from cache | 1
|misaligned memory read | 10
|cache miss | 500
|read from disk | 5000000
"""
# ╔═╡ c0c757b2-8af0-11eb-38f1-3bc3ec4c43bc
md"If you have an inner loop executing millions of times, it may pay off to inspect the generated assembly code for the loop and check if you can express the computation in terms of fast CPU instructions. For example, if you have an integer you know to be 0 or above, and you want to divide it by 8 (discarding any remainder), you can instead do a bitshift, since bitshifts are way faster than integer division:
"
# ╔═╡ c5472fb0-8af0-11eb-04f1-95a1f7b6b9e0
begin
divide_slow(x) = div(x, 8)
divide_fast(x) = x >>> 3;
end;
# ╔═╡ ce0e65d4-8af0-11eb-0c86-2105c26b62eb
md"However, modern compilers are pretty clever, and will often figure out the optimal instructions to use in your functions to obtain the same result, by for example replacing an integer divide `idivq` instruction with a bitshift right (`shrq`) where applicable to be faster. You need to check the assembly code yourself to see:"
# ╔═╡ d376016a-8af0-11eb-3a15-4322759143d1
# Calling it with these keywords removes comments in the assembly code
@code_native debuginfo=:none dump_module=false divide_slow(UInt(1))
# ╔═╡ d70c56bc-8af0-11eb-1220-09e78dba26f7
md"## Allocations and immutability
As already mentioned, main RAM is much slower than the CPU cache. However, working in main RAM comes with an additional disadvantage: Programs are handed a bunch of RAM by the operating system (OS) to work with. Within that chunk of memory, the program itself needs to keep track of what RAM is in use by which objects. If the program didn't keep track, the creation of one object in memory could overwrite another object, causing data loss. Therefore, every creation and destruction of data in RAM requires book-keeping, that takes time.
The creation of new objects in RAM is termed *allocation*, and the destruction is called *deallocation*. Really, the (de)allocation is not really *creation* or *destruction* per se, but rather the act of starting and stopping keeping track of the memory. Memory that is not kept track of will eventually be overwritten by other data. Allocation and deallocation take a significant amount of time depending on the size of objects, from a few tens of nanoseconds to a few microseconds per allocation.
In programming languages such as Julia, Python, R and Java, deallocation is automatically done using a program called the *garbage collector* (GC). This program keeps track of which objects are rendered unreachable by the programmer, and deallocates them. For example, if you do:"
# ╔═╡ dc24f5a0-8af0-11eb-0332-2bc0834d426c
begin
thing = [1,2,3]
thing = nothing
end
# ╔═╡ e3c136de-8af0-11eb-06f1-9393c0f95fbb
md"Then there is no way to get the original array `[1,2,3]` back, it is unreachable. Hence it is simply wasting RAM, and doing nothing. It is *garbage*. Allocating and deallocating objects sometimes cause the GC to start its scan of all objects in memory and deallocate the unreachable ones, which causes significant lag. You can also start the garbage collector manually:"
# ╔═╡ e836dac8-8af0-11eb-1865-e3feeb011fc4
GC.gc()
# ╔═╡ ecfd04e4-8af0-11eb-0962-f548d2eabad3
md"The following example illustrates the difference in time spent in a function that allocates a vector with the new result relative to one which simply modifies the vector, allocating nothing:"
# ╔═╡ f0e24b50-8af0-11eb-1a0e-5d925f3743e0
begin
function increment(x::Vector{<:Integer})
y = similar(x)
@inbounds for i in eachindex(x)
y[i] = x[i] + 1
end
return y
end
function increment!(x::Vector{<:Integer})
@inbounds for i in eachindex(x)
x[i] = x[i] + 1
end
return x
end
end;
# ╔═╡ 22512ab2-8af1-11eb-260b-8d6c16762547
md"""
On my computer, the allocating function is more than 15x slower on average. Also note the high maximum time spend on the allocating function. This is due to a few properties of the code:
* First, the allocation itself takes time
* Second, the allocated objects eventually have to be deallocated, also taking time
* Third, repeated allocations triggers the GC to run, causing overhead
* Fourth, more allocations sometimes means less efficient cache use because you are using more memory
Note that I used the mean time instead of the median, since for this function the GC only triggers approximately every 30'th call, but it consumes 30-40 µs when it does. All this means performant code should keep allocations to a minimum.
The `@btime` macro, and other benchmarking tools, tell you the number of allocations. This information is given because it is assumed that any programmer who cares to benchmark their code will be interested in reducing allocations.
#### Not all objects need to be allocated
Inside RAM, data is kept on either the *stack* or the *heap*. The stack is a simple data structure with a beginning and end, similar to a `Vector` in Julia. The stack can only be modified by adding or subtracting elements from the end, analogous to a `Vector` with only the two mutating operations `push!` and `pop!`. These operations on the stack are very fast. When we talk about "allocations", however, we talk about data on the heap. Unlike the stack, the heap has an unlimited size (well, it has the size of your computer's RAM), and can be modified arbitrarily, deleting and accessing any objects. You can think of the stack like a `Vector`, and the heap like a `Dict`.
Intuitively, it may seem obvious that all objects need to be placed in RAM, must be able to be retrieved and deleted at any time by the program, and therefore need to be allocated on the heap. And for some languages, like Python, this is true. However, this is not true in Julia and other efficient, compiled languages. Integers, for example, can often be placed on the stack.
Why do some objects need to be heap allocated, while others can be stack allocated? To be stack-allocated, the compiler needs to know for certain that:
* The object is a reasonably small size, so it fits on the stack. For technical reasons, the stack can't just be hundreds of megabytes in size.
* The compiler can predict exactly *when* it needs to create and destroy the object so it can destroy it timely by simply popping the stack (similar to calling `pop!` on a `Vector`). This is usually the case for local variables in compiled languages.
Julia has even more constrains on stack-allocated objects.
* The object should have a fixed size known at compile time.
* The compiler must know that object never changes. The CPU is free to copy stack-allocated objects, and for immutable objects, there is no way to distinguish a copy from the original. This bears repeating: *With immutable objects, there is no way to distinguish a copy from the original*. This gives the compiler and the CPU certain freedoms when operating on it. The latter point is also why objects are immutable by default in Julia, and leads to one other performance tip: Use immutable objects wherever possible.
What does this mean in practice? In Julia, it means if you want fast stack-allocated objects:
* Your object must be created, used and destroyed in a fully compiled function so the compiler knows for certain when it needs to create, use and destroy the object. If the object is returned for later use (and not immediately returned to another, fully compiled function), we say that the object *escapes*, and must be allocated.
* Your type must be limited in size. I don't know exactly how large it has to be, but 100 bytes is fine.
* The exact memory layout of your type must be known by the compiler (it nearly always is).
"""
# ╔═╡ 2a7c1fc6-8af1-11eb-2909-554597aa2949
begin
abstract type AllocatedInteger end
struct StackAllocated <: AllocatedInteger
x::Int
end
mutable struct HeapAllocated <: AllocatedInteger
x::Int
end
end
# ╔═╡ 2e3304fe-8af1-11eb-0f6a-0f84d58326bf
md"We can inspect the code needed to instantiate a `HeapAllocated` object with the code needed to instantiate a `StackAllocated` one:"
# ╔═╡ 33350038-8af1-11eb-1ff5-6d42d86491a3
@code_native debuginfo=:none dump_module=false HeapAllocated(1)
# ╔═╡ 3713a8da-8af1-11eb-2cb2-1957455227d0
md"Notice the `callq` instruction in the `HeapAllocated` one. This instruction calls out to other functions, meaning that in fact, much more code is really needed to create a `HeapAllocated` object that what is displayed. In constrast, the `StackAllocated` really only needs a few instructions:"
# ╔═╡ 59f58f1c-8af1-11eb-2e88-997e9d4bcc48
@code_native debuginfo=:none dump_module=false StackAllocated(1)
# ╔═╡ 5c86e276-8af1-11eb-2b2e-3386e6795f37
md"
Because immutable objects dont need to be stored on the heap and can be copied freely, immutables are stored *inline* in arrays. This means that immutable objects can be stored directly inside the array's memory. Mutable objects have a unique identity and location on the heap. They are distinguishable from copies, so cannot be freely copied, and so arrays contain reference to the memory location on the heap where they are stored. Accessing such an object from an array then means first accessing the array to get the memory location, and then accessing the object itself using that memory location. Beside the double memory access, objects are stored less efficiently on the heap, meaning that more memory needs to be copied to CPU caches, meaning more cache misses. Hence, even when stored on the heap in an array, immutables can be stored more effectively.
"
# ╔═╡ 6849d9ec-8af1-11eb-06d6-db49af4796bc
md"We can verify that, indeed, the array in the `data_stack` stores the actual data of a `StackAllocated` object, whereas the `data_heap` contains pointers (i.e. memory addresses):"
# ╔═╡ 74a3ddb4-8af1-11eb-186e-4d80402adfcf
md"## Registers and SIMD
It is time yet again to update our simplified computer schematic. A CPU operates only on data present in *registers*. These are small, fixed size slots (e.g. 8 bytes in size) inside the CPU itself. A register is meant to hold one single piece of data, like an integer or a floating point number. As hinted in the section on assembly code, each instruction usually refers to one or two registers which contain the data the operation works on:
$$[CPU] ↔ [REGISTERS] ↔ [CACHE] ↔ [RAM] ↔ [DISK CACHE] ↔ [DISK]$$
To operate on data structures larger than one register, the data must be broken up into smaller pieces that fits inside the register. For example, when adding two 128-bit integers on my computer:"
# ╔═╡ 7a88c4ba-8af1-11eb-242c-a1813a9e6741
@code_native debuginfo=:none dump_module=false UInt128(5) + UInt128(11)
# ╔═╡ 7d3fcbd6-8af1-11eb-0441-2f88a9d59966
md"""There is no register that can do 128-bit additions. First the lower 64 bits must be added using a `addq` instruction, fitting in a register. Then the upper bits are added with a `adcq` instruction, which adds the digits, but also uses the carry bit from the previous instruction. Finally, the results are moved 64 bits at a time using `movq` instructions.
The small size of the registers serves as a bottleneck for CPU throughput: It can only operate on one integer/float at a time. In order to sidestep this, modern CPUs contain specialized 256-bit registers (or 128-bit in older CPUs, or 512-bit in the brand new ones) than can hold 4 64-bit integers/floats at once, or 8 32-bit integers, etc. Confusingly, the data in such wide registers are termed "vectors." The CPU have access to instructions that can perform various CPU operations on vectors, operating on 4 64-bit integers in one instruction. This is called "single instruction, multiple data," *SIMD*, or *vectorization*. Notably, a 4x64 bit operation is *not* the same as a 256-bit operation, e.g. there is no carry-over with between the 4 64-bit integers when you add two vectors. Instead, a 256-bit vector operation is equivalent to 4 individual 64-bit operations.
We can illustrate this with the following example:"""
# ╔═╡ 8c2ed15a-8af1-11eb-2e96-1df34510e773
md"""
Here, two 8×32 bit vectors are added together in one single instruction. You can see the CPU makes use of a single `vpaddd` (vector packed add double) instruction to add 8 32-bit integers, as well as the corresponding move instruction `vmovdqu`. Note that vector CPU instructions begin with `v`.
It's worth mentioning the interaction between SIMD and alignment: If a series of 256-bit (32-byte) SIMD loads are misaligned, then up to half the loads could cross cache line boundaries, as opposed to just 1/8th of 8-byte loads. Thus, alignment is a much more serious issue when using SIMD. Since array beginnings are always aligned, this is usually not an issue, but in cases where you are not guaranteed to start from an aligned starting point, such as with matrix operations, this may make a significant difference. In brand new CPUs with 512-bit registers, the issues is even worse as the SIMD size is the same as the cache line size, so *all* loads would be misaligned if the initial load is.
SIMD vectorization of e.g. 64-bit integers may increase throughput by almost 4x, so it is of huge importance in high-performance programming. Compilers will automatically vectorize operations if they can. What can prevent this automatic vectorization?
#### SIMD needs uninterrupted iteration of fixed length
Because vectorized operations operate on multiple data at once, it is not possible to interrupt the loop at an arbitrary point. For example, if 4 64-bit integers are processed in one clock cycle, it is not possible to stop a SIMD loop after 3 integers have been processed. Suppose you had a loop like this:
```julia
for i in 1:8
if foo()
break
end
# do stuff with my_vector[i]
end
```
Here, the loop could end on any iteration due to the break statement. Therefore, any SIMD instruction which loaded in multiple integers could operate on data *after* the loop is supposed to break, i.e. data which is never supposed to be read. This would be wrong behaviour, and so, the compiler cannot use SIMD instructions.
A good rule of thumb is that SIMD needs:
* A loop with a predetermined length, so it knows when to stop, and
* A loop with no branches (i.e. if-statements) in the loop
In fact, even boundschecking, i.e. checking that you are not indexing outside the bounds of a vector, causes a branch. After all, if the code is supposed to raise a bounds error after 3 iterations, even a single SIMD operation would be wrong! To achieve SIMD vectorization then, all boundschecks must be disabled.
Fortunately, in the latest versions of Julia, the compiler has been pretty smart at figuring out when it can SIMD even when boundschecking.
To demonstrate the significant impact of SIMDd, we can use a function that uses an input function to break a loop. We can then compare the speed of the function when given a function that the compiler knows will never break the loop and so can SIMDd, with a function that might break the loop."""
# ╔═╡ 94182f88-8af1-11eb-207a-37083c1ead68
begin
# The loop in the function breaks if pred(x[i])
# returns `true`, and therefore cannot be SIMDd
function sum_predicate(pred, x::Vector)
n = zero(eltype(x))
for i in eachindex(x)
y = x[i]
pred(y) && break
n += y
end
return n
end
end;
# ╔═╡ aa3931fc-8af1-11eb-2f42-f582b8e639ad
md"""
On my computer, the SIMD code is 10x faster than the non-SIMD code. SIMD alone accounts for only about 4x improvements (since we moved from 64-bits per iteration to 256 bits per iteration). The rest of the gain comes from not spending time checking the bounds and from automatic loop unrolling (explained later), which is also made possible by the `@inbounds` annotation.
#### SIMD needs a loop where loop order doesn't matter
SIMD can change the order in which elements in an array is processed. If the result of any iteration depends on any previous iteration such that the elements can't be re-ordered, the compiler will usually not SIMD-vectorize. Often when a loop won't auto-vectorize, it's due to subtleties in which data moves around in registers means that there will be some hidden memory dependency between elements in an array.
Imagine we want to sum some 64-bit integers in an array using SIMD. For simplicity, let's say the array has 8 elements, `A`, `B`, `C` ... `H`. In an ordinary non-SIMD loop, the additions would be done like so:
$$(((((((A + B) + C) + D) + E) + F) + G) + H)$$
Whereas when loading the integers using SIMD, four 64-bit integers would be loaded into one vector `<A, B, C, D>`, and the other four into another `<E, F, G, H>`. The two vectors would be added: `<A+E, B+F, C+G, D+H>`. After the loop, the four integers in the resulting vector would be added. So the overall order would be:
$$((((A + E) + (B + F)) + (C + G)) + (D + H))$$
Perhaps surprisingly, addition of floating point numbers can give different results depending on the order (i.e. float addition is not associative):
"""
# ╔═╡ c01bf4b6-8af1-11eb-2f17-bfe0c93d48f9
begin
x = eps(1.0) * 0.4
1.0 + (x + x) == (1.0 + x) + x
end
# ╔═╡ c80e05ba-8af1-11eb-20fc-235b45f2eb4b
md"for this reason, float addition will not auto-vectorize:"
# ╔═╡ e3931226-8af1-11eb-0da5-fb3c1c22d12e
md"However, high-performance programming languages usually provide a command to tell the compiler it's alright to re-order the loop, even for non-associative loops. In Julia, this command is the `@simd` macro:"
# ╔═╡ e793e300-8af1-11eb-2c89-e7bc1be249f0
function sum_simd(x::Vector)
n = zero(eltype(x))
# Here we add the `@simd` macro to allow SIMD of floats
@inbounds @simd for i in eachindex(x)
n += x[i]
end
return n
end;
# ╔═╡ f0a4cb58-8af1-11eb-054c-03192285b5e2
md"""
Julia also provides the macro `@simd ivdep` which further tells the compiler that there are no memory-dependencies in the loop order. However, I *strongly discourage* the use of this macro, unless you *really* know what you're doing. In general, the compiler knows best when a loop has memory dependencies, and misuse of `@simd ivdep` can very easily lead to bugs that are hard to detect.
"""
# ╔═╡ f5c28c92-8af1-11eb-318f-5fa059d8fd80
md"""
## Struct of arrays
If we create an array containing four `AlignmentTest` objects `A`, `B`, `C` and `D`, the objects will lie end to end in the array, like this:
Objects: | A | B | C | D |
Fields: | a | b |c| | a | b |c| | a | b |c| | a | b |c| |
Byte: 1 9 17 25 33
Note again that byte no. 8, 16, 24 and 32 are empty to preserve alignment, wasting memory.
Now suppose you want to do an operation on all the `.a` fields of the structs. Because the `.a` fields are scattered 8 bytes apart, SIMD operations are much less efficient (loading up to 4 fields at a time) than if all the `.a` fields were stored together (where 8 fields could fit in a 256-bit register). When working with the `.a` fields only, the entire 64-byte cache lines would be read in, of which only half, or 32 bytes would be useful. Not only does this cause more cache misses, we also need instructions to pick out the half of the data from the SIMD registers we need.
The memory structure we have above is termed an "array of structs," because, well, it is an array filled with structs. Instead we can strucure our 4 objects `A` to `D` as a "struct of arrays." Conceptually, it could look like:
"""
# ╔═╡ fc2d2f1a-8af1-11eb-11a4-8700f94e866e
struct AlignmentTestVector
a::Vector{UInt32}
b::Vector{UInt16}
c::Vector{UInt8}
end
# ╔═╡ 007cd39a-8af2-11eb-053d-f584d68f7d2f
md"""
With the following memory layout for each field:
Object: AlignmentTestVector
.a | A | B | C | D |
.b | A | B | C | D |
.c |A|B|C|D|
Alignment is no longer a problem, no space is wasted on padding. When running through all the `a` fields, all cache lines contain full 64 bytes of relevant data, so SIMD operations do not need extra operations to pick out the relevant data:
"""
# ╔═╡ 72fbb3ec-8ee8-11eb-3836-11092ef74e86
function Base.rand(::Type{AlignmentTest})
AlignmentTest(rand(UInt32), rand(UInt16), rand(UInt8))
end;
# ╔═╡ abb45d6a-8aef-11eb-37a4-7b10847b39b4
begin
# Open a file
function test_file(path)
open(path) do file
# Go to 1000'th byte of file and read it
seek(file, 1000)
read(file, UInt8)
end
end
# Randomly access data N times
function random_access(data::Vector{UInt}, N::Integer)
n = rand(UInt)
mask = length(data) - 1
@inbounds for i in 1:N
n = (n >>> 7) ⊻ data[n & mask + 1]
end
return n
end
end;
# ╔═╡ bff99828-8aef-11eb-107b-a5c67101c735
let
data = rand(UInt, 2^24)
@time test_file("../alen/src/main.rs")
@time random_access(data, 1000000)
nothing
end
# ╔═╡ b73605ca-8ee4-11eb-1a0d-bb6678de91c6
begin
@btime random_access($(rand(UInt, 1024)), 2^20) seconds=1
@btime random_access($(rand(UInt, 2^24)), 2^20) seconds=1
nothing
end
# ╔═╡ ffca4c72-8aef-11eb-07ac-6d5c58715a71
function linear_access(data::Vector{UInt}, N::Integer)
n = rand(UInt)
mask = length(data) - 1
for i in 1:N
n = (n >>> 7) ⊻ data[(15 * i) & mask + 1]
end
return n
end;
# ╔═╡ e71e4798-8ee4-11eb-3ea2-fdbbcdcf7410
let
data = rand(UInt, 2^24)
@btime random_access($data, 2^20) seconds=1
@btime linear_access($data, 2^20) seconds=1
nothing
end
# ╔═╡ 18e8e4b6-8af0-11eb-2f17-2726f162e9b0
function alignment_test(data::Vector{UInt}, offset::Integer)
# Jump randomly around the memory.
n = rand(UInt)
mask = (length(data) - 9) ⊻ 7
GC.@preserve data begin # protect the array from moving in memory
ptr = pointer(data)
iszero(UInt(ptr) & 63) || error("Array not aligned")
ptr += (offset & 63)
for i in 1:4096
n = (n >>> 7) ⊻ unsafe_load(ptr, (n & mask + 1) % Int)
end
end
return n
end;
# ╔═╡ 1f38f8c6-8ee5-11eb-1c01-f3706534a9cf
let
data = rand(UInt, 256 + 8)
@btime alignment_test($data, 0) seconds=1
@btime alignment_test($data, 60) seconds=1
nothing
end
# ╔═╡ 3fae31a0-8af0-11eb-1ea8-7980e7875039
let
memory_address = reinterpret(UInt, pointer(rand(1024)))
@assert iszero(memory_address % 64) # should not error!
end
# ╔═╡ 11c500e8-8ee2-11eb-3291-4382b60c5a2b
let
data = rand(UInt, 2^10)
show(stdout, MIME"text/plain"(), @benchmark increment($data) seconds=1)
println('\n')
show(stdout, MIME"text/plain"(), @benchmark increment!($data) seconds=1)
end
# ╔═╡ 61ee9ace-8af1-11eb-34bd-c5af962c8d82
let
Base.:+(x::Int, y::AllocatedInteger) = x + y.x
Base.:+(x::AllocatedInteger, y::AllocatedInteger) = x.x + y.x
data_stack = [StackAllocated(i) for i in rand(UInt16, 1000000)]
data_heap = [HeapAllocated(i.x) for i in data_stack]
@btime sum($data_stack) seconds=1
@btime sum($data_heap) seconds=1
nothing
end
# ╔═╡ 6ba266f4-8af1-11eb-10a3-3daf6e473142
let
data_stack = [StackAllocated(i) for i in rand(UInt16, 1)]
data_heap = [HeapAllocated(i.x) for i in data_stack]
println(rpad("First object of data_stack:", 36), data_stack[1])
println(
rpad("First data in data_stack array:", 36),
unsafe_load(pointer(data_stack)),
'\n'
)
println(rpad("First object of data_heap:", 36), data_heap[1])
first_data = unsafe_load(Ptr{UInt}(pointer(data_heap)))
println(rpad("First data in data_heap array:", 36), repr(first_data))
println(
"Data at address ",
repr(first_data), ": ",
unsafe_load(Ptr{HeapAllocated}(first_data))
)
end
# ╔═╡ 84c0d56a-8af1-11eb-30f3-d137b377c31f
let
add_tuple(a, b) = a .+ b
# Create a tuple of 8 32-bit integers.
# could also have created 4 64-bit numbers etc.
numbers = ntuple(i -> rand(UInt32), 8)
@code_native debuginfo=:none dump_module=false add_tuple(numbers, numbers)
nothing
end
# ╔═╡ a0286cdc-8af1-11eb-050e-072acdd4f0a0
let
# Make sure the vector is small so we don't time cache misses
data = rand(UInt64, 4096)
# For a function that always returns false, the compiler
# knows it can never break, and so will SIMD
@btime sum_predicate(Returns(false), $data) seconds=1
# This function has a 1/2^64 risk of returning true;
# while practically impossible, the compiler cannot
# guarantee it won't break the loop, and so will not SIMD
@btime sum_predicate(iszero, $data) seconds=1
nothing
end
# ╔═╡ cc99d9ce-8af1-11eb-12ec-fbd6df3becc8
let
data = rand(Float64, 4096)
@btime sum_predicate(Returns(false), $data) seconds=1
@btime sum_predicate(iszero, $data) seconds=1
nothing
end
# ╔═╡ e8d2ec8e-8af1-11eb-2018-1fa4df5b47ad
let
data = rand(Float64, 4096)
@btime sum_predicate(Returns(false), $data) seconds=1
@btime sum_simd($data) seconds=1
nothing
end
# ╔═╡ 054d848a-8af2-11eb-1f98-67f5d0b9f4ec
let
N = 1_000_000
array_of_structs = [rand(AlignmentTest) for i in 1:N]
struct_of_arrays = AlignmentTestVector(
rand(UInt32, N),
rand(UInt16, N),
rand(UInt8, N)
)
@btime sum(x -> x.a, $array_of_structs) seconds=1
@btime sum($struct_of_arrays.a) seconds=1
nothing
end
# ╔═╡ 0dfc5054-8af2-11eb-098d-35f4e69ae544
md"""
## Specialized CPU instructions
Most code makes use of only a score of simple CPU instructions like move, add, multiply, bitshift, and, or, xor, jumps, and so on. However, CPUs in the typical modern laptop support a *lot* of CPU instructions. Usually, if an operation is used heavily in consumer laptops, CPU manufacturers will add specialized instructions to speed up these operations. Depending on the hardware implementation of the instructions, the speed gain from using these instructions can be significant.
Julia only exposes a few specialized instructions, including:
* The number of set bits in an integer is effectively counted with the `popcnt` instruction, exposed via the `count_ones` function.
* The `tzcnt` instructions counts the number of trailing zeros in the bits an integer, exposed via the `trailing_zeros` function
* The order of individual bytes in a multi-byte integer can be reversed using the `bswap` instruction, exposed via the `bswap` function. This can be useful when having to deal with [endianness](https://en.wikipedia.org/wiki/Endianness).
The following example illustrates the performance difference between a manual implementation of the `count_ones` function, and the built-in version, which uses the `popcnt` instruction:
"""
# ╔═╡ 126300a2-8af2-11eb-00ea-e76a979aef45
function manual_count_ones(x)
n = 0
while x != 0
n += x & 1
x >>>= 1
end
return n
end;
# ╔═╡ 14e46866-8af2-11eb-0894-bba824f266f0
let
data = rand(UInt, 10000)
@btime sum(manual_count_ones, $data) seconds=1
@btime sum(count_ones, $data) seconds=1
nothing
end
# ╔═╡ 1e7edfdc-8af2-11eb-1429-4d4220bad0f0
md"""
The timings you observe here will depend on whether your compiler is clever enough to realize that the computation in the first function can be expressed as a `popcnt` instruction, and thus will be compiled to that. On my computer, the compiler is not able to make that inference, and the second function achieves the same result more than 100x faster.
#### Call any CPU instruction
Julia makes it possible to call CPU instructions direcly. This is not generally advised, since not all your users will have access to the same CPU with the same instructions, and so your code will crash on users working on computers of different brands.
The latest CPUs contain specialized instructions for AES encryption and SHA256 hashing. If you wish to call these instructions, you can call Julia's backend compiler, LLVM, directly. In the example below, I create a function which calls the `vaesenc` (one round of AES encryption) instruction directly:
"""
# ╔═╡ 25a47c54-8af2-11eb-270a-5b58c3aafe6e
begin
# This is a 128-bit CPU "vector" in Julia
const __m128i = NTuple{2, VecElement{Int64}}
# Define the function in terms of LLVM instructions
aesenc(a, roundkey) = ccall(
"llvm.x86.aesni.aesenc", llvmcall, __m128i,
(__m128i, __m128i), a, roundkey
)
end;
# ╔═╡ 2dc4f936-8af2-11eb-1117-9bc10e619ec6
md"(Thanks to Kristoffer Carlsson for [the example](http://kristofferc.github.io/post/intrinsics/)). We can verify it works by checking the assembly of the function, which should contain only a single `vaesenc` instruction, as well as the `retq` (return) and the `nopw` (do nothing, used as a filler to align the CPU instructions in memory) instruction:"
# ╔═╡ 76a4e83c-8af2-11eb-16d7-75eaabcb21b6
@code_native debuginfo=:none dump_module=false aesenc(
__m128i((1, 1)), __m128i((1, 1))
)
# ╔═╡ 797264de-8af2-11eb-0cb0-adf3fbc95c90
md"""Algorithms which makes use of specialized instructions can be extremely fast. [In a blog post](https://mollyrocket.com/meowhash), the video game company Molly Rocket unveiled a new non-cryptographic hash function using AES instructions which reached unprecedented speeds."""
# ╔═╡ 80179748-8af2-11eb-0910-2b825104159d
md"## Inlining
Consider the assembly of this function:"
# ╔═╡ 36b723fc-8ee9-11eb-1b92-451b992acc0c
f() = error();
# ╔═╡ 8af63980-8af2-11eb-3028-83a935bac0db
md"""
This code contains the `callq` instruction, which calls another function. A function call comes with some overhead depending on the arguments of the function and other things. While the time spent on a function call is measured in nanoseconds, it can add up if the function called is in a tight loop.
However, if we show the assembly of this function:
"""
# ╔═╡ 50ab0cf6-8ee9-11eb-3e04-af5fef7f2850
call_plus(x) = x + 1;
# ╔═╡ 93af6754-8af2-11eb-0fe6-216d76e683de
@code_native debuginfo=:none dump_module=false call_plus(1)
# ╔═╡ a105bd68-8af2-11eb-31f6-3335b4fb0f08
md"""
The function `call_plus` calls `+`, and is compiled to a single `leaq` instruction (as well as some filler `retq` and `nopw`). But `+` is a normal Julia function, so `call_plus` is an example of one regular Julia function calling another. Why is there no `callq` instruction to call `+`?
The compiler has chosen to *inline* the function `+` into `call_plus`. That means that instead of calling `+`, it has copied the *content* of `+` directly into `call_plus`. The advantages of this is:
* There is no overhead from the function call
* There is no need to construct a `Tuple` to hold the arguments of the `+` function
* Whatever computations happens in `+` is compiled together with `call_plus`, allowing the compiler to use information from one in the other and possibly simplify some calculations.
So why aren't *all* functions inlined then? Inlining copies code, increases the size of it and consuming RAM. Furthermore, the *CPU instructions themselves* also needs to fit into the CPU cache (although CPU instructions have their own cache) in order to be efficiently retrieved. If everything was inlined, programs would be enormous and grind to a halt. Inlining is only an improvement if the inlined function is small.
Instead, the compiler uses heuristics (rules of thumb) to determine when a function is small enough for inlining to increase performance. These heuristics are not bulletproof, so Julia provides the macros `@noinline`, which prevents inlining of small functions (useful for e.g. functions that raises errors, which must be assumed to be called rarely), and `@inline`, which does not *force* the compiler to inline, but *strongly suggests* to the compiler that it ought to inline the function.
If code contains a time-sensitive section, for example an inner loop, it is important to look at the assembly code to verify that small functions in the loop is inlined. For example, in [this line in my kmer hashing code](https://github.com/jakobnissen/Kash.jl/blob/b9a6e71acf9651d3614f92d5d4b29ffd136bcb5c/src/kmersketch.jl#L41), overall minhashing performance drops by a factor of two if this `@inline` annotation is removed.
An extreme difference between inlining and no inlining can be demonstrated thus:
"""
# ╔═╡ a843a0c2-8af2-11eb-2435-17e2c36ec253
begin
@noinline noninline_poly(x) = x^3 - 4x^2 + 9x - 11
inline_poly(x) = x^3 - 4x^2 + 9x - 11
function time_function(F, x::AbstractVector)
n = 0
for i in x
n += F(i)
end
return n
end
end;
# ╔═╡ b4d9cbb8-8af2-11eb-247c-d5b16e0de13f
let
data = rand(UInt, 1024)
@btime time_function(noninline_poly, $data) seconds=1
@btime time_function(inline_poly, $data) seconds=1
nothing
end
# ╔═╡ bc0a2f22-8af2-11eb-3803-f54f84ddfc46
md"""
## Unrolling
Consider a function that sums a vector of 64-bit integers. If the vector's data's memory offset is stored in register `%r9`, the length of the vector is stored in register `%r8`, the current index of the vector in `%rcx` and the running total in `%rax`, the assembly of the inner loop could look like this:
```
L1:
; add the integer at location %r9 + %rcx * 8 to %rax
addq (%r9,%rcx,8), %rax
; increment index by 1
addq $1, %rcx
; compare index to length of vector
cmpq %r8, %rcx
; repeat loop if index is smaller
jb L1
```
For a total of 4 instructions per element of the vector. The actual code generated by Julia will be similar to this, but also incluce extra instructions related to bounds checking that are not relevant here (and of course will include different comments).
However, if the function is written like this:
```julia
function sum_vector(v::Vector{Int})
n = 0
i = 1
for chunk in 1:div(length(v), 4)
n += v[i + 0]
n += v[i + 1]
n += v[i + 2]
n += v[i + 3]
i += 4
end
return n
end
```
The result is obviously the same if we assume the length of the vector is divisible by four. If the length is not divisible by four, we could simply use the function above to sum the first $N - rem(N, 4)$ elements and add the last few elements in another loop. Despite the functionally identical result, the assembly of the loop is different and may look something like:
```
L1:
addq (%r9,%rcx,8), %rax
addq 8(%r9,%rcx,8), %rax
addq 16(%r9,%rcx,8), %rax
addq 24(%r9,%rcx,8), %rax
addq $4, %rcx
cmpq %r8, %rcx
jb L1
```
For a total of 7 instructions per 4 additions, or 1.75 instructions per addition. This is less than half the number of instructions per integer! The speed gain comes from simply checking less often when we're at the end of the loop. We call this process *unrolling* the loop, here by a factor of four. Naturally, unrolling can only be done if we know the number of iterations beforehand, so we don't "overshoot" the number of iterations. Often, the compiler will unroll loops automatically for extra performance, but it can be worth looking at the assembly. For example, this is the assembly for the innermost loop generated on my computer for `sum([1])`:
L144:
vpaddq 16(%rcx,%rax,8), %ymm0, %ymm0
vpaddq 48(%rcx,%rax,8), %ymm1, %ymm1
vpaddq 80(%rcx,%rax,8), %ymm2, %ymm2
vpaddq 112(%rcx,%rax,8), %ymm3, %ymm3
addq $16, %rax
cmpq %rax, %rdi
jne L144
Where you can see it is both unrolled by a factor of four, and uses 256-bit SIMD instructions, for a total of 128 bytes, 16 integers added per iteration, or 0.44 instructions per integer.
Notice also that the compiler chooses to use 4 different `ymm` SIMD registers, `ymm0` to `ymm3`, whereas in my example assembly code, I just used one register `rax`. This is because, if you use 4 independent registers, then you don't need to wait for one `vpaddq` to complete (remember, it had a ~3 clock cycle latency) before the CPU can begin the next.
The story for unrolling is similar to that for SIMD: The compiler will only unroll a loop if it can tell _for sure_ that it will not overshoot the number of iterations. For example, compare the two following functions:
"""
# ╔═╡ f0bc1fdc-8ee9-11eb-2916-d71e1cf36375
let
data = fill(false, 2^20)
# any: Stops as soon as it finds a `true`
@btime any($data) seconds=1
# foldl: Loops over all values in the array
@btime foldl(|, $data) seconds=1
data[1] = true
@btime any($data) seconds=1
@btime foldl(|, $data) seconds=1
nothing
end
# ╔═╡ 36a2872e-8eeb-11eb-0999-4153ced71678
md"""
The _first_ function stops and returns as soon as it finds a `true` value - but this break in the loop disables SIMD and unrolling. The _second_ function continues throughout the entire array, even if the very first value is `true`. While this enables SIMD and unrolling, it's obviously wasteful if it sees a `true` right in the beginning. Hence, the first is better when we expect to see the first `true` before around 1/4th of the way though the array, the latter better otherwise.
We can create a compromise by manually unrolling. In the functions below, `check128` checks 128 entries using `inbounds`, without stopping underway to check if it's found a `true`, and is thus unrolled and SIMDd. `unroll_compromise` then uses `check128`, but breaks out of the loop as soon as it finds a `true.`
"""
# ╔═╡ 9ca70cfc-8eeb-11eb-361b-b929089ca109
begin
@inline function check128(data, i)
n = false
@inbounds for j in 0:127
n |= data[i+j]
end
n
end
function unroll_compromise(data)
found = false
i = 1
while !found & (i ≤ length(data))
check128(data, i) && return true
i += 128
end