-
Notifications
You must be signed in to change notification settings - Fork 0
/
CM1.Rmd
830 lines (497 loc) · 24.2 KB
/
CM1.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
---
title: "Introduction to Big Data"
author: "Arthur Katossky & Rémi Pépin"
date: "February 2020"
subtitle: Lesson 1 — Data-intensive computing
institute: ENSAI (Rennes, France)
output:
xaringan::moon_reader:
css: ["default", "css/presentation.css"]
nature:
ratio: 16:10
scroll: false
---
```{r setup, message=FALSE, warning=FALSE, include=FALSE}
pacman::p_load(dplyr, ggplot2)
# TO DO:
# [X] Finish an outline of the presentation: just titles ; almost no text (1h)
# [ ] Add header with name, date, ....
# [ ] Add images, gifs and videos (2h)
# [ ] Clean up the beginning to match the minimalistic presentation style (30 min)
# [ ] Use extra time (bedtime = midnight) to pimp the presentation (colors, shadows, transitions)
# [?] Add something about cloud computing?
# [ ] find a way to store resources in common for all courses
# [ ] add resources (books, sites, etc.)
# [ ] better overall style
# [ ] better graphic style
# [ ] add colors
# [ ] add interactivity for text input (add it live to the presentation)
# [ ] introduce the distinction between big N and big K
# [ ] add this course's outline / summary
```
# Forewords
???
This course is in English. You are welcome to ask vocabulary questions at any time.
Try to ask questions in English, but if you tried and can't, you've done your best and you can switch to English.
---
## Motivation
--
Machine-learning algorithms do not scale well.
--
- exact solutions do not scale // ex: matrix inversion is of $O(n^3)$ complexity
--
- approximate solutions do not scale either
???
Some very popular machine-learning algorithms do not scale well.
Exact solutions do not scale: matrix inversion (for instance) is a problem with an algorithmic complexity of $O(n^3)$ ([source](https://en.wikipedia.org/wiki/Computational_complexity_of_mathematical_operations)).
---
background-image: url(img/servers.jpg)
.contrast[ Google's pre-computation of BERT (a neural network for natural language processing) thanks to gradient-descent — a typically non-exact solution — took 4 days on a 64-TPU cluster. ]
.footnote[.contrast[**Source:** [Arxiv](https://arxiv.org/abs/1810.04805) — **Image:** [Wikimedia](https://commons.wikimedia.org/wiki/File:Wikimedia_Foundation_Servers-8055_23.jpg)]]
???
Approximate solutions do not scale either: Lately, Google's pre-computation of BERT (a neural network for natural language processing) thanks to gradient-descent took 4 days on a mesmerizing 64-TPU cluster.
---
## Motivation
Machine-learning algorithms do not scale well.
- exact solutions do not scale // ex: matrix inversion is of $O(n^3)$ complexity
- approximate solutions do not scale either
--
One would thus consider how these computation procedures may be accellerated.
--
- Using algorithmic tricks?
--
- Using more memory?
--
- Parallelizing on several computers?
--
- Relaxing our precision requirements? ...
--
These are many many solutions... and each of this problems is far from trivial!
--
Computing a median can prove challenging to speed up!
---
## Goal
At the end of this course, you will:
--
- understand the potential bottlenecks of data-intensive processes
--
- choose between concurrent solutions for dealing with such bottlenecks
--
- understand the map-reduce principle and have practical experience with Spark on HDFS
--
- have an overview of the cloud computing ecosystem and practical experience with one provider (AWS)
--
- get a glimpse over the *physical*, *ethical*, *economical*, *financial*, *environmental*, *statistical*, *political*... challenges, limitations and dangers of the suggested solutions
???
In this course, we will review the bases of computer science that may be required for accelerating statistical estimation processes.
Ex techniques: sampled vs. exhaustive, sequential vs. parallel, in memory vs. on disk, whole data vs. in chunks, local vs. cloud...
---
## Outline & schedule
--
**Courses:**
- **Course 0:** Introduction to cloud computing (1h30, March 3)
- **Course 1:** Data-intensive computing in a nutshell (3h, February 18)
- **Course 2:** Parallelized computing and distributed computing (3h, March 2)
--
**Tutorials:**
- **Tutorial 0:** Hands-on AWS (1h30, March 3)
- **Tutorial 1:** Hands-on Spark (3h, March 18)
- **Tutorial 2:** How to optimize a statistical algorithm? (3h, March 4)
- **Tutorial 3:** How to compute a statistical summary over distributed data? (3h, March 11)
- **Tutorial 4:** Coping with APIs, data streams and exotic formats (3h, March 25)
???
There will be 6h course + 1h30 presentation of cloud computing solutions:
There will be 9h tutorials + 1h30 hands-on session with AWS:
The course's content this year is the same as 2nd-year students' "Introduction to Big Data". 1 out of the 3 tutorials (How to accelerate a neural network?) is specific to this course.
<!-- TP2: How to speed up the computation of a statistical summary? -->
---
## Evaluation
- Multiple-choice questionnaire at the beginning of every tutorial
- 1 short tutorial report (tutorial 1)
- Desk examination (most probably multiple-choice questionnaire)
---
## Material
Course and tutorial material will be available on Moodle.
---
# From "Big Data" to "Large Scale"
<!-- intro sur la quantié de données produites -->
---
## What is "Big Data" ?
--
.pull-right[![](https://image.flaticon.com/icons/png/512/906/906794.png)]
---
## What is "Big Data" ?
The term started to be used in the 1990's and is a ill-defined notion.
**Big Data** broadly refers to data that **cannot be managed by commonly-used software**.
???
(It is inherenly relative to **who** and is using it. What is a common software for someone may not be for someone else. Think relational databases for someone used to use spreadsheet software.)
---
## What is "Big Data" ?
.pull-left[
_Line and row limits of Microsoft's Excel tablesheet_
```
Version Lines Columns
until 1995 7.0 16 384 256
until 2003 11.0 65 536 256
from 2007 12.0 1 048 576 16 384
```
]
.pull-right[
_Max. number of items stored in one tablesheet_
```{r, echo=FALSE, fig.height=2.5, fig.width=3.5, out.width="100%"}
breaks <- 10^(1:10)
tibble(
measurement = "Excel (number of cells)",
year = c(1999,2003,2007),
lines = c(16384,65536,1048576),
columns = c(256,256,16384)
) %>%
ggplot() +
geom_line(aes(x=year,y=lines*columns, group="measurement")) +
scale_y_log10(name=NULL, breaks=breaks) + # ,labels =scales::label_number()
scale_x_continuous(name=NULL, breaks=1999:2010, minor_breaks = NULL) +
expand_limits(y=1000000) +
theme(axis.text.x = element_text(angle = -45, hjust = 0))
```
]
???
Target constantly moving as the performance of machines and software continues to increase.
For instance Excel, Microsoft's widely used tablesheet programme, can only cope with a limited number of lines and columns.
---
## What is "Big Data" ?
Size is **not** the only thing that matters.
--
.pull-left[In a tablesheet program, what kind of information can't you store properly?]
.pull-right[![](https://image.flaticon.com/icons/png/512/906/906794.png)]
--
- relationnal data
--
- images, long texts
--
- unstructured data (ex: web page)
--
- rapidly varying data (ex: tweeter feed)
???
Neither is it only a question of size. Since we talk about Excel, it is clear that many different kind of data cannot be stored (or can difficultly be stored) in Excel, for instance:
In a tablesheet program, what kind of information can't you store properly? (ask audience)
---
## What is "Big Data" ?
You will often find the reference to the "3 V's of Big Data"
- **V**olume
- **V**elocity
- **V**ariety¹
.footnote[¹: not really treated in this course]
???
- **V**olume (massive, taking place)
- **V**elocity (fast, updated constantly)
- **V**ariety (tabular, structured, unstructured, of unknown nature, mixed)
Each of these aspects generates specific challenges, and we are often confronted to two or three of them simultaneously! In this large-scale machine-learning course, we will tackle the *volume* and *velocity* aspect.
Marketers and management gurus like to add V's to the V's. You will have **Value**, **Veracity**... and more. You will here about the "5 V's".
---
## How big is "Big Data" ?
<!-- Find some graphics and figures about the size of files.-->
???
Let have a specific attention to volume. Data volume is measured in bits (FR: bit) the basal 1/0 unit in digital storage. One bit can store only 2 states / values. There are multiples of bits, the first one being the byte (FR: octet), equal to 8 bits and able to store as many as $2^8=256$ states / values.
---
## Large-scale computing
<!-- Find some graphics and figures about the computation power of machines.-->
???
Note that what characterises the *data* does not necesserily say anything about the computation you perfom **on** the data. Machine-learning tasks (in the learning phase as well as the prediction phase) can be challenging even with modest amount of data, since some algorithms (typically involving matrix inversion) do not scale well at all. **Scaling** computation will be in the focus of this course. (Of course all this is linked: if we want to update a predicion model on a rapidly arriving massive dataset, we need clever fast computation! Think Spotify's weekly recommandations for their millions of users.)
---
## The limits to storage and computing
???
There are many limits to our ability to store and compute.
A last (but important) dimension is cost. Data can be said to become "big" whenever storing data or computing statistics become so resource-intensive that their price become non negligeable. Imagine when you consider to buy an extra computer only for to perform a given opertion.
Ethical, ecological, political (autonomy) considerations are also at stake and we will spend some time discussing them.
Before we dig into the details and discuss exemples of large-scale machine-learning challenges, a detour to how a typical desktop computer works is needed.
---
# Computer science survival kit
--
A computer can be abstracted by four key components:
--
- processing (FR: processeurs)
--
- memory (FR: mémoire vive)
--
- storage (FR: stockage)
--
- wiring / network (FR: connexions / réseau)
???
Ask audience
---
background-image: url(img/cpu.jpg)
background-size: cover
.pull-right[
## Processors
]
???
The processor is the unit responsible for performing the actual computing.
A processor can be called a "core", a "chip"... and can be differentiated into CPU (central processing units), GPU (graphical processing units) and recently TPU (tensor processing units).[^1]
[1]: Formally, the chip is the physical device, and a chip may enclose one or more logical processors, called cores, in its electronic circuits. Multiple CPU cores on the same die share circuit elements, use less energy, exchange information faster (individual signals can be shorter, degrade slower and do not need to be repeated as often).
A processor performence is measured in how many operations it can perform per second, typically measured in FLOP (floating-point operations per second).
---
## Processors
**What is important for us to know?**
--
- computation happens _physically_
--
- programs must be converted to machine-code
--
- instructions are stored sequentially in a stack / thread
--
- multiple processors may share a stack, or have each their own
--
- processors can be specialized
???
- the elementary computation (addition, multiplication...) happens through a **physical** process
- in order to perform higher-level operations (the source code), instructions must be converted to more basal units of computation (the machine code)
- the low-level instructions are stored sequentially in a stack or thread
- processors draw from the stack ; multiple processors may all draw from the same stack (multi-tread architecture), or on the contrary have each their own (single-thread)
- processors can be hard-wired to specialize into doing some types of computation, this is the case of GPUs and TPUs
---
![](img/compution-cost-effectiveness.png)
.footnote[**Source:** Kurzweil ([link](http://www.kurzweilai.net/exponential-growth-of-computing)) via Our World in Data ([link](https://ourworldindata.org/technological-progress))]
???
Processors have made incredible progress over time. And this keeps happening.
---
.height600[![](img/compution-speed.png)]
---
.height600[![](img/compution-moore.png)]
???
"Law" by 1965 by the Intel co-founder Gordon E. Moore.
---
## Processors
**Processing is a limitting factor.**
???
It is costly (it uses energy). One unit can only perform so much operations per second. There is no easy way to combine multiple processors to perform a complex operation.
---
background-image: url(img/ram.jpg)
background-size: cover
## Memory
???
Memory is the unit where the computer temporarily stores the instructions and the data on which to perform computation.
It is also broadly called "cache" or "RAM"[^2].
Memory efficiency is measured by how much information it can contain and how fast in can be read from or written to.
---
## Memory
**What is important for us to know ?**
--
- moving data around takes time
--
- memory is fast (ns)
--
- memory is volatile
--
- memory-processor units are **heavily** optimized
???
What is important to know for us:
- memory is fast, with access time typically measured in nanoseconds
- memory is volatile, meaning that it is lost in case of power interruption
- the easiest way to perform a computation is to move all data in memory — where it becomes accessible to the CPU
---
## Memory
**Memory is a limitting factor.**
???
We can readily perform computation only on a chunk of data that can fit into memory. RAM is very costly: it is still expensive to buy RAM but it is also expensive at use time since it needs electricity to persist data.
<!-- graph of the veolution of memory size -->
[2]: Formally, there is intermediate memory between RAM and the processor: the CPU register is closest to the processor, then comes the CPU cache, on which the computer moves data and instructions that need to for efficiency reasons. "Caching" in general refers to the act of copying a piece of information closer to the place it will be used, such as when storing a local copy of website while navigating over the Internet. RAM (random access memory) is the most-used type of memory.
---
background-image: url(img/storage.jpg)
background-size: contain
## Storage
???
Storage is the unit for long-term storage of information.
It is also called "(disk) space" or "(hard) disk".
Contrary to memory, the disk's non-volatile nature make it suitable to conserving information over long periods. Its first measure of efficiency is thus its size, how long it can retain information without error and how fast you can read from and write on it.
There are other meda of storage, such as... paper, but also digital tapes.
In reality the distinction memory vs. storage is more of a continuum. For instance, SSD disks are faster to read but more expensive to produce.
---
## Storage
**What is important for us to know ?**
--
- storage is faillible
--
- writing and reading is slow (ms)
--
- data is always cached for computation
???
What is important to know for us:
- storage is faillible: errors do occur and information slowly degrades (phenomenon known as data decay: https://en.wikipedia.org/wiki/Data_degradation ; solid state memory like flash card may be corrupted by current ; magnetic field can corrupt disks, tapes ; scratches and dust can corrupt DVDs ; rotting / overuse can corrupt books etc. ; keeping copies often mitigates the problem, but at the cost of maintaining coherence!)
- storage is slow, typically measured in milliseconds [^3]
- computation cannot be performed directly from disk, it has to be temporarily copied into memory (or "cached")
[3]: writing and reading on a disk requires the disk to turn and the reading head to move ; on the top of that, files are split into chunks that are stored at different places and the computer needs a registry to record where what is where
---
## Storage
![](img/xkcd-digital-data.png)
---
## Storage
**Storage is a limitting factor.**
???
On a given computer, we can only store so much. Today's price of disk space does make it less limitting economically. But if data exceeds the size of the disk, data must be split between several disks, which compicates programming (**distributed storage**).
<!-- average storage capacity ; graph of information degradation over time -->
---
.heigth600[![](img/storage-cost.png)]
---
.heigth600[![](img/storage-average.png)]
???
2 more things before we move on: network and compilation / intepretation
---
## Network
???
Shorter == better.
<!-- Amazon truck for servers. -->
<!-- Speed vs. cost of services. -->
---
## Interpreting vs. compiling
???
When running a programme written in R or Python, the **source code** is translated into **machine code**.
There are schematically 2 ways to do that:
When **interpreting** the source code, instructions from source code are converted **one by one** into machine code. Say you have an error in your code at line 10. Your programme would execute the first 9 lines then halt with an error.
When **compiling** the source code, instruction from source code are converted **as a whole** into machine code, while performing many optimisation procedures under the hood (ex: multiplying by 2 is just adding an extra 0 to the right in binary ; also you may want to take advantage of the processor pipeline, i.e. the fact that reading the next task from memory can be made to happen *in the same time* than execution of the preceding task). This means that your code with an error would just fail compiling and that "the beginning" (which does not even make sense anymore) will never be executed. Compilation takes time, but it may be worth it if the same programme is run regularly.
Real-world computing is not black and white. For instance most interpreters do perform optimisations on the fly, even though not as thorough as compilers. A common intermediate pattern is **compilation to bytecode**, a platform agnostic language that gets interpreted at run time, or even further compiled to machine code.
---
# Challenges with large-scale machine-learning
???
Intro:
Snowball mobile.
Photo du trou noir.
Some things are more trivial than others.
---
## What if ... ?
---
## What if data is too big to fit in RAM ?
.pull-right[![](https://image.flaticon.com/icons/png/512/906/906794.png)]
--
.pull-left[
1. do less (filter, sample)
2. buy more/bigger memory
3. do things in chunks (stream, pipeline)
4. distribute the data
]
---
## What if ... ?
---
## What if file is too big for file system ?
.pull-right[![](https://image.flaticon.com/icons/png/512/906/906794.png)]
--
.pull-left[
1. do less (filter, sample)
2. change file system
3. cut file into pieces and process in chunk
4. go on databases, that's what they are made for
5. go on cloud, they take care of the problems for you
]
???
Sometimes you don't have the choice. You get data you have to read before you can filter.
---
## What if file is too big for file system ?
![](img/limit-size-aws.png)
---
## What if file is too big for file system ?
![](img/limit-size-azure.png)
---
## What if file is too big for file system ?
![](img/limit-size-gcp.png)
---
## What if ... ?
---
## What if data is too big to fit on disk ?
.pull-right[![](https://image.flaticon.com/icons/png/512/906/906794.png)]
--
.pull-left[
1. store less (sample, filter)
2. buy bigger physical disk
3. buy a storage server
4. rent storage space in the cloud
5. distribute data yourself on serval nodes
]
---
## What if ... ?
---
## What if computation takes ages ?
.pull-right[![](https://image.flaticon.com/icons/png/512/906/906794.png)]
--
.pull-left[
1. do less (filter, sample), especially in development stages
2. do less (less tests, less formatting)
3. go low-level (compiling ... or building chips!)
4. profile your code (how does it scale ? where are the bottlenecks?)
5. buy bigger or more cores
6. rent cores on the cloud
7. avoid i/o or network operations
8. **parallelize on non-local nodes**
]
???
We have to distinguish between estimation of a statistical model (always the same call, happen seldom, often only once or twice), developement/research (many requests, often very different from each other) and model in production, for instance for prediction (usually lighter computations, but happen possibly very often). It is rare to be in the case where you both want a very intense computation and you wan to perform it often, exceptions being for instance 3D rendering.
Most used statistical procedures (computing a sum, a mean, a median, inversing a matrix...) are already heavily optimised in traditionnal software, so that you will never need to deep to far in algorithmic machinery.
2. Ex: In R, `parse_date_time()` parses an input vector into POSIXct date-time object ; `parse_date_time2()` is a fast C parser of numeric orders. // Caveat. inputs have to be realiably standardized
3. compile and
7. minimize communication (maybe you can perform part of the computation in the database, like agregation, and only then communicate the result) ; beware of the cost of data transfer!!!!! Especially when using the cloud! // Unless you can perform reading / writings in parallel to earlier / later tasks, aka. pipelining or streaming.
---
## What if ... ?
---
## What if computation / storage is too expensive ?
.pull-right[![](https://image.flaticon.com/icons/png/512/906/906794.png)]
--
.pull-left[
1. store or compute less (filter, sample)
2. consider cloud computing
3. be carefull with databases requests
4. go from RAM to disk
5. use smaller but moany computing units (ex: scientific grids)
]
???
4. stream, caveat: more communication between disk and memory
---
## What if ... ?
---
## What if data i/o is too fast ?
.pull-right[![](https://image.flaticon.com/icons/png/512/906/906794.png)]
--
.pull-left[
1. for computing: stream / pipeline
2. for storage: fast databases
]
???
There are many distinct problems. For instance, you may have data that **arrive** continuously (think financial operations). On the contrary, many users may want to access the same data often (think newspapers / websites). You may want to do both (think Twitter).
And you may want (or not) to guarantee that your solution can scale, i.e. that it does not depend .
Different solutions can cope with this:
-> streamer (on the computing side) => same problem as distribution: the computation has to be made by chunks
-> database solutisions for fast wirte / fast read ; there's alwasy a trade-off between size, durability, fiability (do I correctly get the last bit of information?), huge subject of research and development ; it is more a question of production (you won't necessarily be concerned with this aspect)
---
# Other issues with Large-Scale Machine-Learning
---
## Statistical issues
???
- statistical power
- false discovery rate & multiple tests
- low information density
- data improper for causal relationships
Big data uses mathematical analysis, optimization, inductive statistics and concepts from nonlinear system identification[24] to infer laws (regressions, nonlinear relationships, and causal effects) from large sets of data with low information density[25] to reveal relationships and dependencies, or to perform predictions of outcomes and behaviors
## Ethical issues
???
- too much information -> everyone is identifiable
- retroactive content (the collection of data depends on what the algorithm produces and the algorithm learns from the data, ex: video recommandations on YouTube)
- security issues (all the 3 giants are forced to comply by the American intelligence agencies)
---
## Environmental issues
???
- environmental resource consumption of storing / computing / communication
- both the energ consumption
- but also what to do with all the waste
---
## Political issues
???
- what independance do you get when all the data is in the hand of a handful of American companies? see the struggle of OVH for existing on this market
---
# Today's conclusion
--
1. MOST DATA IS SMALL BUT COMPUTATION ON IT CAN STILL BE A CHALLENGE
--
2. DISTIBUTING DOES NOT HELP ; DO IT ONLY IF YOU CAN'T AVOID IT
--
3. SAMPLING IS OFTEN A VERY GOOD SOLUTION
--
4. THERE ARE MANY MANY WAYS TO SPEED UP A COMPUTATION
--
5. THERE ARE ENVIRONMENTAL, ECONOMICAL, ETHICAL AND STATISTICAL CHALLENGES WITH THE ACCUMULATION OF SUCH A QUANT