Skip to content

Commit

Permalink
Updated reveal slides
Browse files Browse the repository at this point in the history
  • Loading branch information
d9w committed Oct 3, 2023
1 parent 10be7fa commit 1965149
Show file tree
Hide file tree
Showing 76 changed files with 5,919 additions and 26 deletions.
349 changes: 349 additions & 0 deletions slides/0_0_intro.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,349 @@
<!DOCTYPE html>
<html lang="en">

<head>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no" />

<title>Tools of Big Data</title>
<link rel="shortcut icon" href="./favicon.ico" />
<link rel="stylesheet" href="./dist/reveal.css" />
<link rel="stylesheet" href="./static/css/reset.css" />
<link rel="stylesheet" href="./static/css/evo.css" />
<!-- <link rel="stylesheet" href="/_assets/evo" id="theme" /> -->
<link rel="stylesheet" href="./css/highlight/vs.css" />

<link rel="stylesheet" href="././_assets/static/css/evo.css" />
<link rel="stylesheet" href="././_assets/static/css/reset.css" />

</head>

<body>
<div class="reveal">
<div class="slides"><section data-markdown><script type="text/template">

## Tools of Big Data

**SDD, 2021**

Dennis WILSON

[supaerodatascience.github.io/DE](https://supaerodatascience.github.io/DE/)

</script></section><section data-markdown><script type="text/template">

### Software as a Service

![](static/img/web_simple.png)

What happens when:

+ There are more clients than one server can handle?
+ There's more data than one database can store?
+ There are different data types which aren't adapted for your data?
+ You want to show live analysis of the data which integrates the user's request?

</script></section><section data-markdown><script type="text/template">

### A growing issue

<table width="30%">
<tr><td>1 </td><td> B</td><td> byte</td></tr>
<tr><td>1000 </td><td> kB</td><td> kilobyte</td></tr>
<tr><td>1000^2 </td><td> MB</td><td> megabyte</td></tr>
<tr><td>1000^3 </td><td> GB</td><td> gigabyte</td></tr>
<tr><td>1000^4 </td><td> TB</td><td> terabyte</td></tr>
<tr><td>1000^5 </td><td> PB</td><td> petabyte</td></tr>
<tr><td>1000^6 </td><td> EB</td><td> exabyte</td></tr>
<tr><td>1000^7 </td><td> ZB</td><td> zettabyte</td></tr>
</table>

![](static/img/datasphere.png)

source: [IDC](https://www.seagate.com/files/www-content/our-story/trends/files/idc-seagate-dataage-whitepaper.pdf)

</script></section><section data-markdown><script type="text/template">

### Big Data Architectures

![](static/img/big-data-pipeline.png)

source: [Microsoft Azure](https://docs.microsoft.com/en-us/azure/architecture/data-guide/big-data/)

</script></section><section data-markdown><script type="text/template">

### Data Architecture Requirements

+ Availability
+ Recoverability
+ Redundancy
+ Consistency
+ Usability

![](static/img/referential_integrity.png)

How to ensure these characteristics for increasing amounts of data requiring large infrastructure?

</script></section><section data-markdown><script type="text/template">

### File system storage

<img src="static/img/inodes.png" width="50%">

Unix [inodes](https://man7.org/linux/man-pages/man7/inode.7.html)

</script></section><section data-markdown><script type="text/template">

### Databases

![](static/img/database.png)

[Relational Database](https://www.researchgate.net/publication/323466947_Design_and_Analysis_of_a_Relational_Database_for_Behavioral_Experiments_Data_Processing)

</script></section><section data-markdown><script type="text/template">

### Beyond Data

+ Store unstructured data that doesn't fit well into a database
+ Store analysis processes
+ Present different interfaces for internal and external processes

![](static/img/datalake.png)

</script></section><section data-markdown><script type="text/template">

### Silos, lakes, warehouses

![](static/img/lake_vs_warehouse.png)

[Amazon AWS](https://aws.amazon.com/big-data/datalakes-and-analytics/what-is-a-data-lake/?nc=sn&loc=2)

</script></section><section data-markdown><script type="text/template">

### AWS Data Lake

<img src="static/img/Zaloni-Data-Lake-2.png" width="70%">

Zaloni data lake on [AWS](https://aws.amazon.com/blogs/apn/turning-data-into-a-key-enterprise-asset-with-a-governed-data-lake-on-aws/)

</script></section><section data-markdown><script type="text/template">

### Data analysis

+ Separable from production data manipulation
+ Transferable between test environment and hardware types
+ Access to high performance CPUs or GPUs
+ High level languages like Python
+ Visualization interface

![](static/img/jupyter.png)

</script></section><section data-markdown><script type="text/template">

### Virtualization

<img src="static/img/virtualization.png" width="60%">

Source: [unixtutorial.org](https://www.unixtutorial.org/hw-virtualization/)

Why is this suboptimal for machine learning pipelines?

</script></section><section data-markdown><script type="text/template">

### Computing systems

![](static/img/hpc_cloud.bmp)

Cloud advantage:
<br/>hardware based on component requirements

</script></section><section data-markdown><script type="text/template">

### Cloud: the new compute paradigm

![](static/img/cloud_providers.jpg)

Source: [DigitalCMO](https://www.digitalcmo.fr/offre-cloud-vous-avez-de-plus-en-plus-le-choix/)

</script></section><section data-markdown><script type="text/template">

### Data distribution

<img src="static/img/akamai_latency.png" width="40%">

[Dyn](https://help.dyn.com/understanding-cdn-performance/) on Akamai latency

+ size of data
+ data transfer latency
+ redundancy
+ calculation parallelism

</script></section><section data-markdown><script type="text/template">

### The problem

<img src="static/img/distributed.png" width="40%" >

How do we execute a query when our data is distributed:
+ across multiple different servers?
+ across different replicas?
+ efficiently?

</script></section><section data-markdown><script type="text/template">

### Example: word count

<img src="static/img/mapreduce.png" width="40%">

+ Split data tasks into distributable components
+ Define operators and manipulate them
+ Send operations between different servers
+ Use **functional programming**

</script></section><section data-markdown><script type="text/template">

### Spark

<img src="static/img/spark.png" width="40%">

+ Data operations (functions) can be passed as objects
+ Written in Scala, a functional programming language
+ Interfaces in Python, C++, Java, and more

</script></section><section data-markdown><script type="text/template">

### Tools of Big Data

Understanding modern Big Data tools and systems, focusing on Machine Learning integration

+ Databases
+ 13h, Dennis Wilson and Failor Elfassi (MP Data)
+ SQL, NoSQL, PostgreSQL
+ Database presentation project
+ Data computation
+ 18h, Florient Chouteau (Airbus DS)
+ Cluster and cloud compute
+ Containers and orchestration
+ GPU computer (Laurent Risser)
+ Data distribution
+ 15h, Guillaume Eynard-Bontemps (CNES)
+ MapReduce, Spark, HDFS
+ Orchestration with Kubernetes
+ Dask BE evalutation

</script></section><section data-markdown><script type="text/template">

Introduction | | | Readings |
--- | --- | --- | ---
[Introduction to Big Data](slides/0_0_intro.md) | 2h | 27/09/2021 | [Global Datasphere](https://github.com/SupaeroDataScience/DE/tree/master/readings/idc_data.pdf)

Databases | | | |
--- | --- | --- | ---
[Databases overview](0_1_databases.md) | 2h | 27/09/2021 | [Databases and SQL](https://github.com/SupaeroDataScience/DE/tree/master/readings/fntdb07-architecture.pdf)
[PostgeSQL TP](0_2_postgres.md) | 3h | 29/09/2021 | [PostgeSQL](https://www.postgresql.org/docs/manuals/)
[Databases Project](0_3_project.md) | 3h | 06/10/2021 |
[Project presentations](0_3_project.md) | 3h | 02/11/2021 |

</script></section><section data-markdown><script type="text/template">

Data Computation | | | Readings |
--- | --- | --- | ---
[Cloud Computing & Google Cloud Platform](1_1_overview.md) | 3h | 18/01/2022 | [Readings](1_7_readings.md#about-cloud-computing)
[Containers](1_3_containers.md) | 2h| 19/01/2022 | [Readings](1_7_readings.md#about-orchestration)
[Orchestration](1_4_orchestration.md) | 1h | 19/01/2022 | [Readings](1_7_readings.md#about-containers) |
[Cloud Compute BE](1_4_be.md) | 6h | 25/01/2022 |
[GPU computing](1_5_gpu.md) | 3h <br/> 3h | 01/02/2022 <br/> 02/02/2022 | [GPGPU TP](https://lms.isae.fr/course/view.php?id=1226&section=2) |

</script></section><section data-markdown><script type="text/template">

| Data Distribution | | | Readings |
| --- | --- | --- | --- |
| [Hadoop and MapReduce](2_3_mapreduce.md) | 3h | 08/02/2022 | [MapReduce](https://github.com/SupaeroDataScience/DE/tree/master/readings/mapreduce.pdf) |
| [Spark](2_4_spark.md) | 3h | 08/02/2022 | [Spark](https://github.com/SupaeroDataScience/DE/tree/master/readings/spark.pdf), [PySpark](https://spark.apache.org/docs/latest/api/python/pyspark.html) |
| [Dask on Kubernetes](2_5_dask.md)| 3h | 14/02/2022 | [Dask documentation](https://docs.dask.org/en/latest/setup/kubernetes.html) |
| [Dask project](2_6_project.md) | 6h | 16/02/2022 | [Dask](https://github.com/SupaeroDataScience/DE/tree/master/readings/dask.pdf) |

</script></section><section data-markdown><script type="text/template">

### Case study: all the IPOs

Choose one of the following companies and come up with a short explanation of
what the company offers. Do they sell hard drives? Compute time on a virtual
machine? Specialized artificial intelligence?

+ Snowflake
+ Alteryx
+ Cloudera
+ Talend
+ Splunk
+ Dataiku

Use simple language that could be understood by your grandparents. Post your
explanation on Slack.
</script></section></div>
</div>
<!-- <div id="footer-container" style="display:none;"> -->
<div id="footer-container">
<div id="footer">
Tools of Big Data
<br />
<a href="https://supaerodatascience.github.io/deep-learning/">https://supaerodatascience.github.io/DE/</a>
<br />
<a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"><img alt="Creative Commons License"
style="border-width:0" src="https://i.creativecommons.org/l/by-sa/4.0/80x15.png" /></a>
</div>
</div>
<script src="./dist/reveal.js"></script>

<script src="./plugin/markdown/markdown.js"></script>
<script src="./plugin/highlight/highlight.js"></script>
<script src="./plugin/zoom/zoom.js"></script>
<script src="./plugin/notes/notes.js"></script>
<script src="./plugin/math/math.js"></script>
<script>
function extend() {
var target = {};
for (var i = 0; i < arguments.length; i++) {
var source = arguments[i];
for (var key in source) {
if (source.hasOwnProperty(key)) {
target[key] = source[key];
}
}
}
return target;
}

// default options to init reveal.js
var defaultOptions = {
controls: true,
progress: true,
history: true,
center: true,
transition: 'default', // none/fade/slide/convex/concave/zoom
plugins: [
RevealMarkdown,
RevealHighlight,
RevealZoom,
RevealNotes,
RevealMath
]
};

// options from URL query string
var queryOptions = Reveal().getQueryHash() || {};

var options = extend(defaultOptions, {"transition":"fade","transitionSpeed":"default","controls":true,"slideNumber":true,"width":"100%","height":"100%"}, queryOptions);
</script>


<script>
Reveal.initialize(options);
var footer = $('#footer-container').html();
$('div.reveal').append(footer);
var logo = $('#logo-container').html();
$('div.reveal').append(logo);
</script>
</body>

</html>
Loading

0 comments on commit 1965149

Please sign in to comment.