deploy: 9f0eb53

roualdes · Jan 10, 2024 · e4fe799 · e4fe799
1 parent 1522320
commit e4fe799
Show file tree

Hide file tree

Showing 21 changed files with 377 additions and 337 deletions.
diff --git a/_images/4f857011502159d7f402ca6d760c498db8ea0e9fd03962e8ff4f5090de248dd6.png b/_images/4f857011502159d7f402ca6d760c498db8ea0e9fd03962e8ff4f5090de248dd6.png
diff --git a/_images/559aa8885ab31225702da0b09e00732d58428696bb5bac577f35651f06893d43.png b/_images/559aa8885ab31225702da0b09e00732d58428696bb5bac577f35651f06893d43.png
diff --git a/_images/61d7d3c1d03f647ec486cf9f94932b9e2de2e0ec0bbbb3291da99e93c235e96f.png b/_images/61d7d3c1d03f647ec486cf9f94932b9e2de2e0ec0bbbb3291da99e93c235e96f.png
diff --git a/_images/de6d1e6ec5c5c9f16e4dd8547c6ffa68b13fb6f5c7fb3cc4b2926da6cee66734.png b/_images/de6d1e6ec5c5c9f16e4dd8547c6ffa68b13fb6f5c7fb3cc4b2926da6cee66734.png
diff --git a/_sources/and-beyond.md b/_sources/and-beyond.md
@@ -16,6 +16,133 @@ kernelspec:
 
 ## Python w/out Google Colab
 
-* [JupyterLab Desktop](https://github.com/jupyterlab/jupyterlab-desktop)
-* [JupyterLab](https://jupyterlab.readthedocs.io/en/latest/)
-* [Virtual environment](https://docs.python.org/3/library/venv.html) -> Emacs/Vim/VS Code/other
+* <a href="https://github.com/jupyterlab/jupyterlab-desktop" target="_blank">JupyterLab Desktop</a>
+* <a href="https://jupyterlab.readthedocs.io/en/latest/" target="_blank">JupyterLab</a>
+* <a href="https://docs.python.org/3/library/venv.html">Virtual environment -> Emacs/Vim/VS Code/other</a>
+
+## Creating you own dataset
+
+When you get to the point that you start creating your own small to
+medium sized datasets, then this section is for you.  This section
+explains some general advice surrounding creating a dataset.
+
+Entering data into a spreadsheet is easy.  And that's good.  But there
+are some gotchas that you should avoid.  Below you'll find lists of
+the dos and then explanations, and the don'ts and explanations, for
+creating your own datasets.
+
+### DOs
+
+* be consistent
+* use simple variable names
+  * prefer all lower case letters
+  * minimize numbers and special characters
+  * use underscore `_` instead of space ` `
+* organize files within directories
+
+**Be consistent**. When programming, having to repeated look back at your
+spreadsheet to figure out your variable names is beyond annoying.  It is beyond
+annoying because it interrupts your programming.  Programming is hard enough,
+try to minimize inconsistencies that can otherwise be settled by being
+consistent.
+
+**Use simple variable names**.  Consider two variables you might want
+to name with multiple words, like miles per gallon and brain to body
+weight ratio. It is easy to name one variable using camel case,
+e.g. `MilesPerGallon`, and another capitalized,
+e.g. `Brain(g)bodyWeight(KG)`.  The first name is fine, so long as you
+are consistent and choose camel case for all of your variable names.
+The second variable name is both not simple and inconsistent.  Camel
+case would have you capitalize each new work, as in `BrainBodyWight`.
+In this case, even the units are not capitalized the same.  This is a
+recipe for frustration.  Also see below, *don't put units in variabl
+names*.
+
+It is recommended to make yourself a simple rule, like *prefer all
+lowercase letters*.  Maybe that's not the rule for you, but don't get
+caught up on the rule.  The rule itself doesn't matter.  Just be
+simple and consistently so.
+
+My go to rule is all lowercase letters, no numbers or special
+characters other than `_`, and to separate words when there are
+contiguous repeated letters, `ee` or `ss`, and otherwise don't
+separate words.  The separator I prefer is underscore `_` instead of
+space ` `, which is mostly a carry over rule from programming in R.
+Remember, the rule matters less than consistency with the rule.
+
+**Organize files within directories**.  When editing files, it is
+tempting to write metadata into the file name.  For instance, it is
+unfortunately common for people to write file names such as
+`draft_manuscript.docx`, `draft2_manuscript.docx`,
+`draft3_manuscript.docx`, `final_manuscript.docx`,
+`final_final_manuscript.docx`.  File names are not intended to carry
+the metadata associated with draft versions.
+
+If you really need to maintain copies of drafts, and I guess you most
+often do not need such copies, then you should create directories such as
+`draft` and `final`.  Each directory should contain a (singular) copy
+of the files you absolutely need with each and every copy of the file.
+Any files, such as data, that are the same for all copies of the file
+should have their own directory.  It might help future you to put a
+separate notes file in each directory that reminds you of exact
+purpose of the directory.
+
+### DON'Ts
+
+* don't start a variable name with a number
+* don't use special character in variable names
+* don't put units in variable names
+* don't use abbreviations
+* don't organize through file names
+* don't put dates in your file names
+* don't have multiple copies of your data
+
+**Don't start a variable name with a number**.  In most programming
+languages, you can't start a variable name with a number.  So it's
+easiest to just avoid putting numbers in variable names altogether.
+Occassionaly, it makes sense to use a number in a variable name.  Just
+don't start your variable name with a number.
+
+**Don't use special characters in variable names**.  This rule is much
+like the rule above.  In my experience, special characters,
+e.g. `~!@#$%^&*()+=,<>/|\`, only make remembering a variable name more
+difficult.  The only special character that you should allow, when
+necessary, in your variable names is underscore `_`.  See **Use simple
+variable names** above.
+
+**Don't put units in variable names**.  Units in variable names just
+open the door for inconsistent variable names.  It is easiest to just
+avoid putting units or other metadata into variable names.  Your data
+should instead have a separate file of all the associated metadata.
+
+**Don't use abbreviations**.  Abbreviated variable names are
+attractive, because they save typing.  For instance, one could imagine
+abbreviating micrograms as `ug`, `mg`, or `μg`.  This creates
+opportunity for misremembering and inconsistency.  Such abbreviations
+in variable names also breaks the rule **Don't put units in variable
+names**.  Further, see **Use simple variable names above** above.
+Instead, put such metadata in a separate file.
+
+**Don't organize through file names**.  The only metadata a file name
+should contain is the name of the file.  Instead, use directories to
+organize your files.  See **Organize files within directories** above.
+
+**Don't put dates in your file names**.  Dates are metadata, see
+**Don't organize through file names** above.
+
+**Don't have multiple copies of your data**.  Generally, you should
+only have one copy of your dataset.  See **Don't put dates in your
+file names** above.  If there are necessary edits to your data for a
+specific analysis, then you should program those edits in Python code
+and save that code for future re-use.  This way you can re-create data
+changes as necessary, and you minimize introducing permanent errors
+into your dataset.
+
+### tidy data
+
+The most complete reference containing the advice above, and more, is
+from Hadley Wickham's paper <a
+href="https://vita.had.co.nz/papers/tidy-data.pdf"
+target="_blank">Tidy Data (pdf)</a>.  The paper lays out a framework
+with the goal of making it easier to clean up (tidy) data, so that
+subsequent analysis is easier.
diff --git a/_sources/data.md b/_sources/data.md
@@ -15,6 +15,6 @@ kernelspec:
 # Data
 
 Dr. Robin Donatello hosts a number of datasets on her website:
-<https://www.norcalbiostat.com/data/>.  You can use any of these
-datasets for practice or for the Exploratory Data Analysis Project
-which concludes MATH 131.
+<https://www.norcalbiostat.com/data/>.  Consider using the datasets
+`Email Spam`, `HIV`, `Depression`, or `Police Shootings` for the
+Exploratory Data Analysis Project which concludes MATH 131.
diff --git a/_sources/week-00.md b/_sources/week-00.md
@@ -44,13 +44,15 @@ [email protected] account).
 
 ## Google Colab
 
-[Google Colab](https://colab.research.google.com) provides a notebook
-environment where the user can develop a reproducible document that blends text
-and code together.  Such reproducible documents are popular in the world of data
-science, statistics, machine learning, and the various applied sciences that use
-programming.  By combining text and code, you can walk (via text) your audience
-through an analysis (usually via code and/or math), showing the exact code you
-used to draw any conclusions about the data or otherwise.
+<a href="https://colab.research.google.com" target="_blank">Google
+Colab</a> provides a notebook environment where the user can develop a
+reproducible document that blends text and code together.  Such
+reproducible documents are popular in the world of data science,
+statistics, machine learning, and the various applied sciences that
+use programming.  By combining text and code, you can walk (via text)
+your audience through an analysis (usually via code and/or math),
+showing the exact code you used to draw any conclusions about the data
+or otherwise.
 
 We will use Google Colab for free, as part of your campus Google
 account [email protected].  The free aspect means we'll have
@@ -61,8 +63,10 @@ install Python on your personal machine, because I believe we can get
 started faster this way.  If you want to follow along with this course
 using different tools, and you understand the consequences you face
 for doing so, please see your options on the page [Week 06 and
-beyond][./and-beyond.md].
+beyond](./and-beyond.md).
 
-From here, there's really no better way to learn about Google Colab than to go
-touch it.  Here's a link to [the Colab notebook associated with Week 00: Start
-here](https://colab.research.google.com/drive/1weKuFgd98W76BloyuuB4d2HudB5KLYew?usp=sharing).
+From here, there's really no better way to learn about Google Colab
+than to go touch it.  Here's a link to <a
+href="https://colab.research.google.com/drive/1weKuFgd98W76BloyuuB4d2HudB5KLYew?usp=sharing"
+target="_blank">the Colab notebook associated with Week 00: Start
+here</a>.
diff --git a/_sources/week-01.md b/_sources/week-01.md
@@ -14,8 +14,8 @@ kernelspec:
 
 # Week 01: Python basics
 
-* [Week 01 Notes](https://colab.research.google.com/drive/1VQhUmSxM6WfSw1ZZeKfhkRhkfM9JPXQx?usp=sharing)
-* [Week 01 Assignment](https://colab.research.google.com/drive/1h9Ck7kWNN9_I2Yun9Yc4uBoI2lgv6chi?usp=sharing)
+* <a href="https://colab.research.google.com/drive/1VQhUmSxM6WfSw1ZZeKfhkRhkfM9JPXQx?usp=sharing" target="_blank">Week 01 Notes</a>
+* <a href="https://colab.research.google.com/drive/1h9Ck7kWNN9_I2Yun9Yc4uBoI2lgv6chi?usp=sharing" target="_blank">Week 01 Assignment</a>
 
 ## Learning objectives
 
@@ -39,7 +39,7 @@ To follow along with this Lesson, please open the Colab notebook [Week
 Notes](https://colab.research.google.com/drive/1VQhUmSxM6WfSw1ZZeKfhkRhkfM9JPXQx?usp=sharing).
 The first code cell of this notebook calls to the remote computer, on
 which the notebook is running, and installs the necessary packages.
-For practice, you are repsonible for importing the necessary packages.
+For practice, you are responsible for importing the necessary packages.
 
 ## Variable
 
@@ -602,5 +602,5 @@ Such tools have a steep learning curve and a huge payoff.
 ```
 
 ```{seealso}
-[Week 01 Assignment](https://colab.research.google.com/drive/1h9Ck7kWNN9_I2Yun9Yc4uBoI2lgv6chi?usp=sharing)
+<a href="https://colab.research.google.com/drive/1h9Ck7kWNN9_I2Yun9Yc4uBoI2lgv6chi?usp=sharing" target="_blank">Week 01 Assignment</a>
 ```
diff --git a/_sources/week-02.md b/_sources/week-02.md
@@ -14,8 +14,8 @@ kernelspec:
 
 # Week 02: Introduction to working with data
 
-* [Week 02 Notes](https://colab.research.google.com/drive/1qHzeZ_1RdfNe1l3KQsZi7xsSjLMVHbel?usp=sharing)
-* [Week 02 Assignment](https://colab.research.google.com/drive/1os3hSTKNFblsA1MUTe25pvCjtaKfId30?usp=sharing)
+* <a href="https://colab.research.google.com/drive/1qHzeZ_1RdfNe1l3KQsZi7xsSjLMVHbel?usp=sharing" target="_blank">Week 02 Notes</a>
+* <a href="https://colab.research.google.com/drive/1os3hSTKNFblsA1MUTe25pvCjtaKfId30?usp=sharing" target="_blank">Week 02 Assignment</a>
 
 ## Learning objectives
 
@@ -281,5 +281,5 @@ msleep["smrt"]
 ```
 
 ```{seealso}
-[Week 02 Assignment](https://colab.research.google.com/drive/1os3hSTKNFblsA1MUTe25pvCjtaKfId30?usp=sharing)
+<a href="https://colab.research.google.com/drive/1os3hSTKNFblsA1MUTe25pvCjtaKfId30?usp=sharing" target="_blank">Week 02 Assignment</a>
 ```
diff --git a/_sources/week-03.md b/_sources/week-03.md
@@ -14,8 +14,8 @@ kernelspec:
 
 # Week 03: Graphing and aggregating
 
-* [Week 03 Notes](https://colab.research.google.com/drive/1HqqhJvfHsWJAj_3dgBt0SOV5E90Sq1pG?usp=sharing)
-* [Week 03 Assignment](https://colab.research.google.com/drive/1_ZTWGesIh5DUB_l3UdTmR5KK9AFPEy_9?usp=sharing)
+* <a href="https://colab.research.google.com/drive/1HqqhJvfHsWJAj_3dgBt0SOV5E90Sq1pG?usp=sharing" target="_blank">Week 03 Notes</a>
+* <a href="https://colab.research.google.com/drive/1_ZTWGesIh5DUB_l3UdTmR5KK9AFPEy_9?usp=sharing" target="_blank">Week 03 Assignment</a>
 
 ## Learning outcomes
 
@@ -344,5 +344,5 @@ p.draw()
 ```
 
 ```{seealso}
-[Week 03 Assignment](https://colab.research.google.com/drive/1_ZTWGesIh5DUB_l3UdTmR5KK9AFPEy_9?usp=sharing)
+<a href="https://colab.research.google.com/drive/1_ZTWGesIh5DUB_l3UdTmR5KK9AFPEy_9?usp=sharing" target="_blank">Week 03 Assignment</a>
 ```