In this assignment, you will implement a neural network class from (almost) scratch. You will then apply your class to create both:
(1) a simple 64x16x64 autoencoder.
(2) a classifier for transcription factor binding sites.
You will begin by finishing the API for generating fully connected neural networks from scratch. You will then make Jupyter Notebooks where you create, train, and test your autoencoder and classifier.
- Finish all methods with a
pass
statement in theNeuralNetwork
class in thenn.py
file.
- Finish the
sample_seqs
function in thepreprocess.py
file. - Finish the
one_hot_encode_seqs
function in thepreprocess.py
file.
An autoencoder is a neural network that takes an input, encodes it into a lower-dimensional latent space through "encoding" layers, and then attempts to reconstruct the original input using "decoding" layers. Autoencoders are often used for dimensionality reduction.
You will train a 64x16x64 autoencoder on the digits dataset. All of the following work should be done in a Jupyter Notebook.
- Load the digits dataset through sklearn using
sklearn.datasets.load_digits()
. - Split the data into training and validation sets.
- Generate an instance of your
NeuralNetwork
class with a 64x16x64 autoencoder architecture. - Train your autoencoder on the training data.
- Plot your training and validation loss by epoch.
- Quantify your average reconstruction error over the validation set.
- Explain why you chose the hyperparameter values you did.
Transcription factors are proteins that bind DNA at promoters to drive gene expression. Most preferentially bind to specific sequences while ignoring others. Traditional methods to determine these sequences (called motifs) have assumed that binding sites in the genome are all independent. However, in some cases people have identified motifs where positional interdependencies exist.
You will implement a multi-layer fully connected neural network using your NeuralNetwork
class to predict whether a short DNA sequence is a binding site for the yeast transcription factor Rap1. The training data is incredibly imbalanced, with way fewer positive sequences than negative sequences, so you will implement a sampling scheme to ensure that class imbalance does not affect training. As in step 2, all of the following work should be done in a Jupyter Notebook.
- Use the
read_text_file
function fromio.py
to read in the 137 positive Rap1 motif examples. - Use the
read_fasta_file
function fromio.py
to read in all the negative examples. Note that these sequences are much longer than the positive sequences, so you will need to process them to the same length. - Balance your classes using your
sample_seq
function and explain why you chose the sampling scheme you did. - One-hot encode the data using your
one_hot_encode_seqs
function. - Split the data into training and validation sets.
- Generate an instance of your
NeuralNetwork
class with an appropriate architecture. - Train your neural network on the training data.
- Plot your training and validation loss by epoch.
- Report the accuracy of your classifier on your validation dataset.
- Explain your choice of loss function and hyperparameters.
- Proper implementation of methods in
NeuralNetwork
class (13 points) - Proper implementation of
sample_seqs
function (1 point) - Proper implementation of
one_hot_encode_seqs
function (1 point)
- Read in data and generate training and validation sets (2 points)
- Successfully train your autoencoder (4 points)
- Plots of training and validation loss (2 points)
- Quantification of reconstruction error (1 point)
- Explanation of hyperparameters (1 point)
- Correctly read in all data (2 points)
- Explanation of your sampling scheme (2 points)
- Proper generation of a training set and a validation set (2 point)
- Successfully train your classifier (4 points)
- Plots of training and validation loss (2 points)
- Reporting validation accuracy of the classifier (1 point)
- Explanation of loss function and hyperparameters (2 points)
Proper unit tests for:
_single_forward
method (1 point)forward
method (1 point)_single_backprop
method (1 point)predict
method (1 point)binary_cross_entropy
method (0.5 points)binary_cross_entropy_backprop
method (0.5 points)mean_squared_error
method (0.5 points)mean_squared_error_backprop
method (0.5 points)sample_seqs
function (0.5 points)one_hot_encode_seqs
function (0.5 points)
- Installable module (1 point)
- GitHub Actions (installing + testing) (2 points)