Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FEAT: add csv2fasta #61

Closed
wants to merge 6 commits into from
Closed
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,7 @@

* `falco`: A C++ drop-in replacement of FastQC to assess the quality of sequence read data (PR #43).

* `csv2fasta`: Convert two columns from a CSV file to FASTA entries (PR #61).

## MAJOR CHANGES

Expand Down
92 changes: 92 additions & 0 deletions src/sequenceformats/csv2fasta/config.vsh.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
name: csv2fasta
description: Convert two columns from a CSV file to FASTA entries.
tverbeiren marked this conversation as resolved.
Show resolved Hide resolved
argument_groups:
- name: Inputs
arguments:
- name: --input
type: file
direction: input
example: barcodes.csv
description: CSV file to be processed.
required: true
- name: "CSV Format arguments"
arguments:
- name: --header
type: boolean_true
description: |
Parse the first line of the CSV file as a header.
- name: --delimiter
type: string
description: |
Column delimiter
default: ","
- name: --quote_character
type: string
description: |
Character used to denote the start and end of a quoted item.
default: '"'
- name: "CSV column arguments"
description: |
Parameters for the selection of columns from the CSV file.
Only required when your CSV file contains more than 2 columns.
arguments:
- name: --sequence_column
tverbeiren marked this conversation as resolved.
Show resolved Hide resolved
type: string
description: |
Name of the column containing the sequences. Implies 'header'.
Cannot be used together with 'sequence_column_index'.
required: false
- name: "--name_column"
type: string
description: |
Name of the column describing the FASTA headers. Implies 'header'.
Cannot be used together with 'name_column_index'.
required: false
- name: "--sequence_column_index"
type: integer
min: 0
description: |
Index of the column to use as the FASTA sequences, counter from the left and
starting from 0. Cannot be used in combination with the 'sequence_column' argument.
required: false
- name: "--name_column_index"
type: integer
min: 0
description: |
Index of the column to use as the FASTA headers, counter from the left and
starting from 0. Cannot be used in combination with 'name_column'.
required: false
- name: Outputs
arguments:
- name: "--output"
type: file
example: barcodes.fasta
direction: output
description: Output fasta file.

resources:
- type: python_script
path: script.py
test_resources:
- type: python_script
path: test_csvtofasta.py

engines:
- type: docker
image: python:slim
setup:
- type: apt
packages:
- procps
- type: python
packages:
- dnaio
test_setup:
- type: python
packages:
- pytest
- viashpy

runners:
- type: executable
- type: nextflow
88 changes: 88 additions & 0 deletions src/sequenceformats/csv2fasta/script.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
from pathlib import Path
import dnaio
import csv

## VIASH START
par = {

}
## VIASH END

def resolve_header_name_to_index(header_entries, column_name):
try:
return header_entries.index(column_name)
except ValueError as e:
raise ValueError(f"Column name '{column_name}' could not "
"be found in the header of the CSV file.") from e


def csv_records(csv_file, delimiter, quote_character,
header, sequence_column, name_column,
sequence_column_index, name_column_index):
with open(csv_file, newline='') as csvfile:
csv_reader = csv.reader(csvfile,
delimiter=delimiter,
quotechar=quote_character)
for linenum, line in enumerate(csv_reader):
if not linenum: # First row
num_columns = len(line)
if header:
if sequence_column:
sequence_column_index = resolve_header_name_to_index(line, sequence_column)
if name_column:
name_column_index = resolve_header_name_to_index(line, name_column)
continue
if not (linenum - header): # First 'data' line
if (not sequence_column_index and not name_column_index and len(line) == 2):
name_column_index, sequence_column_index = 0, 1
if sequence_column_index == name_column_index:
raise ValueError("The same columns were selected for both the FASTQ sequences and "
"headers.")
if sequence_column_index is None:
raise ValueError("Either 'sequence_column_index' or 'sequence_column' needs "
"to be specified.")
if name_column_index is None:
raise ValueError("Either 'name_column' or 'name_column_index' needs to "
"be specified.")
if name_column_index >= num_columns:
raise ValueError(f"Requested to use column number {name_column_index} "
f"(0 based) for the FASTA headers, but only {num_columns} "
"were found on the first line.")
if sequence_column_index >= num_columns:
raise ValueError(f"Requested to use column number {sequence_column_index} "
f"(0 based) for the FASTA sequences, but only {num_columns} "
"were found on the first line.")
if len(line) != num_columns:
raise ValueError(f"Number of columns ({len(line)}) found on line {linenum+1} "
"is different compared to number of columns found "
f"previously ({num_columns}).")
yield line[name_column_index], line[sequence_column_index]


def main(par):
par['input'], par['output'] = Path(par['input']), Path(par['output'])
sequence_column, name_column = par['sequence_column'], par['name_column']
sequence_column_index, name_column_index = par['sequence_column_index'], par['name_column_index']
if (sequence_column or name_column) and not par['header']:
par["header"] = True
if sequence_column_index and sequence_column:
raise ValueError("Cannot specify both 'sequence_column_index' and 'sequence_column'")
if name_column and name_column_index:
raise ValueError("Cannot specify both 'name_column_index' and 'name_column'")
if (sequence_column_index or name_column_index) and \
(sequence_column_index == name_column_index):
raise ValueError("The value specified for 'sequence_column_index' cannot be the same as "
"the value for 'name_column_index'.")
with dnaio.open(par['output'], mode='w', fileformat="fasta") as writer:
for header, sequence in csv_records(par['input'],
par['delimiter'],
par['quote_character'],
par['header'],
sequence_column,
name_column,
sequence_column_index,
name_column_index):
writer.write(dnaio.SequenceRecord(header, sequence))

if __name__ == "__main__":
main(par)
Loading
Loading