How to handle a join when join field is embedded in line? #1445

kubu4 · 2022-04-07T14:39:35Z

kubu4
Apr 7, 2022
Maintainer

I have two files. I'd like to join File 2 with/on File 1.

File 1:

Q7ZU29
Q6C9E7
F7B645
Q6GM29
Q9FZK4
Q9NPA5
Q55CU8
Q7TQE7
Q32P85
Q9UN37
O75351

File 2:

AC   Q8TF76; Q5U5K3; Q96MN1; Q9BXS7;
AC   A1X283; B6F0V2; Q9P2Q1;
AC   Q8CA72;
AC   Q03017; A4V0S3; Q0E8Q3; Q9V3M6;
AC   E7FAM5; F8W380;
AC   Q32NV1; Q6NRJ8;
AC   Q9NQP4; Q5TD11; Q92779;
AC   Q9VJ33; Q29QE5;
AC   P30939;
AC   P97718; O54913; Q8BV77;
AC   Q8ITC9; Q0IGX8; Q71RH8;
AC   Q9NTW7; A2A2N4; Q53H69; Q53XQ1; Q5JWM0; Q5JWM1; Q8WU98; Q9H9P1; Q9NPA5;

The problem: The entry Q9NPA5 occurs at the end the line in File 2:

AC Q9NTW7; A2A2N4; Q53H69; Q53XQ1; Q5JWM0; Q5JWM1; Q8WU98; Q9H9P1; Q9NPA5

I'd like the join to retain all the stuff after AC in File 2.

I'm thinking a join might not even work in this instance. Might have to do some sort of looping over each string in the line in File 2 and compare to each line in File 1; followed by printing the entire line from File 2 if there's a match.

Anyone have any suggestions?

Shell preferred, but would take other solutions.

Answered by kubu4

Apr 21, 2022

The input file examples above are slightly different than what the solution below used, but are extremely similar (biggest differences is different number of columns in each file). Here's the awk solution I ended up using:

awk \
-v FS='[;[:space:]]+' \
'NR==FNR \
{array[$1]=$0; next} \
($1 in array) \
{print $2"\t"array[$1]}' \
"File02.txt File01.txt" \
> "${joined_output}"

And, here's the code explanation:

awk -v FS='[;[:space:]]+': Sets the Field Separator variable to handle ; in UniProt accessions. Allows for proper searching.
FNR == NR: Restricts next block (designated by {}) to work only on first input file.
{array[$1]=$0; next}: Adds the entire line ($0) of the first file to t…

View full answer

sr320 · 2022-04-07T14:49:21Z

sr320
Apr 7, 2022
Maintainer

How about fuzzyjoin?

Example from my nb-2022 repo

betterspur <- spur %>% fuzzy_inner_join(ncbiP, by = "V2", match_fun = str_detect)

1 reply

kubu4 Apr 7, 2022
Maintainer Author

This is intriguing...

Going to hold out for a bit to see if we can get a shell suggestion. Rest of work is in Jupyter Notebook and don't want to take time to switch over to Rmd...

kubu4 · 2022-04-07T21:10:47Z

kubu4
Apr 7, 2022
Maintainer Author

Potential awk solution:

awk ' BEGIN { FS = "^AC" } \
FNR == NR \
{ array[$0]; next } \
{ for ( item in array ) { if ( match ( $0,item ) ) { print $2 } } } ' \
File1.txt File2.txt \
| sed 's/[[:space:]]*//g'

Output:

Q8TF76;Q5U5K3;Q96MN1;Q9BXS7;
A1X283;B6F0V2;Q9P2Q1;
Q8CA72;
Q03017;A4V0S3;Q0E8Q3;Q9V3M6;
E7FAM5;F8W380;
Q32NV1;Q6NRJ8;
Q9NQP4;Q5TD11;Q92779;
Q9VJ33;Q29QE5;
P30939;
P97718;O54913;Q8BV77;
Q8ITC9;Q0IGX8;Q71RH8;
Q9NTW7;A2A2N4;Q53H69;Q53XQ1;Q5JWM0;Q5JWM1;Q8WU98;Q9H9P1;Q9NPA5;
Q9NTW7;A2A2N4;Q53H69;Q53XQ1;Q5JWM0;Q5JWM1;Q8WU98;Q9H9P1;Q9NPA5;

Explanation:

awk ' BEGIN { FS = "^AC" }: Use any line that begins (^) with AC as the field separator.
FNR == NR: Restricts next block (designated by {}) to work only on first input file.
{ array[$0]; next }: Adds the entire line ($0) of the first file to the array names array.
{ for ( item in array ) { if ( match ( $0,item ) ) { print $2 } } }: Loops through each item in the array and if there's a match from the array (item) somewhere in the current line of the second input file ($0), then print the second field ($2) from the second input file.
File1.txt File2.txt: The first and second input files.
| sed 's/[[:space:]]*//g': Pipes output to sed and sed removes all spaces throughout the entire output line.

0 replies

kubu4 · 2022-04-21T17:23:27Z

kubu4
Apr 21, 2022
Maintainer Author

The input file examples above are slightly different than what the solution below used, but are extremely similar (biggest differences is different number of columns in each file). Here's the awk solution I ended up using:

awk \
-v FS='[;[:space:]]+' \
'NR==FNR \
{array[$1]=$0; next} \
($1 in array) \
{print $2"\t"array[$1]}' \
"File02.txt File01.txt" \
> "${joined_output}"

And, here's the code explanation:

awk -v FS='[;[:space:]]+': Sets the Field Separator variable to handle ; in UniProt accessions. Allows for proper searching.
FNR == NR: Restricts next block (designated by {}) to work only on first input file.
{array[$1]=$0; next}: Adds the entire line ($0) of the first file to the array named array and then moves on to the next set of commands for the second input file.
($1 in array): Looks for the value of the first column ($1, which is SPID) from the second file to see if there's a match from the array (which contains the line from the first file).
{print $2,array[$1]}': If there's a match, print the second column ($2, which is gene ID) from the second file, followed by the line from the first file.
File02.txt File01.txt: The first and second input files.
"${joined_output}": Result of the join.

Credit goes to help from StackOverflow.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to handle a join when join field is embedded in line? #1445

{{title}}

Replies: 3 comments 1 reply

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

How to handle a join when join field is embedded in line? #1445

kubu4 Apr 7, 2022 Maintainer

Replies: 3 comments · 1 reply

sr320 Apr 7, 2022 Maintainer

kubu4 Apr 7, 2022 Maintainer Author

kubu4 Apr 7, 2022 Maintainer Author

kubu4 Apr 21, 2022 Maintainer Author

kubu4
Apr 7, 2022
Maintainer

Replies: 3 comments 1 reply

sr320
Apr 7, 2022
Maintainer

kubu4 Apr 7, 2022
Maintainer Author

kubu4
Apr 7, 2022
Maintainer Author

kubu4
Apr 21, 2022
Maintainer Author