-
Notifications
You must be signed in to change notification settings - Fork 11
Description
Hello again everyone,
after correcting #151, I stumbled upon another error in which the created bedfile by the function convert_fragments_to_tn5_bed seemed to produce an error at the level of this line in scatac_generate_bigwig.sh
sort -k1,1 -k2,2n ${1} | bedtools slop -i - -g ${2} -b ${3} | \
bedtools intersect -a - -b ${4} -v | \
tee ${bed_name} | \
bedtools genomecov -i - -g ${2} -bg -scale ${7} | \
LC_COLLATE=C sort -k1,1 -k2,2n > ${bedgraph_name}
The error message was : "Unexpected file format. Please use tab-delimited BED, GFF, or VCF. Perhaps you have non-integer starts or ends at line 1?"
I am using single cell pseudobulk generated tsv files using getGroupFragments in ArchR.
In order to understand the error, I looked at the output of the function convert_fragments_to_tn5_bed in prepare_tools.py :
chr start stop barcode
0 chr1 832792.0 832793.0 astro_P0015_ini_S0028#CAGGGCTTCGATAACC-1
1 chr1 930374.0 930375.0 astro_P0015_ini_S0028#CAGGGCTTCGATAACC-1
2 chr1 939075.0 939076.0 astro_P0015_ini_S0028#CAGGGCTTCGATAACC-1
3 chr1 940409.0 940410.0 astro_P0015_ini_S0028#CAGGGCTTCGATAACC-1
4 chr1 941526.0 941527.0 astro_P0015_ini_S0028#CAGGGCTTCGATAACC-1
The function seemed to be converting the start and stop column as float inducing the error in bedtools.
I am not sure as to why I am the only one experiencing this error. If you want to investigate, I provide the yml from my environment.
Anyway, the error is resolved by adding two lines in the function, insuring that start and stop are considered as integers by pandas.
def convert_fragments_to_tn5_bed(fragments_tsv: str, chroms: list):
"""Convert 10X scATAC fragments file to Tn5 insertion sites bed
Args:
fragments_tsv (str): Path to the scATAC-seq fragments file
Returns:
dataframe: Tn5 insertions sites in a BED formatted pandas dataframe
Examples:
>>> bed_file = convert_fragments_to_tn5_bed("Granja_frags.tsv", ["chr1", "chr2"])
"""
# Import fragments tsv as a dataframe
df = pd.read_table(fragments_tsv,
sep="\t",
header=None,
usecols=[0, 1, 2, 3],
names=["chr", "start", "stop", "barcode"]
)
# Filter chromosomes to those in the list
df = df[df["chr"].isin(chroms)]
# subset starts
df_starts = df[["chr", "start", "barcode"]].copy()
# subset stops
df_ends = df[["chr", "stop", "barcode"]].copy()
# Add a 1 to create 1 bp intervals representing the cut site
df_starts["stop"] = df_starts["start"].copy() + 1
# Subtract a 1 bp interval to represent the cut site. Reference miralidlab wiki page for scATAC-seq analysis
df_ends["start"] = df_ends["stop"].copy() - 1
# Concat the start and end positions into a single dataframe
df_cat = pd.concat([df_starts, df_ends])
# Reset the index
df_cat.reset_index(inplace=True, drop=True)
df_cat["start"] = df_cat["start"].astype(int)
df_cat["stop"] = df_cat["stop"].astype(int)
# If split is True, return two dataframes, one for each end of the fragment
return df_cat[["chr", "start", "stop", "barcode"]]