Skip to content

Error in convert_fragments_to_tn5_bed #153

@Pirhhot

Description

@Pirhhot

Hello again everyone,

after correcting #151, I stumbled upon another error in which the created bedfile by the function convert_fragments_to_tn5_bed seemed to produce an error at the level of this line in scatac_generate_bigwig.sh

sort -k1,1 -k2,2n ${1} | bedtools slop  -i - -g ${2} -b ${3} | \
bedtools intersect -a - -b ${4} -v | \
tee ${bed_name} | \
bedtools genomecov -i - -g ${2} -bg -scale ${7} | \
LC_COLLATE=C sort -k1,1 -k2,2n > ${bedgraph_name}

The error message was : "Unexpected file format. Please use tab-delimited BED, GFF, or VCF. Perhaps you have non-integer starts or ends at line 1?"

I am using single cell pseudobulk generated tsv files using getGroupFragments in ArchR.

In order to understand the error, I looked at the output of the function convert_fragments_to_tn5_bed in prepare_tools.py :

chr     start      stop                                   barcode
0  chr1  832792.0  832793.0  astro_P0015_ini_S0028#CAGGGCTTCGATAACC-1
1  chr1  930374.0  930375.0  astro_P0015_ini_S0028#CAGGGCTTCGATAACC-1
2  chr1  939075.0  939076.0  astro_P0015_ini_S0028#CAGGGCTTCGATAACC-1
3  chr1  940409.0  940410.0  astro_P0015_ini_S0028#CAGGGCTTCGATAACC-1
4  chr1  941526.0  941527.0  astro_P0015_ini_S0028#CAGGGCTTCGATAACC-1

The function seemed to be converting the start and stop column as float inducing the error in bedtools.

I am not sure as to why I am the only one experiencing this error. If you want to investigate, I provide the yml from my environment.

Anyway, the error is resolved by adding two lines in the function, insuring that start and stop are considered as integers by pandas.

def convert_fragments_to_tn5_bed(fragments_tsv: str, chroms: list):
    """Convert 10X scATAC fragments file to Tn5 insertion sites bed

    Args:
        fragments_tsv (str): Path to the scATAC-seq fragments file

    Returns:
        dataframe: Tn5 insertions sites in a BED formatted pandas dataframe
        
    Examples:
    
    >>> bed_file = convert_fragments_to_tn5_bed("Granja_frags.tsv", ["chr1", "chr2"])
    """
    
    # Import fragments tsv as a dataframe
    df = pd.read_table(fragments_tsv,
                       sep="\t",
                       header=None,
                       usecols=[0, 1, 2, 3],
                       names=["chr", "start", "stop", "barcode"]
                       )

    # Filter chromosomes to those in the list
    df = df[df["chr"].isin(chroms)]
    
    # subset starts
    df_starts = df[["chr", "start", "barcode"]].copy()
    
    # subset stops
    df_ends = df[["chr", "stop", "barcode"]].copy()
    
    # Add a 1 to create 1 bp intervals representing the cut site
    df_starts["stop"] = df_starts["start"].copy() + 1
    
    # Subtract a 1 bp interval to represent the cut site. Reference miralidlab wiki page for scATAC-seq analysis
    df_ends["start"] = df_ends["stop"].copy() - 1

    # Concat the start and end positions into a single dataframe
    df_cat = pd.concat([df_starts, df_ends])

    # Reset the index
    df_cat.reset_index(inplace=True, drop=True)
    
    df_cat["start"] = df_cat["start"].astype(int)
    df_cat["stop"] = df_cat["stop"].astype(int)


    # If split is True, return two dataframes, one for each end of the fragment    
    return df_cat[["chr", "start", "stop", "barcode"]]

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions