Error in convert_fragments_to_tn5_bed

Hello again everyone, 

after correcting #151, I stumbled upon another error in which the created bedfile by the function convert_fragments_to_tn5_bed seemed to produce an error at the level of this line in scatac_generate_bigwig.sh
```bash
sort -k1,1 -k2,2n ${1} | bedtools slop  -i - -g ${2} -b ${3} | \
bedtools intersect -a - -b ${4} -v | \
tee ${bed_name} | \
bedtools genomecov -i - -g ${2} -bg -scale ${7} | \
LC_COLLATE=C sort -k1,1 -k2,2n > ${bedgraph_name}
```
The error message was : "Unexpected file format.  Please use tab-delimited BED, GFF, or VCF. Perhaps you have non-integer starts or ends at line 1?"

I am using single cell pseudobulk generated tsv files using getGroupFragments in ArchR. 

In order to understand the error, I looked at the output of the function convert_fragments_to_tn5_bed in prepare_tools.py :
```
chr     start      stop                                   barcode
0  chr1  832792.0  832793.0  astro_P0015_ini_S0028#CAGGGCTTCGATAACC-1
1  chr1  930374.0  930375.0  astro_P0015_ini_S0028#CAGGGCTTCGATAACC-1
2  chr1  939075.0  939076.0  astro_P0015_ini_S0028#CAGGGCTTCGATAACC-1
3  chr1  940409.0  940410.0  astro_P0015_ini_S0028#CAGGGCTTCGATAACC-1
4  chr1  941526.0  941527.0  astro_P0015_ini_S0028#CAGGGCTTCGATAACC-1
```
The function seemed to be converting the start and stop column as float inducing the error in bedtools. 

I am not sure as to why I am the only one experiencing this error. If you want to investigate, I provide the yml from my environment.

Anyway, the error is resolved by adding two lines in the function, insuring that start and stop are considered as integers by pandas. 

```python
def convert_fragments_to_tn5_bed(fragments_tsv: str, chroms: list):
    """Convert 10X scATAC fragments file to Tn5 insertion sites bed

    Args:
        fragments_tsv (str): Path to the scATAC-seq fragments file

    Returns:
        dataframe: Tn5 insertions sites in a BED formatted pandas dataframe
        
    Examples:
    
    >>> bed_file = convert_fragments_to_tn5_bed("Granja_frags.tsv", ["chr1", "chr2"])
    """
    
    # Import fragments tsv as a dataframe
    df = pd.read_table(fragments_tsv,
                       sep="\t",
                       header=None,
                       usecols=[0, 1, 2, 3],
                       names=["chr", "start", "stop", "barcode"]
                       )

    # Filter chromosomes to those in the list
    df = df[df["chr"].isin(chroms)]
    
    # subset starts
    df_starts = df[["chr", "start", "barcode"]].copy()
    
    # subset stops
    df_ends = df[["chr", "stop", "barcode"]].copy()
    
    # Add a 1 to create 1 bp intervals representing the cut site
    df_starts["stop"] = df_starts["start"].copy() + 1
    
    # Subtract a 1 bp interval to represent the cut site. Reference miralidlab wiki page for scATAC-seq analysis
    df_ends["start"] = df_ends["stop"].copy() - 1

    # Concat the start and end positions into a single dataframe
    df_cat = pd.concat([df_starts, df_ends])

    # Reset the index
    df_cat.reset_index(inplace=True, drop=True)
    
    df_cat["start"] = df_cat["start"].astype(int)
    df_cat["stop"] = df_cat["stop"].astype(int)


    # If split is True, return two dataframes, one for each end of the fragment    
    return df_cat[["chr", "start", "stop", "barcode"]]
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Error in convert_fragments_to_tn5_bed #153

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Error in convert_fragments_to_tn5_bed #153

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions