Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

May be some bugs in simulations. #4

Open
RuBP17 opened this issue Dec 8, 2023 · 4 comments
Open

May be some bugs in simulations. #4

RuBP17 opened this issue Dec 8, 2023 · 4 comments
Assignees

Comments

@RuBP17
Copy link

RuBP17 commented Dec 8, 2023

def get_introgressed_tracts(ts, chr_name, src_name, tgt_name, output):
    """
    Description:
        Outputs true introgressed tracts from a tree-sequence into a BED file.

    Arguments:
        ts tskit.TreeSequence: Tree-sequence containing introgressed tracts.
        chr_name int: Name of the chromosome.
        src_name str: Name of the source population.
        tgt_name str: Name of the target population.
        output string: Name of the output file.
    """
    introgressed_tracts = []

    for p in ts.populations():
        source_id = [p.id for p in ts.populations() if p.metadata['name']==src_name][0]
        target_id = [p.id for p in ts.populations() if p.metadata['name']==tgt_name][0]

    for m in ts.migrations():
        if m.dest == source_id: introgressed_tracts.append((int(m.left), int(m.right)))

    introgressed_tracts.sort(key=lambda x:x[0])
    with open(output, 'w') as o:
        for t in introgressed_tracts:
            o.write(f'1\t{t[0]}\t{t[1]}\n')

In this function, all the introgression segments from the "src" are recorded. In fact, not all the segments from the "src" are the introgression segments, some of them may not go into the target segments. And the "tgt_id" is not used. This bug may occur under complexed demographic models like 2 src models.

In 2src simulation (HumanNeanderthalDenisovan)

if wildcards.demog == 'HumanNeanderthalDenisovan':
            graph = demes.load("config/simulation/models/HumanNeanderthalDenisovan_PapuansOutOfAfrica_10J19.yaml")
            demography = msprime.Demography.from_demes(graph)
            ref_id = 'YRI'
            tgt_id = 'Papuan'
            src1_id = 'NeaA'
            src2_id = 'DenA'
            mutation_rate = 1.4e-8
            recombination_rate = 1e-8

            T_DenA = 59682 / 29
            T_NeaA = 75748 / 29

            samples = [
                msprime.SampleSet(int(wildcards.nref), ploidy=2, population=ref_id),
                msprime.SampleSet(int(wildcards.ntgt), ploidy=2, population=tgt_id),
                msprime.SampleSet(1, ploidy=2, population=src1_id, time=T_NeaA),
                msprime.SampleSet(1, ploidy=2, population=src2_id, time=T_DenA),
            ]

The src samples are NeaA and DenA. But in 'get_tracts', the src become "Nea1", "Den1" and "Den2". Maybe there is something wrong.

rule get_tracts:
    input:
        ts = rules.simulation.output.ts,
    output:
        src1_tracts = output_dir + "simulated_data/{demog}/nref_{nref}/ntgt_{ntgt}/{seed}/sim2src.src1.introgressed.tracts.bed",
        src2_tracts = output_dir + "simulated_data/{demog}/nref_{nref}/ntgt_{ntgt}/{seed}/sim2src.src2.introgressed.tracts.bed",
    threads: 1,
    resources: time_min=120, mem_mb=5000, cpus=1,
    run:
        ts = tskit.load(input.ts)

        if wildcards.demog == 'HumanNeanderthalDenisovan':
            src1_id = "Nea1"
            src2_id = "Den1"
            src3_id = "Den2"
            tgt_id = "Papuan"

            src3_tracts = output_dir + f'simulated_data/{wildcards.demog}/nref_{wildcards.nref}/ntgt_{wildcards.ntgt}/{wildcards.seed}/sim.src3.introgressed.tracts.bed'
            get_introgressed_tracts(ts, chr_name=1, src_name=src1_id, tgt_name=tgt_id, output=output.src1_tracts)
            get_introgressed_tracts(ts, chr_name=1, src_name=src2_id, tgt_name=tgt_id, output=output.src2_tracts)
            get_introgressed_tracts(ts, chr_name=1, src_name=src3_id, tgt_name=tgt_id, output=src3_tracts)

            a = pybedtools.BedTool(output.src2_tracts)
            b = pybedtools.BedTool(src3_tracts)
            a.cat(b).sort().merge().saveas(output.src2_tracts)
@xin-huang
Copy link
Member

Hello @RuBP17, thanks for testing the sstar analysis.

Regarding the first question,

not all the segments from the "src" are the introgression segments, some of them may not go into the target segments.

This is correct. However, here we used msprime, which is a backward time simulator, for simulation. As mentioned in the tskit tutorial for introgression, one should be careful in reverse time terminology.

Actually, these codes

for p in ts.populations():
        source_id = [p.id for p in ts.populations() if p.metadata['name']==src_name][0]
        target_id = [p.id for p in ts.populations() if p.metadata['name']==tgt_name][0]

 for m in ts.migrations():
        if m.dest == source_id: introgressed_tracts.append((int(m.left), int(m.right)))

are modified from the get_migrating_tracts function in the above tutorial, replacing neanderthal_id with source_id:

def get_migrating_tracts(ts):
    neanderthal_id = [p.id for p in ts.populations() if p.metadata['name']=='Neanderthal'][0]
    migrating_tracts = []
    # Get all tracts that migrated into the neanderthal population
    for migration in ts.migrations():
        if migration.dest == neanderthal_id:
            migrating_tracts.append((migration.left, migration.right))
    return np.array(migrating_tracts) 

As explained in the tutorial, this function finds

the set of tracts that exist in the Eurasian genome because they have come from Neanderthals via admixture at time T_ad
(be careful here: in reverse time terminology, we denote the “source” population as Eurasian and the “destination” population as Neanderthals). This is done simply by finding all migration records in which the “destination” population name is Neanderthal.

For the second question, the HumanNeanderthalDenisovan demographic model was taken from stdpopsim. As defined in the Catalog, NeaA is the Altai Neanderthal lineage, and DenA is the Altai Denisovan lineage, which are the sampled genomes, whereas Nea1 and Den1 are the actual populations providing introgression materials.

However, in this model, there are several introgressed populations.

pulses:
- {sources: [Nea1], dest: Ghost, time: 1853.0, proportions: [0.024]}
- {sources: [Den2], dest: Papuan, time: 1575.8620689655172, proportions: [0.018]}
- {sources: [Nea1], dest: CHB, time: 1566.0, proportions: [0.011]}
- {sources: [Nea1], dest: Papuan, time: 1412.0, proportions: [0.002]}
- {sources: [Den1], dest: Papuan, time: 1027.5862068965516, proportions: [0.022]}
- {sources: [Nea1], dest: CHB, time: 883.0, proportions: [0.002]}

Therefore, the truth tracts obtained from these codes:

rule get_tracts:
    input:
        ts = rules.simulation.output.ts,
    output:
        src1_tracts = output_dir + "simulated_data/{demog}/nref_{nref}/ntgt_{ntgt}/{seed}/sim2src.src1.introgressed.tracts.bed",
        src2_tracts = output_dir + "simulated_data/{demog}/nref_{nref}/ntgt_{ntgt}/{seed}/sim2src.src2.introgressed.tracts.bed",
    threads: 1,
    resources: time_min=120, mem_mb=5000, cpus=1,
    run:
        ts = tskit.load(input.ts)

        if wildcards.demog == 'HumanNeanderthalDenisovan':
            src1_id = "Nea1"
            src2_id = "Den1"
            src3_id = "Den2"
            tgt_id = "Papuan"

            src3_tracts = output_dir + f'simulated_data/{wildcards.demog}/nref_{wildcards.nref}/ntgt_{wildcards.ntgt}/{wildcards.seed}/sim.src3.introgressed.tracts.bed'
            get_introgressed_tracts(ts, chr_name=1, src_name=src1_id, tgt_name=tgt_id, output=output.src1_tracts)
            get_introgressed_tracts(ts, chr_name=1, src_name=src2_id, tgt_name=tgt_id, output=output.src2_tracts)
            get_introgressed_tracts(ts, chr_name=1, src_name=src3_id, tgt_name=tgt_id, output=src3_tracts)

            a = pybedtools.BedTool(output.src2_tracts)
            b = pybedtools.BedTool(src3_tracts)
            a.cat(b).sort().merge().saveas(output.src2_tracts)

not only contain the introgressed fragments in Papuans but also in the CHB and Ghost populations. Yes, these may affect the results of the performance comparison (@kuhlwilm).

@RuBP17
Copy link
Author

RuBP17 commented Dec 8, 2023

Maybe we can use the tree to find out where the segments go.

source_id = [p.id for p in ts.populations() if p.metadata['name']==src_name][0]
target_id = [p.id for p in ts.populations() if p.metadata['name']==tgt_name][0]

Testpopulation = ts.get_samples(target_id)
for m in ts.migrations():
	if m.dest == source_id:
		for tree in ts.trees(leaf_lists=True):
                        # find the trees among the (m.left and m.right)
			if m.left > tree.get_interval()[0]:
				continue
			if m.right <= tree.get_interval()[0]:
				break

			for l in tree.leaves(mr.node):
                        # use leave to find out where the segments go
				if l in Testpopulation:
					de_seg[l].append(tree.get_interval())

I tested this code under some simple models like HumanNeanderthal, It could find the same segments as the original code, because there is only one intro pop.

And I also tested this code under the complexed model HumanDenisovanNeanderthal, the output segments are much fewer than origin codes and the proportion of introgression segments seems more reasonable.

@xin-huang
Copy link
Member

Could you please try the following codes and see whether you can get reasonable results?

def _get_true_tracts(ts, tgt_id, src_id, ploidy):
    """
    Description:
        Helper function to obtain ground truth introgressed tracts from tree-sequence.

    Arguments:
        ts tskit.TreeSqueuece: Tree-sequence containing ground truth introgressed tracts.
        tgt_id str: Name of the target population. 
        src_id str: Name of the source population.
        ploidy int: Ploidy of the genomes.
    """
    tracts = {}
    introgression = []

    for p in ts.populations():
        source_id = [p.id for p in ts.populations() if p.metadata['name']==src_id][0]
        target_id = [p.id for p in ts.populations() if p.metadata['name']==tgt_id][0]

    for i in range(ts.num_samples):
        node = ts.node(i)
        if node.population == target_id: tracts[node.id] = []

    for m in ts.migrations():
        if m.dest == source_id: introgression.append(m)

    for i in introgression:
        for t in ts.trees():
            # Tree-sequences are sorted by the left ends of the intervals
            # Can skip those tree-sequences are not overlapped with the interval of i.
            if i.left >= t.interval.right: continue
            if i.right <= t.interval.left: break # [l, r)
            for n in tracts.keys():
                left = i.left if i.left > t.interval.left else t.interval.left
                right = i.right if i.right < t.interval.right else t.interval.right
                if t.is_descendant(n, i.node): tracts[n].append([1, int(left), int(right), f'tsk_{ts.node(n).individual}_{int(n%ploidy+1)}'])

    return tracts

@xin-huang xin-huang self-assigned this Dec 8, 2023
@RuBP17
Copy link
Author

RuBP17 commented Dec 11, 2023

yes, I followed your code and get reasonable results. The results were small intervals and I merged it.

def combine_segs(segs_dict):
    combined_segs = {}
    for node, segs in segs_dict.items():
        tgt_id = segs[0][3]
        segs_np = np.array(segs)[:,1:3].astype(int)
        merged = np.empty([0, 2],dtype=np.int64)
        sorted_segs = segs_np[np.argsort(segs_np[:, 1]), :]
        for higher in sorted_segs:
            if len(merged) == 0:
                merged = np.vstack([merged, higher])
            else:
                lower = merged[-1, :]
                if higher[0] <= lower[1]:
                    upper_bound = max(lower[1], higher[1])
                    merged[-1, :] = (lower[0], upper_bound)
                else:
                    merged = np.vstack([merged, higher])
        combined_segs[tgt_id] = merged.tolist()
    return combined_segs

the output is like

{'tsk_10_1': [[4752671, 4822156],
  [5515440, 5577464],
  [5656534, 5782219],
  [5786697, 5912558],
  [9751616, 9776781]],
 'tsk_10_2': [[1081574, 1155350],
  [5515440, 5646920],
  [5787124, 5848448],
  [6489042, 6508794]]}

The results of origin code are

[(3194731, 3215569),
 (990262, 996713),
 (4815364, 4822156),
 (2881951, 2913192),
 (5546817, 5577464),
 (5577464, 5602289),
 (5607978, 5640348),
 (8966597, 8969216),
 (5885349, 5912558),
 (4783723, 4799208),
 (7249153, 7268325),
 (7268325, 7269434),
 (7269434, 7273435),
 (7273435, 7274846),
 (7274846, 7329602),
 (4799208, 4815364),
 (5640348, 5646920),
 (914195, 990262),
 (5536479, 5546817),
 (1382099, 1386707),
 (3122723, 3126090),
 (8966176, 8966597),
 (5515440, 5536479),
 (8969216, 8996656),
 (4881243, 4914243),
 (7085758, 7107298),
 (7199256, 7269434),
 (7269434, 7281135),
 (2653571, 2654761),
 (3836413, 3870191),
 (5604596, 5607978),
 (7435721, 7446997),
 (7482554, 7600984),
 (3812997, 3823362),
 (408670, 513053),
 (618711, 641597),
 (831758, 914195),
 (3800141, 3812997),
 (3782819, 3800141),
 (3823362, 3825790),
 (3825790, 3836413),
 (2878930, 2881951),
 (6489042, 6508794),
 (5656534, 5782219),
 (5602289, 5604596),
 (1212642, 1225440),
 (1225440, 1242849),
 (5786697, 5787124),
 (5787124, 5848448),
 (5848448, 5885349),
 (2263423, 2292123),
 (9751616, 9776781),
 (9999209, 10000000),
 (3193501, 3194731),
 (7239338, 7249153),
 (9111095, 9138914),
 (9138914, 9171257),
 (3119511, 3122723),
 (7329602, 7348552),
 (3079017, 3110135),
 (3870191, 3875974),
 (7137745, 7150696),
 (4752671, 4783723),
 (1081574, 1155350),
 (1242849, 1252345),
 (1371432, 1382099),
 (593168, 616656)]

Total length is 10**7 bp.
the proportion of the introgressed segments of origin code is 0.08193445.
the proportion of the introgressed segments of new code is 0.0347276.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants