Any Python programmers willing to help me find a bug?

Asked by gorillapaws (30517

) December 5th, 2014

I’m working on a Python script for my sister to help with her dissertation for her doctorate in marine biology. I’ve started a thread on dreamincode, but have been unable to pin down the source of the bug thus far. If you’re a Python coder, and have a bit of time to look through my amateurish code (the most recent, edited version is one of the later posts), I’d REALLY appreciate the help. It’s FOR SCIENCE!

Observing members: 0

Composing members: 0

11 Answers

Anything for the sake of science. I will look at the code and see if I can find the bug. As I understand it, you want to find the number of distinct locus values. By just looking at the csv file, I can’t figure out how the locus values are determined. I will start looking at the code to see if I can make more sense of it, but it would help if I knew just how the parsing is supposed to work. Also, what is the calling sequence to get the program started?

LostInParadise (31907

)“Great Answer” (2

) Flag as…

Thanks! I just got an update from my sister, and she wants the data outputted differently. I’ve posted those changes, so the most recent post has the latest code.

As for your question, I’ll quote myself from the other thread:

“That’s correct the loci row (header row) is counting the number of columns (in the test data we have 12—excluding the fish_identifier and location columns, but the real data set is much larger). Each one is the identifier for that locus. In my code, there’s a genotype list containing each locus (there should be 12 for this test data) for each individual fish. The ACGT values of the loci in the genotype should be unique for each fish. In my code it appears to be treating them as globals even though that’s not my intent. Every 2 rows represents a fish (DNA has 2 strands). The goal is to output 4 rows for each fish (A, C, G & T). The value in each column should be either 0.0, 0.5 (one of the DNA strands has this amino acid at that loci), and 1.0 (both DNA strands have this amino acid).”

This is slightly inaccurate now because the exporting requirements have changed. In the output file there should be 1 row for each fish, and a loci for each (A, C, G & T) as separate columns. So the data should remain the same, but simply presented differently.
So the columns for the 84_16 locus should be:
84_16_A, 84_16_C, 84_16_G, 84_16_T

gorillapaws (30517

)“Great Answer” (1

) Flag as…

I am guessing that you kick the program off by making a call to fish_import.

In the csv file, I see an initial ine with a list of pairs of numeric values separated by underscores. The rest of the file has rows starting with a name with 3 parts separated by underscores and a list of numbers. These, are, as you indicated, grouped in pairs. I am not getting how any of this goes together. In particular, I don’t know where the A,C, G and T come into this. I will do my best to figure out what is supposed to happen, but it would be really helpful if I knew what the program was supposed to do.

LostInParadise (31907

)“Great Answer” (1

) Flag as…

Sorry for being unclear. In the .csv file the first row are the names of the loci (e.g. 84_16, 90_71, 122_16 etc.). These names aren’t really important and are computer-generated (I think) from the DNA sequencing process.

The rows represent the fish. Every 2 rows is a different individual. Each row represents half the DNA (2 strands, so 2 rows). The program is designed to go through both strands, at each locus for each fish. It’s easier to explain with an example

1 = A
2 = C
3 = G
4 = T

If we look at the first strand of the first fish (FC75_Sesoko2011_Fish1Vsens5) at the first locus (84_16) we get a value of 4 in the .csv. This means we increment the T at that locus by 0.5. The second strand for that fish at the same locus is also a 4, so we increment T again by 0.5 (so it’s now at 1.0). The values for the first fish ((FC75_Sesoko2011_Fish1Vsens5) at the first locus (84_16) would be A = 0.0, C = 0.0, G = 0.0 and T = 1.0.

She wants the data to output for the first locus in the header row (ignoring the first two columns) to look like:
84_16_A, 84_16_C, 84_16_G, 84_16_T
and the row for that fish would be 0.0, 0.0, 0.0, 1.0
This process would extend for all loci, so the first fish at the 2nd locus (90_71) would be
90_71_A, 90_71_C, 90_71_G, 90_71_T
and that row would continue with 1.0, 0.0, 0.0, 0.0
The third locus (122_16) for the for the first fish is
122_16_A, 122_16_C, 122_16_G, 122_16_T
and that row would continue with 0.0, 0.5, 0.0, 0.5

In total the output would look like this for the header row just for the first 3 loci:
Fish_ID,Population,84_16_A,84_16_C,84_16_G,84_16_T,90_71_A,90_71_C,90_71_G,90_71_T,122_16_A,122_16_C,122_16_G,122_16_T
The second row would be:
FC75_Sesoko2011_Fish1Vsens5,1,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.5,0.0,0.5

Basically all rows, excluding the the first two columns (fish name and population), will have a 0.0, a 0.5, or a 1.0.

I hope that’s intelligible. Please ask if you need me to explain it better.

gorillapaws (30517

)“Great Answer” (0

) Flag as…

edit > I thought I had it but I was wrong. Somehow it seems that genotype is being added to for each fish. 40 fish times 12 genotypes = 480

LostInParadise (31907

)“Great Answer” (1

) Flag as…

THANK YOU SO MUCH!!!

gorillapaws (30517

)“Great Answer” (0

) Flag as…

Wait, where am I not referencing it as self.genotype? Do I need to declare it differently to be an instance property as opposed to a class level property?

gorillapaws (30517

)“Great Answer” (0

) Flag as…

I am a little rusty on Pyton classes. It seems to me that it is being treated as a class variable. I modified the code to set genotype = [] each time that Fish.import_data is called. That seems to work. At least the error mesage did not get printed. I previously put in debug statements and I know that genotype size increases by 12 each time.

LostInParadise (31907

)“Great Answer” (1

) Flag as…

Ok, thank you. I’ll read through the docs. I’m sure you’re right though, because it would perfectly account for the behavior.

gorillapaws (30517

)“Great Answer” (0

) Flag as…

I just checked the Python documenation. You need to remove the declarations from the class level. Define a class constructor that looks like this:

def __init__(self):
self.genotype=[]
self.name = ””
self.population = ””

The lines in the procedure must of couse be indented. Fluther left justifies all the lines.

That should get the program to work correctly.

LostInParadise (31907