Archive for October, 2007
October 22nd, 2007
So my seminar last Monday (not today) went horribly. At first I had the simple goals of presenting on 1) learning parameters of Bayesian networks and 2) learning structure of Bayesian networks. In my quest I quickly discovered that these topics are as deep as you’d like them to be, and that it takes 2/3 of the Neapolitan text book to cover it all. Instead of doing a nice, concise presentation that covers only the essential details, I dragged my audience through my own tumultuous learning process. Even though I’ve posted notes from bad talks before, these ones aren’t going up.
Next week I will give it another shot. While my background knowledge must be extensive and airtight, my talk should be just an overview of important concepts. I look forward very much to posting notes from the talk that don’t suck so much.
October 12th, 2007
Over the summer and this semester I have been endeavoring to teach myself about Bayesian networks and their usability in modeling biological systems, and specifically gene interactions. The task has proved to be a difficult one, as my approach has not been very structured. It started with going through the 1995 tutorial by David Heckerman, but with each step I’ve had to make side journeys to fill in gaps in background knowledge.
So far in our lab seminar series (http://sysbio.fulton.asu.edu) I’ve covered the background in probability theory and in graph theory (Markov chains/blankets/equivalence, faithfulness, d-separation, etc.), and have begun discussing how to actually learn Bayesian networks from data. The last thing discussed was an intro on leaning posterior probability distributions, which will lead into learning network topology, and finally learning both.
I am posting my slides/notes from the four talks I’ve given, but take them only as an illustration of my progress and not as a good resource. For good resources, check out David Heckerman, Nir Friedman, Dana Pe’er, Eran Segal, Marco Ramoni, Andrew Moore, Jose Pena, and Daphne Koller, Richard Neapolitan, and of course the seminal text by Judea Pearl. All of those are who I am learning from.
Talk 1 (ppt) (horrible introduction, I didn’t know anything)
Talk 2 (ppt) (getting there, decent probability discussion)
Talk 3 (ppt) (getting better, decent graph discussion)
Talk 4 (pdf) (not so great, mostly review, notes only, no slides used)
October 9th, 2007
BioMart is pretty cool: http://www.biomart.org
I created an XML query using their form, narrowing the data set and species, then uploading my gene shopping list by Entrez gene IDs. I then specified that I wanted 2000bp upstream and downstream (2 separate queries) and created the XML files. I then removed line breaks in TextPad and ran the queries on a Linux machine using wget:
wget -O results.txt ‘http://www.biomart.org/biomart/martservice?query=MY_XML’
…where MY_XML is the newline-free XML query. The result (for one upstream and one downstream query) was 2 text files with 2000 nucleotides per line for each of 200 genes. I’ve sent this off to my biologist colleague to see if it’s indeed what we’re looking for, but I think it is.
My other colleague is looking at our target gene and getting a consensus sequence or a position weight matrix for the binding target sequence. We will then search for this sequence in the up/downstream sequences of our shopping list of genes and extract the best targets.
October 6th, 2007
This is a lot harder than I thought it would be. I guess there isn’t a tool pre-developed for any data acquisition one wishes. An interesting fact: I just downloaded the human genome. It’s just under 3 gigs of text. Talk about needles in a haystack.
October 4th, 2007
So it looks like I get to play the biologist today. I am going to try and figure out an easy way to get upstream and downstream sequences for a shopping list of genes. Then my colleague will see how likely our favorite transcription factor is to bind to each gene, either upstream or downstream. It seems pretty straightforward, but I have two worries. On my part, I’m not sure how easy it will be to get upstream or downstream regions for a collection of genes automated in bulk, not one at a time, and I’m not sure how far up or down to go in the intergenic region. 1kb? 2?
But after that, I’m not sure how reliable/usable our results will be, since our favorite transcription factor potentially interacts with lots of friends. Just because we determine it highly likely to bind upstream of some gene doesn’t mean it is not working in tandem with other transcription factors on some other gene, and we could miss that.
So far my biggest inhibitor of progress, at least as far as I can gauge from my mentor, is my dislike for trying things I doubt. For example, I am more likely to spend my time convincing myself that an undertaking (such as finding these 5′ and 3′ regions and binding sites) is worth the effort (and thus not making progress) than I am to try it and fail. I think some of that stems from my anxiety about finishing this degree in under a decade–I don’t want to waste time, per se. However, I guess you don’t learn unless you screw things up all the time, and I think I spend too much time avoiding mistakes than achieving success.