For my semester project in my algorithms class I implemented an articulation point detection algorithm by Harold Gabow. Articulation points, a.k.a. cut vertices, are vertices in an undirected graph which, when removed, break the graph into components. The stack-based algorithm is a version of Robert Tarjan’s LOWPOINT method, and it runs in roughly O(|V|*|E|). I picked it for my project since I’d eventually like to apply articulation point detection to gene regulatory network studies. I’ve posted the report, the software, sample input/output, and source code on my website: http://www.michaelverdicchio.com/articulation.
Archive for 2007
Busy times and much has transpired since the last Bayesian networks talk. We’ve now completed parameter learning and had our first talk on structure learning. Here are the materials:
Talk 5 (pdf) – Long talk on parameter learning–we got through page 8.
Talk 6 (pdf) – Short talk finishing the previous talk and completing parameter learning.
Talk 6 Handout (pdf) – This handout summarizes many of the definitions, theorems and lemmas from parameter learning. The reference numbers and notation correspond to Neapolitan’s text.
Talk 7 (ppt) – This presentation gave an overview of structure learning via local search. It is lacking in examples and explicit theory, but examples will be shown more as we come to causality, and the theory is based on the theory presented in parameter learning.
Hopefully after the semester I’ll get one big Latex version of all of these.
So my seminar last Monday (not today) went horribly. At first I had the simple goals of presenting on 1) learning parameters of Bayesian networks and 2) learning structure of Bayesian networks. In my quest I quickly discovered that these topics are as deep as you’d like them to be, and that it takes 2/3 of the Neapolitan text book to cover it all. Instead of doing a nice, concise presentation that covers only the essential details, I dragged my audience through my own tumultuous learning process. Even though I’ve posted notes from bad talks before, these ones aren’t going up.
Next week I will give it another shot. While my background knowledge must be extensive and airtight, my talk should be just an overview of important concepts. I look forward very much to posting notes from the talk that don’t suck so much.
Over the summer and this semester I have been endeavoring to teach myself about Bayesian networks and their usability in modeling biological systems, and specifically gene interactions. The task has proved to be a difficult one, as my approach has not been very structured. It started with going through the 1995 tutorial by David Heckerman, but with each step I’ve had to make side journeys to fill in gaps in background knowledge.
So far in our lab seminar series (http://sysbio.fulton.asu.edu) I’ve covered the background in probability theory and in graph theory (Markov chains/blankets/equivalence, faithfulness, d-separation, etc.), and have begun discussing how to actually learn Bayesian networks from data. The last thing discussed was an intro on leaning posterior probability distributions, which will lead into learning network topology, and finally learning both.
I am posting my slides/notes from the four talks I’ve given, but take them only as an illustration of my progress and not as a good resource. For good resources, check out David Heckerman, Nir Friedman, Dana Pe’er, Eran Segal, Marco Ramoni, Andrew Moore, Jose Pena, and Daphne Koller, Richard Neapolitan, and of course the seminal text by Judea Pearl. All of those are who I am learning from.
Talk 1 (ppt) (horrible introduction, I didn’t know anything)
Talk 2 (ppt) (getting there, decent probability discussion)
Talk 3 (ppt) (getting better, decent graph discussion)
Talk 4 (pdf) (not so great, mostly review, notes only, no slides used)
BioMart is pretty cool: http://www.biomart.org
I created an XML query using their form, narrowing the data set and species, then uploading my gene shopping list by Entrez gene IDs. I then specified that I wanted 2000bp upstream and downstream (2 separate queries) and created the XML files. I then removed line breaks in TextPad and ran the queries on a Linux machine using wget:
wget -O results.txt ‘http://www.biomart.org/biomart/martservice?query=MY_XML’
…where MY_XML is the newline-free XML query. The result (for one upstream and one downstream query) was 2 text files with 2000 nucleotides per line for each of 200 genes. I’ve sent this off to my biologist colleague to see if it’s indeed what we’re looking for, but I think it is.
My other colleague is looking at our target gene and getting a consensus sequence or a position weight matrix for the binding target sequence. We will then search for this sequence in the up/downstream sequences of our shopping list of genes and extract the best targets.
This is a lot harder than I thought it would be. I guess there isn’t a tool pre-developed for any data acquisition one wishes. An interesting fact: I just downloaded the human genome. It’s just under 3 gigs of text. Talk about needles in a haystack.
So it looks like I get to play the biologist today. I am going to try and figure out an easy way to get upstream and downstream sequences for a shopping list of genes. Then my colleague will see how likely our favorite transcription factor is to bind to each gene, either upstream or downstream. It seems pretty straightforward, but I have two worries. On my part, I’m not sure how easy it will be to get upstream or downstream regions for a collection of genes automated in bulk, not one at a time, and I’m not sure how far up or down to go in the intergenic region. 1kb? 2?
But after that, I’m not sure how reliable/usable our results will be, since our favorite transcription factor potentially interacts with lots of friends. Just because we determine it highly likely to bind upstream of some gene doesn’t mean it is not working in tandem with other transcription factors on some other gene, and we could miss that.
So far my biggest inhibitor of progress, at least as far as I can gauge from my mentor, is my dislike for trying things I doubt. For example, I am more likely to spend my time convincing myself that an undertaking (such as finding these 5′ and 3′ regions and binding sites) is worth the effort (and thus not making progress) than I am to try it and fail. I think some of that stems from my anxiety about finishing this degree in under a decade–I don’t want to waste time, per se. However, I guess you don’t learn unless you screw things up all the time, and I think I spend too much time avoiding mistakes than achieving success.
So in my year of graduate school I’ve found that occasionally, I’ll discover something useful on someone’s blog. So I decided to start my own to sort of chronicle my progress to a PhD, but also for a place to post problems I’m working on, and hopefully solutions to those problems. We’ll see how it goes.
