Tag: Python

Python: Getting page title

Published by Silveira on 2020-05-26

# Get the HTML page content
from urllib import request
html = request.urlopen("https://silveiraneto.net").read()

# Get the title tag
from bs4 import BeautifulSoup
title = BeautifulSoup(html, 'html.parser').find('title')

print(title.string)

This uses the XML/HTML library BeautifulSoup . This could also be done using regex but playing with HTML and regex is usually a bad idea.

Telephone keypad combinations

Published by Silveira on 2014-11-19

Problem: Given a sequence of numbers, show all possible letter combinations in a telephone keypad.

Recursive solution in Python:
[python]
keyboard = {
‘1’: [],
‘2’: [‘a’,’b’,’c’],
‘3’: [‘d’,’e’,’f’],
‘4’: [‘g’,’h’,’i’],
‘5’: [‘j’,’k’,’l’],
‘6’: [‘m’,’n’,’o’],
‘7’: [‘p’,’q’,’r’,’s’],
‘8’: [‘t’,’u’,’v’],
‘9’: [‘w’,’x’,’y’,’z’],
‘0’: []
}

def printkeys(numbers, prefix=""):
if len(numbers)==0:
print prefix
return

for letter in keyboard[numbers[0]]:
printkeys(numbers[1:], prefix+letter)

printkeys("234")
[/python]

Output:

adg
adh
adi
aeg
aeh
aei
afg
afh
afi
bdg
bdh
bdi
beg
beh
bei
bfg
bfh
bfi
cdg
cdh
cdi
ceg
ceh
cei
cfg
cfh
cfi

permutations implemented in Python

Published by Silveira on 2014-11-18

In case you can’t use Python’s itertools or in case you want a simple, recursive python implementation for a permutation of a list:

[python]
def perm(a,k=0):
if(k==len(a)):
print a
else:
for i in xrange(k,len(a)):
a[k],a[i] = a[i],a[k]
perm(a, k+1)
a[k],a[i] = a[i],a[k]

perm([1,2,3])
[/python]

Output:

[1, 2, 3]
[1, 3, 2]
[2, 1, 3]
[2, 3, 1]
[3, 2, 1]
[3, 1, 2]

This Python implementation is based in the algorithm presented in the book Computer Algorithms by Horowitz, Sahni and Rajasekaran.

Merging k lists of size n

Published by Silveira on 2013-01-13

Merging n lists of size k, using two different approaches.

Original file on Github.

GWU Computer Science Graduate Classes Graph

Published by Silveira on 2012-06-03

In order to help me to takeÂ decisions about which class to take every semesterÂ I did a web scrappingÂ from the graduate and undergraduate bulletin. For every class I could get classe name, prerequisites, credits, teacher, program, description, etc, in a formated tabular document.

Using Python CSV library I could read the tables and parse the data to other formats. One format very useful to handle graph structures is theÂ DOT language script (included in theÂ GraphvizÂ project), in which you can describe both the graph structure and the elements of the graph layout.

Here is the Python source-code to convert the tables to graphs at Github.

The final result (click to view in full size):

Limitations and comments:

Prerequisites are only displayed using AND logic. It’s not showing other logics as OR (equivalent classes).
Errors may exists due to the scrapping process,Â conversions, or in the errors in the original source.
In the sources there is also a function to convert the graph in Dracula (aÂ JavaScript interactive graph representation) but the current result is too tangled.

Substitutions in a phylogenetic tree file

Published by Silveira on 2012-03-08

The newick tree

The Newick tree format is a way of representing a graph trees with edge lengths using parentheses and commas.

A newick tree example:

(((Espresso:2,(Milk Foam:2,Espresso Macchiato:5,((Steamed Milk:2,Cappucino:2,(Whipped Cream:1,Chocolate Syrup:1,Cafe Mocha:3):5):5,Flat White:2):5):5):1,Coffee arabica:0.1,(Columbian:1.5,((Medium Roast:1,Viennese Roast:3,American Roast:5,Instant Coffee:9):2,Heavy Roast:0.1,French Roast:0.2,European Roast:1):5,Brazilian:0.1):1):1,Americano:10,Water:1);

A graphical representation for the newick tree above (using the http://www.jsphylosvg.com/ library):

TheÂ Newick format is commonly used for storeÂ phylogenetic trees.

The problem

A phylogenetic tree can beÂ highly branched and dense and even using proper visualizationÂ softwareÂ can beÂ difficult to analyse it.Â Additionally, as a tree are produced by a chain of differentÂ software with data from the laboratory,Â the label for eachÂ leaf/node can be something notÂ meaningful for a human reader.

For this particular problem, an example of a node label could be SXS_3014_Albula_vulpes_id_30.

There was a spreadsheetÂ withÂ more meaningful informationÂ where a node label could be used as a primary key. Example for the node above:

Taxon Order	Family	Genus	Species	ID
Albuliformes	Albulidae	Albula	vulpes	SXS_3014_Albula_vulpes_id_30

The problem consists in using the tree and the spreadsheetÂ to produce a new tree with the same structure, where each node have a moreÂ meaningful label.

The approach

The new tree can be mounted by substituting each label of the initial tree with the respective information from the spreadsheet. A script can be used toÂ automate this process.

The solution

After converting the spreadsheet to a CSV fileÂ that could be more easily handled by a CSV Python libraryÂ the problem is reduced to a file handling and string substitution.Â Fortunately, due the simplicity of the Newick format and its limited vocabulary, a tree parser is not necessary.

Source-code at Github.

Difficulties found

The spreadsheet was originally in a Microsoft Office Excel 2007 (.xlsx) and the conversion to CSV provided by Excel was not good and there was no configuration option available. Finally, the conversion provided by LibreOffice Productivity Suite was more configurable and was easier to read by the CSV library.

In the script, the DictReader class showed in the the long-term much more reliable and tolerant to changes in the spreadsheet as long the names of the columns remain the same.

P.S. due to the nature of the original sources for the tree and spreadsheetÂ I don’t have the authorizationÂ for public publishing their complete and original content. The artificialÂ data displayed here isÂ merely illustrative.

GenBank renaming

Published by Silveira on 2012-02-20

DNA inspired sculpture by Charles Jencks. Creative Commons photo by Maria Keays.

What is GenBank?

The GenBank sequence database is a widely used collection of nucleotide sequences and their protein translations. A GenBank sequence record file typically has a .gbk or .gb extension and is filled with plain text characters. A example of GenBank file can be found here.

Filename problem

Although there are several metadata are available inside a GenBank record the name of the file are not always in accordance with the content of the file. This is potentially a source of confusion to organize files and requires an additional effort to rename the files according to their content.

Approach using Biopython

The Biopython project is a mature open source international collaboration of volunteer developers, providing Python libraries for a wide range of bioinformatics problems. Among other tools, Biopython includes modules for reading and writing different sequence file formatsÂ including the GenBank’s record files.

Despite the fact that is possible to write a parser for GenBank’ files it would represent a redundant effortÂ to develop andÂ maintain such tool. Biopython can be delegated to perform parsing and focus the programming on renamingÂ mechanism.

Biopython installation on Linux (Ubuntu 11.10) or Apple OS X (Lion)

For both Ubuntu 11.10 and OS X Lion, a modern version of Python already comes out of the box.

For Linux you just need to installÂ the Biopython package. One method to install Biopython in a APT ready distribution as Ubuntu 11.10 (Oneiric Ocelot) is:

# apt-get install python-biopython

For an Apple OS X (Lion) you can install Biopython using easy_install, a popular package manager for the Python. Easy_install is bundled with Setuptools, a set of tools for Python.

To install the Setuptools download the .egg file for your python version (probably setuptools-0.6c11-py2.7.egg) and execute it as a Shell Script:

sudo sh setuptools-0.6c11-py2.7.egg

After this you already have easy_install in place and you can use it to install the Biopython library:

sudo easy_install -f http://biopython.org/DIST/ biopython

For both operational systems you can test if you already have Biopython installed using the Python iterative terminal:

$ python
Python 2.7.2+ (default, Oct 4 2011, 20:03:08)
[GCC 4.6.1] on linux2
Type “help”, “copyright”, “credits” or “license” for more information.
>>> import Bio
>>> Bio.__version__
‘1.57’
>>>

Automatic rename example through scripting

Below the Python source-code for a simple use of using Biopython to rename a Genbank file to it’s description after removing commas and spaces.

Using the the previous example of GenBank file, suppose you have a file called sequence.gb. To rename this file to the GenBank description metadata inside it you can use the script.

python gbkrename.py sequence.gb

And after this it will be called Hippopotamus_amphibius_mitochondrial_DNA_complete_genome.gbk.

Improvements

There is plenty of room for improvement as:

Better command line parsing with optparse and parameterization of all possible configuration.
A graphical interface
Handle special cases such multiple sequences in a single GenBank file.

Python, flatten a list

Published by Silveira on 2011-10-08

Surprisingly python doesn’t have a shortcut for flatten a list (more generally a list of lists of lists of…).

I made a simple implementation that doesn’t use recursion and tries to be written clearly.

I get a element from a “notflat” list (a list that can have another lists on it). If a element is not a list we store in our flat list. If the element is still a list we deal with him later. The flat list always have only elements that are not a list.
To preserve the original order we reverse the elements at the end.

Contando Algarismos Em Um Intervalo

Published by Silveira on 2010-01-29

Quantos zeros tem entre um e mil?

Ã‰ mais fÃ¡cil responder perguntas desse tipo escrevendo pequenos programas usando o suporte a programaÃ§Ã£o funcional e compreensÃ£o de lista que algumas linguagens como Python oferecem.

Para contar os zeros de um nÃºmero, transformamos ele em uma string e contamos quantas substrings ‘0’ ele contÃ©m. Por exemplo o 800:

str(800).count('0')
# 2

Para gerar uma lista ordenada com os elementos do intervalo entre um e mil, inclusive os valores um e mil:

xrange(1,1001)
# [1, 2, ... , 999, 1000]

[str(x).count('0') for x in xrange(1,1001)]
# [0, 0, ... , 0, 3]

Por exemplo, 1 nÃ£o tem nenhum zero. Dois tambÃ©m nÃ£o. 999 tambÃ©m nÃ£o. 1000 tem trÃªs.

Somamos todos os elementos da lista temos o nÃºmero de algarismos zero entre um e mil.

sum([str(x).count('0') for x in xrange(1,1001)])

O mesmo poderia ser obtido contando quantos zeros hÃ¡ na representaÃ§Ã£o de string da lista do intervalo.

str(range(1,1001)).count('0')

A diferenÃ§a do range pro xrange Ã© que o range constrÃ³i a lista real do intervalo real em memÃ³ria e o xrange uma representaÃ§Ã£o da lista do intervalo. Em geral mas nÃ£o sempre, a performasse do xrange Ã© melhor.

Easily Sortable Date and Time Representation

Published by Silveira on 2010-01-20

I was looking for a date and time representation useful for registering stock quotes in a simple plain file.

I found that the standard ISOÂ 8601 is just the answer for this, it’s called “Data elements and interchange formats â€” Information interchange â€” Representation of dates and times”. Here is a example:

2010-01-20 22:14:38

There’s this good article from Markus Kuhn, “A summary of the international standard date and time notation”. This notation allow us to using simple lexicographical order the events.

Some examples of how to do this in Python (thanks for the Jochen Voss article “Date and Time Representation in Python”) The first for displaying the current date and time:

from time import strftime
print strftime("%Y-%m-%d %H:%M:%S")
# 2010-01-20 22:34:22

Another possibility is using strftime from datetime object.

from datetime import datetime
now = datetime.datetime.now()
print now.strftime("%Y-%m-%d %H:%M:%S")
# 2010-01-20 22:12:31

Is that. Using this notation in the begging of each line is easy to sort them in any language or using the unix sort.