# Tag: CSV

In order to help me to takeÂ decisions about which class to take every semesterÂ I did a web scrappingÂ from the graduate and undergraduate bulletin. For every class I could get classe name, prerequisites, credits, teacher, program, description, etc, in a formated tabular document.

Using Python CSV library I could read the tables and parse the data to other formats. One format very useful to handle graph structures is theÂ DOT language script (included in theÂ GraphvizÂ project), in which you can describe both the graph structure and the elements of the graph layout.

The final result (click to view in full size):

• Prerequisites are only displayed using AND logic. It’s not showing other logics as OR (equivalent classes).
• Errors may exists due to the scrapping process,Â conversions, or in the errors in the original source.
• In the sources there is also a function to convert the graph in Dracula (aÂ JavaScript interactive graph representation) but the current result is too tangled.

The newick tree

The Newick tree format is a way of representing a graph trees with edge lengths using parentheses and commas.

A newick tree example:

(((Espresso:2,(Milk Foam:2,Espresso Macchiato:5,((Steamed Milk:2,Cappucino:2,(Whipped Cream:1,Chocolate Syrup:1,Cafe Mocha:3):5):5,Flat White:2):5):5):1,Coffee arabica:0.1,(Columbian:1.5,((Medium Roast:1,Viennese Roast:3,American Roast:5,Instant Coffee:9):2,Heavy Roast:0.1,French Roast:0.2,European Roast:1):5,Brazilian:0.1):1):1,Americano:10,Water:1);

A graphical representation for the newick tree above (using the http://www.jsphylosvg.com/ library):

TheÂ Newick format is commonly used for storeÂ phylogenetic trees.

The problem

A phylogenetic tree can beÂ highly branched and dense and even using proper visualizationÂ softwareÂ can beÂ difficult to analyse it.Â Additionally, as a tree are produced by a chain of differentÂ software with data from the laboratory,Â the label for eachÂ leaf/node can be something notÂ meaningful for a human reader.

For this particular problem, an example of a node label could be SXS_3014_Albula_vulpes_id_30.

There was a spreadsheetÂ withÂ more meaningful informationÂ where a node label could be used as a primary key. Example for the node above:

Taxon Order Family Genus Species ID
Albuliformes Albulidae Albula vulpes SXS_3014_Albula_vulpes_id_30

The problem consists in using the tree and the spreadsheetÂ to produce a new tree with the same structure, where each node have a moreÂ meaningful label.

The approach

The new tree can be mounted by substituting each label of the initial tree with the respective information from the spreadsheet. A script can be used toÂ automate this process.

The solution

After converting the spreadsheet to a CSV fileÂ that could be more easily handled by a CSV Python libraryÂ the problem is reduced to a file handling and string substitution.Â Fortunately, due the simplicity of the Newick format and its limited vocabulary, a tree parser is not necessary.

Source-code at Github.

Difficulties found

The spreadsheet was originally in a Microsoft Office Excel 2007 (.xlsx) and the conversion to CSV provided by Excel was not good and there was no configuration option available. Finally, the conversion provided by LibreOffice Productivity Suite was more configurable and was easier to read by the CSV library.

In the script, the DictReader class showed in the the long-term much more reliable and tolerant to changes in the spreadsheet as long the names of the columns remain the same.

P.S. due to the nature of the original sources for the tree and spreadsheetÂ I don’t have the authorizationÂ for public publishing their complete and original content. The artificialÂ data displayed here isÂ merely illustrative.