Remodeling Cildb, a popular database for cilia and links for ciliopathies

Background New generation technologies in cell and molecular biology generate large amounts of data hard to exploit for individual proteins. This is particularly true for ciliary and centrosomal research. Cildb is a multi–species knowledgebase gathering high throughput studies, which allows advanced searches to identify proteins involved in centrosome, basal body or cilia biogenesis, composition and function. Combined to localization of genetic diseases on human chromosomes given by OMIM links, candidate ciliopathy proteins can be compiled through Cildb searches. Methods Othology between recent versions of the whole proteomes was computed using Inparanoid and ciliary high throughput studies were remapped on these recent versions. Results Due to constant evolution of the ciliary and centrosomal field, Cildb has been recently upgraded twice, with new species whole proteomes and new ciliary studies, and the latter version displays a novel BioMart interface, much more intuitive than the previous ones. Conclusions This already popular database is designed now for easier use and is up to date in regard to high throughput ciliary studies.


Background
Whatever the field studied in biology, due to the prevalence of new generation technologies, retrieving relevant information from high throughput studies represents a most important challenge. In this view, five years ago, we developed Cildb, a knowledgebase that allowed data mining concerning cilia and ciliopathies (http://cildb. cgm.cnrs-gif.fr/) [1]. Cildb progressively became a reference cilium database, with a number of users reaching now 700 per month. Since its creation and publication [1], Cildb underwent several modifications and improvements, yielding an evolution to Version 2.1 in 2010 and now to Version 3.0 in 2014. Although data in Cildb are raw data treated automatically, so that false positives and false negatives may occur, results are fully informative and make easier searches on ciliary genes.
The purpose of this note is fourfold, reminding the reader of the main uses of this database already described in more detail by Arnaiz et al. [1], providing explanation of the updates, describing the new interface and evaluating the orthology relationships as calculated in Cildb.
Cildb, a database for ciliary studies… and more In the early 2000's, high throughput studies started to appear concerning cilia, a re-emerging organelle at that time [2], and centrioles [3], precursors of basal bodies of cilia in metazoans. Such studies generated large amounts of data on cilia, basal body, centriole, and centrosome proteomes, on transcriptome analyses realized under various conditions (ciliogenesis etc.), and on computation issued from comparative genomics between centric (i.e. with cilia/flagella or at least centrioles at some stage of their life cycle) and acentric organisms. Developing a way to browse these data became essential, not only from the statistician's point of view, but also for experimental biologists who want to seek information on individual proteins from the bulk of the results.

Methods
The originality of Cildb was in its backbone that related on the one side a network of orthology between the whole proteomes, complete sets of protein sequences, of all the species taken pair-wise, calculated with the algorithm of Inparanoid version 4.1 with default parameters [4], and on the other side the detection of each protein in a set of ciliary studies [1]. Therefore, the database allows searches for possible ciliary properties on the whole proteome of one species, e.g. Homo sapiens, based on ciliary properties established by studies conducted in another species, e.g. flagellum proteomics in Chlamydomonas [5]. In addition, the whole human proteome has been linked to the OMIM database (http://www.ncbi. nlm.nih.gov/omim/) that gathers all known human genetic disorders with the corresponding genes. This allows searches of proteins involved in diseases and to display the OMIM description as attribute in the output of a search. Conversely, searches in the whole proteome of any nonhuman species can tell if the resultant proteins are orthologous to human proteins linked to human diseases.
In addition to the ciliary properties of proteins, Cildb contains other information such as synonyms, descriptions, molecular weight, isoelectric point, probability of presence of a signal peptide, of transmembrane helices, as well as the FASTA sequence. This extra information can be searched for and displayed as properties using Cildb.
Cildb has been imagined and worked out to manipulate outputs of high throughput studies. All data coming from studies dedicated to the function of only a specific or of several proteins are not included in Cildb so that some ciliary proteins may escape from Cildb searches if they are not revealed by high throughput studies.

Results and discussion
What is new in Cildb V3.0?
Since the last version of Cildb, new high throughput ciliary studies have appeared and more model organisms have been used for ciliary studies. Thus, we remodeled Cildb to include the proteomes of altogether 44 species, among which are 41 eukaryotes and 3 bacteria (http:// cildb.cgm.cnrs-gif.fr/v3/cgi/genome_versions; Figure 1) and 66 studies, among which 55 directly concern cilia, and 11 other, related studies (http://cildb.cgm.cnrs-gif.fr/ v3/cgi/ciliary_studies; Table 1). BLAST server and human GBrowse facilities are maintained in the new version. In addition, a Motif Search tool has been implemented in order to search proteomes with a sequence motif using the patmatdb program from the EMBOSS package (http://bioweb2.pasteur.fr/docs/EMBOSS/patmatdb.html), based on the format of pattern used in the PROSITE database (http://prosite.expasy.org/prosuser.html). For example, an amino acid motif such as MKK[KP]K, in which either K or P can stand at the fourth position, can be queried in the proteome of any species of Cildb.
Species implemented in Cildb V3.0 Cildb V3.0 contains now whole proteomes of 41 eukaryotes among which 32 are centric species. Fifteen of these species were used for the 66 high throughput studies of Cildb. The 17 other species are good models for ciliary experiments although no high throughput study has been published as of yet. Nine eukaryotic acentric species which lack cilia and centrioles were also taken because they represent 'negative controls' in comparative genomics experiments: two species for which two analyses on spindle pole proteomes are available and seven species without high throughput relevant studies.
Since orthology relationships are a major tool in Cildb, we corrected an inconsistency in the proteome composition in various species. Indeed, species present in Cildb are not homogeneous in their whole proteome, some of them including organelle proteomes (mitochondria, chloroplasts), others not. Organelle proteomes represent a minor part of all the proteins, but since some organellar proteins can be encoded either by nuclear genes or by the organelle, according to the species, this may influence the orthology calculation in some cases. This issue has been fixed in Cildb V3.0. In addition, to study the origin of organellar proteins, we added the whole proteomes of three bacteria because they are closest to those of mitochondria (Rickettsia prowazekii) and chloroplasts (Synechocystis sp PCC6803, Chlamydia pneumoniae).
Since the original publication of Cildb [1], the whole proteomes of 26 novel eukaryotic species have been introduced into Cildb. A notable proportion of fungi, eight fungal whole proteomes, are incorporated in Cildb mainly because fungi represent a phylum at a hinge position in the evolution of centric and acentric species.
(See figure on previous page.) Figure 1 The species whose whole proteome has been included into Cildb V3.0 are gathered by taxonomy groups, with indication whether they are centric or not and of the number of high throughput studies, ciliary or not, performed in the species. The choice of species to include into Cildb was 1) species in which high throughput ciliary studies have been performed, 2) species routinely used as models in ciliary studies in general, and 3) centric and acentric species, because the presence/absence of certain proteins may be relevant for the conservation of ciliary proteins through evolution. The case of the Bug22/GTL3/C16orf80 protein, composed of a domain called DUF667, essential for ciliary motility [6], was carefully examined for the choice of fungi to add in Cildb for comparative genomics. Bug22 is a protein highly conserved in all centric species, be they metazoans, protozoa, plants or fungi and curiously also highly conserved in the acentric land plants, but absent from the genomes of higher fungi already sequenced at the time of the publication, i.e. acentric ascomycetes [6]. Owing to constant new genome sequencing, novel fungal whole proteomes appeared and the occurrence of Bug22 was different from what was thought earlier. It is still undetectable in ascomycetes, but is found conserved in the acentric Mortierella verticillata (accession MVEG_01915), and a more divergent Bug22 with recognizable DUF667 domain is found in several basidiomycetes represented in Cildb by Laccaria bicolor (accession 598201). This property was one of the reasons to include those two fungi proteomes into Cildb V3.0. This also emphasizes that constant arrival of new knowledge as new genomes are sequenced can put into questions former assumptions such as the absence of particular proteins in some species, here Bug22 in fungi. The 66 studies incorporated in Cildb V3.0 mainly consist in high throughput proteomics, differential expression, and comparative genomics studies. 53 of these studies approach ciliary and centriolar/basal body components, structure, function or biogenesis. We also integrated 13 studies concerning related topics, such as microtubuleassociated proteins, spindle proteins, spindle pole bodies, nuclear-associated bodies, whole sperm proteome, and others. Compared to Cildb V1.0, 45 novel studies have been introduced in Cildb. High throughput studies concerning cilia appear monthly in the literature, but computation in Cildb needs full recalculation of the database, so that it cannot be updated each time. However, if the output of a study not present in Cildb has to be compared to a study already present, this can be performed using the keyword box in the general properties filter by querying a list of gene or protein IDs bordered by '%', one per line. The limitation is that the query is slow, since this is not the main task designed for BioMart queries.
Simplified interface and structure for Cildb V3.0 For users trained with previous versions of Cildb, the most prominent change is the new interface. Indeed, it takes advantage of the novel environment provided by BioMart Version 9 [58] (Figure 2). In consequence, making an advanced search becomes much more intuitive than earlier, even for non-trained users, who can easily enter the functionalities of the database.
The simplification of the interface is accompanied by a simplification of the structure of the database. First of all, the orthology calculation has been exclusively centered on Inparanoid [4]. Formerly, users could choose between Inparanoid and Inparanoid plus 'in house' filtered blast hits. The most recent version of Inparanoid appears efficient enough to prevent the output of too many false negatives that occurred with the previous versions, so that the addition of 'in house' filtered blast hits was no more necessary, as detailed in the next section and in the legend of Table 2. We also simplified the way to filter ciliary studies and removed less useful other searches (operator 'OR', customized searches). However, the functions removed in the query menu compared to previous Cildb versions can be applied by another process that consists of downloading data as tables with relevant attributes and sorting these tables thereafter using a spreadsheet software.
The changes brought to Cildb may have unexpected impact and we would be grateful for any feedback by the users. In addition, since genome annotations evolve with time, proteins can be gained or lost in the deduced proteomes from a time to the next. For all these reasons, we kept the former "data freeze" versions of Cildb available through the "Version" menu for comparisons when it is necessary.

Evolutionary conservation viewed through Cildb, the example of centrosomal proteins
To evaluate the identification of orthologs by Inparanoid, called 'inparalogs', we studied centrosomal proteins in more detail, since they are conserved proteins already pretty well known. We wondered whether centrosomal proteins identified in three studies in Homo sapiens would reveal the orthologs, when they exist, in other species. We used the following protocol: We chose to emphasize the orthologs in Mus musculus, Rattus norvegicus, Danio rerio, Apis mellifera and Drosophila melanogaster in the output to follow the evolutionary conservation, as viewed with Inparanoid. Among the 113 human proteins encoded by 77 genes found as centrosomal by this filter, inparalogs were detected for 76 genes in mouse, 75 in rat, 68 genes in fish, 37 genes in bee and 33 genes in fly ( Table 2). A vast majority of these proteins were identified in mammals, as well as in fish, a vertebrate. More negative examples were found in the insects bee and fly. To check whether homologues were indeed absent when no Inparalogs were found, we performed BLAST searches on individual species proteomes using the Cildb BLAST. Except for the two cases discussed in the legend of Table 2, all the absence of Inparalogs corresponds to no or weak BLAST hit detection. In addition, none of the BLAST targets were found in the previous version of Cildb as filtered best hits, a calculation method that we suppress in the present version. Altogether, although reciprocal BLAST searches are always useful to Figure 2 An advanced search on Cildb V3.0 is started by clicking on the 'Search' button on the top row on the right. Then, it is necessary to choose the species in which the proteome has to be searched for. The filter window then appears to adjust the filters in the left panel (no filter means that the full proteome will be retrieved). Similarly, the output window allows displaying particular properties (attributes) in columns for each filtered protein. A summary on the right reminds the user of all the filters and attributes currently used. This also allows direct modification of the orders of the columns in the output by moving the attributes up and down in the list. The last operation of the process is to show the results. The results are given by pages of 20 items with a maximum of 1000 items. To see all results, they have to be downloaded as a file. At any time, if the result output seems incomplete or inappropriate, the filters and attributes can be modified by using the 'Back' button (edit results) to refine the search and show the results again. The quick search allows a rapid search by keywords. The result can be processed the same way as the one described above, with the possibility to add attributes by 'Edit results' and to download the file. Note the direct access to BLAST, Human genome Gbrowse, Motif search, Help and access to older Versions of Cildb on the top row buttons to the right.  This table presents the list of 77 human proteins obtained from a BioMart search described in the text. The output gives a total of 133 proteins encoded by 77 genes, due to the presence of splice variants. For clarity, only one protein ID per gene has been presented in the table, after verification that all the splice variants of each gene displays the same orthology relationships with the species presented here. This table illustrates evolutionary conservation where a "yes" indicates that the human protein has an Inparalog in Cildb and a "no" that no Inparanoid orthology was found. The column 'class' serves to order the output genes in the table (from 5× 'yes' at the top to much fewer 'yes' at the bottom, along criteria of certain species being closer to each other than others, whereby the order from left to right goes human-mouse-rat (mammals), then fish (vertebrate), then bee and fly (insects). All instances of lacking orthology ("no") were individually verified by BLAST searches using the Cildb BLAST. The BLAST results were consistent with the absence of orthologs in the species, and only three exceptions contradict the Inparanoid results, highlighted as bold characters in the table. 1-Human Azi1 (ENSP00000393583) has no inparalog in Drosophila although an ortholog called dilatory exists. BLAST search on the Drosophila genome indeed light up dilatory, with a score very close to the one found for the Apis inparalogs by BLAST. The difference between these different outputs may result from the value of default thresholds taken by the Inparanoid program and the different lengths of the proteins. 2-Human cdk5rap2 (ENSP00000343818) has no Inparalog in Apis, although homologs are found by BLAST. Inparanoid relationships of the top three Apis proteins in the list (XP_006563202.1, XP_006563201.1, XP_392107.3) appear to be Inparalogs of Drosophila centrosomin (cnn, cdk5rap2) for which 8 of 12 splice variant proteins display human Inparalogs. However, no direct Inparanoid relationships exist between the Apis proteins and any human protein.
3-Human dynactin/dctn1 (ENSP00000384844) has surprisingly no Inparalogs in mouse and rat whereas some are found in fish, bee and fly. However, mouse and rat homologs are easily found by BLAST search. After careful examination, it appears that the only ENSP00000384844 dynactin protein found common to the three human centrosomal studies, is one of the splice variants excluded from Inparalog groups. Indeed, the 16 splice variants for the human dynactin gene ENSG00000204843 and the seven splice variants for its mouse counterpart ENSMUSG00000031865 are related by Inparanoid orthology through three groups, hsap_mmus.17187 (one human and one mouse gene), hsap_mmus.1073 (four human and one mouse gene) and hsap_mmus.977 (one human and two mouse genes). The remaining ten human protein variants (among which is ENSP00000384844) and three mouse protein variants encoded by these genes are not included in the orthology groups, probably because their exon composition was too different from the other protein variants.
These three examples represent the limits of Inparanoid orthology prediction, highlighting the fact that reciprocal BLAST searches cannot be avoided, and thus represent an important complementary approach, for the analysis of individual proteins. study the occurrence of individual proteins in various species, the orthology calculation via Inparanoid is pretty suitable for batch identification of conserved proteins using Cildb.

Conclusion
The version V3.0 of Cildb preserves its major original principles of relating orthology to ciliary studies, but, by improving its structure and its interface, makes the database more suitable for advanced searches. Altogether, Cildb V3.0 is a particularly useful tool for unraveling ciliary and ciliopathy networks and will hopefully help in identification of new orphan diseases.