You may have never heard of a bioinformatician but there is a big demand for them.

Today while scanning twitter I came across two posts relating to the demand for data science/analytical/programming jobs in both academia and industry.

The first tweet was from the Royal Society and Royal Statistical Society promoting a report on the need for STEM (science,  technology, engineering and maths) skills in the workforce. It is the product of a conference which brought together academics, the Government’s Chief Scientific Adviser and senior representatives from BT, Amazon, Jaguar Land Rover all united by their need for computer and numeracy literate graduates. They estimate that up to 58,000 data science jobs a year are created each year, and there are a large number of these positions which do not get filled because there is a lack of suitable candidates applying.  In industry there is demand to model data to make predictions and decisions on what trends to follow and demand to visualize this data in a way that allows those without such strong numerical skills to make sense of it. They require employees who can communicate effectively what they are doing and think creatively about what further information they can get out of the data that will improve the commercial aspects of the business. It is a worthwhile read for anyone wondering where their mathematics or computer science training might take them.

The second was a tweet from Barefoot Computing stating that in the UK we are losing £63bn from GDP as we don’t have the appropriate technical or digital skill. I don’t know where the statistic comes from, but Barefoot are using it to encourage confidence and enthusiasm in the teaching of the Primary School Computer Science curriculum which is their underlying ethos.

Both of these reiterated to me the  demand there is and will be, in a wide variety of disciplines, for individuals who have strong mathematical, or computer science skill sets. So if you are considering your career options or know someone who is, encourage them to pursue a mathematics or computer science degree as this demand will just keep on growing.


Diversity brings value; how might we get there?

Below is a copy of a post I wrote for the Software Sustainability Institute.

This post summarises a discussion with Lawrence Hudson, Roberto Murcio, Penny Andrew and Robin Long as part of the Fellow Selection Day 2017.

The question of how to improve diversity is suitably broad and vague to initially induce silence in a group, but eventually, true to its name, it promotes a wide-ranging discussion. Sometimes the task is divided up to target particular under-represented groups, as it starts to become a bit of a minefield to develop a scheme that improves diversity in general. What opens the door to some parts of society can simultaneously close the doors to others. Hackathon events are a common and successful method of attracting young people to computer science; however, if they take place over the weekend and are marketed as providing beer and pizza for sustenance, you start to exclude anyone with caring responsibilities or discourage anyone who doesn’t drink.

Before we can think about trying to improve diversity, it is helpful to consider what exactly do we mean and what are the benefits. It is easy to see how a varied workforce can lead to a larger pool of ideas, skills, and experience, as well as a more harmonious environment where differences are embraced, minimising direct comparisons and competition between colleagues. It can also lead to a broader outreach, either exposing your product or brand, attracting new audiences, or inspiring the next generation. Diversity is often quantified in demographics (gender, age, religion, sexual orientation etc.); however, in a working environment it should also include background or previous experience. Your team may be culturally diverse but if you have all got the same degree qualification from the same institution trained in the same school of thought, where are the new ideas going to come from?

To improve diversity, it is important to recognise where the variety is lost. Was there a great selection of applicants from different backgrounds that got filtered out before the interview stage? If you can identify which factors caused the potentially diverse new team members to be excluded from consideration, this can be used to formulate more open criteria. For example in a university, ranking candidates on number of publications tends to favour men over women, focusing on quality over quantity may prevent this bias. With this in mind, developing a range of metrics of equal merit rather than focusing on a single criterion will also favour a broader range of applicants. This may mean moving away from the standard template for job descriptions which requires some time and effort on the part of the employer, but using a structure that allows potential employees to be creative with how they might meet the criteria creates opportunities for those with less traditional career paths.

Institutions can play their part by celebrating and promoting successes at all levels, as purely focusing on the achievements of the most senior employees often reinforces existing typecasts. In academia, there is a lot of truth in the stereotype of Professors as white, male and middle aged, so only covering the publications, media appearances and grant money brought in by these individuals may deter anyone who physiologically cannot aspire to this demographic. Alternatively publicising both work (software developed, new recruits, promotions) and personal achievements (charity events, sporting triumphs or bake sales) of all members of staff starts to showcase the variety underlying the workforce and may inspire a broader scope of applicants. Active involvement in the wider community raises the profile and generates positive feelings towards an organisation.  Having a creative recruitment and outreach strategy with roles such as community managers, public engagement officers and more tailored positions such as artists in residence can promote a welcoming environment and reach previously untapped employment streams.

Employers need to be open and flexible to new ways of working in order to appeal to a more varied pool of applicants. While many employers recognise the value of diversity and would always embrace a broad range of applicants to choose from, when it comes to the final decision, it can take a brave individual to select a candidate that differs from their usual employee. Pressure to hit the ground running creates barriers for individuals with great potential but who require a little more training or time to adjust to a new environment.

With increasing variety in backgrounds, training opportunities and career paths, the diversity we know will benefit us is continually expanding in the working population. A more open, flexible recruitment strategy will provide the opportunities for those looking for a change, both for employers and employees. However, diversity cannot be enforced. For the benefits to be realised, it needs to be an organic experience where the individuals involved recognise its value.

Bringing together, statistics, genetics and genealogy

In this post I want to highlight a recent genetic study published this week in Nature Communications which uses genetic data to characterize the current population of the US and understand how it came to be using databases of family history.

Their starting point was a very large genetic data set of 774,516 people currently residing in the US, the majority of which were also born there, with measurements at 709,358 different genetic positions.

They compared the genetic profiles of all pairs of individuals to identify regions of the genome (of a certain size) shared by both individuals, consistent with those two individuals having a common ancestor. It is important to note, that this is very unlikely to be the case between two randomly selected or even two distantly related individuals. Therefore this study was only possible because they had accumulated such a large genetic data set, meaning they had enough pairs of individuals with such a genomic region in common to make any inferences. Based on this information they produce a plot of US states where distance between points represents the similarity in common ancestry between individuals born in those states, which closely resembles a geographical map of the US. What it means is that, in general, the closer together two individuals live, the closer their ancestry is likely to be. This isn’t hard to believe,  and has been shown before, for example, similar studies in European populations have produced similar figures in the past.

The aim of the study was to divide the sample up into groups, referred to as clusters, of individuals whose genetic data implied common ancestry and which represented the substructure of the US population. What is perhaps novel to this study, is the inclusion of information from participants relating to when and where their relatives were born to interpret the origins and migratory patterns of each cluster. All of which is then discussed in the context of known immigration and migration patterns in recent times (~last 500 years).

A few things struck me about this article. Firstly, the data was taken from a genetic genealogy service AncestryDNA, who use a saliva sample to genetically profile and generate statistics on customer’s ancestry. Their analytical sample size was 774,516 individuals of US origin who provided consent for their data to be included in genetics research demonstrating potentially how interested the general population is in the information that their genome harnesses. What’s more these individuals are also keen for it to be used to improve our understanding of how genetics influences health and disease.

Secondly, the authors used network analysis to identify their clusters of individuals with common ancestry. The article is littered with mathematical terminology, “principal components”,  “weight function”, “hierarchical clustering”, “ spectral dimensionality reduction technique”, demonstrating not only the utility of statistics in genetics but the additional applications of this to supplementing our knowledge of modern history.

Thirdly, they make use of a range of large data sets (multiple genetic data sets and genealogy databases). This is increasingly necessary in genetics research in order to interpret findings and draw conclusions, making this a nice demonstration of how to think about incorporating additional sources of information (like a historian would) in order to contextualize your results.

Finally, if nothing else, this research serves as a timely reminder of the broad roots and origins of the current residents of the USA and how they came to be there.

Let’s test all the genes!

In this blog post (and others to follow) I want to give some examples of how statistics and statisticians have helped advance genetics research.

Most genetic studies these days consider and measure variation across all 20,000 human genes simultaneously. This is a great advance as it means we can forgot all the old biological theories we had based any previous research around and as yet not found any concrete support for. This is the basis of a genome-wide association study, often shortened to GWAS. GWAS are often referred to as a hypothesis-free approach. Technically, they are not completely hypothesis-free, as to do any statistics we need a hypothesis to test. They work on the hypothesis is that the disease of interest has genetic risk factors, however, we don’t need to have idea which gene or genes may be involved before we start. This means we may find a completely new gene or novel biological process which could revolutionize our understanding of a particular disease. Hence, they brought great promise, and new insight,  to contemporary genetics research.

So when it comes to doing the statistical analysis for our GWAS, we are essentially performing the same mathematical routine over and over again for each genetic variant in turn. This procedure is automated by computer programmes designed to do this efficiently. At the end we have a vast table (as a gene will have multiple genetic variants across it this can contain hundreds of thousands or even millions of rows) of summary statistics to draw our conclusions from. One highly important number for each site is the p-value from each statistical test that we can use to rank our table of results. There is no plausible way in which we can apply the standard checks of the individual statistical tests that a mathematician may have typically been taught to do (i.e. do the data meet the assumptions), to every single genetic variant that we have tested. Instead we often look at the distribution of p-values across all the tests, generally using a Q-Q plot to compare the expected quartiles to the observed quartiles, to decide if there is major bias, or any confounders affecting the results. Once happy in general, we can look at which genetic variants are significantly associated with your disease of interest.

With a number of computer software tools it can be fairly straight-forward to plug in the numbers and perform the required statistical test. The challenge is often the interpretation or drawing conclusions, in particular when it comes to the p-value.  This is made harder by the fact that most statistical training courses make the rather unrealistic assumption that you will only ever do 1 statistical test at a time and teach you how to apply a significance threshold in this scenario. This knowledge is then taken forward, and merrily applied in exactly the same manner to all statistical tests performed from that point forward.

However, there is a potential trap.

When you perform multiple tests, you increase your chances of getting a significant finding, even if there are no true associations. For example, let’s assume that there is no association between eating fruit and time spent watching TV. But to be 100% sure, we have found a group of people to ask about their TV watching habits and how many apples, bananas, oranges, strawberries, kiwis, melons, pears, blueberries, mangoes and plums they eat each week, then we decide to test each one of these ten different fruits individually. At a 10% significance level ( i.e. p-value < 0.1) we would expect that 0.1 x 10 = 1 test would identify a significant finding, which would be a false positive finding. The more things we test, the more we increase our chances of finding a significant association, even where none exists. This is called ‘multiple testing’, or ‘multiple comparisons’.

This knowledge is crucial for correctly interpreting the results of a GWAS. Say we have tested 500,000 genetic variants, even if none of them were truly associated at a significance threshold of P < 0.05 we would get 500000 x 0.05 = 25000 associations! That is (potentially) a rather hefty number of false positives (the number of associations you report as true but in fact are false). To prevent this, we need to adjust our significance threshold to account for the number of tests we have performed, minimizing our chances of incorrectly reporting a false positive. There are multiple methodologies proposed to resolve this issue, and this is one example where statistics plays an important role in genetic research.

What’s more, by highlighting the high probability of chance findings in GWAS there is a common consensus that all findings, even if they withstand the appropriate control for the number of genetic variants tested, must be replicated before they are ‘believed’ or taken seriously. Replication means repeating the whole GWAS process in a completely separate sample. So that’s more work for the statisticians then!

If you are interested in this topic you may enjoy this cartoon, which gives an alternative (comical solution).


Most bioinformatician posts are based in Universities and often require a PhD.

I did my PhD because it helped me get the job I wanted, so I see them as the graduate training scheme for working in scientific or medical research.

I didn’t really know what a PhD was until I did mine, so I admit I went into it slightly clueless and picked up what it was I was supposed to be doing as I went along. This blog post should help anyone thinking about a PhD but not 100% sure what exactly that means.

Firstly, some technical points:

  • The end point of your PhD is a written thesis (~50,000-80,000 words) of original research conducted by you, which you have to defend (or convince that you did and understand) in front of two examiners at a viva.
  • They can be started straight after an undergraduate degree,  but if your degree wasn’t in a relevant subject, you didn’t do as well as you should have, or you are not 100% sure if 3+ plus years in a research environment is for you a Masters program in between may be appropriate.
  • You will be assigned to a supervisor, or more likely multiple supervisors, who will guide you and are responsible for ensuring you get your PhD.

Every PhD is a unique experience, however there are many commonalities. They are designed to be challenging, primarily educationally but also personally. The idea is to study something novel, so as you follow the road not previously traveled,  it is inevitable that there will be problems or challenges along the way where the answer is not obvious. For some some problems there may be no solution (and part of your research is to develop the answer) or there may be multiple paths and you have to decide which one to take.

So why do it? It is a great addition to your CV, even if you don’t see yourself staying in academia. You are recognized as an expert in your (perhaps niche) field of study, and have demonstrated the ability to manage and complete a project over a specific period of time. It is perhaps underappreciated how hard they can be to finish as while the broader research project may go on, the PhD is finite has a clearly defined end goal. Depending on your personalities, it can be either the student or the supervisor who struggles to make the distinction between the end of the PhD and the end of the research project. Ultimately, as the student you have to have to the tenacity to put in the work to meet the requirements and achieve the degree.

Essentially a PhD should be seen as an opportunity, you are a student (and in the UK paid a stipend to support your living costs and not a salary) and therefore should take advantage of learning as many skills, going on as many courses as possible even if they are not directly relevant (think of it as personal development) and generally maximizing the opportunities presented. You should also get the chance to present your work outside of your day to day environment at conferences, so depending on a) your consumables budget and b) reach of your research you may get some chances to travel to all over the world. As an informatician, my only real expense was a computer so the rest of my budget enabled me to do a lot of travelling compared to fellow students who had expensive experiments to fund. There are lots of funding opportunities available to PhD students for travel, so even if your scheme doesn’t have much money available for this, you should still be able to identify sources of money to help with this.

Communication skills are very important and inevitably will be developed throughout your time as a PhD student. You will need to be able communicate effectively with your supervisor and colleagues, to put together your thesis, to present your research internally and externally either as talk or poster and finally to explain and answer questions about your work in your viva. You shouldn’t be afraid to disagree or follow your own intuition, but it helps if you can explain why.

Ultimately you need to be self-motivated, resourceful, and open to new experiences. You will learn a lot about your area of study, yourself and how research/academia works. It can be highly rewarding and set you up with a range of skills applicable to many careers.

If you would like to read more, take a look at this blog post which may be particularly relevant if you are based in the US.


DNA as a solution to computer storage

In this blog post, I wanted to draw your attention to an article published in Nature at the beginning of the week.

Essentially, it documents how a group of bioinformaticians have turned the presumed ridiculous idea of storing data in DNA into a plausible option.

It is quite ironic really, as one of the major hurdles of sequencing the genome is how or where do we store the  1.5 Gb worth of A,C,T and Gs? We have now turned this on its head with the realization that DNA is such an efficient way of compacting lots of information into a small space maybe we should try to take advantage of this.  It is a neat example of how nature has a solution for a very modern problem.

This article, also highlights to me the computational challenge genetics faces to manage and process data efficiently. Once we were able to sequence the genome, the technology continued to develop to do it faster, cheaper and more accurately, dramatically increasing the data outputted. Alongside this we need the software to keep up, or even ahead of the technologies.

Some software is developed by researchers as a necessity to get their projects done, but many companies, will develop programs in parallel to developing sequencing machines in order to offer the complete package to consumers. This means that if you are not interested in the academic lifestyle or a career in scientific research, there are plenty of alternative opportunities in industry.

It means you will be working at the cutting edge of genomic’s technology, but also as the article highlight the cutting edge of computational solutions.

Getting started with pRogramming

With the new term fast approaching we will soon have a number of new students.

Whilst they will come expecting to improve their wet lab skills by spending time in the lab learning new techniques and performing experiments, they may not be aware of the amount of time they will spend in front of the computer using programming to analyse the data they generate. Fortunately many biosciences undergraduate courses teach some statistics and increasing a little bioinformatics, so the students don’t arrive completely unprepared.

With this in mind, I thought I’d run through some of the ways you can go about learning to program, specifically focusing on the statistical language R.

I always like to reiterate that all my programming skills are self taught and I highly doubt that I am not the only one. Practically it is a very hard skill to teach in a classroom setting and the best way to learn is to get stuck in. Trial and error through experience is how you will progress, so it can be very attractive to employers or course providers if you have already had a go. It shows enthusiasm, a go-getting attitude and forward thinking, particularly if it it off the back of your own initiative.

When I started I downloaded the software (freely available) and worked through the ‘Introduction to R’ manual. This is a very dry way to go about it – and I will acknowledge, I did not make it to the end of the document. However, it helped me understand some of the basic principles about variables and functions. From there I was able (with the help of Google) to develop code to acheive the statistical tasks as I needed to.

Since then I have discovered a number of  online tutorials, which provide an interactive environment with hints and tips to make the process more successful and hopefully more enjoyable. In particular, DataCamp (again free) has been highly praised by colleagues  starting out of their programming journey. It is designed for beginners so is appropriate for any age, stage of education, or purpose.

I have recently tried it out with some work experience students who really enjoyed the experience. Programming can seem intimidating, not knowing where to start, fear you’ll break the computer or delete something important, unsure what exactly it can be used to do. These online aids remove many of these worries, and are a great option if you thinking you may be interested in a career involving programming but don’t know how you’ll get on.

In fact I’d encourage everyone to have a go, it is more accessible than you think. You never know what you are capable of until you try and it may even help you decide what career path you wish to follow.

Over the next academic year you may be faced with decisions about what to do next, which subjects to study at GCSE or A-level, whether to look for a job or continue with your studies, which Universities to apply to and what courses to do, what job or career path to follow? Or perhaps you just fancy learning a new skill that may lead to a new direction. Trying out some programming, it may open some doors you didn’t know existed, just like me!

Good luck!



You can’t do that!

I have previously discussed what I feel is the disjoint between taught statistics and the reality of being a statistician. Part of this is that the hard and fast rules are not always obeyed by the data you are working with. This can lead to a state of paralysis either through confusion on what to do next or refusal to use any of the standard approaches.


Unfortunately though, I am paid to do data analysis. I am expected to present results, not a list of reasons why I didn’t think any of the tests I know were not appropriate. Now I am not advocating that all the assumptions are there to be ignored but sometimes you just have to give something a go,  get stuck in and see how far you can bend the rules. For something like a regression model, some of the assumptions relate to the model fitting. For example, you can’t check if the residuals are normally distributed until you have fit the model. Therefore you have to do the calculations and generate the results before you know if it is appropriate.


A big part of statistical analysis is ensuring the robustness of your results. In other words are they a fluke? Is there another variable you haven’t factored in? I find visualization helpful here, can you see any outliers that change the result if you were to take them out? Is there one particular group of samples that is driving your association? Is your sample size large enough that that you can pick up subtle but noisy relationships? Does it hold true in males and females? Old and young? Essentially you are trying to break your finding.


In genomics the large datasets with measurements at thousands of markers for hundreds or thousands of individuals often mean repeating the same test for each marker. Doing statistics at this intensity makes it implausible to check the robustness of every single test. To prevent serious violations, fairly stringent filtering is applied to each of the markers prior to analysis. But the main approach to avoid false positives is to try to replicate your findings in an independent dataset.


Often performing the analysis is quite quick: it’s checking and believing that it’s true that takes the time.

One aspect of my job I never really expected to be involved with is study design. Early in my career I worked with publically available data, so had virtually no insight into how the experiment had been run or any experience of what might actually happen whilst working in a lab.

The nice thing about being in a group that generates a lot of data is the opportunity to be involved at the conception of an idea and have an input in how the project proceeds. I can imagine this may make some data analysts rather jealous as they get handed a dataset and a question to answer with no obvious link between the two, or a technical flaw that scuppers the proposed analysis.

There is no such thing as the perfect experiment. There are so many variables that may influence the outcome either grossly or subtly: quality of sample going in, temperature, batch of reagents, individual(s) doing the experiment, day of the week, time of day, the list is endless. In larger studies you will inevitably need to perform the experiment multiple times over days, weeks or months. This will lead to batch effects. I don’t like the word batch as I think it used very loosely to cover a range of different factors. Broadly it means a group of samples that have something in common that may mean their data is more similar to each other than samples included in other batches. Often this means they were processed at the same time (think about a batch of cakes) and refers to technical factors relating to the experiment.

The challenge is to organise your samples prior to the experimental procedure so that these technical variations do not influence the statistics you want to do. If you are doing a case control study, you want to randomly allocate each sample so that in each batch there is a mix of each group. What you don’t want is the cases to be processed as a group and all the controls to processed together, as then you can’t be sure that the differences you see are due to the disease or the fact the experiment was run by two different people.

There are times when you want to make sure what you comparing is from the same batch. For example we do a lot of work with discordant twin design. Here we are looking at the differences between the two members of twin pairs, so we want to be sure that they are not an artifact of the fact that they were processed two months apart.

While I have no desire to go into the lab to run any experiments, I have learnt a lot by having day to day interaction with my colleagues who generated the data. That knowledge can really help when it comes to processing the data. Comparing notes with the person who ran the experiments to identify why something doesn’t look like you were expecting invariably gives you confidence when it can be resolved. This is the kind of interaction I always wanted out of a job. I enjoy that I bring my skills and having responsibility for certain parts of a project whilst others with different skill sets are responsible for something else.

There is enough data out there that as a Bioinformatician I don’t have to work with a group who generate data. However I would strongly recommend spending some time in that environment as it is always beneficial to understand a bit more about how and where your data came from.

Dealing with unknowns.

Science is all about dealing with unknowns.

There are the big unknowns, ‘Can we eradicate cancer?‘, ‘Why do we forget things as we get older?‘, ‘ Can we grow replacement organs?‘. Then there are the day to day niggling unknowns. These are the ones that tend to cause the most anxiety. Perhaps because we never expect to completely answer the big questions and are simply looking to add to the body of knowledge.

Pretty much all of the day to day problems I deal with relate to ‘how’ are we going to test a particular hypothesis. Once you have data in hand, it is not uncommon for some technicalities or oversights to emerge. We have to accept that the perfect study design is often unobtainable, and instead are striving to control for as many external factors that may influence the result as possible. Where you couldn’t do so in the way the experiment was conducted, you have a second chance at the analysis stage. This is limited by two things: 1) knowing what all of these possible confounders are, and 2) actually having a measure or proxy for that confounder.

There are two routes taken when dealing with confounders: one option is you perform the initial analysis and then see if it changes with the addition of additional covariates, alternatively you include all the variables from the outset. Personally I don’t see the point of doing an analysis, if you are subsequently going to discount any of the results which you find later to be a product of some other factor. Of course this view, may reflect my ‘omics background, where, given the large number of features tested in every experiment, spurious results are expected as part of the course and the quicker you can discount them the better.

Recently I have been working with some data for which, we are aware of many possible confounders. Some of these were obvious at the start and we have the relevant information to include in the analysis. For some of the unknowns, we have calculated estimates from our data using a commonly accepted methodology – however we are unsure of how accurate these are, as there is little empirical evidence to truly assess them, and whether they are capturing everything they should.

An alternative in high dimensional data (that is when you have lots of data points for each sample), is to use methods to create surrogate variables. These capture the variation present in your dataset presumed to be reflecting the confounders we are concerned about (and those perhaps we haven’t thought of yet). I have always been cautious of such an approach as I don’t like the idea of not understanding exactly what you are putting into your model. What’s more there is a possibility that you are removing some of the true effects you are interested in. However, there is the opposing argument of, ‘What does it matter? If it prevents false positive results then that’s the whole point.’

At present it is somewhat an open question which way we should proceed. It is good practise to question your approach and test it until it breaks. Having tried a few ways of doing something – all of which produce a slightly different numbers, how do we decide which is the correct one? Part of the problem is that we don’t know what the right answer is. We can keep trying new things but how do we know when to stop? Because unlike school we can’t just turn the textbook upside-down and flick to the back pages to mark our effort as right or wrong. Instead we have to think outside the box to come up with additional ways to check the robustness of our result. But this is part of the course, research is all about unknowns. These are the challenges we relish, and maybe eventually we will start to convince ourselves that our result might be true!

Often the gold standard is replication, that is repeating the analysis and finding the same result in a completely independent sample. Sometimes you might have a second cohort already lined up, so this validation can be internal and give you confidence in what you are doing. Or you may face a nervous wait to collect more data or for another group to follow up your work.

Sometimes though, you just have to go with what you have got. Sharing your work with the research community is a great opportunity for feedback and may prompt a long overdue conversation about the issues at hand. Ultimately, as long as you are clear about exactly what has been done, your findings can be interpreted appropriately.