The Genealogical World of Phylogenetic Networks: Coronavirus statistics are (almost) all misleading

There are plenty of places on the internet where we can access statistics about the current Covid-19 pandemic, caused by the rapid global spread of the SARS-CoV-2 virus — notably Johns Hopkins University (formally described here), and Worldometer. These are compilations of official government statistics, comparing different countries, or states within a country. These are potentially interesting, because we can see how things are progressing in our own location, and compare it to other places. If nothing else, this might inform our own actions for protecting ourselves.

The basic problem is that these data are often not comparable between jurisdictions, in the sense that they will have been collected in different ways and with different degrees of success. For example, consider these two recent articles about the country that is very likely to end up being the worst hit:

The second one contains this quote that sums up the issue: "India is the third-worst hit country in the world, but there are concerns a lack of testing could mean the true figure is far higher." Government organizations usually do their best to collate their local data, but their relative success in a situation like this will vary from "okay" to "abysmal". We cannot really know where any given dataset fits into that continuum, and this profoundly affects how we interpret the data.

Data must be comparable if we are to compare them. This is an obvious truism, especially in science; but achieving comparability is often very difficult in practice, and scientists spend much of their time trying to achieve it in their own work. I would hate to be the person delegated the job of summarizing this pandemic globally, because they will really be us against the wall. But someone will have a go at it, believe me, and I wish them every success.

In this post, I summarize the main data-collecting issues, as they are currently understood. The two main statistics reported are the number of infection cases and the number of resulting deaths, which have separate issues.

Case numbers

Deciding whether a particular person is a Covid-19 case is not straightforward. Three main criteria have been used to date:

disease symptoms (which are similar to influenza)
detection of a viral genome in the body (meaning the person currently has the virus)
detection of virus antibodies in the body (meaning the person has previously had the virus).

These three criteria will yield different estimates of the number of cases.

Since the virus seems to have originated in China, the Chinese were the first to officially count cases. They started by including only those people who had been tested for the virus itself (after they showed symptoms), but soon realized that this caused a delay before these people received medical treatment. So, the official data show a massive spike in case numbers, when the authorities switched to using symptoms alone to count cases. You can see in this graph (from Worldometer) which day that was.

Using symptoms alone presumably over-estimates the number of cases, because of the similarity of coronavirus symptoms to those resulting from influenza viruses. Clearly, symptoms need to be confirmed by a direct test for each particular type of virus.

However, without a concerted testing effort for SARS-CoV-2, the number of cases will be under-estimated, probably by a large margin. We now know that many people show few or no symptoms of this coronavirus, and will therefore not be detected if we test only those people with explicit symptoms, and who visit a testing center. Some countries have made massive testing efforts, relative to their population size, while many other countries have been much less active. This table shows the top data from Worldometer, counted as the number of tests per million people.

Clearly, the more of your population you test, the more likely you are to correctly detect all of your cases. The effect of this can be seen in this next Worldometer graph, for Sweden. The apparent burst in cases after June 5 was due to the government finally implementing large-scale virus testing, which naturally increases the detection rate for this type of situation. That is, the data were greatly under-estimated before June 5, and the official data were corrected during June, by catching up with many of the as-yet-undetected cases. This increased testing has continued, which means that the drop in cases during July is cause for optimism, as in any situation where you search for something bad and don't find it. Nevertheless, these tests cover only 8% of the population, to date, and so even now the data may still (theoretically) be under-estimates.

So, between-country comparisons are misleading, unless the same amount of virus testing has been conducted. This is the point I made about India, above, where testing is a real challenge given the size of the population. Those of you in the USA might like to contemplate just how many cases you really have — your officials have conducted more tests than anyone else except China, but you still have covered only 17% of your population (the table above is cut off at 30% coverage).

Alternatively, antibody testing is a good way to detect people who have had the virus without knowing it, since this studies their body's reaction to the virus rather than looking for the virus itself. As this sort of testing proceeds around the world, the number of official cases will continue to increase. However, the number of false positives and false negatives of the antibody tests means that even they are not entirely reliable (see False positive and false negative coronavirus test results explained). Indeed, a review article assessing the range of currently available antibody tests shows remarkable variation in their success rates (Diagnostic accuracy of serological tests for Covid-19: systematic review and meta-analysis).

As a final point, which has been very obvious here in Sweden, is just how long a person is considered to be a Covid-19 case. As far as Sweden is concerned, there were apparently a lot of "active cases" early in the pandemic. However, what was happening was that most other jurisdictions were declaring cases as "recovered" after the person's symptoms receded, which takes about 7 days, and were then removed from the official list of cases. On the other hand, Sweden did not officially declare a case recovered until the person was completely free of the virus, which takes about 5 weeks. So, Sweden's reported number of active cases remained much higher than for most other places, for a much longer time. The number of Swedish cases was actively criticized by the foreign media, but the cause was never mentioned — the data were not comparable to elsewhere.

Similarly, the reporting of cases is obviously not equal throughout any given week, so that daily reports are unreliable — there are obvious weekly cycles in almost all of the national datasets, with fewer reported cases or deaths on Saturdays and Sundays. The same thing applies to regional (geographic) patterns, of course. For example, both Spain and the United Kingdom have noted that their current outbreaks are all regional, with the majority of their countries being much less affected.

Number of deaths

This brings us a consideration of counting deaths due to Covid-19. We all know what death is, but it is not so easy to assign a particular cause to any particular death. A death certificate signed by a professional medical practitioner will assign an official "cause of death", and possibly list other "contributing factors". So, when does a death count as a coronavirus death?

The simplest solution is to say that any dead person who has a virus genome in their body counts; and it is clear that some of the statistics around the world have counted Covid-19 deaths this way. Unfortunately, as has been pointed out ironically, this counts people who are carrying the virus when they get run over by a car; and this may not be what most people mean when referring to "a coronavirus death".

Just as importantly, some jurisdictions have clearly tested, and thus counted, only those people who died in hospital. Similarly, there are clear differences in counting due to social circumstances, especially in countries with large poor communities. These factors will under-estimate the actual death rate.

The main issue, however, is that most of the people severely affected by this new virus are elderly persons with pre-existing medical conditions. For example, 7.3% of the reported Covid-19 cases in Sweden have resulted in death, to date, but 89.1% of those deaths have been in the 70+ age group. This is a bit more extreme than elsewhere, as early on in the pandemic the virus got into several aged-care facilities in Sweden. In most of these cases, the SARS-CoV-2 virus was simply one thing too many, for people whose health was already declining — this is called co-morbidity (the presence of one or more additional conditions co-occurring with a primary medical condition).

So, where is the border between a main cause and a subsidiary factor? The answer to this question clearly differs around the world; and this makes the officially reported death data non-comparable. Some data will be over-estimates and some will be under-estimates, compared to some global standard definition. So, what does the following graph, from Worldometer, really tell us?

The generally accepted solution to this conundrum is to consider what is called excess mortality, which assumes that there has been a temporary change in the number of deaths during some specified period of time. That is, we do not assign deaths to particular causes, but simply compare the total number of deaths now to the total number of deaths in previous years. The difference can be attributed directly or indirectly to the current circumstances. This is not perfect, but it is the best we have got.

So, we should compare the number of deaths during the current pandemic period with some estimate of a baseline number of deaths under more normal circumstances. The baseline is commonly taken as the equivalent data from the immediately preceding 3–5 years, or so — how many more people have died during the pandemic, compared to the average deaths during the same months of prior years?

The U.S. Centers for Disease Control and Prevention has a compilation of these data for the states of the USA, updated daily: Excess deaths associated with COVID-19. The data are still provisional, but it would be nice to think that they are directly comparable. Whether the data are actually meaningful for the current pandemic is a point I discuss at the end of this post.

Similarly, the EuroMOMO collaborative network is supported by the European Centre for Disease Prevention and Control, and provides weekly data for public health threats in 24 European countries. If you look at their graphs, you can see the age-related effects of seasonal flu in every winter since 2016, as well as the magnitude of current pandemic. Here is a graph of their current data, pooled across all age groups and countries. Roughly speaking, deaths are 80% greater than in previous years.

Elsewhere in the world, data are a bit more scarce. The principal problem is lack of suitable prior data — not everywhere on the planet has accurate estimates of the local death rate, for some combination of social, economic or political reasons. Nevertheless, we have data for all of the expected places; and some of the groups who are collating the excess mortality data for the current pandemic are listed by the Our World in Data site: Excess mortality from the coronavirus pandemic (COVID-19).

These groups include three newspapers, each of which is covering the current pandemic across c. 10 countries:

All three of these make their compiled data publicly available on GitHub.

Conclusion and final point

The world is a complex place, and biology is one of the most complex parts of it. Do not over-interpret simplistic data, no matter how prettily it is presented. In particular, for data to be meaningful, all parts of it need to be directly comparable; otherwise the conclusions are likely to be wonky.

Sadly, as a final point to emphasize the issues, I will note that the USA itself apparently has rather big practical problems, as discussed in: Covid-19 data in the US is an ‘information catastrophe’. According to this media report, there are serious problems with the hospitalization data:

Covid-19 data in the US — in fact, almost all public health data — is chaotic: not one pipe, but a tangle ... Every health system, every public health department, every jurisdiction really has their own ways of going about things ... It's very difficult to get an accurate and timely and geographically resolved picture of what's happening in the US, because there's such a jumble of data.

The issue seems to be the National Healthcare Safety Network, as used by the Centers for Disease Control and Prevention, which is responsible for collating the data nationally. The Department of Health and Human Services has now taken over direct responsibility for data concerning Covid-19 infections in hospitalized patients, much to the dismay of many people.

Monday, August 3, 2020

Coronavirus statistics are (almost) all misleading

No comments:

Post a Comment