Monday, June 10, 2013

A network analysis of the Bundesliga

The German Fußball-Bundesliga association football (soccer) competition has just completed its 50th season, having started in 1963. This seems like an appropriate time to celebrate by performing a network analysis of the annual competition results. (I have previously provided a similar analysis for the results of the FIFA World Cup.)

The 1.Budesliga (the top level of the competition, with 18 teams) is one of the most prestigious in the world, and is currently ranked third in Europe (behind La Liga in Spain and the Premier League in England). In the current Elo Football Club Rankings, 15 of the top 250 ranked teams play in the Bundesliga, including the top-ranked team, FC Bayern München. (England and Spain both have 18 teams in the top 250, and Serie A in Italy has 16.)

Average attendance at the games is c. 45,000 people, with the average being >80,000 per game for the Borussia Dortmund team in the 2012-13 season. (This implies a stadium-capacity attendance at every Dortmund home game!) This national average is second only to the US National Football League (c. 65,000 people per game), and is far more than for any other soccer nation.

Of course, German national football competitions existed before the founding of the Bundesliga, but football was previously organized as five separate Oberligen (premier leagues) representing different geographical regions. The first Bundesliga competition involved 16 teams selected from these regions, starting in August 1963. The number of teams was expanded to 18 in 1965, where it has remained, except for the 1991-1992 season when there were 20 teams due to German reunification. A total of 52 clubs have competed in the 1.Bundesliga since it was founded, although only one team, Hamburger SV, has competed at the top level in all 50 seasons.

The relationships between these 52 teams represent a network within each competition, based on their relative success at the games they played (and thus their final position in the league table), and this network changes through time across the various seasons. It is this network of relationships across the years that I am analyzing here, using a phylogenetic network.

In this case, the phylogenetic analysis is being used in an exploratory manner — it is intended to be a convenient visual summary of the data. It seeks to uncover the patterns associated with a group of objects (teams in this case) for which multi-variable data have been collected (ie. competition results for multiple years). The network analysis assumes that the data have been formed by some historical process(es), and it produces a visualization that places objects with similar histories near each other in the network.

The analysis

I have taken the competition data for each season from the Deutsche Telekom T Online Sport page. The teams are simply ranked according to their official finishing position in each season. I have converted this to a score per season (1st=18points, 2nd=17, ... 18th=1point), with a score of zero if the team was absent from the 1.Bundesliga for that season. All 52 teams thus have a score each season from 0-18.

The similarity among the 50 scores for each pair of teams was calculated using the Steinhaus dissimilarity. The Steinhaus dissimilarity ignores "negative matches", as discussed in a previous blog post, so that two teams are not considered to have similar histories just because they were both absent from the 1.Bundesliga in the same years (ie. when they are absent their history is "unknown").

A Neighbor-net analysis was then used to display the between-team similarities as a phylogenetic network. This decomposes the similarities into a series of bi-partitions of the teams, and then tries to display as many of these bi-partitions as possible in two dimensions. Each bi-partition represents the division of the teams into two sub-groups, where the data indicate that the two sub-groups differ in some way. (See the post on How to interpret splits graphs.) That is, teams that are closely connected in the network are similar to each other based on their Bundesliga results, and those that are further apart are progressively more different from each other.

I have colour-coded some of the teams in order to highlight some of the patterns shown in the network:
Brown Those teams that have appeared in most of the seasons (43-50 out of 50), and have finished in the top 3 at least once
Pink Teams finishing in the top 6 (ie. top third) at least once, and participated in at least 20 seasons (20-34)
Blue Teams finishing in the top 8 at least once, and participated in 10-20 seasons
Purple Teams finishing in the top 8 at least once, and participated in <10 seasons
Black Teams rarely competing in the 1.Bundesliga, and not very successful when they have been there

Note that most of these colours are strongly clustered in the network graph, especially the brown and pink ones. This is to be expected, because the teams have a very similar Bundesliga history.

The pink teams form two main clusters in the graph. These teams have been in the 1.Bundesliga for about half of the 50 seasons, and the two clusters represent which years they were present. That is, roughly speaking, when the five pink teams at the top of the graph were present then the four pink teams at the bottom were not, and vice versa. Bayer 04 Leverkusen is separate from the other pink teams because, rather than appearing in the 1.Bundesliga "on and off", it has been there continuously since 1979.

The blue teams form three groups in the graph. KFC Uerdingen 05 is separated from the other blue-coloured teams because its period of success was from 1975-1995, whereas the successes of Hansa Rostock, SC Freiburg and VfL Wolfsburg all date from 1993 onwards, and the successful years of DSC Arminia Bielefeld have been scattered throughout the 50 seasons.

The positions of the black and purple teams are based on similar considerations — their positions reflect which few seasons they participated in the 1.Bundesliga.

Note also that the very long terminal branches for most of the teams indicates that they actually do not have a great deal in common regarding their competition results. Teams have been successful to different extents in different years, relatively independently of each other. This is not surprising.

The largest split in the network separates the brown (10) and pink (10) teams from the others (plus Arminia Bielefeld), thus highlighting these 21 teams as the most successful competitors. In the network graph, the brown-coloured teams form a distinct subset of this group (along with Bayer 04 Leverkusen), forming the upper echelon of success.

With one exception (Vfl Wolfsburg), the competition champions have also come from this upper group — the number of championships is indicated in the network next to each of the 12 teams who have won. This highlights the unfortunate fact that both FC Schalke 04 (45 seasons, best finish 2nd) and Eintracht Frankfurt (43 seasons, best finish 3rd) have had consistent success without ever becoming champions.

This phylogenetic network thus provides a very effective visual summary of the main features of the 1.Bundesliga results when averaged over all of the 50 seasons.

No comments:

Post a Comment