The Genealogical World of Phylogenetic Networks: Network analysis of New York neighborhoods

Trying to quantify the characteristics of a neighborhood is a tricky business. Part of the problem is trying to define the nebulous idea of "livability" with respect to a geographical area, and part is due to the impracticality of collecting most of the data that might allow us to quantify the various aspects of life in that area.

Nevertheless, in New York magazine Nate Silver had a go at this in 2010, by trying to identify The Most Livable Neighborhoods in New York. He tried this because:

there is a wealth of information to study. The Bloomberg administration gathers reams of data about almost every element of life in the city — from potholes to infant-mortality rates— as do New York University's Furman Center and the U.S. Census Bureau. Sites like Yelp provide a reasonably objective perspective on the popularity of neighborhood bars and restaurants. StreetEasy.com and Zillow.com publish the costs of apartment space per square foot. Ethnic diversity is now broken down in much finer gradients than black and white ... Our goal was to take advantage of this wealth of data and apply a little bit of science to the question. If there was anything that could plausibly affect one's quality of life in a particular neighborhood, we tried to incorporate it.

New York thus provides a unique opportunity to try quantifying the nedbulous, and I think that it is worth looking at these data in more detail.

The data

The data were compiled into twelve broad categories, representing different characteristics about the various New York neighborhoods:

Affordability / Housing Cost (as measured on a price-per-square-foot basis, for both renters and buyers), Housing Quality (historic districts, code violations, cockroaches), Transit and Proximity (commute times to lower Manhattan and midtown, the density of subway coverage), Safety (as measured by violent- and nonviolent-crime rates), Public Schools (test scores and parent satisfaction), Shopping & Services (the number of neighborhood amenities, especially supermarkets), Food & Restaurants (judged by density and quality of options), Nightlife (ditto), Creative Capital (arts venues as well as the number of residents engaged in the arts), Diversity (in terms of both race and income), Green Space (park and waterfront access, street trees), and Health & Environment (noise, air quality, overall cleanliness).

The data were gathered from the stated sources, and are presented in the original magazine article for 50 of the 60 neighborhoods that were assessed. The data for all of the characteristics were then summed for each neighborhood, based on a particular weighting scheme for the 12 categories. This provided "a quantitative index of the 50 most satisfying places to live."

The sum total of the scores is not actually very different among the neighborhoods (score 73–78 / 100), and therefore the choice between them on that basis is (as the author admits) "splitting hairs". More particularly, neighborhoods with very different characteristics can end up with the same total score — they simply get that total by combining the category scores in very different ways (ie. the neighborhoods have different strengths and weaknesses).

So, this is a rather limited approach to assessing the data. Surely we can get more out of the data than this? What would be more useful is a picture showing which neighborhoods are similar to each other based on the way the scores are distributed across the different categories. This will tell us which neighborhoods have the same characteristics, and which are different from each other. This avoids splitting hairs, because it uses all of the data simultaneously, rather than summarizing the data down to a single number for each neighborhood.

The analysis

A phylogenetic network is ideal for doing this sort of thing, as I have emphasized many times in this blog, and so I have constructed one. As my analysis of choice, I have used the manhattan distance (appropriately enough!) combined with a NeighborNet network. Neighborhoods that are closely connected in the network are similar to each other based on the various characteristics, and those that are further apart are progressively more different from each other.

Click to enlarge.

I have color-coded the neighborhoods based on their borough, using roughly the same colors as in the map shown above.

I have also placed an asterisk next to the top five neighborhoods based on their total scores. Two of these neighborhoods are near each other in the graph, with two a bit further away, and one is quite distant from the others. This indicates that, even though they have very similar total scores, these neighborhoods are actually quite different.

In general, the network shows a trend from Manhattan (at the right-hand end of the graph) to Queens and the Bronx (at the left-hand end), via Brooklyn (stretching through the middle). This seems to neatly summarize the overall impression of the relationships among the areas of New York, at least as it is usually presented to outsiders. So, I think that the network analysis has been a successful one, in the sense that it provides a useful picture of the relationships between the neighborhoods.

Going deeper, many of the detailed patterns in the network graph are fairly obvious. For example, (at the right-hand end of the graph) the association of the southern Manhattan neighborhoods of Soho, Central Greenwich Village, Tribeca, Battery Park City, and the Financial District should surprise no-one. Similarly, (at the top of the graph) the linking of Manhattan's Inwood with the nearby Bronx neighborhoods of Belmont, Bedford Park and Riverdale is not unexpected. Furthermore, (at the left-hand end of the graph) the connection of Astoria, Woodside, Jackson Heights, and Flushing (in north-western Queens) with Cobble Hill, Boerum Hill, and Bay Ridge (in western Brooklyn) is hardly surprising, even though the two borough areas are geographically separated.

Other patterns are less obvious, and thus more intriguing, such as the apparent similarity of Chinatown (southern Manhattan), Central Harlem (northern Manhattan), Co-op City (the Bronx), and West Brighton and New Dorp (both Staten Island) (at the bottom-left of the graph). This bears looking into, should you be looking for somewhere to live in New York. Perhaps the oddest juxtaposition is that of Chelsea (midtown Manhattan) with Corona Park (Queens) and Washington Heights (northern Manhattan).

Another possible use of the graph is that it makes suggestions for areas that might be suitable as alternatives to any neighborhood that is out of reach on the Affordability / Housing Cost criterion. That is, we might consider areas that are similar based on the other criteria and yet differ in Affordability. For example, Park Slope (northern Brooklyn) differs dramatically in Affordability from the Nolita & Little Italy neighborhood (lower Manhattan), and yet the only other characteristic they differ greatly on is Shopping & Services. Williamsburg, Greenpoint, and Carroll Gardens & Gowanus are indicated in the network as other neighborhoods worth considering.

It seems unlikely, however, that anyone looking for a substitute for the Upper East Side of Manhattan (one of the most expensive neighborhoods in the USA) is going to look at Sheepshead Bay, as suggested by the network — the two neighborhoods differ dramatically in Transit Proximity, since Sheepshead Bay is way down on the Atlantic coastline. Nor are those looking for a replacement for the Upper West Side going to consider Brooklyn's Prospect Heights — these two differ more than somewhat in Housing Quality, for example. So, good though it is, the suggestions made by the network graph are not perfect!

Postscript

There is one other ranking scheme that I know of, at the StreetAdvisor Best Neighborhoods web page [on that page, click on Neighborhoods]. It is described as follows:

Our rankings begin with reviews written by locals. Each review contains certain scoring elements that tell us how good, or how bad a place is. We then combine all the scores and apply a 'fairness' factor that takes into account things such as volume of reviews, age of reviews and the type of person writing a review. We then apply a rank so we can compare and sort locations.

You will find many of the rankings odd, to say the least. For example, it seems doubtful that Country Club (the Bronx) is the "3rd best neighborhood in New York City" (after Carnegie Hill and Gramercy Park).

Not the least of the oddities is that the Upper East Side (7.6) scores much less than neighboring Carnegie Hill (9.4) and Lenox Hill (8.1). Indeed, it scores worse than parts of Brooklyn (Carroll Gardens, Clinton Hill, Brooklyn Heights, Park Slope, Bay Ridge), Queens (Glendale, Richmond Hill, Forest Park), the Bronx (Country Club, Schuylerville) and Staten Island (Huguenot).

You can access the individual neighborhoods within the boroughs at these web pages:
Manhattan
Brooklyn
Queens
Bronx

Pages

Monday, March 25, 2013

Network analysis of New York neighborhoods

No comments:

Post a Comment