This goal of this project is to analyze biodiversity data from the National Parks Service, particularly around various species observed in different national park locations.
Scoping
It’s beneficial to create a project scope whenever a new project is being started. Four sections were created below to help guide the project’s process and progress. The first section is the project goals, this section will define the high-level objectives and set the intentions for this project. The next section is the data, luckily in this project, data is already provided but still needs to be checked if project goals can be met with the available data. Thirdly, the analysis will have to be thought through, which include the methods and questions that are aligned with the project goals. Lastly, evaluation will help us build conclusions and findings from our analysis.
Project Goals
In this project, the perspective will be through a biodiversity analyst for the National Parks Service. The National Park Service wants to ensure the survival of at-risk species, to maintain the level of biodiversity within their parks. Therefore, the main objectives as an analyst will be understanding characteristics about the species and their conservation status, and those species and their relationship to the national parks. Some questions that are posed:
- What is the distribution of conservation status for species?
- Are certain types of species more likely to be endangered?
- Are the differences between species and their conservation status significant?
- Which animal is most prevalent and what is their distribution amongst parks?
Data
This project has two data sets that came with the package. The first csv
file has information about each species and another has observations of species with park locations. This data will be used to analyze the goals of the project.
Analysis
In this section, descriptive statistics and data visualization techniques will be employed to understand the data better. Statistical inference will also be used to test if the observed values are statistically significant. Some of the key metrics that will be computed include:
- Distributions
- counts
- relationship between species
- conservation status of species
- observations of species in parks.
Evaluation
Lastly, it’s a good idea to revisit the goals and check if the output of the analysis corresponds to the questions first set to be answered (in the goals section). This section will also reflect on what has been learned through the process, and if any of the questions were unable to be answered. This could also include limitations or if any of the analysis could have been done using different methodologies.
#First, import the primary modules that will be used in this project:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
%matplotlib inline
Loading the Data
To analyze the status of conservation of species and their observations in national parks, load the datasets into DataFrames
. Once loaded as DataFrames
the data can be explored and visualized with Python.
In the next steps, Observations.csv
and Species_info.csv
are read in as DataFrames
called observations
and species
respectively. The newly created DataFrames
are glimpsed with .head()
to check its contents.
Species
The species_info.csv
contains information on the different species in the National Parks. The columns in the data set include:
- category – The category of taxonomy for each species
- scientific_name – The scientific name of each species
- common_names – The common names of each species
- conservation_status – The species conservation status
species = pd.read_csv('species_info.csv',encoding='utf-8')
species.head()
Observations
The Observations.csv
contains information from recorded sightings of different species throughout the national parks in the past 7 days. The columns included are:
- scientific_name – The scientific name of each species
- park_name – The name of the national park
- observations – The number of observations in the past 7 days
observations = pd.read_csv('observations.csv', encoding='utf-8')
observations.head()
Explore the Data
It is time to explore the species
data a little more in depth. The first thing is to find the number of distinct species in the data. Use the column scientific_name
to get 5,541 unique species. There seems to be a lot of species in the national parks!
print(f"number of species:{species.scientific_name.nunique()}")
print(f"nnumber of categories:{species.category.nunique()}")
print(f"categories:{species.category.unique()}")
number of species:5541
nnumber of categories:7
categories:['Mammal' 'Bird' 'Reptile' 'Amphibian' 'Fish' 'Vascular Plant' 'Nonvascular Plant']
Here is a chance to drill one level deeper and see the count of category
in the data. Vascular plants are by far the largest share of species with 4,470 in the data with reptiles being the fewest with 79.
species.groupby("category").size()
category
Amphibian 80
Bird 521
Fish 127
Mammal 214
Nonvascular Plant 333
Reptile 79
Vascular Plant 4470
dtype: int64
Another column to explore is conservation_status
. The column has 4 categories, Species of Concern
, Endangered
, Threatened
, In Recovery
, and nan
values.
print(f"number of conservation statuses:{species.conservation_status.nunique()}")
print(f"unique conservation statuses:{species.conservation_status.unique()}")
print(f"na values:{species.conservation_status.isna().sum()}")
print(species.groupby("conservation_status").size())
number of conservation statuses:4
unique conservation statuses:[nan 'Species of Concern' 'Endangered' 'Threatened' 'In Recovery']
na values:5633
conservation_status
Endangered 16
In Recovery 4
Species of Concern 161
Threatened 10
dtype: int64
Observations
The next section looks at observations
data. The first task is to check the number of parks that are in the dataset and there are only 4 national parks.
print(f"number of parks:{observations.park_name.nunique()}")
print(f"unique parks:{observations.park_name.unique()}")
print(f"number of observations:{observations.observations.sum()}")
number of parks:4
unique parks:['Great Smoky Mountains National Park' 'Yosemite National Park'
'Bryce National Park' 'Yellowstone National Park']
number of observations:3314739
Analysis
This section will begin analyzing the data after the initial exploration. First task will be to clean and explore the conservation_status
column in species
.
The column conservation_status
has several possible values:
Species of Concern
: declining or appear to be in need of conservationThreatened
: vulnerable to endangerment in the near futureEndangered
: seriously at risk of extinctionIn Recovery
: formerlyEndangered
, but currently neither in danger of extinction throughout all or a significant portion of its range
In the exploration, a lot of nan
values were detected. These values will need to be converted to No Intervention
.
species.fillna('No Intervention', inplace=True)
species.groupby("conservation_status").size()
conservation_status
Endangered 16
In Recovery 4
No Intervention 5633
Species of Concern 161
Threatened 10
dtype: int64
Next is to checkout the different categories that are nested in the conservation_status
column except for the ones that do not require an intervention. There is both the table and chart to explore below.
For those in the Endangered
status, 7 were mammals and 4 were birds. In the In Recovery
status, there were 3 birds and 1 mammal, which could possibly mean that the birds are bouncing back more than the mammals.
conservationCategory = species[species.conservation_status != "No Intervention"]\
.groupby(["conservation_status", "category"])['scientific_name']\
.count()\
.unstack()
conservationCategory
ax = conservationCategory.plot(kind = 'bar', figsize=(8,6),
stacked=True)
ax.set_xlabel("Conservation Status")
ax.set_ylabel("Number of Species");
In conservation
The next question is if certain types of species are more likely to be endangered? This can be answered by creating a new column called is_protected
and include any species that had a value other than No Intervention
.
species['is_protected'] = species.conservation_status != 'No Intervention'
Once the new column is created, group by category
and is_protected
to show the break down of each species type and protection status.
It’s easy to see that Birds, Vascular Plants, and Mammals have a higher absolute number of species protected.
category_counts = species.groupby(['category', 'is_protected'])\
.scientific_name.nunique()\
.reset_index()\
.pivot(columns='is_protected',
index='category',
values='scientific_name')\
.reset_index()
category_counts.columns = ['category', 'not_protected', 'protected']
category_counts
Absolute numbers are not always the most useful statistic, therefore it’s important to calculate the rate of protection that each category
exhibits in the data. From this analysis, one can see that ~17 percent of mammals were under protection, as well as ~15 percent of birds.
category_counts['percent_protected'] = category_counts.protected / \
(category_counts.protected + category_counts.not_protected) * 100
category_counts
Statistical Significance
This section will run some chi-squared tests to see if different species have statistically significant differences in conservation status rates. In order to run a chi squared test, a contingency table will need to be created. The contingency table should look like this:
protected | not protected | |
---|---|---|
Mammal | ? | ? |
Bird | ? | ? |
The first test will be called contingency1
and will need to be filled with the correct numbers for mammals and birds.
The results from the chi-squared test returns many values, the second value which is 0.69 is the p-value. The standard p-value to test statistical significance is 0.05. For the value retrieved from this test, the value of 0.69 is much larger than 0.05. In the case of mammals and birds there doesn’t seem to be any significant relationship between them i.e. the variables independent.
from scipy.stats import chi2_contingency
contingency1 = [[30, 146],
[75, 413]]
chi2_contingency(contingency1)
(0.1617014831654557,
0.6875948096661336,
1,
array([[ 27.8313253, 148.1686747],
[ 77.1686747, 410.8313253]]))
The next pair, is going to test the difference between Reptile
and Mammal
.
The format is again is like below:
protected | not protected | |
---|---|---|
Mammal | ? | ? |
Reptile | ? | ? |
This time the p-value is 0.039 which is below the standard threshold of 0.05 which can be take that the difference between reptile and mammal is statistically significant. Mammals are shown to have a statistically significant higher rate of needed protection compared with Reptiles.
contingency2 = [[30, 146],
[5, 73]]
chi2_contingency(contingency2)
(4.289183096203645,
0.03835559022969898,
1,
array([[ 24.2519685, 151.7480315],
[ 10.7480315, 67.2519685]]))
Species in Parks
The next set of analysis will come from data from the conservationists as they have been recording sightings of different species at several national parks for the past 7 days.
The first step is to look at the the common names from species
to get an idea of the most prevalent animals in the dataset. The data will be need to be split up into individual names.
from itertools import chain
import string
def remove_punctuations(text):
for punctuation in string.punctuation:
text = text.replace(punctuation, '')
return text
common_Names = species[species.category == "Mammal"]\
.common_names\
.apply(remove_punctuations)\
.str.split().tolist()
common_Names[:6]
[['Gappers', 'RedBacked', 'Vole'],
['American', 'Bison', 'Bison'],
['Aurochs',
'Aurochs',
'Domestic',
'Cattle',
'Feral',
'Domesticated',
'Cattle'],
['Domestic', 'Sheep', 'Mouflon', 'Red', 'Sheep', 'Sheep', 'Feral'],
['Wapiti', 'Or', 'Elk'],
['WhiteTailed', 'Deer']]
The next step is to clean up duplicate words in each row since they should no be counted more than once per species.
cleanRows = []
for item in common_Names:
item = list(dict.fromkeys(item))
cleanRows.append(item)
cleanRows[:6]
[['Gappers', 'RedBacked', 'Vole'],
['American', 'Bison'],
['Aurochs', 'Domestic', 'Cattle', 'Feral', 'Domesticated'],
['Domestic', 'Sheep', 'Mouflon', 'Red', 'Feral'],
['Wapiti', 'Or', 'Elk'],
['WhiteTailed', 'Deer']]
Next the words need to be collapsed into one list for easier use.
res = list(chain.from_iterable(i if isinstance(i, list) else [i] for i in cleanRows))
res[:6]
['Gappers', 'RedBacked', 'Vole', 'American', 'Bison', 'Aurochs']
Now the data is ready to be able to count the number of occurrences of each word. From this analysis, it seems that Bat
occurred 23 times while Shrew
came up 18 times.
words_counted = []
for i in res:
x = res.count(i)
words_counted.append((i,x))
pd.DataFrame(set(words_counted), columns =['Word', 'Count']).sort_values("Count", ascending = False).head(10)
In the data, there are several different scientific names for different types of bats. The next task is to figure out which rows of species
are referring to bats. A new column made up of boolean values will be created to check if is_bat
is True
.
species['is_bat'] = species.common_names.str.contains(r"\bBat\b", regex = True)
species.head(10)
Here is a subset of the data where is_bat
is true, returning see the rows that matched. There seems to be a lot of species of bats and a mix of protected vs. non-protected species.
species[species.is_bat]
Next the results of the bat species will be merged with observations
to create a DataFrame
with observations of bats across the four national parks.
bat_observations = observations.merge(species[species.is_bat])
bat_observations
Let’s see how many total bat observations(across all species) were made at each national park.
The total number of bats observed in each park over the past 7 days are in the table below. Yellowstone National Park seems to have the largest with 8,362 observations and the Great Smoky Mountains National Park having the lowest with 2,411.
bat_observations.groupby('park_name').observations.sum().reset_index()
Now let’s see each park broken down by protected bats vs. non-protected bat sightings. It seems that every park except for the Great Smoky Mountains National Park has more sightings of protected bats than not. This could be considered a great sign for bats.
obs_by_park = bat_observations.groupby(['park_name', 'is_protected']).observations.sum().reset_index()
obs_by_park
Below is a plot from the output of the last data manipulation. From this chart one can see that Yellowstone and Bryce National Parks seem to be doing a great job with their bat populations since there are more sightings of protected bats compared to non-protected species. The Great Smoky Mountains National Park might need to beef up there efforts in conservation as they have seen more non-protected species.
plt.figure(figsize=(16, 4))
sns.barplot(x=obs_by_park.park_name, y= obs_by_park.observations, hue=obs_by_park.is_protected)
plt.xlabel('National Parks')
plt.ylabel('Number of Observations')
plt.title('Observations of Bats per Week')
plt.show()
Conclusions
The project was able to make several data visualizations and inferences about the various species in four of the National Parks that comprised this data set.
This project was also able to answer some of the questions first posed in the beginning:
- What is the distribution of conservation status for species?
- The vast majority of species were not part of conservation.(5,633 vs 191)
- Are certain types of species more likely to be endangered?
- Mammals and Birds had the highest percentage of being in protection.
- Are the differences between species and their conservation status significant?
- While mammals and Birds did not have significant difference in conservation percentage, mammals and reptiles exhibited a statistically significant difference.
- Which animal is most prevalent and what is their distribution amongst parks?
- the study found that bats occurred the most number of times and they were most likely to be found in Yellowstone National Park.