CMSC320 Final Project

An Analysis of Heart Diseases and Attributes Leading to Heart Disease

By: Jordan Waite

Introduction

Heart Disease is an issue plaguing society today until recently with Heart Disease was the number one cause of death every year in the United States. COVID-19 has taken its place this year but it is also common knowledge now the obesity and other conditions that play a role in heart disease are common risk factors for more serious cases of COVID.
The goal of this project is to look at determining factors of heart disease and other attributes and make determination on people who would be at higher risk and which attributes you may want to control to reduce your risk. This dataset contains 13 attributes and then the target column. This column indicates the presence of heart disease or not (0= heart disease, 1 no disease present).

Imports

Data Collection

1. age
2. sex
3. chest pain type (4 values, 0-3 progressivley increasing)
4. resting blood pressure (systolic number)
5.serum cholestoral in mg/dl
6.fasting blood sugar > 120 mg/dl (1=true,0=false)
7.resting electrocardiographic results (values 0,1,2)
8.maximum heart rate achieved
9.exercise induced angina
10.oldpeak = ST depression induced by exercise relative to rest
11.the slope of the peak exercise ST segment
12.number of major vessels (0-3) colored by flourosopy
13.thal: 3 = normal; 6 = fixed defect; 7 = reversable defect
14.Target 0=Heart Disease 1=No Heart Disease

Cleaning Data

The data is already in a format so not much is needed. All values are already numeric which will help us later when running random forests.One thing that I will be doing renaming some of the columns so they are more easily identifiable. This will also help later when graphing certain conditions. I will also being id's for each patient. The names of the patients were removed from the original databases for privacy, id's will allow me to refer to individuals.

Graphs

A great way to visualize data is graphs i will correlate some of the data to see how each factor individually effect heart disease.
I have split into multiple data frames age ranges so I can further examine the data. This will allow me to correlate data to age ranges. The ages examined surround people in their 50's. Subjects in the 20's and 70's seem to be the outliers while the main subjects are those in there 40's,50's and 60's
I used this graph to show if there was any particular age range more susceptible to heart disease based on the data that was taken there seems to be no correlation between age and heart disease. There are many people with it that are young and old, but there are also many people that don't have it in all age ranges.Bbut there is an increasing trend of heart disease with age.
This graph shows that as chest pain increases so does the likelihood of Heart disease. This is most likely not a cause but a biproduct of it. This shows that people who have chest pain should be getting checked out for heart disease to get on top of it before it gets bad.
There looks to be little to no correlation between Heart disease and blood pressure from this dataset but we do know from other studies hypertension is a common issue leading to heart disease.
This line plot shows an upward trend that people with a higher max heart rate have a lower likelihood of developing heart disease. This could possibly be because they are healthier individuals.
We see here from the regression line that as peoples cholesterol levels get higher people tend to have a higher risk of having heart disease. Lowering cholesterol is a commonly known way to reduce risk of heart complications.
There could be multiple interpretations of this data. I see that people who die of heart disease generally die younger so this puts the average age around 56.6. To calculate this I added up all of the ages of all people with heart disease in the dataset and added up the number of people with heart disease total in the dataset. I then divided the sum of ages by the number of people with heart disease.

Relationships Between Attributes

I wil be looking at relationships between not only Attributes and heart disease but attributes and other attributes. Many diseases and disorders are progressive. This will allow me to see if one attribute can cause another.
While the regression line does not fit the data perfectly it does show that there is a positive correlation between resting blood pressure and cholesterol level. As someones cholesterol level increases their resting blood pressure should increase. Both of these have a positive correlation with Heart disease as well. Seeing this correlation could mean reducing your cholesterol would allow you to have a lower blood pressure. This would help reduce 2 attributes that cause heart disease.
ST waves or the last 2 points that a person sees in a Wave for a heart beat. PQRST is the whole wave. P is the upper peak before the first depression. Q is the first valley. Ris the highest peak in the wave. S is the next valley an T is the final point in the whole Wave. We are looking at the S depression or valley before the final T peak. The increase in the depresion on the EKG is correlated with a rise in resting blood pressure. This means reducing blood pressure could reduce this depression. This would help with decreasing the risk of heart disease. More information on the PQRST wave is pictured below

Machine Learning Algorithms

Normally these algorithms would require me to set up the data to be correctly interpreted by the functions. The data we were given though is completely numerical making it easier becuase i do not need to clean the data to fit them at all. I chose K-NN classification and random forests for this part.

K-NN Classification

The based on the 2 lines since they are closest at 3 neighbors this should be the most accurate measurements so we will find answers K-NN Classification with 3 neighbors
Running the K-NN algorithm with 3 neighbors gives us a very high 10 fold cross validation of 94.8%. I also gives us an acceptable standard error 3% which is less than our acceptable margin of 5%. This means that given the data that we have using a K-NN algorithm to predict someone's risk of having heart disease is acceptable.

Random Forests

The testing accuracy and training accuracy actually converge on this random forest plot at 7 estimators. So we will use 7 estiators for our random forests test.
Using the optimal k parameter of 7 I ran random forests on the data. This gave me 96% accuracy 10 fold cross validation score. So this is a very accurate predictor. The standard of error is way within our means of 5%. It comes in at 2.1%. This means that random forests is a acceptable algorithm to determine someones risk of having heart disease.

Conclusion

Heart Disease is a very prominent issue in the United States and is preventable as well. Throughout this we went through a data analysis step by step. We first discovered data from a kaggle site and imported that as a CSV file. This gave us a relatively clean data set. We only needed to change some column names for better interpretation. If given data in a different form we would have had to clean it and make it possible to perform exploratory data analysis on it. We then used some machine learning principle and algorithms including K-NN Classifier (not regressor) and Random Forests to find the best performing algorithm to predict heart disease based off of the attributes given to us in the data. Kaggles format made this very easy because the data was in all numerical form which made the data easy to run the algorithms on because i did not have to change anything due to ML's numerical requirements.

Extra Resources

Here are a few of them:
1. https://www.cdc.gov/heartdisease/facts.htm
2. https://towardsdatascience.com/heart-disease-prediction-73468d630cfc
3. https://ieee-dataport.org/open-access/heart-disease-dataset-comprehensive
4. https://www.webmd.com/heart-disease/default.htm
5. https://stackabuse.com/k-nearest-neighbors-algorithm-in-python-and-scikit-learn/
6. https://builtin.com/data-science/random-forest-algorithm
7. https://medium.com/@hjhuney/implementing-a-random-forest-classification-model-in-python-583891c99652