2. Data Problem Description and Analysis

2.1 Data Issues

Throughout the EDA and data preprocessing process, various challenges associated with data quality and consistency were identified. Among these, inconsistencies and lack of standardization were particularly prevalent. The records, originating from multiple annual files and being manually managed, presented a series of inconsistencies and shortcomings that demanded detailed cleaning and preprocessing.

The challenge also lay in the data quality. The need to perform extensive cleanings and transformations indicated the existence of problems in the raw data, which could be mitigated in the future through more standardized and automated data collection and management. Furthermore, the presence of null values in various critical variables required a cautious management strategy to prevent the introduction of biases or errors into subsequent analyses.

2.1.1. Implemented Solutions and Strategies

To address these issues, several strategies and solutions were implemented. Data standardization was a crucial step, where a unified and coherent dataset was created through the concatenation and standardization of various annual datasets. Additionally, text processing and data handling techniques were used to clean and transform the variables, thus improving the data quality for future analyses. In particular, a strategy was proposed to handle the NA values, grounded on the nature of each variable and the clinical context.

The use of Kedro facilitated a robust and reproducible data science workflow, ensuring that analyses and models could be easily replicated and audited.

2.1.2. Challenges and Future Considerations

Although measures have been taken to mitigate the identified data problems, there are challenges and considerations for the future. Automating the process, from data collection to the initial preprocessing phases, could improve efficiency and consistency in the future. It is also imperative to develop more sophisticated strategies or predictive models to handle missing data, especially in critical variables. Although this work focused on cleaning and preprocessing, subsequent analysis stages should explore the patterns and relationships in the data more deeply, leading to deeper analyses.

2.2 Raw Data Description

2.2.1 Data Dictionary

These are the columns for Raw Data:

Data Dictionary raw Data
Column Name	Data Type	Description
Hospital	Object	Name or identifier of the hospital.
Servicio	Object	Service type or department within the hospital.
AP	Object	Possibly refers to a specific medical procedure or department.
Otros	Object	A category for other miscellaneous entries.
Diagnostico	Object	Diagnosis information.
Motivo Ing	Object	Reason for admission or inquiry (e.g., symptom control).
paliativo Onc	Object	Indicates if palliative care for oncology is provided.
Paliativo No Onc	Object	Indicates if palliative care for non-oncology is provided.
Fiebre	Object	Indicates presence of fever.
Disnea	Object	Indicates presence of shortness of breath.
Dolor	Object	Indicates presence of pain.
Delirium	Object	Indicates presence of delirium.
Astenia	Object	Indicates presence of asthenia (weakness).
Anorexia	Object	Indicates presence of anorexia.
Otros.1	Object	Another category for other miscellaneous entries.
P terminal	Object	Possibly refers to terminal phase of a condition.
Agonía	Object	Indicates presence of agony.
PS/IK	Object	Possibly refers to performance status or a specific score/index.
Barthel	Object	Possibly refers to the Barthel Index, a measure of disability.
GDS-FAST	Object	Possibly refers to the Global Deterioration Scale or Functional Assessment Staging Test.
EVA ing	Object	Possibly refers to a type of assessment or score at admission.
Otros.2	Object	Another category for other miscellaneous entries.
Complicaciones	Object	Indicates presence of complications.
Nº estancias	Float64	Number of stays or admissions.
Nº visitas	Float64/Object	Number of visits.
SEDACIÓN	Object	Indicates presence of sedation.
Mot. ALTA	Object	Reason for discharge or end of care.
Médico	Object	Name or identifier of the medical professional.
unnamed.1	Float64	Unspecified column with sparse data, likely an error or misplaced data.

2.2.2 Data Features

The datasets vary year by year, not only in terms of the number of entries but also in their structure and quality. Below are the main characteristics of the annual datasets from 2017 to 2022:

Inconsistencies in column names: The datasets exhibit variability in column names, reflecting a lack of standardization in data capture and storage.
Variability in Columns: Some years have additional columns or fewer columns compared to other years, highlighting the need to align and reconcile these differences during preprocessing.
Presence of Null Values: Certain columns, such as “Discharge Date” in 2022 and “Complications” in other years, have a significant number of null values, which requires a considered strategy for handling missing data.

Note

This code snippet is designed for exploring multiple DataFrames within a Kedro project.

Requirements: Ensure that the Kedro project and the related data are executed and available. The names of the DataFrames should be defined in the Kedro catalog. If you are using IDE like VS Code run %load_ext kedro.ipython %load_ext kedro.ipython

Exploring Multiple DataFrames

# Exploring Multiple DataFrames
# Requirements: The Kedro project and data must be executed, DataFrame names are defined in the catalog.

# DataFrame names that you want to explore
dfs_names = ['hado_22', 'hado_21', 'hado_20', 'hado_19', 'hado_18', 'hado_17']

# Loop to print information for each DataFrame
for name in dfs_names:
    # Assuming 'catalog.load' loads the DataFrame based on its name
    # Adjust this line as needed if it's not the case
    df = catalog.load(name)

    print(f"Information and Describe for DataFrame: {name}")
    print("-----------------------------------")

    # Display the DataFrame information
    print(df.info(), df.describe(include='all').T)

    print("\\n\\n")  # Add a couple of blank lines to separate the information from different DataFrames

2.2.3 Data Quality

Missing Data: It is observed that the variable “Discharge Date” is only available for half of 2022, representing a limitation for any temporal analysis involving this variable. It is crucial to investigate the presence of missing data in other variables and manage them adequately to avoid biases in subsequent analyses.

Exploring null values

 import missingno as msno
 df = catalog.load('hado_concat')
 msno.matrix(df)

Data Consistency: Data consistency will be evaluated by analyzing outlier and unexpected values in the different variables.
Variable Cardinality: Variables like “Main Diagnosis” and “Reason for Admission (ING)” present high cardinality, which complicates analysis. Following strategies like grouping or transforming categories are necessary to handle this complexity.

Pandas profiling Report

 from ydata_profiling import ProfileReport
 profile = ProfileReport(df, title='Pandas Profiling Report')

2.2.4 Variable Distribution

The Exploratory Data Analysis (EDA) in subsequent stages, through continuous iterations, will provide a clearer view of the distribution of clinical and demographic variables, such as the distribution of diagnoses, length of stays, and visits. This analysis also seeks to identify patterns and anomalies in the data that may be of interest.

2.2.5 Preprocessing Strategy

Data preprocessing focuses on managing missing data, inconsistencies, and transforming high-cardinality variables. Additionally, it seeks to generate new variables (feature engineering) that can enrich subsequent analyses. NLP techniques can be used to extract and categorize relevant information from free-text variables like “Main Diagnosis”. For example, adding the year for each data set, the use of different antibiotics, grouping diagnoses, discharges, admissions, etc.