2. Data Problem Description and Analysis

2.1 Data Issues

Throughout the EDA and data preprocessing process, various challenges associated with data quality and consistency were identified. Among these, inconsistencies and lack of standardization were particularly prevalent. The records, originating from multiple annual files and being manually managed, presented a series of inconsistencies and shortcomings that demanded detailed cleaning and preprocessing.

The challenge also lay in the data quality. The need to perform extensive cleanings and transformations indicated the existence of problems in the raw data, which could be mitigated in the future through more standardized and automated data collection and management. Furthermore, the presence of null values in various critical variables required a cautious management strategy to prevent the introduction of biases or errors into subsequent analyses.

2.1.1. Implemented Solutions and Strategies

To address these issues, several strategies and solutions were implemented. Data standardization was a crucial step, where a unified and coherent dataset was created through the concatenation and standardization of various annual datasets. Additionally, text processing and data handling techniques were used to clean and transform the variables, thus improving the data quality for future analyses. In particular, a strategy was proposed to handle the NA values, grounded on the nature of each variable and the clinical context.

The use of Kedro facilitated a robust and reproducible data science workflow, ensuring that analyses and models could be easily replicated and audited.

2.1.2. Challenges and Future Considerations

Although measures have been taken to mitigate the identified data problems, there are challenges and considerations for the future. Automating the process, from data collection to the initial preprocessing phases, could improve efficiency and consistency in the future. It is also imperative to develop more sophisticated strategies or predictive models to handle missing data, especially in critical variables. Although this work focused on cleaning and preprocessing, subsequent analysis stages should explore the patterns and relationships in the data more deeply, leading to deeper analyses.

2.2 Raw Data Description

2.2.1 Data Dictionary

These are the columns for Raw Data:

Data Dictionary raw Data

Column Name

Data Type

Description

Hospital

Object

Name or identifier of the hospital.

Servicio

Object

Service type or department within the hospital.

AP

Object

Possibly refers to a specific medical procedure or department.

Otros

Object

A category for other miscellaneous entries.

Diagnostico

Object

Diagnosis information.

Motivo Ing

Object

Reason for admission or inquiry (e.g., symptom control).

paliativo Onc

Object

Indicates if palliative care for oncology is provided.

Paliativo No Onc

Object

Indicates if palliative care for non-oncology is provided.

Fiebre

Object

Indicates presence of fever.

Disnea

Object

Indicates presence of shortness of breath.

Dolor

Object

Indicates presence of pain.

Delirium

Object

Indicates presence of delirium.

Astenia

Object

Indicates presence of asthenia (weakness).

Anorexia

Object

Indicates presence of anorexia.

Otros.1

Object

Another category for other miscellaneous entries.

P terminal

Object

Possibly refers to terminal phase of a condition.

Agonía

Object

Indicates presence of agony.

PS/IK

Object

Possibly refers to performance status or a specific score/index.

Barthel

Object

Possibly refers to the Barthel Index, a measure of disability.

GDS-FAST

Object

Possibly refers to the Global Deterioration Scale or Functional Assessment Staging Test.

EVA ing

Object

Possibly refers to a type of assessment or score at admission.

Otros.2

Object

Another category for other miscellaneous entries.

Complicaciones

Object

Indicates presence of complications.

Nº estancias

Float64

Number of stays or admissions.

Nº visitas

Float64/Object

Number of visits.

SEDACIÓN

Object

Indicates presence of sedation.

Mot. ALTA

Object

Reason for discharge or end of care.

Médico

Object

Name or identifier of the medical professional.

unnamed.1

Float64

Unspecified column with sparse data, likely an error or misplaced data.

2.2.2 Data Features

The datasets vary year by year, not only in terms of the number of entries but also in their structure and quality. Below are the main characteristics of the annual datasets from 2017 to 2022:

  • Inconsistencies in column names: The datasets exhibit variability in column names, reflecting a lack of standardization in data capture and storage.

  • Variability in Columns: Some years have additional columns or fewer columns compared to other years, highlighting the need to align and reconcile these differences during preprocessing.

  • Presence of Null Values: Certain columns, such as “Discharge Date” in 2022 and “Complications” in other years, have a significant number of null values, which requires a considered strategy for handling missing data.

Note

This code snippet is designed for exploring multiple DataFrames within a Kedro project.

Requirements: Ensure that the Kedro project and the related data are executed and available. The names of the DataFrames should be defined in the Kedro catalog. If you are using IDE like VS Code run %load_ext kedro.ipython %load_ext kedro.ipython

Exploring Multiple DataFrames
# Exploring Multiple DataFrames
# Requirements: The Kedro project and data must be executed, DataFrame names are defined in the catalog.

# DataFrame names that you want to explore
dfs_names = ['hado_22', 'hado_21', 'hado_20', 'hado_19', 'hado_18', 'hado_17']

# Loop to print information for each DataFrame
for name in dfs_names:
    # Assuming 'catalog.load' loads the DataFrame based on its name
    # Adjust this line as needed if it's not the case
    df = catalog.load(name)

    print(f"Information and Describe for DataFrame: {name}")
    print("-----------------------------------")

    # Display the DataFrame information
    print(df.info(), df.describe(include='all').T)

    print("\\n\\n")  # Add a couple of blank lines to separate the information from different DataFrames

2.2.3 Data Quality

  • Missing Data: It is observed that the variable “Discharge Date” is only available for half of 2022, representing a limitation for any temporal analysis involving this variable. It is crucial to investigate the presence of missing data in other variables and manage them adequately to avoid biases in subsequent analyses.

Exploring null values
 import missingno as msno
 df = catalog.load('hado_concat')
 msno.matrix(df)
msno.matrix.hado_concat
  • Data Consistency: Data consistency will be evaluated by analyzing outlier and unexpected values in the different variables.

  • Variable Cardinality: Variables like “Main Diagnosis” and “Reason for Admission (ING)” present high cardinality, which complicates analysis. Following strategies like grouping or transforming categories are necessary to handle this complexity.

Pandas profiling Report
 from ydata_profiling import ProfileReport
 profile = ProfileReport(df, title='Pandas Profiling Report')

2.2.4 Variable Distribution

The Exploratory Data Analysis (EDA) in subsequent stages, through continuous iterations, will provide a clearer view of the distribution of clinical and demographic variables, such as the distribution of diagnoses, length of stays, and visits. This analysis also seeks to identify patterns and anomalies in the data that may be of interest.

2.2.5 Preprocessing Strategy

Data preprocessing focuses on managing missing data, inconsistencies, and transforming high-cardinality variables. Additionally, it seeks to generate new variables (feature engineering) that can enrich subsequent analyses. NLP techniques can be used to extract and categorize relevant information from free-text variables like “Main Diagnosis”. For example, adding the year for each data set, the use of different antibiotics, grouping diagnoses, discharges, admissions, etc.