Step 2 in detail: Preprocessing
As explained in this previous article the pre-processing is quite a complicated and very important element of the project architecture, so I wanted to explain it in detail. As explained in this previous article the pre-processing is quite a complicated and very important element of the project architecture, so I wanted to explain it in detail.
What are the different elements of the processing?
As said, there are four main elements:
44K PAPERS
Clean JSON files
12 K PAPERS
JSON DEFINITION (W.I.P)
CSV DEFINITION (W.I.P)
The dataset comes in a JSON format, but that format has to be cleaned to be manipulated.
Therefore, we must first explore the subset of interest, biorxiv_medrixv. Why this one? As we are focusing this project on the genetic aspect, it's better to reduce it to the subset that has articles related to that aspect. There are indeed many articles about COVID-19, but they are centered on different aspects. The genetic and medical impact is analyzed in the articles of that particular subset, and that is why we chose it. That is also very interesting as it allows us to make a first selection ( we pass from 44k to 12k papers).
After that, we need to load the JSON file into a list of nested dictionaries. To know-how, go to this article (w.i.p). That allows preparing the papers as we want to, exploring all the categories that are available, and choosing the ones we want to.
Once that done, we generate a CSV to launch the information as it is a much user-friendly format.
Eliminate duplicates
Due to the high amount of information and papers, it is normal that there are duplications in the information. As part of the preprocessing, there is important to eliminate duplicates, to avoid noise.
For doing so, we must before perform a word count on the abstract and body text, which are the more important aspect for us. Once that done, we need to identify the unique values in them. Then the text is ready to have the duplicates eliminated using panda's drop function for doing so.
Select language ENGLISH
In this case, we are going to focus on English, but this whole project can be done in as many languages as we want to, it only has to be specified. We use langedetect to visualize the main language, then the repartition, to confirm English is a good choice. Seeing it is the main language, along with german, we señect it.
Generate STOPWORDS
STOPWORDS DEFINITION (W.I.P)
Stopwords are an important part of Natural Language Processing. In this case, as we decided for this example to start with the English language, we only need the list of English stopwords. For doing so, we can use either NLTK or Spacy, both libraries are very useful. Once that done, we can manually extend and personalize, in this case with scientific words that are heavily used in this kind of paper and that need to be added to the stopwords as they don't add useful information for our goal.