Automated Research Paper Categorizer

BASIC OVERVIEW :

Contemporary paper submission platforms necessitate users to upload paper titles and abstracts, followed by the selection of appropriate categories for their submissions. However, the multitude of available categories poses a challenge for authors seeking optimal classification.
This classifier is designed to perform multi-label classification to determine the appropriate research categories for a given paper based on its title and abstract.

💾 DATA COLLECTION AND PREPROCESSING:

DATASET LINK : https://www.kaggle.com/c/kriti-2024/data?select=train.csv arXiv Dataset used .

It has 50000+ research papers Titles and Abstracts with its appropriate research categories .
To Extract all the components of the text, we replaced the 'Title' and 'Abstract' columns with a single column named 'Context', which is the concatenation of the two columns.

PREPROCESSING:

Data Cleaning:
- URL Removal
- Punctuation Removal
Text Tokenization: Tokenization is used in natural language processing to split paragraphs and sentences into smaller units(words) that can be more easily assigned meaning. We have used a tokenizer to split the sentences into useful components.
Sequence Padding: The tokenized sequences are padded to ensure uniform length.
Word Embeddings: Word2Vec embeddings are used to represent words in a continuous vector space where semantically similar words are mapped to nearby points. This captures the context of words in the research papers, providing rich word representations that are fed into the model.

📌 Model Architecture :

Embedding Layer: The embedding layer acts as a lookup table that maps words in the input sequences to their corresponding dense vector representations, initialized with pre-trained Word2Vec embeddings, which helps the model capture the semantic relationships between words more effectively.
Bidirectional LSTM Layers: LSTMs are a type of recurrent neural network (RNN) that are well-suited for sequential data and can capture long-range dependencies in the text. A BiLSTM enhances the LSTM model by processing data in both forward and backward directions. This means it consists of two LSTM networks:
- Forward LSTM: Processes the sequence from the beginning to the end.
- Backward LSTM: Processes the sequence from the end to the beginning.
  Working of BiLSTM :
- Input Sequence: Given an input sequence, the forward LSTM processes it from the first element to the last, while the backward LSTM processes it from the last element to the first.
- Hidden States: Both LSTM networks generate their own hidden states for each time step.
- Concatenation: The hidden states from both LSTMs are concatenated or combined to produce the final output for each time step.
Dense Layers: After the LSTM layers, dense (fully connected) layers are used to process the output and make final predictions. These layers are followed by an output layer with a sigmoid activation function for the 57-way multi-label classification.
Adams optimizer: Adaptive Moment Estimation is the technique used to update the gradient descent. Used for sparse data.
categorical_crossentropy: Categorical cross-entropy is used when true labels are one-hot encoded. Hence, we use them.

📊 RESULTS AND TESTING :

Deployment

Deployed on Streamlit. Check it out! : https://automated-research-paper-categorizer.streamlit.app

Team:

This project was made by:

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
.devcontainer		.devcontainer
README.md		README.md
TEST1.png		TEST1.png
TEST2.png		TEST2.png
app.py		app.py
model.h5		model.h5
model.ipynb		model.ipynb
model_plot.png		model_plot.png
multilabel_binarizer.pickle		multilabel_binarizer.pickle
requirements.txt		requirements.txt
tokenizer.pickle		tokenizer.pickle

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Automated Research Paper Categorizer

Table of Contents

BASIC OVERVIEW :

💾 DATA COLLECTION AND PREPROCESSING:

PREPROCESSING:

📌 Model Architecture :

📊 RESULTS AND TESTING :

Deployment

Team:

About

Releases

Packages

Contributors 2

Languages

kanishkmehta29/Automated-Research-Paper-Categorizer

Folders and files

Latest commit

History

Repository files navigation

Automated Research Paper Categorizer

Table of Contents

BASIC OVERVIEW :

💾 DATA COLLECTION AND PREPROCESSING:

PREPROCESSING:

📌 Model Architecture :

📊 RESULTS AND TESTING :

Deployment

Team:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages