Data Visualization on Research Citations

Make sense of the Research citation data, which the academic community is currently buzzing about.

Dr. Saptarsi Goswami
7 min readOct 11, 2023
image Source: https://unsplash.com/photos/w00FkE6e8zE

Why citations are important for a research article (As per one of the blogs there are 5.14 million articles published in an year).

Let’s undesratnd this with the example of YouTube videos or a LinkedIn / Facebook posts. The more reactions, you get, the more influence it has. The same principle applies to research articles. The more they are mentioned/cited by other researchers in therir work, more the perceived impact of the article. (Just a side note, often the reseaech articles are thought to be written on an ivory tower with no practical use, off late concepts of many of the ML/DL articles are being adopted by top research labs across industries.) Citations have direct bearing in ranking of the scientist, the institute with a bearing on collaboration opportunities, funding etc.

Every year, for the last five years Stanford and Elsevier have jointly published a list of top cited scientists across the world. This dataset we have downloaded this dataset and made it available on Kaggle.

While most, other fields may be understood, let’s quickly introduce a couple of important Concepts like H-index and HM-Index before jumping to the data analysis section.

H-Index

The H-index is a metric that measures the citation impact of the publications of a particular author. It is calculated by counting the number of publications (h) that have each been cited at least h times. For example, an h-index of 10 means that the author has published at least 10 papers that have each been cited at least 10 times. This is commonly available in Google Scholar.

H Index ( Image Source: Author)

HM Index

The HM index is a modification of the H-index that takes into account the number of authors in each publication. It was proposed by Schreiber in 2008 in an attempt to address the fact that the H-index can be biased in favor of authors who publish in large research groups.

Data Analysis:

Let’s formulate, a few questions and answer through the codes and graphs. Citation data has practically all the required fields. just one issue is in the dataset, in the country column we have the three-letter country code. For the expanded name we have used a country master dataset and merged/joined these two datasets. The code snippet is given below. The answers to the questions typically have a Python code, followed by a chart visual, and then observations.

research=pd.read_csv(“/kaggle/input/citation2023/Stanford Research Citation.csv”, encoding=’ISO-8859–1',low_memory=False)
# Required as in-country database it is in upper case
research[‘cntry’] = research[‘cntry’].str.upper()

# This is country master data, imported in the notebook to get the country names
CountryM=pd.read_csv(“/kaggle/input/country-dataset/all.csv”)
# Merging to have the country name
merged_df = pd.merge(research, CountryM, left_on=’cntry’, right_on=’alpha-3', how=’inner’)

Q1: Which are the top 25 countries, as far as no. of scientists are concerned?

# Group by Country
ds = merged_df.groupby([‘name’])[‘name’].count()
# Sort by No. of Scientists
ds_sorted = ds.sort_values(ascending=False)
# Select the top 25 Country
top_25 = ds_sorted.head(25)

# Plot the above Series using Matplotlib
plt.plot(top_25.index, top_25.values, marker=’o’, linestyle=’-’,color=’red’)
top_25.plot(kind=’bar’)
plt.xlabel(‘Country’)
plt.ylabel(‘No. of Scientists’)
plt.title(‘No. of Scientists across top 25 Countries’)
fig = plt.gcf()
fig.set_size_inches(12, 6)
# Show the plot
plt.show()

Fig 1: Finding the top countries ( Image Source: Author)
  • The top 5 countries are the USA, UK, Germany, China and Canada
  • India is in the overall 14th Spot
  • Further direction of analysis may be taking the population of the country into perspective

Q2: Which are the top 25 countries, in terms of no. of institutes from which these scientists are from?

query = “SELECT name,count(distinct inst_name) as InstCount from merged_df where name != ‘’ group by name”
result = psql.sqldf(query)
rslt_sorted = result.sort_values(by=’InstCount’,ascending=False)
# Select the top 25 Country
top_25 = rslt_sorted.head(25)
plt.bar(top_25[‘name’], top_25[‘InstCount’])
plt.xlabel(‘Countries’)
plt.ylabel(‘No. of Institutes’)
plt.title(‘Counties by No. of Institutes’)
plt.xticks(rotation=90, ha=’center’) # ‘ha’ stands for horizontal alignment
fig = plt.gcf()
fig.set_size_inches(12, 6)
plt.show()

Fig 2: Finding the top countries in terms of Institutes ( Image Source: Author)
  • The top five countries are the USA, the UK, France, Japan, and Germany
  • India is in 7th Spot

If a country is higher in rank as far as institutes are concerned compared to by no. of scientists, then it possibly means, the scientific activities are more distributed, i.e. more institutes are participating.

Q3: Which are the top 50 institutes by number of scientists?

This is a similar code, so I would not like to eat away your appetite with repetitive snippets, let’s focus on the graph instead

Fig 3: Finding the top institutes in terms of scientists ( Image Source: Author)
  • Havard Medical School and University of Washington towers over the rest
  • The sheer dominance of the USA, followed by the UK in the top 50

Q4: This is not really a separate question, but rather probing the no. of Scientists per country question through a world map visualization

# Using GeoPandas for the visualization

ds = merged_df.groupby([‘cntry’])[‘cntry’].count()
world = gpd.read_file(gpd.datasets.get_path(‘naturalearth_lowres’))
mergedds=world.merge(ds, left_on=’iso_a3', right_index=True)
fig, ax = plt.subplots(1, 1, figsize=(12, 8))
mergedds.plot(column=’cntry’, cmap=’plasma’, linewidth=0.8, ax=ax, edgecolor=’0.8', legend=True)
ax.set_title(‘World No. of Scientist Heatmap’)
ax.axis(‘off’)
plt.show()

Fig 4: World heatmap by no. of scientists
  • USA appears almost from a different “colorscape”
  • UK, Germany, and China also appear to be of another color than the rest
  • However, some countries are whitened, the probable explanation is no scientists from there like Bolivia and some countries from Central Africa. ( A wakeup call for policymakers for inclusivity )

Just, a thought experiment, where we removed the USA and redrawn the map.

Fig 4a: Redrwan heatmap
  • A quick observation is other leaders like the UK, Germany, France, China, Australia, Canada, and Japan emerged. ( It’s important to do such exercises)

Q5: From which field/discipline or subfield scientists have got more citations?

dsinst = research.groupby([‘sm-subfield-1’])[‘sm-subfield-1’].count()
# Sort by field of research
dsi_sorted = dsinst.sort_values(ascending=False)
# Select the top 40 Fields
top_40 = dsi_sorted.head(40)
# Plot the Series using Matplotlib
plt.plot(top_40.index, top_40.values, marker=’o’, linestyle=’-’,color=’purple’)
top_40.plot(kind=’bar’,color=[‘green’])
plt.xlabel(‘Field of Research’)
plt.ylabel(‘No. of Scientists’)
plt.title(‘No. of Scientists across top 40 Disciplines’)
# Show the plot
fig = plt.gcf()
fig.set_size_inches(12, 6)
plt.show()

Fig 5: Impactful Fields ( Image Source: Author)
  • There is no surprise to find AI on top
  • Many important fields are related to medical science
  • Followed by Physics and Chemistry
  • Apart from these, the other fields to watch out for are Education, Psychiatry, and Ecology

Q6: How is the mix of new versus veteran scientists in the top 2% database?

For this one we will have to work on some assumptions, we are going to take the first year of publication as the proxy measure of experience.

# Using Kernel Density Function for look at the distrinution

sns.kdeplot(merged_df[‘firstyr’], fill=True)
# Add labels and a title
plt.xlabel(‘firstyr’)
plt.ylabel(‘Density’)
plt.title(‘KDE Plot by the year of first publication by scientists’)
fig = plt.gcf()
fig.set_size_inches(12, 6)
# Show the plot
plt.show()

Fig 6: Experience distribution ( Image Source: Author)
  • The peak is around 1990 or so. Hence, on average scientists who have made it have been working for 30 years
  • It is a left-skewed distribution, so not many scientists from the earlier generation make it.
  • There is a good proportion of scientists, who have started publishing after 2000, made to the list.

Q7: Which countries are lagging or leading in terms of Grouped Research?

Again we have used a proxy measure, which is the ratio of hm22/h22. If it is close to 1, then that means it is more individual effort than a research group effort.

query = “SELECT name,sum(h22),sum(hm22),sum(hm22)/sum(h22) as GroupEffect from merged_df where name != ‘’ group by name”
result = psql.sqldf(query)
rslt_sorted = result.sort_values(by=’GroupEffect’,ascending=True)
# Select the top 25 Country
top_50 = rslt_sorted.head(50)
world = gpd.read_file(gpd.datasets.get_path(‘naturalearth_lowres’))
mergedds=world.merge(result, left_on=’name’, right_on=’name’)
fig, ax = plt.subplots(1, 1, figsize=(12, 8))
mergedds.plot(column=’GroupEffect’, cmap=’plasma’, linewidth=0.8, ax=ax, edgecolor=’0.8', legend=True)
ax.set_title(‘World No. of Scientist Heatmap’)
ax.axis(‘off’)
plt.show()

Fig 7: Group Effect ( Image Source: Author)
  • Countries from Africa like Mali, Uganda, and Andorra have pretty low scores in research outcomes which are not performed in groups

Well, there can be very many dimensions of exploring the dataset. Last five years data may also be taken and trend analysis is possible.

Q8: As a data miner from India, the last question is about top institutes from India in terms of no. of scientists.

Omitting the code here

  • The top institutes are the IISC, IITs i.e central institutes with the exception of Jadavpur University
  • The research is concentrated in Engg, only medical institute figuring out is AIIMS. ( Maybe again a pointer to policymakers)

Summary:

This blog is for budding data scientists, as well as academic leaders, to take note of the research trends across the globe. This is being developed as a community initiative, both the dataset and the notebook are available publicly.

--

--

Dr. Saptarsi Goswami

Asst Prof — CS Bangabasi Morning Clg, Lead Researcher University of Calcutta Data Science Lab, ODSC Kolkata Chapter Lead