Data sources and core processing

In EURITO we predominantly use four data sources:

  • EU-funded research from CORDIS
  • Technical EU research on arXiv
  • EU Patents from PATSTAT
  • EU Companies [under license, we can’t specify the source publicly. Contact us for more details!]

CORDIS

Data from the CORDIS’s H2020 API and FP7 API funded projects is extracted using code found in this repository.

In total, 51250 organisations and 50640 projects were extracted from the API. There are 1102 proposal calls, 245465 publications and 34507 reports. In total 6545 are associated with the projects.

Software outputs are associated with the projects, using OpenAIRE API.

All of these entities are then linked together, and stored using a neo4j graph database. The code for automatically piping the data in neo4j is provided here.

Cordis to Neo4j

Tools for piping data from a SqlAlchemy ORM to Neo4j, to be used in the Luigi pipeline.

orm_to_neo4j(session, transaction, orm_instance, parent_orm=None, rel_name=None)[source]

Pipe a SqlAlchemy ORM instance (a ‘row’ of data) to neo4j, inserting it as a node or relationship, as appropriate.

Parameters:
  • session (sqlalchemy.Session) – SQL DB session.
  • transaction (py2neo.Transaction) – Neo4j transaction
  • orm_instance (sqlalchemy.Base) – Instance of a SqlAlchemy ORM, i.e. a ‘row’ of data.
  • parent_orm (sqlalchemy.Base) – Parent ORM to build relationship to
  • rel_name (str) – Name of the relationship to be added to Neo4j
build_relationships(session, graph, orm, data_row, rel_name, parent_orm=None)[source]

Build a py2neo.Relationship object from SqlAlchemy objects.x

Parameters:
  • session (sqlalchemy.Session) – SQL DB session.
  • transaction (py2neo.Transaction) – Neo4j transaction
  • orm (sqlalchemy.Base) – A SqlAlchemy ORM
  • rel_name (str) – Name of the relationship to be added to Neo4j
  • parent_orm (sqlalchemy.Base) – Another ORM to build relationship to. If this is not specified, it implies that orm is node, rather than a relationship.
Returns:

Relationships pointing to the node (inferred from ORM), and one pointing back to it’s associated project.

Return type:

{relationship, back_relationship}

set_constraints(orm, graph_schema)[source]

Set constraints in the neo4j graph schema.

Parameters:
  • orm (sqlalchemy.Base) – A SqlAlchemy ORM
  • graph_schema (py2neo.Graph.Schema) – Neo4j graph schema.
prepare_base_entities(table)[source]

Returns the objects required to generate a graph representation of the ORM.

Parameters:table (sqlalchemy.sql.Table) – SQL alchemy table object from which to extract an graph representation.
Returns:
Two ORMs and a string describing
their relationship
Return type:{orm, parent_orm, rel_name}
flatten(orm_instance)[source]

Convert a SqlAlchemy ORM (i.e. a ‘row’ of data) to flat JSON.

Parameters:orm_instance (sqlalchemy.Base) – Instance of a SqlAlchemy ORM, i.e. a ‘row’ of data.
Returns:A flat row of data, inferred from orm_instance
Return type:row (dict)
flatten_dict(row, keys=[('title',), ('street', 'city', 'postalCode')])[source]

Flatten a dict by concatenating string values of matching keys.

Parameters:row (dict) – Data to be flattened
Returns:Concatenated data.
Return type:flat (str)
retrieve_node(session, graph, orm, parent_orm, data_row)[source]

Retrieve an existing node from neo4j, by first retrieving it’s id (field name AND value) via SqlAlchemy.

Parameters:
  • session (sqlalchemy.Session) – SQL DB session.
  • transaction (py2neo.Transaction) – Neo4j transaction
  • orm (sqlalchemy.Base) – SqlAlchemy ORM describing data_row
  • parent_orm (sqlalchemy.Base) – Parent ORM to build relationship to
  • data_row (dict) – Flat row of data retrieved from orm
Returns:

Node of data corresponding to data_row

Return type:

node (py2neo.Node)

table_from_fk(fks)[source]

Get the table name of the fk constraint, ignoring the cordis_projects table

Parameters:fks (list of SqlAlchemy.ForeignKey) – All foreign keys for a given table.
Returns:The table name corresponding to the non-Project foreign key.
Return type:tablename (str)
get_row(session, parent_orm, orm, data_row)[source]

Retrieve a flat row of data corresponding to the parent relation, inferred via foreign keys.

Parameters:
  • session (sqlalchemy.Session) – SQL DB session.
  • parent_orm (sqlalchemy.Base) – Parent ORM to build relationship to
  • orm (sqlalchemy.Base) – SqlAlchemy ORM describing data_row
  • data_row (dict) – Flat row of data retrieved from orm
Returns:

Flat row of data retrieved from parent_orm

Return type:

_row (dict)

Enrich Cordis with OpenAIRE

Tools for collecting OpenAIRE data (by Cordis project), and piping to Neo4j.

write_record_to_neo(record, output_type, graph)[source]

A utility function, which takes record and writes it to neo4j graph

Parameters:
  • record (dict) – a dictionary that contains metadata about a record
  • output_type (str) – type of record to be extracted from OpenAIRE API. Accepts “software”, “datasets”, “publications”, “ECProjects”
  • graph (graph_session) – connection to neo4j database
get_project_soups(currentUrl, reqsession, output_type, projectID)[source]

Gets a beautiful soup according to output type and projectID

Parameters:
  • currentUrl (str) – URL to OpenAIRE API
  • reqsession (instance of Requests session) – currently open HTTP request
  • output_type (str) – type of record to be extracted from OpenAIRE API. Accepts “software”, “datasets”, “publications”, “ECProjects”
  • projectID (str) – EC project identifier
Returns:

a list of BeautifulSoup objects that contain the results from API call

Return type:

souplist(list)

get_results_from_soups(souplist)[source]

Extracts string from all BeautifulSoup objects and merges them into one list

Parameters:souplist (list) – a list of BeautifulSoup objects that contain the results from API call
Returns:a list of strings with results metadata
Return type:resultlist(list)

arXiv

All articles from arXiv, which is the world’s premier repository of pre-prints of articles in the physical, quantitative and computational sciences, are already automatically collected, geocoded (using GRID) and enriched with topics (using MAG). Articles are assigned to EU NUTS regions (at all levels) using nesta’s nuts-finder python package.

Data is transferred to EURITO’s elasticsearch server via nesta’s es2es package. The lolvelty algorithm is then applied to the data in order to generate a novelty metric for each article. This procedure is bettter described in this blog (see “Defining novelty”).

The indicators using this data source are presented in this other EURITO repository.

In total, 1598033 articles have been processed, of which 459371 have authors based in EU nations.

PATSTAT

All patents from the PATSTAT service have been collected in nesta’s own database using nesta’s pypatstat library. Since this database is very large, we have selected patents which belong to a patent family with a granted patent first published after the year 2000, with at least one person or organisation (inventor or applicant) based in an EU member state. This leads to 1552303 patents in the database.

Data is transferred to EURITO’s elasticsearch server via nesta’s es2es package. The lolvelty algorithm is then applied to the data in order to generate a novelty metric for each article. This procedure is bettter described in this blog (see “Defining novelty”).

The indicators using this data source are presented in this other EURITO repository.

Companies

We have acquired private-sector company data under license. The dataset contains 550540 companies, of which 133641 are based in the EU.

Data is transferred to EURITO’s elasticsearch server via nesta’s es2es package. The lolvelty algorithm is then applied to the data in order to generate a novelty metric for each article. This procedure is bettter described in this blog (see “Defining novelty”).

The indicators using this data source are presented in this other EURITO repository.