Data sources and core processing¶

In EURITO we predominantly use four data sources:

EU-funded research from CORDIS
Technical EU research on arXiv
EU Patents from PATSTAT
EU Companies [under license, we can’t specify the source publicly. Contact us for more details!]

CORDIS¶

Data from the CORDIS’s H2020 API and FP7 API funded projects is extracted using code found in this repository.

In total, 51250 organisations and 50640 projects were extracted from the API. There are 1102 proposal calls, 245465 publications and 34507 reports. In total 6545 are associated with the projects.

Software outputs are associated with the projects, using OpenAIRE API.

All of these entities are then linked together, and stored using a neo4j graph database. The code for automatically piping the data in neo4j is provided here.

Cordis to Neo4j¶

Tools for piping data from a SqlAlchemy ORM to Neo4j, to be used in the Luigi pipeline.

orm_to_neo4j(session, transaction, orm_instance, parent_orm=None, rel_name=None)[source]¶

Pipe a SqlAlchemy ORM instance (a ‘row’ of data) to neo4j, inserting it as a node or relationship, as appropriate.

Parameters:	session (sqlalchemy.Session) – SQL DB session. transaction (py2neo.Transaction) – Neo4j transaction orm_instance (sqlalchemy.Base) – Instance of a SqlAlchemy ORM, i.e. a ‘row’ of data. parent_orm (sqlalchemy.Base) – Parent ORM to build relationship to rel_name (str) – Name of the relationship to be added to Neo4j

build_relationships(session, graph, orm, data_row, rel_name, parent_orm=None)[source]¶

Build a py2neo.Relationship object from SqlAlchemy objects.x

Parameters:	session (sqlalchemy.Session) – SQL DB session. transaction (py2neo.Transaction) – Neo4j transaction orm (sqlalchemy.Base) – A SqlAlchemy ORM rel_name (str) – Name of the relationship to be added to Neo4j parent_orm (sqlalchemy.Base) – Another ORM to build relationship to. If this is not specified, it implies that `orm` is node, rather than a relationship.
Returns:	Relationships pointing to the node (inferred from ORM), and one pointing back to it’s associated project.
Return type:	{relationship, back_relationship}

set_constraints(orm, graph_schema)[source]¶

Set constraints in the neo4j graph schema.

Parameters:	orm (sqlalchemy.Base) – A SqlAlchemy ORM graph_schema (py2neo.Graph.Schema) – Neo4j graph schema.

prepare_base_entities(table)[source]¶

Returns the objects required to generate a graph representation of the ORM.

Parameters:	table (sqlalchemy.sql.Table) – SQL alchemy table object from which to extract an graph representation.
Returns:	Two ORMs and a string describing their relationship
Return type:	{orm, parent_orm, rel_name}

flatten(orm_instance)[source]¶

Convert a SqlAlchemy ORM (i.e. a ‘row’ of data) to flat JSON.

Parameters:	orm_instance (sqlalchemy.Base) – Instance of a SqlAlchemy ORM, i.e. a ‘row’ of data.
Returns:	A flat row of data, inferred from orm_instance
Return type:	row (dict)

flatten_dict(row, keys=[('title',), ('street', 'city', 'postalCode')])[source]¶

Flatten a dict by concatenating string values of matching keys.

Parameters:	row (dict) – Data to be flattened
Returns:	Concatenated data.
Return type:	flat (str)

retrieve_node(session, graph, orm, parent_orm, data_row)[source]¶

Retrieve an existing node from neo4j, by first retrieving it’s id (field name AND value) via SqlAlchemy.

Parameters:	session (sqlalchemy.Session) – SQL DB session. transaction (py2neo.Transaction) – Neo4j transaction orm (sqlalchemy.Base) – SqlAlchemy ORM describing `data_row` parent_orm (sqlalchemy.Base) – Parent ORM to build relationship to data_row (dict) – Flat row of data retrieved from orm
Returns:	Node of data corresponding to data_row
Return type:	node (py2neo.Node)

table_from_fk(fks)[source]¶

Get the table name of the fk constraint, ignoring the cordis_projects table

Parameters:	fks (`list` of SqlAlchemy.ForeignKey) – All foreign keys for a given table.
Returns:	The table name corresponding to the non-Project foreign key.
Return type:	tablename (str)

get_row(session, parent_orm, orm, data_row)[source]¶

Retrieve a flat row of data corresponding to the parent relation, inferred via foreign keys.

Parameters:	session (sqlalchemy.Session) – SQL DB session. parent_orm (sqlalchemy.Base) – Parent ORM to build relationship to orm (sqlalchemy.Base) – SqlAlchemy ORM describing `data_row` data_row (dict) – Flat row of data retrieved from orm
Returns:	Flat row of data retrieved from parent_orm
Return type:	_row (dict)

Enrich Cordis with OpenAIRE¶

Tools for collecting OpenAIRE data (by Cordis project), and piping to Neo4j.

write_record_to_neo(record, output_type, graph)[source]¶

A utility function, which takes record and writes it to neo4j graph

Parameters:	record (dict) – a dictionary that contains metadata about a record output_type (str) – type of record to be extracted from OpenAIRE API. Accepts “software”, “datasets”, “publications”, “ECProjects” graph (graph_session) – connection to neo4j database

get_project_soups(currentUrl, reqsession, output_type, projectID)[source]¶

Gets a beautiful soup according to output type and projectID

Parameters:	currentUrl (str) – URL to OpenAIRE API reqsession (instance of Requests session) – currently open HTTP request output_type (str) – type of record to be extracted from OpenAIRE API. Accepts “software”, “datasets”, “publications”, “ECProjects” projectID (str) – EC project identifier
Returns:	a list of BeautifulSoup objects that contain the results from API call
Return type:	souplist(list)

get_results_from_soups(souplist)[source]¶

Extracts string from all BeautifulSoup objects and merges them into one list

Parameters:	souplist (list) – a list of BeautifulSoup objects that contain the results from API call
Returns:	a list of strings with results metadata
Return type:	resultlist(list)

arXiv¶

All articles from arXiv, which is the world’s premier repository of pre-prints of articles in the physical, quantitative and computational sciences, are already automatically collected, geocoded (using GRID) and enriched with topics (using MAG). Articles are assigned to EU NUTS regions (at all levels) using nesta’s nuts-finder python package.

Data is transferred to EURITO’s elasticsearch server via nesta’s es2es package. The lolvelty algorithm is then applied to the data in order to generate a novelty metric for each article. This procedure is bettter described in this blog (see “Defining novelty”).

The indicators using this data source are presented in this other EURITO repository.

In total, 1598033 articles have been processed, of which 459371 have authors based in EU nations.

PATSTAT¶

All patents from the PATSTAT service have been collected in nesta’s own database using nesta’s pypatstat library. Since this database is very large, we have selected patents which belong to a patent family with a granted patent first published after the year 2000, with at least one person or organisation (inventor or applicant) based in an EU member state. This leads to 1552303 patents in the database.

Data is transferred to EURITO’s elasticsearch server via nesta’s es2es package. The lolvelty algorithm is then applied to the data in order to generate a novelty metric for each article. This procedure is bettter described in this blog (see “Defining novelty”).

The indicators using this data source are presented in this other EURITO repository.

Companies¶

We have acquired private-sector company data under license. The dataset contains 550540 companies, of which 133641 are based in the EU.

Data is transferred to EURITO’s elasticsearch server via nesta’s es2es package. The lolvelty algorithm is then applied to the data in order to generate a novelty metric for each article. This procedure is bettter described in this blog (see “Defining novelty”).

The indicators using this data source are presented in this other EURITO repository.

Data sources and core processing¶

CORDIS¶

Cordis to Neo4j¶

Enrich Cordis with OpenAIRE¶

arXiv¶

PATSTAT¶

Companies¶

Table of Contents

Previous topic

Next topic

This Page