NTCIR-7 Patent Mining Task
INFORMATION FOR PARTICIPANTS

- Formal run -


1. Task Description

(1) This subtask identifies IPC codes for given Japanese abstracts using USPTO patent data and Patent Abstracts of Japan data, which we will describe later. Identifying IPC codes for given English abstracts using Japanese patent applications is also conducted.

2. Schedule

3. Data Collections and Tools

3.1 Documents

NOTE:
The following file of Japanese patent application in the ntc4pk95 DVD is corrupt:
kkh/1995A/0204/DOCUMENT/A/95100001/95100001/95100061/95100065.TXT

3.2 Tools

3.3 Training Data

4. Documents

Japanese Patent Data

The following tags are inserted into each document by means of ntc4pk.prl.
<DOC>Document
<DOCNO>Document identifier (2)
<TEXT>Text body
<PASSAGE>Passage
<PNUM>Passage identifier (3)
(2) The format of document identifier
  PATENT-JA-UPA-1993-123456
     |    |  |    |     |
     |    |  |    |     +------ INID code 11 (publication number)
     |    |  |    +------------ publication year
     |    |  +----------------- publication of unexamined patent application
     |    +-------------------- Japanese patent
     +------------------------- patent document
(3) A passage in a document is identified by the identifier of the document, suffixed with the serial number of the passage (starting with zero) in the document. Although passages are extracted from the specific fields, such as claims and detailed descriptions of the invention, any fields can be used for categorization purposes.

USPTO Patent Data

The following tags are inserted into each document by means of uspto_tag.prl.
<DOC>Document
<DOCNO>Document identifier (4)
<APP-NO>Application number
<APP-DATE>Application date
<PUB-NO>Publication number
<PUB-TYPE>Publication type
<PAT-NO>Patent number
<PAT-TYPE>Patent type (5)
<PUB-DATE>Publication date
<PRI-IPC>Primary IPC
<IPC-VER>IPC version
<PRI-USPC>Primary USPC
<PRIORITY>Priority information
<CITATION>Citation(s) (6)
<INVENTOR>Inventor(s)
<ASSIGNEE>Assignee(s)
<TITLE>Title
<ABST>Abstract
<SPEC>Specification
<CLAIM>Claim(s)
Only <DOC>, <DOCNO>, <TITLE>, <ABST>, <SPEC>, and <CLAIM> can be used for categorization purposes.

(4) The format of document identifier

  PATENT-US-GRT-1993-123456
     |    |  |    |     |
     |    |  |    |     +------ patent number
     |    |  |    +------------ publication year
     |    |  +----------------- grant data
     |    +-------------------- USPTO patent
     +------------------------- patent document
(5) Patent Type (6) Citation(s)
More than one citation is combined with tab (\t). Each citation consists of "patent number of cited patent/ date". "Patent number of cited patent" corresponds to <PAT_NO>. However, date information is incomplete.

NTCIR-1, 2 CLIR task Test Collection and Patent Abstracts of Japan(PAJ)

Please refer to the README files contained in each data set.

5. Submission Format

Each result is organized as the TREC_EVAL format file preceded by the following tags. The newline code ('\n') should be inserted only at the end of each field, except for the <RESULT> field. All texts must be in English using only ASCII characters.
<SYSTEM-ID>System identifier that is the same as the group ID
<MODE>Category assignment mode: "full-auto", "semi-auto", or "manual"
<SUBTASK>"English", "Japanese", "English to Japanese" (Cross-lingual), or "Japanese to English" (Cross-lingual)
<DOC-TAG>List of document fields used for categorization purposes
<TOPIC-DOCUMENT>Part of topic document used in categorization (e.g., "full text of patent application", "claim", "PAJ abstract", etc.)
<TRAINING-CORPUS>List of data collections used for training purposes, including the sample topics. Please also describe how you used those data
<MODEL>Name or short description of categorization model (e.g., "C4.5", "CART", "Neural Net", "Naive Bayes", "SVM", "KNN", etc.)
<RESULT>Retrieval result in the TREC_EVAL format (7)

(7) Each line in the <RESULT> field is organized in the following format:

topic-id \t 0 \t IPC \t IPC-rank \t IPC-score \t run-id

The following example is a fragment of the file named "ntc7".

  <SYSTEM-ID>ntc7</SYSTEM-ID>
  <MODE>full-auto</MODE>
  ...
  <RESULT>

  1    0    G01N_29_24    1    9999    ntc7
  1    0    A61B_8_00     2    9998    ntc7
  1    0    G01N_29_22    3    9997    ntc7
  ...
  2    0    A61C_12_00    1    9999    ntc7
  ...
  </RESULT>
The maximum number of IPC codes for a single topic is 1000.

6. Submission

The formal run results should be submitted from this page.

The submission deadline is June 13, 2008.

7. Results

The evaluation results will be released by June 23, 2008. All run files submitted by all participant groups and the correct answers will be distributed to the participant groups that submit their run files on June 13, 2008(8). By using these data, you can evaluate and compare the effectiveness of your system and other systems. Please use this opportunity to present your research at international conferences and journals.

(8) All run files and the correct answers will not be distributed to the participant groups that do not submit their run files.

8. Topics

9. Round Table Meeting

Round table meeting was held on March 31, 2008 at National Institute of Informatics, Japan. Followings are the materials for the meeting, and the memorandum of meeting.

Organizers

Contact

 

Last modified on May 14, 2008

Back to the Web page of CFP