Welcome to Bryan's Home Page for MARC-related Perl code
My name is Bryan Baldus. I am a cataloger at Quality Books, Inc., in Oregon, Illinois.
This page has been set up, initially, to distribute a number of Perl scripts (and modules) I have written to deal with MARC21/USMARC records.
Please see the manifest.htm and readme.htm for more information, along with the modules and scripts themselves.
Perl code files on the site end in txt to facilitate downloading. Change to .pm for BBMARC, Lintadditions, Errorchecks, and CodeData (MARC::Lint::CodeData) and to .pl for the others.
The above mentioned modules, are based on or extensions to the MARC::Record distribution, and are named MARC::[module name] (with CodeData being in a MARC::Lint directory).
They are referred to on this site as either *.pm or MARC::*.
The modules contain a number of known issues/to-do lists, and some checks are specific to Quality Books Inc.'s records.
Site arrangement:
bryanmodules
fullrecscripts
cleanupscripts
inprocess
prevversions
Each directory's contents are described in manifest.htm
Changes:
(May 25, 2008)
Module updates:
Errorchecks.pm:
Version 1.14: Updated Oct. 21, 2007, Jan. 21, 2008, May 20, 2008. Released May 25, 2008.
- Updated %ldrbytes with leader/19 per Update no. 8, Oct. 2007. Check for validity of leader/19 not yet implemented.
- Updated _check_book_bytes with code '2' ('Offprints') for 008/24-27, per Update no. 8, Oct. 2007.
- Updated check_245ind1vs1xx($record) with TODO item and comments
- Updated check_bk008_vs_300($record) to allow "leaves of plates" (as opposed to "leaves", when no p. or v. is present), "leaf", and "column"(s).
- Updated test in Errorchecks.t to remove check for LCCN starting with year greater than the current year. This was at 2008, which is no longer later. A test may be implemented in the future that will be less likely to break with the passage of time.
(Oct. 21, 2007)
Module updates:
Lintadditions.pm:
Version 1.13: Updated Oct. 21, 2007. Released Oct. 21, 2007.
- Updated check_100 (and by call, all check_1xx, check_7xx, and check_8xx):
- Non-numeric reduced from non-digits to [0-5, 79], since 6 and 8 follow different rules.
- Added check for punctuation preceding $e.
- Updated check_260, check_440, and check_490 to deal with subfield 6 being 1st when checking for subfield a as first subfield.
(Oct. 3, 2007)
Module updates:
Errorchecks.pm:
Version 1.13: Updated Aug. 26, 2007. Released Oct. 3, 2007.
- Uncommented valid MARC 21 leader values in %ldrbytes to remove local practice. Libraries wishing to restrict leader values should comment out individual bytes to enable errors when an unwanted value is encountered.
- Added ldrvalidate.t.pl and ldrvalidate.t tests.
- Includes version 1.18 of MARC::Lint::CodeData.
MARC::Lint::CodeData.pm:
Versions 1.15 to 1.18: Updated Feb. 28, 2007-Aug. 14, 2007.
- Added new source codes from Technical Notice of Aug. 13, 2007.
- Added new source codes from Technical Notice of July 13, 2007.
- Added new source codes from Technical Notice of Apr. 5, 2007.
- Added new country and geographic codes from Technical Notice of Feb. 28, 2007.
- Added 'yu ' to list of obsolete codes.
Lintadditions.pm:
Version 1.12: Updated Mar. 1-Aug 26, 2007. Released Oct. 3, 2007.
- Updated check_042 with new code, ukblderived, from Technical Notice for Aug. 13, 2007.
- Updated check_042 with new code, scipio, from Technical Notice for Mar. 1, 2007.
- Updated check_xxx methods (check_250) to account for subfield '6' as 1st subfield.
(Feb. 25, 2007)
Module updates:
Errorchecks.pm:
Version 1.12: Updated July 5-Nov. 17, 2006. Released Feb. 25, 2007.
- Updated check_bk008_vs_300($record) to look for extra p. or v. after parenthetical qualifier.
- Updated check_bk008_vs_300($record) to look for missing period after 'col' in subfield 'b'.
- Replaced $field-tag() with $tag in error message reporting in check_nonpunctendingfields($record).
- Turned off 50-field limit check in check_fieldlength($record).
- Updated parse008vs300b($illcodes, $field300subb) to look for /map[ \,s]/ rather than just 'map' when 008 is coded 'b'.
- Updated check_bk008_vs_bibrefandindex($record) to look for spacing on each side of parenthetical pagination.
- Updated check_internal_spaces($record) to report 10 characters on either side of each set of multiple internal spaces.
- Uncommented level-5 and level-7 leader values as acceptable. Level-3 is still commented out, but could be uncommented for libraries that allow it.
- Includes version 1.14 of MARC::Lint::CodeData.
MARC::Lint::CodeData.pm:
Versions 1.09 to 1.14: Updated June 26, 2006-Jan. 8, 2007.
- Added new source codes from Technical Notice of Jan. 5, 2007.
- Added new source codes from Technical Notice of Nov. 14, 2006.
- Added new source code from Technical Notice of Oct. 19, 2006.
- Added new source codes from Technical Notice of Oct. 17, 2006.
- Added new source codes from Technical Notice of Aug. 29, 2006.
- Added new source codes from Technical Notice of June 23, 2006.
Lintadditions.pm:
Version 1.11: Updated June 12, 2006-Feb. 7, 2007. Released Feb. 25, 2007.
- Updated check_130(), check_6xx, and check_7xx to check for proper punctuation before subfield _l.
- Updated check_240() to allow ? and ! before subfield _l, based on LCRI revision in 2006.
- Updated check_050() to report error if subfield _a doesn't start with capital letters followed by digits.
- Updated check_050() to report error if subfield _a ends in capital letter.
- Replaced $field->tag() in warning statements with $tagno.
- Revised $2 validation to split on '/', thus ignoring edition and language additions on valid codes
- Updated check_440 to look for miscoded 2nd ind. using MARC::Lint::_check_article().
- Updated 130, 240, 630, 730, and 830 checks to look for article, using MARC::Lint::_check_article().
- Updated check_042() with source code from technical notice of Sept. 29, 2006.
- Added TO DO item for determining whether check_130, 630, and 730 can use the same code.
Script updates:
LCSHchangesparserpl110.txt
Version 1.10: Updated Dec. 7, 2006
- Revised parsing to allow ! within heading (e.g., !Xu)
Version 1.09: Updated Sept. 8, 2006
- Misc. fixes, including:
- Closing up spaces in 682 fields
- Parsing of new 1xx with [proposed update]
Version 1.08: Updated Sept. 4, 2006
- Reports filename of files containing AC headings (new or changed) in AC headings section.
New Module in process:
MARC::Lint::Lint_Authority.pm:
Version 0.01--Feb. 21, 2007. Posted Feb. 25, 2007
(June 19, 2006)
Module updates:
MARC::Global_Replace.pm:
Version 0.06--Updated June 18, 2006. Released June 19, 2006.
- Added subs for personal names--closed date reporting.
- Added bin/parsedeathdateslists.pl for parsing a directory of html saves from OCLC's closed date archive into a plain text tab-delimited list of old-new name pairs for use with bin/globalreplaceidentnames.pl.
(June 6, 2006)
Module updates:
Errorchecks.pm:
Version 1.11: Updated June 5, 2006. Released June 6, 2006.
- Implemented check_006($record) to validate 006 (currently only does length check).
- Revised validate008($field008, $mattype, $biblvl) to use internal sub for material specific bytes (18-34)
- Revised validate008($field008, $mattype, $biblvl) language code (008/35-37) to report new 'zxx' code availability when ' ' (3-blanks) is the code in the record.
- Added 'mgmt.' to %abbexceptions for check_nonpunctendingfields($record).
MARC::Lint::CodeData.pm:
(Most current version is available through CVS on SourceForge with MARC::Lint.)
- Versions 1.05-1.08 were updated with additions of codes from technical notices.
Lintadditions.pm:
Version 1.10: Updated Oct. 17, 2005-May 18, 2006. Released June 6, 2006.
- Added check_024() for UPC and EAN validation. Uses Business::Barcode::EAN13 and Business::UPC for these checks.
- check_042() updated with valid source codes from MARC list for sources.
- check_050() updated to report cutters not preceded by period.
- Misc. bug fixes, including turning off uninitialized warnings for short 007 bytes.
MARC::Global_Replace.pm:
Version 0.05--Updated May 1, 2006. Released June 6, 2006.
- Revised identify_changed_hdgs($field, \%heading_data, \%changed_hdgs_sub_a) attempting to resolve problem of closed dates vs. open.
Version 0.04--Updated Feb. 13, 2006. Unreleased
- Modified identify_changed_hdgs($field, \%heading_data, \%changed_hdgs_sub_a) to not report headings where new and old are identical.
- Need to strip ending periods for match to work!!
- Testing needed for sears heading changes--currently appears to fail to match
Script updates:
LCSHchangesparserpl107.txt
Version 1.07: Updated May 8, 2006
- Revised changed heading regex to include "\&" (e.g. AT&T)
Version 1.06: Updated Oct. 5, 2005
- Added 682 parsing
- New_tag is set to 682 when headings are extracted from that field
- Global_Replace will need to take these into account during parsing and comparison, since there is a chance that the parsing done by this script will produce unexpected/unreliable results.
- 682 parsing is incomplete and will likely fail on headings with qualifiers.
Version 1.05: Updated Aug. 25, 2005
- Revised parsing to account for some lines previously counted as bad.
parsedeathdateslists.pl.txt
No version. Very preliminary test code
- Help needed in stripping entities other than subfield delimiter.
- Help needed in selecting best HTML/XML parser for OCLC's closed dates lists.
- Requires pure Perl solution (no ability to use compiler or to install extra, non-Perl programs.
- Cross-platform capable, non-Unicode/capable of stripping non-ASCII characters without worrying about Mac (Classic) vs. Windows character sets.
(Jan. 2, 2006)
Module updates:
Errorchecks.pm:
Version 1.10: Updated Sept. 5-Jan. 2, 2006. Released Jan. 2, 2006.
- Revised validate008($field008, $mattype, $biblvl) to use internal subs for material specific byte checking.
- Added:
- _check_cont_res_bytes($mattype, $biblvl, $bytes),
- _check_book_bytes($mattype, $biblvl, $bytes),
- _check_electronic_resources_bytes($mattype, $biblvl, $bytes),
- _check_cartographic_bytes($mattype, $biblvl, $bytes),
- _check_music_bytes($mattype, $biblvl, $bytes),
- _check_visual_material_bytes($mattype, $biblvl, $bytes),
- _check_mixed_material_bytes,
- _reword_008(@warnings), and
- _reword_006(@warnings).
- Updated Continuing resources byte 20 from ISSN center to Undefined per MARC 21 update of Oct. 2003.
- Updated wording in findfloatinghyphens($record) to report 10 chars on either side of floaters and check_floating_punctuation($record) to report some context if the field in question has more than 80 chars.
- check_bk008_vs_bibrefandindex($record) updated to check for 'p. ' following bibliographical references when pagination is present.
- check_5xxendingpunctuation($record) reports question mark or exclamation point followed by period as error.
- check_5xxendingpunctuation($record) now checks 505.
- Updated check_nonpunctendingfields($record) to account for initialisms with interspersed periods.
- Added check_floating_punctuation($record) looking for unwanted spaces before periods, commas, and other punctuation marks.
- Renamed findfloatinghyphens($record) to fix spelling.
- Revised check_bk008_vs_300($record) to account for textual materials on CD-ROM.
- Added abstract to name.
MARC::Lint::CodeData.pm:
(Most current version is available through CVS on SourceForge with MARC::Lint.)
Version 1.04: Updated Oct. 13, 2005.
- Added new sources codes from Technical Notice of Oct. 12, 2005.
Version 1.03: Updated Aug. 31, 2005.
- Added new language codes for Ainu and Southern Altai (August 30, 2005 technical notice)
(Aug. 14, 2005)
Module updates:
Errorchecks.pm:
Version 1.09: Updated July 18, 2005. Released July 19, 2005 (Aug. 14, 2005 to CPAN).
Module in process:
MARC::File::MARCMaker.pm: (zipped and uncompressed as /marc-marcmaker/)
Version 0.03: Updated Aug. 2, 2005. Released Aug. 14, 2005.
- Revised decode() to fix problem with dollar sign conversion from mnemonics to characters.
MARC::Global_Replace.pm:
Version 0.03--Updated Aug. 3, 2005. Posted Aug. 14, 2005
- Added as_hyphenated($sh_field) to break field into a string with dashes separating subfields.
- Revised subs as needed.
- First version released (to my site)
- This version is very preliminary and will not necessarily install using the normal installation process.
- The test program, global_replace_ident.txt (global_replace_ident.pl) will require modification to work on other systems (it relies on my internal module, MARC::QBI::Misc for file IO).
Script updates:
LCSHchangesparserpl104.txt
Version 1.04: Updated July 28-Aug. 4, 2005
- Added thesaurus as 5th element of output lines, telling which thesaurus the line uses.
- Outputs AC headings as separate group at the end of the compiled sorted file of headings (allhash).
- Misc. fixes.
Current planned in progress tasks:
1. Clean LCSH weekly lists to identify cancelled->replaced headings. Preliminary code for this is in the inprocess directory.
2. Use the cleaned LCSH weekly list cancel/replace headings to do global SH replace. I am working on a new module, MARC::GlobalReplace, to do this. It is at a very early stage, and has been posted to the inprocess directory. It includes a script, global_replace_ident.pl to identify changed headings given 'allhash.txt' generated by the LCSH changes parser (v. 1.07+) and a file of MARC records. The script and module have undergone very minimal testing, but seem to do ok at reporting possible changed headings in MARC records.
3. Cleanup some of the templatified (full record/cleanup) scripts, adding type/creator information when using MacOS, for example, along with documentation.
4. Write additional lint checks, including (these will go into MARC::Errorchecks):
- Rewrite validate007 to be more efficient and consistent in error reporting (wording) (validate008 has been updated in version 1.04 of MARC::Errochecks).
- Match data from 006-008 vs. appropriate fields lower in record.
- GMD validation and comparison with other fields.
- Additional 300-field comparisons based on record material type.
- Validate or compare subfield code in 6xx against approved list, mainly to find miscoded geographical headings (and topicals in geographical subdivisions). (Code available upon request for geographical headings and form subdivisions)
Item 4 is shorter now, as I added a number of check_XXX functions in MARC::Lintadditions.pm and MARC::Errorchecks.pm.
5. Work on integrating MARC::Lintadditions functionality into MARC::Lint. This has begun, with check_041, check_043, and check_245. The main hold-up is getting tests written for each check_xxx method.
6. Write tests for MARC::Errorchecks, MARC::Lintadditions, and MARC::BBMARC.
7. Work on creating MARC::File::MARCMaker. This is a rewrite of the MARCMaker-related code in MARC.pm, to allow MARC::Record to work with LC's MARCMaker format files (http://www.loc.gov/marc/makrbrkr.html). This has been uploaded to SourceForge CVS (marcpm, alongside MARC::Record, MARC::Lint, etc.)
8. Move file handling and other subroutines from internal MARC::QBI::Misc module to public MARC::BBMARC module. The functions in MARC::QBI::Misc are mainly for non-command line users, to put up prompts and reduce unwanted overwriting of files.
9. Work on creating MARC::Lint::Lint_Authority.pm. This will be a module essentially copying MARC::Lint, but with a data section and methods for validating MARC format for Authority data rather than Bibliographic. An initial version of this module appears in the inprocess directory.
With the added checks, the lint checker runs a bit slower, so I welcome any suggestions for improved efficiency.
I welcome any help with any of the above, especially number 7.
(July 19, 2005)
Module updates:
Errorchecks.pm:
Version 1.09: Updated July 18, 2005. Released July 19, 2005.
- Added check_010.t (and check_010.t.pl) tests for check_010($record).
- check_010($record) revisions.
- Turned off validation of 8-digit LCCN years. Code commented-out.
- Modified parsing of numbers to check spacing for 010a with valid non-digits after valid numbers.
- Validation of 10-digit LCCN years is based on current year.
- Fixed bug of uninitialized values for matchpubdates($record) 050 and 260 dates.
- Corrected comparison for year entered < 1980.
- Removed AutoLoader (which was a remnant of the initial module creation process)
(July 16, 2005)
Module updates:
Lintadditions.pm:
Version 1.09: Updated Mar. 31-Apr., 2005. Released July 16, 2005.
- check_260() updated to report error if subfield 'a' and 'b' are not present.
- More '==' etc. changed to 'eq' etc. for indicators.
- check_082() updated to set $dewey to empty string if no 082$a is present before checking for 3 digits.
Errorchecks.pm:
Version 1.08: Updated Feb. 15-July 11, 2005. Released July 16, 2005.
- Added 008errorchecks.t (and 008errorchecks.t.txt) tests for 008 validation
- Added check of current year, month, day vs. 008 creation date, reporting error if creation date appears to be later than local time. Assumes 008 dates of 00mmdd to 70mmdd represent post-2000 dates.
- This is a change from previous range, which gave dates as 00-06 as 200x, 80-99 as 19xx, and 07-79 as invalid.
- Added _get_current_date() internal sub to assist with check of creation date vs. current date.
- findemptysubfields($record) also reports error if period(s) and/or space(s) are the only data in a subfield.
- Revised wording of error messages for validate008($field008, $mattype, $biblvl)
- Revised parse008date($field008string) error message wording and bug fix.
- Bug fix in video007vs300vs538($record) for gathering multiple 538 fields.
- added check in check_5xxendingpunctuation($record) for space-semicolon-space-period at the end of 5xx fields.
- added field count check for more than 50 fields to check_fieldlength($record)
- added 'webliography' as acceptable 'bibliographical references' term in check_bk008_vs_bibrefandindex($record), even though it is discouraged. Consider adding an error message indicating that the term should be 'bibliographical references'?
- Code indenting changed from tabs to 4 spaces per tab.
- Misc. bug fixes including changing '==' to 'eq' for tag numbers, bytes in 008, and indicators.
MARC::Lint::CodeData.pm:
Version 1.02: Updated June 21-July 12, 2005. Released (to CPAN) with new version of MARC::Errorchecks. Also posted to CVS on SourceForge with MARC::Lint.
- Added GAC and Country code changes for Australia (July 12, 2005 update)
- Added 6xx subfield 2 source code data for June 17, 2005 update.
- Updated valid Language codes to June 2, 2005 changes.
Module in process:
MARC::File::MARCMaker.pm: (zipped and uncompressed as /marc-marcmaker/)
Version 0.02: Updated July 12-13, 2005. Released July 16, 2005.
- Preliminary version of encode() for fields and records.
- Appears to work when no special chars are present (including dollar signs).
- See TODO.txt and readme0.02.txt for list of items still needing to be done and other notes.
- Note: This is a pre-alpha release, and little testing has been done on the results of decode() or encode().
Added and changed scripts:
Updated LCSH Changes Parser script, LCSHchangesparserpl103.txt:
- Now creates files with tab-separated lines: old_tag \t old_hdg \t new_tag \t new_hdg.
- Better parsing of weekly files.
(Mar. 7, 2005)
New Module in process:
MARC::File::MARCMaker.pm: (zipped and uncompressed as /marc-marcmaker/)
Version 0.01: Initial version, Nov. 21, 2004-Mar. 7, 2005. Released Mar. 7, 2005.
- Basic version, translates .mrk format file into MARC::Record objects.
- See TODO.txt for list of items still needing to be done.
- Note: This is a pre-alpha release, and little testing has been done on the results of decode().
(Feb. 27, 2005)
Module updates:
Lintadditions.pm:
Version 1.08: Updated Feb. 21-27, 2005. Released Feb. 27, 2005.
- Revision of check_020() in preparation for move to MARC::Lint.
- Moved check_020() to MARC::Lint (remains here during testing).
- validate007() revised to deal with possibility of subfields existing in pre-010 fields, or other non-legitimate 1st characters existing.
(Feb. 13, 2005)
Module updates:
Errorchecks.pm:
Version 1.07: Updated Dec. 11-Feb. 2005. Released Feb. 13, 2005.
- check_double_periods() skips field 856, where multiple punctuation is possible for URIs.
- added code in check_internal_spaces() to account for spaces between angle brackets in open dates in field 260c.
- Updated various subs to verify that 008 exists (and quietly return if not. check_008 will report the error).
- Changed #! line, removed -w, replaced with use warnings.
- Added error message to check_bk008_vs_bibrefandindex($record) if 008 book index byte is not 0 or 1. This will result in duplicate errors if check_008 is also called on the record.
Lintadditions.pm:
Version 1.07: Updated Jan. 2-Feb. 1, 2005. Released Feb. 13, 2005.
- Updated check_260 to account for angle brackets for open dates in subfield c.
- Updated check_020 to handle 13-digit ISBNs. This relies upon the new internal _isbn13_check_digit($ean), temporary until Business::ISBN handles 13-digit ISBNs directly.
- Added basic check to check_600 (and by call, other 6xx) for subfield 2 codes. Similar code duplicated in check_655, due to difference in code lists for each field. Still need to deal with obsolete code error reporting.
- Moved check_245 to MARC::Lint (retained here as a POD section during testing).
- Moved check_041 and check_043 to MARC::Lint (retained here as a POD section during testing).
- Added warning to check_007 for obsolete byte 2.
- Removed pod info related to changes needed to MARC::Lint (which has been updated).
- Misc. cleanup.
- Revised check_1xx, check_6xx, check_7xx, and check_8xx to use check_100, etc. (to avoid code duplication). (based on code from Ian Hamilton)
MARC::Lint::CodeData.pm:
Version 1.01: Updated Jan. 5-Feb. 10, 2005. Released (to CPAN) Feb. 13, 2005 (with new version of MARC::Errorchecks).
- Added code list data for 600-651 subfield 2 and for 655 subfield 2 sources.
- Updated codes based on changes made Jan. 19 (languages), Feb. 2 (sources), Feb. 9 (sources).
Added and changed scripts:
See the manifest.htm page for more information about these. All are in fullrecscripts.
- extractisbn.txt
- fieldindexer.txt
- lintallchecks.txt
- linttest.txt
- EAN_ISBNconverter.txt
- hasnosears.txt
- Errorchecks.t.txt
(Dec. 5, 2004)
New module:
MARC::Lint::CodeData.pm:
Version 1.00 (original version): First release, Dec. 5, 2004. Uploaded to SourceForge CVS, Jan. 3, 2005.
- Included in MARC::Errorchecks distribution on CPAN.
- Used by MARC::Lintadditions.
Module updates:
Errorchecks.pm:
Version 1.04: Updated Nov. 4-Dec. 4, 2004. Released Dec. 5, 2004.
- Updated validate008() to use MARC::Lint::CodeData.
- Removed DATA section, since this is now in MARC::Lint::CodeData.
- Updated check_008() to use the new validate008().
- Revised bib. refs. check to require 'reference' to be followed by optional 's', optional period, and word boundary (to catch things like 'referenced'.
Lintadditions.pm:
Version 1.06: Updated Nov. 21-24, 2004. Released Dec. 5, 2004.
- Removed readcodedata(), replaced with separate data pack, MARC::Lint::CodeData.
- Updated check_040, check_041 and check_043 to use MARC::Lint::CodeData.
- Deleted the DATA section based on the above changes.
- Misc. bug fixes.
- Reports 13 digit ISBNs as errors pending updating of Business::ISBN to account for 13 digit ISBNs.
BBMARC.pm:
Version 1.08: Updated Oct 31, 2004. Released Dec. 5, 2004.
- New method, as_array, an add-on to MARC::Field which breaks down a MARC::Field object into a flat array, returns a ref to that array.
- Misc. cleanup.
Added and changed scripts:
See the manifest.htm page for more information about these. All but the last are in fullrecscripts. The last is in inprocess.
- Errorchecks.t.txt
- extractbycontrolno.txt
- extractbyisbn.txt
- extractbystockorisbn.txt
- extractspecsubfield.txt
- fieldextractionwithregex.txt
- countrycodelistclean.txt
- gaccleanupscript.txt
- languagecodelistclean.txt
- deleteSHandDDC.txt
- EAN_ISBNconverter.txt
- printrecordasformatted.txt
- LCSHchangesparserpl102.txt
(Oct. 17, 2004)
Module updates:
Errorchecks.pm:
Version 1.03: Updated Aug. 30-Oct. 16, 2004. Released Oct. 17. First CPAN version.
- Moved subs to MARC::QBIerrorchecks
- check_003($record)
- check_CIP_for_stockno($record)
- check_082count($record)
- Fixed bug in check_5xxendingpunctuation for first 10 characters.
- Moved validate008() and parse008date() from MARC::BBMARC (to make MARC::Errorchecks more self-contained).
- Moved readcodedata() from BBMARC (used by validate008)
- Moved DATA from MARC::BBMARC for use in readcodedata()
- Remove dependency on MARC::BBMARC
- Added duplicate comma check in check_double_periods($record)
- Misc. bug fixes
- Planned (future versions):
- Account for undetermined dates in matchpubdates($record).
- Cleanup of validate008
- Standardization of error reporting
- Material specific byte checking (bytes 18-34) abstracted to allow 006 validation.
Lintadditions.pm:
Version 1.05: Updated Aug. 30-Oct. 16, 2004. Released Oct. 17, 2004.
- Moved institution-specific code from check_040 to MARC::QBIerrorchecks.
- check_040 still present to check $b language (currently commented-out)
- Moved check_037 to MARC::QBIerrorchecks.
- Updated check_082 to ensure decimal after 3rd digit in numbers longer than 3 digits.
- Moved validate007(\@bytesfrom007) from MARC::BBMARC (to make MARC::Lintadditions more self-contained).
- Fixed problem in 6xx check for subfield _2 (changed '==' to 'eq').
- Updated validate007(\@bytesfrom007) (bug fixes, misc. revisions)
- Updated check_050 to check for unfinished cutters (single capital letter followed by space or nothing)
BBMARC.pm:
Version 1.07: Updated Aug. 30-Oct. 16, 2004. Released Oct. 16, 2004.
- Moved subroutine getcontrolstocknos() to MARC::QBIerrorchecks
- Moved validate007() to Lintadditions.pm
- Moved validate008() and related subs to Errorchecks.pm
- (Left readcodedata() in BBMARC, but it is now duplicated in Errorchecks.pm, along with a modified version in Lintadditions.pm).
- Also left parse008date, which may have uses outside of error checking.
- Updated read_controlnos([$filename]) with minor changes.
- This subroutine could be rewritten in a more general way, since it simply reads all lines from a file into an array and returns that array.
Added and changed scripts:
- 003cleanupscript.pl--Included in MARC::Errorchecks distribution, matches 001 and 003.
- 007cleanupscript.pl--Included in MARC::Errorchecks distribution, cleans bytes in 007 and reports errors.
- 010cleanupscript.pl--Included in MARC::Errorchecks distribution, cleans spacing in 010 subfield 'a'.
- cleantrailingspaces.pl--Included in MARC::Errorchecks distribution, removes spaces at the end of fields > 010, ignores 016.
- countrycodelistclean.pl--Included in MARC::Errorchecks distribution, given the ASCII-format MARC code list for Countries, creates list of code\tlanguage pairs, followed by tab-separated list of current, then obsolete, codes.
- gaccleanupscript.pl--Included in MARC::Errorchecks distribution, given the ASCII-format MARC code list for Geographic Area Codes, creates list of code\tlanguage pairs, followed by tab-separated list of current, then obsolete, codes.
- languagecodelistclean.pl--Included in MARC::Errorchecks distribution, given the ASCII-format MARC code list for Languages, creates list of code\tlanguage pairs, followed by tab-separated list of current, then obsolete, codes.
- lintallchecks.pl--Included in MARC::Errorchecks distribution.
Reorganized Full Record Script directory as seen below. Note: Many of the scripts have not been reviewed lately, and so may not work with the current versions of my modules. This is particularly true of the items in Tests for Errorchecks and Tests for Lintadditions.
Cleanup full recs
- 003cleanupscript.txt
- 007cleanupscript.txt
- 010cleanupscript.txt
- cleantrailingspaces.txt
Code list cleanup
- countrycodelistclean.txt
- gaccleanupscript.txt
- languagecodelistclean.txt
Counting
- comparemerge.txt
- countrecords.txt
- countrecsbytype.txt
- errreptcount.txt
- fieldsubfieldcounts.txt
Extraction
- extractbycontrolno.txt
- extractbycontrolnoignrspace.txt
- extracterrorsfrommodules.txt
- extractnonbookby008date.txt
- extractpcip.txt
- extractspecsubfield.txt
- fieldextraction.txt
- fieldextraction3.txt
- fieldextractioncleanspaces.txt
- fieldextractionnocontrols.txt
findmultiplefields.txt
hasbeenupdated.txt
Linting
- lintallchecks.txt
- lintcheck2.txt
- linttest.txt
- lintwithadditions.txt
- lintwithadditionsselective.txt
mermarcfiles.txt
outputchangestogether.txt
printrecordasformatted.txt
rawanddecodedscan.txt
splitmarcfile.pl.txt
Tests for Errorchecks
- 008checker.txt
- 008illvs300.txt
- 008matchvsotherfields.txt
- checkcipforstockno.txt
- Errorchecks.t.txt
- findemptysubfields.txt
- findlongrecords.txt
- findmultiperiodsafter010.txt
- findmultiplefields.txt
- findmultispacesafter010.txt
- findunderscoredollarinfield.txt
- ldrvalidatescript.txt
- pubdatecomparisons.txt
- testgetdate.txt
- testnewerrorchecks.txt
- viddvdvsvhs.txt
Tests for Lintadditions
- check022script.txt
- isbnvalidatescript.txt
- lintwithadditionsselective.txt
- validate007.t.txt
(Aug. 22, 2004):
Module updates:
Errorchecks.pm:
Version 1.02: Updated Aug. 11-22, 2004. Released Aug. 22, 2004.
- Implemented VERSION (uncommented)
- Added check for presence of 040 (check_040present($record)).
- Added check for presence of 2 082s in full-level, 1 082 in CIP-level records (check_082count($record)).
- Added temporary (test) check for trailing punctuation in 240, 586, 440, 490, 246 (check_nonpunctendingfields($record))
- which should not end in punctuation except when the data ends in such.
- Added check_fieldlength($record) to report fields longer than 1870 bytes.
- This should be rewritten to use the length in the directory of the raw MARC.
- Fixed workaround in check_bk008_vs_bibrefandindex($record) (Thanks again to Rich Ackerman).
Lintadditions.pm:
Version 1.04: Updated Aug. 10-22, 2004. Released Aug.22, 2004.
- Implemented VERSION (uncommented)
- Revised check_050 exception (Thank you to all who posted about this).
- Moved VERSION HISTORY to end of module.
- Added preliminary checking of 245 2nd indicator in check_245 (Thanks to Ian Hamilton).
BBMARC.pm:
Version 1.06: Updated Aug. 10-22, 2004. Released Aug. 15, 2004.
- Implemented VERSION (uncommented)
- Added subroutine getcontrolstocknos()
- General readability cleanup (added tabs)
- Bug fix in validate008 for date2 check
Planned (next release):
- Cleanup of validate008 (and validate007)
- Standardization of error reporting
- Material specific byte checking (bytes 18-34) abstracted to allow 006 validation.
Added and changed scripts:
Updated LCSH Changes Parser script, LCSHchangesparser2.txt:
- Adds 500 to tag number if it is 1xx, so that it becomes 600-655, in preparation for use in global replacement.
- Misc. fixes.
(Aug. 15, 2004):
Module updates as described above (prereleased).
Added and changed scripts:
Updated LCSH Changes Parser script, LCSHchangesparserpl2.txt:
- Now prints 682 and 260 if cancelled heading is followed by one of these instead of a replacement 1xx.
- Prints out to an all.txt file (containing all changed 1xx headings) and to a bad.txt file (containing any cancelled headings not followed by a 1xx or headings with characters not matching the regular expression for finding an old heading.
(Aug. 8, 2004):
Module updates:
Errorchecks.pm:
Version 1.01: Updated July 20-Aug. 7, 2004. Released Aug. 8, 2004.
- Temporary (or not) workaround for check_bk008_vs_bibrefandindex($record) and bibliographies.
- Removed variables from some error messages and cleanup of messages.
- Code readability cleanup.
- Added subroutines
- check_240ind1vs1xx($record) -- Reports errors based on whether 240 and 1xx are both present and first indicator is 1 or 0.
- check_041vs008lang($record) -- Compares first code in subfield 'a' of 041 vs. 008 bytes 35-37.
- check_5xxendingpunctuation($record) -- Looks for final punctuation in several of the 5xx fields.
- findfloatinghypens($record) -- Looks for space-hyphen-space in each field (in a list of given fields)
- video007vs300vs538($record) -- In video records, compares 007 values vs. 300 and 538 fields. Limited to VHS, DVD, and Video CD.
- ldrvalidate($record) -- Checks for valid bytes in the user-changable leader bytes.
- geogsubjvs043($record) -- Reports missing 043 if 651 or 6xx$z is present.
- has list of exceptions (e.g. English-speaking countries)
- findemptysubfields($record) -- Looks for empty subfields (e.g. $x$xPsychology.)
- Changed subroutines:
- added cross-checking for codes a, b, c, g (ill., map(s), port(s)., music)
- added checking for 'p. ' or 'v. ' or 'leaves ' in subfield 'a'
- added checking for 'cm.', 'mm.', 'in.' in subfield 'c'
- revised check for 'm', phono. (which QBI doesn't currently use)
- Added check in check_bk008_vs_bibrefandindex($record) for 'Includes index.' (or indexes) in 504
- This has a workaround I would like to figure out how to fix.
Lintadditions.pm:
version 1.03: Updated July 20-Aug. 7, 2004. Released Aug. 8, 2004.
- Added check_1xx and check_7xx sets -- Checks for proper ending punctuation (100, 110, 111, 130, 700, 710, 711, 730, 740).
- Added checks for non-filing indicator in 130, 630, 730, 740, and 830.
- Added indicator check for 700--ind1 == 3 -> error.
- Added validation of 041 against MARC Code List for Languages.
- Added check_028 and check_037 -- IOrQBI specific for 037, both verify presence of subfield 'b'.
- Removed some variables from warning messages.
- Added check_050 -- Reports double periods before Cutters (with exception--date after first Cutter in subfield 'a' followed by subfield 'b' item Cutter).
- Added check_040 (IOrQBI specific).
- Added check_440 and check_490.
- Added check_246.
- Changed check_245 ending punctuation errors based on MARC21 rule change vs. LCRI 1.0C from Nov. 2003.
- Added check for square brackets in 245 $h.
- Added check for 260 ending punctuation.
Added and changed scripts:
Most of these are test scripts created while writing the subroutines listed above.
The subroutines in the modules may have code not in the scripts, so it is best to use the module rather than the script for those checks (the last 3 full record scripts).
- Full record:
- fieldsubfieldcounts.txt -- Field and subfield count--will report totals for each tag and subfield.
- First version: Field tag counts only.
- testnewerrorchecks.txt -- Test script to call new subroutines in Errorchecks.pm (MARC::Errorchecks).
- ldrvalidatescript.txt -- In Errorchecks.pm
- viddvdvsvhs.txt -- In Errorchecks.pm.
- findemptysubfields.txt -- Looks for empty subfields. Skips 037 in CIP-level records. In Errorchecks.pm.
- Cleanup:
- find050doubleperiod.txt -- Test regex for finding pattern in 050$a. Preliminary code for MARC::Lintadditions::check_050()
- removetitlefromlintrpt.txt -- Removes titles from lintallchecks' output file.
- findmissing300apunctuation.txt -- Looks for missing period after p or v in 300a extract file. Initial step for MARC::Errorchecks::check_bk008_vs_300($record) code.
(July 17, 2004):
Module updates:
Errorchecks.pm:
Version 1.00 (update to 0.95): First release, July 17, 2004.
- Fixed bugs causing check_003 and check_010 subroutines to fail (Thanks to Rich Ackerman)
- Added to documentation
- Misc. cleanup
- Added subroutines (MARC::Errorchecks::*):
- check_end_punct_300($record)
- check_bk008_vs_300($record)
- check_490vs8xx($record)
- check_245ind1vs1xx($record)
- matchpubdates($record)
- check_bk008_vs_bibref($record)
- check_bk008_vs_bibrefandindex($record)
- Added skip of 787 fields to check_internal_spaces($record)
Lintadditions.pm:
version 1.02: Updated July 2-16, 2004. Released July 17, 2004.
- Cleaned up some of the documentation
- Added global variable in hopes of improving efficiency of language/GAC/country code validation
- Modified check_043 and/or MARC::Lintadditions::readcodedata() to use the new global variable.
- Added check_6xx subroutines (600, 610, 611, 650, 651, 655)
- Added check for space between initials in 245 $c in check_245
- Added check_042 (valid values: lcac, lccopycat, pcc, nsdp)
- Added check_020 (relies upon Business::ISBN module)
- Added check_022 (relies upon Business::ISSN module)
BBMARC.pm:
Version 1.05: Updated July 3, 2004, released July 17, 2004
- Cleaned some documentation
- Added global variable in hopes of improving efficiency of language/GAC/country code validation
- Modified MARC::BBMARC::validate008 and/or MARC::BBMARC::readcodedata() to use the new global variable.
- Moved MARC::BBMARC::readcodedata() and MARC::BBMARC::parse008date above MARC::BBMARC::validate008
Added and changed scripts:
- Updated lintallchecks.pl Calls MARC::Lint, MARC::Lintadditions, and MARC::Errorchecks and outputs the controlno, title, and errors found.
- extractspecsubfield.pl: Based on fieldextraction.pl, pulls out specified subfields from a given field (or set of fields, such as 6xx)
- isbnvalidatescript.pl: Initial version of subroutine being added to Lintadditions, check_020
- testnewerrorchecks.pl: Unmaintained/initial script for testing new subroutines in Errorchecks.pm
- extractpcip.pl Outputs records coded as CIP-level (8).
- 003cleanupscript.pl: Similar process to check_003 in Lintadditions.pm, but does the cleaning.
- cleantrailingspaces.pl: Removes spaces from the end of lines
- fieldextractioncleanspaces.pl: Field extraction code modified to clean trailing spaces and certain punctuation from the end of the field
- findmultiperiodsafter010.pl: Looks for more than one period within subfields after 010, ignoring ellipses. This has been integrated into MARC::Errorchecks
- listcomparison.txt: Compares 2 lists (uses List::Compare from CPAN). Useful with the fieldextraction.pls
- cleansubfieldextracts.txt: Similar to the other fieldextraction cleanups, removes counts from subfield extraction files
(June 22, 2004):
New module:
Errorchecks.pm (MARC::Errorchecks): Collection of error checking subroutines similar to MARC::Lint and MARC::Lintadditions. This is currently version 0.95 due to problems with the subroutine calls to check_003 and check_010. Warnings indicate use of uninitialized Array references.
Associated script for using MARC::Errorchecks: lintallchecks.txt. This can replace most of the error checking scripts, along with the checking portion of the cleanup full record scripts. It should also work without changes as Errorchecks.pm is updated with new subroutines.
(June 20, 2004):
Two new scripts:
findmultispacesafter010.txt: (looks for multiple spaces in a field, for fields after 010. Could be improved by accounting for other fields where multiple spaces would be acceptable (such as 035).
010cleanupscript.txt: For 010 fields with only an 8 or 10 digit LCCN in subfield 'a', makes sure proper spacing precedes and follows the number and replaces that subfield in the record. Reports any problems with cleaning the subfield.
Changes to my main modules:
Lintadditions.pm:
version 1.01: Updated June 17, 2004. Released June 20, 2004.
- Added validation of 043 against GAC list.
- Added check_082.
- Added checks for $b, $h, $n, and $p in 245.
- Other changes/fixes.
BBMARC.pm:
Version 1.04: Updated June 16, 2004, released June 20, 2004
- Updated MARC::BBMARC::as_formatted2() to work with MARC::Record 1.38 (is_control_field() instead of is_control_tag()
- Fixed bug in MARC::BBMARC::validate008 for visual materials running time (hypen was not escaped, so it was being interpreted as a range indicator).
- Added MARC::BBMARC::parse008date($) to allow user to enter yymmdd and get yyyy\tmm\tdd\t$error string back (for other uses).
- Added DATA containing codes from the MARC lists for Countries, Geographic Areas, and Languages, to 2003. Each code set is separated by tabs, and Obsolete codes are given following each set of valid codes, in the same format.
- Added MARC::BBMARC::readcodedata() subroutine for reading in the data and returning the data in an array for use by validation code, such as in MARC::BBMARC::validate008()
- Modified MARC::BBMARC::validate008 subroutine to use the DATA to validate language and country codes.
Version 1.03: Updated June 10, not released.
- Contained many of the changes in 1.04, but 1.04 contains the update to MARC::BBMARC::validate008, so I wanted a new version.
(May 31, 2004):
Reorganized site arrangement. I removed separate directories for Mac, Win, and Unix, consolidating the files into the following directories:
cleanupscripts
fullrecscripts
inprocess
prevversions
bryanmodules
Each directory's contents are described in manifest.htm
The new inprocess directory contains alpha-or-so stage code, or code I may be having trouble with.
Currently this contains an LCSH Weekly Lists parser, which condenses a folder/directory of files into a file of tag-old-new headings, separated by tabs. It also compiles a file of all changed headings in the files in the input directory.
Updated MARC::BBMARC:
- Cleaned up some of the documentation.
- Added new function, updated_record_hash(), not yet fully tested, which is similar to updated_record_array(), but stores raw USMARC record indexed (keyed) by control number. This has not been fully tested, and will likely eat massive amounts of memory, especially for large files of records.
Added new module, MARC::Lintadditions.pm. This is an extension to MARC::Lint.pm, with added check_XXX functions (see the module for details).
Added script to go with Lintadditions.pm, lintwithadditions.pl, based on Example V3 of the MARC::Doc::Tutorial.
Added cleantrailingspaces.pl, which removes the space from the end of each field > 010. I have not yet dealt with the 010 trailing spaces cleanup.
Updated fieldextraction.pl. This should fix the problem created when I updated MARC::BBMARC::getthreedigits() to allow periods (so 6.. will retrieve all 6xx fields).
(May 1, 2004):
Updated BBMARC with a new function, validate008, along with other changes, as listed in BBMARC.pm, including version number (not fully implemented).
Moved BBMARC to a separate directory, MARC-BBMARC-[version number], which also includes the two main subroutines as separate files (validate007 and validate008).
Added 008checker.pl to go along with validate008.
About the author:
I am a cataloger/librarian with very limited programming experience. I began teaching myself Perl, using Coriolis' Perl Black Book, along with online documentation and books, around November, 2003. The extent of my knowledge is limited to knowing enough to start using MARC::Record's modules.
Copyright (c) 2003-2008
Bryan Baldus
eijabb@cpan.org
Last updated May 25, 2008.